BeautifulSoup详解

博主：梅零落
发布时间：2021 年 01 月 28 日
1128 次浏览
暂无评论
8871字数
分类： Python

BeautifulSoup4简介

参考： http://www.jsphp.net/python/show-24-214-1.html

BeautifulSoup4是爬虫必学的技能。BeautifulSoup最主要的功能是从网页抓取数据，Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。BeautifulSoup支持Python标准库中的HTML解析器,还支持一些第三方的解析器，如果我们不安装它，则 Python 会使用 Python默认的解析器，lxml 解析器更加强大，速度更快，推荐使用lxml 解析器。

BeautifulSoup4和 lxml 一样，Beautiful Soup 也是一个HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 数据。

BeautifulSoup支持Python标准库中的HTML解析器,还支持一些第三方的解析器，如果我们不安装它，则 Python 会使用 Python默认的解析器，lxml 解析器更加强大，速度更快，推荐使用lxml 解析器。

Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。

BeautifulSoup4主要解析器，以及优缺点

BeautifulSoup4简单使用

假设有这样一个Html，具体内容如下：

<!DOCTYPE html>
<html>
<head>
    <meta content="text/html;charset=utf-8" http-equiv="content-type" />
    <meta content="IE=Edge" http-equiv="X-UA-Compatible" />
    <meta content="always" name="referrer" />
    <link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css" />
    <title>百度一下，你就知道 </title>
</head>
<body link="#0000cc">
  <div id="wrapper">
    <div id="head">
        <div class="head_wrapper">
          <div id="u1">
            <a class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻 </a>
            <a class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123 </a>
            <a class="mnav" href="http://map.baidu.com" name="tj_trmap">地图 </a>
            <a class="mnav" href="http://v.baidu.com" name="tj_trvideo">视频 </a>
            <a class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">贴吧 </a>
            <a class="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;">更多产品 </a>
          </div>
        </div>
    </div>
  </div>
</body>
</html>

from bs4 import BeautifulSoup 
file = open('./aa.html', 'rb') 
html = file.read() 
bs = BeautifulSoup(html,"html.parser") # 缩进格式
print(bs.prettify()) # 格式化html结构
print(bs.title) # 获取title标签的名称
print(bs.title.name) # 获取title的name
print(bs.title.string) # 获取head标签的所有内容
print(bs.head) 
print(bs.div)  # 获取第一个div标签中的所有内容
print(bs.div["id"]) # 获取第一个div标签的id的值
print(bs.a) 
print(bs.find_all("a")) # 获取所有的a标签
print(bs.find(id="u1")) # 获取id="u1"
for item in bs.find_all("a"): 
    print(item.get("href")) # 获取所有的a标签，并遍历打印a标签中的href的值
for item in bs.find_all("a"): 
    print(item.get_text())  # 获取所有的a标签，并遍历打印a标签中的内容

BeautifulSoup4四大对象种类

BeautifulSoup4将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

Tag
NavigableString
BeautifulSoup
Comment

Tag

Tag通俗点讲就是HTML中的一个个标签，例如：

from bs4 import BeautifulSoup 
file = open('./aa.html', 'rb') 
html = file.read() 
bs = BeautifulSoup(html,"html.parser") 
# 获取title标签的所有内容
print(bs.title) 
# 获取head标签的所有内容
print(bs.head) 
# 获取第一个a标签的所有内容
print(bs.a) 
# 类型
print(type(bs.a))

我们可以利用 soup 加标签名轻松地获取这些标签的内容，这些对象的类型是bs4.element.Tag。但是注意，它查找的是在所有内容中的第一个符合要求的标签。

对于 Tag，它有两个重要的属性，是 name 和 attrs:

from bs4 import BeautifulSoup 
file = open('./aa.html', 'rb') 
html = file.read() 
bs = BeautifulSoup(html,"html.parser") 
# [document] #bs 对象本身比较特殊，它的 name 即为 [document]
print(bs.name) 
# head #对于其他内部标签，输出的值便为标签本身的名称
print(bs.head.name) 
# 在这里，我们把 a 标签的所有属性打印输出了出来，得到的类型是一个字典。
print(bs.a.attrs) 
#还可以利用get方法，传入属性的名称，二者是等价的
print(bs.a['class']) # 等价 bs.a.get('class')
# 可以对这些属性和内容等等进行修改
bs.a['class'] = "newClass"
print(bs.a) 
# 还可以对这个属性进行删除
del bs.a['class'] 
print(bs.a)

NavigableString

既然我们已经得到了标签的内容，那么问题来了，我们要想获取标签内部的文字怎么办呢？很简单，用 .string 即可，例如：

from bs4 import BeautifulSoup 
file = open('./aa.html', 'rb') 
html = file.read() 
bs = BeautifulSoup(html,"html.parser")
 
print(bs.title.string) 
print(type(bs.title.string))

BeautifulSoup

BeautifulSoup对象表示的是一个文档的内容。大部分时候，可以把它当作 Tag 对象，是一个特殊的 Tag，我们可以分别获取它的类型，名称，以及属性，例如：

from bs4 import BeautifulSoup 
file = open('./aa.html', 'rb') 
html = file.read()
 
bs = BeautifulSoup(html,"html.parser") 
print(type(bs.name)) 
print(bs.name) 
print(bs.attrs)

Comment

Comment 对象是一个特殊类型的 NavigableString 对象，其输出的内容不包括注释符号。

from bs4 import BeautifulSoup 
file = open('./aa.html', 'rb') 
html = file.read() 
bs = BeautifulSoup(html,"html.parser") 
print(bs.a)
# 此时不能出现空格和换行符，a标签如下：
# <a class="mnav" href="http://news.baidu.com" name="tj_trnews"><!--新闻--></a>
print(bs.a.string) # 新闻
print(type(bs.a.string)) # <class 'bs4.element.Comment'>

遍历文档树

.contents：获取Tag的所有子节点，返回一个list

# tag的.content 属性可以将tag的子节点以列表的方式输出
print(bs.head.contents)
# 用列表索引来获取它的某一个元素
print(bs.head.contents[1])

.children：获取Tag的所有子节点，返回一个生成器

for child in  bs.body.children:
    print(child)

.descendants：获取Tag的所有子孙节点
.strings：如果Tag包含多个字符串，即在子孙节点中有内容，可以用此获取，而后进行遍历
.stripped_strings：与strings用法一致，只不过可以去除掉那些多余的空白内容
.parent：获取Tag的父节点
.parents：递归得到父辈元素的所有节点，返回一个生成器
.previous_sibling：获取当前Tag的上一个节点，属性通常是字符串或空白，真实结果是当前标签与上一个标签之间的顿号和换行符
.next_sibling：获取当前Tag的下一个节点，属性通常是字符串或空白，真是结果是当前标签与下一个标签之间的顿号与换行符
.previous_siblings：获取当前Tag的上面所有的兄弟节点，返回一个生成器
.next_siblings：获取当前Tag的下面所有的兄弟节点，返回一个生成器
.previous_element：获取解析过程中上一个被解析的对象(字符串或tag)，可能与previous_sibling相同，但通常是不一样的
.next_element：获取解析过程中下一个被解析的对象(字符串或tag)，可能与next_sibling相同，但通常是不一样的
.previous_elements：返回一个生成器，可以向前访问文档的解析内容
.next_elements：返回一个生成器，可以向后访问文档的解析内容
.has_attr：判断Tag是否包含属性

搜索文档树

find_all(name, attrs, recursive, text, **kwargs)

在上面的例子中我们简单介绍了find_all的使用，接下来介绍一下find_all的更多用法-过滤器。这些过滤器贯穿整个搜索API，过滤器可以被用在tag的name中，节点的属性等。

name参数：

字符串过滤：会查找与字符串完全匹配的内容

a_list = bs.find_all("a")
print(a_list)

正则表达式过滤：如果传入的是正则表达式，那么BeautifulSoup4会通过search()来匹配内容

from bs4 import BeautifulSoup 
import re 
file = open('./aa.html', 'rb') 
html = file.read() 
bs = BeautifulSoup(html,"html.parser") 
t_list = bs.find_all(re.compile("a")) 
for item in t_list: 
   print(item)

列表：如果传入一个列表，BeautifulSoup4将会与列表中的任一元素匹配到的节点返回

t_list = bs.find_all(["meta","link"])
for item in t_list:
    print(item)

方法：传入一个方法，根据方法来匹配

from bs4 import BeautifulSoup 
file = open('./aa.html', 'rb') 
html = file.read() 
bs = BeautifulSoup(html,"html.parser") 
def name_is_exists(tag): 
    return tag.has_attr("name") 
t_list = bs.find_all(name_is_exists) 
for item in t_list: 
    print(item)

kwargs参数：

from bs4 import BeautifulSoup 
import re 
file = open('./aa.html', 'rb') 
html = file.read() 
bs = BeautifulSoup(html,"html.parser") 
# 查询id=head的Tag
t_list = bs.find_all(id="head")
print(t_list) 
# 查询href属性包含http://news.baidu.com的Tag
t_list = bs.find_all(href=re.compile("http://news.baidu.com")) 
print(t_list) 
# 查询所有包含class的Tag(注意：class在Python中属于关键字，所以加_以示区别)
t_list = bs.find_all(class_=True) 
for item in t_list: 
    print(item)

attrs参数：

并不是所有的属性都可以使用上面这种方式进行搜索，比如HTML的data-*属性：

t_list = bs.find_all(data-foo="value")

如果执行这段代码，将会报错。我们可以使用attrs参数，定义一个字典来搜索包含特殊属性的tag：

t_list = bs.find_all(attrs={"data-foo":"value"})
for item in t_list:
    print(item)

text参数：

通过text参数可以搜索文档中的字符串内容，与name参数的可选值一样，text参数接受字符串，正则表达式，列表

from bs4 import BeautifulSoup 
import re 
file = open('./aa.html', 'rb') 
html = file.read() 
bs = BeautifulSoup(html, "html.parser") 

t_list = bs.find_all(text="hao123") 
for item in t_list: 
    print(item) 
t_list = bs.find_all(text=["hao123", "地图", "贴吧"]) 
for item in t_list: 
    print(item) 
t_list = bs.find_all(text=re.compile("\d")) 
for item in t_list: 
    print(item)

当我们搜索text中的一些特殊属性时，同样也可以传入一个方法来达到我们的目的：

def length_is_two(text):
    return text and len(text) == 2
t_list = bs.find_all(text=length_is_two)
for item in t_list:
    print(item)

limit参数：

可以传入一个limit参数来限制返回的数量，当搜索出的数据量为5，而设置了limit=2时，此时只会返回前2个数据

from bs4 import BeautifulSoup 
import re 
file = open('./aa.html', 'rb') 
html = file.read() 
bs = BeautifulSoup(html, "html.parser") 
t_list = bs.find_all("a",limit=2) 
for item in t_list: 
    print(item)

find_all除了上面一些常规的写法，还可以对其进行一些简写：

# 两者是相等的
# t_list = bs.find_all("a") => t_list = bs("a")
t_list = bs("a") # 两者是相等的
# t_list = bs.a.find_all(text="新闻") => t_list = bs.a(text="新闻")
t_list = bs.a(text="新闻")

find()

find()将返回符合条件的第一个Tag，有时我们只需要或一个Tag时，我们就可以用到find()方法了。当然了，也可以使用find_all()方法，传入一个limit=1，然后再取出第一个值也是可以的，不过未免繁琐。

from bs4 import BeautifulSoup 
import re 
file = open('./aa.html', 'rb') 
html = file.read() 
bs = BeautifulSoup(html, "html.parser") 
# 返回只有一个结果的列表
t_list = bs.find_all("title",limit=1) 
print(t_list) 
# 返回唯一值
t = bs.find("title") 
print(t) 
# 如果没有找到，则返回None
t = bs.find("abc") print(t)

从结果可以看出find_all，尽管传入了limit=1，但是返回值仍然为一个列表，当我们只需要取一个值时，远不如find方法方便。但是如果未搜索到值时，将返回一个None

在上面介绍BeautifulSoup4的时候，我们知道可以通过bs.div来获取第一个div标签，如果我们需要获取第一个div下的第一个div，我们可以这样：

t = bs.div.div
# 等价于
t = bs.find("div").find("div")

CSS选择器

BeautifulSoup支持发部分的CSS选择器，在Tag获取BeautifulSoup对象的.select()方法中传入字符串参数，即可使用CSS选择器的语法找到Tag:

通过标签名查找

print(bs.select('title'))
print(bs.select('a'))

通过类名查找

print(bs.select('.mnav'))

通过id查找

print(bs.select('#u1'))

组合查找

print(bs.select('div .bri'))

属性查找

print(bs.select('a[class="bri"]'))
print(bs.select('a[href="http://tieba.baidu.com"]'))

直接子标签查找

t_list = bs.select("head > title")
print(t_list)

兄弟节点标签查找

t_list = bs.select(".mnav ~ .bri")
print(t_list)

获取内容

t_list = bs.select("title")
print(bs.select('title')[0].get_text())

最后修改：2022 年 12 月 20 日

如果觉得我的文章对你有用，请随意赞赏

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

评论 *

私密评论

名称 *

🎲

邮箱 *

地址

gjthjvdklo
哈哈哈，写的太好了
pvoqmiwipl
哈哈哈，写的太好了
fglisjhmif
哈哈哈，写的太好了
wtazkvseyu
哈哈哈，写的太好了
yoyo
去水印的项目怎么隐藏啦

jsDelivr加速GitHub仓库图片
浏览次数: 1003
Debian安装Docker
浏览次数: 1128
Cloudflare自选节点
浏览次数: 1774
Rclone挂载GoogleDrive
浏览次数: 1079
Python网络爬虫
浏览次数: 1158

BeautifulSoup详解

梅零落 • 2021 年 01 月 28 日

<h1>BeautifulSoup4简介</h1><p>参考： <span class="external-link"><a class="no-external-link" href="http://www.jsphp.net/python/show-24-214-1.html" target="_blank"><i data-feather="external-link"></i>http://www.jsphp.net/python/show-24-214-1.html</a></span></p><p>BeautifulSoup4是爬虫必学的技能。BeautifulSoup最主要的功能是从网页抓取数据，Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。BeautifulSoup支持Python标准库中的HTML解析器,还支持一些第三方的解析器，如果我们不安装它，则 Python 会使用 Python默认的解析器，lxml 解析器更加强大，速度更快，推荐使用lxml 解析器。</p><p>BeautifulSoup4和 lxml 一样，Beautiful Soup 也是一个HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 数据。</p><p>BeautifulSoup支持Python标准库中的HTML解析器,还支持一些第三方的解析器，如果我们不安装它，则 Python 会使用 Python默认的解析器，lxml 解析器更加强大，速度更快，推荐使用lxml 解析器。</p><p>Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。</p><h1>BeautifulSoup4主要解析器，以及优缺点</h1><p><img src="https://www.imgyh.com/usr/uploads/2022/12/3543757066.png" alt="" title="" style=""></p><h1>BeautifulSoup4简单使用</h1><p>假设有这样一个Html，具体内容如下：</p><pre><code>&lt;!DOCTYPE html&gt;
&lt;html&gt;
&lt;head&gt;
    &lt;meta content=&quot;text/html;charset=utf-8&quot; http-equiv=&quot;content-type&quot; /&gt;
    &lt;meta content=&quot;IE=Edge&quot; http-equiv=&quot;X-UA-Compatible&quot; /&gt;
    &lt;meta content=&quot;always&quot; name=&quot;referrer&quot; /&gt;
    &lt;link href=&quot;https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css&quot; rel=&quot;stylesheet&quot; type=&quot;text/css&quot; /&gt;
    &lt;title&gt;百度一下，你就知道 &lt;/title&gt;
&lt;/head&gt;
&lt;body link=&quot;#0000cc&quot;&gt;
  &lt;div id=&quot;wrapper&quot;&gt;
    &lt;div id=&quot;head&quot;&gt;
        &lt;div class=&quot;head_wrapper&quot;&gt;
          &lt;div id=&quot;u1&quot;&gt;
            &lt;a class=&quot;mnav&quot; href=&quot;http://news.baidu.com&quot; name=&quot;tj_trnews&quot;&gt;新闻 &lt;/a&gt;
            &lt;a class=&quot;mnav&quot; href=&quot;https://www.hao123.com&quot; name=&quot;tj_trhao123&quot;&gt;hao123 &lt;/a&gt;
            &lt;a class=&quot;mnav&quot; href=&quot;http://map.baidu.com&quot; name=&quot;tj_trmap&quot;&gt;地图 &lt;/a&gt;
            &lt;a class=&quot;mnav&quot; href=&quot;http://v.baidu.com&quot; name=&quot;tj_trvideo&quot;&gt;视频 &lt;/a&gt;
            &lt;a class=&quot;mnav&quot; href=&quot;http://tieba.baidu.com&quot; name=&quot;tj_trtieba&quot;&gt;贴吧 &lt;/a&gt;
            &lt;a class=&quot;bri&quot; href=&quot;//www.baidu.com/more/&quot; name=&quot;tj_briicon&quot; style=&quot;display: block;&quot;&gt;更多产品 &lt;/a&gt;
          &lt;/div&gt;
        &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/body&gt;
&lt;/html&gt;</code></pre><pre><code>from bs4 import BeautifulSoup 
file = open('./aa.html', 'rb') 
html = file.read() 
bs = BeautifulSoup(html,&quot;html.parser&quot;) # 缩进格式
print(bs.prettify()) # 格式化html结构
print(bs.title) # 获取title标签的名称
print(bs.title.name) # 获取title的name
print(bs.title.string) # 获取head标签的所有内容
print(bs.head) 
print(bs.div)  # 获取第一个div标签中的所有内容
print(bs.div[&quot;id&quot;]) # 获取第一个div标签的id的值
print(bs.a) 
print(bs.find_all(&quot;a&quot;)) # 获取所有的a标签
print(bs.find(id=&quot;u1&quot;)) # 获取id=&quot;u1&quot;
for item in bs.find_all(&quot;a&quot;): 
    print(item.get(&quot;href&quot;)) # 获取所有的a标签，并遍历打印a标签中的href的值
for item in bs.find_all(&quot;a&quot;): 
    print(item.get_text())  # 获取所有的a标签，并遍历打印a标签中的内容</code></pre><h1>BeautifulSoup4四大对象种类</h1><p>BeautifulSoup4将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:</p><ul><li>Tag</li><li>NavigableString</li><li>BeautifulSoup</li><li>Comment</li></ul><h2>Tag</h2><p>Tag通俗点讲就是HTML中的一个个标签，例如：</p><pre><code>from bs4 import BeautifulSoup 
file = open('./aa.html', 'rb') 
html = file.read() 
bs = BeautifulSoup(html,&quot;html.parser&quot;) 
# 获取title标签的所有内容
print(bs.title) 
# 获取head标签的所有内容
print(bs.head) 
# 获取第一个a标签的所有内容
print(bs.a) 
# 类型
print(type(bs.a))</code></pre><p>我们可以利用 soup 加标签名轻松地获取这些标签的内容，这些对象的类型是bs4.element.Tag。但是注意，它查找的是在所有内容中的第一个符合要求的标签。</p><p>对于 Tag，它有两个重要的属性，是 name 和 attrs:</p><pre><code>from bs4 import BeautifulSoup 
file = open('./aa.html', 'rb') 
html = file.read() 
bs = BeautifulSoup(html,&quot;html.parser&quot;) 
# [document] #bs 对象本身比较特殊，它的 name 即为 [document]
print(bs.name) 
# head #对于其他内部标签，输出的值便为标签本身的名称
print(bs.head.name) 
# 在这里，我们把 a 标签的所有属性打印输出了出来，得到的类型是一个字典。
print(bs.a.attrs) 
#还可以利用get方法，传入属性的名称，二者是等价的
print(bs.a['class']) # 等价 bs.a.get('class')
# 可以对这些属性和内容等等进行修改
bs.a['class'] = &quot;newClass&quot;
print(bs.a) 
# 还可以对这个属性进行删除
del bs.a['class'] 
print(bs.a)</code></pre><h2>NavigableString</h2><p>既然我们已经得到了标签的内容，那么问题来了，我们要想获取标签内部的文字怎么办呢？很简单，用 .string 即可，例如：</p><pre><code>from bs4 import BeautifulSoup 
file = open('./aa.html', 'rb') 
html = file.read() 
bs = BeautifulSoup(html,&quot;html.parser&quot;)
 
print(bs.title.string) 
print(type(bs.title.string))</code></pre><h2>BeautifulSoup</h2><p>BeautifulSoup对象表示的是一个文档的内容。大部分时候，可以把它当作 Tag 对象，是一个特殊的 Tag，我们可以分别获取它的类型，名称，以及属性，例如：</p><pre><code>from bs4 import BeautifulSoup 
file = open('./aa.html', 'rb') 
html = file.read()
 
bs = BeautifulSoup(html,&quot;html.parser&quot;) 
print(type(bs.name)) 
print(bs.name) 
print(bs.attrs)</code></pre><h2>Comment</h2><p>Comment 对象是一个特殊类型的 NavigableString 对象，其输出的内容不包括注释符号。</p><pre><code>from bs4 import BeautifulSoup 
file = open('./aa.html', 'rb') 
html = file.read() 
bs = BeautifulSoup(html,&quot;html.parser&quot;) 
print(bs.a)
# 此时不能出现空格和换行符，a标签如下：
# &lt;a class=&quot;mnav&quot; href=&quot;http://news.baidu.com&quot; name=&quot;tj_trnews&quot;&gt;&lt;!--新闻--&gt;&lt;/a&gt;
print(bs.a.string) # 新闻
print(type(bs.a.string)) # &lt;class 'bs4.element.Comment'&gt;</code></pre><h1>遍历文档树</h1><ol><li>.contents：获取Tag的所有子节点，返回一个list</li></ol><pre><code># tag的.content 属性可以将tag的子节点以列表的方式输出
print(bs.head.contents)
# 用列表索引来获取它的某一个元素
print(bs.head.contents[1])</code></pre><ol start="2"><li>.children：获取Tag的所有子节点，返回一个生成器</li></ol><pre><code>for child in  bs.body.children:
    print(child)</code></pre><ol start="3"><li>.descendants：获取Tag的所有子孙节点</li><li>.strings：如果Tag包含多个字符串，即在子孙节点中有内容，可以用此获取，而后进行遍历</li><li>.stripped_strings：与strings用法一致，只不过可以去除掉那些多余的空白内容</li><li>.parent：获取Tag的父节点</li><li>.parents：递归得到父辈元素的所有节点，返回一个生成器</li><li>.previous_sibling：获取当前Tag的上一个节点，属性通常是字符串或空白，真实结果是当前标签与上一个标签之间的顿号和换行符</li><li>.next_sibling：获取当前Tag的下一个节点，属性通常是字符串或空白，真是结果是当前标签与下一个标签之间的顿号与换行符</li><li>.previous_siblings：获取当前Tag的上面所有的兄弟节点，返回一个生成器</li><li>.next_siblings：获取当前Tag的下面所有的兄弟节点，返回一个生成器</li><li>.previous_element：获取解析过程中上一个被解析的对象(字符串或tag)，可能与previous_sibling相同，但通常是不一样的</li><li>.next_element：获取解析过程中下一个被解析的对象(字符串或tag)，可能与next_sibling相同，但通常是不一样的</li><li>.previous_elements：返回一个生成器，可以向前访问文档的解析内容</li><li>.next_elements：返回一个生成器，可以向后访问文档的解析内容</li><li>.has_attr：判断Tag是否包含属性</li></ol><h1>搜索文档树</h1><p>find_all(name, attrs, recursive, text, **kwargs)</p><p>在上面的例子中我们简单介绍了find_all的使用，接下来介绍一下find_all的更多用法-过滤器。这些过滤器贯穿整个搜索API，过滤器可以被用在tag的name中，节点的属性等。</p><h2>name参数：</h2><p>字符串过滤：会查找与字符串完全匹配的内容</p><pre><code>a_list = bs.find_all(&quot;a&quot;)
print(a_list)</code></pre><p>正则表达式过滤：如果传入的是正则表达式，那么BeautifulSoup4会通过search()来匹配内容</p><pre><code>from bs4 import BeautifulSoup 
import re 
file = open('./aa.html', 'rb') 
html = file.read() 
bs = BeautifulSoup(html,&quot;html.parser&quot;) 
t_list = bs.find_all(re.compile(&quot;a&quot;)) 
for item in t_list: 
   print(item)</code></pre><p>列表：如果传入一个列表，BeautifulSoup4将会与列表中的任一元素匹配到的节点返回</p><pre><code>t_list = bs.find_all([&quot;meta&quot;,&quot;link&quot;])
for item in t_list:
    print(item)</code></pre><p>方法：传入一个方法，根据方法来匹配</p><pre><code>from bs4 import BeautifulSoup 
file = open('./aa.html', 'rb') 
html = file.read() 
bs = BeautifulSoup(html,&quot;html.parser&quot;) 
def name_is_exists(tag): 
    return tag.has_attr(&quot;name&quot;) 
t_list = bs.find_all(name_is_exists) 
for item in t_list: 
    print(item)</code></pre><h2>kwargs参数：</h2><pre><code>from bs4 import BeautifulSoup 
import re 
file = open('./aa.html', 'rb') 
html = file.read() 
bs = BeautifulSoup(html,&quot;html.parser&quot;) 
# 查询id=head的Tag
t_list = bs.find_all(id=&quot;head&quot;)
print(t_list) 
# 查询href属性包含http://news.baidu.com的Tag
t_list = bs.find_all(href=re.compile(&quot;http://news.baidu.com&quot;)) 
print(t_list) 
# 查询所有包含class的Tag(注意：class在Python中属于关键字，所以加_以示区别)
t_list = bs.find_all(class_=True) 
for item in t_list: 
    print(item)</code></pre><h2>attrs参数：</h2><p>并不是所有的属性都可以使用上面这种方式进行搜索，比如HTML的data-*属性：</p><pre><code>t_list = bs.find_all(data-foo=&quot;value&quot;)</code></pre><p>如果执行这段代码，将会报错。我们可以使用attrs参数，定义一个字典来搜索包含特殊属性的tag：</p><pre><code>t_list = bs.find_all(attrs={&quot;data-foo&quot;:&quot;value&quot;})
for item in t_list:
    print(item)</code></pre><h2>text参数：</h2><p>通过text参数可以搜索文档中的字符串内容，与name参数的可选值一样，text参数接受 字符串，正则表达式，列表</p><pre><code>from bs4 import BeautifulSoup 
import re 
file = open('./aa.html', 'rb') 
html = file.read() 
bs = BeautifulSoup(html, &quot;html.parser&quot;)

t_list = bs.find_all(text=&quot;hao123&quot;) 
for item in t_list: 
    print(item) 
t_list = bs.find_all(text=[&quot;hao123&quot;, &quot;地图&quot;, &quot;贴吧&quot;]) 
for item in t_list: 
    print(item) 
t_list = bs.find_all(text=re.compile(&quot;\d&quot;)) 
for item in t_list: 
    print(item)</code></pre><p>当我们搜索text中的一些特殊属性时，同样也可以传入一个方法来达到我们的目的：</p><pre><code>def length_is_two(text):
    return text and len(text) == 2
t_list = bs.find_all(text=length_is_two)
for item in t_list:
    print(item)</code></pre><h2>limit参数：</h2><p>可以传入一个limit参数来限制返回的数量，当搜索出的数据量为5，而设置了limit=2时，此时只会返回前2个数据</p><pre><code>from bs4 import BeautifulSoup 
import re 
file = open('./aa.html', 'rb') 
html = file.read() 
bs = BeautifulSoup(html, &quot;html.parser&quot;) 
t_list = bs.find_all(&quot;a&quot;,limit=2) 
for item in t_list: 
    print(item)</code></pre><p>find_all除了上面一些常规的写法，还可以对其进行一些简写：</p><pre><code># 两者是相等的
# t_list = bs.find_all(&quot;a&quot;) =&gt; t_list = bs(&quot;a&quot;)
t_list = bs(&quot;a&quot;) # 两者是相等的
# t_list = bs.a.find_all(text=&quot;新闻&quot;) =&gt; t_list = bs.a(text=&quot;新闻&quot;)
t_list = bs.a(text=&quot;新闻&quot;)</code></pre><p>find()</p><p>find()将返回符合条件的第一个Tag，有时我们只需要或一个Tag时，我们就可以用到find()方法了。当然了，也可以使用find_all()方法，传入一个limit=1，然后再取出第一个值也是可以的，不过未免繁琐。</p><pre><code>from bs4 import BeautifulSoup 
import re 
file = open('./aa.html', 'rb') 
html = file.read() 
bs = BeautifulSoup(html, &quot;html.parser&quot;) 
# 返回只有一个结果的列表
t_list = bs.find_all(&quot;title&quot;,limit=1) 
print(t_list) 
# 返回唯一值
t = bs.find(&quot;title&quot;) 
print(t) 
# 如果没有找到，则返回None
t = bs.find(&quot;abc&quot;) print(t)</code></pre><p>从结果可以看出find_all，尽管传入了limit=1，但是返回值仍然为一个列表，当我们只需要取一个值时，远不如find方法方便。但是如果未搜索到值时，将返回一个None</p><p>在上面介绍BeautifulSoup4的时候，我们知道可以通过bs.div来获取第一个div标签，如果我们需要获取第一个div下的第一个div，我们可以这样：</p><pre><code>t = bs.div.div
# 等价于
t = bs.find(&quot;div&quot;).find(&quot;div&quot;)</code></pre><h1>CSS选择器</h1><p>BeautifulSoup支持发部分的CSS选择器，在Tag获取BeautifulSoup对象的.select()方法中传入字符串参数，即可使用CSS选择器的语法找到Tag:</p><h2>通过标签名查找</h2><pre><code>print(bs.select('title'))
print(bs.select('a'))</code></pre><h2>通过类名查找</h2><pre><code>print(bs.select('.mnav'))</code></pre><h2>通过id查找</h2><pre><code>print(bs.select('#u1'))</code></pre><h2>组合查找</h2><pre><code>print(bs.select('div .bri'))</code></pre><h2>属性查找</h2><pre><code>print(bs.select('a[class=&quot;bri&quot;]'))
print(bs.select('a[href=&quot;http://tieba.baidu.com&quot;]'))</code></pre><h2>直接子标签查找</h2><pre><code>t_list = bs.select(&quot;head &gt; title&quot;)
print(t_list)</code></pre><h2>兄弟节点标签查找</h2><pre><code>t_list = bs.select(&quot;.mnav ~ .bri&quot;)
print(t_list)</code></pre><h2>获取内容</h2><pre><code>t_list = bs.select(&quot;title&quot;)
print(bs.select('title')[0].get_text())</code></pre>

BeautifulSoup详解

BeautifulSoup4简介

BeautifulSoup4主要解析器，以及优缺点

BeautifulSoup4简单使用

BeautifulSoup4四大对象种类

Tag

NavigableString

BeautifulSoup

Comment

遍历文档树

搜索文档树

name参数：

kwargs参数：

attrs参数：

text参数：

limit参数：

CSS选择器

通过标签名查找

通过类名查找

通过id查找

组合查找

属性查找

直接子标签查找

兄弟节点标签查找

获取内容

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

qBittorrent一键安装脚本

heroku搭建免费的onemanager

Rclone挂载OneDrive方法-自建API

VPS安装emby

Cloudflare面板使用教程

jsDelivr加速GitHub仓库图片

Debian安装Docker

Cloudflare自选节点

Rclone挂载GoogleDrive

Python网络爬虫

BeautifulSoup详解

BeautifulSoup4简介

BeautifulSoup4主要解析器，以及优缺点

BeautifulSoup4简单使用

BeautifulSoup4四大对象种类

Tag

NavigableString

BeautifulSoup

Comment

遍历文档树

搜索文档树

name参数：

kwargs参数：

attrs参数：

text参数：

limit参数：

CSS选择器

通过标签名查找

通过类名查找

通过id查找

组合查找

属性查找

直接子标签查找

兄弟节点标签查找

获取内容

发表评论 取消回复 使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

BeautifulSoup详解

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款