Beautiful Soup find()和find_all()比较

Beautiful Soup库包括了find()以及find_all()方法。这两种方法是在解析HTML或XML文档时最常用的方法之一。从特定的文档树中，您通常需要定位具有某种标签类型、特定属性或特定CSS样式的PageElement。这些条件作为参数传递给find()和find_all()方法。两者的主要区别在于find()定位满足条件的第一个子元素，而find_all()则查找满足条件的所有子元素。

find()方法的定义如下：

语法

find(name, attrs, recursive, string, **kwargs)

name参数指定了对标签名的过滤器。使用attrs可以设置对标签属性值的过滤器。如果recursive为True，则强制进行递归搜索。可以通过传入属性值的字典作为kwargs来传递可变的关键字参数。

soup.find(id='nm')
soup.find(attrs={"name":'marks'})

find_all()方法接受与find()方法相同的参数，除此之外还有一个limit参数。它是一个整数，限制了给定过滤条件的指定数量的出现次数。如果不设置，默认情况下find_all()会在所述PageElement下的所有子元素中搜索符合条件的对象。

soup.find_all('input')
lst = soup.find_all('li', limit=2)

如果find_all()方法的limit参数设置为1，那么它实际上就像find()方法一样工作。

这两个方法的返回类型有所不同。find()方法返回找到的第一个Tag对象或NavigableString对象。find_all()方法返回一个包含所有符合过滤条件的PageElement的ResultSet。

以下是一个展示find和find_all方法之间区别的例子。

示例

from bs4 import BeautifulSoup

markup = open("index.html")

soup = BeautifulSoup(markup, 'html.parser')
ret1 = soup.find('input')
ret2 = soup.find_all('input')
print(ret1, 'Return type of find:', type(ret1))
print(ret2)
print('Return type find_all:', type(ret2))

# 设置 limit=1
ret3 = soup.find_all('input', limit=1)
print('find:', ret1)
print('find_all:', ret3)

输出

<input id="nm" name="name" type="text"/> Return type of find: <class 'bs4.element.Tag'>
[<input id="nm" name="name" type="text"/>, <input id="age" name="age" type="text"/>, <input id="marks" name="marks" type="text"/>]
Return type find_all: <class 'bs4.element.ResultSet'>
find: <input id="nm" name="name" type="text"/>
find_all: [<input id="nm" name="name" type="text"/>]

请注意，输出展示了find()和find_all()方法的区别，特别是当设置了limit参数时的行为差异。