Beautiful Soup 获取所有 HTML 标签

在HTML中，标签类似于Python或Java等传统编程语言中的关键字。标签具有预定义的行为，浏览器根据该行为呈现其内容。使用BeautifulSoup，可以收集给定HTML文档中的所有标签。

获取标签列表的最简单方法是将网页解析成一个soup对象，然后调用find_all()方法，不带任何参数。它返回一个列表生成器，为我们提供了所有标签的列表。

让我们提取Google主页中所有标签的列表。

示例

from bs4 import BeautifulSoup
import requests

url = "https://www.google.com/"
req = requests.get(url)

soup = BeautifulSoup(req.content, "html.parser")

tags = soup.find_all()
print([tag.name for tag in tags])

输出：

['html', 'head', 'meta', 'meta', 'title', 'script', 'style', 'style', 'script', 'body', 'script', 'div', 'div', 'nobr', 'b', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'u', 'div', 'nobr', 'span', 'span', 'span', 'a', 'a', 'a', 'div', 'div', 'center', 'br', 'div', 'img', 'br', 'br', 'form', 'table', 'tr', 'td', 'td', 'input', 'input', 'input', 'input', 'input', 'div', 'input', 'br', 'span', 'span', 'input', 'span', 'span', 'input', 'script', 'input', 'td', 'a', 'input', 'script', 'div', 'div', 'br', 'div', 'style', 'div', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'span', 'div', 'div', 'a', 'a', 'a', 'a', 'p', 'a', 'a', 'script', 'script', 'script']

自然，您可能会得到这样一个列表，其中某个标签可能会多次出现。要获取唯一标签列表（避免重复），请从标签对象列表中构造一个集。

将上述代码中的print语句更改为：

print({tag.name for tag in tags})

输出：

{'body', 'head', 'p', 'a', 'meta', 'tr', 'nobr', 'script', 'br', 'img', 'b', 'form', 'center', 'span', 'div', 'input', 'u', 'title', 'style', 'td', 'table', 'html'}

要获取带有一些关联文本的标签，请检查string属性，如果不是None，则打印：

tags = soup.find_all()
for tag in tags:
   if tag.string is not None:
      print(tag.name, tag.string)

可能有一些没有文本但具有一个或多个属性的单例标签，如<img>标签中所示。以下循环结构列出了此类标签。

在下面的代码中，HTML字符串不是一个完整的HTML文档，因为没有给出 <html> 和 <body> 标记。但是 html5lib 和 lxml 解析器在解析文档树时会自行添加这些标记。因此，当我们提取标签列表时，也会看到其他标签。

示例

html = '''
<h1 style="color:blue;text-align:center;">This is a heading</h1>
<p style="color:red;">This is a paragraph.</p>
<p>This is another paragraph</p>
'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html5lib")

tags = soup.find_all()
print({tag.name for tag in tags})

输出：

{'head', 'html', 'p', 'h1', 'body'}