Beautiful Soup 按标签导航

HTML文档中的一个重要组成部分是标签，这些标签可能包含其他标签或字符串（即标签的子元素）。Beautiful Soup提供了多种方式来遍历和迭代标签的子元素。

最容易搜索解析树的方法是通过其名称来查找标签。

`soup.head`

soup.head函数返回HTML页面中<head> .. </head>元素内部的内容。

考虑以下要抓取的HTML页面：

<html>
   <head>
      <title>Yoagoa</title>
      <script>
         document.write("Welcome to Yoagoa");
      </script>
   </head>
   <body>
      <h1>Yoagoa Online Library</h1>
      <p><b>It's all Free</b></p>
   </body>
</html>

下面的代码提取了<head>元素的内容：

示例

from bs4 import BeautifulSoup
with open("index.html") as fp:
   soup = BeautifulSoup(fp, 'html.parser')
print(soup.head)

输出

<head>
<title>Yoagoa</title>
<script>
document.write("Welcome to Yoagoa");
</script>
</head>

`soup.body`

类似地，要返回HTML页面body部分的内容，可以使用soup.body。

示例

from bs4 import BeautifulSoup

with open("index.html") as fp:
   soup = BeautifulSoup(fp, 'html.parser')
print (soup.body)

输出

<body>
<h1>Yoagoa Online Library</h1>
<p><b>It's all Free</b></p>
</body>

你也可以提取<body>标签中的特定标签（如第一个<h1>标签）。

示例

from bs4 import BeautifulSoup

with open("index.html") as fp:
   soup = BeautifulSoup(fp, 'html.parser')

print(soup.body.h1)

输出

<h1>Yoagoa Online Library</h1>

`soup.p`

我们的HTML文件包含了一个<p>标签。我们可以提取这个标签的内容。

示例

from bs4 import BeautifulSoup

with open("index.html") as fp:
   soup = BeautifulSoup(fp, 'html.parser')

print(soup.p)

输出

<p><b>It's all Free</b></p>

`Tag.contents`

一个Tag对象可能包含一个或多个PageElements。Tag对象的contents属性返回包含在其内所有元素的列表。

让我们找到index.html文件中<head>标签内的元素。

示例

from bs4 import BeautifulSoup

with open("index.html") as fp:
   soup = BeautifulSoup(fp, 'html.parser')

tag = soup.head
print (tag.contents)

输出

['\n',
<title>Yoagoa</title>,
'\n',
<script>
document.write("Welcome to Yoagoa");
</script>,
'\n']

`Tag.children`

HTML脚本中的标签结构是层次性的。元素是嵌套在一起的。例如，顶层的<HTML>标签包含<HEAD>和<BODY>标签，每个标签可能包含其他标签。

Tag对象有一个children属性，返回包含的PageElements的列表迭代器。

为了演示children属性，我们将使用以下HTML脚本（index.html）。在<body>部分中有两个<ul>列表元素，其中一个嵌套在另一个中。换句话说，body标签有顶级列表元素，而每个列表元素下还有另一个列表。

<html>
   <head>
      <title>Yoagoa</title>
   </head>
   <body>
      <h2>Departmentwise Employees</h2>
      <ul>
      <li>Accounts</li>
         <ul>
         <li>Anand</li>
         <li>Mahesh</li>
         </ul>
      <li>HR</li>
         <ul>
         <li>Rani</li>
         <li>Ankita</li>
         </ul>
      </ul>
   </body>
</html>

下面的Python代码给出了顶层<ul>标签的所有子元素的列表。

示例

from bs4 import BeautifulSoup

with open("index.html") as fp:
   soup = BeautifulSoup(fp, 'html.parser')

tag = soup.ul
print (list(tag.children))

输出

['\n', <li>Accounts</li>, '\n', <ul>
<li>Anand</li>
<li>Mahesh</li>
</ul>, '\n', <li>HR</li>, '\n', <ul>
<li>Rani</li>
<li>Ankita</li>
</ul>, '\n']

由于.children属性返回的是一个列表迭代器，我们可以使用for循环来遍历层次结构。

示例

for child in tag.children:
   print (child)

输出

<li>Accounts</li>

<ul>
<li>Anand</li>
<li>Mahesh</li>
</ul>

<li>HR</li>

<ul>
<li>Rani</li>
<li>Ankita</li>
</ul>

`Tag.find_all()`

此方法返回所有匹配所提供标签参数的内容的结果集。

考虑以下HTML页面（index.html）：

<html>
   <body>
      <h1>Yoagoa Online Library</h1>
      <p><b>It's all Free</b></p>
      <a class="prog" href="https://www.yoagoa.com/java/java_overview.htm" id="link1">Java</a>
      <a class="prog" href="https://www.yoagoa.com/cprogramming/index.htm" id="link2">C</a>
      <a class="prog" href="https://www.yoagoa.com/python/index.htm" id="link3">Python</a>
      <a class="prog" href="https://www.yoagoa.com/javascript/javascript_overview.htm" id="link4">JavaScript</a>
      <a class="prog" href="https://www.yoagoa.com/ruby/index.htm" id="link5">C</a>
   </body>
</html>

下面的代码列出了所有带有<a>标签的元素。

示例

from bs4 import BeautifulSoup

with open("index.html") as fp:
   soup = BeautifulSoup(fp, 'html.parser')

result = soup.find_all("a")
print (result)

输出

[
   <a class="prog" href="https://www.yoagoa.com/java/java_overview.htm" id="link1">Java</a>,
   <a class="prog" href="https://www.yoagoa.com/cprogramming/index.htm" id="link2">C</a>,
   <a class="prog" href="https://www.yoagoa.com/python/index.htm" id="link3">Python</a>,
   <a class="prog" href="https://www.yoagoa.com/javascript/javascript_overview.htm" id="link4">JavaScript</a>,
   <a class="prog" href="https://www.yoagoa.com/ruby/index.htm" id="link5">C</a>
]