Beautiful Soup 移除子元素

HTML文档是一个不同标签的层次结构，其中某个标签可能有多级嵌套的其他标签。如何移除某个标签的子元素？使用Beautiful Soup来做这件事非常简单。

在Beautiful Soup库中有两种主要的方法来移除某个标签：decompose()方法和extract()方法，区别在于后者返回被移除的内容，而前者仅仅是销毁它。

因此，要移除子元素，可以首先对给定的Tag对象调用findChildren()方法，然后对每个子元素调用extract()或decompose()。

考虑以下代码片段：

soup = BeautifulSoup(fp, "html.parser")
soup.decompose()
print(soup)

这将销毁整个soup对象本身，即文档的解析树。显然，这不是我们想做的。

现在考虑以下代码：

soup = BeautifulSoup(fp, "html.parser")
tags = soup.find_all()
for tag in tags:
   for t in tag.findChildren():
      t.extract()

在文档树中，<html>是第一个标签，所有其他的标签都是它的子标签，因此它将在循环的第一次迭代中移除所有标签，除了<html>和</html>。

如果我们想移除特定标签的子元素，可以更有效地使用这种方法。例如，您可能想移除HTML表格的头部行。

以下HTML脚本包含了一个表格，其中第一个<tr>元素包含了由<th>标签标记的头部。

<html>
   <body>
      <h2>Beautiful Soup - Remove Child Elements</h2>
      <table border="1">
         <tr class='header'>
            <th>Name</th>
            <th>Age</th>
            <th>Marks</th>
         </tr>
         <tr>
            <td>Ravi</td>
            <td>23</td>
            <td>67</td>
         </tr>
         <tr>
            <td>Anil</td>
            <td>27</td>
            <td>84</td>
         </tr>
      </table>
   </body>
</html>

我们可以使用以下Python代码来移除所有带有<th>单元格的<tr>标签的子元素。

示例

from bs4 import BeautifulSoup

fp = open("index.html")
soup = BeautifulSoup(fp, "html.parser")
tags = soup.find_all('tr', {'class': 'header'})

for tag in tags:
   for t in tag.findChildren():
      t.extract()

print(soup)

输出

<html>
<body>
<h2>Beautiful Soup - Parse Table</h2>
<table border="1">
<tr class="header">

</tr>
<tr>
<td>Ravi</td>
<td>23</td>
<td>67</td>
</tr>
<tr>
<td>Anil</td>
<td>27</td>
<td>84</td>
</tr>
</table>
</body>
</html>

可以看出，<th>元素已经被从解析树中移除了。