Beautiful Soup 对象类型

当我们把HTML文档或字符串传递给BeautifulSoup构造函数时，BeautifulSoup基本上会将复杂的HTML页面转换成不同的Python对象。下面我们将讨论在bs4包中定义的四种主要类型的对象。

四种主要的对象类型

Tag 对象

HTML标签用于定义各种类型的内容。BeautifulSoup中的标签对象对应于实际页面或文档中的HTML或XML标签。

示例

from bs4 import BeautifulSoup

soup = BeautifulSoup('<b class="boldest">Yoagoa</b>', 'lxml')
tag = soup.html
print (type(tag))

输出

<class 'bs4.element.Tag'>

标签包含很多属性和方法，标签的两个重要特征是它的名字和属性。

名称（tag.name）

每个标签都有一个名字并且可以通过.name作为后缀来访问。tag.name将返回它的标签类型。

示例

from bs4 import BeautifulSoup

soup = BeautifulSoup('<b class="boldest">Yoagoa</b>', 'lxml')
tag = soup.html
print (tag.name)

输出

html

然而，如果我们更改标签名，相同的更改也会反映在BeautifulSoup生成的HTML标记中。

示例

from bs4 import BeautifulSoup

soup = BeautifulSoup('<b class="boldest">Yoagoa</b>', 'lxml')
tag = soup.html
tag.name = "strong"
print (tag)

输出

<strong><body><b class="boldest">Yoagoa</b></body></strong>

属性（tag.attrs）

标签对象可以有任意数量的属性。在上面的例子中，标签有一个属性'class'，其值为"boldest"。任何不是标签的东西基本上都是一个属性并且必须包含一个值。通过"attrs"返回属性及其值的字典。你可以通过访问键来访问这些属性。

在下面的例子中，Beautifulsoup()构造函数的字符串参数包含HTML输入标签。输入标签的属性由"attr"返回。

示例

from bs4 import BeautifulSoup

soup = BeautifulSoup('<input type="text" name="name" value="Raju">', 'lxml')
tag = soup.input

print (tag.attrs)

输出

{'type': 'text', 'name': 'name', 'value': 'Raju'}

我们可以对标签的属性执行各种操作（添加/删除/修改），使用字典操作符或方法。

在下面的例子中，更新了value标签。更新后的HTML字符串显示了更改。

示例

from bs4 import BeautifulSoup

soup = BeautifulSoup('<input type="text" name="name" value="Raju">', 'lxml')
tag = soup.input

print (tag.attrs)
tag['value']='Ravi'
print (soup)

输出

<html><body><input name="name" type="text" value="Ravi"/></body></html>

我们添加一个新的id标签，并删除value标签。

示例

from bs4 import BeautifulSoup

soup = BeautifulSoup('<input type="text" name="name" value="Raju">', 'lxml')
tag = soup.input

tag['id']='nm'
del tag['value']
print (soup)

输出

<html><body><input id="nm" name="name" type="text"/></body></html>

多值属性

一些HTML5属性可以有多重值。最常用的class-attribute可以有多个CSS值。其他包括'rel'、'rev'、'headers'、'accesskey'和'accept-charset'。BeautifulSoup中的多值属性显示为列表。

示例

from bs4 import BeautifulSoup

css_soup = BeautifulSoup('<p class="body"></p>', 'lxml')
print ("css_soup.p['class']:", css_soup.p['class'])

css_soup = BeautifulSoup('<p class="body bold"></p>', 'lxml')
print ("css_soup.p['class']:", css_soup.p['class'])

输出

css_soup.p['class']: ['body']
css_soup.p['class']: ['body', 'bold']

然而，如果某个属性包含多个值但是它不是任何版本的HTML标准中的多值属性，BeautifulSoup将会保留该属性不变。

示例

from bs4 import BeautifulSoup

id_soup = BeautifulSoup('<p id="body bold"></p>', 'lxml')
print ("id_soup.p['id']:", id_soup.p['id'])
print ("type(id_soup.p['id']):", type(id_soup.p['id']))

输出

id_soup.p['id']: body bold
type(id_soup.p['id']): <class 'str'>

NavigableString 对象

通常，在某种类型的开闭标签之间放置一个特定的字符串。浏览器的HTML引擎在渲染元素时会对该字符串应用预期的效果。例如，在Hello World中，你在和标签之间找到一个字符串以便以粗体形式呈现。

NavigableString对象表示标签的内容。它是bs4.element.NavigableString类的一个对象。要访问内容，使用.string与标签。

示例

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>", 'html.parser')

print (soup.string)

print (type(soup.string))

输出

Hello, Tutorialspoint!
<class 'bs4.element.NavigableString'>

NavigableString对象类似于Python的Unicode字符串。它的部分特性支持树形导航和搜索。NavigableString可以通过str()函数转换为Unicode字符串。

示例

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>",'html.parser')

tag = soup.h2
string = str(tag.string)
print (string)

输出

Hello, Tutorialspoint!

正如Python字符串是不可变的一样，NavigableString也不能就地修改。然而，可以使用replace_with()替换标签内的字符串。

示例

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2 id='message'>Hello, Tutorialspoint!</h2>",'html.parser')

tag = soup.h2
tag.string.replace_with("OnLine Tutorials Library")
print (tag.string)

输出

OnLine Tutorials Library

BeautifulSoup 对象

BeautifulSoup对象代表整个解析的对象。然而，它可以被认为类似于Tag对象。这是我们在尝试爬取网络资源时创建的对象。因为它类似于Tag对象，所以支持解析和搜索文档树所需的全部功能。

示例

from bs4 import BeautifulSoup
fp = open("index.html")
soup = BeautifulSoup(fp, 'html.parser')

print (soup)
print (soup.name)
print ('type:',type(soup))

输出

<html>
<head>
<title>Yoagoa</title>
</head>
<body>
<h2>Departmentwise Employees</h2>
<ul>
<li>Accounts</li>
<ul>
<li>Anand</li>
<li>Mahesh</li>
</ul>
<li>HR</li>
<ul>
<li>Rani</li>
<li>Ankita</li>
</ul>
</ul>
</body>
</html>
[document]
type: <class 'bs4.BeautifulSoup'>

BeautifulSoup对象的name属性总是返回[document]。

如果将BeautifulSoup对象作为参数传递给某些函数如replace_with()，则可以合并两个已解析的文档。

示例

from bs4 import BeautifulSoup
obj1 = BeautifulSoup("<book><title>Python</title></book>", features="xml")
obj2 = BeautifulSoup("<b>Beautiful Soup parser</b>", "lxml")

obj2.find('b').replace_with(obj1)
print (obj2)

输出

<html><body><book><title>Python</title></book></body></html>

Comment 对象

在HTML以及XML文档中写在之间的任何文本都被认为是注释。BeautifulSoup能够检测这种被注释的文本作为Comment对象。

示例

from bs4 import BeautifulSoup
markup = "<b><!--This is a comment text in HTML--></b>"
soup = BeautifulSoup(markup, 'html.parser')
comment = soup.b.string
print (comment, type(comment))

输出

This is a comment text in HTML <class 'bs4.element.Comment'>

Comment对象是一种特殊的NavigableString对象。prettify()方法以特殊格式显示注释文本。

示例

print (soup.b.prettify())

输出

<b>
   <!--This is a comment text in HTML-->
</b>