使用Beautiful Soup库如BeautifulSoup进行网页抓取的一个重要且常用的应用是提取HTML脚本中的文本。有时你需要丢弃所有标签及其相关属性,并分离出文档中的纯文本。Beautiful Soup中的get_text()
方法适合这一目的。
以下是一个演示get_text()
方法用法的基本示例。通过移除所有的HTML标签,你可以获取HTML文档中的所有文本。
示例
html = '''
<html>
<body>
<p> The quick, brown fox jumps over a lazy dog.</p>
<p> DJs flock by when MTV ax quiz prog.</p>
<p> Junk MTV quiz graced by fox whelps.</p>
<p> Bawds jog, flick quartz, vex nymphs.</p>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
text = soup.get_text()
print(text)
输出
The quick, brown fox jumps over a lazy dog.
DJs flock by when MTV ax quiz prog.
Junk MTV quiz graced by fox whelps.
Bawds jog, flick quartz, vex nymphs.
get_text()
方法有一个可选的separator
参数。在下面的示例中,我们将get_text()
方法的separator
参数指定为#
。
示例
html = '''
<p>The quick, brown fox jumps over a lazy dog.</p>
<p>DJs flock by when MTV ax quiz prog.</p>
<p>Junk MTV quiz graced by fox whelps.</p>
<p>Bawds jog, flick quartz, vex nymphs.</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
text = soup.get_text(separator='#')
print(text)
输出
get_text()
方法还有一个参数strip
,它可以是True
或False
。让我们检查一下当设置为True
时strip
参数的效果,默认情况下它是False
。
示例
html = '''
<p>The quick, brown fox jumps over a lazy dog.</p>
<p>DJs flock by when MTV ax quiz prog.</p>
<p>Junk MTV quiz graced by fox whelps.</p>
<p>Bawds jog, flick quartz, vex nymphs.</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
text = soup.get_text(strip=True)
print(text)
输出
The quick, brown fox jumps over a lazy dog.DJs flock by when MTV ax quiz prog.Junk MTV quiz graced by fox whelps.Bawds jog, flick quartz, vex nymphs.