一、方法描述
get_text()
方法从整个 HTML 文档或给定的标签中仅返回人类可读的文本。所有子字符串将被指定的分隔符连接,默认情况下分隔符为空字符串。
二、语法
get_text(separator, strip)
三、参数
-
separator
:子字符串将使用此参数进行连接,默认情况下为 ""
。
-
四、返回类型
get_Text()
方法返回一个字符串。
五、示例
示例 1
在下面的示例中,get_text()
方法去除了所有的 HTML 标签。
html = '''
<html>
<body>
<p> The quick, brown fox jumps over a lazy dog.</p>
<p> DJs flock by when MTV ax quiz prog.</p>
<p> Junk MTV quiz graced by fox whelps.</p>
<p> Bawds jog, flick quartz, vex nymphs.</p>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
text = soup.get_text()
print(text)
输出:
The quick, brown fox jumps over a lazy dog.
DJs flock by when MTV ax quiz prog.
Junk MTV quiz graced by fox whelps.
Bawds jog, flick quartz, vex nymphs.
示例 2
在下面的示例中,我们将 get_text()
方法的 separator
参数指定为 '#'
。
html = '''
<p>The quick, brown fox jumps over a lazy dog.</p>
<p>DJs flock by when MTV ax quiz prog.</p>
<p>Junk MTV quiz graced by fox whelps.</p>
<p>Bawds jog, flick quartz, vex nymphs.</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
text = soup.get_text(separator='#')
print(text)
输出:
#The quick, brown fox jumps over a lazy dog.#
#DJs flock by when MTV ax quiz prog.#
#Junk MTV quiz graced by fox whelps.#
#Bawds jog, flick quartz, vex nymphs.#
示例 3
让我们检查当 strip
参数设置为 True
时的效果,默认情况下它是 False
。
html = '''
<p>The quick, brown fox jumps over a lazy dog.</p>
<p>DJs flock by when MTV ax quiz prog.</p>
<p>Junk MTV quiz graced by fox whelps.</p>
<p>Bawds jog, flick quartz, vex nymphs.</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
text = soup.get_text(strip=True)
print(text)
输出:
The quick, brown fox jumps over a lazy dog.DJs flock by when MTV ax quiz prog.Junk MTV quiz graced by fox whelps.Bawds jog, flick quartz, vex nymphs.