Beautiful Soup diagnose() 方法

一、方法描述

Beautiful Soup 中的 diagnose() 方法是一个诊断套件，用于隔离常见的问题。如果您在理解 Beautiful Soup 如何处理文档时遇到困难，可以将文档作为参数传递给 diagnose() 函数。报告将向您展示不同的解析器如何处理文档，并告诉您是否缺少某个解析器。

二、语法

diagnose(data)

三、参数

data：文档字符串。

四、返回值

diagnose() 方法打印根据所有可用解析器解析给定文档的结果。

五、示例

示例

让我们用这个简单的文档来进行练习：

<h1>Hello World
<b>Welcome</b>
<P><b>Beautiful Soup</a> <i>Tutorial</i><p>

以下代码对上述 HTML 脚本运行诊断：

markup = '''
<h1>Hello World
<b>Welcome</b>
<P><b>Beautiful Soup</a> <i>Tutorial</i><p>
'''

from bs4.diagnose import diagnose

diagnose(markup)

diagnose() 输出以一条消息开始，显示了所有可用的解析器：

Diagnostic running on Beautiful Soup 4.12.2
Python version 3.11.2 (tags/v3.11.2:878ead1, Feb  7 2023, 16:38:35) [MSC v.1934 64 bit (AMD64)]
Found lxml version 4.9.2.0
Found html5lib version 1.1

如果要诊断的文档是完美的 HTML 文档，则所有解析器的结果大致相似。然而，在我们的示例中，存在许多错误。

首先使用内置的 html.parser 开始处理。报告如下：

Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
   <h1>
      Hello World
   <b>
      Welcome
   </b>
   <p>
      <b>
         Beautiful Soup
         <i>
            Tutorial
         </i>
         <p>
         </p>
      </b>
   </p>
</h1>

您可以注意到 Python 内置的解析器并没有插入 <html> 和 <body> 标签。未闭合的 <h1> 标签在末尾提供了匹配的 </h1>。

html5lib 和 lxml 解析器通过包裹在 <html>、<head> 和 <body> 标签中来完成文档。

Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<html>
   <head>
   </head>
   <body>
      <h1>
         Hello World
         <b>
            Welcome
         </b>
         <p>
            <b>
               Beautiful Soup
               <i>
                  Tutorial
               </i>
            </b>
         </p>
         <p>
            <b>
            </b>
         </p>
      </h1>
   </body>
</html>

使用 lxml 解析器时，请注意 </h1> 是如何插入的。此外，不完整的 <b> 标签被修正，并且悬空的 </a> 被移除。

Trying to parse your markup with lxml
Here's what lxml did with the markup:
<html>
   <body>
      <h1>
         Hello World
         <b>
            Welcome
         </b>
      </h1>
      <p>
         <b>
            Beautiful Soup
            <i>
               Tutorial
            </i>
         </b>
      </p>
      <p>
      </p>
   </body>
</html>

diagnose() 方法也会将文档作为 XML 文档进行解析，这在我们的情况下可能是多余的。

Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<h1>
   Hello World
   <b>
      Welcome
   </b>
   <P>
      <b>
         Beautiful Soup
      </b>
      <i>
         Tutorial
      </i>
   <p/>
   </P>
</h1>

让我们给 diagnose() 方法一个 XML 文档而不是 HTML 文档。

<?xml version="1.0" ?>
   <books>
      <book>
         <title>Python</title>
         <author>Yoagoa</author>
         <price>400</price>
      </book>
   </books>

现在如果我们运行诊断，即使它是 XML，仍然应用 HTML 解析器。

Trying to parse your markup with html.parser

Warning (from warnings module):
  File "C:\Users\mlath\OneDrive\Documents\Feb23 onwards\BeautifulSoup\Lib\site-packages\bs4\builder\__init__.py", line 545
    warnings.warn(
XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.

使用 html.parser 时，会显示警告信息。使用 html5lib 时，第一行包含 XML 版本信息的部分被注释掉，其余部分则像 HTML 文档一样被解析。

Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<!--?xml version="1.0" ?-->
<html>
   <head>
   </head>
   <body>
      <books>
         <book>
            <title>
               Python
            </title>
            <author>
               Yoagoa
            </author>
            <price>
               400
            </price>
         </book>
      </books>
   </body>
</html>

lxml HTML 解析器不插入注释，而是将其作为 HTML 进行解析。

Trying to parse your markup with lxml
Here's what lxml did with the markup:
<?xml version="1.0" ?>
<html>
   <body>
      <books>
         <book>
            <title>
               Python
            </title>
            <author>
               Yoagoa
            </author>
            <price>
               400
            </price>
         </book>
      </books>
   </body>
</html>

lxml-xml 解析器将文档作为 XML 进行解析。

Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<?xml version="1.0" ?>
   <books>
      <book>
         <title>
            Python
         </title>
         <author>
            Yoagoa
         </author>
         <price>
            400
         </price>
      </book>
   </books>

诊断报告可能有助于发现 HTML/XML 文档中的错误。