类文档

WP_HTML_Processor

💡 云策文档标注

概述

WP_HTML_Processor 是 WordPress 核心类,用于安全解析和修改 HTML5 文档。它支持 HTML5 规范子集,遇到不支持标记时提前中止以避免破坏文档,确保 HTML 完整性。相比 WP_HTML_Tag_Processor,它更适合基于嵌套结构查询和未来操作如包装、解包、节点增删等。

关键要点

  • WP_HTML_Processor 安全解析 HTML5,支持子集,遇不支持标记提前中止
  • 使用步骤:调用静态创建方法、查找位置、请求修改
  • 通过面包屑(breadcrumbs)查询嵌套结构,等效于 CSS 子选择器
  • 支持省略可选标签、意外闭合标签等非规范 HTML,但遇到 TABLE 内元素、SVG/MATH 等外部内容或插入模式外元素会中止

代码示例

$processor = WP_HTML_Processor::create_fragment( $html );
if ( $processor->next_tag( array( 'breadcrumbs' => array( 'DIV', 'FIGURE', 'IMG' ) ) ) ) {
    $processor->add_class( 'responsive-image' );
}

注意事项

  • 面包屑查询可指定完整路径避免部分匹配错误
  • 使用 create_fragment() 时,面包屑默认包含隐式外层元素如 HTML 和 BODY
  • 类设计保持简单合规,不支持 TABLE 内元素、SVG/MATH 等,遇此类情况中止处理

📄 原文内容

Core class used to safely parse and modify an HTML document.

Description

The HTML Processor class properly parses and modifies HTML5 documents.

It supports a subset of the HTML5 specification, and when it encounters unsupported markup, it aborts early to avoid unintentionally breaking the document. The HTML Processor should never break an HTML document.

While the WP_HTML_Tag_Processor is a valuable tool for modifying attributes on individual HTML tags, the HTML Processor is more capable and useful for the following operations:

  • Querying based on nested HTML structure.

Eventually the HTML Processor will also support:

  • Wrapping a tag in surrounding HTML.
  • Unwrapping a tag by removing its parent.
  • Inserting and removing nodes.
  • Reading and changing inner content.
  • Navigating up or around HTML structure.

Usage

Use of this class requires three steps:

  1. Call a static creator method with your input HTML document.
  2. Find the location in the document you are looking for.
  3. Request changes to the document at that location.

Example:

$processor = WP_HTML_Processor::create_fragment( $html );
if ( $processor->next_tag( array( 'breadcrumbs' => array( 'DIV', 'FIGURE', 'IMG' ) ) ) ) {
    $processor->add_class( 'responsive-image' );
}

Breadcrumbs represent the stack of open elements from the root of the document or fragment down to the currently-matched node, if one is currently selected. Call WP_HTML_Processor::get_breadcrumbs() to inspect the breadcrumbs for a matched tag.

Breadcrumbs can specify nested HTML structure and are equivalent to a CSS selector comprising tag names separated by the child combinator, such as “DIV > FIGURE > IMG”.

Since all elements find themselves inside a full HTML document when parsed, the return value from get_breadcrumbs() will always contain any implicit outermost elements. For example, when parsing with create_fragment() in the BODY context (the default), any tag in the given HTML document will contain array( 'HTML', 'BODY', … ) in its breadcrumbs.

Despite containing the implied outermost elements in their breadcrumbs, tags may be found with the shortest-matching breadcrumb query. That is, array( 'IMG' ) matches all IMG elements and array( 'P', 'IMG' ) matches all IMG elements directly inside a P element. To ensure that no partial matches erroneously match it’s possible to specify in a query the full breadcrumb match all the way down from the root HTML element.

Example:

$html = '<figure><img><figcaption>A <em>lovely</em> day outside</figcaption></figure>';
//               ----- Matches here.
$processor->next_tag( array( 'breadcrumbs' => array( 'FIGURE', 'IMG' ) ) );

$html = '<figure><img><figcaption>A <em>lovely</em> day outside</figcaption></figure>';
//                                  ---- Matches here.
$processor->next_tag( array( 'breadcrumbs' => array( 'FIGURE', 'FIGCAPTION', 'EM' ) ) );

$html = '<div><img></div><img>';
//                       ----- Matches here, because IMG must be a direct child of the implicit BODY.
$processor->next_tag( array( 'breadcrumbs' => array( 'BODY', 'IMG' ) ) );

HTML Support

This class implements a small part of the HTML5 specification.
It’s designed to operate within its support and abort early whenever encountering circumstances it can’t properly handle. This is the principle way in which this class remains as simple as possible without cutting corners and breaking compliance.

Supported elements

If any unsupported element appears in the HTML input the HTML Processor will abort early and stop all processing. This draconian measure ensures that the HTML Processor won’t break any HTML it doesn’t fully understand.

The HTML Processor supports all elements other than a specific set:

  • Any element inside a TABLE.
  • Any element inside foreign content, including SVG and MATH.
  • Any element outside the IN BODY insertion mode, e.g. doctype declarations, meta, links.

Supported markup

Some kinds of non-normative HTML involve reconstruction of formatting elements and re-parenting of mis-nested elements. For example, a DIV tag found inside a TABLE may in fact belong before the table in the DOM. If the HTML Processor encounters such a case it will stop processing.

The following list illustrates some common examples of unexpected HTML inputs that the HTML Processor properly parses and represents:

  • HTML with optional tags omitted, e.g. <p>one<p>two.
  • HTML with unexpected tag closers, e.g. <p>one </span> more</p>.
  • Non-void tags with self-closing flag, e.g.
    the DIV is still open.