块编辑器开发文档

数据流与数据格式

💡 云策文档标注

概述

本文档介绍了 WordPress 块编辑器(Block Editor)中数据流与数据格式的核心概念,包括块编辑器文章的内存表示、块对象结构、序列化与解析过程。重点阐述了块数据如何以树形结构在内存中操作,并通过 HTML 注释作为分隔符进行序列化,以确保与 WordPress 生态系统的兼容性。

关键要点

  • 块编辑器文章是内存中的块对象树,表示每个块的语义和基本数据,不同于最终生成的 post_content。
  • 块对象包含 clientId、type、attributes 和 innerBlocks 等属性,其结构和允许内容由块类型定义。
  • 序列化过程将块树转换为 HTML,使用 HTML 注释作为块分隔符来存储属性(如 JSON 字面量),以保持数据一致性和可读性。
  • 解析过程通过分隔符和解析表达式语法(parsing expression grammar)从序列化 HTML 重建块树,支持高效且容错的解析。
  • 数据生命周期包括从 post_content 解析为块树、在编辑器中操作块树,最后序列化回 post_content,确保单一数据源。

代码示例

// 块对象示例
const block = {
    clientId, // 唯一字符串标识符
    type, // 块类型(如 paragraph、image)
    attributes, // 表示当前块直接属性/内容的键值对集合
    innerBlocks, // 子块或内部块的数组
};

// 段落块示例
const paragraphBlock1 = {
    clientId: '51828be1-5f0d-4a6b-8099-f4c6f897e0a3',
    type: 'core/paragraph',
    attributes: {
        content: 'This is the <strong>content</strong> of the paragraph block',
        dropCap: true,
    },
};

// 序列化块示例(HTML 注释分隔符)
<!-- wp:image -->
<figure class="wp-block-image"><img src="source.jpg" alt="" /></figure>
<!-- /wp:image -->

<!-- wp:latest-posts {"postsToShow":4,"displayPostDate":true} /-->

注意事项

  • 块编辑器文章本质上不是 HTML,而是存储在 post_content 中的结构化数据,以确保向后兼容性和可访问性。
  • 动态块(如服务器渲染块)可能依赖额外机制(如全局块或 WP_Post 对象存储),需注意其数据存储方式的灵活性。
  • 解析器利用 HTML 注释的简单性和容错性,无需完全有效的 HTML,提高了性能和鲁棒性。

📄 原文内容

The format

A block editor post is the proper block-aware representation of a post: a collection of semantically consistent descriptions of what each block is and what its essential data is. This representation only ever exists in memory. It is the chase in the typesetter’s workshop, ever-shifting as sorts are attached and repositioned.

A block editor post is not the artifact it produces, namely the post_content. The latter is the printed page, optimized for the reader but retaining its invisible markings for later editing.

The input and output of the block editor is a tree of block objects with the current format:

const value = [ block1, block2, block3 ];

The block object

Each block object has an id, a set of attributes and potentially a list of child blocks.

const block = {
    clientId, // unique string identifier.
    type, // The block type (paragraph, image...)
    attributes, // (key, value) set of attributes representing the direct properties/content of the current block.
    innerBlocks, // An array of child blocks or inner blocks.
};

Note the attributes keys and types, the allowed inner blocks are defined by the block type. For example, the core quote block has a cite string attribute representing the cite content while a heading block has a numeric level attribute, representing the level of the heading (1 to 6).

During the lifecycle of the block in the editor, the block object can receive extra metadata:

  • isValid: A boolean representing whether the block is valid or not;
  • originalContent: The original HTML serialization of the block.

Examples

// A simple paragraph block.
const paragraphBlock1 = {
    clientId: '51828be1-5f0d-4a6b-8099-f4c6f897e0a3',
    type: 'core/paragraph',
    attributes: {
        content: 'This is the <strong>content</strong> of the paragraph block',
        dropCap: true,
    },
};

// A separator block.
const separatorBlock = {
    clientId: '51828be1-5f0d-4a6b-8099-f4c6f897e0a4',
    type: 'core/separator',
    attributes: {},
};

// A columns block with a paragraph block on each column.
const columnsBlock = {
    clientId: '51828be1-5f0d-4a6b-8099-f4c6f897e0a7',
    type: 'core/columns',
    attributes: {},
    innerBlocks: [
        {
            clientId: '51828be1-5f0d-4a6b-8099-f4c6f897e0a5',
            type: 'core/column',
            attributes: {},
            innerBlocks: [ paragraphBlock1 ],
        },
        {
            clientId: '51828be1-5f0d-4a6b-8099-f4c6f897e0a6',
            type: 'core/column',
            attributes: {},
            innerBlocks: [ paragraphBlock2 ],
        },
    ],
};

Serialization and parsing

This data model, however, is something that lives in memory while editing a post. It’s not visible to the page viewer when rendered, just like a printed page has no trace of the structure of the letters that produced it in the press.

Since the whole WordPress ecosystem has an expectation for receiving HTML when rendering or editing a post, the block editor transforms its data into something that can be saved in post_content through serialization. This assures that there’s a single source of truth for the content, and that this source remains readable and compatible with all the tools that interact with WordPress content at the present. Were we to store the object tree separately, we would face the risk of post_content and the tree getting out of sync and the problem of data duplication in both places.

Thus, the serialization process converts the block tree into HTML using HTML comments as explicit block delimiters—which can contain the attributes in non-HTML form. This is the act of printing invisible marks on the printed page that leave a trace of the original structured intention.

This is one end of the process. The other is how to recreate the collection of blocks whenever a post is to be edited again. A formal grammar defines how the serialized representation of a block editor post should be loaded, just as some basic rules define how to turn the tree into an HTML-like string. The block editor’s posts aren’t designed to be edited by hand; they aren’t designed to be edited as HTML documents because the block editor posts aren’t HTML in essence.

They just happen, incidentally, to be stored inside of post_content in a way in which they require no transformation in order to be viewable by any legacy system. It’s true that loading the stored HTML into a browser without the corresponding machinery might degrade the experience, and if it included dynamic blocks of content, the dynamic elements may not load, server-generated content may not appear, and interactive content may remain static. However, it at least protects against not being able to view block editor posts on themes and installations that are blocks-unaware, and it provides the most accessible way to the content. In other words, the post remains mostly intact even if the saved HTML is rendered as is.

Delimiters and parsing expression grammar

We chose instead to try to find a way to keep the formality, explicitness, and unambiguity in the existing HTML syntax. Within the HTML there were a number of options.

Of these options, a novel approach was suggested: by storing data in HTML comments, we would know that we wouldn’t break the rest of the HTML in the document, that browsers should ignore it, and that we could simplify our approach to parsing the document.

Unique to HTML comments is the fact that they cannot legitimately exist in ambiguous places, such as inside of HTML attributes like <img alt='data-id="14"'>. Comments are also quite permissive. Whereas HTML attributes are complicated to parse properly, comments are quite easily described by a leading <!-- followed by anything except -- until the first -->. This simplicity and permissiveness means that the parser can be implemented in several ways without needing to understand HTML properly, and we have the liberty to use more convenient syntax inside of the comment—we only need to escape double-hyphen sequences. We take advantage of this in how we store block attributes: as JSON literals inside the comment.

After running this through the parser, we’re left with a simple object we can manipulate idiomatically, and we don’t have to worry about escaping or unescaping the data. It’s handled for us through the serialization process. Because the comments are so different from other HTML tags and because we can perform a first-pass to extract the top-level blocks, we don’t actually depend on having fully valid HTML!

This has dramatic implications for how simple and performant we can make our parser. These explicit boundaries also protect damage in a single block from bleeding into other blocks or tarnishing the entire document. It also allows the system to identify unrecognized blocks before rendering them.

N.B.: The defining aspects of blocks are their semantics and the isolation mechanism they provide: in other words, their identity. On the other hand, where their data is stored is a more liberal aspect. Blocks support more than just static local data (via JSON literals inside the HTML comment or within the block’s HTML), and more mechanisms (e.g., global blocks or otherwise resorting to storage in complementary WP_Post objects) are expected. See attributes for details.

The anatomy of a serialized block

When blocks are saved to the content after the editing session, its attributes—depending on the nature of the block—are serialized to these explicit comment delimiters.

<!-- wp:image -->
<figure class="wp-block-image"><img src="source.jpg" alt="" /></figure>
<!-- /wp:image -->

A purely dynamic block that is to be server-rendered before display could look like this:

<!-- wp:latest-posts {"postsToShow":4,"displayPostDate":true} /-->

The data lifecycle

In summary, the block editor workflow parses the saved document to an in-memory tree of blocks, using token delimiters to help. During editing, all manipulations happen within the block tree. The process ends by serializing the blocks back to the post_content.

The workflow process relies on a serialization/parser pair to persist posts. Hypothetically, the post data structure could be stored using a plugin or retrieved from a remote JSON file to be converted to the block tree.