wp_is_valid_utf8()

💡 云策文档标注

概述

wp_is_valid_utf8() 函数用于判断给定的字节字符串是否表示有效的 UTF-8 编码。它基于 PHP 的 mb_check_encoding() 实现，返回布尔值。

关键要点

函数检查字节序列是否符合 UTF-8 编码规范，包括字符的字节序列最小化和禁止代理码点。
非 UTF-8 数据（如 US-ASCII 或 ISO-8859-1）可能被误判为有效 UTF-8，因为许多文本同时符合多种编码标准。
函数在 WordPress 6.9.0 版本中引入，常用于字符串验证和清理场景。

代码示例

true === wp_is_valid_utf8( '' );
true === wp_is_valid_utf8( 'just a test' );
true === wp_is_valid_utf8( "xE2x9Cx8F" );    // Pencil, U+270F.
true === wp_is_valid_utf8( "u{270F}" );        // Pencil, U+270F.
true === wp_is_valid_utf8( '✏' );              // Pencil, U+270F.

false === wp_is_valid_utf8( "just xC0 test" ); // Invalid bytes.
false === wp_is_valid_utf8( "xE2x9C" );       // Invalid/incomplete sequences.
false === wp_is_valid_utf8( "xC1xBF" );       // Overlong sequences.
false === wp_is_valid_utf8( "xEDxB0x80" );   // Surrogate halves.
false === wp_is_valid_utf8( "BxFCch" );        // ISO-8859-1 high-bytes.

注意事项

有效字符串需符合 UTF-8 编码方案，使用最小字节序列，且不包含 UTF-16 代理码点或超出可表示范围的字符。更多细节可参考 Unicode 规范。

📄 原文内容

Determines if a given byte string represents a valid UTF-8 encoding.

Description

Note that it’s unlikely for non-UTF-8 data to validate as UTF-8, but it is still possible. Many texts are simultaneously valid UTF-8, valid US-ASCII, and valid ISO-8859-1 (latin1).

Example:

true === wp_is_valid_utf8( '' );
true === wp_is_valid_utf8( 'just a test' );
true === wp_is_valid_utf8( "xE2x9Cx8F" );    // Pencil, U+270F.
true === wp_is_valid_utf8( "u{270F}" );        // Pencil, U+270F.
true === wp_is_valid_utf8( '✏' );              // Pencil, U+270F.

false === wp_is_valid_utf8( "just xC0 test" ); // Invalid bytes.
false === wp_is_valid_utf8( "xE2x9C" );       // Invalid/incomplete sequences.
false === wp_is_valid_utf8( "xC1xBF" );       // Overlong sequences.
false === wp_is_valid_utf8( "xEDxB0x80" );   // Surrogate halves.
false === wp_is_valid_utf8( "BxFCch" );        // ISO-8859-1 high-bytes.
                                                // E.g. The “ü” in ISO-8859-1 is a single byte 0xFC,
                                                // but in UTF-8 is the two-byte sequence 0xC3 0xBC.

A “valid” string consists of “well-formed UTF-8 code unit sequence[s],” meaning that the bytes conform to the UTF-8 encoding scheme, all characters use the minimal byte sequence required by UTF-8, and that no sequence encodes a UTF-16 surrogate code point or any character above the representable range.

Parameters

$bytesstringrequired: String which might contain text encoded as UTF-8.

Return

bool Whether the provided bytes can decode as valid UTF-8.

Source

function wp_is_valid_utf8( string $bytes ): bool {
	return mb_check_encoding( $bytes, 'UTF-8' );
}

View all references View on Trac View on GitHub

Used by	Description
wxr_cdata()`wp-admin/includes/export.php`	Wraps given string in XML CDATA tag.
wp_read_image_metadata()`wp-admin/includes/image.php`	Gets extended image metadata, exif or iptc as available.
sanitize_file_name()`wp-includes/formatting.php`	Sanitizes a filename, replacing whitespace with dashes.
sanitize_title_with_dashes()`wp-includes/formatting.php`	Sanitizes a title, replacing whitespace and a few other characters with dashes.
wp_check_invalid_utf8()`wp-includes/formatting.php`	Checks for invalid UTF8 in a string.
remove_accents()`wp-includes/formatting.php`	Converts all accent characters to ASCII characters.

Show 1 more Show less

Changelog

Version	Description
6.9.0	Introduced.

云策 WordPress 开发者社区

函数文档

wp_is_valid_utf8()

概述

关键要点

代码示例

注意事项

Description

See also

Parameters

Return

Source

Changelog

函数文档

概述

关键要点

代码示例

注意事项

Description

See also

Parameters

Return

Source

Related

Changelog