wp_scrub_utf8()
概述
wp_scrub_utf8() 函数用于处理 UTF-8 编码字符串中的无效字节序列,将其替换为 Unicode 替换字符(U+FFFD)。该函数基于“最大子部分”算法,旨在安全地中和编码问题,防止下游处理出错。
关键要点
- 函数将无效 UTF-8 字节序列替换为 Unicode 替换字符(�),但需谨慎使用,因为过早替换可能导致数据丢失或下游处理问题。
- 使用“最大子部分”算法,确保替换安全且可互操作,可能导致连续多个替换字符。
- 参数 $text 为假设为 UTF-8 的字符串,可能包含无效字节序列;返回替换后的字符串。
- Unicode 替换字符本身是 Unicode 字符(U+FFFD),替换后无法区分原始意图或清洗结果,建议仅用于显示目的。
代码示例
// 有效字符串保持不变。
'test' === wp_scrub_utf8( 'test' );
// 无效字节序列被替换。
$invalid = "the byte xC0 is never allowed in a UTF-8 string.";
"the byte u{FFFD} is never allowed in a UTF-8 string." === wp_scrub_utf8( $invalid, true );
'the byte � is never allowed in a UTF-8 string.' === wp_scrub_utf8( $invalid, true );
// 最大子部分单独替换。
'.�.' === wp_scrub_utf8( ".xC0." ); // C0 永远无效。
'.�.' === wp_scrub_utf8( ".xE2x8C." ); // 末尾缺少 A3。
'.��.' === wp_scrub_utf8( ".xE2x8CxE2x8C." ); // 最大子部分分别替换。
'.��.' === wp_scrub_utf8( ".xC1xBF." ); // 过长序列。
'.���.' === wp_scrub_utf8( ".xEDxA0x80." ); // 代理对半部分。注意事项
Unicode 替换字符是 Unicode 字符(U+FFFD),替换后无法区分其来源。理想情况下,替换仅用于显示,但在某些上下文(如生成 XML 或传递数据给大型语言模型)中可能需要有效输入字符串。
Replaces ill-formed UTF-8 byte sequences with the Unicode Replacement Character.
Description
Knowing what to do in the presence of text encoding issues can be complicated.
This function replaces invalid spans of bytes to neutralize any corruption that may be there and prevent it from causing further problems downstream.
However, it’s not always ideal to replace those bytes. In some settings it may be best to leave the invalid bytes in the string so that downstream code can handle them in a specific way. Replacing the bytes too early, like escaping for HTML too early, can introduce other forms of corruption and data loss.
When in doubt, use this function to replace spans of invalid bytes.
Replacement follows the “maximal subpart” algorithm for secure and interoperable strings. This can lead to sequences of multiple replacement characters in a row.
Example:
// Valid strings come through unchanged.
'test' === wp_scrub_utf8( 'test' );
// Invalid sequences of bytes are replaced.
$invalid = "the byte xC0 is never allowed in a UTF-8 string.";
"the byte u{FFFD} is never allowed in a UTF-8 string." === wp_scrub_utf8( $invalid, true );
'the byte � is never allowed in a UTF-8 string.' === wp_scrub_utf8( $invalid, true );
// Maximal subparts are replaced individually.
'.�.' === wp_scrub_utf8( ".xC0." ); // C0 is never valid.
'.�.' === wp_scrub_utf8( ".xE2x8C." ); // Missing A3 at end.
'.��.' === wp_scrub_utf8( ".xE2x8CxE2x8C." ); // Maximal subparts replaced separately.
'.��.' === wp_scrub_utf8( ".xC1xBF." ); // Overlong sequence.
'.���.' === wp_scrub_utf8( ".xEDxA0x80." ); // Surrogate half.
Note! The Unicode Replacement Character is itself a Unicode character (U+FFFD).
Once a span of invalid bytes has been replaced by one, it will not be possible to know whether the replacement character was originally intended to be there or if it is the result of scrubbing bytes. It is ideal to leave replacement for display only, but some contexts (e.g. generating XML or passing data into a large language model) require valid input strings.
See also
Parameters
$textstringrequired-
String which is assumed to be UTF-8 but may contain invalid sequences of bytes.
Source
function wp_scrub_utf8( $text ) {
/*
* While it looks like setting the substitute character could fail,
* the internal PHP code will never fail when provided a valid
* code point as a number. In this case, there’s no need to check
* its return value to see if it succeeded.
*/
$prev_replacement_character = mb_substitute_character();
mb_substitute_character( 0xFFFD );
$scrubbed = mb_scrub( $text, 'UTF-8' );
mb_substitute_character( $prev_replacement_character );
return $scrubbed;
}
Changelog
| Version | Description |
|---|---|
| 6.9.0 | Introduced. |