_wp_utf8_decode_fallback()
云策文档标注
概述
_wp_utf8_decode_fallback() 是一个 WordPress 内部函数,用于将 UTF-8 编码的字符串转换为 ISO-8859-1 编码,以保持与 PHP 标准库中已弃用函数的向后兼容性。该函数通过扫描 UTF-8 字节序列,处理可转换的字符并替换无效字符为问号。
关键要点
- 函数作用:将 UTF-8 字符串转换为 ISO-8859-1 编码,兼容 PHP 的 utf8_decode() 函数。
- 参数:$utf8_text(必需),作为 UTF-8 字节处理的文本。
- 返回值:转换后的 ISO-8859-1 字符串。
- 内部实现:使用循环扫描 UTF-8 字节,处理 ASCII 字符和可转换的 UTF-8 字符(代码点 ≤ U+FF),无效字符替换为 '?'。
- 相关函数:依赖 _wp_scan_utf8() 来检测 UTF-8 字节的有效性。
- 版本历史:在 WordPress 6.9.0 中引入。
代码示例
function _wp_utf8_decode_fallback( $utf8_text ) {
$utf8_text = (string) $utf8_text;
$at = 0;
$was_at = 0;
$end = strlen( $utf8_text );
$iso_8859_1_text = '';
while ( $at < $end ) {
$ascii_byte_count = strspn(
$utf8_text,
"x00x01x02x03x04x05x06x07x08x09x0ax0bx0cx0dx0ex0fx10x11x12x13x14x15x16x17x18x19x1ax1bx1cx1dx1ex1f !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~x7f",
$at
);
if ( $ascii_byte_count > 0 ) {
$at += $ascii_byte_count;
continue;
}
$next_at = $at;
$invalid_length = 0;
$found = _wp_scan_utf8( $utf8_text, $next_at, $invalid_length, null, 1 );
$span_length = $next_at - $at;
$next_byte = '?';
if ( 1 !== $found ) {
if ( $invalid_length > 0 ) {
$next_byte = '';
goto flush_sub_part;
}
break;
}
// All convertible code points are two-bytes long.
$byte1 = ord( $utf8_text[ $at ] );
if ( 0xC0 !== ( $byte1 & 0xE0 ) ) {
goto flush_sub_part;
}
// All convertible code points are not greater than U+FF.
$byte2 = ord( $utf8_text[ $at + 1 ] );
$code_point = ( ( $byte1 & 0x1F ) << 6 ) | ( $byte2 & 0x3F );
if ( $code_point > 0xFF ) {
goto flush_sub_part;
}
$next_byte = chr( $code_point );
flush_sub_part:
$iso_8859_1_text .= substr( $utf8_text, $was_at, $at - $was_at );
$iso_8858_1_text .= $next_byte;
$at += $span_length;
$was_at = $at;
if ( $invalid_length > 0 ) {
$iso_8859_1_text .= '?';
$at += $invalid_length;
$was_at = $at;
}
}
if ( 0 === $was_at ) {
return $utf8_text;
}
$iso_8859_1_text .= substr( $utf8_text, $was_at );
return $iso_8859_1_text;
}注意事项
- 此函数主要用于内部兼容性,开发者应优先使用现代 PHP 函数或 WordPress 提供的其他编码处理工具。
- 转换过程中,无效的 UTF-8 字符会被替换为 '?',可能导致数据丢失。
- 函数依赖于 _wp_scan_utf8() 来检测 UTF-8 字节,确保相关文件已加载。
原文内容
Converts a string from UTF-8 to ISO-8859-1, maintaining backwards compatibility with the deprecated function from the PHP standard library.
Description
See also
Parameters
$utf8_textstringrequired-
Text treated as UTF-8 bytes.
Source
function _wp_utf8_decode_fallback( $utf8_text ) {
$utf8_text = (string) $utf8_text;
$at = 0;
$was_at = 0;
$end = strlen( $utf8_text );
$iso_8859_1_text = '';
while ( $at < $end ) {
// US-ASCII bytes are identical in ISO-8859-1 and UTF-8. These are 0x00–0x7F.
$ascii_byte_count = strspn(
$utf8_text,
"x00x01x02x03x04x05x06x07x08x09x0ax0bx0cx0dx0ex0f" .
"x10x11x12x13x14x15x16x17x18x19x1ax1bx1cx1dx1ex1f" .
" !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~x7f",
$at
);
if ( $ascii_byte_count > 0 ) {
$at += $ascii_byte_count;
continue;
}
$next_at = $at;
$invalid_length = 0;
$found = _wp_scan_utf8( $utf8_text, $next_at, $invalid_length, null, 1 );
$span_length = $next_at - $at;
$next_byte = '?';
if ( 1 !== $found ) {
if ( $invalid_length > 0 ) {
$next_byte = '';
goto flush_sub_part;
}
break;
}
// All convertible code points are two-bytes long.
$byte1 = ord( $utf8_text[ $at ] );
if ( 0xC0 !== ( $byte1 & 0xE0 ) ) {
goto flush_sub_part;
}
// All convertible code points are not greater than U+FF.
$byte2 = ord( $utf8_text[ $at + 1 ] );
$code_point = ( ( $byte1 & 0x1F ) << 6 ) | ( ( $byte2 & 0x3F ) );
if ( $code_point > 0xFF ) {
goto flush_sub_part;
}
$next_byte = chr( $code_point );
flush_sub_part:
$iso_8859_1_text .= substr( $utf8_text, $was_at, $at - $was_at );
$iso_8859_1_text .= $next_byte;
$at += $span_length;
$was_at = $at;
if ( $invalid_length > 0 ) {
$iso_8859_1_text .= '?';
$at += $invalid_length;
$was_at = $at;
}
}
if ( 0 === $was_at ) {
return $utf8_text;
}
$iso_8859_1_text .= substr( $utf8_text, $was_at );
return $iso_8859_1_text;
}
Changelog
| Version | Description |
|---|---|
| 6.9.0 | Introduced. |