函数文档

_wp_utf8_decode_fallback()

💡 云策文档标注

概述

_wp_utf8_decode_fallback() 是一个 WordPress 内部函数,用于将 UTF-8 编码的字符串转换为 ISO-8859-1 编码,以保持与 PHP 标准库中已弃用函数的向后兼容性。该函数通过扫描 UTF-8 字节序列,处理可转换的字符并替换无效字符为问号。

关键要点

  • 函数作用:将 UTF-8 字符串转换为 ISO-8859-1 编码,兼容 PHP 的 utf8_decode() 函数。
  • 参数:$utf8_text(必需),作为 UTF-8 字节处理的文本。
  • 返回值:转换后的 ISO-8859-1 字符串。
  • 内部实现:使用循环扫描 UTF-8 字节,处理 ASCII 字符和可转换的 UTF-8 字符(代码点 ≤ U+FF),无效字符替换为 '?'。
  • 相关函数:依赖 _wp_scan_utf8() 来检测 UTF-8 字节的有效性。
  • 版本历史:在 WordPress 6.9.0 中引入。

代码示例

function _wp_utf8_decode_fallback( $utf8_text ) {
    $utf8_text       = (string) $utf8_text;
    $at              = 0;
    $was_at          = 0;
    $end             = strlen( $utf8_text );
    $iso_8859_1_text = '';

    while ( $at < $end ) {
        $ascii_byte_count = strspn(
            $utf8_text,
            "x00x01x02x03x04x05x06x07x08x09x0ax0bx0cx0dx0ex0fx10x11x12x13x14x15x16x17x18x19x1ax1bx1cx1dx1ex1f !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~x7f",
            $at
        );

        if ( $ascii_byte_count > 0 ) {
            $at += $ascii_byte_count;
            continue;
        }

        $next_at        = $at;
        $invalid_length = 0;
        $found          = _wp_scan_utf8( $utf8_text, $next_at, $invalid_length, null, 1 );
        $span_length    = $next_at - $at;
        $next_byte      = '?';

        if ( 1 !== $found ) {
            if ( $invalid_length > 0 ) {
                $next_byte = '';
                goto flush_sub_part;
            }

            break;
        }

        // All convertible code points are two-bytes long.
        $byte1 = ord( $utf8_text[ $at ] );
        if ( 0xC0 !== ( $byte1 & 0xE0 ) ) {
            goto flush_sub_part;
        }

        // All convertible code points are not greater than U+FF.
        $byte2      = ord( $utf8_text[ $at + 1 ] );
        $code_point = ( ( $byte1 & 0x1F ) << 6 ) | ( $byte2 & 0x3F );
        if ( $code_point > 0xFF ) {
            goto flush_sub_part;
        }

        $next_byte = chr( $code_point );

        flush_sub_part:
        $iso_8859_1_text .= substr( $utf8_text, $was_at, $at - $was_at );
        $iso_8858_1_text .= $next_byte;
        $at              += $span_length;
        $was_at           = $at;

        if ( $invalid_length > 0 ) {
            $iso_8859_1_text .= '?';
            $at              += $invalid_length;
            $was_at           = $at;
        }
    }

    if ( 0 === $was_at ) {
        return $utf8_text;
    }

    $iso_8859_1_text .= substr( $utf8_text, $was_at );
    return $iso_8859_1_text;
}

注意事项

  • 此函数主要用于内部兼容性,开发者应优先使用现代 PHP 函数或 WordPress 提供的其他编码处理工具。
  • 转换过程中,无效的 UTF-8 字符会被替换为 '?',可能导致数据丢失。
  • 函数依赖于 _wp_scan_utf8() 来检测 UTF-8 字节,确保相关文件已加载。

📄 原文内容

Converts a string from UTF-8 to ISO-8859-1, maintaining backwards compatibility with the deprecated function from the PHP standard library.

Description

See also

Parameters

$utf8_textstringrequired
Text treated as UTF-8 bytes.

Return

string Text converted into ISO-8859-1.

Source

function _wp_utf8_decode_fallback( $utf8_text ) {
	$utf8_text       = (string) $utf8_text;
	$at              = 0;
	$was_at          = 0;
	$end             = strlen( $utf8_text );
	$iso_8859_1_text = '';

	while ( $at < $end ) {
		// US-ASCII bytes are identical in ISO-8859-1 and UTF-8. These are 0x00–0x7F.
		$ascii_byte_count = strspn(
			$utf8_text,
			"x00x01x02x03x04x05x06x07x08x09x0ax0bx0cx0dx0ex0f" .
			"x10x11x12x13x14x15x16x17x18x19x1ax1bx1cx1dx1ex1f" .
			" !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~x7f",
			$at
		);

		if ( $ascii_byte_count > 0 ) {
			$at += $ascii_byte_count;
			continue;
		}

		$next_at        = $at;
		$invalid_length = 0;
		$found          = _wp_scan_utf8( $utf8_text, $next_at, $invalid_length, null, 1 );
		$span_length    = $next_at - $at;
		$next_byte      = '?';

		if ( 1 !== $found ) {
			if ( $invalid_length > 0 ) {
				$next_byte = '';
				goto flush_sub_part;
			}

			break;
		}

		// All convertible code points are two-bytes long.
		$byte1 = ord( $utf8_text[ $at ] );
		if ( 0xC0 !== ( $byte1 & 0xE0 ) ) {
			goto flush_sub_part;
		}

		// All convertible code points are not greater than U+FF.
		$byte2      = ord( $utf8_text[ $at + 1 ] );
		$code_point = ( ( $byte1 & 0x1F ) << 6 ) | ( ( $byte2 & 0x3F ) );
		if ( $code_point > 0xFF ) {
			goto flush_sub_part;
		}

		$next_byte = chr( $code_point );

		flush_sub_part:
		$iso_8859_1_text .= substr( $utf8_text, $was_at, $at - $was_at );
		$iso_8859_1_text .= $next_byte;
		$at              += $span_length;
		$was_at           = $at;

		if ( $invalid_length > 0 ) {
			$iso_8859_1_text .= '?';
			$at              += $invalid_length;
			$was_at           = $at;
		}
	}

	if ( 0 === $was_at ) {
		return $utf8_text;
	}

	$iso_8859_1_text .= substr( $utf8_text, $was_at );
	return $iso_8859_1_text;
}

Changelog

Version Description
6.9.0 Introduced.