函数文档

_wp_scan_utf8()

💡 云策文档标注

概述

_wp_scan_utf8() 是一个底层工具,用于扫描字符串中的有效和无效 UTF-8 字节序列。它通过引用参数返回扫描位置、无效字节长度和成功扫描的码点数量,为构建高级 UTF-8 功能提供基础。

关键要点

  • 函数扫描 UTF-8 字符串,识别无效字节序列,并更新 $at(扫描起始位置)和 $invalid_length(无效字节长度)参数。
  • 返回成功扫描的码点数量,支持通过 $max_bytes 和 $max_code_points 参数控制扫描范围。
  • 作为底层函数,所有参数直接传递以保持明确性,不接收选项数组。
  • 内部实现包括快速处理 ASCII 字符和详细的多字节序列验证,遵循 Unicode 标准。
  • 相关函数如 _wp_utf8_codepoint_count() 和 _wp_scrub_utf8_fallback() 依赖此工具实现 UTF-8 功能。

代码示例

// 示例:扫描包含无效字节的字符串
"PixF1a" === $pineapple = mb_convert_encoding( "Piña", 'Windows-1252', 'UTF-8' );
$at = $invalid_length = 0;

// 第一步找到无效的 0xF1 字节
2 === _wp_scan_utf8( $pineapple, $at, $invalid_length );
$at === 2; $invalid_length === 1;

// 第二步继续扫描到字符串末尾
1 === _wp_scan_utf8( $pineapple, $at, $invalid_length );
$at === 4; $invalid_length === 0;

注意事项

  • 此函数旨在作为底层基础,不提供高级便利功能,如选项数组。
  • 扫描不会在多字节字符中间返回,确保字符完整性。
  • 遵循 Unicode 标准处理无效序列,推荐使用最大子部分替换策略。

📄 原文内容

Finds spans of valid and invalid UTF-8 bytes in a given string.

Description

This is a low-level tool to power various UTF-8 functionality.
It scans through a string until it finds invalid byte spans.
When it does this, it does three things:

  • Assigns $at to the position after the last successful code point.
  • Assigns $invalid_length to the length of the maximal subpart of the invalid bytes starting at $at.
  • Returns how many code points were successfully scanned.

This information is enough to build a number of useful UTF-8 functions.

Example:

// ñ is U+F1, which in `ISO-8859-1`/`latin1`/`Windows-1252`/`cp1252` is 0xF1.
"PixF1a" === $pineapple = mb_convert_encoding( "Piña", 'Windows-1252', 'UTF-8' );
$at = $invalid_length = 0;

// The first step finds the invalid 0xF1 byte.
2 === _wp_scan_utf8( $pineapple, $at, $invalid_length );
$at === 2; $invalid_length === 1;

// The second step continues to the end of the string.
1 === _wp_scan_utf8( $pineapple, $at, $invalid_length );
$at === 4; $invalid_length === 0;

Note! While passing an options array here might be convenient from a calling-code standpoint, this function is intended to serve as a very low-level foundation upon which to build higher level functionality. For the sake of keeping costs explicit all arguments are passed directly.

Parameters

$bytesstringrequired
UTF-8 encoded string which might include invalid spans of bytes.
$atintrequired
Where to start scanning.
$invalid_lengthintrequired
Will be set to how many bytes are to be ignored after $at.
$max_bytesint|nulloptional
Stop scanning after this many bytes have been seen.

Default:null

$max_code_pointsint|nulloptional
Stop scanning after this many code points have been seen.

Default:null

$has_noncharactersbool|nulloptional
Set to indicate if scanned string contained noncharacters.

Default:null

Return

int How many code points were successfully scanned.

Source

function _wp_scan_utf8( string $bytes, int &$at, int &$invalid_length, ?int $max_bytes = null, ?int $max_code_points = null, ?bool &$has_noncharacters = null ): int {
	$byte_length       = strlen( $bytes );
	$end               = min( $byte_length, $at + ( $max_bytes ?? PHP_INT_MAX ) );
	$invalid_length    = 0;
	$count             = 0;
	$max_count         = $max_code_points ?? PHP_INT_MAX;
	$has_noncharacters = false;

	for ( $i = $at; $i < $end && $count <= $max_count; $i++ ) {
		/*
		 * Quickly skip past US-ASCII bytes, all of which are valid UTF-8.
		 *
		 * This optimization step improves the speed from 10x to 100x
		 * depending on whether the JIT has optimized the function.
		 */
		$ascii_byte_count = strspn(
			$bytes,
			"x00x01x02x03x04x05x06x07x08x09x0ax0bx0cx0dx0ex0f" .
			"x10x11x12x13x14x15x16x17x18x19x1ax1bx1cx1dx1ex1f" .
			" !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~x7f",
			$i,
			$end - $i
		);

		if ( $count + $ascii_byte_count >= $max_count ) {
			$at    = $i + ( $max_count - $count );
			$count = $max_count;
			return $count;
		}

		$count += $ascii_byte_count;
		$i     += $ascii_byte_count;

		if ( $i >= $end ) {
			$at = $end;
			return $count;
		}

		/**
		 * The above fast-track handled all single-byte UTF-8 characters. What
		 * follows MUST be a multibyte sequence otherwise there’s invalid UTF-8.
		 *
		 * Therefore everything past here is checking those multibyte sequences.
		 *
		 * It may look like there’s a need to check against the max bytes here,
		 * but since each match of a single character returns, this functions will
		 * bail already if crossing the max-bytes threshold. This function SHALL
		 * NOT return in the middle of a multi-byte character, so if a character
		 * falls on each side of the max bytes, the entire character will be scanned.
		 *
		 * Because it’s possible that there are truncated characters, the use of
		 * the null-coalescing operator with "xC0" is a convenience for skipping
		 * length checks on every continuation bytes. This works because 0xC0 is
		 * always invalid in a UTF-8 string, meaning that if the string has been
		 * truncated, it will find 0xC0 and reject as invalid UTF-8.
		 *
		 * > [The following table] lists all of the byte sequences that are well-formed
		 * > in UTF-8. A range of byte values such as A0..BF indicates that any byte
		 * > from A0 to BF (inclusive) is well-formed in that position. Any byte value
		 * > outside of the ranges listed is ill-formed.
		 *
		 * > Table 3-7. Well-Formed UTF-8 Byte Sequences
		 *  ╭─────────────────────┬────────────┬──────────────┬─────────────┬──────────────╮
		 *  │ Code Points         │ First Byte │ Second Byte  │ Third Byte  │ Fourth Byte  │
		 *  ├─────────────────────┼────────────┼──────────────┼─────────────┼──────────────┤
		 *  │ U+0000..U+007F      │ 00..7F     │              │             │              │
		 *  │ U+0080..U+07FF      │ C2..DF     │ 80..BF       │             │              │
		 *  │ U+0800..U+0FFF      │ E0         │ A0..BF       │ 80..BF      │              │
		 *  │ U+1000..U+CFFF      │ E1..EC     │ 80..BF       │ 80..BF      │              │
		 *  │ U+D000..U+D7FF      │ ED         │ 80..9F       │ 80..BF      │              │
		 *  │ U+E000..U+FFFF      │ EE..EF     │ 80..BF       │ 80..BF      │              │
		 *  │ U+10000..U+3FFFF    │ F0         │ 90..BF       │ 80..BF      │ 80..BF       │
		 *  │ U+40000..U+FFFFF    │ F1..F3     │ 80..BF       │ 80..BF      │ 80..BF       │
		 *  │ U+100000..U+10FFFF  │ F4         │ 80..8F       │ 80..BF      │ 80..BF       │
		 *  ╰─────────────────────┴────────────┴──────────────┴─────────────┴──────────────╯
		 *
		 * @see https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G27506
		 */

		// Valid two-byte code points.
		$b1 = ord( $bytes[ $i ] );
		$b2 = ord( $bytes[ $i + 1 ] ?? "xC0" );

		if ( $b1 >= 0xC2 && $b1 <= 0xDF && $b2 >= 0x80 && $b2 <= 0xBF ) {
			++$count;
			++$i;
			continue;
		}

		// Valid three-byte code points.
		$b3 = ord( $bytes[ $i + 2 ] ?? "xC0" );

		if ( $b3 < 0x80 || $b3 > 0xBF ) {
			goto invalid_utf8;
		}

		if (
			( 0xE0 === $b1 && $b2 >= 0xA0 && $b2 <= 0xBF ) ||
			( $b1 >= 0xE1 && $b1 <= 0xEC && $b2 >= 0x80 && $b2 <= 0xBF ) ||
			( 0xED === $b1 && $b2 >= 0x80 && $b2 <= 0x9F ) ||
			( $b1 >= 0xEE && $b1 <= 0xEF && $b2 >= 0x80 && $b2 <= 0xBF )
		) {
			++$count;
			$i += 2;

			// Covers the range U+FDD0–U+FDEF, U+FFFE, U+FFFF.
			if ( 0xEF === $b1 ) {
				$has_noncharacters |= (
					( 0xB7 === $b2 && $b3 >= 0x90 && $b3 <= 0xAF ) ||
					( 0xBF === $b2 && ( 0xBE === $b3 || 0xBF === $b3 ) )
				);
			}

			continue;
		}

		// Valid four-byte code points.
		$b4 = ord( $bytes[ $i + 3 ] ?? "xC0" );

		if ( $b4 < 0x80 || $b4 > 0xBF ) {
			goto invalid_utf8;
		}

		if (
			( 0xF0 === $b1 && $b2 >= 0x90 && $b2 <= 0xBF ) ||
			( $b1 >= 0xF1 && $b1 <= 0xF3 && $b2 >= 0x80 && $b2 <= 0xBF ) ||
			( 0xF4 === $b1 && $b2 >= 0x80 && $b2 <= 0x8F )
		) {
			++$count;
			$i += 3;

			// Covers U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, …, U+10FFFE, U+10FFFF.
			$has_noncharacters |= (
				( 0x0F === ( $b2 & 0x0F ) ) &&
				0xBF === $b3 &&
				( 0xBE === $b4 || 0xBF === $b4 )
			);

			continue;
		}

		/**
		 * When encountering invalid byte sequences, Unicode suggests finding the
		 * maximal subpart of a text and replacing that subpart with a single
		 * replacement character.
		 *
		 * > This practice is more secure because it does not result in the
		 * > conversion consuming parts of valid sequences as though they were
		 * > invalid. It also guarantees at least one replacement character will
		 * > occur for each instance of an invalid sequence in the original text.
		 * > Furthermore, this practice can be defined consistently for better
		 * > interoperability between different implementations of conversion.
		 *
		 * @see https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-5/#G40630
		 */
		invalid_utf8:
		$at             = $i;
		$invalid_length = 1;

		// Single-byte and two-byte characters.
		if ( ( 0x00 === ( $b1 & 0x80 ) ) || ( 0xC0 === ( $b1 & 0xE0 ) ) ) {
			return $count;
		}

		$b2 = ord( $bytes[ $i + 1 ] ?? "xC0" );
		$b3 = ord( $bytes[ $i + 2 ] ?? "xC0" );

		// Find the maximal subpart and skip past it.
		if ( 0xE0 === ( $b1 & 0xF0 ) ) {
			// Three-byte characters.
			$b2_valid = (
				( 0xE0 === $b1 && $b2 >= 0xA0 && $b2 <= 0xBF ) ||
				( $b1 >= 0xE1 && $b1 <= 0xEC && $b2 >= 0x80 && $b2 <= 0xBF ) ||
				( 0xED === $b1 && $b2 >= 0x80 && $b2 <= 0x9F ) ||
				( $b1 >= 0xEE && $b1 <= 0xEF && $b2 >= 0x80 && $b2 <= 0xBF )
			);

			$invalid_length = min( $end - $i, $b2_valid ? 2 : 1 );
			return $count;
		} elseif ( 0xF0 === ( $b1 & 0xF8 ) ) {
			// Four-byte characters.
			$b2_valid = (
				( 0xF0 === $b1 && $b2 >= 0x90 && $b2 <= 0xBF ) ||
				( $b1 >= 0xF1 && $b1 <= 0xF3 && $b2 >= 0x80 && $b2 <= 0xBF ) ||
				( 0xF4 === $b1 && $b2 >= 0x80 && $b2 <= 0x8F )
			);

			$b3_valid = $b3 >= 0x80 && $b3 <= 0xBF;

			$invalid_length = min( $end - $i, $b2_valid ? ( $b3_valid ? 3 : 2 ) : 1 );
			return $count;
		}

		return $count;
	}

	$at = $i;
	return $count;
}

Changelog

Version Description
6.9.0 Introduced.