函数文档

wp_check_invalid_utf8()

💡 云策文档标注

概述

wp_check_invalid_utf8() 函数用于检查字符串中的无效 UTF-8 编码。仅在 blog_charset 设置为 UTF-8 时执行检查,否则返回原输入。可通过 $strip 参数控制是否替换无效字节序列为 Unicode 替换字符。

关键要点

  • 函数仅在 blog_charset 为 UTF-8 时工作,否则直接返回输入文本
  • 默认情况下,输入包含无效 UTF-8 序列时返回空字符串
  • 设置 $strip 参数为 true 可替换无效字节序列为 Unicode 替换字符 (U+FFFD �)
  • 建议考虑使用 wp_scrub_utf8() 替代,它不依赖 blog_charset 值
  • 函数返回检查后的文本字符串

代码示例

// The `blog_charset` is `latin1`, so this returns the input unchanged.
$every_possible_input === wp_check_invalid_utf8( $every_possible_input );

// Valid strings come through unchanged.
'test' === wp_check_invalid_utf8( 'test' );

$invalid = "the byte xC0 is never allowed in a UTF-8 string.";

// Invalid strings are rejected outright.
'' === wp_check_invalid_utf8( $invalid );

// “Stripping” invalid sequences produces the replacement character instead.
"the byte u{FFFD} is never allowed in a UTF-8 string." === wp_check_invalid_utf8( $invalid, true );
'the byte � is never allowed in a UTF-8 string.' === wp_check_invalid_utf8( $invalid, true );

注意事项

  • 函数依赖 blog_charset 设置,非 UTF-8 时可能不执行检查
  • 使用 $strip 参数可灵活处理无效序列,避免数据丢失
  • 相关函数如 wp_is_valid_utf8() 和 wp_scrub_utf8() 提供更多 UTF-8 处理选项

📄 原文内容

Checks for invalid UTF8 in a string.

Description

Note! This function only performs its work if the blog_charset is set to UTF-8. For all other values it returns the input text unchanged.

Note! Unless requested, this returns an empty string if the input contains any sequences of invalid UTF-8. To replace invalid byte sequences, pass true as the optional $strip parameter.

Consider using wp_scrub_utf8() instead which does not depend on the value of blog_charset.

Example:

// The `blog_charset` is `latin1`, so this returns the input unchanged.
$every_possible_input === wp_check_invalid_utf8( $every_possible_input );

// Valid strings come through unchanged.
'test' === wp_check_invalid_utf8( 'test' );

$invalid = "the byte xC0 is never allowed in a UTF-8 string.";

// Invalid strings are rejected outright.
'' === wp_check_invalid_utf8( $invalid );

// “Stripping” invalid sequences produces the replacement character instead.
"the byte u{FFFD} is never allowed in a UTF-8 string." === wp_check_invalid_utf8( $invalid, true );
'the byte � is never allowed in a UTF-8 string.' === wp_check_invalid_utf8( $invalid, true );

Parameters

$textstringrequired
String which is expected to be encoded as UTF-8 unless blog_charset is another encoding.
$stripbooloptional
Whether to replace invalid sequences of bytes with the Unicode replacement character (U+FFFD ). Default false returns an empty string for invalid UTF-8 inputs.

Default:false

Return

string The checked text.

Source

function wp_check_invalid_utf8( $text, $strip = false ) {
	$text = (string) $text;

	if ( 0 === strlen( $text ) ) {
		return '';
	}

	// Store the site charset as a static to avoid multiple calls to get_option().
	static $is_utf8 = null;
	if ( ! isset( $is_utf8 ) ) {
		$is_utf8 = is_utf8_charset();
	}

	if ( ! $is_utf8 || wp_is_valid_utf8( $text ) ) {
		return $text;
	}

	return $strip
		? wp_scrub_utf8( $text )
		: '';
}

Changelog

Version Description
6.9.0 Stripping replaces invalid byte sequences with the Unicode replacement character U+FFFD (�).
2.8.0 Introduced.