Audit usage of `mb_*` functions in PHP parser #1611

nylen · 2017-06-30T14:36:12Z

The generated PHP parser contains several instances of mb_* (multibyte string) functions:

     38 mb_substr(
      6 mb_strlen(
      3 mb_regex_encoding(
      1 mb_eregi(
      1 mb_ereg(
      1 mb_convert_encoding(

However, PHP is not always installed with the mbstring extension enabled, so these functions do not always exist.

WP core has polyfills that attempt to emulate mb_substr and mb_strlen if they are not present; the rest of these functions are newly introduced and will cause errors under PHP configurations without these functions enabled.

There are a few options for how to proceed:

Polyfill the new functions

Converting encodings and matching regexes both seem quite complicated.

Decide that this is no longer an issue

This code in core has a lot of history, and I am not sure how common it is to have PHP installed without the mbstring extension. (It's not enabled in a default build of PHP, but this hasn't been a problem in distributions of PHP that I've used).

In particular there is a mention of yoast.com having this configuration, but from 6 years ago.

Replace the new functions with something else

Fortunately, mb_convert_encoding is only used to convert a character code to a UTF-8 character, and [I am pretty sure that] the mb_ereg and mb_eregi calls are only used to match very simple regular expressions of the form ^[charlist] against single characters. So the simplest path forward is probably to remove these calls from phpegjs.

The text was updated successfully, but these errors were encountered:

youknowriad · 2017-06-30T17:26:49Z

A potential polyfill https://github.com/symfony/polyfill-mbstring

dmsnell · 2017-07-10T10:41:00Z

Closed with the merge of #1775 (I'm pretty sure 😉)

nylen · 2017-07-10T10:49:04Z

Unfortunately #1775 only got rid of the 2 functions that are already polyfilled in core.

     38 mb_substr(           -- No longer used.
      6 mb_strlen(           -- No longer used.
      3 mb_regex_encoding(   -- Still used.
      1 mb_eregi(            -- Still used.
      1 mb_ereg(             -- Still used.
      1 mb_convert_encoding( -- Still used.

nylen · 2017-07-10T10:54:28Z

There is now also a usage of preg_match_all in Unicode mode, which is not always available. Core has a polyfill for this too. See https://github.com/nylen/phpegjs/blob/v1.0.0-beta5/src/passes/generate-php.js#L709-L715 for details and links.

mtias · 2017-07-11T21:11:50Z

This is causing fatals to some people in certain configurations. Let's avoid breaking and more gracefully not rendering dynamic blocks on the server.

mdawaffe · 2017-07-12T11:44:21Z

mb_ereg(), mb_eregi()

As I understand it, the grammar never needs multi-byte regex patterns (we only ever look for digits, ASCII text, and a few other explicit character classes). For inputs guaranteed to be UTF-8, I think we can just switch to preg_match() (with no /u needed).

mb_regex_encoding()

This is no longer required if we never use mb_ereg(), mb_eregi().

mb_convert_encoding()

I believe we can switch to html_entity_decode():

function Gutenberg_PEG_chr_unicode( $code ) {
	return html_entity_decode( '&#' . $code . ';', ENT_QUOTES, 'UTF-8' );
}

(The use of ENT_QUOTES is not important; it's just a placeholder for that argument.)

Where, though, is this function even used?

preg_match_all() with PCRE_UTF8 (/u)

This is only used to split the input into "characters" (UTF-8 encoded Unicode codepoints) for array-access and counting, true? We could do the splitting in a more annoying way:

/**
 * Split UTF-8 String into array of UTF-8 characters (UTF-8 encoded codepoints).
 *
 * Based on WordPress' _mb_substr() compat function.
 *
 * @param string      $str      The string to extract the substring from.
 * @return array
 */
function _utf8_split( $str ) {
	if ( _wp_can_use_pcre_u() ) {
		// Use the regex unicode support to separate the UTF-8 characters into an array.
		preg_match_all( '/./us', $str, $match );
		return $match[0];
	}

	$regex = '/(
		  [\x00-\x7F]                  # single-byte sequences   0xxxxxxx
		| [\xC2-\xDF][\x80-\xBF]       # double-byte sequences   110xxxxx 10xxxxxx
		| \xE0[\xA0-\xBF][\x80-\xBF]   # triple-byte sequences   1110xxxx 10xxxxxx * 2
		| [\xE1-\xEC][\x80-\xBF]{2}
		| \xED[\x80-\x9F][\x80-\xBF]
		| [\xEE-\xEF][\x80-\xBF]{2}
		| \xF0[\x90-\xBF][\x80-\xBF]{2} # four-byte sequences   11110xxx 10xxxxxx * 3
		| [\xF1-\xF3][\x80-\xBF]{3}
		| \xF4[\x80-\x8F][\x80-\xBF]{2}
	)/x';

	// Start with 1 element instead of 0 since the first thing we do is pop.
	$chars = array( '' );
	do {
		// We had some string left over from the last round, but we counted it in that last round.
		array_pop( $chars );

		/*
		 * Split by UTF-8 character, limit to 1000 characters (last array element will contain
		 * the rest of the string).
		 */
		$pieces = preg_split( $regex, $str, 1000, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY );

		$chars = array_merge( $chars, $pieces );

	// If there's anything left over, repeat the loop.
	} while ( count( $pieces ) > 1 && $str = array_pop( $pieces ) );

	return $chars;
}

I did not even try to run this code :) No idea if it works, how it actually scales, etc.

dmsnell · 2017-07-12T11:49:06Z

@mdawaffe your knowledge of text continues to amaze me. 👏

I was also wondering if we could completely ignore the mb_ functions on account of the fact that we're only searching for ASCII-representable characters. One thing I wasn't aware of is if there might be any security implications in doing so; I'm wholly unaware of those kinds of implications with Unicode. Do you have any input on this, or does your response indicate that you don't see any problem doing away with multi-byte handling?

mdawaffe · 2017-07-12T11:56:32Z

In other character encodings, there might be problems. In UTF-8, though, any byte that looks like an ASCII character is not part of some multi-byte sequence. So searching for ASCII bytes shouldn't return false positives, and splitting on ASCII bytes shouldn't create any tricky string munging based injection attacks.

All that to say: I think (for this grammar), we can get away with ignoring multi-byteness when tokenizing (if that's the right word).

nylen · 2017-07-12T14:36:30Z

Generally I would prefer to fix these issues in a way that leaves phpegjs compatible with as much of PEG.js as possible. In a couple of cases this has led me to deviate slightly from the approaches suggested above.

Everything described here is implemented in #1869.

I believe we can switch mb_convert_encoding() to html_entity_decode()

This is the approach I took in nylen/phpegjs@2cc4046. The ENT_QUOTES is necessary, otherwise incorrect values are returned for 34 (") and 39 (').

We are not using this chr_unicode function anywhere, but it is available in PEG.js so I would prefer to keep it in phpegjs as well.

As I understand it, the grammar never needs multi-byte regex patterns

Indeed, the only usage of mb_ereg and mb_eregi was to match single characters against character classes. The case-insensitivity handling is especially hard to replicate without using the mb_* functions.

Since we're not using any case-insensitive matching in our grammar, I've added a mbstringAllowed: false option to phpegjs in nylen/phpegjs@ab1dc14. When this option is present, generated parsers will use an alternative method of matching character classes based on numeric character codes, and if case-insensitivity features are used in a grammar, the parser generation will fail with an appropriate error message.

This also yielded a slight further performance improvement (measured the same way as #1775 (comment)):

JavaScript :  11ms
PHP 5.6    : 135ms (down from 150ms)
PHP 7.0    :  17ms (no significant change)
PHP 7.1    :  16ms (down from 19ms)

preg_match_all() used to split the input into "characters" for array-access and counting - We could do the splitting in a more annoying way

Since some of the functions we need already exist in WordPress, and yes it is pretty annoying, let's keep this out of phpegjs as well. In nylen/phpegjs@ff75fac I've added a way to do this: split the input string into an array before handing it off to the parser.

nylen · 2017-07-12T14:44:16Z

I did not even try to run this [UTF-8 splitting] code :) No idea if it works, how it actually scales, etc.

It works :) Based on my testing so far, it doesn't appear to have adverse performance characteristics either.

My only concern is the limit on string length, which defaulted to 1,000. Most places in WP core use this logic to extract fairly short prefixes from a string: excerpts, default post titles, etc. I've changed it to 100,000. Perhaps it should be even longer for this use case.

mdawaffe · 2017-07-12T14:51:59Z

The ENT_QUOTES is necessary, otherwise incorrect values are returned for 34 (") and 39 (').

Cool. I clearly don't really know what's going on, but I'm glad I arbitrarily picked the right mode :)

My only concern is the limit on string length, which defaulted to 1,000.

That's just the per iteration limit in the loop. It can still handle arbitrarily long strings. (Rather, some other bottleneck will appear.) The loop and per iteration limit are there to prevent Regex DOSes from catastrophic inputs.

nylen · 2017-07-12T14:58:29Z

👍 Updated the PR to restore the previous 1000 character limit.

nylen · 2017-07-13T11:02:44Z

Fixed in #1869.

nylen added [Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f [Type] Bug An existing feature does not function as intended labels Jun 30, 2017

nylen mentioned this issue Jul 3, 2017

PEG.js based parser introduced a performance regression #1646

Closed

dmsnell closed this as completed Jul 10, 2017

nylen reopened this Jul 10, 2017

nylen mentioned this issue Jul 10, 2017

Improve PHP parsing performance #1775

Merged

mtias added this to the Beta 0.5.0 milestone Jul 11, 2017

mtias added the [Priority] High Used to indicate top priority items that need quick attention label Jul 11, 2017

nylen mentioned this issue Jul 12, 2017

Skip dynamic blocks parsing if mbstring PHP extension is missing #1866

Merged

This was referenced Jul 12, 2017

Add a way to report PHP parsing performance info #1865

Merged

Improve PHP parser compatibility with different server configurations #1869

Merged

mtias removed this from the Beta 0.5.0 milestone Jul 12, 2017

nylen closed this as completed Jul 13, 2017

SergeyBiryukov mentioned this issue Feb 4, 2019

A warning and broken content with mbstring.func_overload enabled #13655

Closed

getdave mentioned this issue Aug 2, 2019

Try/even better previews #16873

Merged

Verdinjoshua26 mentioned this issue Jul 31, 2022

[Snyk] Security upgrade node-fetch from 2.6.7 to 3.2.10 Verdinjoshua26/gutenberg#46

Open

soloinovator mentioned this issue Aug 1, 2022

[Snyk] Security upgrade node-fetch from 2.6.7 to 3.2.10 soloinovator/gutenberg#44

Open

MaxMood96 mentioned this issue Aug 1, 2022

[Snyk] Security upgrade node-fetch from 2.6.7 to 3.2.10 MaxMood96/gutenberg#73

Open

soloinovator mentioned this issue Aug 1, 2022

[Snyk] Security upgrade node-fetch from 2.6.7 to 3.2.10 soloinovator/gutenberg#45

Open

snyk-bot mentioned this issue Aug 1, 2022

[Snyk] Security upgrade node-fetch from 1.7.3 to 3.2.10 Verdinjoshua26/gutenberg#47

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audit usage of `mb_*` functions in PHP parser #1611

Audit usage of `mb_*` functions in PHP parser #1611

nylen commented Jun 30, 2017

youknowriad commented Jun 30, 2017

dmsnell commented Jul 10, 2017

nylen commented Jul 10, 2017 •

edited

Loading

nylen commented Jul 10, 2017 •

edited

Loading

mtias commented Jul 11, 2017

mdawaffe commented Jul 12, 2017

dmsnell commented Jul 12, 2017

mdawaffe commented Jul 12, 2017 •

edited

Loading

nylen commented Jul 12, 2017 •

edited

Loading

nylen commented Jul 12, 2017

mdawaffe commented Jul 12, 2017

nylen commented Jul 12, 2017

nylen commented Jul 13, 2017

Audit usage of mb_* functions in PHP parser #1611

Audit usage of mb_* functions in PHP parser #1611

Comments

nylen commented Jun 30, 2017

Polyfill the new functions

Decide that this is no longer an issue

Replace the new functions with something else

youknowriad commented Jun 30, 2017

dmsnell commented Jul 10, 2017

nylen commented Jul 10, 2017 • edited Loading

nylen commented Jul 10, 2017 • edited Loading

mtias commented Jul 11, 2017

mdawaffe commented Jul 12, 2017

dmsnell commented Jul 12, 2017

mdawaffe commented Jul 12, 2017 • edited Loading

nylen commented Jul 12, 2017 • edited Loading

nylen commented Jul 12, 2017

mdawaffe commented Jul 12, 2017

nylen commented Jul 12, 2017

nylen commented Jul 13, 2017

Audit usage of `mb_*` functions in PHP parser #1611

Audit usage of `mb_*` functions in PHP parser #1611

nylen commented Jul 10, 2017 •

edited

Loading

nylen commented Jul 10, 2017 •

edited

Loading

mdawaffe commented Jul 12, 2017 •

edited

Loading

nylen commented Jul 12, 2017 •

edited

Loading