Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audit usage of mb_* functions in PHP parser #1611

Closed
nylen opened this issue Jun 30, 2017 · 13 comments
Closed

Audit usage of mb_* functions in PHP parser #1611

nylen opened this issue Jun 30, 2017 · 13 comments
Labels
[Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f [Priority] High Used to indicate top priority items that need quick attention [Type] Bug An existing feature does not function as intended

Comments

@nylen
Copy link
Member

nylen commented Jun 30, 2017

The generated PHP parser contains several instances of mb_* (multibyte string) functions:

     38 mb_substr(
      6 mb_strlen(
      3 mb_regex_encoding(
      1 mb_eregi(
      1 mb_ereg(
      1 mb_convert_encoding(

However, PHP is not always installed with the mbstring extension enabled, so these functions do not always exist.

WP core has polyfills that attempt to emulate mb_substr and mb_strlen if they are not present; the rest of these functions are newly introduced and will cause errors under PHP configurations without these functions enabled.

There are a few options for how to proceed:

Polyfill the new functions

Converting encodings and matching regexes both seem quite complicated.

Decide that this is no longer an issue

This code in core has a lot of history, and I am not sure how common it is to have PHP installed without the mbstring extension. (It's not enabled in a default build of PHP, but this hasn't been a problem in distributions of PHP that I've used).

In particular there is a mention of yoast.com having this configuration, but from 6 years ago.

Replace the new functions with something else

Fortunately, mb_convert_encoding is only used to convert a character code to a UTF-8 character, and [I am pretty sure that] the mb_ereg and mb_eregi calls are only used to match very simple regular expressions of the form ^[charlist] against single characters. So the simplest path forward is probably to remove these calls from phpegjs.

@nylen nylen added [Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f [Type] Bug An existing feature does not function as intended labels Jun 30, 2017
@youknowriad
Copy link
Contributor

A potential polyfill https://github.com/symfony/polyfill-mbstring

@dmsnell
Copy link
Member

dmsnell commented Jul 10, 2017

Closed with the merge of #1775 (I'm pretty sure 😉)

@dmsnell dmsnell closed this as completed Jul 10, 2017
@nylen
Copy link
Member Author

nylen commented Jul 10, 2017

Unfortunately #1775 only got rid of the 2 functions that are already polyfilled in core.

     38 mb_substr(           -- No longer used.
      6 mb_strlen(           -- No longer used.
      3 mb_regex_encoding(   -- Still used.
      1 mb_eregi(            -- Still used.
      1 mb_ereg(             -- Still used.
      1 mb_convert_encoding( -- Still used.

@nylen nylen reopened this Jul 10, 2017
@nylen
Copy link
Member Author

nylen commented Jul 10, 2017

There is now also a usage of preg_match_all in Unicode mode, which is not always available. Core has a polyfill for this too. See https://github.com/nylen/phpegjs/blob/v1.0.0-beta5/src/passes/generate-php.js#L709-L715 for details and links.

@mtias
Copy link
Member

mtias commented Jul 11, 2017

This is causing fatals to some people in certain configurations. Let's avoid breaking and more gracefully not rendering dynamic blocks on the server.

@mtias mtias added this to the Beta 0.5.0 milestone Jul 11, 2017
@mtias mtias added the [Priority] High Used to indicate top priority items that need quick attention label Jul 11, 2017
@mdawaffe
Copy link

mb_ereg(), mb_eregi()

As I understand it, the grammar never needs multi-byte regex patterns (we only ever look for digits, ASCII text, and a few other explicit character classes). For inputs guaranteed to be UTF-8, I think we can just switch to preg_match() (with no /u needed).

mb_regex_encoding()

This is no longer required if we never use mb_ereg(), mb_eregi().

mb_convert_encoding()

I believe we can switch to html_entity_decode():

function Gutenberg_PEG_chr_unicode( $code ) {
	return html_entity_decode( '&#' . $code . ';', ENT_QUOTES, 'UTF-8' );
}

(The use of ENT_QUOTES is not important; it's just a placeholder for that argument.)

Where, though, is this function even used?

preg_match_all() with PCRE_UTF8 (/u)

This is only used to split the input into "characters" (UTF-8 encoded Unicode codepoints) for array-access and counting, true? We could do the splitting in a more annoying way:

/**
 * Split UTF-8 String into array of UTF-8 characters (UTF-8 encoded codepoints).
 *
 * Based on WordPress' _mb_substr() compat function.
 *
 * @param string      $str      The string to extract the substring from.
 * @return array
 */
function _utf8_split( $str ) {
	if ( _wp_can_use_pcre_u() ) {
		// Use the regex unicode support to separate the UTF-8 characters into an array.
		preg_match_all( '/./us', $str, $match );
		return $match[0];
	}

	$regex = '/(
		  [\x00-\x7F]                  # single-byte sequences   0xxxxxxx
		| [\xC2-\xDF][\x80-\xBF]       # double-byte sequences   110xxxxx 10xxxxxx
		| \xE0[\xA0-\xBF][\x80-\xBF]   # triple-byte sequences   1110xxxx 10xxxxxx * 2
		| [\xE1-\xEC][\x80-\xBF]{2}
		| \xED[\x80-\x9F][\x80-\xBF]
		| [\xEE-\xEF][\x80-\xBF]{2}
		| \xF0[\x90-\xBF][\x80-\xBF]{2} # four-byte sequences   11110xxx 10xxxxxx * 3
		| [\xF1-\xF3][\x80-\xBF]{3}
		| \xF4[\x80-\x8F][\x80-\xBF]{2}
	)/x';

	// Start with 1 element instead of 0 since the first thing we do is pop.
	$chars = array( '' );
	do {
		// We had some string left over from the last round, but we counted it in that last round.
		array_pop( $chars );

		/*
		 * Split by UTF-8 character, limit to 1000 characters (last array element will contain
		 * the rest of the string).
		 */
		$pieces = preg_split( $regex, $str, 1000, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY );

		$chars = array_merge( $chars, $pieces );

	// If there's anything left over, repeat the loop.
	} while ( count( $pieces ) > 1 && $str = array_pop( $pieces ) );

	return $chars;
}

I did not even try to run this code :) No idea if it works, how it actually scales, etc.

@dmsnell
Copy link
Member

dmsnell commented Jul 12, 2017

@mdawaffe your knowledge of text continues to amaze me. 👏

I was also wondering if we could completely ignore the mb_ functions on account of the fact that we're only searching for ASCII-representable characters. One thing I wasn't aware of is if there might be any security implications in doing so; I'm wholly unaware of those kinds of implications with Unicode. Do you have any input on this, or does your response indicate that you don't see any problem doing away with multi-byte handling?

@mdawaffe
Copy link

mdawaffe commented Jul 12, 2017

In other character encodings, there might be problems. In UTF-8, though, any byte that looks like an ASCII character is not part of some multi-byte sequence. So searching for ASCII bytes shouldn't return false positives, and splitting on ASCII bytes shouldn't create any tricky string munging based injection attacks.

All that to say: I think (for this grammar), we can get away with ignoring multi-byteness when tokenizing (if that's the right word).

@nylen
Copy link
Member Author

nylen commented Jul 12, 2017

Generally I would prefer to fix these issues in a way that leaves phpegjs compatible with as much of PEG.js as possible. In a couple of cases this has led me to deviate slightly from the approaches suggested above.

Everything described here is implemented in #1869.

I believe we can switch mb_convert_encoding() to html_entity_decode()

This is the approach I took in nylen/phpegjs@2cc4046. The ENT_QUOTES is necessary, otherwise incorrect values are returned for 34 (") and 39 (').

We are not using this chr_unicode function anywhere, but it is available in PEG.js so I would prefer to keep it in phpegjs as well.

As I understand it, the grammar never needs multi-byte regex patterns

Indeed, the only usage of mb_ereg and mb_eregi was to match single characters against character classes. The case-insensitivity handling is especially hard to replicate without using the mb_* functions.

Since we're not using any case-insensitive matching in our grammar, I've added a mbstringAllowed: false option to phpegjs in nylen/phpegjs@ab1dc14. When this option is present, generated parsers will use an alternative method of matching character classes based on numeric character codes, and if case-insensitivity features are used in a grammar, the parser generation will fail with an appropriate error message.

This also yielded a slight further performance improvement (measured the same way as #1775 (comment)):

JavaScript :  11ms
PHP 5.6    : 135ms (down from 150ms)
PHP 7.0    :  17ms (no significant change)
PHP 7.1    :  16ms (down from 19ms)

preg_match_all() used to split the input into "characters" for array-access and counting - We could do the splitting in a more annoying way

Since some of the functions we need already exist in WordPress, and yes it is pretty annoying, let's keep this out of phpegjs as well. In nylen/phpegjs@ff75fac I've added a way to do this: split the input string into an array before handing it off to the parser.

@nylen
Copy link
Member Author

nylen commented Jul 12, 2017

I did not even try to run this [UTF-8 splitting] code :) No idea if it works, how it actually scales, etc.

It works :) Based on my testing so far, it doesn't appear to have adverse performance characteristics either.

My only concern is the limit on string length, which defaulted to 1,000. Most places in WP core use this logic to extract fairly short prefixes from a string: excerpts, default post titles, etc. I've changed it to 100,000. Perhaps it should be even longer for this use case.

@mdawaffe
Copy link

The ENT_QUOTES is necessary, otherwise incorrect values are returned for 34 (") and 39 (').

Cool. I clearly don't really know what's going on, but I'm glad I arbitrarily picked the right mode :)

My only concern is the limit on string length, which defaulted to 1,000.

That's just the per iteration limit in the loop. It can still handle arbitrarily long strings. (Rather, some other bottleneck will appear.) The loop and per iteration limit are there to prevent Regex DOSes from catastrophic inputs.

@nylen
Copy link
Member Author

nylen commented Jul 12, 2017

👍 Updated the PR to restore the previous 1000 character limit.

@mtias mtias removed this from the Beta 0.5.0 milestone Jul 12, 2017
@nylen
Copy link
Member Author

nylen commented Jul 13, 2017

Fixed in #1869.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f [Priority] High Used to indicate top priority items that need quick attention [Type] Bug An existing feature does not function as intended
Projects
None yet
Development

No branches or pull requests

5 participants