Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode: add UTF flag if subject is UTF #15

Open
rurban opened this issue Apr 9, 2017 · 2 comments
Open

unicode: add UTF flag if subject is UTF #15

rurban opened this issue Apr 9, 2017 · 2 comments

Comments

@rurban
Copy link
Owner

rurban commented Apr 9, 2017

if the pattern is not UTF8 (but ambivalent with \D\W...)
but the subject is, recompile with UTF and match.

failing re_tests:

\w	\x{200C}	yp	$&	\x{200C}
\W	\x{200C}	np	-	-
\w	\x{200D}	yp	$&	\x{200D}
\W	\x{200D}	np	-	-

/^\D{11}/a	\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}	np	-	-
/^\S{11}/a	\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}	np	-	-
/^\W{11}/a	\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}\x{10FFFF}	np	-	-

# [ perl #114272]
\Vn	\xFFn/	yp	$&	\xFFn

a?\X	     a\x{100}	yp	$&	a\x{100}
@rurban
Copy link
Owner Author

rurban commented Apr 9, 2017

plan for the implementation strategy:

  • if a pattern contains unicode classes like \w, \s \d, always compile with /u.
    if the subject is ascii, compile again with /a and do the ascii match.
  • otherwise if the pattern is compiled /a and the subject is /u, re-compile again.
  • cache the optional second pattern. in pprivate as struct of compiled_ascii_pattern and compiled_uni_pattern, together with the engine. see e.g. re::engine::Hyperscan where I also store two ptrs in pprivate.
  • also cache statistics about asc/uni usage to make better predictions. (e.g. 2 more ints)

@todd-richmond
Copy link

any progress on this? I'm looking to use PCRE2 for better perf, but need mixed UTF8 (regex and subject) to work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants