Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF8 capture problem #36

Open
todd-richmond opened this issue Jan 30, 2020 · 3 comments
Open

UTF8 capture problem #36

todd-richmond opened this issue Jan 30, 2020 · 3 comments
Assignees

Comments

@todd-richmond
Copy link

There is a common re::engine::* bug where RXf_MATCH_UTF8 flag is not being set on the perl regex object to ensure that all captures are correctly computed as UTF8 when the input is UTF8. There are 2 critical issues involved that are fixed by this

  1.  All captures as well as ${^PREMATCH} and ${^POSTMATCH} will correctly have their utf8 bits set
    
  2.  $+[0] and $-[0] (offsets of captures) will be computed correctly for utf8 chars rather than byte offset. When these are wrong, it is impossible to compute a substring for match in the original text instead of using ${^POSTMATCH} which is required due to a horrific perf problem
    

XS code will need to do something like this

#ifdef RXf_UTF8
if (flags & RXf_UTF8)
extflags |= RXf_MATCH_UTF8;
#else
if (SvUTF8(pattern))
extflags |= RXf_MATCH_UTF8;
#endif

@demerphq
Copy link
Contributor

@todd-richmond did you file a ticket with perl for this. This is the first time I've heard about it

@rurban
Copy link
Owner

rurban commented Jan 21, 2023

Also the engine does not get the utf8 flag set yet. #15

@rurban rurban self-assigned this Jan 21, 2023
@todd-richmond
Copy link
Author

@demerphq I think this is an issue with this module, not perl, but not 100% positive
I filed a similar patch against re-engine-re2 and have been using that in production for several years. Without it, all captures are corrupt if the pattern is UTF8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants