Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoidable "PCRE compilation error" by allowing more than ASCII #35322

Closed
PallHaraldsson opened this issue Mar 31, 2020 · 2 comments · Fixed by #35607
Closed

Avoidable "PCRE compilation error" by allowing more than ASCII #35322

PallHaraldsson opened this issue Mar 31, 2020 · 2 comments · Fixed by #35607
Labels
domain:strings "Strings!" status:help wanted Indicates that a maintainer wants help on an issue or pull request

Comments

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Mar 31, 2020

Missing "PCRE_UTF8 option" for pcre_compile() or something else?

"ð" (and "þ") are some of he extra letters in the Icelandic alphabet, and I would want to make sure those work too.

From memory same kind of regex worked in Python, and as I was new to this, it took a long time to figure out what was wrong... (porting from Python something that supposedly should have worked) I had some longer regex end i was (too) tedious to count out to the right letter:

julia> r"""(?P<viðburðarnafn>((\w)*))"""
ERROR: LoadError: PCRE compilation error: syntax error in subpattern name (missing terminator) at offset 6
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] compile(::String, ::UInt32) at ./pcre.jl:123
 [3] compile(::Regex) at ./regex.jl:72
 [4] Regex(::String, ::UInt32, ::UInt32) at ./regex.jl:37
 [5] Regex(::String) at ./regex.jl:60
 [6] @r_str(::LineNumberNode, ::Module, ::Any) at ./regex.jl:109
in expression starting at REPL[51]:1
julia> versioninfo
versioninfo (generic function with 2 methods)

julia> versioninfo()
Julia Version 1.5.0-DEV.360
Commit 012b270df6 (2020-02-28 07:57 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU D-1541 @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, broadwell)
Environment:
  JULIA_LLVM_ARGS = -unroll-threshold=500
  JULIA_NUM_THREADS = 4

http:https://pcre.sourceforge.net/pcre.txt

UTF-8 SUPPORT

 To build PCRE with support for UTF-8 character strings, add

   --enable-utf8

 to the configure command. Of itself, this does not make PCRE
 treat  strings as UTF-8. As well as compiling PCRE with this
 option, you also have have to set the PCRE_UTF8 option  when
 you call the pcre_compile() function.
@StefanKarpinski StefanKarpinski added status:help wanted Indicates that a maintainer wants help on an issue or pull request domain:strings "Strings!" labels Apr 13, 2020
Micket added a commit to Micket/julia that referenced this issue Apr 27, 2020
@Micket
Copy link
Contributor

Micket commented Apr 27, 2020

Just need to update to PCRE2 10.34 and this will just work as far as I can tell. Created PR #35607

Note that there is lots of outdated documentation around for PCRE; throughout the years, it seems both --enable-utf8, --enable-utf, --enable-unicode has existed. It's all on by on by default nowadays though.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Feb 3, 2021

Fixed by #39310

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:strings "Strings!" status:help wanted Indicates that a maintainer wants help on an issue or pull request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants