Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex containing umlauts in square brackets not matching #4922

Closed
muety opened this issue Dec 15, 2023 · 4 comments
Closed

Regex containing umlauts in square brackets not matching #4922

muety opened this issue Dec 15, 2023 · 4 comments

Comments

@muety
Copy link

muety commented Dec 15, 2023

Description

I might be mistaken, but the following seems like a bug to me. Regexes that contain a German umlaut (or probably any other special character) wrapped between curly braces (like ^foo[ä]b) won't match as expected, e.g. won't match the word fooäbar.

Steps to Reproduce

Suppose you have the following user document in your database:

{
  "_id": "3f6b99deb2a9d8ba80ac3a8cf3000558",
  "_rev": "1-889fc2635b372fa8b0f436fe6c6d4393",
  "lastname": "Mütsch",
  "firstname": "Ferdinand",
  "created": 1702629477669
}

Try to run these queries against the /_find endpoint:

Query 1:

{
   "selector": {
      "lastname": {
         "$regex": "(?i)^müt"
      }
   }
}

Expected: returns above document
Actual: returns above document ✅

Query 2:

{
   "selector": {
      "lastname": {
         "$regex": "(?i)^mü[t]"
      }
   }
}

Expected: returns above document
Actual: returns above document ✅

Query 3:

{
   "selector": {
      "lastname": {
         "$regex": "(?i)^m[ü]t"
      }
   }
}

Expected: returns above document
Actual: returns empty result ❌

According to my tests on Regex101 with PCRE, the last query should match the above document's lastname field, but for some reason doesn't.

Your Environment

  • CouchDB version used: 3.3.3 (inside Docker with latest 3-tagged image)
  • Browser name and version: Firefox 120.0.1
  • Operating system and version: Fedora 38 / whatever OS is used in the Docker image
@willholley
Copy link
Member

The Erlang regex syntax can be a little unusual. In this case, I think the issue is that, as described in Square Brackets and Character Classes, you need to set UTF mode to get the behaviour you expect.

Changing the query to:

{
   "selector": {
      "lastname": {
         "$regex": "(*UTF8)(?i)^m[ü]t"
      }
   }
}

should give you the correct result.

@muety
Copy link
Author

muety commented Dec 15, 2023

Awesome, thanks a lot 👍

@big-r81
Copy link
Contributor

big-r81 commented Dec 15, 2023

The Erlang regex syntax can be a little unusual. In this case, I think the issue is that, as described in Square Brackets and Character Classes, you need to set UTF mode to get the behaviour you expect.

Is this mentioned in the docs?

@willholley
Copy link
Member

the docs layout isn't great but we do link to the Erlang syntax for reference in https://docs.couchdb.org/en/stable/api/database/find.html#condition-operators.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants