Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: regex (or minimally, startswith/endswith) string tests #164

Closed
gojomo opened this issue Jul 11, 2013 · 81 comments
Closed

Comments

@gojomo
Copy link

gojomo commented Jul 11, 2013

String contains() is nice, but full regular-expression matching would be great. Or perhaps as smaller step: startswith/endswith or positional string-matching tests?

@georgir
Copy link

georgir commented Oct 16, 2013

If we get substr() or characters in strings could be iterated, #192, we could easily implement startswith/endswith and the like ourselves. Regex functionality (both test and replace, first-match or global, with capturing subpattern matches and their positions) would be a great addition though.

@jedisct1
Copy link

+1

@nrk
Copy link

nrk commented Nov 2, 2013

I think full-blown regular expressions are probably a bit of an overkill but something like Lua's pattern matching would be fast and powerful enough for most tasks.

@georgir
Copy link

georgir commented Nov 4, 2013

On first look that really seems to me the same as regexps with % instead of \ for escape. I do like their () solution for capturing position though.

@stedolan
Copy link
Contributor

stedolan commented Nov 8, 2013

String processing in jq is not good at the moment. I really want to implement a PEG-style parsing system like LPeg or Parsec, which I find a bit more useful and readable than regexes. It's a bit of work though, and I haven't had a chance recently. In the meantime, a wrapper around libc's regex functions would probably be a useful stopgap.

@amagura
Copy link

amagura commented Dec 4, 2013

+1 Support for GNU sed-like regex expression support.

@nicowilliams
Copy link
Contributor

startswith/endswith and a few more goodies are now in master.

@pkoppstein
Copy link
Contributor

As wonderful as jq is, I would argue that true greatness will require support for regular expressions with named captures. (*)

My impression is that the oniguruma open-source library (**) would be a very good match for jq.

Meanwhile, thank you stedolan !

(*) Ruby supports named captures very nicely, as illustrated by this interaction:

> /< *(?<email>[^> ]*)>/ =~ "Abe Lincoln <[email protected]>"
=> 12
> email
=> "[email protected]"

The creation and setting of local variables in this manner should be possible in jq, perhaps allowing jq expressions such as the following:

.people[].email | select( match( "(?

[^> ])>" )) | $address

Supporting regex with named captures in this manner would also allow parsing of dates in a way that would resolve the "sorting by non-numeric dates" issue mentioned above (#359).

(**) http:https://www.geocities.jp/kosako3/oniguruma is a stable, multi-platform open source regex library written in C that is used by ruby 1.9 and php.

@wtlangford
Copy link
Contributor

I'm currently working on implementing this.
Are we willing to add a hard dependency on an external library for regex support? Or would we rather make it more optional, such as a capability that has to be enabled at compile time?
If we're okay with the hard dependency, I also like oniguruma. It seems to be readily available from package managers, both on various *nixes and brew for os x, and is small in size (the fedora package took up a whopping 671 KB.

@nicowilliams
Copy link
Contributor

So far jq has been very portable largely because it has no external run-time dependencies -- just the C and math libraries. How will adding a hard dependency affect the Windows cross-compilation build, for example? I'm not sure, but I'd urge you to try a cross-compilation before doing a PR, just to make sure that I can build and release for Windows.

The way jq currently handles number parsing and printing is by including in the source tree what should really have been an external library (jv_dtoa.c). I was considering doing something like that for bignums: include libtommath (or libtomfloat, or whatever the name is that escapes me atm) in the jq source tree. You might do the same for regexp, but you need to be careful that the license of whatever library you choose is compatible with jq's.

Alternatives include:

  • using a library that is widely available, including Windows, so that for building a release one need only download the headers and DLL and it all just works (except now we'd have to distribute jq and the regexp DLL)
  • implement a plugin loading facility, using dlopen() on POSIX, LoadLibrary*() on Windows, ...

Having just done a release of jq, these issues are very much on my mind.

Thoughts?

@nicowilliams
Copy link
Contributor

Also, since I do want regexp to be part of jq and don't want users to have to separately download a regexp DLL and plugin... I think the best option is to add a suitable regexp library to the source tree. If there's a suitable such library with a git repo, then maybe we could use git submodules.

@wtlangford
Copy link
Contributor

Is there a particular flavor of regexp we are looking for? I'm working assuming we're looking to support PCRE-style regexes, but that could be wrong. And if we are, are we aiming to support things like PCRE modifiers (case-insensitive, extended mode (non-significant whitespace), etc)?

@wtlangford
Copy link
Contributor

Of course, I can also just add support in directly for them, as well as for the other regexp styles, a la PHP's http:https://www.php.net/manual/en/function.mb-regex-set-options.php
Maybe a match("/pattern/modifiers") and a match_ex("/pattern/modifiers","mode")

@nicowilliams
Copy link
Contributor

I'm not able to search right now, but the issue has come up before and IIRC
@stedolan wanted a PCRE. You can search the issues list yourself, or i
will tomorrow.

@pkoppstein
Copy link
Contributor

@wtlangford asked:

Is there a particular flavor of regexp we are looking for?

Since the "j" in "jq" ultimately refers to JavaScript, I believe that jq support for regular expressions should (at least in general) be guided by JavaScript. A useful reference is:

https://developer.mozilla.or/en-US/docs/Web/JavaScript/Guide/Regular_Expressions

If in doubt, I would suggest checking (and maybe following the example of) Ruby 2.1:

http:https://www.ruby-doc.org/core-2.1.1/Regexp.html

One really important point I'd like to make, however, is that it should be possible in jq to work with regular expressions and flags (aka modifiers) as data. That is, one should be able to read regular expression specifications and flags as data, and manipulate them as data. Since "data" in the jq context means JSON, it follows that jq should support JSON specifications of regular expressions and flags.

For example, suppose we want a jq function match(REGEX) that takes as input a STRING. The simplest approach that avoids new syntax would be for match() to expect the regex to be supplied as a string, e.g. "/abc/i" (or perhaps as a JSON array or object). Indeed, given the wealth of possibilities, I cannot really see the point in introducing additional syntax to handle regular expressions.

Of course, if JSON had a regex type, then we'd certainly use it, but it doesn't :-(

@wtlangford
Copy link
Contributor

With respect to how to include the regex library, the one I'm currently using doesn't have a git repository, though that isn't hard to correct. It does, however, have a non-trivial build and I'm not sure how to work that into the current build. I've currently modified the build to link against a system version for testing purposes, though I would not add that to a pull request without discussing.

@wtlangford
Copy link
Contributor

@nicowilliams: A search through the issues shows that @stedolan was considering using PCRE, so I'll use that syntax, which is nice, since the JavaScript regular expressions seem to have been based off of Perl's anyways.
@pkoppstein I agree that the specification and flags should come in as data and not require a new syntax. I'm looking at ways to keep it simple, both in code and in usage. I can't decide if it seems more likely/useful that they come in as a single piece of data or as two, though. Thoughts on whether the function should be something like match("/abc/i") or match("/abc/"; "i") or match("abc"; "i") ?
If we do one of the first two, then any backslashes in the actual regex pattern will need to be escaped and we will need to unescape them during processing.
Of course, we could also define it to be match(["abc","i"])
Also: thoughts on a handful of regex-related functions, similar to javascript's exec, test, match, search? (see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions#Working_with_Regular_Expressions)

@pkoppstein
Copy link
Contributor

@wtlangford contrasts:

  1. match("/abc/i")
  2. match("/abc/"; "i")
  3. match("abc"; "i")
  4. match(["abc","i"])

The key to resolving this issue is, I believe, that whereas jq requires that functions have a specific arity, it does support polymorphism. That is, in a sense, we can have our cake (match("abc")) and eat it (match(["abc", "i"])).

In fact, we could easily support:
5. match(STRING), match([STRING]), and match([STRING, STRING])

So unless @nicowilliams will allow us to have both match(STRING) and match(STRING;STRING), we can say:

(1) has the advantage of familiarity but at double the peskiness (because we have both '"' and '/' to trick everyone up)
(2) and (3) don't fly.

((2) seems pointless anyway.)

Another (very tiny) advantage of (5) is that it could be further extended, e.g. to support {"regex": "abc", "flags": "i"}.

@wtlangford
Copy link
Contributor

@pkoppstein I agree with your analysis. A polymorphic approach like that is very JavaScript and seems to be the best option that exposes additional functionality without too much clutter (since arity is strictly enforced).
I'll work towards
5. match(STRING), match([STRING]), and match([STRING, STRING])
and will throw in the match(OBJECT) implementation as well. It won't take much additional work.

@georgir
Copy link

georgir commented Jun 12, 2014

I just want to add that it would be best if the result from match() is very
detailed, like an array of objects with offsets and lengths of all
subpatterns, not just the matched strings. And if we support named
subpatterns, then the result should probably be an object with separate
object subkey for named subpatterns and array subkey for unnamed
subpatterns, etc...
From that and the brand new string slicing we can implement replace
functions in jq itself... I guess... are string slices assignable?

@wtlangford
Copy link
Contributor

@georgir said:

I just want to add that it would be best if the result from match() is very
detailed, like an array of objects with offsets and lengths of all
subpatterns, not just the matched strings. And if we support named
subpatterns, then the result should probably be an object with separate
object subkey for named subpatterns and array subkey for unnamed
subpatterns, etc...

But for consistency at that point if no matches are found the result should be []. And then that wouldn't work inside of select(match("abc")), which is where I see this getting a lot of use.
I like having the ability to return the matches, though. I'll likely add a search() builtin that does this, maybe returns an object like this:

{
    "matches" : [ {
        "offset" : #,
        "length" : #,
        "string" : "matchString"
    }, ... ],
    "captures" : [ "captureString", ... ],
    "namedCaptures" : { "name" : "captureString", ... }
}

@nicowilliams
Copy link
Contributor

Multiple arity could be added: one def per arity. Adding bigger arity than
six requires changes to parser.y; I would prefer not to go that way.
Varargs for jq functions would necessitate new syntax because jq function
arguments are lambdas, not values (e.g., $1 and so on for each lambda).
Varargs for C-coded functions would be OK, but in general C-coded
functions should be wrapped with jq ones, and anways, it's easy enough to
pass/take arrays in those cases.

Bottom line: the easiest thing to do is to pass an array when you want
multiple argument values rather than multiple lambdas, and always be
careful not to accidentally cause cartesian products where no one expects
them. But it's true that multiple arguments look better than arrays.

As to polymorphism, yes, but no automatic type conversions -- that's the
key.

match(...) and match(..., options) and so on would be OK. But so would
match(...) and match([..., options]). If you are up for contributing
multiple arity support in builtin"c and compile.c, go for the first, else
the second.

@nicowilliams
Copy link
Contributor

@pkoppstein The 'j' in jq does not stand for JavaScript. Nowhere in the docs (nor source) is "jq" explained as standing for anything. But the docs talk a lot about JSON right on the homepage while nothing is said about JavaScript. The reader is subliminally invited to think of "jq" as being a "JSON query" language. But still, nothing is said about what "jq" stands for. Feel free to propose importing language features from any language, but not bad language features (of which JavaScript is full) :)

@georgir
Copy link

georgir commented Jun 12, 2014

Will, captures and namedCaptures should have the structure of matches in your example, wanting offset for them and not just the strings is exactly what I ment originally. Perhaps me calling them subpatterns instead of captures was confusing. And I did overlook a regex with global-match flag, I was just thinking of single match with multiple captures. In global match mode, each full match result object should have its own captures/namedCaptures subkeys.

Whether you return an empty array or null or false for no match is fine by me... .[]? lets us iterate all possibilities without boilerplate checks, and length can be added to your select example. You might want to add a boolean test alternative to match though, if it will help performance to run with no captures and quit on the first match with the specific regex lib.

EDIT: Actually, a most jq-ish way would be to return empty on no match, one object on one match, and multiple objects on multiple matches for global-flagged regexes, as separate outputs rather than in an array.

@wtlangford
Copy link
Contributor

I do like test as a name for that better. I'll probably implement it anyways, if nothing more than to make select with a regex simpler, though I will see if I can get the lib to stop early.
Regarding the result format, that does make more sense; I misunderstood you the first time.

@wtlangford
Copy link
Contributor

So, playing with test cases, I've come across an oddity. How do we handle zero-width matches? For example, given the input abc abc and the regex (abc)*, a match will be returned after the first abc of width zero. I feel that this should not occur, but at the same time, it is also correct.
As it stands right now, there's a modifer to ignore zero-width matches. I'm tempted to invert it and make the default behavior to not return zero-width matches and have the modifier allow them. It feels like the more expected behavior to me. Thoughts?

@pkoppstein
Copy link
Contributor

@wtlangford wrote:

given the input abc abc and the regex (abc)*, a match will be returned after the first abc of width zero.

And one would hope after the second "abc" as well. For example, using irb (ruby) we see:

"abc abc".scan( /(abc)*/ )
=> [["abc"], [nil], ["abc"], [nil]]

sed (or at least sed on a Mac) also acts on zero-width matches:

$ sed -E 's/( )*/ABC/g' <<< "abc abc"
ABCaABCbABCcABCaABCbABCcABC

@nicowilliams
Copy link
Contributor

@wtlangford It's easy to filter out zero-width matches, so let them through i say :)

@pkoppstein
Copy link
Contributor

@nicowilliams wrote:

It's easy to filter out zero-width matches, so let them through i say :)

Yes, anyone who writes (abc)* deserves to have his or her zero-width matches. As in all matters to do with regex, it's nice to have ruby as a guide.

@wtlangford
Copy link
Contributor

That's a terrifying thing sed did there, yet exactly what it was told to do. After having slept on it, I find myself agreeing that the empty matches should be the default behavior. As such, I'm currently returning them as full match objects, just with 0 length. Capture groups that didn't match anything are also returned with an offset of -1, so as to preserve numbering. Would it be better to simply place nulls in the output for these instead? I don't think so, as the empty matches will still have an offset.

@wtlangford
Copy link
Contributor

@nicowilliams I'm ready to start the final bits of testing (specifically the windows build.) What's the usual method for cross-compiling jq to windows?

@nicowilliams
Copy link
Contributor

@wtlangford Install mingw64, by which I mean:

  • create an unprivileged user account in which to run it (optional)
  • get the tarballs
  • untar

then (as that user account, if you created one):

$ make distclean   # if need be
$ autoreconf -fi   # if need be
$ PATH=<path-to-mingw64-bin>:$PATH scripts/crosscompile win64 --host=x86_64-w64-mingw32

The scripts/crosscompile script will create a directory called build/$1 (build/win64 in this case) and will run ./configure with remaining arguments.

@nicowilliams
Copy link
Contributor

If you can figure out how to setup a cross-compilation environment for OS X, then you can do that too.

Finally, recall that it's ok in the interim if we just don't get regexp on Windows. The approach I'll probably be taking for bignums is to include libtommath and libtomfloat in-tree, either whole or as git submodules. Assuming the license of the regexp library of your choice is amenable to this, we can always do the same with it.

@wtlangford
Copy link
Contributor

Okay, so regexp support builds and tests fine for the mingw64 windows cross-compile build. I had to compile oniguruma with mingw64 first, but that is to be expected.
It also works on OS X, building natively.

I don't seem to be able to build shared libraries when cross-compiling for windows, and I'm not sure if that is known/intended behavior, since a clean copy of jq's master will also not build shared libraries.

As things stand right now, I've modified configure.ac to require oniguruma to exist somewhere on the system (--with-oniguruma=/path/to/oniguruma if it isn't in the standard locations).

The brew formula's head section will need to be updated once this is merged in.
I'm going to write a few more test cases for all.test and then prep for a pull request.

Any final thoughts?

@nicowilliams
Copy link
Contributor

Yeah, libjq isn't getting installed... I should fix that. I don't care about libjq on Windows, not yet, but that will be my problem; let's not make it your problem.

No other thoughts as to your PR, just go ahead and submit it. I'm really looking forward to it!

@wtlangford
Copy link
Contributor

Found a leak in my final testing. Gonna track that down.
Also have found what appears to be a weird corner case with the library that, while correct, feels odd.
/. bar/ matched on "a\u0304 bar" returns "\u0304 bar", which renders (on my computer) as a double-quote with a macron and some other things. It appears that the library isn't matching graphemes or characters, but instead codepoints. Which makes sense in a lot of ways. A regex library shouldn't be responsible for normalizing UTF-8.
Just an oddity I thought I'd write somewhere for the poor soul who ends up with this problem.

@nicowilliams
Copy link
Contributor

Er, sorry, but I do get libjq libraries when building in situ. I get both, a libjq shared object and a static link archive.

scrips/crosscompile doesn't install them into the build location, nor the headers. That's true.

@wtlangford
Copy link
Contributor

Ah. Found the issue. Is there somewhere I can put cleanup code? something along the lines of fini or __attribute((destructor))? I need to call onig_end() to make it clean up some of its leftovers. I could call end after every invocation of match, but that seems wrong, as onig_end() is to be called when we are done with the library.

@wtlangford
Copy link
Contributor

@nicowilliams would you mind sharing the way you called scripts/crosscompile so I can be sure that it isn't something I'm doing, then?

@nicowilliams
Copy link
Contributor

@wtlangford Well, in my I/O builtins branch I made all the C-coded builtins take a jq_state *, so we could hook the cleanup there and have it run at jq_teardown() time. Sadly that's not in master right now, so you'll have to do something else. If this is just one-time cleanup to make valgrind happy, then let's setup a suppressions file for valgrind. If it's not one-time then we're going to have to make the C-coded builtins take a jq_state * now, not later, and we'll need a function by which to set a cleanup handler on the jq_state *.

(struct jq_state is where the VM state lives.)

@nicowilliams
Copy link
Contributor

scripts/crosscompile only does make install-binaries, and it doesn't use DESTDIR, so it doesn't setup a decent prefix -- it's really designed only to produce the jq executable and nothing else (it removes all the temporary build artifacts too). Just take a peek at it and you'll see.

I.e., scripts/crosscompile needs a revamp. You can hack on it if what you want is to make sure that the libjq it builds works on Windows, but I won't ask for that.

@wtlangford
Copy link
Contributor

Looks like it is leftover state from compiling regexes. If I compile the same one multiple times in a single run of jq, it doesn't leak any additional memory, so I think this can wait until the jq_state gets added in. That being said, we should likely open an issue for it once this gets merged so that it doesn't get forgotten.
Also, it is only on the order of hundreds of bytes, so I don't think it is a high-priority issue.

@nicowilliams
Copy link
Contributor

@wtlangford If this leaks every time a regexp is compiled then the teardown has to happen when matching is done (since we have no way to reuse compiled regexps at this time, but you could use a thread-specific/local cache of compiled regexps if the library is reentrant and thread-safe).

Only truly one-time state can be left unclean.

@nicowilliams
Copy link
Contributor

I.e., you should check whether compiling many times leaks more.

@wtlangford
Copy link
Contributor

I poked around in the library's source and it seems that the library is just recycling used parse nodes instead of always creating new ones, which is why they are being reported as leaking. Calls to onig_end() free the node list, which removes the nodes, but will allow multiple calls to a different init function, onigenc_init(), which is currently a no-op, but I don't know that it will stay that way. I can just call onig_end() every time with impunity for now, but this might cause undesired behavior later.

@nicowilliams
Copy link
Contributor

@wtlangford I took a look at Onigurama. You can leave this leak in. There's a fixed size to the node cache, so the leak size is bound.

FYI, Onigurama init/end is not really thread safe, incidentally. In particular onig_init() should use something like a proper "once" initializer, and there's no refcounting of inits.

@nicowilliams
Copy link
Contributor

I pushed a partial revamp of scripts/crosscompile.

wtlangford added a commit to wtlangford/jq that referenced this issue Jun 18, 2014
jq now depends on oniguruma for regex support.
Modified configure.ac accordingly.

Added valgrind suppression file for oniguruma to prevent one-time and bounded
leaks from causing tests to fail.
nicowilliams pushed a commit that referenced this issue Jun 19, 2014
jq now depends on oniguruma for regex support.
Modified configure.ac accordingly.

Added valgrind suppression file for oniguruma to prevent one-time and bounded
leaks from causing tests to fail.

Signed-off-by: Nicolas Williams <[email protected]>
@nicowilliams
Copy link
Contributor

Closed with #164.

@pkoppstein
Copy link
Contributor

@wtlangford -- Congratulations and thank you!

I was especially pleased to see that "named captures" are, in a sense, already working:

$ jq 'match( "(?<x>abc)*"; "n" ).captures[0]' <<< $'"abc abc"'
{
  "offset": 0,
  "length": 3,
  "string": "abc",
  "name": "x"
}

@wtlangford
Copy link
Contributor

They turned out to be really simple to add, so I just threw them in. They're placed inside of the normal captures array, as they still count for numbering. Now I'm just working on getting the homebrew formula updated for head so that it will build again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants