Feature request: regex (or minimally, startswith/endswith) string tests #164

gojomo · 2013-07-11T05:48:53Z

String contains() is nice, but full regular-expression matching would be great. Or perhaps as smaller step: startswith/endswith or positional string-matching tests?

The text was updated successfully, but these errors were encountered:

georgir · 2013-10-16T08:48:39Z

If we get substr() or characters in strings could be iterated, #192, we could easily implement startswith/endswith and the like ourselves. Regex functionality (both test and replace, first-match or global, with capturing subpattern matches and their positions) would be a great addition though.

jedisct1 · 2013-10-23T22:28:24Z

+1

nrk · 2013-11-02T11:10:59Z

I think full-blown regular expressions are probably a bit of an overkill but something like Lua's pattern matching would be fast and powerful enough for most tasks.

georgir · 2013-11-04T14:20:20Z

On first look that really seems to me the same as regexps with % instead of \ for escape. I do like their () solution for capturing position though.

stedolan · 2013-11-08T12:04:27Z

String processing in jq is not good at the moment. I really want to implement a PEG-style parsing system like LPeg or Parsec, which I find a bit more useful and readable than regexes. It's a bit of work though, and I haven't had a chance recently. In the meantime, a wrapper around libc's regex functions would probably be a useful stopgap.

amagura · 2013-12-04T19:12:01Z

+1 Support for GNU sed-like regex expression support.

nicowilliams · 2013-12-07T10:15:21Z

startswith/endswith and a few more goodies are now in master.

pkoppstein · 2014-05-22T06:02:00Z

As wonderful as jq is, I would argue that true greatness will require support for regular expressions with named captures. (*)

My impression is that the oniguruma open-source library (**) would be a very good match for jq.

Meanwhile, thank you stedolan !

(*) Ruby supports named captures very nicely, as illustrated by this interaction:

> /< *(?<email>[^> ]*)>/ =~ "Abe Lincoln <[email protected]>"
=> 12
> email
=> "[email protected]"

The creation and setting of local variables in this manner should be possible in jq, perhaps allowing jq expressions such as the following:

.people[].email | select( match( "(?

[^> ])>" )) | $address

Supporting regex with named captures in this manner would also allow parsing of dates in a way that would resolve the "sorting by non-numeric dates" issue mentioned above (#359).

(**) http:https://www.geocities.jp/kosako3/oniguruma is a stable, multi-platform open source regex library written in C that is used by ruby 1.9 and php.

wtlangford · 2014-06-12T00:06:18Z

I'm currently working on implementing this.
Are we willing to add a hard dependency on an external library for regex support? Or would we rather make it more optional, such as a capability that has to be enabled at compile time?
If we're okay with the hard dependency, I also like oniguruma. It seems to be readily available from package managers, both on various *nixes and brew for os x, and is small in size (the fedora package took up a whopping 671 KB.

nicowilliams · 2014-06-12T00:21:56Z

So far jq has been very portable largely because it has no external run-time dependencies -- just the C and math libraries. How will adding a hard dependency affect the Windows cross-compilation build, for example? I'm not sure, but I'd urge you to try a cross-compilation before doing a PR, just to make sure that I can build and release for Windows.

The way jq currently handles number parsing and printing is by including in the source tree what should really have been an external library (jv_dtoa.c). I was considering doing something like that for bignums: include libtommath (or libtomfloat, or whatever the name is that escapes me atm) in the jq source tree. You might do the same for regexp, but you need to be careful that the license of whatever library you choose is compatible with jq's.

Alternatives include:

using a library that is widely available, including Windows, so that for building a release one need only download the headers and DLL and it all just works (except now we'd have to distribute jq and the regexp DLL)
implement a plugin loading facility, using dlopen() on POSIX, LoadLibrary*() on Windows, ...

Having just done a release of jq, these issues are very much on my mind.

Thoughts?

nicowilliams · 2014-06-12T00:23:43Z

Also, since I do want regexp to be part of jq and don't want users to have to separately download a regexp DLL and plugin... I think the best option is to add a suitable regexp library to the source tree. If there's a suitable such library with a git repo, then maybe we could use git submodules.

wtlangford · 2014-06-12T01:40:11Z

Is there a particular flavor of regexp we are looking for? I'm working assuming we're looking to support PCRE-style regexes, but that could be wrong. And if we are, are we aiming to support things like PCRE modifiers (case-insensitive, extended mode (non-significant whitespace), etc)?

wtlangford · 2014-06-12T01:44:29Z

Of course, I can also just add support in directly for them, as well as for the other regexp styles, a la PHP's http:https://www.php.net/manual/en/function.mb-regex-set-options.php
Maybe a match("/pattern/modifiers") and a match_ex("/pattern/modifiers","mode")

nicowilliams · 2014-06-12T03:02:56Z

I'm not able to search right now, but the issue has come up before and IIRC
@stedolan wanted a PCRE. You can search the issues list yourself, or i
will tomorrow.

pkoppstein · 2014-06-12T03:15:36Z

@wtlangford asked:

Is there a particular flavor of regexp we are looking for?

Since the "j" in "jq" ultimately refers to JavaScript, I believe that jq support for regular expressions should (at least in general) be guided by JavaScript. A useful reference is:

https://developer.mozilla.or/en-US/docs/Web/JavaScript/Guide/Regular_Expressions

If in doubt, I would suggest checking (and maybe following the example of) Ruby 2.1:

http:https://www.ruby-doc.org/core-2.1.1/Regexp.html

One really important point I'd like to make, however, is that it should be possible in jq to work with regular expressions and flags (aka modifiers) as data. That is, one should be able to read regular expression specifications and flags as data, and manipulate them as data. Since "data" in the jq context means JSON, it follows that jq should support JSON specifications of regular expressions and flags.

For example, suppose we want a jq function match(REGEX) that takes as input a STRING. The simplest approach that avoids new syntax would be for match() to expect the regex to be supplied as a string, e.g. "/abc/i" (or perhaps as a JSON array or object). Indeed, given the wealth of possibilities, I cannot really see the point in introducing additional syntax to handle regular expressions.

Of course, if JSON had a regex type, then we'd certainly use it, but it doesn't :-(

wtlangford · 2014-06-12T03:47:37Z

With respect to how to include the regex library, the one I'm currently using doesn't have a git repository, though that isn't hard to correct. It does, however, have a non-trivial build and I'm not sure how to work that into the current build. I've currently modified the build to link against a system version for testing purposes, though I would not add that to a pull request without discussing.

wtlangford · 2014-06-12T04:54:51Z

@nicowilliams: A search through the issues shows that @stedolan was considering using PCRE, so I'll use that syntax, which is nice, since the JavaScript regular expressions seem to have been based off of Perl's anyways.
@pkoppstein I agree that the specification and flags should come in as data and not require a new syntax. I'm looking at ways to keep it simple, both in code and in usage. I can't decide if it seems more likely/useful that they come in as a single piece of data or as two, though. Thoughts on whether the function should be something like match("/abc/i") or match("/abc/"; "i") or match("abc"; "i") ?
If we do one of the first two, then any backslashes in the actual regex pattern will need to be escaped and we will need to unescape them during processing.
Of course, we could also define it to be match(["abc","i"])
Also: thoughts on a handful of regex-related functions, similar to javascript's exec, test, match, search? (see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions#Working_with_Regular_Expressions)

pkoppstein · 2014-06-12T05:51:16Z

@wtlangford contrasts:

match("/abc/i")
match("/abc/"; "i")
match("abc"; "i")
match(["abc","i"])

The key to resolving this issue is, I believe, that whereas jq requires that functions have a specific arity, it does support polymorphism. That is, in a sense, we can have our cake (match("abc")) and eat it (match(["abc", "i"])).

In fact, we could easily support:
5. match(STRING), match([STRING]), and match([STRING, STRING])

So unless @nicowilliams will allow us to have both match(STRING) and match(STRING;STRING), we can say:

(1) has the advantage of familiarity but at double the peskiness (because we have both '"' and '/' to trick everyone up)
(2) and (3) don't fly.

((2) seems pointless anyway.)

Another (very tiny) advantage of (5) is that it could be further extended, e.g. to support {"regex": "abc", "flags": "i"}.

wtlangford · 2014-06-12T13:35:05Z

@pkoppstein I agree with your analysis. A polymorphic approach like that is very JavaScript and seems to be the best option that exposes additional functionality without too much clutter (since arity is strictly enforced).
I'll work towards
5. match(STRING), match([STRING]), and match([STRING, STRING])
and will throw in the match(OBJECT) implementation as well. It won't take much additional work.

georgir · 2014-06-12T15:03:49Z

I just want to add that it would be best if the result from match() is very
detailed, like an array of objects with offsets and lengths of all
subpatterns, not just the matched strings. And if we support named
subpatterns, then the result should probably be an object with separate
object subkey for named subpatterns and array subkey for unnamed
subpatterns, etc...
From that and the brand new string slicing we can implement replace
functions in jq itself... I guess... are string slices assignable?

wtlangford · 2014-06-12T15:54:22Z

@georgir said:

I just want to add that it would be best if the result from match() is very
detailed, like an array of objects with offsets and lengths of all
subpatterns, not just the matched strings. And if we support named
subpatterns, then the result should probably be an object with separate
object subkey for named subpatterns and array subkey for unnamed
subpatterns, etc...

But for consistency at that point if no matches are found the result should be []. And then that wouldn't work inside of select(match("abc")), which is where I see this getting a lot of use.
I like having the ability to return the matches, though. I'll likely add a search() builtin that does this, maybe returns an object like this:

{
    "matches" : [ {
        "offset" : #,
        "length" : #,
        "string" : "matchString"
    }, ... ],
    "captures" : [ "captureString", ... ],
    "namedCaptures" : { "name" : "captureString", ... }
}

nicowilliams · 2014-06-12T16:21:49Z

Multiple arity could be added: one def per arity. Adding bigger arity than
six requires changes to parser.y; I would prefer not to go that way.
Varargs for jq functions would necessitate new syntax because jq function
arguments are lambdas, not values (e.g., $1 and so on for each lambda).
Varargs for C-coded functions would be OK, but in general C-coded
functions should be wrapped with jq ones, and anways, it's easy enough to
pass/take arrays in those cases.

Bottom line: the easiest thing to do is to pass an array when you want
multiple argument values rather than multiple lambdas, and always be
careful not to accidentally cause cartesian products where no one expects
them. But it's true that multiple arguments look better than arrays.

As to polymorphism, yes, but no automatic type conversions -- that's the
key.

match(...) and match(..., options) and so on would be OK. But so would
match(...) and match([..., options]). If you are up for contributing
multiple arity support in builtin"c and compile.c, go for the first, else
the second.

nicowilliams · 2014-06-12T16:48:57Z

@pkoppstein The 'j' in jq does not stand for JavaScript. Nowhere in the docs (nor source) is "jq" explained as standing for anything. But the docs talk a lot about JSON right on the homepage while nothing is said about JavaScript. The reader is subliminally invited to think of "jq" as being a "JSON query" language. But still, nothing is said about what "jq" stands for. Feel free to propose importing language features from any language, but not bad language features (of which JavaScript is full) :)

georgir · 2014-06-12T17:19:36Z

Will, captures and namedCaptures should have the structure of matches in your example, wanting offset for them and not just the strings is exactly what I ment originally. Perhaps me calling them subpatterns instead of captures was confusing. And I did overlook a regex with global-match flag, I was just thinking of single match with multiple captures. In global match mode, each full match result object should have its own captures/namedCaptures subkeys.

Whether you return an empty array or null or false for no match is fine by me... .[]? lets us iterate all possibilities without boilerplate checks, and length can be added to your select example. You might want to add a boolean test alternative to match though, if it will help performance to run with no captures and quit on the first match with the specific regex lib.

EDIT: Actually, a most jq-ish way would be to return empty on no match, one object on one match, and multiple objects on multiple matches for global-flagged regexes, as separate outputs rather than in an array.

wtlangford · 2014-06-12T17:32:16Z

I do like test as a name for that better. I'll probably implement it anyways, if nothing more than to make select with a regex simpler, though I will see if I can get the lib to stop early.
Regarding the result format, that does make more sense; I misunderstood you the first time.

wtlangford · 2014-06-18T03:25:19Z

So, playing with test cases, I've come across an oddity. How do we handle zero-width matches? For example, given the input abc abc and the regex (abc)*, a match will be returned after the first abc of width zero. I feel that this should not occur, but at the same time, it is also correct.
As it stands right now, there's a modifer to ignore zero-width matches. I'm tempted to invert it and make the default behavior to not return zero-width matches and have the modifier allow them. It feels like the more expected behavior to me. Thoughts?

pkoppstein · 2014-06-18T05:08:47Z

@wtlangford wrote:

given the input abc abc and the regex (abc)*, a match will be returned after the first abc of width zero.

And one would hope after the second "abc" as well. For example, using irb (ruby) we see:

"abc abc".scan( /(abc)*/ )
=> [["abc"], [nil], ["abc"], [nil]]

sed (or at least sed on a Mac) also acts on zero-width matches:

$ sed -E 's/( )*/ABC/g' <<< "abc abc"
ABCaABCbABCcABCaABCbABCcABC

nicowilliams · 2014-06-18T05:26:53Z

@wtlangford It's easy to filter out zero-width matches, so let them through i say :)

pkoppstein · 2014-06-18T11:25:45Z

@nicowilliams wrote:

It's easy to filter out zero-width matches, so let them through i say :)

Yes, anyone who writes (abc)* deserves to have his or her zero-width matches. As in all matters to do with regex, it's nice to have ruby as a guide.

wtlangford · 2014-06-18T12:39:59Z

That's a terrifying thing sed did there, yet exactly what it was told to do. After having slept on it, I find myself agreeing that the empty matches should be the default behavior. As such, I'm currently returning them as full match objects, just with 0 length. Capture groups that didn't match anything are also returned with an offset of -1, so as to preserve numbering. Would it be better to simply place nulls in the output for these instead? I don't think so, as the empty matches will still have an offset.

wtlangford · 2014-06-18T15:56:29Z

@nicowilliams I'm ready to start the final bits of testing (specifically the windows build.) What's the usual method for cross-compiling jq to windows?

nicowilliams · 2014-06-18T17:09:16Z

@wtlangford Install mingw64, by which I mean:

create an unprivileged user account in which to run it (optional)
get the tarballs
untar

then (as that user account, if you created one):

$ make distclean   # if need be
$ autoreconf -fi   # if need be
$ PATH=<path-to-mingw64-bin>:$PATH scripts/crosscompile win64 --host=x86_64-w64-mingw32

The scripts/crosscompile script will create a directory called build/$1 (build/win64 in this case) and will run ./configure with remaining arguments.

nicowilliams · 2014-06-18T17:15:45Z

If you can figure out how to setup a cross-compilation environment for OS X, then you can do that too.

Finally, recall that it's ok in the interim if we just don't get regexp on Windows. The approach I'll probably be taking for bignums is to include libtommath and libtomfloat in-tree, either whole or as git submodules. Assuming the license of the regexp library of your choice is amenable to this, we can always do the same with it.

wtlangford · 2014-06-18T20:11:05Z

Okay, so regexp support builds and tests fine for the mingw64 windows cross-compile build. I had to compile oniguruma with mingw64 first, but that is to be expected.
It also works on OS X, building natively.

I don't seem to be able to build shared libraries when cross-compiling for windows, and I'm not sure if that is known/intended behavior, since a clean copy of jq's master will also not build shared libraries.

As things stand right now, I've modified configure.ac to require oniguruma to exist somewhere on the system (--with-oniguruma=/path/to/oniguruma if it isn't in the standard locations).

The brew formula's head section will need to be updated once this is merged in.
I'm going to write a few more test cases for all.test and then prep for a pull request.

Any final thoughts?

nicowilliams · 2014-06-18T20:37:27Z

Yeah, libjq isn't getting installed... I should fix that. I don't care about libjq on Windows, not yet, but that will be my problem; let's not make it your problem.

No other thoughts as to your PR, just go ahead and submit it. I'm really looking forward to it!

wtlangford · 2014-06-18T20:47:44Z

Found a leak in my final testing. Gonna track that down.
Also have found what appears to be a weird corner case with the library that, while correct, feels odd.
/. bar/ matched on "a\u0304 bar" returns "\u0304 bar", which renders (on my computer) as a double-quote with a macron and some other things. It appears that the library isn't matching graphemes or characters, but instead codepoints. Which makes sense in a lot of ways. A regex library shouldn't be responsible for normalizing UTF-8.
Just an oddity I thought I'd write somewhere for the poor soul who ends up with this problem.

nicowilliams · 2014-06-18T20:52:19Z

Er, sorry, but I do get libjq libraries when building in situ. I get both, a libjq shared object and a static link archive.

scrips/crosscompile doesn't install them into the build location, nor the headers. That's true.

wtlangford · 2014-06-18T20:53:14Z

Ah. Found the issue. Is there somewhere I can put cleanup code? something along the lines of fini or __attribute((destructor))? I need to call onig_end() to make it clean up some of its leftovers. I could call end after every invocation of match, but that seems wrong, as onig_end() is to be called when we are done with the library.

wtlangford · 2014-06-18T20:54:19Z

@nicowilliams would you mind sharing the way you called scripts/crosscompile so I can be sure that it isn't something I'm doing, then?

nicowilliams · 2014-06-18T20:57:57Z

@wtlangford Well, in my I/O builtins branch I made all the C-coded builtins take a jq_state *, so we could hook the cleanup there and have it run at jq_teardown() time. Sadly that's not in master right now, so you'll have to do something else. If this is just one-time cleanup to make valgrind happy, then let's setup a suppressions file for valgrind. If it's not one-time then we're going to have to make the C-coded builtins take a jq_state * now, not later, and we'll need a function by which to set a cleanup handler on the jq_state *.

(struct jq_state is where the VM state lives.)

nicowilliams · 2014-06-18T21:01:04Z

scripts/crosscompile only does make install-binaries, and it doesn't use DESTDIR, so it doesn't setup a decent prefix -- it's really designed only to produce the jq executable and nothing else (it removes all the temporary build artifacts too). Just take a peek at it and you'll see.

I.e., scripts/crosscompile needs a revamp. You can hack on it if what you want is to make sure that the libjq it builds works on Windows, but I won't ask for that.

wtlangford · 2014-06-18T21:03:51Z

Looks like it is leftover state from compiling regexes. If I compile the same one multiple times in a single run of jq, it doesn't leak any additional memory, so I think this can wait until the jq_state gets added in. That being said, we should likely open an issue for it once this gets merged so that it doesn't get forgotten.
Also, it is only on the order of hundreds of bytes, so I don't think it is a high-priority issue.

nicowilliams · 2014-06-18T21:13:10Z

@wtlangford If this leaks every time a regexp is compiled then the teardown has to happen when matching is done (since we have no way to reuse compiled regexps at this time, but you could use a thread-specific/local cache of compiled regexps if the library is reentrant and thread-safe).

Only truly one-time state can be left unclean.

nicowilliams · 2014-06-18T21:13:36Z

I.e., you should check whether compiling many times leaks more.

wtlangford · 2014-06-18T21:26:16Z

I poked around in the library's source and it seems that the library is just recycling used parse nodes instead of always creating new ones, which is why they are being reported as leaking. Calls to onig_end() free the node list, which removes the nodes, but will allow multiple calls to a different init function, onigenc_init(), which is currently a no-op, but I don't know that it will stay that way. I can just call onig_end() every time with impunity for now, but this might cause undesired behavior later.

nicowilliams · 2014-06-18T21:42:21Z

@wtlangford I took a look at Onigurama. You can leave this leak in. There's a fixed size to the node cache, so the leak size is bound.

FYI, Onigurama init/end is not really thread safe, incidentally. In particular onig_init() should use something like a proper "once" initializer, and there's no refcounting of inits.

nicowilliams · 2014-06-18T21:43:21Z

I pushed a partial revamp of scripts/crosscompile.

jq now depends on oniguruma for regex support. Modified configure.ac accordingly. Added valgrind suppression file for oniguruma to prevent one-time and bounded leaks from causing tests to fail.

jq now depends on oniguruma for regex support. Modified configure.ac accordingly. Added valgrind suppression file for oniguruma to prevent one-time and bounded leaks from causing tests to fail. Signed-off-by: Nicolas Williams <[email protected]>

nicowilliams · 2014-06-19T02:03:41Z

Closed with #164.

pkoppstein · 2014-06-19T03:25:52Z

@wtlangford -- Congratulations and thank you!

I was especially pleased to see that "named captures" are, in a sense, already working:

$ jq 'match( "(?<x>abc)*"; "n" ).captures[0]' <<< $'"abc abc"'
{
  "offset": 0,
  "length": 3,
  "string": "abc",
  "name": "x"
}

wtlangford · 2014-06-19T03:28:00Z

They turned out to be really simple to add, so I just threw them in. They're placed inside of the normal captures array, as they still count for numbering. Now I'm just working on getting the homebrew formula updated for head so that it will build again.

georgir mentioned this issue Oct 16, 2013

Feature request: string manipulation: either substr() or a way for character iteration. #192

Closed

lluchs mentioned this issue Feb 12, 2014

endswith is not defined #289

Closed

kirbysayshi mentioned this issue May 11, 2014

sorting by non-numeric dates #359

Closed

pkoppstein mentioned this issue May 24, 2014

Error with reading large integer? #369

Closed

nicowilliams added this to the 1.5 release milestone Jun 6, 2014

nicowilliams closed this as completed Jun 19, 2014

pkoppstein mentioned this issue Jun 19, 2014

enhancement request: regex filters #432

Closed

wader mentioned this issue Jun 13, 2023

Fix regular expression section on PCRE (fix #2439), also improve example format #2613

Merged

Feature request: regex (or minimally, startswith/endswith) string tests #164

Feature request: regex (or minimally, startswith/endswith) string tests #164

Comments

gojomo commented Jul 11, 2013

georgir commented Oct 16, 2013

jedisct1 commented Oct 23, 2013

nrk commented Nov 2, 2013

georgir commented Nov 4, 2013

stedolan commented Nov 8, 2013

amagura commented Dec 4, 2013

nicowilliams commented Dec 7, 2013

pkoppstein commented May 22, 2014

wtlangford commented Jun 12, 2014

nicowilliams commented Jun 12, 2014

nicowilliams commented Jun 12, 2014

wtlangford commented Jun 12, 2014

wtlangford commented Jun 12, 2014

nicowilliams commented Jun 12, 2014

pkoppstein commented Jun 12, 2014

wtlangford commented Jun 12, 2014

wtlangford commented Jun 12, 2014

pkoppstein commented Jun 12, 2014

wtlangford commented Jun 12, 2014

georgir commented Jun 12, 2014

wtlangford commented Jun 12, 2014

nicowilliams commented Jun 12, 2014

nicowilliams commented Jun 12, 2014

georgir commented Jun 12, 2014

wtlangford commented Jun 12, 2014

wtlangford commented Jun 18, 2014

pkoppstein commented Jun 18, 2014

nicowilliams commented Jun 18, 2014

pkoppstein commented Jun 18, 2014

wtlangford commented Jun 18, 2014

wtlangford commented Jun 18, 2014

nicowilliams commented Jun 18, 2014

nicowilliams commented Jun 18, 2014

wtlangford commented Jun 18, 2014

nicowilliams commented Jun 18, 2014

wtlangford commented Jun 18, 2014

nicowilliams commented Jun 18, 2014

wtlangford commented Jun 18, 2014

wtlangford commented Jun 18, 2014

nicowilliams commented Jun 18, 2014

nicowilliams commented Jun 18, 2014

wtlangford commented Jun 18, 2014

nicowilliams commented Jun 18, 2014

nicowilliams commented Jun 18, 2014

wtlangford commented Jun 18, 2014

nicowilliams commented Jun 18, 2014

nicowilliams commented Jun 18, 2014

nicowilliams commented Jun 19, 2014

pkoppstein commented Jun 19, 2014

wtlangford commented Jun 19, 2014