-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: regex (or minimally, startswith/endswith) string tests #164
Comments
If we get substr() or characters in strings could be iterated, #192, we could easily implement startswith/endswith and the like ourselves. Regex functionality (both test and replace, first-match or global, with capturing subpattern matches and their positions) would be a great addition though. |
+1 |
I think full-blown regular expressions are probably a bit of an overkill but something like Lua's pattern matching would be fast and powerful enough for most tasks. |
On first look that really seems to me the same as regexps with |
String processing in jq is not good at the moment. I really want to implement a PEG-style parsing system like LPeg or Parsec, which I find a bit more useful and readable than regexes. It's a bit of work though, and I haven't had a chance recently. In the meantime, a wrapper around libc's regex functions would probably be a useful stopgap. |
+1 Support for GNU sed-like regex expression support. |
startswith/endswith and a few more goodies are now in master. |
As wonderful as jq is, I would argue that true greatness will require support for regular expressions with named captures. (*) My impression is that the oniguruma open-source library (**) would be a very good match for jq. Meanwhile, thank you stedolan ! (*) Ruby supports named captures very nicely, as illustrated by this interaction:
The creation and setting of local variables in this manner should be possible in jq, perhaps allowing jq expressions such as the following: .people[].email | select( match( "(? [^> ])>" )) | $addressSupporting regex with named captures in this manner would also allow parsing of dates in a way that would resolve the "sorting by non-numeric dates" issue mentioned above (#359). (**) http:https://www.geocities.jp/kosako3/oniguruma is a stable, multi-platform open source regex library written in C that is used by ruby 1.9 and php. |
I'm currently working on implementing this. |
So far jq has been very portable largely because it has no external run-time dependencies -- just the C and math libraries. How will adding a hard dependency affect the Windows cross-compilation build, for example? I'm not sure, but I'd urge you to try a cross-compilation before doing a PR, just to make sure that I can build and release for Windows. The way jq currently handles number parsing and printing is by including in the source tree what should really have been an external library (jv_dtoa.c). I was considering doing something like that for bignums: include libtommath (or libtomfloat, or whatever the name is that escapes me atm) in the jq source tree. You might do the same for regexp, but you need to be careful that the license of whatever library you choose is compatible with jq's. Alternatives include:
Having just done a release of jq, these issues are very much on my mind. Thoughts? |
Also, since I do want regexp to be part of jq and don't want users to have to separately download a regexp DLL and plugin... I think the best option is to add a suitable regexp library to the source tree. If there's a suitable such library with a git repo, then maybe we could use git submodules. |
Is there a particular flavor of regexp we are looking for? I'm working assuming we're looking to support PCRE-style regexes, but that could be wrong. And if we are, are we aiming to support things like PCRE modifiers (case-insensitive, extended mode (non-significant whitespace), etc)? |
Of course, I can also just add support in directly for them, as well as for the other regexp styles, a la PHP's http:https://www.php.net/manual/en/function.mb-regex-set-options.php |
I'm not able to search right now, but the issue has come up before and IIRC |
@wtlangford asked:
Since the "j" in "jq" ultimately refers to JavaScript, I believe that
If in doubt, I would suggest checking (and maybe following the example of) Ruby 2.1: http:https://www.ruby-doc.org/core-2.1.1/Regexp.html One really important point I'd like to make, however, is that it should be possible in jq to work with regular expressions and flags (aka modifiers) as data. That is, one should be able to read regular expression specifications and flags as data, and manipulate them as data. Since "data" in the jq context means JSON, it follows that jq should support JSON specifications of regular expressions and flags. For example, suppose we want a jq function match(REGEX) that takes as input a STRING. The simplest approach that avoids new syntax would be for match() to expect the regex to be supplied as a string, e.g. "/abc/i" (or perhaps as a JSON array or object). Indeed, given the wealth of possibilities, I cannot really see the point in introducing additional syntax to handle regular expressions. Of course, if JSON had a regex type, then we'd certainly use it, but it doesn't :-( |
With respect to how to include the regex library, the one I'm currently using doesn't have a git repository, though that isn't hard to correct. It does, however, have a non-trivial build and I'm not sure how to work that into the current build. I've currently modified the build to link against a system version for testing purposes, though I would not add that to a pull request without discussing. |
@nicowilliams: A search through the issues shows that @stedolan was considering using PCRE, so I'll use that syntax, which is nice, since the JavaScript regular expressions seem to have been based off of Perl's anyways. |
@wtlangford contrasts:
The key to resolving this issue is, I believe, that whereas jq requires that functions have a specific arity, it does support polymorphism. That is, in a sense, we can have our cake (match("abc")) and eat it (match(["abc", "i"])). In fact, we could easily support: So unless @nicowilliams will allow us to have both match(STRING) and match(STRING;STRING), we can say: (1) has the advantage of familiarity but at double the peskiness (because we have both '"' and '/' to trick everyone up) ((2) seems pointless anyway.) Another (very tiny) advantage of (5) is that it could be further extended, e.g. to support {"regex": "abc", "flags": "i"}. |
@pkoppstein I agree with your analysis. A polymorphic approach like that is very JavaScript and seems to be the best option that exposes additional functionality without too much clutter (since arity is strictly enforced). |
I just want to add that it would be best if the result from match() is very |
@georgir said:
But for consistency at that point if no matches are found the result should be
|
Multiple arity could be added: one def per arity. Adding bigger arity than Bottom line: the easiest thing to do is to pass an array when you want As to polymorphism, yes, but no automatic type conversions -- that's the
|
@pkoppstein The 'j' in jq does not stand for JavaScript. Nowhere in the docs (nor source) is "jq" explained as standing for anything. But the docs talk a lot about JSON right on the homepage while nothing is said about JavaScript. The reader is subliminally invited to think of "jq" as being a "JSON query" language. But still, nothing is said about what "jq" stands for. Feel free to propose importing language features from any language, but not bad language features (of which JavaScript is full) :) |
Will, Whether you return an empty array or null or false for no match is fine by me... EDIT: Actually, a most jq-ish way would be to return empty on no match, one object on one match, and multiple objects on multiple matches for global-flagged regexes, as separate outputs rather than in an array. |
I do like |
So, playing with test cases, I've come across an oddity. How do we handle zero-width matches? For example, given the input |
@wtlangford wrote:
And one would hope after the second "abc" as well. For example, using irb (ruby) we see:
|
@wtlangford It's easy to filter out zero-width matches, so let them through i say :) |
@nicowilliams wrote:
Yes, anyone who writes |
That's a terrifying thing sed did there, yet exactly what it was told to do. After having slept on it, I find myself agreeing that the empty matches should be the default behavior. As such, I'm currently returning them as full match objects, just with 0 length. Capture groups that didn't match anything are also returned with an offset of -1, so as to preserve numbering. Would it be better to simply place nulls in the output for these instead? I don't think so, as the empty matches will still have an offset. |
@nicowilliams I'm ready to start the final bits of testing (specifically the windows build.) What's the usual method for cross-compiling jq to windows? |
@wtlangford Install mingw64, by which I mean:
then (as that user account, if you created one):
The |
If you can figure out how to setup a cross-compilation environment for OS X, then you can do that too. Finally, recall that it's ok in the interim if we just don't get regexp on Windows. The approach I'll probably be taking for bignums is to include libtommath and libtomfloat in-tree, either whole or as git submodules. Assuming the license of the regexp library of your choice is amenable to this, we can always do the same with it. |
Okay, so regexp support builds and tests fine for the mingw64 windows cross-compile build. I had to compile oniguruma with mingw64 first, but that is to be expected. I don't seem to be able to build shared libraries when cross-compiling for windows, and I'm not sure if that is known/intended behavior, since a clean copy of jq's master will also not build shared libraries. As things stand right now, I've modified configure.ac to require oniguruma to exist somewhere on the system ( The brew formula's head section will need to be updated once this is merged in. Any final thoughts? |
Yeah, libjq isn't getting installed... I should fix that. I don't care about libjq on Windows, not yet, but that will be my problem; let's not make it your problem. No other thoughts as to your PR, just go ahead and submit it. I'm really looking forward to it! |
Found a leak in my final testing. Gonna track that down. |
Er, sorry, but I do get libjq libraries when building in situ. I get both, a libjq shared object and a static link archive.
|
Ah. Found the issue. Is there somewhere I can put cleanup code? something along the lines of fini or __attribute((destructor))? I need to call |
@nicowilliams would you mind sharing the way you called scripts/crosscompile so I can be sure that it isn't something I'm doing, then? |
@wtlangford Well, in my I/O builtins branch I made all the C-coded builtins take a ( |
I.e., |
Looks like it is leftover state from compiling regexes. If I compile the same one multiple times in a single run of jq, it doesn't leak any additional memory, so I think this can wait until the jq_state gets added in. That being said, we should likely open an issue for it once this gets merged so that it doesn't get forgotten. |
@wtlangford If this leaks every time a regexp is compiled then the teardown has to happen when matching is done (since we have no way to reuse compiled regexps at this time, but you could use a thread-specific/local cache of compiled regexps if the library is reentrant and thread-safe). Only truly one-time state can be left unclean. |
I.e., you should check whether compiling many times leaks more. |
I poked around in the library's source and it seems that the library is just recycling used parse nodes instead of always creating new ones, which is why they are being reported as leaking. Calls to |
@wtlangford I took a look at Onigurama. You can leave this leak in. There's a fixed size to the node cache, so the leak size is bound. FYI, Onigurama init/end is not really thread safe, incidentally. In particular |
I pushed a partial revamp of |
jq now depends on oniguruma for regex support. Modified configure.ac accordingly. Added valgrind suppression file for oniguruma to prevent one-time and bounded leaks from causing tests to fail.
jq now depends on oniguruma for regex support. Modified configure.ac accordingly. Added valgrind suppression file for oniguruma to prevent one-time and bounded leaks from causing tests to fail. Signed-off-by: Nicolas Williams <[email protected]>
Closed with #164. |
@wtlangford -- Congratulations and thank you! I was especially pleased to see that "named captures" are, in a sense, already working:
|
They turned out to be really simple to add, so I just threw them in. They're placed inside of the normal captures array, as they still count for numbering. Now I'm just working on getting the homebrew formula updated for head so that it will build again. |
String
contains()
is nice, but full regular-expression matching would be great. Or perhaps as smaller step: startswith/endswith or positional string-matching tests?The text was updated successfully, but these errors were encountered: