Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancement request: regex filters #432

Closed
pkoppstein opened this issue Jun 19, 2014 · 51 comments
Closed

enhancement request: regex filters #432

pkoppstein opened this issue Jun 19, 2014 · 51 comments

Comments

@pkoppstein
Copy link
Contributor

jq now has basic regex support in the form of match and test (see #431 and #164) but I believe still lacks "builtin" support for the following:

split(REGEX)
sub(REGEX, STRING)
gsub(REGEX, STRING) 
scan(REGEX)
@nicowilliams
Copy link
Contributor

@pkoppstein Agreed!

It should be pretty simple to write jq-coded builtins to do this.

@nicowilliams
Copy link
Contributor

To clarify:

  • split -> return the substrings not matching the regexp (i.e., split on regexp)
  • sub -> replace the first matching substring with the given string
  • gsub -> replace all matching substrings with the given string
  • scan -> output just the matching substrings

@wtlangford
Copy link
Contributor

So, I don't think that sub and gsub can be jq-coded, since we can't slice and modify strings.
Same for split.
scan would look something like def scan(reg): match(reg; "g") | .string;

@nicowilliams
Copy link
Contributor

@wtlangford You can't modify strings no matter what, and yes, you can slice strings (with .[start:end]). sub should just do something like:

def sub(re; s): . as $s|[match(re)]|.[0]|. as $r|$s|.[0:$r.offset]+s+.[$r.offset+$r.length:-1];

EDIT: removed "g" from call to match (which was wrong anyways since I'd used a comma instead of a semi-colon).

@nicowilliams
Copy link
Contributor

@wtlangford

And:

def _stredit(edits; s): if edits|length == 0 then . else (. as $s|(edits|length -1) as $l|(edits[$l]) as $edit| .[0:$edit.offset]+s+.[$edit.offset+$edit.length:] | _stredit(edits[0:$l]; s) ) end;
def gsub(re; s): [match(re;"g")] as $edits|_stredit($edits; s);

@nicowilliams
Copy link
Contributor

diff --git a/builtin.c b/builtin.c
index 7b215e6..65ba828 100644
--- a/builtin.c
+++ b/builtin.c
@@ -952,7 +952,9 @@ static const char* const jq_builtins[] = {
   "def test(re; mode): _match_impl(re; mode; true);",
   "def test(val): if val |type == \"string\" then test(val; null) elif val | type == \"array\" and (val | length) > 1 then test(val[0]; val[1]) elif val | type == \"array\" and (val | length > 0) then test(val[0]; null) else error((val | type) + \" not a string or array\") end;",
 //  "def test(re): _match(re; null; 1);",
-  
+  "def sub(re; s): . as $s|[match(re)]|.[0]|. as $r|$s|.[0:$r.offset]+s+.[$r.offset+$r.length:-1];",
+  "def _stredit(edits; s): if edits|length == 0 then . else (. as $s|(edits|length -1) as $l|(edits[$l]) as $edit| .[0:$edit.offset]+s+.[$edit.offset+$edit.length:] | _stredit(edits[0:$l]; s) ) end;",
+  "def gsub(re; s): [match(re;\"g\")] as $edits|_stredit($edits; s);",
 };
 #undef LIBM_DD

@wtlangford
Copy link
Contributor

Oh. I don't know where I got no slicing from.

@nicowilliams
Copy link
Contributor

def _matches(re): match(re;"g")|[.offset,.length];
def scan(re): . as $s|_matches(re)|$s[.[0]:.[0]+.[1]];

split(re) turns out to be a bit trickier.

EDIT: fix cut-n-paste error.

@nicowilliams
Copy link
Contributor

Please bang on these, let me know if they work for you!

@pkoppstein
Copy link
Contributor Author

@nicowilliams wrote:

scan -> output just the matching substrings

That description just covers the basic case. Here's a fuller description (taken from ruby docs) that also covers the case when the regex contains groups:

For each match, a result is generated and ... added to the result array. If the pattern contains no groups, each individual result consists of the matched string ... If the pattern contains groups, each individual result is itself an array containing one entry per group.

@nicowilliams
Copy link
Contributor

@pkoppstein Ah, OK. Can I leave this to you and split to you or @wtlangford ?

@nicowilliams
Copy link
Contributor

Can you do better than this? It doesn't seem as elegant as it could be... My take ought to be simpler, no? But I do like my split: turn the match lengths to end offsets, make a list of start and end offsets, flatten, prefix 0, append the last position, unique, split into pairs of two, then slice the string. Except it's wrong! If the first match starts at offset 0, this doesn't work.

def take(a; n): if a|length < n then a else a[0:n] , take(a[n:];n) end;
def take(n): take(.;n);
def _matches(re): match(re;"g")|[.offset,.length];
# BUSTED
def split(re): . as $s|[_matches(re)|.[0],.[0]+.[1]]|[0]+.+[$s|length]|take(2)|select(length == 2)|$s[.[0]:.[1]];

@pkoppstein
Copy link
Contributor Author

@nicowilliams, @wtlangford -- I just noticed that "offset" is not being correctly set (according to my understanding) as illustrated by:

$ jq --version
jq-1.4-34-gff5cf0a
$ jq -n -c '"aa" | match("a","g")'
{"offset":0,"length":1,"string":"a","captures":[]}
{"offset":0,"length":1,"string":"a","captures":[]}

@pkoppstein
Copy link
Contributor Author

@nicowilliams - This is a slightly revised/simplified version of what you wrote that behaves as I'd expect:

def nwise(a; n): if a|length <= n then a else a[0:n] , nwise(a[n:];n) end;
def nwise(n): nwise(.;n);

def split(re): . as $s
  | [ match(re;"g") | (.offset, .offset + .length) ]
  | [0] + . +[$s|length]
  | nwise(2)
  | $s[.[0]:.[1] ] ;

Examples:

IN: "0a23a5678a"
"0"
"23"
"5678"
""
IN: "a23a5678"
""
"23"
"5678"

@pkoppstein
Copy link
Contributor Author

Apropos scan, @nicowilliams asked:

Can I leave this to you

I have a version ready for testing once match is tamed.

@nicowilliams
Copy link
Contributor

@pkoppstein Try s/,/;/ :)

$ # WRONG
$ jq -n -c '"aa" | match("a","g")'
$ # RIGHT
$ jq -n -c '"aa" | match("a";"g")'

Yeah, the need to use semi-colons to separate arguments lists kills me too sometimes...

$  ./jq -c 'def f(a): [.,a]; f(",","g")'
"a,,b,c"
["a,,b,c",",","g"]

jq does the right thing with commas for arrays. It should have as well for arguments lists. It's too late to fix it.

@nicowilliams
Copy link
Contributor

@pkoppstein Actually, my split wasn't busted:

$ jq 'split(", ")'
", a,b, c, d, e,f, "                             
""
"a,b"
"c"
"d"
"e,f"
""

I just hadn't tested that nor thought it through.

Those empty strings aren't wrong either. They're quite correct I'd say.

Any reason not to call take take?

@nicowilliams
Copy link
Contributor

Alright, I have implementations of all four utility functions requested. Plus take.

I just need to write docs and tests and I'll push.

@pkoppstein
Copy link
Contributor Author

@nicowilliams wrote:

Try s/,/;/ :)

Indeed. Thanks. In my defense (?), it was late :-(

Here's my now-tested candidate for scan:

def scan(re): 
  [ match(re; "g")
    |  if (.captures|length > 0)
      then [ .captures | .[] | .string ]
      else .string
      end  ];

Any reason not to call take take?

Mainly because

  • we say "pairwise" when n==2; some might say "n-wise" for n>2
  • take(n) often means "take n, and leave the rest" (e.g. in ruby and perhaps English)
  • take/drop are often paired

@pkoppstein
Copy link
Contributor Author

p.s. Please don't forget split/0, e.g.

def split: split("\\s+");

@nicowilliams
Copy link
Contributor

@pkoppstein @wtlangford Ooops, the new split (which shipped in 1.4) conflicts with the old split. The new one outputs multiple results; the old one (being C-coded) outputs an array of all results.

I'm not going to break compatibility. And thinking about it I think I'd like to have a split builtin that doesn't use regexps anyways: for performance reasons. I'm leaning towards renaming the RE-based split to something else. Suggestions?

@nicowilliams
Copy link
Contributor

As to take/nwise, in retrospect the obvious thing to do is to use split(n) to split an array into n-sized arrays.

I'm rather bummed out about split. I'm actually considering breaking compat, actually -- it's a horrible mistake!

@pkoppstein
Copy link
Contributor Author

@nicowilliams wrote:

I'm rather bummed out about split. I'm actually considering breaking compat, actually -- it's a horrible mistake!

There must be a lesson here somewhere :-)

I don't understand why you think that string#split/1 returning an array is a mistake -- that's what every other comparable split/1 does; also .[] is easy enough to read and write, and [ .... ] is often quite awkward to read. But I suppose that goes to show my jq enlightenment score is very low.

However, if split/1 is to change, then one way out would be to take the view that the introduction of regex is a big enough leap to warrant a jump to the next "major version" (on the theory that "breaking backwards compatibility on minor versions is evil" but not otherwise). Would that require jumping to 2.0?

Using the name "split" instead of nwise/take is fine by me. An alternative to consider would be group(n).

@nicowilliams
Copy link
Contributor

@pkoppstein jq is a language where expressions generate results. like Python iterators with yield or Icon generators with suspend. In such a language it is preferable that primitives generate multiple values rather than a single array of all of them. The latter is more wasteful than the former. I'm sure you've noticed this, now you just have to accept its goodness ;)

@pkoppstein
Copy link
Contributor Author

@nicwilliams -- Maybe I'd be more easily persuaded if jq had a stream-oriented version of reduce.

For example, if we have a long pipeline, S, that produces a stream of JSON objects and wish to add these objects using +, we have either to use the form reduce S .... (which is hard to read when S is complex) or to collect the objects as an array and then use .[] to "uncollect" them again, i.e. [S] | reduce .[] as $x (...) rather than S | reduceStream . as $x (...).

Here's an example motivated by "named captures":

# return an object with the names of the named captures as keys
def capture(re):
  match(re)
   | [ .captures
       | .[] 
       | select(.name != null)
       | { (.name) : .string } ]
   | reduce .[] as $pair ({}; . + $pair);

Example:

"xyz" | capture("(?<a>.).(?<c>.)") as $vars | [ $vars.a, $vars.c ]
 # => [ "x", "z" ]

@pkoppstein
Copy link
Contributor Author

By the way, I'd like to propose adding capture/1 as a builtin, so here's the "reduce S" version suitable for builtin.c:

def capture(re):
   match(re)
   | reduce ( .captures | .[]
              | select(.name != null)
              | { (.name) : .string } )
     as $pair ({}; . + $pair);

Example:
$ jq -c -f <(cat capture.jq; echo 'capture("(?(?.).(?.))")')
"abcd"
{"both":"abc","a":"a","c":"c"}

@nicowilliams
Copy link
Contributor

@pkoppstein I agree that we need better/more iterators. That's no excuse to sully other parts of jq.

@pkoppstein
Copy link
Contributor Author

@nicowilliams - I'm not sure what might have elicited your remark about sullying jq. I was merely saying that, given jq as-it-is, the case for breaking backwards-compatibility in order to change string#split seems less than compelling to me. I hope you were not suggesting that I was proposing some kind of hack along the lines of " ... |
slurp | ..." .

It does, however, seem to me that the key to providing a syntactic alternative to "[S] | ...." is to introduce a new kind of "pipe". Let us therefore assume for the moment that we could use the symbol "|>" for a pipe that slurps its input. This would, for example, allow us to write:

range(0;10) |> map( 2*. )

instead of:

[range(0;10)] | map( 2*. )

Would that work?

@nicowilliams
Copy link
Contributor

@pkoppstein I meant that given support for generators, builtins should preferably generate than produce a single array output. It's easier to collect generated results than to iterate arrays -- in particular there's no way to process arrays in an online manner, but streams can be processed in an online manner.

@pkoppstein
Copy link
Contributor Author

@nicowilliams - This set of extra builtins (#432) also seems to have fallen off you radar screen. I believe the only real sticking point was that the existing split/1 returns an array. You were contemplating changing the existing definition so that it would instead be a generator.

In the interests of moving forward, I would propose one of the following alternatives:

  • split(REGEX|STRING) returns an array, at least for now
  • retain split(STRING) as-is (for now), and add SYNONYM(REGEX|STRING) where SYNONYM could be a verb (splitup, cleave, ...) or perhaps more fittingly for a generator, a plural noun such as segments or substrings.
  • postpone addition of split(REGEX) functionality ;-(

@pkoppstein
Copy link
Contributor Author

@nicowilliams - One other question, and one concern, as I put all the pieces together:

Are you ok with "nwise" by that name as a new public builtin? It could be just a helper subfunction, but it is generic enough. What about "cleave"?

There are two considerations which make me concerned about the regex filters blindly producing streams rather than arrays. In a nutshell:

  • BOUNDARY PROBLEM: For example, if a stream of STRING is presented to a stream-oriented split/1, then the boundaries are unclear, e.g. ("a,b,c", "d,e") | split(","). So one typically has to wrap things up in an array anyway.

@nicowilliams
Copy link
Contributor

@pkoppstein Way, will you be sending a PR? If so I'll wait until you send it.

As to the boundary problem:

$ ./jq -cn '("a,b,c", "d,e") | split(",")'
["a","b","c"]
["d","e"]
$ ./jq -cn '("a,b,c", "d,e") | split(",")[]'
"a"
"b"
"c"
"d"
"e"
$ ./jq -cn '("a,b,c", "d,e") | [split(",")[]]'
["a","b","c"]
["d","e"]
$ 

As you can see, it's easy to work around.

@nicowilliams
Copy link
Contributor

Here's what I had before, rebased to master:

https://github.com/nicowilliams/jq/compare/scan?expand=1

Please review.

@nicowilliams
Copy link
Contributor

That is, I've not done much more than rebase.

@pkoppstein
Copy link
Contributor Author

@nicowillliams wrote:

Here's what I had before, rebased to master:

Thanks. One of the things that I've done is ensure that sub and gsub handle named captures properly (i.e. in the "tostring"). Things are working nicely, e.g.

$ jq -n '"Abcabc" | gsub("(?<x>.)[^a]*"; "+\(.x)-")'
+A-+a-"

As you can see, it's easy to work around.

Yes, I know; that's why I said "So one typically has to wrap things up in an array anyway."

Of course I understand that streams have their place. It's a Very Good Thing that jq can handle a stream of JSON objects, and I like generators (especially those that go from null to N and beyond).

However, for the regex filters, I seem only to be able to see the downsides of the stream-oriented approach. After all, map/1 could have been implemented to produce a stream of mapped items, but it doesn't.

In short, since jq is JSON-oriented, and given the ease and (so far as I can tell) the speed) of [...] and .[], I still don't see the great harm in string-to-array filters, but I'd be happy to be enlightened on this as on many other topics. And I apologize if I've failed to understand the significance of a point you've already made on this topic.

@nicowilliams
Copy link
Contributor

@pkoppstein I'm unmoved on this. I made a mistake with split. For C-coded builtins it's certainly not possible to generate streams without using .[] in a jq-coded wrapper, but that's no excuse for not having a stream-output form. Also, it's always possible to turn an array into a stream, yes, but NOT in an online way, while it's always possible to turn a stream into an array at no extra cost (if you wanted an array, you didn't need online, and you were going to pay the price for collecting the stream's outputs no matter what).

Suppose Oniguruma gave us a streaming interface (maybe it does already; idk). Would it be possible to apply a regex to very large (streaming!) inputs and process the outputs in a streaming fashion? Wouldn't you want to? But if we make the interface always collect, we lose this ability.

The principle is: stream first, we can always collect when you want an array.

@nicowilliams
Copy link
Contributor

@pkoppstein Can you review the scan branch of my clone?

https://github.com/nicowilliams/jq/compare/scan?expand=1

@nicowilliams
Copy link
Contributor

Also, can you add examples for the manual?

@nicowilliams
Copy link
Contributor

@pkoppstein Actually, I'm concerned about take/2: it should take n items from from a's outputs, not from each array output by a.

Something like:

def take(a; n): n as $n |
    foreach a as $a ([];
                     if length == $n then [$a] else .+[$a] end;
                     if length == $n then . else empty end);

This points out that foreach really needs an option for one more closure: one to extract state when the UPDATE closure is done.

This sort of thing has been bothering me about this commit. I'm thinking: leave out take/2 for now.

Also, gsub/2 seems odd to me as written. For one, there's no docs, so I'm not sure how it's supposed to work. For another, the s closure is special but probably shouldn't be (s as $s | ... should be added in). Finally, I think gsub/2 could be written in a more functional style. At this point I've lost track if gsub/2 was yours or mine, so don't take offense if it's mine.

@pkoppstein
Copy link
Contributor Author

@nicowilliams wrote:

there's no docs

They're all documented in the same commit in a long addition that starts at scan(regex). Perhaps you missed this addition because they're in the regex section? (One advantage of SourceTree is that you can't miss the changes.) I believe the additions are adequate -- I didn't think it was necessary (or desirable) to go into exhaustive detail, partly because that's not the style of manual.yml, but mainly because there's ample relevant documentation about sub and gsub (with captures) out there.

I'll certainly take a look again at these filters in light of your comments, but they are the way they are because of the captures. Actually, I think to write all this in most other languages would be either very long or very convoluted.

@nicowilliams
Copy link
Contributor

Oh, I guess I fell behind your branch.

@pkoppstein
Copy link
Contributor Author

@nicowilliams wrote:

Oh, I guess I fell behind your branch.

The relevant commit has no "take".

$ git log -1
commit a21e610bd0938b3154a270107b2a0fefcdeeb334
Author: pkoppstein <[email protected]>
Date:   Thu Jul 31 20:32:44 2014 -0400
    regex filters (#432): scan, splits, split, sub, gsub

This has test cases, too (largely borrowed from the ones you had written).

@nicowilliams
Copy link
Contributor

My take was what you called nwise. I don't care about the name though. I do care about outputting streams though, and nwise/2's a should be a stream, not an array. It's easy to collect stream outputs into arrays; the reverse is possible too, of course, but you lose the ability to do things in an online way. Preferably we should also use streams internally (it's like TCO: it matters when the streams get large). We could make nwise private though. And, all that said, a streaming nwise is not that simple (it'd be nice if foreach had an option for a final extract to run on the state after the last iteration is done).

@pkoppstein
Copy link
Contributor Author

Hi, @nicowilliams. I only mentioned take as your reference to it seemed to indicate your critique was of an ancient version of the code somewhere.

I think the best resolution of the issue is, as you suggested, to make it private. That way, when a suitably brilliant implementation of nwise or take becomes available, it will be trivial to incorporate it into builtin.

As I've said before, since you're the Decider and since you know what you want better than I do, it would probably be simplest at this point if you make any changes that you believe are appropriate. I am of course speaking of #522

On the other hand, if you want me to submit another PR, I'd need to know which changes you'd specifically like. Thanks.

@nicowilliams
Copy link
Contributor

@pkoppstein

I pushed your commit too soon. There were problems with the docs. Also, these two tests fail:

[.[] | scan(", ")]
["a,b, c, d, e,f",", a,b, c, d, e,f, "]

[.[] | split(", ")]
["a,b, c, d, e,f",", a,b, c, d, e,f, "]

I'll push a fix soon.

@nicowilliams
Copy link
Contributor

BTW, please always run make check before submitting a PR.

Of course, I should have as well. I always do, but slipped this time.

nicowilliams added a commit that referenced this issue Aug 8, 2014
nicowilliams added a commit that referenced this issue Aug 8, 2014
@pkoppstein
Copy link
Contributor Author

@nicowilliams wrote:

BTW ...

Of course I always run "make" but, when I ran "make check" once I got this message:

Support on MacOS 10.8/10.9 is experimental and mostly broken.
WARNING: Expect incorrect results, assertions and crashes

So I've assumed that it's not worthwhile. Also, I don't have the ruby environment necessary to "make" the documents. For a previous commit, I believe I made it explicit that I was not vouching for anything other than the correctness of the changes to builtin.c but I should have been explicit on this occasion as well. Sorry about that. But thank you for installing the updates and attending to the imperfections.

@guruprasannaps
Copy link

I want to validate the filename. If the file name contains the \ / : * ? " < > | I should get the result as false. I tried many things but unable to make it work. CVOuld you help me please on this. how I can validate?
test("\ / : * ? " < > |*"; "gx")] is throing an compilation error .Please help me

@pkoppstein
Copy link
Contributor Author

@guruprasannaps - One answer:

jq -R 'test("[\\/:*?\"<>|]") | not'

In short, the regex must be a JSON string.

In future, please ask usage questions at stackoverflow.com with the jq tag: https://stackoverflow.com/questions/tagged/jq

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants