enhancement request: regex filters #432

pkoppstein · 2014-06-19T19:26:26Z

jq now has basic regex support in the form of match and test (see #431 and #164) but I believe still lacks "builtin" support for the following:

split(REGEX)
sub(REGEX, STRING)
gsub(REGEX, STRING) 
scan(REGEX)

The text was updated successfully, but these errors were encountered:

nicowilliams · 2014-06-19T20:03:07Z

@pkoppstein Agreed!

It should be pretty simple to write jq-coded builtins to do this.

nicowilliams · 2014-06-19T21:28:27Z

To clarify:

split -> return the substrings not matching the regexp (i.e., split on regexp)
sub -> replace the first matching substring with the given string
gsub -> replace all matching substrings with the given string
scan -> output just the matching substrings

wtlangford · 2014-06-19T22:17:20Z

So, I don't think that sub and gsub can be jq-coded, since we can't slice and modify strings.
Same for split.
scan would look something like def scan(reg): match(reg; "g") | .string;

nicowilliams · 2014-06-19T22:27:30Z

@wtlangford You can't modify strings no matter what, and yes, you can slice strings (with .[start:end]). sub should just do something like:

def sub(re; s): . as $s|[match(re)]|.[0]|. as $r|$s|.[0:$r.offset]+s+.[$r.offset+$r.length:-1];

EDIT: removed "g" from call to match (which was wrong anyways since I'd used a comma instead of a semi-colon).

nicowilliams · 2014-06-19T22:45:04Z

@wtlangford

And:

def _stredit(edits; s): if edits|length == 0 then . else (. as $s|(edits|length -1) as $l|(edits[$l]) as $edit| .[0:$edit.offset]+s+.[$edit.offset+$edit.length:] | _stredit(edits[0:$l]; s) ) end;
def gsub(re; s): [match(re;"g")] as $edits|_stredit($edits; s);

nicowilliams · 2014-06-19T22:56:49Z

diff --git a/builtin.c b/builtin.c
index 7b215e6..65ba828 100644
--- a/builtin.c
+++ b/builtin.c
@@ -952,7 +952,9 @@ static const char* const jq_builtins[] = {
   "def test(re; mode): _match_impl(re; mode; true);",
   "def test(val): if val |type == \"string\" then test(val; null) elif val | type == \"array\" and (val | length) > 1 then test(val[0]; val[1]) elif val | type == \"array\" and (val | length > 0) then test(val[0]; null) else error((val | type) + \" not a string or array\") end;",
 //  "def test(re): _match(re; null; 1);",
-  
+  "def sub(re; s): . as $s|[match(re)]|.[0]|. as $r|$s|.[0:$r.offset]+s+.[$r.offset+$r.length:-1];",
+  "def _stredit(edits; s): if edits|length == 0 then . else (. as $s|(edits|length -1) as $l|(edits[$l]) as $edit| .[0:$edit.offset]+s+.[$edit.offset+$edit.length:] | _stredit(edits[0:$l]; s) ) end;",
+  "def gsub(re; s): [match(re;\"g\")] as $edits|_stredit($edits; s);",
 };
 #undef LIBM_DD

wtlangford · 2014-06-19T23:09:05Z

Oh. I don't know where I got no slicing from.

nicowilliams · 2014-06-19T23:13:08Z

def _matches(re): match(re;"g")|[.offset,.length];
def scan(re): . as $s|_matches(re)|$s[.[0]:.[0]+.[1]];

split(re) turns out to be a bit trickier.

EDIT: fix cut-n-paste error.

nicowilliams · 2014-06-19T23:13:34Z

Please bang on these, let me know if they work for you!

pkoppstein · 2014-06-19T23:19:45Z

@nicowilliams wrote:

scan -> output just the matching substrings

That description just covers the basic case. Here's a fuller description (taken from ruby docs) that also covers the case when the regex contains groups:

For each match, a result is generated and ... added to the result array. If the pattern contains no groups, each individual result consists of the matched string ... If the pattern contains groups, each individual result is itself an array containing one entry per group.

nicowilliams · 2014-06-19T23:21:05Z

@pkoppstein Ah, OK. Can I leave this to you and split to you or @wtlangford ?

nicowilliams · 2014-06-19T23:40:33Z

Can you do better than this? It doesn't seem as elegant as it could be... My take ought to be simpler, no? But I do like my split: turn the match lengths to end offsets, make a list of start and end offsets, flatten, prefix 0, append the last position, unique, split into pairs of two, then slice the string. Except it's wrong! If the first match starts at offset 0, this doesn't work.

def take(a; n): if a|length < n then a else a[0:n] , take(a[n:];n) end;
def take(n): take(.;n);
def _matches(re): match(re;"g")|[.offset,.length];
# BUSTED
def split(re): . as $s|[_matches(re)|.[0],.[0]+.[1]]|[0]+.+[$s|length]|take(2)|select(length == 2)|$s[.[0]:.[1]];

pkoppstein · 2014-06-20T04:42:28Z

@nicowilliams, @wtlangford -- I just noticed that "offset" is not being correctly set (according to my understanding) as illustrated by:

$ jq --version
jq-1.4-34-gff5cf0a
$ jq -n -c '"aa" | match("a","g")'
{"offset":0,"length":1,"string":"a","captures":[]}
{"offset":0,"length":1,"string":"a","captures":[]}

pkoppstein · 2014-06-20T05:26:24Z

@nicowilliams - This is a slightly revised/simplified version of what you wrote that behaves as I'd expect:

def nwise(a; n): if a|length <= n then a else a[0:n] , nwise(a[n:];n) end;
def nwise(n): nwise(.;n);

def split(re): . as $s
  | [ match(re;"g") | (.offset, .offset + .length) ]
  | [0] + . +[$s|length]
  | nwise(2)
  | $s[.[0]:.[1] ] ;

Examples:

IN: "0a23a5678a"
"0"
"23"
"5678"
""
IN: "a23a5678"
""
"23"
"5678"

pkoppstein · 2014-06-20T06:28:20Z

Apropos scan, @nicowilliams asked:

Can I leave this to you

I have a version ready for testing once match is tamed.

nicowilliams · 2014-06-20T17:10:45Z

@pkoppstein Try s/,/;/ :)

$ # WRONG
$ jq -n -c '"aa" | match("a","g")'
$ # RIGHT
$ jq -n -c '"aa" | match("a";"g")'

Yeah, the need to use semi-colons to separate arguments lists kills me too sometimes...

$  ./jq -c 'def f(a): [.,a]; f(",","g")'
"a,,b,c"
["a,,b,c",",","g"]

jq does the right thing with commas for arrays. It should have as well for arguments lists. It's too late to fix it.

nicowilliams · 2014-06-20T17:34:06Z

@pkoppstein Actually, my split wasn't busted:

$ jq 'split(", ")'
", a,b, c, d, e,f, "                             
""
"a,b"
"c"
"d"
"e,f"
""

I just hadn't tested that nor thought it through.

Those empty strings aren't wrong either. They're quite correct I'd say.

Any reason not to call take take?

nicowilliams · 2014-06-20T17:40:26Z

Alright, I have implementations of all four utility functions requested. Plus take.

I just need to write docs and tests and I'll push.

pkoppstein · 2014-06-20T18:19:10Z

@nicowilliams wrote:

Try s/,/;/ :)

Indeed. Thanks. In my defense (?), it was late :-(

Here's my now-tested candidate for scan:

def scan(re): 
  [ match(re; "g")
    |  if (.captures|length > 0)
      then [ .captures | .[] | .string ]
      else .string
      end  ];

Any reason not to call take take?

Mainly because

we say "pairwise" when n==2; some might say "n-wise" for n>2
take(n) often means "take n, and leave the rest" (e.g. in ruby and perhaps English)
take/drop are often paired

pkoppstein · 2014-06-20T18:54:11Z

p.s. Please don't forget split/0, e.g.

def split: split("\\s+");

nicowilliams · 2014-06-20T22:26:34Z

@pkoppstein @wtlangford Ooops, the new split (which shipped in 1.4) conflicts with the old split. The new one outputs multiple results; the old one (being C-coded) outputs an array of all results.

I'm not going to break compatibility. And thinking about it I think I'd like to have a split builtin that doesn't use regexps anyways: for performance reasons. I'm leaning towards renaming the RE-based split to something else. Suggestions?

nicowilliams · 2014-06-20T22:32:48Z

As to take/nwise, in retrospect the obvious thing to do is to use split(n) to split an array into n-sized arrays.

I'm rather bummed out about split. I'm actually considering breaking compat, actually -- it's a horrible mistake!

pkoppstein · 2014-06-21T03:21:32Z

@nicowilliams wrote:

I'm rather bummed out about split. I'm actually considering breaking compat, actually -- it's a horrible mistake!

There must be a lesson here somewhere :-)

I don't understand why you think that string#split/1 returning an array is a mistake -- that's what every other comparable split/1 does; also .[] is easy enough to read and write, and [ .... ] is often quite awkward to read. But I suppose that goes to show my jq enlightenment score is very low.

However, if split/1 is to change, then one way out would be to take the view that the introduction of regex is a big enough leap to warrant a jump to the next "major version" (on the theory that "breaking backwards compatibility on minor versions is evil" but not otherwise). Would that require jumping to 2.0?

Using the name "split" instead of nwise/take is fine by me. An alternative to consider would be group(n).

nicowilliams · 2014-06-21T04:19:01Z

@pkoppstein jq is a language where expressions generate results. like Python iterators with yield or Icon generators with suspend. In such a language it is preferable that primitives generate multiple values rather than a single array of all of them. The latter is more wasteful than the former. I'm sure you've noticed this, now you just have to accept its goodness ;)

pkoppstein · 2014-06-21T05:46:02Z

@nicwilliams -- Maybe I'd be more easily persuaded if jq had a stream-oriented version of reduce.

For example, if we have a long pipeline, S, that produces a stream of JSON objects and wish to add these objects using +, we have either to use the form reduce S .... (which is hard to read when S is complex) or to collect the objects as an array and then use .[] to "uncollect" them again, i.e. [S] | reduce .[] as $x (...) rather than S | reduceStream . as $x (...).

Here's an example motivated by "named captures":

# return an object with the names of the named captures as keys
def capture(re):
  match(re)
   | [ .captures
       | .[] 
       | select(.name != null)
       | { (.name) : .string } ]
   | reduce .[] as $pair ({}; . + $pair);

Example:

"xyz" | capture("(?<a>.).(?<c>.)") as $vars | [ $vars.a, $vars.c ]
 # => [ "x", "z" ]

pkoppstein · 2014-06-21T13:55:43Z

By the way, I'd like to propose adding capture/1 as a builtin, so here's the "reduce S" version suitable for builtin.c:

def capture(re):
   match(re)
   | reduce ( .captures | .[]
              | select(.name != null)
              | { (.name) : .string } )
     as $pair ({}; . + $pair);

Example:
$ jq -c -f <(cat capture.jq; echo 'capture("(?(?.).(?.))")')
"abcd"
{"both":"abc","a":"a","c":"c"}

nicowilliams · 2014-06-21T23:29:59Z

@pkoppstein I agree that we need better/more iterators. That's no excuse to sully other parts of jq.

pkoppstein · 2014-06-22T05:38:25Z

@nicowilliams - I'm not sure what might have elicited your remark about sullying jq. I was merely saying that, given jq as-it-is, the case for breaking backwards-compatibility in order to change string#split seems less than compelling to me. I hope you were not suggesting that I was proposing some kind of hack along the lines of " ... |
slurp | ..." .

It does, however, seem to me that the key to providing a syntactic alternative to "[S] | ...." is to introduce a new kind of "pipe". Let us therefore assume for the moment that we could use the symbol "|>" for a pipe that slurps its input. This would, for example, allow us to write:

range(0;10) |> map( 2*. )

instead of:

[range(0;10)] | map( 2*. )

Would that work?

nicowilliams · 2014-06-24T05:52:08Z

@pkoppstein I meant that given support for generators, builtins should preferably generate than produce a single array output. It's easier to collect generated results than to iterate arrays -- in particular there's no way to process arrays in an online manner, but streams can be processed in an online manner.

pkoppstein · 2014-07-28T06:09:09Z

@nicowilliams - This set of extra builtins (#432) also seems to have fallen off you radar screen. I believe the only real sticking point was that the existing split/1 returns an array. You were contemplating changing the existing definition so that it would instead be a generator.

In the interests of moving forward, I would propose one of the following alternatives:

split(REGEX|STRING) returns an array, at least for now
retain split(STRING) as-is (for now), and add SYNONYM(REGEX|STRING) where SYNONYM could be a verb (splitup, cleave, ...) or perhaps more fittingly for a generator, a plural noun such as segments or substrings.
postpone addition of split(REGEX) functionality ;-(

pkoppstein · 2014-07-31T03:06:32Z

@nicowilliams - One other question, and one concern, as I put all the pieces together:

Are you ok with "nwise" by that name as a new public builtin? It could be just a helper subfunction, but it is generic enough. What about "cleave"?

There are two considerations which make me concerned about the regex filters blindly producing streams rather than arrays. In a nutshell:

BOUNDARY PROBLEM: For example, if a stream of STRING is presented to a stream-oriented split/1, then the boundaries are unclear, e.g. ("a,b,c", "d,e") | split(","). So one typically has to wrap things up in an array anyway.

nicowilliams · 2014-07-31T03:52:03Z

@pkoppstein Way, will you be sending a PR? If so I'll wait until you send it.

As to the boundary problem:

$ ./jq -cn '("a,b,c", "d,e") | split(",")'
["a","b","c"]
["d","e"]
$ ./jq -cn '("a,b,c", "d,e") | split(",")[]'
"a"
"b"
"c"
"d"
"e"
$ ./jq -cn '("a,b,c", "d,e") | [split(",")[]]'
["a","b","c"]
["d","e"]
$

As you can see, it's easy to work around.

nicowilliams · 2014-07-31T04:06:46Z

Here's what I had before, rebased to master:

https://github.com/nicowilliams/jq/compare/scan?expand=1

Please review.

nicowilliams · 2014-07-31T04:07:08Z

That is, I've not done much more than rebase.

pkoppstein · 2014-07-31T05:16:21Z

@nicowillliams wrote:

Here's what I had before, rebased to master:

Thanks. One of the things that I've done is ensure that sub and gsub handle named captures properly (i.e. in the "tostring"). Things are working nicely, e.g.

$ jq -n '"Abcabc" | gsub("(?<x>.)[^a]*"; "+\(.x)-")'
+A-+a-"

As you can see, it's easy to work around.

Yes, I know; that's why I said "So one typically has to wrap things up in an array anyway."

Of course I understand that streams have their place. It's a Very Good Thing that jq can handle a stream of JSON objects, and I like generators (especially those that go from null to N and beyond).

However, for the regex filters, I seem only to be able to see the downsides of the stream-oriented approach. After all, map/1 could have been implemented to produce a stream of mapped items, but it doesn't.

In short, since jq is JSON-oriented, and given the ease and (so far as I can tell) the speed) of [...] and .[], I still don't see the great harm in string-to-array filters, but I'd be happy to be enlightened on this as on many other topics. And I apologize if I've failed to understand the significance of a point you've already made on this topic.

nicowilliams · 2014-07-31T16:04:22Z

@pkoppstein I'm unmoved on this. I made a mistake with split. For C-coded builtins it's certainly not possible to generate streams without using .[] in a jq-coded wrapper, but that's no excuse for not having a stream-output form. Also, it's always possible to turn an array into a stream, yes, but NOT in an online way, while it's always possible to turn a stream into an array at no extra cost (if you wanted an array, you didn't need online, and you were going to pay the price for collecting the stream's outputs no matter what).

Suppose Oniguruma gave us a streaming interface (maybe it does already; idk). Would it be possible to apply a regex to very large (streaming!) inputs and process the outputs in a streaming fashion? Wouldn't you want to? But if we make the interface always collect, we lose this ability.

The principle is: stream first, we can always collect when you want an array.

nicowilliams · 2014-08-06T23:12:12Z

@pkoppstein Can you review the scan branch of my clone?

https://github.com/nicowilliams/jq/compare/scan?expand=1

nicowilliams · 2014-08-06T23:12:29Z

Also, can you add examples for the manual?

nicowilliams · 2014-08-07T01:27:03Z

@pkoppstein Actually, I'm concerned about take/2: it should take n items from from a's outputs, not from each array output by a.

Something like:

def take(a; n): n as $n |
    foreach a as $a ([];
                     if length == $n then [$a] else .+[$a] end;
                     if length == $n then . else empty end);

This points out that foreach really needs an option for one more closure: one to extract state when the UPDATE closure is done.

This sort of thing has been bothering me about this commit. I'm thinking: leave out take/2 for now.

Also, gsub/2 seems odd to me as written. For one, there's no docs, so I'm not sure how it's supposed to work. For another, the s closure is special but probably shouldn't be (s as $s | ... should be added in). Finally, I think gsub/2 could be written in a more functional style. At this point I've lost track if gsub/2 was yours or mine, so don't take offense if it's mine.

pkoppstein · 2014-08-07T01:43:56Z

@nicowilliams wrote:

there's no docs

They're all documented in the same commit in a long addition that starts at scan(regex). Perhaps you missed this addition because they're in the regex section? (One advantage of SourceTree is that you can't miss the changes.) I believe the additions are adequate -- I didn't think it was necessary (or desirable) to go into exhaustive detail, partly because that's not the style of manual.yml, but mainly because there's ample relevant documentation about sub and gsub (with captures) out there.

I'll certainly take a look again at these filters in light of your comments, but they are the way they are because of the captures. Actually, I think to write all this in most other languages would be either very long or very convoluted.

nicowilliams · 2014-08-07T01:51:09Z

Oh, I guess I fell behind your branch.

pkoppstein · 2014-08-07T02:08:09Z

@nicowilliams wrote:

Oh, I guess I fell behind your branch.

The relevant commit has no "take".

$ git log -1
commit a21e610bd0938b3154a270107b2a0fefcdeeb334
Author: pkoppstein <[email protected]>
Date:   Thu Jul 31 20:32:44 2014 -0400
    regex filters (#432): scan, splits, split, sub, gsub

This has test cases, too (largely borrowed from the ones you had written).

nicowilliams · 2014-08-07T05:38:48Z

My take was what you called nwise. I don't care about the name though. I do care about outputting streams though, and nwise/2's a should be a stream, not an array. It's easy to collect stream outputs into arrays; the reverse is possible too, of course, but you lose the ability to do things in an online way. Preferably we should also use streams internally (it's like TCO: it matters when the streams get large). We could make nwise private though. And, all that said, a streaming nwise is not that simple (it'd be nice if foreach had an option for a final extract to run on the state after the last iteration is done).

pkoppstein · 2014-08-07T06:09:27Z

Hi, @nicowilliams. I only mentioned take as your reference to it seemed to indicate your critique was of an ancient version of the code somewhere.

I think the best resolution of the issue is, as you suggested, to make it private. That way, when a suitably brilliant implementation of nwise or take becomes available, it will be trivial to incorporate it into builtin.

As I've said before, since you're the Decider and since you know what you want better than I do, it would probably be simplest at this point if you make any changes that you believe are appropriate. I am of course speaking of #522

On the other hand, if you want me to submit another PR, I'd need to know which changes you'd specifically like. Thanks.

nicowilliams · 2014-08-08T22:59:43Z

@pkoppstein

I pushed your commit too soon. There were problems with the docs. Also, these two tests fail:

[.[] | scan(", ")]
["a,b, c, d, e,f",", a,b, c, d, e,f, "]

[.[] | split(", ")]
["a,b, c, d, e,f",", a,b, c, d, e,f, "]

I'll push a fix soon.

nicowilliams · 2014-08-08T23:16:44Z

BTW, please always run make check before submitting a PR.

Of course, I should have as well. I always do, but slipped this time.

pkoppstein · 2014-08-09T02:31:56Z

@nicowilliams wrote:

BTW ...

Of course I always run "make" but, when I ran "make check" once I got this message:

Support on MacOS 10.8/10.9 is experimental and mostly broken.
WARNING: Expect incorrect results, assertions and crashes

So I've assumed that it's not worthwhile. Also, I don't have the ruby environment necessary to "make" the documents. For a previous commit, I believe I made it explicit that I was not vouching for anything other than the correctness of the changes to builtin.c but I should have been explicit on this occasion as well. Sorry about that. But thank you for installing the updates and attending to the imperfections.

guruprasannaps · 2019-06-10T15:26:42Z

I want to validate the filename. If the file name contains the \ / : * ? " < > | I should get the result as false. I tried many things but unable to make it work. CVOuld you help me please on this. how I can validate?
test("\ / : * ? " < > |*"; "gx")] is throing an compilation error .Please help me

pkoppstein · 2019-06-10T16:28:49Z

@guruprasannaps - One answer:

jq -R 'test("[\\/:*?\"<>|]") | not'

In short, the regex must be a JSON string.

In future, please ask usage questions at stackoverflow.com with the jq tag: https://stackoverflow.com/questions/tagged/jq

Thank you.

nicowilliams added the feature request label Jun 21, 2014

This was referenced Jul 29, 2014

Add capture, and document regular expression filters. #493

Closed

.foo[-1] = value syntax desired for appending to arrays #489

Closed

pkoppstein mentioned this issue Jul 31, 2014

array optimizations for sort/unique #517

Closed

nicowilliams pushed a commit that referenced this issue Aug 8, 2014

regex filters (#432): scan, splits, split, sub, gsub

a696c6b

nicowilliams closed this as completed Aug 8, 2014

nicowilliams added a commit that referenced this issue Aug 8, 2014

Fix docs YAML (introduced in #432)

eb12f03

nicowilliams added a commit that referenced this issue Aug 8, 2014

Fix tests for #432

fc52e78

This was referenced Jun 10, 2019

Added regex support as per issue #164. #431

Closed

regex filters (#432): scan, splits, split, sub, gsub #522

Closed

jqlang deleted a comment from guruprasannaps Jun 11, 2019

enhancement request: regex filters #432

enhancement request: regex filters #432

Comments

pkoppstein commented Jun 19, 2014

nicowilliams commented Jun 19, 2014

nicowilliams commented Jun 19, 2014

wtlangford commented Jun 19, 2014

nicowilliams commented Jun 19, 2014

nicowilliams commented Jun 19, 2014

nicowilliams commented Jun 19, 2014

wtlangford commented Jun 19, 2014

nicowilliams commented Jun 19, 2014

nicowilliams commented Jun 19, 2014

pkoppstein commented Jun 19, 2014

nicowilliams commented Jun 19, 2014

nicowilliams commented Jun 19, 2014

pkoppstein commented Jun 20, 2014

pkoppstein commented Jun 20, 2014

pkoppstein commented Jun 20, 2014

nicowilliams commented Jun 20, 2014

nicowilliams commented Jun 20, 2014

nicowilliams commented Jun 20, 2014

pkoppstein commented Jun 20, 2014

pkoppstein commented Jun 20, 2014

nicowilliams commented Jun 20, 2014

nicowilliams commented Jun 20, 2014

pkoppstein commented Jun 21, 2014

nicowilliams commented Jun 21, 2014

pkoppstein commented Jun 21, 2014

pkoppstein commented Jun 21, 2014

nicowilliams commented Jun 21, 2014

pkoppstein commented Jun 22, 2014

nicowilliams commented Jun 24, 2014

pkoppstein commented Jul 28, 2014

pkoppstein commented Jul 31, 2014

nicowilliams commented Jul 31, 2014

nicowilliams commented Jul 31, 2014

nicowilliams commented Jul 31, 2014

pkoppstein commented Jul 31, 2014

nicowilliams commented Jul 31, 2014

nicowilliams commented Aug 6, 2014

nicowilliams commented Aug 6, 2014

nicowilliams commented Aug 7, 2014

pkoppstein commented Aug 7, 2014

nicowilliams commented Aug 7, 2014

pkoppstein commented Aug 7, 2014

nicowilliams commented Aug 7, 2014

pkoppstein commented Aug 7, 2014

nicowilliams commented Aug 8, 2014

nicowilliams commented Aug 8, 2014

pkoppstein commented Aug 9, 2014

guruprasannaps commented Jun 10, 2019

pkoppstein commented Jun 10, 2019