Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct syntax for filtering multiple values on a single column #1591

Closed
karudonaldson opened this issue Jun 17, 2024 · 14 comments
Closed

Correct syntax for filtering multiple values on a single column #1591

karudonaldson opened this issue Jun 17, 2024 · 14 comments

Comments

@karudonaldson
Copy link

karudonaldson commented Jun 17, 2024

Wanting to filter multiple values TEST1 and TEST2 on a single column.
My earlier post may answer this one but thought I'd post it separately just in case.

Trying with && but perhaps I'm missing something?
mlr -- filter '$COLUMN1 == "TEST1" && "TEST2"' input.csv > output.csv

@aborruso
Copy link
Contributor

aborruso commented Jun 17, 2024 via email

@johnkerl
Copy link
Owner

@karudonaldson do you want to include rows for which COLUMN1 is either TEST1 or TEST2? If so I would suggest

mlr --csv filter '$COLUMN1 == "TEST1" || $COLUMN1 == "TEST2"' input.csv > output.csv

@karudonaldson
Copy link
Author

karudonaldson commented Jun 17, 2024

@karudonaldson do you want to include rows for which COLUMN1 is either TEST1 or TEST2? If so I would suggest

mlr --csv filter '$COLUMN1 == "TEST1" || $COLUMN1 == "TEST2"' input.csv > output.csv

Awesome, exactly what I was looking for.

Then to clarify further, if I'm combining multiple values on a single column with multiple values on other columns, would this be the correct syntax?

mlr --csv filter '$COLUMN1 == "TEST1" || $COLUMN1 == "TEST2" || $COLUMN3 == "TEST3" && $COLUMN4 == "TEST4"' input.csv > output.csv

@johnkerl
Copy link
Owner

johnkerl commented Jun 17, 2024

Yes, but you need to know about operator precedence if you're going to mix && and ||:
https://miller.readthedocs.io/en/latest/reference-dsl-operators/#operator-precedence

Regardless of any particular programming language's precedence rules, when mixing && and || in any language I suggest using parentheses to disambiguate:

mlr --csv filter '$COLUMN1 == "TEST1" || $COLUMN1 == "TEST2" || ($COLUMN3 == "TEST3" && $COLUMN4 == "TEST4")' input.csv > output.csv

or

mlr --csv filter '($COLUMN1 == "TEST1" || $COLUMN1 == "TEST2" || $COLUMN3 == "TEST3") && $COLUMN4 == "TEST4"' input.csv > output.csv

depending on which of those are what you mean ...

@karudonaldson
Copy link
Author

karudonaldson commented Jun 17, 2024

Would you mind breaking down what the key difference is here for clarity.
To the untrained eye, they both look identical so merely wanting to understand the reasons for using parentheses in each?

@johnkerl
Copy link
Owner

johnkerl commented Jun 17, 2024

@karudonaldson consider the following input:

input.csv:

COLUMN1,COLUMN2,COLUMN3,COLUMN4
NOT,NOT,NOT,TEST4

One way:

mlr --csv --from input.csv filter '$COLUMN1 ==  "TEST1" && $COLUMN2 == "TEST2" && ($COLUMN3 == "TEST3"  || $COLUMN4 == "TEST4")'

This prints no rows. Why? Because AND of things means they all must be true, and, not all of the following are true:

  • $COLUMN1 =="TEST1" -- not true
  • $COLUMN2 == "TEST2" -- not true
  • ($COLUMN3 == "TEST3" || $COLUMN4 == "TEST4") -- true, since COLUMN4 is TEST4

Then parentheses another way:

mlr --csv --from input.csv filter '($COLUMN1 == "TEST1"  && $COLUMN2 == "TEST2" &&  $COLUMN3 == "TEST3") || $COLUMN4 == "TEST4"'

This prints

COLUMN1,COLUMN2,COLUMN3,COLUMN4
NOT,NOT,NOT,TEST4

Why? Because OR of things mean at least one must be true, and one of the following is true:

  • ($COLUMN1 == "TEST1" && $COLUMN2 == "TEST2" && $COLUMN3 == "TEST3") -- these are all false, so the AND of them is false
  • $COLUMN4 == "TEST4" -- this one is true

So, you need to decide what you want your filter expression to test for.

@karudonaldson
Copy link
Author

That makes total sense! Thanks for taking the time to explain that. Now it's clear :)

@karudonaldson
Copy link
Author

karudonaldson commented Jun 18, 2024

Re-opening since I ran into another snag.
Seem to be getting no results with the below example. Is it because the string to filter contains a period?

input.csv

COLUMN1,COLUMN2,COLUMN3,COLUMN4
TEST.1,NOT,NOT,TEST4
TEST.2,NOT,NOT,TEST4
NOT,NOT,NOT,TEST4

mlr --csv filter '$COLUMN1 == "TEST.1" || $COLUMN1 == "TEST.2"' input.csv > output.csv

Expecting:

COLUMN1,COLUMN2,COLUMN3,COLUMN4
TEST.1,NOT,NOT,TEST4
TEST.2,NOT,NOT,TEST4

@karudonaldson karudonaldson reopened this Jun 18, 2024
@johnkerl
Copy link
Owner

johnkerl commented Jun 18, 2024

@karudonaldson here's what I have:

cat input.csv

COLUMN1,COLUMN2,COLUMN3,COLUMN4
TEST.1,NOT,NOT,TEST4
TEST.2,NOT,NOT,TEST4
NOT,NOT,NOT,TEST4

mlr --csv filter '$COLUMN1 == "TEST.1" || $COLUMN1 == "TEST.2"' input.csv

COLUMN1,COLUMN2,COLUMN3,COLUMN4
TEST.1,NOT,NOT,TEST4
TEST.2,NOT,NOT,TEST4

@karudonaldson
Copy link
Author

karudonaldson commented Jun 18, 2024

Ouch I found the problem at my end.

The string "TEST.1" has white space after it!
e.g. TEST.1 ,

I've seen this before in other CSV's and we came to the conclusion it makes sense if we detect the whitespace overwrite the entire cell with the correct string name "TEST.1" then apply the filter. sigh

What is your suggestion for handling this scenario?

I tried filtering "TEST.1*" but that didn't help.

@aborruso
Copy link
Contributor

What is your suggestion for handling this scenario?

Hi, you have a verb to clean this kind of problems:
https://miller.readthedocs.io/en/6.12.0/reference-verbs/index.html#clean-whitespace

@johnkerl
Copy link
Owner

Indeed, and/or, filter TEST\.1.*

@karudonaldson
Copy link
Author

karudonaldson commented Jun 18, 2024

EDIT:
I now include a new process before any rules:
mlr --csv clean-whitespace ./source/input.csv > ./source/output.csv

With the clean new output.csv I run filter rules against it. :)

@karudonaldson
Copy link
Author

All sorted, thanks @johnkerl @aborruso!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants