Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse with Missing #61

Open
bkamins opened this issue Dec 1, 2017 · 23 comments
Open

parse with Missing #61

bkamins opened this issue Dec 1, 2017 · 23 comments

Comments

@bkamins
Copy link
Member

bkamins commented Dec 1, 2017

I would like to put under consideration if we want to add methods to parse that accept Union{T, Missing} argument (or create a similar method). This would be something like tryparse but returning missing on failure.

I would think of something like:

  • parse(Union{Int, Missing}, "123") == 123;
  • parse(Union{Int, Missing}, "123a") == missing.
@nalimilan
Copy link
Member

Actually with JuliaLang/julia#23642 tryparse will have return type Union{Some{T}, Void} (or maybe just Union{T, Void} depending on the outcome of discussions). I'm not sure we want to allow parse(Union{T, Missing}, ...) in addition to that. There have been lots of discussions about this kind of syntax around the introduction of tryparse in Base. One issue is that parse(Union{T, Missing}, x) could be syntax to parse "missing" as missing, numbers as Int, and throw and error in other cases.

@bkamins
Copy link
Member Author

bkamins commented Dec 2, 2017

Close this issue then or keep for a reference?

@nalimilan
Copy link
Member

I'd say it depends on whether you think this syntax is needed or not.

@bkamins
Copy link
Member Author

bkamins commented Dec 2, 2017

I believe we need a function that returns missing if parsing fails. It does not have to be prase, actually tryparse is more natural, but now it returns Nullable (which behavior I earlier thought would be retained as computer scientist missingness).

So let me comment on a use case and you can decide which way to go (I have much less understanding of the whole ecosystem context). A typical scenario with DataFrames is to read some data which is messy but is supposed to contain integers. Assume that we already have it as ["1", "2", "?", "", "3", "unknown"].

A pure approach would be to handle the missing data carefully, but a practical workflow (and this is how R works) is to parse it as Vector{Int} and make all that is not an integer a missing.

Computer scientist now can do:

julia> tryparse.(Int, ["1", "2", "?", "", "3", "unknown"])
6-element Array{Nullable{Int64},1}:
 1
 2
 #NULL
 #NULL
 3
 #NULL

But data scientist has to do (or there is some simpler way?):

julia> [isnull(x) ? missing : get(x) for x in tryparse.(Int, ["1", "2", "?", "", "3", "unknown"])]
6-element Array{Any,1}:
 1
 2
  missing
  missing
 3
  missing

which not only is messy but also has wrong type (Any) of the produced array.

I would like to have this fixed (I understand that solving Any problem is an additional topic discussed in JuliaLang/julia#24332).

@nalimilan
Copy link
Member

I see your point. I'm not sure we want to follow R in all aspects, in particular we have the ability to distinguish an invalid value from a missing value, so it wouldn't be absurd to be stricter, and require an explicit operation to convert an unparsable value to a missing value. That could be done quite easily using replace (see JuliaLang/julia#22324). Adding a function as you suggest also makes sense, but I'd rather wait for the design of tryparse to settle first (JuliaLang/julia#23642), we can easily extend it later.

@bkamins
Copy link
Member Author

bkamins commented Dec 2, 2017

Thanks.

@alyst
Copy link

alyst commented Dec 13, 2017

I also don't think parse() should return missing for invalid inputs.

But somewhat more conservative thing would be to allow

parse(Int, missing) = missing
#= or =# 
parse(Union{Int, Missing}, missing) = missing

I prefer the first alternative as the 2nd might imply: parse(Union{Int, Missing}, "missing") == missing.

Then it would be possible to parse.(Int, ::AbstractArray{Union{String, Missing}})

@quinnj
Copy link
Member

quinnj commented Dec 13, 2017

The approach taken in CSV.jl, which @shashi and I have discussed generalizing to parsing functions in general is to have parse(Union{Int, Missing}, str) => Union{Int, Missing} and parse(Int, "missing") => throw(ParsingException()).

@alyst
Copy link

alyst commented Dec 13, 2017

@quinnj Do you also throw ParsingException() on parse(Int, ::Missing)?

@quinnj
Copy link
Member

quinnj commented Dec 13, 2017

I don't currently have a use-case for that in CSV as there's always guaranteed to be an input, though the input may be an empty string.

@nalimilan
Copy link
Member

So you are using missing to indicate that parsing failed? Basically what tryparse does with Nullable() (and will soon do with nothing)? I think it'd prefer using parse(Union{Int, Void}, ...) for that, to distinguish clearly between "invalid" and "missing".

@quinnj
Copy link
Member

quinnj commented Dec 14, 2017

@nalimilan was that in response to my comment or @alyst ?

@nalimilan
Copy link
Member

To yours.

@quinnj
Copy link
Member

quinnj commented Dec 14, 2017

No, the user has to provide, in the case of CSV, a "null" string and if the input matches this "null" string, then missing is returned. In the case of failure, e.g. non-numeric digits when parsing a number, an error is always thrown. That leaves try-catch behavior up to the caller, who would know what to do in case of failure anyway. I supposed you could keep a tryparse function that returned nothing instead of throwing.

@pdeffebach
Copy link
Contributor

A user ran into this with Dates recently, where Dates throws an error on parsing missings.

JuliaLang/julia#28570

I would recommend a keyword argument, but if a constructor relies on parsing, we have to trust that they forward keyword arguments as well.

@nalimilan
Copy link
Member

Since Missing now lives in Base, this feature proposal should be discussed there.

@xpe
Copy link

xpe commented Feb 9, 2019

@nalimilan Right. Has a proposal / discussion started over there? Let's link to it if so.

@bkamins
Copy link
Member Author

bkamins commented Feb 9, 2019

Actually I would close it (although I know I have raised the issue in the first place).

The current meta-understanding of types in Base is the following AFAICT:

  • nothing: the value does not exist
  • missing: the value exists, but we do not know it

I would argue that tryparse follows this rule and returns nothing on error - as the proper value does not exist. And I would keep this approach for consistency. Of course this is a bit debatable as the distinction here is thin.

But maybe there are cases when would be convenient to return missing - if yes we could keep discussing it.

@nalimilan
Copy link
Member

Yes, the case for parsing to Union{T,Missing} seems to be that you want a sentinel value (e.g. "missing", "" or "NA") to be converted to missing. But invalid values should not be converted to missing. I imagine we could have something like parse(Union{T, Missing}, x, missing="missing").

I'm not aware of any issue in Base about this currently.

@bkamins
Copy link
Member Author

bkamins commented Feb 10, 2019

Yes - the sentinel approach could make sense. It could be also more general, e.g. allowing a Regex or a predicate.

@scls19fr
Copy link

Having a parse (or tryparse) function which can output missing will greatly improve workflow with DataFrame. See JuliaData/DataFrames.jl#1854

@bkamins
Copy link
Member Author

bkamins commented Jun 20, 2019

@scls19fr - as you can see from the discussion above the issue is not that simple. What syntax would you think would be the most convenient while consistent (the consistency issue was a major concern in the past).

pepelovesvim added a commit to pepelovesvim/julia that referenced this issue Sep 18, 2020
parse and tryparse to handle missing/nothing, returning missing as result.
pepelovesvim added a commit to pepelovesvim/julia that referenced this issue Sep 18, 2020
Added handling of missing/nothing inputs for parse and tryparse. Returns nothings to stay consistent with parse returns during invalid parses.
@pepelovesvim
Copy link

pepelovesvim commented Sep 18, 2020

I think one should output nothing when using parse/tryparse on missing/nothing data. Since it's tryparse is already outputting nothing on invalid strings. Especially since nothing is suppose to represent "does not exist", so parsing missing/nothing should give a result that "does not exist".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants