Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

correctly parse all integer types. fixes #9289 #9316

Closed
wants to merge 1 commit into from

Conversation

tanmaykm
Copy link
Member

Used parseint and parsefloat methods instead of float64_isvalid.

end
end
end
end
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes. That can't be efficient :-\

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like it either. But couldn't find a better way. Using parse was worse.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it is only when an abstract Number type is specified to readdlm. Other cases perform as before.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is our fault for not having a good API for this. Given when you have to work with, this is the only way. But it's still terrible :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last try...end does not have a catch. Is there a reason to use try...end for that case instead of just val = parseint(BigInt, sval)?

P.S. I vaguely recall something like this showing up in a C++ benchmark, which caused compiler implementers to come up with crazy optimizations for this kind of practice, even though the practice was nominally discouraged.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArchRobison that is to let it fall through and try for bool or just set a string, when T is Any or Number.

@JeffBezanson Yes, I think we can just limit to Int. Also, won't the specialized methods be able to optimize away conditions that don't apply? Which will then be better than using Union.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method will be specialized anyway; it's just more efficient not to have 5 copies of basically the same source code. Using @eval forces all 5 copies to exist; relying on specialization only generates new code for cases that are actually used. Also @eval and $ are just ugly.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is unfortunate, since stack unwinding is so incredibly expensive on windows and we always create the stack trace on an error

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will probably be useful to have versions of parseint and parsefloat that indicate failure instead of throwing an exception.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to failures as return variables rather than exceptions.

As part of the String redesign, I'd like to see functions like those in https://github.com/johnmyleswhite/CSVReaders.jl/blob/master/src/05_parsers.jl get moved into Base.

Used parseint instead of float64_isvalid.
@tanmaykm
Copy link
Member Author

Modified the PR:

  • no try-catch if concrete types are specified and only one (for parseint) otherwise
  • removed support for specifying abstract numeric types. They would probably be not be of any practical use anyway
  • also not relying on return value of the try-end construct

Some performance comparisons with a 3000000x6 CSV of Ints:

Master:

@time d = readcsv("test.csv", Any);
elapsed time: 21.261691912 seconds (3379880904 bytes allocated, 78.64% gc time)

@time d = readcsv("test.csv", Int);
elapsed time: 3.682649125 seconds (2473946568 bytes allocated, 10.52% gc time)

@time d = readcsv("test.csv", Float64);
elapsed time: 5.819523024 seconds (2473946568 bytes allocated, 44.50% gc time)

Branch:

julia> @time d = readcsv("test.csv", Any);
elapsed time: 7.013307727 seconds (2792678272 bytes allocated, 46.28% gc time)

julia> @time d = readcsv("test.csv", Int);
elapsed time: 3.993296468 seconds (2473946496 bytes allocated, 9.67% gc time)

julia> @time d = readcsv("test.csv", Float64);
elapsed time: 3.930931881 seconds (2775887328 bytes allocated, 13.01% gc time)

@tanmaykm
Copy link
Member Author

Shall merge this tomorrow if there are no objections and is still open.

@ViralBShah
Copy link
Member

Nice speed bumps. Any idea why the second test is slower? Also, there was a recent try-end return value fix. Not sure if that is relevant here.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Dec 18, 2014

i'm mildly concerned this may have a major speed regression on windows do to the dependence on try/catch. do you happen to have a machine you could test on for comparison?

i would instead suggest:

if sval == "true"
    cells[row,col] = true
elseif sval == "false"
    cells[row,col] = false
elseif (sval[1] == '-' && isdigit(SubString(sval,2))) || isdigit(sval) #xxx: need to handle 0x, 0b, 0o here also
    cells[row,col] = parseint(sval)
elseif float64_isvalid(sval, tmp64)
    cells[row,col] = tmp64[1]
else
    return true
end
return false

or, just rewrite float64_isvalid to return a Nullable{Float64} and get rid of this silly nonsense about needing a output argument for a function because it can only return one value.

@tanmaykm
Copy link
Member Author

@ViralBShah I can see the second test result vary between 3.9 to 4.2, so seems to be within the margin. The recent try-end return value fix won't affect this as this does not rely on the return value any more.

@vtjnash Will try the options suggested. I tried to avoid too many string comparisons, but probably that will be faster than the try-catch.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Dec 18, 2014

a six character ascii-string compare is going to be faster than just about anything else you could attempt (plus, it will almost always short circuit after the first character is not matched in t/f). the main issue with my suggestion is defining the is_parsable_as_int functionality. i'm actually more interested in the Nullable rewrite, since that also gets rid of the global variable.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Dec 18, 2014

also, it's worth pointing out that the time to capture a stack trace is going to be proportional to the size of the stack. benchmark code is typically going to have a much shorter call stack than a real function.

(full disclosure: I'm currently testing stack overflow code on windows ed5a8fc via f()=f(); @time try f() end, and it takes a good full minute on my slow atom processor to unwind that stack trace, vs 0.25s on my speedier Xeon linux and i7 mac machines)

@tanmaykm
Copy link
Member Author

Ouch. Should definitely avoid the try-catch.

@JeffBezanson
Copy link
Sponsor Member

Returning a Nullable from float64_isvalid would be great.

@johnmyleswhite
Copy link
Member

I like the idea of using Nullable, but have come to think that the parsing code may need to throw numbered errors so that we can distinguish between failure cases. For integers, you at least want to distinguish: (1) is an integer, but caused overflow, (2) byte sequence is a number, but is a floating point number, and (3) byte sequence contains non-numeric characters and is actually a non-numeric string.

@tanmaykm
Copy link
Member Author

Here's a draft implementation of float conversion methods returning Nullable instead of requiring an output parameter - https://github.com/tanmaykm/julia/compare/maybefloat (diff).

The Nullable version seems to have a larger gc footprint. I hope I have done it correctly. @vtjnash @JeffBezanson can you please take a look at it?

julia> function time_float64(N::Int64, s::AbstractString)
           println("maybefloat64:")
           gc()
           @time for i in 1:N
               f1 = Base.maybefloat64(s)
           end

           println("float64:")
           gc()
           @time for i in 1:N
               f2 = float64(s)
           end
       end
time_float64 (generic function with 1 method)

julia> time_float64(10^8, "23423432423.234234324324")
maybefloat64:
elapsed time: 19.220756599 seconds (4000000000 bytes allocated, 7.71% gc time)
float64:
elapsed time: 16.762540432 seconds (1600000000 bytes allocated, 3.58% gc time)

@vtjnash
Copy link
Sponsor Member

vtjnash commented Dec 20, 2014

doing the work in C eliminates the ability to perform the crucial optimizations that make this fast. instead, use ccall's ability to interpret output struct parameters and add a C representation of Julia's Nullable{Float64} type:

typedef struct {
    uint8_t isnull;
    double value;
} nullable_float64;
nullable_float64 maybefloat64(char *s) {
   nullable_float64 ret = {1, strtod(s));
   return ret;
}
ccall(:maybefloat64, Nullable{Float64}, (Ptr{Uint8},), s)

@tanmaykm
Copy link
Member Author

Thanks! With that change:

julia> time_float32(10^8, "23423432423.234234324324")
maybefloat32:
elapsed time: 8.332165974 seconds (1600000000 bytes allocated, 7.56% gc time)
float32:
elapsed time: 8.755044282 seconds (1600000000 bytes allocated, 6.64% gc time)

julia> time_float64(10^8, "23423432423.234234324324")
maybefloat64:
elapsed time: 16.754755825 seconds (2400000000 bytes allocated, 5.79% gc time)
float64:
elapsed time: 16.434358898 seconds (1600000000 bytes allocated, 3.59% gc time)

@vtjnash
Copy link
Sponsor Member

vtjnash commented Dec 20, 2014

that's good. now my other remaining question would be: why isn't the allocation 0 bytes?

@tanmaykm
Copy link
Member Author

This seems to fix the extra allocation tanmaykm@94ee10b, @vtjnash please verify.

Now:

julia> time_float64(10^8, "23423432423.234234324324");
maybefloat64:
elapsed time: 3.64e-6 seconds (0 bytes allocated)
float64:
elapsed time: 4.068e-6 seconds (16 bytes allocated)

@vtjnash
Copy link
Sponsor Member

vtjnash commented Dec 22, 2014

wow, yes, good find! that code largely predated immutable types. that missing condition seems to have been an oversight.

any idea why that affected float64 also?

@tanmaykm
Copy link
Member Author

Oops sorry, that was for a single loop (I had commented out the loop for some testing in my repl). Here's the correct result for 10^8 loops:

julia> time_float64(10^8, "23423432423.234234324324");
maybefloat64:
elapsed time: 15.228390363 seconds (0 bytes allocated)
float64:
elapsed time: 16.982878151 seconds (1600000000 bytes allocated, 3.47% gc time)

tanmaykm added a commit to tanmaykm/julia that referenced this pull request Dec 29, 2014
Introduces following methods that parse a string as the indicated type and return a `Nullable` with the result instead of throwing exception:
- `maybeint{T<:Integer}(::Type{T<:Integer},s::AbstractString)`
- `maybefloat32(s::AbstractString)` and `maybefloat64(s::AbstractString)`

Ref: discussions at JuliaLang#9316, JuliaLang#3631, JuliaLang#5704
tanmaykm added a commit to tanmaykm/julia that referenced this pull request Dec 29, 2014
Introduces following methods that parse a string as the indicated type and return a `Nullable` with the result instead of throwing exception:
- `maybeint{T<:Integer}(::Type{T<:Integer},s::AbstractString)`
- `maybefloat32(s::AbstractString)` and `maybefloat64(s::AbstractString)`

Ref: discussions at JuliaLang#9316, JuliaLang#3631, JuliaLang#5704
tanmaykm added a commit to tanmaykm/julia that referenced this pull request Dec 29, 2014
Introduces following methods that parse a string as the indicated type and return a `Nullable` with the result instead of throwing exception:
- `maybeint{T<:Integer}(::Type{T<:Integer},s::AbstractString)`
- `maybefloat32(s::AbstractString)` and `maybefloat64(s::AbstractString)`

Ref: discussions at JuliaLang#9316, JuliaLang#3631, JuliaLang#5704
tanmaykm added a commit to tanmaykm/julia that referenced this pull request Mar 11, 2015
Introduces following methods that parse a string as the indicated type and return a `Nullable` with the result instead of throwing exception:
- `maybeint{T<:Integer}(::Type{T<:Integer},s::AbstractString)`
- `maybefloat32(s::AbstractString)` and `maybefloat64(s::AbstractString)`

Ref: discussions at JuliaLang#9316, JuliaLang#3631, JuliaLang#5704
tanmaykm added a commit to tanmaykm/julia that referenced this pull request Mar 11, 2015
Introduces following methods that parse a string as the indicated type and return a `Nullable` with the result instead of throwing exception:
- `maybeint{T<:Integer}(::Type{T<:Integer},s::AbstractString)`
- `maybefloat32(s::AbstractString)` and `maybefloat64(s::AbstractString)`

Ref: discussions at JuliaLang#9316, JuliaLang#3631, JuliaLang#5704
tanmaykm added a commit to tanmaykm/julia that referenced this pull request Mar 11, 2015
Introduces following methods that parse a string as the indicated type and return a `Nullable` with the result instead of throwing exception:
- `maybeint{T<:Integer}(::Type{T<:Integer},s::AbstractString)`
- `maybefloat32(s::AbstractString)` and `maybefloat64(s::AbstractString)`

Ref: discussions at JuliaLang#9316, JuliaLang#3631, JuliaLang#5704
tanmaykm added a commit to tanmaykm/julia that referenced this pull request Mar 12, 2015
Introduces following methods that parse a string as the indicated type and return a `Nullable` with the result instead of throwing exception:
- `maybeint{T<:Integer}(::Type{T<:Integer},s::AbstractString)`
- `maybefloat32(s::AbstractString)` and `maybefloat64(s::AbstractString)`

Ref: discussions at JuliaLang#9316, JuliaLang#3631, JuliaLang#5704
tanmaykm added a commit to tanmaykm/julia that referenced this pull request Mar 13, 2015
Introduces following methods that parse a string as the indicated type and return a `Nullable` with the result instead of throwing exception:
- `maybeint{T<:Integer}(::Type{T<:Integer},s::AbstractString)`
- `maybefloat32(s::AbstractString)` and `maybefloat64(s::AbstractString)`

Ref: discussions at JuliaLang#9316, JuliaLang#3631, JuliaLang#5704
tanmaykm added a commit to tanmaykm/julia that referenced this pull request Mar 17, 2015
Introduces the tryparse method:
- tryparse{T<:Integer}(::Type{T<:Integer},s::AbstractString)
- tryparse(::Type{Float..},s::AbstractString)
- a few variants of the above

And:
- tryparse(Float.., ...) call the corresponding C functions jl_try_strtof, jl_try_substrtof, jl_try_strtod and jl_try_substrtod.
- The parseint, parsefloat, float64_isvalid and float32_isvalid methods wrap the corresponding tryparse methods.
- The jl_strtod, jl_strtof, ... functions are wrappers over the jl_try_str... functions.

This should fix JuliaLang#10498 as well.

Ref: discussions at JuliaLang#9316, JuliaLang#3631, JuliaLang#5704
tanmaykm added a commit to tanmaykm/julia that referenced this pull request Mar 17, 2015
Introduces the tryparse method:
- tryparse{T<:Integer}(::Type{T<:Integer},s::AbstractString)
- tryparse(::Type{Float..},s::AbstractString)
- a few variants of the above

And:
- tryparse(Float.., ...) call the corresponding C functions jl_try_strtof, jl_try_substrtof, jl_try_strtod and jl_try_substrtod.
- The parseint, parsefloat, float64_isvalid and float32_isvalid methods wrap the corresponding tryparse methods.
- The jl_strtod, jl_strtof, ... functions are wrappers over the jl_try_str... functions.

This should fix JuliaLang#10498 as well.

Ref: discussions at JuliaLang#9316, JuliaLang#3631, JuliaLang#5704
tanmaykm added a commit to tanmaykm/julia that referenced this pull request Mar 17, 2015
Introduces the tryparse method:
- tryparse{T<:Integer}(::Type{T<:Integer},s::AbstractString)
- tryparse(::Type{Float..},s::AbstractString)
- a few variants of the above

And:
- tryparse(Float.., ...) call the corresponding C functions jl_try_strtof, jl_try_substrtof, jl_try_strtod and jl_try_substrtod.
- The parseint, parsefloat, float64_isvalid and float32_isvalid methods wrap the corresponding tryparse methods.
- The jl_strtod, jl_strtof, ... functions are wrappers over the jl_try_str... functions.

This should fix JuliaLang#10498 as well.

Ref: discussions at JuliaLang#9316, JuliaLang#3631, JuliaLang#5704
@tanmaykm
Copy link
Member Author

Closing. Will resubmit from a different branch.

@tanmaykm
Copy link
Member Author

Continued in #10560

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants