Use the TOML stdlib in code loading #36018

KristofferC · 2020-05-24T21:16:24Z

The current way TOML files are parsed in Base during code loading has a few issues:

The TOML "parser" in Base does pattern matching to try to find exactly the information code loading in its current state needs. This makes it a bit hard to extend since you need to write a new parsing routine for every new feature. We might want to try add some new features (e.g. sub-projects) that would require some changes in code loading which the current parser causes a bit of a hindrance to.
The TOML parser in Base is "stateless" so every query it makes it starts to parse the file from scratch. This means that in some cases it can re-parse the file many times (Code loading might be better just fully parsing TOML files #27414 (comment)) unnecessarily.
The TOML parser used by code loading and Pkg is different which means that they support different parts of the spec (AFAIU, the Base parser doesn't support inline tables for example).

Therefore, it would make sense to try:

Make Base use a proper TOML parser that implements the spec.
Make Pkg and Base share this TOML parser implementation.
Eventually make the TOML parser a fully supported stdlib. TOML files are a big part of packages in Julia and it makes sense to me to have an easy option to parse these since they are needed for a lot of the ecosystem.

This PR starts this process by putting a newly written TOML parser into Base and re-implements code loading to take advantage of it. The parser is available as a package with proper API, docs, tests, etc (https://github.com/KristofferC/TOMLX.jl) but the only thing this PR adds is the parser part (with a more low-level API) since that is what is needed in Base and I wanted to make this PR have as little "noise" as possible.
It has a significant number of tests (https://github.com/KristofferC/TOMLX.jl/blob/master/test/readme.jl, https://github.com/KristofferC/TOMLX.jl/blob/master/test/toml_test.jl) and I have benchmarked it pretty thoroughly.

For this PR, I think it makes sense to mostly comment on the code loading parts (@StefanKarpinski for review). For the TOML parsing itself, perhaps it makes sense to focus that discussion to the https://github.com/KristofferC/TOMLX.jl/ repo?.

After this step is done, we update Pkg to start using the parser here for it's parsing.
Then we can try to finally make a TOML stdlib and have Pkg depend on that. This will remove the last bundled dependency for Pkg.

There is a TODO in this PR and that is that I right now use a global cache object to store the filename => Dict mapping in code loading which gets empited in Base.require. This should be moved to a function local cache that gets created in Base.require and threaded through all functions. I wanted to put up this PR anyway for people to get a chance to review things as soon as possible.

One possible question is why this parser is not just the one that currently lives in the ext folder in Pkg. The reason for this is that the ext- parser (forked from https://github.com/JuliaLang/TOML.jl) was written a quite long time ago for an old spec of TOML and requires some significant updates. I was also a bit unhappy with the performance of it (it allocates scratch buffers a bit too much and makes copies of strings etc) so I felt a clean implementation might lead to a better end result.

Fixes #27414.

timholy · 2020-05-24T23:21:42Z

Does this make Pkg operations faster? Thinking of #27414 (comment)

KristofferC · 2020-05-25T07:37:00Z

Does this make Pkg operations faster?

If you mean loading packages, then I don't think it was bottlenecked on reading the TOML files, even though it did it quite inefficiently.
If you mean stuff like Pkg.add then TOML parsing is significant but not crucial for performance (anymore). I've already fixed the biggest performance problems with the TOML parser in Pkg (JuliaLang/Pkg.jl#113, JuliaLang/Pkg.jl#116, JuliaLang/Pkg.jl#1751) so while this parser is significantly faster there are some diminishing returns in effect.

StefanKarpinski · 2020-05-25T14:30:32Z

Any measurements on how this affects code loading speed? One thought I’d had for that was to specify what values you’d like to extract from a file and only return that data instead of constructing the whole data structure. So if you’re just looking for a single value, it could be done with no allocation at all.

KristofferC · 2020-05-25T14:36:28Z

One thought I’d had for that was to specify what values you’d like to extract from a file and only return that data instead of constructing the whole data structure. So if you’re just looking for a single value, it could be done with no allocation at all.

Yeah, it is possible, but the time to fully parse my pretty big v1.4 Manifest.toml file is ~300 us (including reading it from disc). Compared to the cost of actually loading a package that is negligible. I don't think this really needs to be optimized further.

StefanKarpinski · 2020-05-25T16:22:00Z

I was thinking about the issue of parsing the same file over and over. But maybe caching is better for that.

KristofferC · 2020-05-25T16:46:35Z

I was thinking about the issue of parsing the same file over and over.

I think it is good to completely avoid opening and reading through the file which a cache should help with. Also, it is simpler in the sense that you don't need a second version of the parser that does something more akin lazy parsing.

vtjnash · 2020-06-02T19:06:52Z

base/toml_parser.jl

+ # [a.b.c.d] doesn't "define" the table [a]
+ # so keys can later be added to [a], therefore
+ # we need to keep track of what tables are
+ # actualyl "defined


Suggested change

# actualyl "defined

# actually defined

vtjnash · 2020-06-02T19:11:48Z

base/toml_parser.jl

+@eval macro $(Symbol("try"))(expr)
+ :(
+ v = $(esc(expr));
+ v isa $ParserError && return v;
+ v;
+ )
+end


Suggested change

@eval macro $(Symbol("try"))(expr)

:(

v = $(esc(expr));

v isa $ParserError && return v;

v;

)

end

@eval macro $(:var"try")(expr)

return quote

v = $(esc(expr))

v isa ParserError && return v

v

end

end

vtjnash · 2020-06-03T04:37:25Z

base/toml_parser.jl

+function recurse_dict!(l::Parser, d::Dict, dotted_keys::AbstractVector{String}, check=true)::Err{TOMLDict}
+ for i in 1:length(dotted_keys)
+ key = dotted_keys[i]
+ d = get!(() -> TOMLDict(), d, key)


Suggested change

d = get!(() -> TOMLDict(), d, key)

d = get!(TOMLDict, d, key)

vtjnash · 2020-06-03T04:42:10Z

base/toml_parser.jl

+
+isvalid_hex(c::Char) = isdigit(c) || ('a' <= c <= 'f') || ('A' <= c <= 'F')
+isvalid_oct(c::Char) = '0' <= c <= '7'
+isvalid_binary(c::Char) = '0' <= c <= '2'


Suggested change

isvalid_binary(c::Char) = '0' <= c <= '2'

isvalid_binary(c::Char) = '0' <= c <= '1'

vtjnash · 2020-06-03T04:45:02Z

base/toml_parser.jl

+ contains_underscore |= read_underscore
+ end
+ if !ok_end_value(peek(l))
+ error()


nit: message is required

vtjnash · 2020-06-03T04:47:43Z

base/toml_parser.jl

+ subs = take_substring(l)
+ # Need to pass a AbstractString to `parse` so materialize it in case it
+ # contains underscore.
+ # vvvvvvv <- this looksl like a dude flipping the bird


Suggested change

# vvvvvvv <- this looksl like a dude flipping the bird

vtjnash · 2020-06-03T04:49:19Z

base/toml_parser.jl

+ e isa Base.OverflowError && return(ParserError(ErrOverflowError))
+ error("internal parser error: did not correctly discredit $(repr(s)) as an int")
+ end
+end


Suggested change

end

return v

end

vtjnash · 2020-06-03T04:51:45Z

base/toml_parser.jl

+ set_marker!(l)
+ @try accept_two(l, isdigit)
+ day = parse_int(l, false)
+ # Verify the real range in the constructor below


This isn't true--as stated at the top of the file, no validity checking is done

it is true when you have the Dates stdlib available:

julia> Date(1989, 02, 29) ERROR: ArgumentError: Day: 29 out of range (1:28)

vtjnash · 2020-06-03T04:52:04Z

base/toml_parser.jl

+ if ok_end_value(peek(l))
+ if (read_space = accept(l, ' '))
+ if !isdigit(peek(l))
+ return Date(year, month, day)


Do you intend for this to throw errors (whereas everything else returns them)?

Should be a try catch here as well.

vtjnash · 2020-06-03T04:55:33Z

base/toml_parser.jl

+accept_two(l, f::F) where {F} = accept_n(l, 2, f) || return(ParserError(ErrParsingDateTime))
+function parse_datetime(l)::Err{Union{DateTime, Date}}
+ # Year has already been eaten when we reach here
+ year = parse_int(l, false)


Suggested change

year = parse_int(l, false)

year = parse_int(l, false)::Int

base/loading.jl

vtjnash · 2020-06-03T05:20:12Z

base/loading.jl

+ entry = first(d[name])
+ if found_name
+ uuid = get(entry, "uuid", nothing)
+ return PkgId(UUID(uuid), name)


ERROR: MethodError: Cannot convert an object of type Nothing to an object of type UInt128

vtjnash · 2020-06-03T05:20:35Z

base/loading.jl

+ error("expected a single entry for $(repr(name)) in $(repr(project_file))")
+ end
+ entry = first(d[name])
+ if found_name


always true here?

vtjnash · 2020-06-03T05:22:57Z

base/loading.jl

+ elseif deps isa Dict
+ for (dep, uuid) in deps
+ if dep === name
+ return PkgId(UUID(uuid), name)


uuid might not be of the proper type if the file is malformed

vtjnash · 2020-06-03T05:25:08Z

base/loading.jl

- return nothing
+ found_where || return nothing
+ found_name || return PkgId(name)
+ # Only here is deps was not a dict which mean we have a unique name for the dep


Suggested change

# Only here is deps was not a dict which mean we have a unique name for the dep

# Only here if deps was not a dict which mean we have a unique name for the dep

vtjnash · 2020-06-03T05:31:18Z

base/toml_parser.jl

+ # In case we do not have the Dates stdlib available
+ # we parse DateTime into these internal structs,
+ # note that these do not do any argument checking
+ struct Date
+ year::Int
+ month::Int
+ day::Int
+ end
+ struct Time
+ hour::Int
+ minute::Int
+ second::Int
+ ms::Int
+ end
+ struct DateTime
+ date::Date
+ time::Time
+ end
+ DateTime(y, m, d, h, mi, s, ms) =
+ DateTime(Date(y,m,d), Time(h, mi, s, ms))


Would be so much better if we could either use delayed definitions here (Requires.jl?) or make these a parser error (unsupported) rather than re-implementing incompatible structs with the same names.

We can't make it a parser error (we need to support arbitrary TOML files) and we want code loading to work without Dates. So this was the best I could come up with.

vtjnash · 2020-06-03T05:34:18Z

Added the docs and tests labels since the new toml_parser.jl file will need both. Otherwise, most looks pretty straight-forward here and should be in pretty good shape to merge.

rfourquet · 2020-06-03T07:34:48Z

base/loading.jl

+ p::TOML.Parser
+ d::Dict{String, Dict{String, Any}}
+end
+TOMLCache() = TOMLCache(TOML.Parser(), Dict())


I guess it doesn't matter, but maybe:

Suggested change

TOMLCache() = TOMLCache(TOML.Parser(), Dict())

TOMLCache() = TOMLCache(TOML.Parser(), Dict{String,Dict{String,Any}}())

?

timholy · 2020-07-03T13:27:30Z

Thanks for pointing out the better alternative. It seems to be identical to the difference between

x::T = y

and

x = y::T

I'll switch to the style you demonstrated for future commits.

KristofferC · 2020-08-26T09:05:44Z

I've rebased this and changed the target branch to https://github.com/JuliaLang/julia/tree/kc/load_toml for a simpler review.

KristofferC · 2020-08-26T09:14:25Z

@timholy If you want to do some invalidation sanity checks, you could use this branch together with JuliaLang/Pkg.jl#1984 and see how things look.

KristofferC · 2020-08-27T14:16:47Z

This is ready to go but it breaks Revise and I don't want to break people's workflow so I'll wait a bit with merging. It should be a fairly small change to https://github.com/timholy/Revise.jl/blob/8336b10a5e7d0e4b1bb0726cf426469c8f866d2e/src/pkgs.jl#L459-L488 @timholy. I didn't really know what the function does so I had a hard time updating it myself.

… Base TOML parser

KristofferC · 2020-08-28T13:50:00Z

Revise has been updated to deal with this PR now.

… Base TOML parser (JuliaLang#36018)

KristofferC added the domain:packages Package management and loading label May 24, 2020

KristofferC requested a review from StefanKarpinski May 24, 2020 21:16

KristofferC force-pushed the kc/load_toml branch from 96e3ca9 to 741a305 Compare May 25, 2020 06:03

KristofferC force-pushed the kc/load_toml branch from 741a305 to 23ba995 Compare May 25, 2020 13:02

tkf mentioned this pull request May 27, 2020

Add Preferences subsystem JuliaLang/Pkg.jl#1835

Closed

KristofferC force-pushed the kc/load_toml branch 7 times, most recently from 3f9ecfd to 741cff8 Compare May 29, 2020 14:30

KristofferC mentioned this pull request Jun 1, 2020

Include in ParseError the filename and line-info JuliaLang/Pkg.jl#1832

Closed

vtjnash reviewed Jun 2, 2020

View reviewed changes

vtjnash reviewed Jun 3, 2020

View reviewed changes

vtjnash added needs docs Documentation for this change is required needs tests Unit tests are required for this change labels Jun 3, 2020

rfourquet reviewed Jun 3, 2020

View reviewed changes

timholy mentioned this pull request Jul 7, 2020

Summary of non-ambiguous patterns of invalidation #35922

Closed

visr mentioned this pull request Jul 13, 2020

create LICENSE JuliaLang/TOML.jl#1

Merged

KristofferC mentioned this pull request Aug 5, 2020

Some package load times affected drastically by current working directory #36911

Closed

fkastner mentioned this pull request Aug 11, 2020

url to superseded version is broken JuliaAttic/TOML_old.jl#22

Closed

KristofferC mentioned this pull request Aug 11, 2020

Add Scratch stdlib #36966

Closed

KristofferC added this to the 1.6 milestone Aug 11, 2020

KristofferC mentioned this pull request Aug 13, 2020

add a TOML standard library to Julia #37034

Merged

KristofferC force-pushed the kc/load_toml branch from b1d6fc8 to 240a190 Compare August 26, 2020 09:03

KristofferC changed the base branch from master to kc/toml_stdlib August 26, 2020 09:03

KristofferC removed needs docs Documentation for this change is required needs tests Unit tests are required for this change labels Aug 26, 2020

KristofferC force-pushed the kc/load_toml branch from 240a190 to 47119ee Compare August 26, 2020 09:30

Base automatically changed from kc/toml_stdlib to master August 26, 2020 19:09

KristofferC force-pushed the kc/load_toml branch from 47119ee to 031448d Compare August 27, 2020 05:57

KristofferC changed the title ~~WIP: Add a TOML parser to Base and use it in code loading~~ Use the TOML stdlib in code loading Aug 28, 2020

KristofferC mentioned this pull request Aug 28, 2020

fix for future changes in code loading timholy/Revise.jl#521

Merged

move the TOML parser to Base and implement code loading on top of the…

c601fbf

… Base TOML parser

KristofferC force-pushed the kc/load_toml branch from 031448d to c601fbf Compare August 28, 2020 08:49

KristofferC merged commit c885514 into master Aug 28, 2020

KristofferC deleted the kc/load_toml branch August 28, 2020 13:50

oscardssmith pushed a commit to oscardssmith/julia that referenced this pull request Aug 28, 2020

move the TOML parser to Base and implement code loading on top of the…

5f62d56

… Base TOML parser (JuliaLang#36018)

simeonschaub pushed a commit to simeonschaub/julia that referenced this pull request Aug 29, 2020

move the TOML parser to Base and implement code loading on top of the…

f41a05f

… Base TOML parser (JuliaLang#36018)

omus mentioned this pull request Apr 7, 2021

Add UUID(::UUID) and parse(::Type{UUID}, string) JuliaLang/Compat.jl#740

Merged

KristofferC mentioned this pull request May 20, 2021

prevent doing excessive file system checks in require calls #40890

Merged

ericphanson mentioned this pull request Apr 21, 2022

add no-op constructor for VersionNumber(::VersionNumber) #45052

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the TOML stdlib in code loading #36018

Use the TOML stdlib in code loading #36018

KristofferC commented May 24, 2020 •

edited

Loading

timholy commented May 24, 2020

KristofferC commented May 25, 2020

StefanKarpinski commented May 25, 2020

KristofferC commented May 25, 2020 •

edited

Loading

StefanKarpinski commented May 25, 2020

KristofferC commented May 25, 2020

vtjnash Jun 2, 2020

vtjnash Jun 2, 2020

vtjnash Jun 3, 2020

vtjnash Jun 3, 2020

vtjnash Jun 3, 2020

vtjnash Jun 3, 2020

KristofferC Jun 3, 2020

vtjnash Jun 3, 2020

vtjnash Jun 3, 2020

KristofferC Jun 3, 2020

vtjnash Jun 3, 2020

KristofferC Jun 3, 2020 •

edited

Loading

vtjnash Jun 3, 2020

vtjnash Jun 3, 2020

vtjnash Jun 3, 2020

vtjnash Jun 3, 2020

vtjnash Jun 3, 2020

vtjnash Jun 3, 2020

KristofferC Jun 3, 2020

vtjnash commented Jun 3, 2020

rfourquet Jun 3, 2020

timholy commented Jul 3, 2020

KristofferC commented Aug 26, 2020

KristofferC commented Aug 26, 2020

KristofferC commented Aug 27, 2020

KristofferC commented Aug 28, 2020

	d = get!(() -> TOMLDict(), d, key)
	d = get!(TOMLDict, d, key)

	isvalid_binary(c::Char) = '0' <= c <= '2'
	isvalid_binary(c::Char) = '0' <= c <= '1'

	# Only here is deps was not a dict which mean we have a unique name for the dep
	# Only here if deps was not a dict which mean we have a unique name for the dep

	TOMLCache() = TOMLCache(TOML.Parser(), Dict())
	TOMLCache() = TOMLCache(TOML.Parser(), Dict{String,Dict{String,Any}}())

Use the TOML stdlib in code loading #36018

Use the TOML stdlib in code loading #36018

Conversation

KristofferC commented May 24, 2020 • edited Loading

timholy commented May 24, 2020

KristofferC commented May 25, 2020

StefanKarpinski commented May 25, 2020

KristofferC commented May 25, 2020 • edited Loading

StefanKarpinski commented May 25, 2020

KristofferC commented May 25, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KristofferC Jun 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vtjnash commented Jun 3, 2020

Choose a reason for hiding this comment

timholy commented Jul 3, 2020

KristofferC commented Aug 26, 2020

KristofferC commented Aug 26, 2020

KristofferC commented Aug 27, 2020

KristofferC commented Aug 28, 2020

KristofferC commented May 24, 2020 •

edited

Loading

KristofferC commented May 25, 2020 •

edited

Loading

KristofferC Jun 3, 2020 •

edited

Loading