Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some overly optimistic fastpaths in subtyping #25796

Merged
merged 7 commits into from
Jul 17, 2020
Merged

Conversation

martinholters
Copy link
Member

Fixes #25430, see discussion therein.

WIP because it leads to this regression (hence the disabled core test):

julia> struct T22624{A,B,C}; v::Vector{T22624{Int64,A}}; end
ERROR: StackOverflowError:

Hints welcome about where I should look to fix this!

@martinholters
Copy link
Member Author

Ok, the recursive type definition requires obviously_unequal(Int64, A) to be true to not result in a stack overflow, so I've added another fast path to obviously_unequal for the (concrete type, typevar) case. However, from the function name, I would expect Int64 and A not to be obviously unequal without knowing the environment; after all, A could be bound to Int64. But I guess that's just the problematic semantics of lone type variables...

Anyway, this now passes tests (locally, hoping CI agrees), and fixes a bug, so getting a review would be appreciated.

@martinholters martinholters changed the title WIP: Fix some overly optimistic fastpaths in subtyping RFC: Fix some overly optimistic fastpaths in subtyping Apr 12, 2018
@JeffBezanson JeffBezanson self-assigned this Apr 12, 2018
src/jltypes.c Outdated
@@ -583,6 +583,11 @@ static int is_typekey_ordered(jl_value_t **key, size_t n)
!(jl_is_datatype(k) && (((jl_datatype_t*)k)->uid ||
(!jl_has_free_typevars(k) && !contains_unions(k)))))
return 0;
if (jl_is_datatype(k)) {
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably subsumes the jl_is_datatype(k) part of the check above (line 583).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right, recursing down the structure once should be enough. Will push a simplified version once checks pass locally.

src/jltypes.c Outdated
return 1;

return is_typekey_ordered(jl_svec_data(dt1->parameters), jl_svec_len(dt1->parameters)) ==
is_typekey_ordered(jl_svec_data(dt2->parameters), jl_svec_len(dt2->parameters));
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This predicate seems pretty expensive. If we're going to use this, we should probably cache this property like we do for has-free-typevars.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Performance-wise, I'm more worried about taking the fast-paths in subtype.c much less often, though. BTW, does nanosoldier have any benchmarks that exercise the type system?

If we're going to use this

Before I set into motion working on the caching: What are the chances of that becoming true? (And what are the alternatives?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the precomputing/caching of typekeyordered. Any idea for a benchmark showing how helpful that is? touch base/compiler/compiler.jl; time make and make test-subtype are pretty much unaffected.

@JeffBezanson
Copy link
Sponsor Member

Ok, the recursive type definition requires obviously_unequal(Int64, A) to be true to not result in a stack overflow

Where was this call coming from? Was it the call to jl_types_equal in inst_datatype_inner?

@martinholters
Copy link
Member Author

Where was this call coming from?

I admit I got nowhere when I tried to walk through the code in the debugger and figure why it was ending in a stack overflow. So I added debug output showing all type pairs where the result of obviously_unequal had changed. The (Int64, A) looked promising and, yes, fixed it.

So looking at the stack traces of all calls to obviously_unequal(Int64, A) (occurring 8 times), I can verify your suspicion, they all look like:

inst_datatype_inner -> jl_types_equal -> obviously_unequal -> obviously_unequal(Int64, A)

BTW, there are also calls to obviously_unequal(A, Int64), but they may safely return false without breaking anything.

@martinholters
Copy link
Member Author

Bump. Should I work on caching the result of is_typekey_ordered? Or do we need to work out what to do with obviously_unequal(Int64, A)? Or...?

@martinholters
Copy link
Member Author

Bumpity.

@martinholters
Copy link
Member Author

Looking at #31696, it still fails in this branch, but only because jl_obviously_unequal(Int8, T) returns true. If I revert b5cb181, this actually fixes #31696 🎉, but of course, breaks #22624 again (ref. #25796 (comment)). So I think a type should not be "obviously unequal" to a typevar and we need to find some other fix to make #22624 work again, but I have no clue about that (and little time to look into it at the moment).

@martinholters
Copy link
Member Author

I've removed

julia/src/subtype.c

Lines 223 to 225 in 9d9257d

if ((jl_is_concrete_type(a) && jl_is_typevar(b)) ||
(jl_is_concrete_type(b) && jl_is_typevar(a)))
return 1;

and spent some time in the debugger to figure out why that makes

struct T22624{A,B,C}; v::Vector{T22624{Int64,A}}; end

overflow the stack. The call to jl_types_equal in

julia/src/jltypes.c

Lines 1103 to 1104 in 9d9257d

if (tw && tw != pi && (tn != jl_type_typename || jl_typeof(pi) == jl_typeof(tw)) &&
jl_types_equal(pi, tw)) {

is invoked for pi == Main.T22624{Int64, Int64, C} where C, tw == Main.T22624{A, B, C} where C where B where A. That subtype test invokes rename_unionall, which wants to construct another T22624, resulting in the infinite recursion. I think guarding the call to jl_types_equal with a suitable predicate allows to keep #22624 fixed and at the same time fix #31696. I'll poke at it a bit more...

@martinholters
Copy link
Member Author

Yep, this can be fixed by adding a suitable guard, see last commit. However, I have no idea how to justify this. My guess is it's just a more or less accidental fix. Hence, the name guard_against_22624 because that's what is. I would have thought the stack plays a role here, too, but couldn't put it to use fruitfully. @JeffBezanson can you take a look at this?

@JeffBezanson
Copy link
Sponsor Member

Nice, thanks for figuring that out. I'll take a look.

src/jltypes.c Outdated
return 0;
jl_datatype_t *dt1 = (jl_datatype_t *) t1;
jl_datatype_t *dt2 = (jl_datatype_t *) t2;
if (!(is_cacheable(dt1) || jl_svec_len(dt1->parameters) == 0)) {
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be nice to have some comments here about what the reasoning here is, what the parameters are, etc.

@JeffBezanson
Copy link
Sponsor Member

This is trying to normalize types that might equal dt.name.wrapper for some dt::DataType. So in effect guard_against_22624 is another variant of types_obviously_unequal. So we could call it type_obviously_not_wrapper and make sure it implements that rigorously.

src/jltypes.c Outdated Show resolved Hide resolved
@martinholters
Copy link
Member Author

I guess calling jl_types_equal here is generally risky, so ideally, we should fuse the !type_obviously_not_wrapper and jl_types_equal into an type_is_wrapper which does not lead to calls to rename_unionall. I don't know whether that is feasible. So what do we loose if we instead make this a type_obviously_wrapper, i.e. maybe miss opportunities to normalize to the wrapper? What's the utility of normalizing to wrappers here? As an experiment, I've disabled the whole loop trying to normalize the parameters and nothing bad seems to be happening...

@martinholters
Copy link
Member Author

See #31828 for a possible remedy concerning the normalization to wrapper/infinite recursion issue.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Apr 25, 2019

I guess calling jl_types_equal here is generally risky,

FWIW, the lines of code immediately subsequent (lookup_type and lookup_type_stack) are also a call to jl_types_equal*, this particular stacktrace is the result of hoisting this particular common case out of the inner loop (thereby allowing us to replace the linear scan with a faster and simpler binary search).

@martinholters
Copy link
Member Author

Right. But it's the comparison to the potential wrapper that is most likely to require a rename_unionall.

@martinholters
Copy link
Member Author

Any ideas anyone about the failed precompile test on win32? I wouldn't expect that to be flaky, but I also don't see how this PR could affect only a single platform. And the output is not too helpful.

@martinholters
Copy link
Member Author

Ok, rebasing made the CI error go away. And yay, all green!

@martinholters
Copy link
Member Author

Fixed another fastpath found in typekey_eq, now the case from #25796 (comment) passes, too.

Note that this PR does not improve the normalization, but aims at avoiding bugs due to incomplete normalization. E.g. the types from #35130 are still not normalized any better, but they now compare equal as they should. While better normalization would certainly be good, I doubt it can ever be perfect, except for an uncoditional linear cache search which is probably too expensive. So I think we should apply the fixes from this PR and try to improve normalization.

RA new round of review from @vtjnash / @JeffBezanson would be appreciated.

@martinholters martinholters added the kind:bugfix This change fixes an existing bug label Jun 23, 2020
@martinholters
Copy link
Member Author

Bump. As this fixes a bug and resolves an existential crisis, it would be nice to make some progress here. The main point I'm unsure about is the caching of cashed_by_hash. Is it worth it and if so, does the current implementation make sense? But another round of general review would also be appreciated.

@StefanKarpinski
Copy link
Sponsor Member

Since this passes tests and has gotten no responses despite multiple pings over a long period, I will merge this if there's no replies in the next day or so.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jun 25, 2020

This seems slightly expensive, and still will be wrong in many other places in the system, so I'm inclined to try to go the normalization route to get the majority of cases, and see what it breaks with these rules:

  • For instantiation of Tuple, replace any TypeVar parameter with a concrete upper bound with the upper bound itself (ignores lower bound).
  • For instantiation of any other object, replace any TypeVar with ub === lb with just upper bound itself.

@martinholters
Copy link
Member Author

Ref. #36211 for an experiment going the normalization route (where I'd also appreciate feedback). In general, I think to thoroughly fix the situation, an approach like in this PR will be needed. Even if there are more places around which need fixing to call jl_type_equality_is_identity, that's a finite number, so eventually, we should be able to fix them all. OTOH, I'm skeptical whether normalization can ever be perfect. Of course, we might decide that it's acceptable to be wrong in some corner cases if the price for fixing them is too high:

This seems slightly expensive

Expensive in terms of what? Do you have an idea for a benchmark to evaluate the cost of this approach?

@martinholters
Copy link
Member Author

Do you have an idea for a benchmark to evaluate the cost of this approach?

As it seems to be quite fashionable at the moment, I measured the TTFP:

julia> @time using Plots
 15.696126 seconds (22.13 M allocations: 1.186 GiB, 3.09% gc time) # before
 10.652585 seconds (10.04 M allocations: 630.085 MiB, 2.28% gc time) # after

julia> @time plot(rand(10))
  3.294725 seconds (3.64 M allocations: 188.339 MiB, 2.04% gc time) # before
  3.257043 seconds (3.64 M allocations: 188.792 MiB, 1.37% gc time) # after

Before is 8c65f8d, not current master, as I didn't rebase. Given the latest latency improvements, I probably should repeat this after a rebase, but from this benchmark, it doesn't look expensive at all. Quite on the contrary, actually. (Without precomputation of cashed_by_hash, the times are 10.98s and 3.49s, so it's probably worth it.)

@martinholters
Copy link
Member Author

After rebasing (to 6185d24), @time plot(rand(10)) is pretty much unaffected by this PR, while @time using Plots still shows an improvement, albeit a smaller one: from 11.6s to 9.5s.

@martinholters
Copy link
Member Author

Sill trying to get a feel for the cost of this, I can first report that touch base/compiler/compiler.jl ; time make is pretty much unaffected (within measurement noise).

I'm a bit puzzled by

julia> @btime ==(Vector{Tuple{Integer}}, Vector{Tuple{>:Int}})
  11.548 μs (6 allocations: 320 bytes) # master
  5.343 μs (6 allocations: 320 bytes) # PR
false

Why is this getting faster? Does it involve an obviously_disjoint (where I could enable the fastpath) somewhere?

@vtjnash, you claimed that this "seems slightly expensive". Any idea for a benchmark where might actually show?

@JeffBezanson
Copy link
Sponsor Member

I think it's ok to merge this. It fixes some things, and there doesn't seem to be a cost. I do agree we should do more in the future to help ensure that equality_is_identity is true for types used as type tags (===typeof(x) for some x). But this seems like a reasonable stopgap.

@StefanKarpinski StefanKarpinski added the status:merge me PR is reviewed. Merge when all tests are passing label Jul 15, 2020
@JeffBezanson JeffBezanson merged commit bd52a95 into master Jul 17, 2020
@JeffBezanson JeffBezanson deleted the mh/fix_25430 branch July 17, 2020 20:16
@mbauman mbauman removed the status:merge me PR is reviewed. Merge when all tests are passing label Jul 20, 2020
simeonschaub pushed a commit to simeonschaub/julia that referenced this pull request Aug 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:types and dispatch Types, subtyping and method dispatch kind:bugfix This change fixes an existing bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unreachable reached at 0000000015349AF1
5 participants