Add a generic code-inferrability test #37193

timholy · 2020-08-25T14:05:30Z

This PR adds an overall "code quality assurance" test, checking elements of inferrability of Base and the stdlibs. It largely tests just the precompiled MethodInstances, so does not offer comprehensive coverage, but these are connected to invalidations so this offers a form of protection against regression of our recent progress.

There are several elements of the test:

it checks whether Base.require is "bullet-proof" against invalidation. Anytime Base.require gets invalidated, it's more than half a second before you can start loading the next package in a sequence, so invalidations here are a big problem from the standpoint of package-loading performance. Currently this is marked @test_broken, because we still have some string-processing code with inference problems that could invalidate these methods, but a goal will be to turn these into @tests.
it tests the overall number of "worrisome" MethodInstances (length(badcounts) in the test code), as well as the average number of total dependents (backedges, and the backedges of their backedges, etc) of these methods (meancounts in the test code). Currently this PR sets the bar at 1250 worrisome MethodInstances and 33 average dependents. If/when my current unmerged PRs make it in, these can drop to 1000 and 32, respectively. For reference, Julia 1.5 has 2030 such MethodInstances with an average of 38.5 dependents. Most of what's left to fix appears to be in the area of string processing, so with a bit more effort we may be able to drop these numbers quite dramatically.
it checks inferrability of specific functions (mostly the is* functions which should generally return Bool), and whether specific unfortunate variants of certain other methods (==, <, etc) have been inferred. The absence of such MethodInstances indicates that all code that calls them is well-inferred.

It's worth acknowledging that this may put a new burden on contributors: not only do they need to contribute the functionality, but they need to make sure it's not dramatically worsening inference. I think this is worth having. After all, we check whitespace, and whitespace is probably not nearly as important to the overall user experience as having well-inferred, invalidation-resistant code. Nevertheless, to mitigate the extra hurdles I've tried to put some hints near places where tests might fail to help developers find resources to help them.

Closes #36393

KristofferC · 2020-08-26T06:22:37Z

It's worth acknowledging that this may put a new burden on contributors: not only do they need to contribute the functionality, but they need to make sure it's not dramatically worsening inference.

Because of this, I am not sure this is the best way to do this. And with "this way" I mean, setting an arbitrary cap and then hard erroring CI when that is reached. This would be a bit similar to setting a hard cap on sysimage build time and failing CI when that is reached (assuming that it is completely deterministic). When that point is reached, it is likely a lot of CI will start failing and the whole thing grinds to a halt until someone with the skill goes in and pushes to number of invalidations down again.

Instead, (just like compilation time), I feel like this should be tracked via some time-series functionality where it can be monitored and when some threshold is reached it might start blinking red or something but hard failing CI because someone does a small fix that pushes the number of backedges from 163 to 170 while the PR that was merged yesterday took it from 130 to 160 seems too harsh.

After all, we check whitespace, and whitespace is probably not nearly as important to the overall user experience as having well-inferred, invalidation-resistant code.

While true, the difficulty of fixing the issue also needs to be weighed in which in one case is trivial and in the other very advanced.

timholy · 2020-08-26T12:12:00Z

Excellent points as usual. Maybe a separate buildbot pass then? I'm clueless about how one goes about implementing this.

KristofferC · 2020-08-26T12:27:26Z

The CUDA people have a nice benchmark site based on codespeed that shows timelines for various benchmarks (https://speed.juliagpu.org/timeline/#/?exe=5,3,4&base=3+74&ben=grid&env=1&revs=50&equid=off&quarts=on&extr=on).

It would be nice to hookup the daily Nanosoldier benchmarks to an instance of that and at the same time we could maybe add some compilation time benchmarks (JuliaCI/BaseBenchmarks.jl#252) but also some "static" info like number of backedges etc so that could also be tracked over time.

timholy · 2020-08-26T13:46:27Z

That makes sense, but I worry about how many people actually look at those numbers. I don't check Julia's performance routinely. OTOH, as long as it's checked before the release perhaps that would be enough.

OTOH, if we had a separate build pass, there's no way you'd miss a change although one could decide to go ahead and merge anyway. It does make sense to separate it out from "standard" tests of course.

KristofferC · 2020-08-26T13:53:48Z

I don't check Julia's performance routinely. OTOH, as long as it's checked before the release perhaps that would be enough.

Maybe that is because we have no way of nicely visualizing it right now.

OTOH, if we had a separate build pass, there's no way you'd miss a change although one could decide to go ahead and merge anyway. It does make sense to separate it out from "standard" tests of course.

Well, that build pass is going to be green and ignored until it turns red and then it will still mostly be ignored. And when someone (you?) see it, then what do you do? You probably want to try to find the place where bad stuff happened but since you don't really have a pass / fail criteria the best thing is probably at that point to start building a time series to see if there are points where things suddenly got worse. You could try a bisection but if the regression is gradual, it won't necessarily give you something valuable.

timholy · 2020-08-26T13:55:27Z

Fair enough. I'm not sure how one gets started on that though.

KristofferC · 2020-08-28T07:39:43Z

Hm, @staticfloat how long do we save the logs from the buildbot? If we add something that prints to the build log, would it be easy to build up a timeseries from that?

For now, perhaps we can:

Keep the test error for the "non-fuzzy" things (like making sure Base.require is solid). This is a yes or no thing and a PR that breaks it can fix it.
Change the error for the fuzzy tests to warnings or something along those lines.

That means we can still merge the PR and the code gets regularly run so it doesn't bit rot and then we can think a bit more about how exactly the "holistic" measurement should be presented / checked?

martinholters · 2020-08-28T10:08:49Z

Could we do a before/after comparison like is done e.g. for code coverage? "This PR increases invalidation susceptibility by 2.3%." or whatever.

staticfloat · 2020-08-28T21:42:03Z

Hm, @staticfloat how long do we save the logs from the buildbot? If we add something that prints to the build log, would it be easy to build up a timeseries from that?

The buildbots clean things up after 120 days. You can do this if you want to, but note that buildbot is far from a performant database. Expect to wait like 0.5s for each datapoint you want to read.

You are the second or third person in the last month who has wanted to store timeseries data for packages. I've been thinking about hosting a timeseries database like influxdb or timescaledb for the (authenticated) Julia community. Combined with a small wrapper over HTTP (like this abandoned package), it would allow us to POST timeseries datapoints and then query timeseries back out with HTTP or even fancy things like Grafana. Interested, or overkill?

timholy · 2020-08-31T13:23:07Z

I do like the idea of a time series, but I think the more important one is to be notified when PRs add a lot of poorly-inferred calls. What I had in mind was best summarized by @martinholters (#37193 (comment)).

timholy · 2020-08-31T13:28:08Z

xref https://github.com/julia-actions/julia-invalidations, thanks @ianshmean! That covers packages, so we just need something like that for Julia itself (based on potential invalidations like this PR).

timholy · 2020-11-02T11:38:36Z

Is there any hope of doing something like this for Julia 1.6? We don't seem to have regressed, except there seem to be more ways to invalidate Base.require than I remember previously. At least one chunk of problems come from the preferences work (e.g., #38044, CC @staticfloat).

KristofferC · 2020-11-02T14:14:26Z

@timholy could you check if https://github.com/JuliaLang/julia/tree/kc/stability_preferences helps with that?

KristofferC · 2020-11-03T12:27:25Z

Is there any hope of doing something like this for Julia 1.6?

I'm personally ok with putting in whatever you think would be helpful. I still worry a bit about a potential future PR "deadlock" when this starts to erroring and new contributors get confused about what they did wrong. But we can always disable the tests if they become disruptive so 🤷

timholy · 2020-11-03T15:10:08Z

How about we aim to eliminate vulnerabilities in paths leading to Base.require and make that a hard-fail, and use warnings for the rest?

Closes #36393

timholy · 2020-11-09T14:44:28Z

OK, this seems to be working (test failure seems incidental). Ready for review.

I am not terribly attached to the is* tests, so we could delete those depending on how others feel about them. For quite a few of them there are more than 3 methods, so in cases of poor inference their caller is going to have to annotate at the call site anyway. This may still have some value since it can be nice to check what all methods do return, and if they're all Bool (as opposed to, say, Union{Bool,Missing}) then one can feel more confident about asserting that.

vchuravy · 2020-11-16T05:54:25Z

I am seeing tests added in this PR failing in #38437

KristofferC · 2020-11-16T07:08:49Z

Did CI run last time 7 days ago? Maybe it had time to regress in that time?

timholy · 2020-11-16T10:02:35Z

Hmm, I should have re-run CI, although it seems broken right now.

Indeed, something has changed it. Specifically, the test that I think is the most important, the impossibility of invalidating Base.require without committing piracy or specializing non-exported methods of Base, is now failing.

julia> m = methods(Base.require).ms[2]
require(uuidkey::Base.PkgId) in Base at loading.jl:900

julia> mis = remove_unlikely_methodinstances(atrisk_triggers(m, first.(mivulnerabilities_exported)))
2-element Vector{MethodInstance}:
 MethodInstance for convert(::Type{Nothing}, ::Any)
 MethodInstance for convert(::Type{Union{}}, ::Any)

julia> using Cthulhu

julia> ascend(mis[1])
Choose a call for analysis (q to quit):
     convert(::Type{Nothing}, ::Any)
       setindex!(::IdDict{Dict{String, Any}, Nothing}, ::Any, ::Any)
   +     push!(::IdSet{Dict{String, Any}}, ::Dict{String, Any})
   +   setindex!(::IdDict{Any, Nothing}, ::Any, ::Any)
       setindex!(::IdDict{MethodInstance, Nothing}, ::Any, ::Any)
   +     push!(::IdSet{MethodInstance}, ::MethodInstance)
         push!(::IdSet{MethodInstance}, ::Any)
 >         treelist!(::FoldingTrees.Node{Cthulhu.Data{MethodInstance}}, ::IOBuffer, ::Any, ::String, ::IdSet{MethodInstance})

I'm showing the tree in collapsed form, folding everything below a "reasonable" call. It's basically all IdDict/IdSet-related. The offending convert call is

julia/base/iddict.jl

Line 74 in 110f125

val = convert(V, val)

which has been with us at least since 8b6a8c2 (February), so it's hard to believe that this directly changed. Seems like it might be something else, like perhaps an improvement elsewhere?

There's a fix in #38455.

timholy mentioned this pull request Aug 25, 2020

A few changes to the analysis of abstract signatures demo timholy/MethodAnalysis.jl#17

Merged

This was referenced Sep 26, 2020

Fixes for non-Int based lengths #37741

Merged

Fix some string-related invalidation risks #37799

Merged

timholy force-pushed the teh/inference_qa branch 6 times, most recently from 5b69038 to 9d9d7b7 Compare November 9, 2020 11:40

timholy added 2 commits November 9, 2020 06:29

Add a generic code-inferrability test

fee3f0d

Closes #36393

Make some tests warnings, not errors

d54b626

timholy force-pushed the teh/inference_qa branch from 9d9d7b7 to d54b626 Compare November 9, 2020 12:29

timholy merged commit 110f125 into master Nov 15, 2020

timholy deleted the teh/inference_qa branch November 15, 2020 20:25

timholy mentioned this pull request Nov 16, 2020

Avoid unnecessary convert in setindex! for IdDict #38455

Closed

timholy mentioned this pull request Nov 16, 2020

Some inference_qa test failure on master #38457

Closed

StefanKarpinski mentioned this pull request Nov 16, 2020

Revert "Add a generic code-inferrability test" #38461

Merged

timholy mentioned this pull request Nov 17, 2020

show_default, but without type parameters #36263

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a generic code-inferrability test #37193

Add a generic code-inferrability test #37193

timholy commented Aug 25, 2020 •

edited

Loading

KristofferC commented Aug 26, 2020 •

edited

Loading

timholy commented Aug 26, 2020

KristofferC commented Aug 26, 2020

timholy commented Aug 26, 2020

KristofferC commented Aug 26, 2020 •

edited

Loading

timholy commented Aug 26, 2020

KristofferC commented Aug 28, 2020 •

edited

Loading

martinholters commented Aug 28, 2020

staticfloat commented Aug 28, 2020

timholy commented Aug 31, 2020

timholy commented Aug 31, 2020 •

edited

Loading

timholy commented Nov 2, 2020

KristofferC commented Nov 2, 2020

KristofferC commented Nov 3, 2020

timholy commented Nov 3, 2020

timholy commented Nov 9, 2020

vchuravy commented Nov 16, 2020

KristofferC commented Nov 16, 2020

timholy commented Nov 16, 2020 •

edited

Loading

Add a generic code-inferrability test #37193

Add a generic code-inferrability test #37193

Conversation

timholy commented Aug 25, 2020 • edited Loading

KristofferC commented Aug 26, 2020 • edited Loading

timholy commented Aug 26, 2020

KristofferC commented Aug 26, 2020

timholy commented Aug 26, 2020

KristofferC commented Aug 26, 2020 • edited Loading

timholy commented Aug 26, 2020

KristofferC commented Aug 28, 2020 • edited Loading

martinholters commented Aug 28, 2020

staticfloat commented Aug 28, 2020

timholy commented Aug 31, 2020

timholy commented Aug 31, 2020 • edited Loading

timholy commented Nov 2, 2020

KristofferC commented Nov 2, 2020

KristofferC commented Nov 3, 2020

timholy commented Nov 3, 2020

timholy commented Nov 9, 2020

vchuravy commented Nov 16, 2020

KristofferC commented Nov 16, 2020

timholy commented Nov 16, 2020 • edited Loading

timholy commented Aug 25, 2020 •

edited

Loading

KristofferC commented Aug 26, 2020 •

edited

Loading

KristofferC commented Aug 26, 2020 •

edited

Loading

KristofferC commented Aug 28, 2020 •

edited

Loading

timholy commented Aug 31, 2020 •

edited

Loading

timholy commented Nov 16, 2020 •

edited

Loading