Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do exact array resize!() in more cases #22038

Merged
merged 2 commits into from
Jun 1, 2017

Conversation

alyst
Copy link
Contributor

@alyst alyst commented May 23, 2017

Sometimes it's useful to have precise control over memory size allocated for a given array (see e.g. #21938).
One would assume it could be done by resize!(), but under the hood it calls jl_array_grow_at_end().
The latter method is actually tuned for the gradually growing arrays, it would allocate the exactly requested size only if the array is empty,
otherwise it would allocate the next power of 2 that is greater or equal to the requested size.

There are several options:

  • do nothing
  • change the heuristic to allocate exactly requested amounts in more cases (final decision implemented by this PR)
  • add keyword/positional exact argument to resize!()/sizehint!()
  • change the behaviour of resize!()/sizehint!() to always grow to the exactly specified size
  • add new resize_exact!() method (either exported or not)
  • ...

@KristofferC
Copy link
Sponsor Member

Any thoughts about sizehint!? Does / should it follow the same rule?

@alyst
Copy link
Contributor Author

alyst commented May 23, 2017

IMO the same changes should be done to sizehint!(). It would be nice if they remain easily interchangeable.

@KristofferC
Copy link
Sponsor Member

KristofferC commented May 23, 2017

This makes sense to me. For example, when you create a sparse matrix it is common that you know exactly how many stored elements you will have so you might do a sizehint!(I, large_number). Potentially wasting a factor of 2 memory seems unnecessary.

Edit: On the other hand you usually start with empty vectors then, which is already covered by the current code.

@StefanKarpinski
Copy link
Sponsor Member

Since the cost of the size not being exact is ~one extra resize versus up to 2x memory waste, taking size hints and resizes literally seems like a good idea.

@andreasnoack
Copy link
Member

Also, in the motivating use case, #21938, we know that the arrays will never grow again.

@kshyatt kshyatt added the arrays [a, r, r, a, y, s] label May 23, 2017
@alyst
Copy link
Contributor Author

alyst commented May 24, 2017

It seems #22044 has fixed the issue with the string interpolation that got exposed by 6d92926 (see #21938 comments): after rebasing to the current master the QR tests are fine again.

@KristofferC
Copy link
Sponsor Member

Since sizehint! ends up calling the function modified here, this would automatically work for sizehint! as well, right?

JL_DLLEXPORT void jl_array_sizehint(jl_array_t *a, size_t sz)

@alyst
Copy link
Contributor Author

alyst commented May 25, 2017

First take on making resize!() and sizehint!() always growing to an exactly specified size.

@alyst
Copy link
Contributor Author

alyst commented May 26, 2017

@nanosoldier runbenchmarks(ALL, vs=":master")

@alyst
Copy link
Contributor Author

alyst commented May 26, 2017

Looks like @nanosoldier doesn't listen to me.

@KristofferC
Copy link
Sponsor Member

@nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@KristofferC
Copy link
Sponsor Member

On the positive side, string, join does take less memory now :D

@yuyichao
Copy link
Contributor

The necessity last commit basically means that user code that reserves multiple elements with resize! and then assign to them in batches will be much slower than just pushing individual elements so the behavior of resize! should not be changed when the increment is small (the original change in the heuristics is good though).

@alyst
Copy link
Contributor Author

alyst commented May 26, 2017

Definitely there should be a way for the user to grow an array the old way.
It could be either the new exported grow!(a, n) method.
Or, an additional exact param for resize!(). In the latter case -- should it default to exact=true or exact=false (old behaviour)?
IMHO, calling resize!() within appending loop sounds a little bit confusing, I would rather use another verb.

@yuyichao
Copy link
Contributor

resize! should definitely stay with it's current default. I see no point of making everyone's code suddently slower unless we're moving to a new name altogether.

@alyst
Copy link
Contributor Author

alyst commented May 26, 2017

I'm not sure resize!() is used that much in the packages in the append context. For arrays there's push!() and append!(), and low-level streams are already implemented in Base.
OTOH in the Base we don't see a lot of improvements when exact=true is the default.

@alyst
Copy link
Contributor Author

alyst commented May 26, 2017

Could somebody ask nanosoldier? Thanks!

@KristofferC
Copy link
Sponsor Member

@nanosoldier runbenchmarks(ALL, vs=":master")

Copy link
Contributor

@yuyichao yuyichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure resize!() is used that much in the packages in the append context.

The point is that this change gives a O(1) memory saving and a O(n) performance regression. It's basically always wrong when a small increment is used.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@alyst
Copy link
Contributor Author

alyst commented May 26, 2017

Now the benchmarks looks ok, let's try to decide on the API changes :)

I don't like resize!(A, n, exact::Bool) much: "non-exact resize" sounds confusing.
What about resize!(A, n, preemptive::Bool=false), where preemptive=true means the current behaviour? At the moment resize!() documentation doesn't mention the fact that it allocates more than requested and that this feature could be exploited for growing an array. So making "exact resize" a default behaviour seems a fair change. The same for sizehint!()

@yuyichao
Copy link
Contributor

At the moment resize!() documentation doesn't mention the fact that it allocates more than requested and that this feature could be exploited for growing an array. So making "exact resize" a default behaviour seems a fair change. The same for sizehint!()

No, it's not documented in the same way as push! isn't documented to be fast. People expect them to do the right thing and it's a bad argument to make them O(n) slower. They should still be documented but the lack of that is irrelevant for this change.

@alyst
Copy link
Contributor Author

alyst commented May 26, 2017

People expect them to do the right thing and it's a bad argument to make them O(n) slower.

What is "the right thing" for resize!()? At least it's context-dependent. For LAPACK preemptive behaviour could mean gigabytes of unnecessary allocated memory with all the consequences of hitting physical memory limits. IMHO if resize!() is called in a loop from the user/package code, it looks like a wrong code pattern, and the author should assert that he really means it by explicitly specifying preemptive=true.

Well, I'm out of anti-preemptive arguments for the moment :) With the updated jl_array_grow_at_end() heuristic I wouldn't be so much obsessed with specifying preemptive=false for each resize!() call in my code.
Just want to be sure that everyone interested agrees with preemptive=true (including the parameter name) before I'll update the PR.

@nalimilan
Copy link
Member

Personally I would expect "the right thing" for resize! to be "resize this array to the exact size I specified", not "leave a bit more room in case I want to resize it again soon". The right abstraction level seems to be push! and append!, which can easily pass exact=false. Do we have evidence that people use resize! when growing an array progressively, rather than calling push! or append!?

@yuyichao
Copy link
Contributor

Do we have evidence that people use resize! when growing an array progressively, rather than calling push! or append!?

Short of any other function to do it currently I believe resize! is the correct function to do it right now. There are certainly a lot of cases in base that does something similar (not just Array, but also arraylist_t in C) so I believe it's a valid use case that shouldn't be broken and doesn't worth requireing users to change their code especially because there's no way we can issue a warning for this and it'll be really hard to catch (and this is also why I think it'll be fine to change the default if we rename it altogether)

I'll still repeat that exact resize will only save O(1) memory (less than a factor of 2, maybe ~1.5x on average assuming some random distribution). The default heuristic can certainly be changed to do exact resizing after a threshold. In fact, doing at least 2x or exact resizing as default should still provide the same scaling while providing a net saving of memory.

@KristofferC
Copy link
Sponsor Member

Isn't the thing that O(1) can still be very large?

@StefanKarpinski
Copy link
Sponsor Member

My reasoning for 25% is that there are papers which report that 50% growth actually gives optimal performance for exponential growth, and 0% is O(n^2) case that's problematic, so split the difference and make the cutoff halfway between 0% and 50%. The precise cutoff probably doesn't matter.

@StefanKarpinski
Copy link
Sponsor Member

Should there be a shrinkwrap exception? I.e. when the requested size is ≤ the current size, we make it exactly that size? I guess that could lead to bad behavior if a vector is being used as a stack or a dequeue.

@yuyichao
Copy link
Contributor

The growing factor is #8269 and #16305 and I would rather not tie it up with this one. What I was suggesting as a default heuristic is basically

size_t new_size = old_size * grow_factor;
new_size = new_size > request_size ? new_size : request_size;

And we can change the grow factor if there's clear evident that it's better.

@rfourquet
Copy link
Member

It's an expected pattern to call resize! repeatedly by (small) increments (e.g. in a loop), so doing always "exact resize!" would be terrible. resize! is about the actual size of the array, I agree with @yuyichao that the allocation mechanism should do "the right thing" (I would believe most of the time RAM memory is big enough to accomodate the exponential growth factor). If one wants to opimize for space, sizehint! seems the more appropriate level to control those implementation details. At the same level, there could be a shrink_to_fit! (name borrowed from c++) to free the non-used space when the vector is not expected to grow anymore (which seems preferable to me than the "shrinkwrap" proposed above). If exact resize is needed, I vote for a new name instead of a keyword, e.g. setsize! which sounds more final than resize!.
Exact resize! beyond a threshold sounds good though, but I would feel more comfortable with at least 50% to get the "optimal exponential growth" (i.e. the same growth factor as push! etc.). If I remember correctly, c++ libs do this with a growth factor equal to 2: newsize = max(requestedsize, oldsize*2).

@rfourquet
Copy link
Member

Also #16305 (comment) may be relevant?

@alyst
Copy link
Contributor Author

alyst commented May 30, 2017

I've reverted the API changes and implemented 125%/200% growth thresholds heuristic discussed above. I think it should be sensitive enough to do exact resize in most of relevant cases. Let's see what nanosoldier thinks of that.

Shrinkwrap might be useful for some applications, although it's much more easy to hit memory limits with imprecise growth than with imprecise shrinkage. And definitely it would break much more code patterns than exact growth.

@alyst
Copy link
Contributor Author

alyst commented May 30, 2017

CI tests look ok, could somebody please ping nanosoldier as it doesn't listen to me?

@andreasnoack
Copy link
Member

@nanosoldier runbenchmarks(ALL, vs=":master")

@tkelman
Copy link
Contributor

tkelman commented May 31, 2017

it would make the most sense to me if the cutoff here was such that this PR didn't change the minimum amount of growth, but only the amount of space used when that minimum is exceeded

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented May 31, 2017

Why? Why does that make more sense? The point of my proposal was that if you're in the bad O(n^2) then each growth is small – of fixed size, typically. Having any 1+ε multiplicative threshold should suffice to avoid that. It doesn't make sense to me why 2x growth makes sense as the cutoff and no one has made a compelling argument for it. Feeble as my "25% is halfway between 0% and 50%" argument is, at least it is an argument.

@tkelman
Copy link
Contributor

tkelman commented May 31, 2017

2x is our array growth cutoff elsewhere, and it doesn't make sense for resize to arbitrarily use a different growth ratio.

This also resizes to a discontinuous size - ask for 24% larger and the space doubles, but ask for 25% larger and it grows by less? That doesn't seem like a useful or predictable behavior.

@yuyichao
Copy link
Contributor

25% is halfway between 0% and 50%

So it's in between a "optimum" and the worst? That actually feels kind of random, mostly the 0% part. And

This also resizes to a discontinuous size - ask for 24% larger and the space doubles, but ask for 25% larger and it grows by less? That doesn't seem like a useful or predictable behavior.

is actually also why I think the cutoff should not be lower than the grow factor.

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented May 31, 2017

Ok. The continuity argument is a good one – I buy that. Thank you for actually making a case. Let's go with that then, and the appropriate growth factor can be tweaked later (or not at all).

@alyst
Copy link
Contributor Author

alyst commented May 31, 2017

Now it should be "continuous". I wonder if it helps sparse matrix tests. Nanosoldier, please?

@nalimilan
Copy link
Member

@nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@alyst
Copy link
Contributor Author

alyst commented May 31, 2017

I see no relevant regressions in the last nanosoldier run (no real improvements either).
Do you think we have reached the convergence? :)
Should I mention the change of resize!() behaviour in NEWS.md?

@StefanKarpinski
Copy link
Sponsor Member

A NEWS entry would be great. Then I think we're good to go here! Nice work :)

@alyst alyst changed the title [WIP/RFC] Exact array resize!() Do exact array resize!() in more cases May 31, 2017
alyst added 2 commits May 31, 2017 16:33
If the requested size of array buffer is >=200% of its current size,
grow exactly to the requested size, otherwise double the current size.
This should make resize!() & sizehint!() allocate the exactly requested
size in more situations (previously the exact resize was guaranteed for
empty arrays only). This heuristic should reduce memory usage when
arrays are growing non-incrementally (e.g. temporary buffers allocated
by LAPACK wrappers).
@alyst
Copy link
Contributor Author

alyst commented May 31, 2017

Squashed and added NEWS entry.
Thanks everyone for the lively discussion! :)

@alyst
Copy link
Contributor Author

alyst commented Jun 1, 2017

Anything else to do before merging?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants