Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

turn a few Lwt_list functions into a tail recursive variant #347

Merged
merged 2 commits into from
Jul 18, 2017
Merged

turn a few Lwt_list functions into a tail recursive variant #347

merged 2 commits into from
Jul 18, 2017

Conversation

domsj
Copy link
Contributor

@domsj domsj commented May 5, 2017

No description provided.

let tx = f x and tl = iter_p f l in
tx >>= fun () -> tl
let iter_p f l =
let ts = List.map f l in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List.map is unfortunately not tail-recursive (see link). The same applies to List.mapi. For map, I think you can do something like List.rev (List.rev_map f l). For mapi, I guess a custom helper is necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Ok, I'll make some changes for this

.travis.yml Outdated
@@ -38,3 +38,5 @@ notifications:
- [email protected]
on_success: always
on_failure: always

sudo: required
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be included in the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll remove it. somehow I did need this to get travis to build & run the tests on my fork of the lwt repo...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, you also have the option to submit a PR, tell us not to review it, and use it to trigger CI in this repo :) Not sure why sudo: required would be required (heh). Perhaps the main Lwt repo needs it now too, but we are squeezing by due to some legacy support from Travis...?

@aantron
Copy link
Collaborator

aantron commented May 5, 2017

Thanks. I'm a bit surprised that these aren't tail-recursive to begin with. For map_s, it looks like it would only matter if you had a large list of already-completed promises (the bind callback on a pending promise is not executed on the current stack) – but that is definitely a possibility. The parallel ones, in their original implementation, look like they could destroy the stack whether the promises start out completed or not.

@vasilisp
Copy link
Contributor

vasilisp commented May 5, 2017

You may want to read Leroy's classic "defense" of the non-tail-recursive List.map, if you haven't already. If anything, his argument gets even stronger in the Lwt case. I would be very interested to see an application that meaningfully calls Lwt_list.map_* on a list of tens of thousands of elements :). And if you really need that, coding a custom tail-recursive version is not that big of a deal.

@rgrinberg
Copy link
Contributor

That post makes a good case for the existence of both maps but it doesn't justify why the faster and more dangerous version should be the default. Similarly, I would be very interested to see an application that where Lwt_list.map_* is the bottleneck. And coding a non tail recursive version is equally easy.

@rgrinberg
Copy link
Contributor

rgrinberg commented May 6, 2017

Now let me actually contribute something to this discussion:

Perhaps we can get some inspiration by base's map function which is both fast on small lists and tail recursive at the cost of being ugly: https://github.com/janestreet/base/blob/master/src/list.ml#L313-L345

@domsj
Copy link
Contributor Author

domsj commented May 8, 2017

You may want to read Leroy's classic "defense" of the non-tail-recursive List.map, if you haven't already. If anything, his argument gets even stronger in the Lwt case.

How does his argument get stronger in the lwt case?

The way I see it is different:

  • the extra stack space is used for a longer time (potentially while waiting for IO)
  • the performance argument (difference in amount of heap allocation) is less important as there will be other overhead (binds) already

@vasilisp
Copy link
Contributor

vasilisp commented May 8, 2017

How does his argument get stronger in the lwt case?

We would be even more inclined to optimize for short lists, because doing asynchronous things on long lists is likely to pose other challenges. But apparently there are counter-arguments. Ultimately, I have no objection to a tail-recursive implementation :).

@aantron
Copy link
Collaborator

aantron commented May 8, 2017

@domsj:

the extra stack space is used for a longer time (potentially while waiting for IO)

Just want to clarify, that if Lwt ends up waiting for I/O, the corresponding function returns the promise immediately, releasing the stack frame, etc. The stack space is therefore not used – Lwt will call the callback that continues the computation from that promise later, after the I/O completes, on a different stack frame – and that stack frame is likely to be shallower. It's still the same thread's stack, though.

A few Lwt functions on Windows currently have blocking implementations, but for those, Lwt blocks the whole thread making the call anyway.

None of the above is against tail recursion – it's just a clarification.

@domsj
Copy link
Contributor Author

domsj commented May 9, 2017

Thanks for the correction/clarification @aantron.

I haven't done any work yet regarding lack of tail recursiveness of List.map(i) functions.

I could as suggested add some custom implementation for it, based on e.g. the code in janestreet base ...
But it seems like a waste of time/energy when we could just add a dependency to one of the stdlib replacements (base, core, containers, ...) and use what's in there, right?

@aantron
Copy link
Collaborator

aantron commented May 9, 2017

We could, and I'm considering it/have considered it in the past, for other reasons as well. But, I think for now, if there is no evidence that the performance of map is a bottleneck, we can just use a non-heavily-optimized tail-recursive version, and avoid making decisions about large dependencies (which decisions also depend on things other than just map).

I'm assuming you hit an actual problem with these functions not being tail-recursive. So, I say let's solve that problem. We can use @rgrinberg's helpful comment separately later, as a reference, if we get feedback that these functions are too slow.

@aantron
Copy link
Collaborator

aantron commented May 9, 2017

Alternatively, we could add the "fastest map possible," and comment that it came from base, or wherever. If/when Lwt decides on an stdlib extension to depend on, the comment can tell us to eliminate that code in favor of the dependency, during a future refactoring.

@Drup
Copy link
Member

Drup commented May 9, 2017

Can't we regularly crash the stack by using the scheduler and a judiciously designed wait/wakeup ? Like, let's say, every stacksize/2 ? :)
If I remember correctly, that's how it was done in core and batteries for a while, except using unsafe mutations. The unsafe mutation is already built in Lwt, in our case.

I tend to agree with @vasilisp though: the existing implementation is fairly good and while a tail-rec version might be desirable, it shouldn't compromise efficiency on small lists (which is probably the main usage pattern).

@domsj
Copy link
Contributor Author

domsj commented May 9, 2017

We didn't have an actual problem with this ... we were looking for a segfault/stackoverflow, and in the end it turned out to be related to logging (and is fixed with #348).

@aantron
Copy link
Collaborator

aantron commented May 9, 2017

Can't we regularly crash the stack by using the scheduler and a judiciously designed wait/wakeup ?

Yes, and I've hit this. There is a discussion about eliminating/discouraging/warning about that in #329.

If there is no current need to change Lwt_list, and doing it the "right way" is blocked on a bigger decision about dependencies, one option we can choose is to simply leave this PR open as a reference for the future. The discussion has been quite helpful.

@aantron
Copy link
Collaborator

aantron commented May 9, 2017

Also, FTR, I usually prefer tail recursion, even if the function is slower, unless there is evidence that's the bottleneck for enough users.

I guess when I am a user, I'd rather not worry that safe-looking functions might one day suddenly trigger segfaults, just from me calling them on mystery inputs, or scaling my application. I'd much rather deal with a performance bottleneck, since by then I'm likely already set up for finding them (i.e. doing profiling and the like). But I guess this is best resolved on a case-by-case basis.

@aantron
Copy link
Collaborator

aantron commented May 13, 2017

Alright, I decided to just finish off this PR. I'm not comfortable with these functions not being tail-recursive. That's just a source of anxiety for users, IMO.

@domsj I pushed an extra commit into your branch, to not use stdlib's List.map and List.mapi. If that's okay with you, I'll merge this PR.

We can optimize these functions later, if we get any evidence that they are a performance bottleneck. I left a comment pointing to the implementation in base linked by @rgrinberg. If necessary, we can use base as a guideline, or just use base.

In the future, we should go through all of Lwt_list and make sure everything is tail-recursive, then document that. I also noticed that the Lwt_list test suite needs some improvement.

@Drup
Copy link
Member

Drup commented May 13, 2017

@aantron I think you misunderstood: I was thinking about that as a feature, in order to avoid stack overflow. It doesn't make the function tail rec, but it doesn't matter if the stack is reset regularly.

Also, I'm sorry, but all the new versions are really much much slower. Good defaults are also about performances. You are very careful about semantic changes, I'm surprised it doesn't bother you to make basic Lwt functions several times slower.

@aantron
Copy link
Collaborator

aantron commented May 13, 2017

I think you misunderstood: I was thinking about that as a feature, in order to avoid stack overflow.

I think I did, thanks for pointing it out. I guess you mean forcing the loop through Lwt's queue once every so many iterations, which would indeed prevent stack overflow. That seems like a pretty funny hack, though. Maybe that's what we should do, but then, why the "was" in

was done in core and batteries for a while


Also, I'm sorry, but all the new versions are really much much slower. Good defaults are also about performances. You are very careful about semantic changes, I'm surprised it doesn't bother you to make basic Lwt functions several times slower.

It does bother me, but presumably, we can optimize tail-recursive functions for performance to some extent. Perhaps it is better to measure the various options first and do a survey? Perhaps the result should go into a blog post, or one is already available?

At the same time, it's not clear to me that, in a bigger context, even if these functions are much slower, they will often become a bottleneck. As I said above, I'd much rather have someone decide that "Lwt_list sucks because it's too slow" while they're already optimizing their code for performance, than decide "Lwt_list sucks because it blew up my stack" when they did not expect it, and it took them hours or days to find out Lwt_list was the cause (or even that it wasn't the cause, as in this PR).

In either case, the user is forced to write an alternative implementation (at least in the near term), but only the second one is graceless, catastrophic, and unpredictable. Avoiding that seems like a more basic need than performance.

@rgrinberg
Copy link
Contributor

rgrinberg commented May 13, 2017 via email

@Drup
Copy link
Member

Drup commented May 13, 2017

That seems like a pretty funny hack, though. Maybe that's what we should do, but then, why the "was"

I think it's quite a good idea, actually. It's still done in batteries. Not sure about core. The main problem is that it's not type-safe ... but unlike batteries, we already have a mechanism for write-once references: Lwt.t. We don't need any additional Obj.magic:p

At the least, it's worth a try, given it should have good properties wrt small lists.

@c-cube
Copy link
Collaborator

c-cube commented May 13, 2017

Note that you can mix direct style (for a max number of iterations) and then the safer, slower tailrec version. This is a good mix in my experience with normal blocking code on lists. I can show examples if you wish.

@aantron
Copy link
Collaborator

aantron commented May 14, 2017

@c-cube: this actually is for blocking code on lists, at least as written in the first commit in this PR. So, I guess whatever experience you have with that applies here.

@aantron
Copy link
Collaborator

aantron commented May 17, 2017

Before 3.1.0 (mid-July), we will find out which tail-recursive implementation is the fastest, or fast enough, switch to it, and ideally document it in an article or a blog post, so this work does not have to be done again for a while. Lwt APIs should not be a source of uncertainty for users.

@c-cube
Copy link
Collaborator

c-cube commented Jun 1, 2017

My experience is mostly playing with such functions in containers (where I provide safe variants of some List functions).

map: https://github.com/c-cube/ocaml-containers/blob/master/src/core/CCList.ml#L23
benchmark for map: https://github.com/c-cube/ocaml-containers/blob/master/benchs/run_benchs.ml#L23

the results of benchmarks on 4.04.0+flambda: https://paste.isomorphis.me/p0K&raw
(in the last case, with lists of length 500,000, List.map is fast because it actually fails). I think CCList.map is faster on small inputs than List.map for inlining reasons, I will check. The idea, though, is that CCList.map is roughly as fast on small lists, and behaves correctly on big lists.

@aantron aantron force-pushed the master branch 3 times, most recently from b8b3168 to 0773186 Compare June 8, 2017 22:53
This was referenced Jun 11, 2017
@jsthomas
Copy link
Contributor

jsthomas commented Jul 8, 2017

FWIW, I've done some experiments benchmarking different implementations of map and wrote up my results here.

To summarize the relevant parts of that writeup, the data suggested that:

  • A naive tail recursive map is significantly slower than a direct implementation; the direct version takes 60-80% of the time needed by the tail-recursive version to do the same work, depending on the length of the list. However, one can get a much faster map that doesn't sacrifice safety by using a "loop unrolled" direct implementation for a fixed prefix (say, up to 1000 elements), and then switching over to the tail recursive implementation for the rest (like @c-cube explained above).
  • Compared to the cost of performing IO, the amortized cost per element of using even the slow naive tail recursive implementation is small.

@aantron
Copy link
Collaborator

aantron commented Jul 8, 2017

Thanks @jsthomas!

My summary of the findings:

  • We can implement a map that uses a fast, stdlib-like implementation for, say, the first 1000 elements, and switches to a tail-recursive implementation for larger lists. This will address the main problem in this PR. There will only be a window during which this map is slower than stdlib's: below 1000 elements, it should be as fast, and above around 100,000 elements (depending on stack size), stdlib's map doesn't work anyway (rather catastrophically).
  • Optimizing the performance of this map on the first 1000 elements, relative to stdlib, by, for instance, loop unrolling, effectively narrows the bottom bound of this window. That is, by handling the first 1000 elements faster, we increase the total number of elements on which this hybrid map becomes slower than stdlib, inside the tail-recursive suffix. We may also be able to optimize the tail-recursive portion, but I guess this is an open question. Loop unrolling also turns 1000 stack frames into 1000/K stack frames, so besides giving performance, it allows varying the length of the non-tail-recursive prefix.

What I propose we do with this right now, however, is implement naive tail-recursion, and put a link in a comment by it to the message from @jsthomas. If we really have users for whom CPU/memory performance of Lwt_list is the bottleneck, rather than I/O, we then have at least one already-researched way to optimize the functions. But we can do that on demand.

@aantron aantron self-assigned this Jul 18, 2017
@aantron
Copy link
Collaborator

aantron commented Jul 18, 2017

We discussed further on IRC, and there are even more optimizations to potentially make.

For now, I've rebased this PR to resolve merge conflicts, and we are merging the naive tail-recursive implementation.

@aantron aantron merged commit 540ac3d into ocsigen:master Jul 18, 2017
@domsj domsj deleted the lwt-list-tail-recursive branch July 19, 2017 13:21
aantron added a commit that referenced this pull request Aug 8, 2017
The annotation is added to functions that were modified in #347. It will
be added to other functions in Lwt_list during a future thorough review
of that module.

[@ocaml.tailcall] was introduced in OCaml 4.03.0. On 4.02, the
annotation is silently ignored, so it is safe. Working on 4.03.0 and
higher is enough for any problems with tail recursion to be caught in
CI.
aantron referenced this pull request Oct 26, 2017
Before this commit, if Lwt_switch.turn_off had multiple hooks to call,
and one of the hooks raised an exception, the thread created by turn_off
would not wait for the remaining hook threads to complete.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants