turn a few Lwt_list functions into a tail recursive variant #347

domsj · 2017-05-05T13:58:24Z

No description provided.

aantron · 2017-05-05T16:46:29Z

src/core/lwt_list.ml

- let tx = f x and tl = iter_p f l in
- tx >>= fun () -> tl
+let iter_p f l =
+ let ts = List.map f l in


List.map is unfortunately not tail-recursive (see link). The same applies to List.mapi. For map, I think you can do something like List.rev (List.rev_map f l). For mapi, I guess a custom helper is necessary.

Right. Ok, I'll make some changes for this

aantron · 2017-05-05T16:55:43Z

.travis.yml

@@ -38,3 +38,5 @@ notifications:
 - [email protected]
 on_success: always
 on_failure: always
+
+sudo: required


I don't think this should be included in the PR.

ok, I'll remove it. somehow I did need this to get travis to build & run the tests on my fork of the lwt repo...

BTW, you also have the option to submit a PR, tell us not to review it, and use it to trigger CI in this repo :) Not sure why sudo: required would be required (heh). Perhaps the main Lwt repo needs it now too, but we are squeezing by due to some legacy support from Travis...?

aantron · 2017-05-05T17:05:47Z

Thanks. I'm a bit surprised that these aren't tail-recursive to begin with. For map_s, it looks like it would only matter if you had a large list of already-completed promises (the bind callback on a pending promise is not executed on the current stack) – but that is definitely a possibility. The parallel ones, in their original implementation, look like they could destroy the stack whether the promises start out completed or not.

vasilisp · 2017-05-05T22:04:39Z

You may want to read Leroy's classic "defense" of the non-tail-recursive List.map, if you haven't already. If anything, his argument gets even stronger in the Lwt case. I would be very interested to see an application that meaningfully calls Lwt_list.map_* on a list of tens of thousands of elements :). And if you really need that, coding a custom tail-recursive version is not that big of a deal.

rgrinberg · 2017-05-06T08:06:14Z

That post makes a good case for the existence of both maps but it doesn't justify why the faster and more dangerous version should be the default. Similarly, I would be very interested to see an application that where Lwt_list.map_* is the bottleneck. And coding a non tail recursive version is equally easy.

rgrinberg · 2017-05-06T08:12:23Z

Now let me actually contribute something to this discussion:

Perhaps we can get some inspiration by base's map function which is both fast on small lists and tail recursive at the cost of being ugly: https://github.com/janestreet/base/blob/master/src/list.ml#L313-L345

domsj · 2017-05-08T09:08:40Z

You may want to read Leroy's classic "defense" of the non-tail-recursive List.map, if you haven't already. If anything, his argument gets even stronger in the Lwt case.

How does his argument get stronger in the lwt case?

The way I see it is different:

the extra stack space is used for a longer time (potentially while waiting for IO)
the performance argument (difference in amount of heap allocation) is less important as there will be other overhead (binds) already

vasilisp · 2017-05-08T11:13:00Z

How does his argument get stronger in the lwt case?

We would be even more inclined to optimize for short lists, because doing asynchronous things on long lists is likely to pose other challenges. But apparently there are counter-arguments. Ultimately, I have no objection to a tail-recursive implementation :).

aantron · 2017-05-08T14:12:15Z

@domsj:

the extra stack space is used for a longer time (potentially while waiting for IO)

Just want to clarify, that if Lwt ends up waiting for I/O, the corresponding function returns the promise immediately, releasing the stack frame, etc. The stack space is therefore not used – Lwt will call the callback that continues the computation from that promise later, after the I/O completes, on a different stack frame – and that stack frame is likely to be shallower. It's still the same thread's stack, though.

A few Lwt functions on Windows currently have blocking implementations, but for those, Lwt blocks the whole thread making the call anyway.

None of the above is against tail recursion – it's just a clarification.

domsj · 2017-05-09T11:23:01Z

Thanks for the correction/clarification @aantron.

I haven't done any work yet regarding lack of tail recursiveness of List.map(i) functions.

I could as suggested add some custom implementation for it, based on e.g. the code in janestreet base ...
But it seems like a waste of time/energy when we could just add a dependency to one of the stdlib replacements (base, core, containers, ...) and use what's in there, right?

aantron · 2017-05-09T13:17:19Z

We could, and I'm considering it/have considered it in the past, for other reasons as well. But, I think for now, if there is no evidence that the performance of map is a bottleneck, we can just use a non-heavily-optimized tail-recursive version, and avoid making decisions about large dependencies (which decisions also depend on things other than just map).

I'm assuming you hit an actual problem with these functions not being tail-recursive. So, I say let's solve that problem. We can use @rgrinberg's helpful comment separately later, as a reference, if we get feedback that these functions are too slow.

aantron · 2017-05-09T13:24:20Z

Alternatively, we could add the "fastest map possible," and comment that it came from base, or wherever. If/when Lwt decides on an stdlib extension to depend on, the comment can tell us to eliminate that code in favor of the dependency, during a future refactoring.

Drup · 2017-05-09T14:01:45Z

Can't we regularly crash the stack by using the scheduler and a judiciously designed wait/wakeup ? Like, let's say, every stacksize/2 ? :)
If I remember correctly, that's how it was done in core and batteries for a while, except using unsafe mutations. The unsafe mutation is already built in Lwt, in our case.

I tend to agree with @vasilisp though: the existing implementation is fairly good and while a tail-rec version might be desirable, it shouldn't compromise efficiency on small lists (which is probably the main usage pattern).

domsj · 2017-05-09T14:02:40Z

We didn't have an actual problem with this ... we were looking for a segfault/stackoverflow, and in the end it turned out to be related to logging (and is fixed with #348).

aantron · 2017-05-09T14:50:11Z

Can't we regularly crash the stack by using the scheduler and a judiciously designed wait/wakeup ?

Yes, and I've hit this. There is a discussion about eliminating/discouraging/warning about that in #329.

If there is no current need to change Lwt_list, and doing it the "right way" is blocked on a bigger decision about dependencies, one option we can choose is to simply leave this PR open as a reference for the future. The discussion has been quite helpful.

aantron · 2017-05-09T15:10:14Z

Also, FTR, I usually prefer tail recursion, even if the function is slower, unless there is evidence that's the bottleneck for enough users.

I guess when I am a user, I'd rather not worry that safe-looking functions might one day suddenly trigger segfaults, just from me calling them on mystery inputs, or scaling my application. I'd much rather deal with a performance bottleneck, since by then I'm likely already set up for finding them (i.e. doing profiling and the like). But I guess this is best resolved on a case-by-case basis.

aantron · 2017-05-13T20:30:24Z

Alright, I decided to just finish off this PR. I'm not comfortable with these functions not being tail-recursive. That's just a source of anxiety for users, IMO.

@domsj I pushed an extra commit into your branch, to not use stdlib's List.map and List.mapi. If that's okay with you, I'll merge this PR.

We can optimize these functions later, if we get any evidence that they are a performance bottleneck. I left a comment pointing to the implementation in base linked by @rgrinberg. If necessary, we can use base as a guideline, or just use base.

In the future, we should go through all of Lwt_list and make sure everything is tail-recursive, then document that. I also noticed that the Lwt_list test suite needs some improvement.

Drup · 2017-05-13T20:43:40Z

@aantron I think you misunderstood: I was thinking about that as a feature, in order to avoid stack overflow. It doesn't make the function tail rec, but it doesn't matter if the stack is reset regularly.

Also, I'm sorry, but all the new versions are really much much slower. Good defaults are also about performances. You are very careful about semantic changes, I'm surprised it doesn't bother you to make basic Lwt functions several times slower.

aantron · 2017-05-13T21:09:59Z

I think you misunderstood: I was thinking about that as a feature, in order to avoid stack overflow.

I think I did, thanks for pointing it out. I guess you mean forcing the loop through Lwt's queue once every so many iterations, which would indeed prevent stack overflow. That seems like a pretty funny hack, though. Maybe that's what we should do, but then, why the "was" in

was done in core and batteries for a while

Also, I'm sorry, but all the new versions are really much much slower. Good defaults are also about performances. You are very careful about semantic changes, I'm surprised it doesn't bother you to make basic Lwt functions several times slower.

It does bother me, but presumably, we can optimize tail-recursive functions for performance to some extent. Perhaps it is better to measure the various options first and do a survey? Perhaps the result should go into a blog post, or one is already available?

At the same time, it's not clear to me that, in a bigger context, even if these functions are much slower, they will often become a bottleneck. As I said above, I'd much rather have someone decide that "Lwt_list sucks because it's too slow" while they're already optimizing their code for performance, than decide "Lwt_list sucks because it blew up my stack" when they did not expect it, and it took them hours or days to find out Lwt_list was the cause (or even that it wasn't the cause, as in this PR).

In either case, the user is forced to write an alternative implementation (at least in the near term), but only the second one is graceless, catastrophic, and unpredictable. Avoiding that seems like a more basic need than performance.

rgrinberg · 2017-05-13T21:18:04Z

Also, I'm sorry, but all the new versions are really much much slower. Good defaults are also about performances. You are very careful about semantic changes, I'm surprised it doesn't bother you to make basic Lwt functions several times slower.

Very curious to see measurements if you've made some. I'm kind of surprised it's that much slower in this case as well. With all the binds inserted everywhere. I think that if the overhead is proven to be significant then we should seriously consider on waiting the optimized versions to appear.

Drup · 2017-05-13T21:27:04Z

That seems like a pretty funny hack, though. Maybe that's what we should do, but then, why the "was"

I think it's quite a good idea, actually. It's still done in batteries. Not sure about core. The main problem is that it's not type-safe ... but unlike batteries, we already have a mechanism for write-once references: Lwt.t. We don't need any additional Obj.magic:p

At the least, it's worth a try, given it should have good properties wrt small lists.

c-cube · 2017-05-13T22:58:20Z

Note that you can mix direct style (for a max number of iterations) and then the safer, slower tailrec version. This is a good mix in my experience with normal blocking code on lists. I can show examples if you wish.

aantron · 2017-05-14T02:29:13Z

@c-cube: this actually is for blocking code on lists, at least as written in the first commit in this PR. So, I guess whatever experience you have with that applies here.

aantron · 2017-05-17T17:10:41Z

Before 3.1.0 (mid-July), we will find out which tail-recursive implementation is the fastest, or fast enough, switch to it, and ideally document it in an article or a blog post, so this work does not have to be done again for a while. Lwt APIs should not be a source of uncertainty for users.

c-cube · 2017-06-01T08:11:38Z

My experience is mostly playing with such functions in containers (where I provide safe variants of some List functions).

map: https://github.com/c-cube/ocaml-containers/blob/master/src/core/CCList.ml#L23
benchmark for map: https://github.com/c-cube/ocaml-containers/blob/master/benchs/run_benchs.ml#L23

the results of benchmarks on 4.04.0+flambda: https://paste.isomorphis.me/p0K&raw
(in the last case, with lists of length 500,000, List.map is fast because it actually fails). I think CCList.map is faster on small inputs than List.map for inlining reasons, I will check. The idea, though, is that CCList.map is roughly as fast on small lists, and behaves correctly on big lists.

jsthomas · 2017-07-08T02:54:20Z

FWIW, I've done some experiments benchmarking different implementations of map and wrote up my results here.

To summarize the relevant parts of that writeup, the data suggested that:

A naive tail recursive map is significantly slower than a direct implementation; the direct version takes 60-80% of the time needed by the tail-recursive version to do the same work, depending on the length of the list. However, one can get a much faster map that doesn't sacrifice safety by using a "loop unrolled" direct implementation for a fixed prefix (say, up to 1000 elements), and then switching over to the tail recursive implementation for the rest (like @c-cube explained above).
Compared to the cost of performing IO, the amortized cost per element of using even the slow naive tail recursive implementation is small.

aantron · 2017-07-08T13:23:05Z

Thanks @jsthomas!

My summary of the findings:

We can implement a map that uses a fast, stdlib-like implementation for, say, the first 1000 elements, and switches to a tail-recursive implementation for larger lists. This will address the main problem in this PR. There will only be a window during which this map is slower than stdlib's: below 1000 elements, it should be as fast, and above around 100,000 elements (depending on stack size), stdlib's map doesn't work anyway (rather catastrophically).
Optimizing the performance of this map on the first 1000 elements, relative to stdlib, by, for instance, loop unrolling, effectively narrows the bottom bound of this window. That is, by handling the first 1000 elements faster, we increase the total number of elements on which this hybrid map becomes slower than stdlib, inside the tail-recursive suffix. We may also be able to optimize the tail-recursive portion, but I guess this is an open question. Loop unrolling also turns 1000 stack frames into 1000/K stack frames, so besides giving performance, it allows varying the length of the non-tail-recursive prefix.

What I propose we do with this right now, however, is implement naive tail-recursion, and put a link in a comment by it to the message from @jsthomas. If we really have users for whom CPU/memory performance of Lwt_list is the bottleneck, rather than I/O, we then have at least one already-researched way to optimize the functions. But we can do that on demand.

aantron · 2017-07-18T22:46:14Z

We discussed further on IRC, and there are even more optimizations to potentially make.

For now, I've rebased this PR to resolve merge conflicts, and we are merging the naive tail-recursive implementation.

The annotation is added to functions that were modified in #347. It will be added to other functions in Lwt_list during a future thorough review of that module. [@ocaml.tailcall] was introduced in OCaml 4.03.0. On 4.02, the annotation is silently ignored, so it is safe. Working on 4.03.0 and higher is enough for any problems with tail recursion to be caught in CI.

Before this commit, if Lwt_switch.turn_off had multiple hooks to call, and one of the hooks raised an exception, the thread created by turn_off would not wait for the remaining hook threads to complete.

aantron reviewed May 5, 2017

View reviewed changes

aantron mentioned this pull request May 11, 2017

Robustify stress2 test + use tail recursive map everywhere openvstorage/alba#717

Merged

aantron added the archive label May 11, 2017

aantron modified the milestone: 3.1.0 May 12, 2017

aantron removed the archive label May 13, 2017

aantron mentioned this pull request May 24, 2017

Fix CI builds for Lwt contributors #377

Closed

aantron added the in progress label May 30, 2017

aantron force-pushed the master branch 3 times, most recently from b8b3168 to 0773186 Compare June 8, 2017 22:53

This was referenced Jun 11, 2017

Lwt_list: More tests. #392

Merged

[easy] Involved tests for Lwt_list #395

Closed

aantron self-assigned this Jul 18, 2017

domsj and others added 2 commits July 18, 2017 14:48

turn a few Lwt_list functions into a tail recursive variant

bc3491c

Lwt_list: tail-recursive versions of List.map[i]

c5673f2

aantron merged commit 540ac3d into ocsigen:master Jul 18, 2017

domsj deleted the lwt-list-tail-recursive branch July 19, 2017 13:21

aantron mentioned this pull request Jul 19, 2017

Add Lwt 3.1.0 – concurrency library ocaml/opam-repository#9876

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

turn a few Lwt_list functions into a tail recursive variant #347

turn a few Lwt_list functions into a tail recursive variant #347

domsj commented May 5, 2017

aantron May 5, 2017

domsj May 8, 2017

aantron May 5, 2017

domsj May 8, 2017

aantron May 8, 2017

aantron commented May 5, 2017

vasilisp commented May 5, 2017 •

edited

Loading

rgrinberg commented May 6, 2017

rgrinberg commented May 6, 2017 •

edited

Loading

domsj commented May 8, 2017

vasilisp commented May 8, 2017

aantron commented May 8, 2017 •

edited

Loading

domsj commented May 9, 2017

aantron commented May 9, 2017

aantron commented May 9, 2017

Drup commented May 9, 2017 •

edited

Loading

domsj commented May 9, 2017

aantron commented May 9, 2017

aantron commented May 9, 2017

aantron commented May 13, 2017

Drup commented May 13, 2017

aantron commented May 13, 2017

rgrinberg commented May 13, 2017 via email

Drup commented May 13, 2017

c-cube commented May 13, 2017

aantron commented May 14, 2017

aantron commented May 17, 2017

c-cube commented Jun 1, 2017

jsthomas commented Jul 8, 2017

aantron commented Jul 8, 2017 •

edited

Loading

aantron commented Jul 18, 2017

turn a few Lwt_list functions into a tail recursive variant #347

turn a few Lwt_list functions into a tail recursive variant #347

Conversation

domsj commented May 5, 2017

aantron May 5, 2017

Choose a reason for hiding this comment

domsj May 8, 2017

Choose a reason for hiding this comment

aantron May 5, 2017

Choose a reason for hiding this comment

domsj May 8, 2017

Choose a reason for hiding this comment

aantron May 8, 2017

Choose a reason for hiding this comment

aantron commented May 5, 2017

vasilisp commented May 5, 2017 • edited Loading

rgrinberg commented May 6, 2017

rgrinberg commented May 6, 2017 • edited Loading

domsj commented May 8, 2017

vasilisp commented May 8, 2017

aantron commented May 8, 2017 • edited Loading

domsj commented May 9, 2017

aantron commented May 9, 2017

aantron commented May 9, 2017

Drup commented May 9, 2017 • edited Loading

domsj commented May 9, 2017

aantron commented May 9, 2017

aantron commented May 9, 2017

aantron commented May 13, 2017

Drup commented May 13, 2017

aantron commented May 13, 2017

rgrinberg commented May 13, 2017 via email

Drup commented May 13, 2017

c-cube commented May 13, 2017

aantron commented May 14, 2017

aantron commented May 17, 2017

c-cube commented Jun 1, 2017

jsthomas commented Jul 8, 2017

aantron commented Jul 8, 2017 • edited Loading

aantron commented Jul 18, 2017

vasilisp commented May 5, 2017 •

edited

Loading

rgrinberg commented May 6, 2017 •

edited

Loading

aantron commented May 8, 2017 •

edited

Loading

Drup commented May 9, 2017 •

edited

Loading

aantron commented Jul 8, 2017 •

edited

Loading