excessive 2-message latency #45

JeffBezanson · 2011-06-14T07:45:34Z

This says it all:

julia> tic();println(remote_call_fetch(2,+,1,1));toc()
2
elapsed time: 0.0006420612335205 sec
julia> tic();println(fetch(remote_call(2,+,1,1)));toc()
2
elapsed time: 0.0400099754333496 sec

This is all between two processes on the same machine. A single message roundtrip (request, response) is fast. But if we do the same thing in two messages (send call, send fetch, get response) it takes 60x longer. This is crazy. And there is still only one round trip, just a single extra one-way message from p1 to p2.
We already tried setting TCP_NODELAY and it did not make any difference. So far I'm stumped.

The text was updated successfully, but these errors were encountered:

JeffBezanson · 2011-06-14T08:02:43Z

I straced both sides and the output is not exactly shocking:

p1:

with call_fetch:

write(4, "\2\ncall_fetch\24\2O\7\23\0\0\0\23\0\2\1+\24\2OO", 29) = 29
select(11, [3 4 5 6 7 8 9 10], NULL, NULL, {10, 0}) = 1 (in [4], left {9, 999997})
read(4, "\2\6result\2\5fetch\24\2O\7\23\0\0\0P", 8191) = 24

with call then fetch:

write(4, "'\24\2O\7\24\0\0\0\23\0\2\1+\24\2OO", 18) = 18
write(4, "\2\5fetch\24\2O\7\24\0\0\0", 15) = 15
select(11, [3 4 5 6 7 8 9 10], NULL, NULL, {10, 0}) = 1 (in [4], left {9, 961472})
read(4, "\2\6result\2\5fetch\24\2O\7\24\0\0\0P", 8191) = 24

p2:

with call_fetch:

select(15, [0 11 12 13 14], NULL, NULL, {10, 0}) = 1 (in [0], left {8, 622846})
read(0, "\2\ncall_fetch\24\2O\7\26\0\0\0\23\0\2\1+\24\2OO", 8191) = 29
select(15, [0 11 12 13 14], NULL, NULL, {0, 0}) = 0 (Timeout)
write(0, "\2\6result\2\5fetch\24\2O\7\26\0\0\0P", 24) = 24

with call then fetch:

select(15, [0 11 12 13 14], NULL, NULL, {10, 0}) = 1 (in [0], left {8, 853315})
read(0, "'\24\2O\7\27\0\0\0\23\0\2\1+\24\2OO", 8191) = 18
select(15, [0 11 12 13 14], NULL, NULL, {0, 0}) = 0 (Timeout)
select(15, [0 11 12 13 14], NULL, NULL, {10, 0}) = 1 (in [0], left {9, 961106})
read(0, "\2\5fetch\24\2O\7\27\0\0\0", 8191) = 15
write(0, "\2\6result\2\5fetch\24\2O\7\27\0\0\0P", 24) = 24

I can get rid of the 0-timeout selects by avoiding select entirely when the workqueue is non-empty, but this makes no difference.

ViralBShah · 2011-06-14T13:51:20Z

On Mac, I see reasonable results. Perhaps something to do with some buffer setting in the linux kernel - one of those /proc entries?

julia> tic();println(remote_call_fetch(2,+,1,1));toc()
2
elapsed time: 0.0008530616760254 sec
0.0008530616760254

julia> tic();println(fetch(remote_call(2,+,1,1)));toc()
2
elapsed time: 0.00111985206604 sec
0.00111985206604

Darwin The-Surfing-Burrito.local 10.7.0 Darwin Kernel Version 10.7.0: Sat Jan 29 15:17:16 PST 2011; root:xnu-1504.9.37~1/RELEASE_I386 i386

ViralBShah · 2011-06-14T13:54:50Z

Tried on 64-bit Linux (Opteron - not that it should matter), and see the same 60x slowdown.

Linux neumann.cs.ucsb.edu 2.6.18-8.1.6.el5 #1 SMP Thu Jun 14 17:29:04 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux

ViralBShah · 2011-06-14T13:57:33Z

This may help:

http:https://www.ibm.com/developerworks/linux/library/l-hisock/index.html

StefanKarpinski · 2011-06-14T14:29:30Z

I'm not sure any of those are the issue. I think this is odder. Maybe one of these Linux-specific TCP bugs? Unfortunately, I could only find the Google cached version.

StefanKarpinski · 2011-06-14T14:35:40Z

Jeff, I assume you're holding the TCP connection open, right? If so, can you try doing the latter in a loop? If only the first one is slow, then I'm more inclined to finger slow start as the culprit.

JeffBezanson · 2011-06-14T18:42:00Z

When I loop it N times it always seems to take exactly N times longer:

julia> tic();for i=1:100;fetch(remote_call(2,+,1,1));end;toc()
elapsed time: 3.9971921443939209 sec

StefanKarpinski · 2011-06-14T19:09:18Z

And looping the first one is proportionally faster? Can you confirm that the same TCP connection gets used throughout?

JeffBezanson · 2011-06-14T19:18:28Z

Looping the first one 100x seems to take only 50x longer. Interesting. It's almost exactly like somebody inserted a sleep call in the second case.
I'm doing all the socket stuff manually so I'm pretty sure there are no new connections. Wouldn't that show up in the strace output?

StefanKarpinski · 2011-06-14T19:24:39Z

Yeah, it should show up in the strace output since doing anything with opening or closing sockets requires a system call.

JeffBezanson · 2011-06-15T21:25:13Z

This problem is officially shitty.

JeffBezanson · 2011-06-19T23:45:35Z

I have a new idea for fixing this: implement timer events, and use it to combine messages sent within, say, 2ms of each other into a single write. Then at least on the sending side both cases will have the same strace signature.

StefanKarpinski · 2011-06-20T04:12:36Z

Ugh. This is such a hack. Not a complaint about the approach really, but it's just so fucked up that this has to be down. Why is the Linux TCP implementation this buggy?

ViralBShah · 2011-06-20T04:31:03Z

I have always had problems with latency on linux, for as far as I can remember. For the last 10 years, one has always had to look for some kernel patches, and /proc entries, and nothing really addresses it.

-viral

On Jun 20, 2011, at 9:42 AM, StefanKarpinski wrote:

Ugh. This is such a hack. Not a complaint about the approach really, but it's just so fucked up that this has to be down. Why is the Linux TCP implementation this buggy?

Reply to this email directly or view it on GitHub:
#45 (comment)

JeffBezanson · 2011-06-20T04:46:50Z

It might be a good thing to do anyway; delaying and aggregating messages is pretty common.

ViralBShah · 2011-07-09T14:39:00Z

I think this should be moved to 2.0.

JeffBezanson · 2011-07-20T03:42:48Z

OK, get excited. I just tried this and it actually worked:

julia> @time println(remote_call_fetch(2,+,1,1))
2
elapsed time: 0.00174689292907715 sec
julia> @time println(fetch(remote_call(2,+,1,1)))
2
elapsed time: 0.00170516967773438 sec

JeffBezanson · 2011-07-20T04:17:50Z

Turns out we can do even better by doing message aggregation only on the requesting side:

julia> @time println(remote_call_fetch(2,+,1,1))
2
elapsed time: 0.00072383880615234 sec
julia> @time println(fetch(remote_call(2,+,1,1)))
2
elapsed time: 0.00075888633728027 sec

StefanKarpinski · 2011-07-20T05:51:57Z

Which commit fixed this?

JeffBezanson · 2011-07-20T05:55:20Z

Not committed yet.

delaying and aggregating messages to reduce the number of writes. probably fixing issue #110 the problem was doing I/O during GC in the RemoteRef finalizer. obviously this writes data at a random time and really screws things up. this is an epic commit.

JeffBezanson · 2013-02-08T04:09:28Z

This performance problem is back. The fix needs to be restored within the new I/O system.

vtjnash · 2013-02-08T04:53:45Z

I think we had dropped the distinction between send_msg and send_msg_now when IOBuffer didn't exist. This can probably be restored fairly easily now.

JeffBezanson · 2013-03-18T06:06:13Z

See also f9637a2. Threads may be necessary to make this work properly.

amitmurthy · 2013-04-26T06:24:43Z

I used wireshark to capture network packets.

For

julia> tic(); remotecall_fetch(2,+,1,1);  toc()
elapsed time: 0.001487489 seconds
0.001487489

the capture showed:

For

julia> tic();fetch(remotecall(2,+,1,1));toc()
elapsed time: 0.040404147 seconds
0.040404147

the capture showed:

Is it possible it is not a network issue? In the second case the response to the first packet is arriving after 0.039 seconds, and the request-response for the second packet hardly takes any time.

Does it have anything to do with creating a RemoteRef in the second case (which is not required in with remotecall_fetch)

NOTE: The term pichat refers to port 9009 in the images. Apparently 9009 is the standard port for some P2P application called pichat.

amitmurthy · 2013-04-26T11:23:16Z

This is a really weird bug. Notice the timings below:

julia> tic();a=remotecall(2,+,1,1);toc();tic();remotecall_fetch(2,+,1,1);toc();tic();fetch(a);toc()
elapsed time: 0.001218872 seconds
elapsed time: 0.038245016 seconds
elapsed time: 0.000710215 seconds
0.000710215

The remotecall_fetch() immediately after the remotecall() is slow!

and

julia> tic();a=remotecall_fetch(2,+,1,1);toc();tic();remotecall_fetch(2,+,1,1);toc();tic();remotecall_fetch(2,+,1,1);toc();
elapsed time: 0.001811197 seconds
elapsed time: 0.00117068 seconds
elapsed time: 0.001172147 seconds

No such issues here.

JeffBezanson · 2013-04-26T11:38:50Z

I was able to fix this before by aggregating messages into a single write
using a timer and an io thread.
On Apr 26, 2013 7:23 AM, "Amit Murthy" [email protected] wrote:

This is a really weird bug. Notice the timings below:

julia> tic();a=remotecall(2,+,1,1);toc();tic();remotecall_fetch(2,+,1,1);toc();tic();fetch(a);toc()
elapsed time: 0.001218872 seconds
elapsed time: 0.038245016 seconds
elapsed time: 0.000710215 seconds
0.000710215

The remotecall_fetch() immediately after the remotecall() is slow!

and

julia> tic();a=remotecall_fetch(2,+,1,1);toc();tic();remotecall_fetch(2,+,1,1);toc();tic();remotecall_fetch(2,+,1,1);toc();
elapsed time: 0.001811197 seconds
elapsed time: 0.00117068 seconds
elapsed time: 0.001172147 seconds

No such issues here.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/45#issuecomment-17067872
.

ViralBShah · 2013-04-26T12:15:37Z

I don't notice any of these on the mac.

WestleyArgentum · 2014-03-26T20:55:40Z

This is a bit of a shot in the dark, but is it at all possible that this is related to #2816?

I believe the discovery there was an unnecessary buffer copy in a very low level function right above the tcp stack.

Fixes #59, closes #45

Update sampling.jl

get rid of no-op "plain"

Without this change, the compiler fails to notice that `env_threads isa Int` in the fall-through case, leading to a union-split with a branch that is in fact unreachable: ``` 43 ┄ %109 = φ (#41 => %105, #42 => %108)::Union{Nothing, Int64} │ %110 = (%109 isa Int64)::Bool └─── goto #45 if not %110 ... 45 ─ %126 = π (%109, Nothing) │ Base.convert(Int64, %126)::Union{} └─── unreachable ``` Co-authored-by: Jeff Bezanson <[email protected]>

Without this change, the compiler fails to notice that `env_threads isa Int` in the fall-through case, leading to a union-split with a branch that is in fact unreachable: ``` 43 ┄ %109 = φ (#41 => %105, #42 => %108)::Union{Nothing, Int64} │ %110 = (%109 isa Int64)::Bool └─── goto #45 if not %110 ... 45 ─ %126 = π (%109, Nothing) │ Base.convert(Int64, %126)::Union{} └─── unreachable ``` After this change, the union-split is eliminated. Co-authored-by: Jeff Bezanson <[email protected]>

Without this change, the compiler fails to notice that `env_threads isa Int` in the fall-through case, leading to a union-split with a branch that is in fact unreachable: ``` 43 ┄ %109 = φ (JuliaLang#41 => %105, JuliaLang#42 => %108)::Union{Nothing, Int64} │ %110 = (%109 isa Int64)::Bool └─── goto JuliaLang#45 if not %110 ... 45 ─ %126 = π (%109, Nothing) │ Base.convert(Int64, %126)::Union{} └─── unreachable ``` After this change, the union-split is eliminated. Co-authored-by: Jeff Bezanson <[email protected]>

ghost assigned JeffBezanson Jun 28, 2011

JeffBezanson closed this as completed Jul 20, 2011

ViralBShah mentioned this issue Mar 19, 2012

WIP: libuv rewrite and Windows port #521

Merged

vtjnash added a commit that referenced this issue Jan 8, 2013

fix #45

1a40dc6

JeffBezanson reopened this Feb 8, 2013

JeffBezanson mentioned this issue Mar 27, 2014

Serializing RemoteRefs result in unexplained delays #6287

Closed

amitmurthy mentioned this issue Mar 29, 2014

Fixes 40 ms wait time on consecutive network calls - linux only #6302

Merged

JeffBezanson closed this as completed in #6302 Mar 30, 2014

StefanKarpinski pushed a commit that referenced this issue Feb 8, 2018

Backport vectorized round/ceil/floor/trunc from Base

8339923

Fixes #59, closes #45

StefanKarpinski pushed a commit that referenced this issue Feb 8, 2018

add HMAC functionalities (#45)

43b2bad

vchuravy mentioned this issue Feb 8, 2018

use git subtree to graft Pkg3, etc. into stdlib #25942

Closed

JeffBezanson mentioned this issue Apr 26, 2019

Very slow second write to a socket #31842

Open

LilithHafner pushed a commit to LilithHafner/julia that referenced this issue Oct 11, 2021

Merge pull request JuliaLang#45 from one-more-minute/patch-1

beedf4b

Update sampling.jl

chflood mentioned this issue Mar 9, 2022

OOM not always signaled properly in Fedora 33 #44539

Closed

Keno pushed a commit that referenced this issue Oct 9, 2023

Merge pull request #45 from JuliaDebug/kc/plain

31ea1d4

get rid of no-op "plain"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

excessive 2-message latency #45

excessive 2-message latency #45

JeffBezanson commented Jun 14, 2011

JeffBezanson commented Jun 14, 2011

ViralBShah commented Jun 14, 2011

ViralBShah commented Jun 14, 2011

ViralBShah commented Jun 14, 2011

StefanKarpinski commented Jun 14, 2011

StefanKarpinski commented Jun 14, 2011

JeffBezanson commented Jun 14, 2011

StefanKarpinski commented Jun 14, 2011

JeffBezanson commented Jun 14, 2011

StefanKarpinski commented Jun 14, 2011

JeffBezanson commented Jun 15, 2011

JeffBezanson commented Jun 19, 2011

StefanKarpinski commented Jun 20, 2011

ViralBShah commented Jun 20, 2011

JeffBezanson commented Jun 20, 2011

ViralBShah commented Jul 9, 2011

JeffBezanson commented Jul 20, 2011

JeffBezanson commented Jul 20, 2011

StefanKarpinski commented Jul 20, 2011

JeffBezanson commented Jul 20, 2011

JeffBezanson commented Feb 8, 2013

vtjnash commented Feb 8, 2013

JeffBezanson commented Mar 18, 2013

amitmurthy commented Apr 26, 2013

amitmurthy commented Apr 26, 2013

JeffBezanson commented Apr 26, 2013

ViralBShah commented Apr 26, 2013

WestleyArgentum commented Mar 26, 2014

excessive 2-message latency #45

excessive 2-message latency #45

Comments

JeffBezanson commented Jun 14, 2011

JeffBezanson commented Jun 14, 2011

ViralBShah commented Jun 14, 2011

ViralBShah commented Jun 14, 2011

ViralBShah commented Jun 14, 2011

StefanKarpinski commented Jun 14, 2011

StefanKarpinski commented Jun 14, 2011

JeffBezanson commented Jun 14, 2011

StefanKarpinski commented Jun 14, 2011

JeffBezanson commented Jun 14, 2011

StefanKarpinski commented Jun 14, 2011

JeffBezanson commented Jun 15, 2011

JeffBezanson commented Jun 19, 2011

StefanKarpinski commented Jun 20, 2011

ViralBShah commented Jun 20, 2011

JeffBezanson commented Jun 20, 2011

ViralBShah commented Jul 9, 2011

JeffBezanson commented Jul 20, 2011

JeffBezanson commented Jul 20, 2011

StefanKarpinski commented Jul 20, 2011

JeffBezanson commented Jul 20, 2011

JeffBezanson commented Feb 8, 2013

vtjnash commented Feb 8, 2013

JeffBezanson commented Mar 18, 2013

amitmurthy commented Apr 26, 2013

amitmurthy commented Apr 26, 2013

JeffBezanson commented Apr 26, 2013

ViralBShah commented Apr 26, 2013

WestleyArgentum commented Mar 26, 2014