Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

writev – zero-copy gather output #291

Merged
merged 2 commits into from
Nov 29, 2016
Merged

writev – zero-copy gather output #291

merged 2 commits into from
Nov 29, 2016

Conversation

aantron
Copy link
Collaborator

@aantron aantron commented Nov 19, 2016

Lwt_unix.writev takes a sequence of bytes and Bigarray buffers, and writes the data to a file descriptor in one system call. Bigarray buffers are always written without copying. bytes buffers are written without copying if the file descriptor is in non-blocking mode, which is typical for sockets and pipes in Lwt.

Example usage:

let () =
  let t =
    let hello : bytes       = Bytes.unsafe_of_string "hello " in
    let world : Lwt_bytes.t = Lwt_bytes.of_string "world!" in (* bigarray *)

    let%lwt bytes_written =
      let io_vectors = Lwt_unix.IO_vectors.create () in
      Lwt_unix.IO_vectors.append_bytes io_vectors hello 0 6;
      Lwt_unix.IO_vectors.append_bigarray io_vectors world 0 6;
      Lwt_unix.(writev stdout io_vectors)
    in

    assert (bytes_written = 12);

    Lwt.return_unit
  in

  Lwt_main.run t

There is also a drop function to trim io_vectors if not all bytes are written.

Performance

I did some rudimentary performance testing (writev.c, writev.ml), measuring throughput. The conclusions are:

  • When multiple buffers are available, Lwt_unix.writev should always be preferred over multiple calls to Lwt_unix.write (or Lwt_bytes.write, or stdlib's Unix.single_write).
  • However, if the buffers are sufficiently small (around 128 bytes on Linux, 512 bytes on OS X), it is faster (up to around 30% in C, 250% in OCaml) to copy the buffers to a single large one, and then do a single Lwt_unix.write on the coalesced buffer. This is presumably due to the overhead of dealing with the I/O vectors by the application and kernel.
  • As the buffers get larger, writev outperforms coalescing by up to 50%.
  • For large numbers of small buffers (e.g. 512 × 128B), both writev and coalescing dramatically outperform multiple calls to write. Both are about 7× faster on non-blocking Lwt file descriptors, and over 100× faster on blocking ones. The latter is probably due to Lwt synchronizing with worker threads for I/O on blocking descriptors.
  • Testing the write and writev system calls from C gives very similar ratios, and the order of magnitude of buffer size at which writev becomes faster than coalescing is the same.
  • Lwt_unix.writev is at least 90% as fast as the writev system call, except for very small buffer sizes mentioned above, where OCaml allocations become relatively significant, but where coalescing is faster in both OCaml and C.

The test machines were my OS X computer, and a Linux virtual machine running in Digital Ocean. These aren't very controlled environments. The point was to make sure there are no serious errors affecting performance in this initial implementation, rather than obtain really high-quality measurements. Perhaps I will do some more thorough testing later, and write a short article on the results.

Other notes

  • I made the I/O vector sequence an abstract type, so we can change its representation easily in the future.
  • Optimizations could include an unsafe version that does not perform bounds checks, and/or versions that represent Lwt_unix.IO_vector.t using C struct iovecs directly, though the latter may have to carry extra data or make assumptions about the user retaining references to buffers, and potentially the GC not running between I/O vectors assembly and the call to writev for bytes buffers (this may be especially tough in future multicore).
  • Even though the new code deals with Bigarrays, I put it in Lwt_unix. Hopefully, one day we will have modular implicits, and Lwt_bytes can be folded into Lwt_unix.
  • Maybe, at some point in the distant future, send_msg and recv_msg can be switched to use the heterogenous I/O vectors from this PR. I am not in a hurry to break compatibility, however. Also, modular implicits may help here.
  • It should be possible to implement writev on Windows for Windows sockets only, but I have left that to be done on demand, or as part of some future effort to port more functions to Windows.

Resolves #281.

cc @rgrinberg, @seliopou

Copy link
Collaborator

@avsm avsm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very useful PR. Overall I'm just wondering if it makes more sense to put a Bigarray-based version of this call in Lwt_bytes, and leave the Bytes version in Lwt_unix. Right now that's the split between Bigarray and Bytes functions in Lwt.

module IO_vectors =
struct
type _bigarray =
(char, Bigarray.int8_unsigned_elt, Bigarray.c_layout) Bigarray.Array1.t
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use Lwt_bytes.t instead of a new type definition here? It is the same definition

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lwt_bytes depends on Lwt_unix, otherwise yes.


type _buffer =
| Bytes of Bytes.t
| Bigarray of _bigarray
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_buffer is not exposed in the external signature so shouldnt need the _ prefix.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I typically use the _ to indicate at a glance that something isn't exposed (or in rare cases, like _bigarray, is exposed, but I wish it weren't).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@avsm you're right here, I had forgotten that the leading _ suppress dead code warnings. I removed the underscores in e7a6755.

mutable reversed_suffix : _io_vector list;
mutable count : int}

let create () = {prefix = []; reversed_suffix = []; count = 0}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's worth the extra complexity in the interface, but an optional argument of ?bytes and ?bigarrays here could be used to initialise the iov in one call rather than multiple appends. I could use that in cstruct to perform the write with less overhead from the API.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the ordering if both are provided? Perhaps it is better to have separate calls such as from_bigarray_list?

@aantron
Copy link
Collaborator Author

aantron commented Nov 21, 2016

@avsm Thanks.

I chose to have one function handle mixtures of bytes and Bigarrays because, if there are two functions, and you have, say, data in Bigarrays interleaved with punctuation in strings, you would completely lose the benefit of writev. You would have to either make individual calls for each buffer, or copy everything into a big buffer.

With that and other considerations in mind, I have an ill-defined long-term goal of merging Lwt_bytes into Lwt_unix, if/when we get modular implicits. The distinction seems a bit artificial at present, and writev seems like a case where it is better not to have it.

includes its first [n] bytes. *)

val count : t -> int
(** [count vs] is the number of I/O vectors in the sequence [vs]. *)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit curious as to why did you decide to make this function part of the interface. I can't think of any use cases for it off the top of my head.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that somebody might want to compare count vs with system_limit, in case the latter matters to them. I suppose, in the vast majority of cases, such users could easily keep a counter while accumulating vs. If you agree, I'll hide this function. We can always expose it later, if there is a request.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it also gives an easy way to know when all the data in the vectors has been written. Here is some kind of writev loop skeleton:

let rec loop () =
  Lwt_unix.writev fd vecs >>= fun n ->
  Lwt_unix.IO_vectors.drop vecs n;
  if Lwt_unix.IO_vectors.count vecs = 0 then Lwt.return_unit
  else loop ()

Of course, the user could keep track of how much data is in the vectors, but having count for this seems like less of a pain.

This could be an argument to expose a size or byte_count instead, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case, it seems like it would be better to add a predicate that would check if the whole io vector has been written yet, rather than expose the raw count. Hopefully we can fulfill other cases in this direct way as well.

@aantron
Copy link
Collaborator Author

aantron commented Nov 28, 2016

@rgrinberg I've replaced count by empty. Not fully sure about the new interface:

(** [empty vs] is [true] if and only if [vs] has no I/O vectors, or all I/O
    vectors in [vs] have zero bytes. *)

However, writev with zero vectors results in EINVAL, but writev with at least one vector of length zero is a valid operation. empty does not distinguish between these. However, I figured that the main use of empty is for drop loops. If the user wants to do a zero-byte write using writev, they would construct a special I/O vector and not need to call empty on it anyway.

@aantron
Copy link
Collaborator Author

aantron commented Nov 28, 2016

writev with zero vectors results in EINVAL

Apparently, that's only the case on Mac (and I guess other BSDs). Automated testing FTW.

@rgrinberg
Copy link
Contributor

rgrinberg commented Nov 28, 2016 via email

@aantron
Copy link
Collaborator Author

aantron commented Nov 28, 2016

It seems that the BSD behavior is more strict, while still being POSIX-compliant. There are two options. I favor the first one:

  1. Fail with Invalid_argument if a zero-length vector is passed, ensuring BSD-compatible behavior at the Lwt level. Lwt code written and tested on Linux will be portable to BSD without the risk of new unexpected failures.
  2. Allow writev to do whatever the system's writev does.

After a brief search, I haven't found any mention that the Linux behavior is deliberate. Will probably look further later, however.


BSD writev:

[EINVAL]           The iovcnt argument was less than or equal to 0, or
                   greater than IOV_MAX.

POSIX writev:

...may fail...

EINVAL The iovcnt argument was less than or equal to 0, or greater
       than {IOV_MAX}.

Linux writev:

EINVAL The vector count, iovcnt, is less than zero or greater than
       the permitted maximum.

I'll change empty to is_empty.

@aantron aantron force-pushed the writev branch 2 times, most recently from 160aa90 to fcbc348 Compare November 28, 2016 22:41
@aantron
Copy link
Collaborator Author

aantron commented Nov 28, 2016

Regarding zero-length I/O vector lists, I decided it's better to only document the difference and tell users not to rely on the behavior. Don't really have any argument for dealing with this corner case in any particular way.

@aantron aantron merged commit b6d1f9f into master Nov 29, 2016
@aantron aantron mentioned this pull request Nov 29, 2016
@aantron aantron deleted the writev branch November 29, 2016 22:10
@aantron aantron mentioned this pull request Jan 15, 2018
35 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants