Skip to content

Latest commit

 

History

History

mod_http2

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

mod_h2 internals

A description of how the module's structure, the main terms used and how it overall works.

While the complete and accurate description will always be the source code, this document is intended to serve as an entrance to understand how the module works and what its main moving parts are.

Terms

  • c1: a primary connection. This is a connection to a HTTP/2 client.
  • c2: a secondary connection. This is an internal, virtual one used to process a request.
  • session: the HTTP/2 state for a particular c1 connection.
  • stream: a HTTP/2 stream that (commonly) carries a request+body and delivers a response+body. Each stream has a unique 32 bit identifier, as defined in the HTTP/2 protocol. Stream 0 is the c1 connection itself.
  • mplx: the multiplexer, one per session. It processes streams, forwards input (request bodies) and collects output from processing. To process a stream, it creates a c2 connection.
  • worker: polling all registered mplxs for c2 connections to process. mplxs registers themselves at the workers when they have something to process.
  • slot: a particular worker thread. The number of workers may vary based on server load and configuration.
  • beam: a mechanism for transferring APR's buckets between c1and c2. More under memory.

File Structure

Source files are all prefixed with h2_ followed by the main topic they are about. h2_c2_filter for example contains all input/output filter code for c2 connections. h2_session is all about the session instances. etc.

Session States

HTTP/2 sessions can be in one of the following states:

  • INIT: state during initialization of the session. Sends initial SETTINGS to client and transits to BUSY on success.
  • BUSY: reading c1 input, writing frames on c1 output and checking any c2 for I/O events. Switches to WAIT when c1 input is exhausted.
  • WAIT: collects c1 socket and all c2 pipes into a pollset and does a wait with connection timeout on events. Transits on c1 input to BUSY again.
  • IDLE: there are no streams to process. The session waits on new c1 input to arrive. If a stream has been processed already, this is like a HTTP/1 keepalive.
  • DONE: session is done processing streams and shuts down. Possibly a last GOAWAY frame is being sent. Transits to CLEANUP when the protocol needs have been taken care of.
  • CLEANUP: release all internal resources. Make sure that any ongoing c2 processing terminates.

There a sub-states in the BUSY and WAIT handling. A session may chose to no longer accept new streams from the clients while finishing processing all ongoing streams. This is, for example, triggered by a graceful shutdown of the child process.

Errors and timeouts on the c1 connection trigger a transition to DONE.

Stream States

These mostly correspond to the states described in the HTTP/2 standard, with some additions for internal handling:

  • IDLE: a stream has been created by a request from the client.
  • OPEN: all request headers have arrived. The stream can start processing.
  • RSVD_R: the stream (identifier) has been reserved by the client (remote).
  • RSVD_L: the stream (identifier) has been reserved by the session (locally).
  • CLOSED_R: stream was closed by the client (remote). A (possibly empty) request body is complete.
  • CLOSED_L: stream was closed by the session (locally) and the output is complete.
  • CLOSED: both stream ends have been closed.
  • CLEANUP: the session is done with the stream, its resources may be reclaimed. Such a stream is handed over to the mplx which performs the reclamation. This needs to take care of a potentially still running c2 connection.

A mplx maintains three stream lists:

  • streams: the active streams which are being processed (or scheduled to be).
  • shold: the streams in CLEANUP which have an ongoing c2 connection that needs to terminate first.
  • spurge: the streams without or with a finished c2 that can be reclaimed.

Memory

Setup

The APR memory model with its pools determines much of mod_h2's structure, initialization and resource reclamation strategies. pools are the foundation of everything in Apache httpd: lists, tables, files, pipes, sockets, data and meta data transfers (buckets) are tied to them.

The fundamental restriction of pools is that they are not thread safe. Using the same pool from 2 threads will mess up its internal lists. The more busy the server is, the more likely this will then happen. Since everything one does with APR's features has the potential to modify its underlying pool, all the things listed above are not thread safe.

Closing a file will modify the pool it was opened with, for example. If 10 files are opened with the same pool, one is unable to use 5 of them in one thread and the rest in another. Everything that is based on the same pool needs to stay on the same thread.

A session handling c1 needs to run in parallel to request processing on c2 connections. That means session and c2s have completely separate pools.

When a session creates a stream, it creates a new child pool for it. Pool memory can only be freed by destroying the whole pool. To handle thousands of streams without leaking memory, they have to be placed in child pools that can be reclaimed when a stream is done.

A mplx is used by the session and by c2 connections. To manage its internal structures, it also needs its own, separate pool. This one is protects via its mutex. It creates new c2 connections when needed also with separate pools as processing happens in separate worker threads.

All these separate pools have their own APR allocator (the one that manages system memory) to be independent. However, there are still tied with a parent/child relationship to not lose track of them (leaking). So c2 pools are children of mplx pool which is a child of the session pool.

Teardown

When destroying a pool, it modifies its parent pool. When reclaiming a c2 pool, the mplx pool will be changed. So, this can only be allowed to happen inside the mplx mutex protection. When reclaiming the mplx, it modifies the session pool. So this may only happen on the thread that works on session.

This means that tearing down a session needs to tear down the mplx which needs to tear down all its c2 connections first. Or else.

Stream Memory

Streams have their own input/output buffers, allocated from their own pool. Similar to "normal" HTTP/1 requests, they need to take care that all their outgoing data on c1 has actually been sent, before they can be destroyed. HTTP/1 uses the EOR meta bucket for that. HTTP/2 has a H2_EOS bucket that is similar.

On closing a stream, a H2_EOS is created and send on c1. When this bucket is destroyed, the stream is handed to mplx for safe destruction. mplx then removes the stream from its list of active ones. It places it on its spurge list when the stream has no c2or it has already returned from the worker. For an active c2, the stream is placed into shold.

When a worker tells an mplx that it has finished a c2, the mplx checks if the stream is still active or if it is found in shold. If the stream is in shold, it is moved to spurge. It cannot be destroyed right away, since the stream's pool is a child of session. That would manipulate the session pool from inside a worker thread.

The purge list is instead only processed, when session calls the mplx.

Data Transfer

With all this pool touchiness, how does request/response/bodies ever get transferred between a stream and its c2 connection that does the actual work? That merits its own chapter about bucket beams.

Bucket Beams

Apache httpd uses APR's bucket brigades to transfer data and meta information through its connection filters. So, whatever also one does, ultimately streams and c2 connections will use brigades.

The problem is: it is impossible to transfer a bucket from one brigade to another between threads.

A bucket belongs to a bucket_alloc which belongs to a memory pool. All three are not thread safe and tied. Imagine transferring from brigade b1 on thread t1 into brigade b2 on thread t2:

  • t1 can take data out of b1, but cannot put it into b2.
  • t2 can put data into b2, but cannot take it out of b1.

So, mod_h2 needs something to juggle the data in between t1 and t2 doing their thing. That is the job of a bucket beam.

(The name "beam" is inspired by Start Trek beam technology, where people are transported from one place to another - but not instantly. They first become frozen and semi-transparent at both location, until the arrival is complete and then they disappear at the start. Same happens to buckets in a bucket beam.)

A bucket beam has two APR rings (rings are a doubly linked list that works independent from pools. Yay!).

  • buckets_to_send: when t1 calls h2_beam_send(beam, b1), the beam takes buckets out of b1 and appends them to this ring.
  • buckets_consumed: when t2 calls h2_beam_receive(beam, brigade), buckets are taken from the to_send ring, converted to new buckets (which are appended to the receiver brigage) and the original ones are added to the 'consumed' ring.

The buckets in buckets_consumed can then be destroyed the next time that thread t1 calls.

Bucket Conversions

Buckets in a beam are "converted" into new buckets (the originals from 'to_send' are readonly). For well known meta buckets, just a new one of the corresponding type is created and added to the receiver brigade.

This concerns 'eos', 'flush' and 'error' buckets. Several other meta buckets are not converted. For example the 'eor' bucket is not passed on in any from, just added to the buckets_consumed ring on transfer. This is important to make sure that all sent buckets are destroyed in the correct order.

Data buckets have their data and length extracted via apr_bucket_read() and the data is written to the receiver brigade. This makes a copy of the data. Special handling is done for 'file' and 'mmap' buckets where the file/mmap itself is duped and added to the receiver. This allows transfer of large response bodies without copying any data.

For additional conversions, beams allow registration of conversion functions. All unknown buckets are passed to them. They may add their own converted buckets to the receiver brigade.

Beam Memory

Bucket beams can be configured with a buffer limit. This blocks senders when they try to add more (data) buckets than the limit allows. They become unblocked when a receiver takes data out. Data buckets of type 'file' or 'mmap' are not counted against this limit, as they do not really occupy memory in the beam's buffer. At least not additional memory as the sender already has created these buckets.

Efficiency

While the response to a HTTP/2 request is being generated, the request occupies a h2 worker thread. Once the response is complete (headers and body buckets), and all buckets have been passed through the filter chain, the request processing returns and the worker can be used for other requests.

Buckets, however, can only be passed when they do not exceed the memory limitations for a HTTP/2 stream. Sending buckets on the output stream will block, when this limit is reached, and that blocks the worker thread.

Fortunately, for static files, this limit is never reached, as file buckets have a tiny memory footprint until their data is actually read. Beams will accept file buckets of any length without blocking. The same is true for mmap buckets. Serving static files will only shortly occupy workers to lookup the file and set the response headers and all other work then happens on the c1 connection itself.

Polling

The module mplx uses pipes and pollsets to monitor communications on c1 and c2:

  • for c1 it polls the connection's socket for new incoming data.
  • for c2s it polls pipe_out_prod, a pipe used for signalling the availability of new output.
  • for c2s it polls pipe_in_drain, a pipe used for signalling that input has been read by c2 (if the stream processing on a c2 has indeed input, e.g. a request body).

The HTTP/2 protocol has in-built flow control on its streams. c1 needs to be continuously monitored not only for new requests, but also for updates on flow control window sizes.

Flow control also is in effect for clients sending data to the server, e.g. in a POST request. That is why the consumption of such input data is being monitored. When data is passed on to a c2's input, the stream's window for the client is updated (and the update sent to the client on c1), so more POST data can be sent.

In this way, a HTTP/2 connection is much more busy in both directions than HTTP/1 where there is a clear separation between sending and receiving phases for a request.

For c2s expecting input, an additional pipe is being created: pipe_in_prod which signals when c1 has written new data to the c2 input. This pipe is read by a c2 waiting for additional data.

The bucket beams have callbacks for various events that are used to feed these pipes:

  • h2_beam_on_received(): used to signal that c2 input has been read.
  • h2_beam_on_was_empty(): used to signal the addition of new data.

This way, sending/receiving on a bucket beam adds notifications to the pipes.

Timeouts

When a session is in state WAIT, it polls using the c1 connection timeout. If polling times out, the session is shut down.

To enter WAIT state, a session needs to be sure that there is no pending data in c1's input filters. A session enters BUSY state whenever it is not certain about this. A BUSY session uses the pollset with timeout 0, which immediately returns, and tries to read/write more on the c1 connection.

Eventually, it will detect that c1 input filters have nothing buffered any more and enter WAIT state.

A session enters IDLE state when it has no more streams to process and nothing more to send. It will then poll with a very short timeout and, when nothing changes, return the c1 connection to the mpm for "keepalive" monitoring.

Workers

The modules has its own worker pool, separate from the mpm worker thread. This is mostly for historical reasons, as HTTP/2 came late into the server, but separate pools make sense in order for the workers not to deadlock themselves (all workers holding a c1 and waiting for a free worker to process their c2s).

The h2 workers have a fixed number of slots for threads and start a minimal number (all configurable). Additional threads are added in a busy situation up to the number of available slots. When such a slot becomes idle for a number of seconds (configurable), it shuts down again. Also, inactive workers shut down when the server does a graceful reload.

h2 workers manage a queue of mplx instances that have work to be done. An idle worker removes an mplx from the queue and asks it for a c2 to process. When done, it informs the mplx that it has finished the c2 and if it is willing to work on another c2 from this mplx. It is only willing to do so when the maximum worker count has not been reached yet.

The reason for this design are:

  • all mplx are for a particular c1 and all c1 should be treated equal. c1s that open many requests at once should not get preferred treatment over others that have "only" a single request.
  • scheduling c2s directly would require creating them, potentially a lot of them, without a worker being available. 100 connections with 10 requests ongoing would hold 1000 c2s all the time while only a small set is being worked on.
  • asking the same mplx for additional work is not unfair as long as more workers are available. And it results in better performance than switching workers all the time.

Interactions with HTTP/1

The HTTP/1 protocol handling is the default in Apache httpd and there are several places where it applies itself for "HTTP" without any consideration of the actual protocol version. There are ongoing efforts to separate the generic HTTP processing (e.g. checks on valid headers, method name, paths, etc.) from the serialization of responses and bodies on the wire.

Two areas are noteworthy here to understand how mod_h2 works.

Request Bodies

Requests in HTTP/1 have either an implied length of 0 or an announced length in the content-length header or use the "chunked" transfer encoding. The standard HTTP_IN filter in Apache therefore assumes that the request has not body if both content length and chunked transfer information is missing.

Since HTTP/2 allows request bodies without announced length and no chunking, mod_h2 must fake a chunked request input for those requests.

Response Headers

Similarly, the server installs the HTTP_HEADER filter on a request output which, on seeing the response body, inserts the response header in HTTP/1 format on the filter chain. HTTP/2 has no need for this and removes HTTP_HEADER on each of its requests.

In its stead, it applies its own filter in the output that writes a special H2HEADERS meta bucket. This bucket contains response status and all headers and can be passed through the output beam. The c1 processing then converts H2HEADERS buckets to the HTTP/2 HEADER frames for responses and footers.

DDoS Protection

HTTP/2 as a protocol has more internal state than HTTP/1. This makes prevention of exploits more difficult. The known HTTP/1 attacks, such as "Slo Loris", can also be applied and are mitigated using the known mod_reqtimeout mechanisms.

But "Slo Loris" can also attack the c1 processing on the main connection. After all, the goal of such an attack is to exhaust the server's worker thread and preventing processing of new connection. By keeping the HTTP/2 c1 processing occupied, the same goal can be achieved if there is no special protection in place. This has been a bit of an arms race in the last few years.

Another attack angle is aimed at exhausting the h2 workers with requests, that are slow or kept in an incomplete state or are just simply cancelled and started, again and again.

To protect against such behaviour, a mod_h2 has an internal "mood" to obey a client based on its past behaviour (on this connection, there is not tracking). The mood starts neutral where a client is allowed to have 6 active requests, e.g. occupy 6 h2 workers. When responses are read by the client in a timely fashion (no stalling), this mood rises and the limit is raised. Should the client stall responses (drag its feet with window updates) or reset streams before a response has been seen by it, the processing limit is lowered.

Another protection against flooding is scheduling fairness between connections. If one connection has a single request it needs a worker for and another has 100, both get one worker assigned, before the other 99 requests may receive attention.