Skip to content

Commit

Permalink
Docs update for statistics gathering changes.
Browse files Browse the repository at this point in the history
  • Loading branch information
Simon Prickett committed Jul 21, 2023
1 parent b03d1e7 commit 3f650f2
Showing 1 changed file with 6 additions and 1 deletion.
7 changes: 6 additions & 1 deletion enricher/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,12 @@

This is the "enricher" component. It pulls entries from a Redis List that is being used as a queue between this component and the [receiver](../receiver) component. Each entry represents a flight that needs additional data fetching from the [FlightAware Aero API](https://flightaware.com/commercial/aeroapi/).

This component calls the API to fetch that data, storing it back in the Redis Hash representing the flight.
This component calls the API to fetch that data, storing it back in the Redis Hash representing the flight. It also records statistics about the aircraft seen in the following Redis data structures ([see the bonus video for details](https://www.youtube.com/watch?v=ttXq_E4Galw)):

* [Set](https://redis.io/docs/data-types/sets/): The key for this is `stats:planesseen`. It is used to record the registrations of each plane seen. We can use the [`SCARD` command](https://redis.io/commands/scard/) to get the cardinality of the Set (how many different planes have we seen), the [`SISMEMBER` command](https://redis.io/commands/sismember/) to see whether we have seen a given registration, and the [`SSCAN`](https://redis.io/commands/sscan/) or [`SMEMBERS`](https://redis.io/commands/smembers/) commands to retrive all of the registrations seen. The benefit of using a Set here is that we can do all of these things, the downside is that because we keep all of the data the memory used by the Set will grow over time and may become a problem.
* [Hyperloglog](https://redis.io/docs/data-types/probabilistic/hyperloglogs/): The key for this is `stats:planesapprox`. It is used to approximate the number of different plane registrations we have seen. We use the [`PFCOUNT` command](https://redis.io/commands/pfcount/) to get the approximation. The benefit of a Hyperloglog is that it allows us to approximate the number of distinct planes seen without storing the data (it's hashed away) and to a reasonable degree of accuracy. The downsides include inability to retrieve the original data back from the Hyerloglog and loss of absolute accuracy.
* [Sorted Set](https://redis.io/docs/data-types/sorted-sets/): The key for this is `stats:operators`. This is used as a scoreboard to track the most frequently seen aircraft operators (Lufthansa, Ryanair, EasyJet, Virgin Atlantic etc) by operator code. We can use the [`ZRANGE` command](https://redis.io/commands/zrange/) to get slices of this high score table, and the [`ZRANK` command](https://redis.io/commands/zrank/) to see what a given operator's rank is. The benefit of a Sorted Set is accuracy, the downside can be memory usage for a large data set.
* [Top-K](https://redis.io/docs/data-types/probabilistic/top-k/): The key for this is `stats:aircrafttypes`. This is also used as a scoreboard, in this case for the different types of aircraft seen (Airbus A319, Boeing 737-800, Embraer 190 etc). This is a probabilistic data structure, so there's some trade off between accuracy and space required to store the data. We can retrieve the leaderboard with approximate scores using the [`TOPK.LIST` command](https://redis.io/commands/topk.list/).

## Setup

Expand Down

0 comments on commit 3f650f2

Please sign in to comment.