Skip to content

Latest commit

 

History

History
116 lines (64 loc) · 7.52 KB

incubation-cortex.md

File metadata and controls

116 lines (64 loc) · 7.52 KB

Cortex proposal for CNCF Incubation

Background

Request for Incubation

Project Review Proposal

Original TOC presentation

Cortex is a horizontally scalable, highly available, multi-tenant, Prometheus API compatible service that offers a long-term storage solution.

For teams looking for a Prometheus solution that offers the following over vanilla Prometheus:

  • Long-term metrics storage in a variety of cloud based and on-prem NoSQL data stores

  • Tenancy model supporting commercial SaaS offerings or large/multiple Kubernetes installations requiring data separation

  • On-demand Prometheus instance provisioning

  • A highly-available architecture that benefits from cloud-native architectures run with Kubernetes

  • A highly scalable Prometheus experience that horizontally scales

  • The ability to handle large metric topologies in a single instance without the need for federation

Cortex was presented at the CNCF TOC meeting on 6/5/2018. We’ve grown a lot since then and the project is a lot more active and mature now.

Notable improvements in the last year include:

  • An easy to use single process version for people to test things out.
  • Queries are now much faster (up to 10x).
  • We now use a lot less disk space.
  • Cortex is now much more stable and easier to run, with more improvements on their way.
  • Our alerting and recording rule layer is now horizontally scalable.

Further, a lot of the work in Cortex also involved improvements in upstream Prometheus and small subset include:

Alignment with Cloud Native

Cortex fully supports the CNCF’s goal for scalability, "Ability to support all scales of deployment, from small developer centric environments to the scale of enterprises and service providers."

There are many different ways to provide a scalable and available metric system for Kubernetes. Cortex with it’s tenancy model combined with both the high-availability and horizontally scalability architecture serves this goal directly. Further, while having no dependency on Kubernetes, Cortex is built with Kubernetes in mind and most users deploy it in Kubernetes. We also provide a robust way for users to scale their Prometheus servers and Cortex has resulted in a lot of improvements in Prometheus itself.

Comparison with Thanos

Thanos is another CNCF project that provides high-availability and long-term storage to Prometheus. Both Thanos and Cortex make different trade-offs that will appeal to different use-cases:

  • Cortex is a centralised store while Thanos holds the recent data at the edge in Prometheus servers themselves. This presents a different tradeoff - pushing writes to a central location with Cortex vs pulling data at query time with Thanos. This results in query latency and availability differences.

  • Because of the centralised nature and push based architecture of Cortex, you can enforce quality and quantity of the data being stored, and drop data you don’t want to store long-term.

  • Multitenancy is built into Cortex which makes a good option for larger organisations that need to keep the data from separate teams separate. Having said that, even Thanos has multitenancy on the roadmap.

  • The Thanos architecture allows for incremental adoption and reuse of existing Prometheus deployments, whereas Cortex leverages the built-in Prometheus remote-write API.

With all this, the Cortex and Thanos communities are constantly collaborating with each other, the recent ones being using the cortex-frontend to provide query caching for Thanos and exploration of writing blocks and using the Thanos query path in Cortex. There are plans to collaborate further and the community and usage of both the projects is only growing!

Incubation State Requirements

  1. Document that it is being used successfully in production by at least three independent end users which, in the TOC’s judgement, are of adequate quality and scope.

We have the list of public adopters in the repo, some notable users include:

  • Electronic Arts, is using Cortex for scaling their Prometheus servers.
  • DigitalOcean, is using Cortex for scaling their Prometheus servers.
  • GoJek, is using Cortex to build lens, the unified internal monitoring platform for all the services in its fleet.
  • Grafana Labs, uses Cortex to run a commercial hosted Prometheus service.
  • Mayadata, is using Cortex to monitor the storage nodes for the users of its platform.
  • Weaveworks, uses Cortex to run a commercial hosted Prometheus service.
  1. Have a healthy number of committers. A committer is defined as someone with the commit bit; i.e., someone who can accept contributions to some or all of the project.

We have 8 maintainers 3 of who are independent and 4 of them from 3 different companies. The details are here.

  1. Demonstrate a substantial ongoing flow of commits and merged contributions.

We are seeing a constant stream of performance improvements and features from the maintainers and community. See the stats here:

  1. A clear versioning scheme.

We now have regular releases documented at: https://github.com/cortexproject/cortex/blob/master/RELEASE.md We’ve only recently started our release process, but have 3 releases out already.

  1. Roadmap

We're nowhere near done with Cortex and have a lot of plans for future development, and our roadmap currently includes:

  • Write-Ahead Log (In Progress) - This would enable crash resiliency in Cortex. Right now we replicate each sample n-ways in the ingesters for that but we do have data-loss if enough ingesters crash.
  • Query Parallelisation (In Progress) - Parallelise the processing of queries further where possible, enabling us to execute massive queries.
  • Simpler runtime configuration - This would make Cortex easier to operate.
  • Downsampling - Lets us store less data and over longer periods of time.
  • Recording Rule Substitution - Right now whenever users add recording rules, they have to wait a long time for the recording rule to populate data before they can switch their dashboards to it. We’re exploring automatically detecting and replacing the original query with its recording rule when appropriate. Once this is in place, we can also detect regular large queries and automatically replace them with recording rules.
  • No dependencies cortex (no NoSQL/Object Store needed) - This would make Cortex easier to operate on-prem and on bare-metal.