Skip to content

Latest commit

 

History

History
109 lines (96 loc) · 5.43 KB

TODO.md

File metadata and controls

109 lines (96 loc) · 5.43 KB

Roadmap

v0.1

  • Controlled scheduling of deployments according to the CRD spec
  • Allow updates of deployments (store generation of the deployment in status)
  • Propagate partition ID to the managed deployment
  • Adding or removing deployments based on CRD spec

v0.2

  • Update GOMAXPROCS based on number of CPUs
  • Initial resource allocation based on current Kafka metrics + static multipliers
  • Auto-scaling based on Production/Consumption/Offset
  • Store MetricsMap from the last query in consumer object status
  • Rename static predictor to naive
  • Load metricsProvider from the status

v0.3

  • Setup travis-ci
  • Query multiple prometheus for metrics
  • Write readme

v0.4

  • Validate/Fix RBAC permissions
  • build simple service to produce pseudo-data to local kafka/prometheus
  • Update readme with the steps to configure dev env in linux/macos

v0.5 scaling

  • Scale down only when no lag present
  • Scale only after X periods of lag/no lag
  • Introduce another deployment status - SATURATED to indicate that we don't have enough resources for it
  • Need a way to expose resource saturation level (how many CPUs are lacking)
  • Per-deployment auto-scaling pause

v0.6 - observability

  • Post behaviour updates to Kubernetes events
  • Cleanup logging
  • Expose metrics about own health and behaviour
  • Grafana dashboard
  • Update spec to deploy 3 instances of operator
  • Add totalMaxAllowed which will limit total number of cores available for the consumer

v0.7 testing

  • Guest-mode. Ability to run operator without cluster-wide permissions
  • Verify that scaling works with multi-container pods
  • Verify that disabling all auto-scaling and setting resources in the deployment itself works
  • Verify that HA mode works
  • Verify that system operates as expected when autoscaling is disabled
  • Scaling up only if there is a lag, down if there is no lag
  • Update owner. If operator was restarted it gets new UID, we need to update ownerRef in reconcile
  • [TEST] Add more integration tests
  • [TEST] add test to verify that env variables are always set
  • [BUG] updating operator spec to scale all deployments down works, but resume doesn't

v0.8 bugfixes

  • Verify that system works without ResourcePolicy set
  • make scaleStatePendingPeriod configurable
  • profile slow reconcile (15s for ~300 deployments)
  • Fix statuses after the change in scaling logic (scale based on lag)

v0.9 Multi-partition assignment

  • Static multi-partition assignment
  • Improve logging

v1.0

  • Promote v1alpha1 version to v1
  • Resync all metrics on reconcile. Status metric is wrong (seems to be updated only during creating of deployments)
  • Replace partition labels with ranges, validate length

v1.1

  • If there is no metrics at all allocate avg/mean instead of minimal
  • cpu increment should be configurable
  • cleanup logging
    • Log not only scaling cmp value (-1, 0, 1), but also how many cores were estimated per instance
    • introduce verbose mode to make it easier to debug a single instance issues.

vNext

  • Implement bounce mode, keep track of the last few node names per instance and add antiaffinitity rule to deployment to avoid scheduling to that node during the next scale up.
  • Implement progressive updates (canary).
  • Rework configuration of RAM estimation, make it possible to provide some formula, i.e. (fixed + ramPerCore)
  • Consider replacing DeploymentSpec with PodSpec/PodLabels/PodAnnotations Ability to set additional deployment-level annotations/labels
  • Recreate deployments from scratch, if any of the immutable fields were changed in the deploymentSpec Now, it requires manual deleting of all deployments.
  • Use annotations to pause/resume configmap-based consumers

Unsorted

  • Consider using number of messages in all estimates instead of projected lag time
  • Add jitter to the scaling time
  • Dynamic multi-partition assignment. Instead of static numPartitionsPerInstance: Configure min/max values for numPartitionsPerInstance Configure min/max number of pods per consumer Based on production rate decide the value of partitions to assign. Scale each one vertically until fits. If per-pod resource limit is exhausted, but global one is not, scale horizontally and reduce number of partitions per instance.
  • [BUG] update of the auto-scaler spec (ratePerCore, ramPerCore) should ? trigger reconciliation
  • Reset status annotation if MANUAL mode is enabled
  • Consider making number of partitions optional in the spec
  • [Feature] implement defaulting/validating webhooks
  • [Feature] call external webhooks on scaling events
  • [Feature] Vertical auto-scaling of balanced workloads (single deployment)
  • [Feature] Fully dynamic resource allocations based on historic data
  • [Feature] ? consider adding support for VPA/HPA
  • [Feature] ? Tool for operations consumerctl stop/start consumer
  • [Feature] ? Consider getting all the pods to estimate uptime and avoid frequent restarts.
  • [Feature] Implement second metrics provider (Kafka)
  • [Feature] scale up without restart blocked
  • [Feature] Get kafka lag directly from the prometheus blocked