Roadmap

v0.1

v0.2

v0.3

v0.4

v0.5 scaling

Scale down only when no lag present
Scale only after X periods of lag/no lag
Introduce another deployment status - SATURATED to indicate that we don't have enough resources for it
Need a way to expose resource saturation level (how many CPUs are lacking)
Per-deployment auto-scaling pause

v0.6 - observability

Post behaviour updates to Kubernetes events
Cleanup logging
Expose metrics about own health and behaviour
Grafana dashboard
Update spec to deploy 3 instances of operator
Add totalMaxAllowed which will limit total number of cores available for the consumer

v0.7 testing

Guest-mode. Ability to run operator without cluster-wide permissions
Verify that scaling works with multi-container pods
Verify that disabling all auto-scaling and setting resources in the deployment itself works
Verify that HA mode works
Verify that system operates as expected when autoscaling is disabled
Scaling up only if there is a lag, down if there is no lag
Update owner. If operator was restarted it gets new UID, we need to update ownerRef in reconcile
[TEST] Add more integration tests
[TEST] add test to verify that env variables are always set
[BUG] updating operator spec to scale all deployments down works, but resume doesn't

v0.8 bugfixes

v0.9 Multi-partition assignment

v1.0

Promote v1alpha1 version to v1
Resync all metrics on reconcile. Status metric is wrong (seems to be updated only during creating of deployments)
Replace partition labels with ranges, validate length

v1.1

If there is no metrics at all allocate avg/mean instead of minimal
cpu increment should be configurable
cleanup logging
- Log not only scaling cmp value (-1, 0, 1), but also how many cores were estimated per instance
- introduce verbose mode to make it easier to debug a single instance issues.

vNext

Implement bounce mode, keep track of the last few node names per instance and add antiaffinitity rule to deployment to avoid scheduling to that node during the next scale up.
Implement progressive updates (canary).
Rework configuration of RAM estimation, make it possible to provide some formula, i.e. (fixed + ramPerCore)
Consider replacing DeploymentSpec with PodSpec/PodLabels/PodAnnotations Ability to set additional deployment-level annotations/labels
Recreate deployments from scratch, if any of the immutable fields were changed in the deploymentSpec Now, it requires manual deleting of all deployments.
Use annotations to pause/resume configmap-based consumers

Unsorted

Consider using number of messages in all estimates instead of projected lag time
Add jitter to the scaling time
Dynamic multi-partition assignment. Instead of static numPartitionsPerInstance: Configure min/max values for numPartitionsPerInstance Configure min/max number of pods per consumer Based on production rate decide the value of partitions to assign. Scale each one vertically until fits. If per-pod resource limit is exhausted, but global one is not, scale horizontally and reduce number of partitions per instance.
[BUG] update of the auto-scaler spec (ratePerCore, ramPerCore) should ? trigger reconciliation
Reset status annotation if MANUAL mode is enabled
Consider making number of partitions optional in the spec
[Feature] implement defaulting/validating webhooks
[Feature] call external webhooks on scaling events
[Feature] Vertical auto-scaling of balanced workloads (single deployment)
[Feature] Fully dynamic resource allocations based on historic data
[Feature] ? consider adding support for VPA/HPA
[Feature] ? Tool for operations consumerctl stop/start consumer
[Feature] ? Consider getting all the pods to estimate uptime and avoid frequent restarts.
[Feature] Implement second metrics provider (Kafka)
[Feature] scale up without restart blocked
[Feature] Get kafka lag directly from the prometheus blocked

Provide feedback