Compaction

Description

Ambry is a handle store that supports the storage of small and large objects. Two of the main features of Ambry include support for blob deletion and blob expiration (TTL). Over time, Ambry could accumulate a lot of “dead” objects and tombstones and this provides an opportunity to recover resources spent on these “dead” objects.

Brief Introduction

Compaction is a store-local operation i.e. each replica of a partition (or in implementation parlance, a BlobStore) runs compaction independently of all other replicas of the same partition and other partitions in the node and across the cluster. This means that there is no coordination of when data will be compacted and each BlobStore runs compaction when it is ready and able.

When compaction runs on a range of log segments, all the "valid" data in these log segments is copied over to new log segments. Valid data refers to all data that has no expiry, data that has not expired and data that has not been deleted (see the config section for what counts as "deleted" data). Once the copy is complete, the source log segments are freed. To note is the fact that the "active" segment i.e. the segment that is currently being written to, is never eligible for compaction. In reality any log segment with active entries in the Journal is not eligible for compaction and this may include more segments apart from the "active" segment.

Compaction frees up space in multiples of segment sizes only. It is possible for a segment to have only 1 byte of valid data but still count as a whole segment towards used capacity.

Pre-requisites

To enable compaction, the log of a BlobStore must be segmented (single segment logs cannot be compacted owing to the fact that compaction never touches the segment currently being written to). Segmentation is a property that is set when the BlobStore is created and cannot be changed. If a non-segmented log has to be converted into a segmented log, it has to be deleted and recreated. Recreation can be achieved either via replication or by using the StoreCopier or DiskReformatter.

To create a segmented log, set the config store.segment.size.in.bytes in StoreConfig to a perfect divisor of the partition size. For e.g., if partition size is 160G, segment sizes can be 1G, 2G, 4G, 5G, 8G, 10G, 16G, 20G, 32G, 40G or 80G. The selection of segment size is important since it determines the number of files created and the frequency at which space savings are realized after compaction. Since compaction reclaims space in multiples of segment sizes only, smaller segment sizes will result in space saving realizations more frequently compared to larger segment sizes but will result in more files on disk.

Once a segment size is set, it remains the same for the lifetime of the BlobStore (i.e. all segments are of the same size) since the config is consulted only when a store is created. Changing the config has no effect on stores that have already been bootstrapped.

Compaction Policy

Compaction can be requested via multiple triggers but is not guaranteed to run in response to any of them. Running compaction (or not) is determined by a (configurable) CompactionPolicy. Implementations of the policy act as gatekeepers and determine whether compaction should be run in response to a trigger. They also determine the range of log segments that compaction has to be run on. There are currently two implementations of CompactionPolicy

CompactAllPolicy: Compacts all the log segments except ones that have active entries in the Journal regardless of space saving realizations.
StatsBasedCompactionPolicy: Uses BlobStoreStats to determine the "best" range of log segments to run compaction on.

Compaction Related Configs

store.deleted.message.retention.days: The number of days that must elapse after a delete has been issued for the delete to be considered "in effect". The corresponding put of a delete that is not past this time is still considered valid.
store.cleanup.operations.bytes.per.sec: The disk I/O bandwidth available for use by compaction. Note that only one compaction is active per disk at any point in time i.e. if there are 10 stores on 10 disks, compaction occurs in parallel on all 10 stores; however if there are 10 stores on 1 disk, compaction occurs one store at a time. So this configuration essentially determines how much disk I/O bandwidth compaction is allowed to use.
store.compaction.triggers: The triggers that have been enabled for compaction i.e. the methods through which compaction can be initiated. Currently supported triggers are documented with the config in StoreConfig
store.compaction.check.frequency.in.hours: If the "Periodic" trigger is enabled, this config determines how often the BlobStore is checked for eligibility to compact.
store.compaction.policy.factory: The CompactionPolicyFactory implementation to use as a gatekeeper for determining whether a trigger results in an actual compaction.

The following configs may be used by CompactionPolicy implementations:

store.min.used.capacity.to.trigger.compaction.in.percentage: The minimum fill level of the BlobStore for it to be eligible to run compaction.
store.min.log.segment.count.to.reclaim.to.trigger.compaction: The number of log segments that have to be reclaimed on a compaction run for the compaction to be considered viable. This is used by policies that can glean this information (StatsBasedCompactionPolicy) but not by ones that cannot (CompactAllPolicy).

Enabling Compaction

Compaction is enabled by simply setting valid triggers (comma separated) for the config store.compaction.triggers. All the other configs have been set to sensible defaults that can be changed based on requirements.

If the "Periodic" trigger has been enabled, eligibility for compaction of every store will checked on startup and every store.compaction.check.frequency.in.hours hours.

If the "Admin" trigger has been enabled, admin requests for compaction of partitions will be processed and the relevant BlobStore will be checked for compaction eligibility. To make these requests, the ServerAdminTool can be used.

Effects of Compaction

There are a few possible side effects of compaction (not exhaustive):

Latency may be affected: Compaction currently does not use direct I/O and processes blobs by bringing them into the file cache. Though there is no risk of OOM, cache usage is affected. This means that the file cache has data that is not user accessed and there is lesser space for write buffering. However, there are no lasting bad effects. Things return to normal after compaction.
Replication will be reset: Current implementation resets the replication tokens of all the peers (i.e. remote replicas start from 0, the local maintains its position in all remotes). This means that there will be an increase in ReplicaMetadata requests in the cluster without any exchange of data until the peers catch up. This can also cause a delay in the replication of the most recent blobs.
Expired blobs will turn into Not_Found: Blobs that were expired and were cleaned up become Not_Found. Not_Found has a specific connotation for Ambry since frontends will query all replicas until they all report Not_Found, or one of them reports Deleted/Expired or serves the blob. If clients are querying expired blobs often, then this will have a direct impact on the number of requests in the cluster and client latency.
Queries with special GetOptions can fail: Without compaction, queries that ask for expired and deleted blobs will always succeed. With compaction, it is possible that the storage nodes have cleaned the data up and the GETs will still fail.
Hard delete progress may be reset: Hard delete is suspended before compaction is started. However, if position at which hard delete was is compacted, the hard delete progress token is reset and hard delete will redo work from the beginning. This is however a rare occurrence since hard delete is usually in the range of the Journal/latest index segment.

Resources

This wiki introduces and describes compaction, pre-requisites, configs, enabling and how compaction can affect the service. For more details on the design and actual implementation, please refer to the following resources:

Getting Started
- Home
- Quick Start
API
- Rest API
- Cluster Operations API
Configuration
- Server Configuration
- Frontend Configuration
Design
Implementation
Operations
FAQ
Project
Future Work

Provide feedback

Saved searches

Use saved searches to filter your results more quickly