🚚 Bulker-server configuration

Running Bulker
Common Parameters
Kafka Connection
Batching
Streaming
Error Handling and Retries
Advanced Kafka Tuning
Events Log (optional)
Defining Destination

See also HTTP API

Running Bulker

The best way to run Bulker is to use docker image.

Use jitsucom/bulker:latest for the last stable version
Use jitsucom/bulker:canary for the last build

Alternatively, you can build your own binary by running go mod download && go build -o bulker

Bulker is configured via environment variables. All variables are prefixed with BULKER_. See the list of available variables below.

Common parameters

`BULKER_INSTANCE_ID`

Optional, default value: random uuid

ID of bulker instance. It is used for identifying Kafka consumers and metrics. If is not set, instance id will be generated and persisted to disk (~/.bulkerapp/instance_id) and reused on next restart.

`BULKER_HTTP_PORT`

Optional, default value: 3042

`BULKER_AUTH_TOKENS`

Optional, default value: ''

A list of hashed auth tokens that authorizes user in HTTP interface separated by comma. Each must have format:

${salt}.${hash} where ${salt} should be random string. Hash is hex(sha512($token + $salt + BULKER_TOKEN_SECRET).

$token must consist only of letters, digits, underscore and dash

`BULKER_TOKEN_SECRET`

Optional, default value: empty string

See above. A secret that is used for hashing tokens.

`BULKER_RAW_AUTH_TOKENS`

Optional, default value: ''

A list of plain non hashed tokens separated by comma. Each token must consist only of letters, digits, underscore and dash

Can be used instead of BULKER_AUTH_TOKENS,BULKER_TOKEN_SECRET pair. It offers simplicity at cost of lower security.

Not recommended for production.

Connection to Kafka

`BULKER_KAFKA_BOOTSTRAP_SERVERS`

Required

List of Kafka brokers separated by comma. Each broker should be in format host:port.

`BULKER_KAFKA_SSL`

If SSL should be enabled

`BULKER_KAFKA_SSL_SKIP_VERIFY`

Skip SSL verification of kafka server certificate.

`BULKER_KAFKA_SASL` (aka Kafka auth)

Kafka authorization as JSON object {"mechanism": "SCRAM-SHA-256|PLAIN", "username": "user", "password": "password"}

Batching

Bulker buffers events and sends them to destination in batches if mode=batch. The batch is sent when either one of the following is true:

batchSize events are buffered
frequency minutes passed since the first event in the batch was buffered. (float))

Batch settings that are default for all destinations may be set with following variables:

`BULKER_BATCH_RUNNER_DEFAULT_PERIOD_SEC`

Optional, default value: 300 (5 min)

Default period for batch processing for destinations where frequency is not set explicitly. Read more about batch processing configuration below

`BULKER_BATCH_RUNNER_DEFAULT_BATCH_SIZE`

Optional, default value: 10000

Default batch size for destinations where batchSize is not set explicitly. Read more about batch processing configuration below

See also DB Feature Matrix

Streaming

If mode is stream, Bulker will send events to destination as soon as they are received.

Error Handling and Retries

If Bulker fails to send events to destination, it can retry sending them with exponential backoff. When error occurs, Bulker move events to Kafka topic dedicated to Retry Consumer. In streaming mode single failed event is moved to retry topic while in batch mode whole batch is moved to retry topic.

Retry Consumer is responsible for requeuing events from retry topic. It runs periodically and relocate events from retry topic to the original topic while incrementing retries attempt counter.

If stream or batch consumer reaches max retry attempts for specific event, that event is moved to dead topic.

Parameters:

`BULKER_MESSAGES_RETRY_COUNT`

Optional, default value: 5

Max number of retry attempts.

`BULKER_MESSAGES_RETRY_BACKOFF_BASE`

Optional, default value: 5

Defines base for exponential backoff in minutes for retry attempts. For example, if retry count is 3 and base is 5, then retry delays will be 5, 25, 125 minutes.

`BULKER_MESSAGES_RETRY_BACKOFF_MAX_DELAY`

Optional, default value: 1440

Defines maximum possible retry delay in minutes. Default: 1440 minutes = 24 hours

`BULKER_BATCH_RUNNER_DEFAULT_RETRY_PERIOD_SEC`

Optional, default value: 300 (5 min)

Default period of running Retry Consumer for destinations where retryPeriodSec is not set explicitly. Read more about batch processing configuration below

`BULKER_BATCH_RUNNER_DEFAULT_RETRY_BATCH_SIZE`

Optional, default value: 100

Default batch size for destination's Retry Consumer where retryBatchSize is not set explicitly. Read more about batch processing configuration below

Kafka topic management (advanced)

Bulker automatically creates 3 topics per each table in destination. One topic is for main processing, one is for failed events that should be retried and the last one for failed events that won't be retried - dead. The topic names has format in.id.{destiantionId}.m.{mode}.t.{tableName}.

Mode can be: batch or stream, retry, dead.

Parameters above define how topics are created

`BULKER_KAFKA_TOPIC_RETENTION_HOURS`

Optional, default value: 168 (7 days)

Main topic retention time in hours.

`BULKER_KAFKA_RETRY_TOPIC_RETENTION_HOURS`

Optional, default value: 168 (7 days)

Topic for retried events retention time in hours.

`BULKER_KAFKA_DEAD_TOPIC_RETENTION_HOURS`

Optional, default value: 168 (7 days)

Topic for dead events retention time in hours.

`BULKER_KAFKA_TOPIC_REPLICATION_FACTOR`

Optional, default value: 1

Replication factor for topics.

Note For production, it should be set to at least 2.

Events Log

If BULKER_CLICKHOUSE_HOST is set, Bulker will use ClickHouse for storing a history of processed events

`BULKER_CLICKHOUSE_HOST`

Optional

Clickhouse host and port to store events log. E.g. clickhouse.example.com:9440

`BULKER_CLICKHOUSE_DATABASE`

Optional

Clickhouse database where to store events log.

`BULKER_CLICKHOUSE_SSL`

Optional

Enable SSL for Clickhouse connection

`BULKER_CLICKHOUSE_USERNAME`

Optional

`BULKER_CLICKHOUSE_PASSWORD`

Optional

Defining destinations

Bulker operates with destinations. Each destination is a connection to database or storage services (GCS, S3, etc).

Each destination is a JSON-object

There are two ways how to define list of destinations:

With `BULKER_DESTINATION_*` environment variables

Each environment variable BULKER_DESTINATION_* defines a destination. The value of the variable is a JSON object. Example:

BULKER_DESTINATION_POSTGRES="{id: 'postgres', }"

With HTTP Endpoint

`BULKER_CONFIG_SOURCE`

URL of endpoint that returns configuration of destination entities entities.

E.g. jitsucom/console's export endpoint: http:https://<consoles-domain>/api/admin/export/bulker-connections

`BULKER_CONFIG_SOURCE_HTTP_AUTH_TOKEN`

Auth token for accessing BULKER_CONFIG_SOURCE endpoint.

E.g. for jitsucom/console's export endpoint: service-admin-account:CONSOLE_AUTH_TOKENS

`BULKER_CONFIG_REFRESH_PERIOD_SEC`

Default value: 5

Period in seconds for refreshing configuration from BULKER_CONFIG_SOURCE endpoint.

With Redis

Set BULKER_CONFIG_SOURCE to redis:https://... or rediss:https://... and Bulker will read destinations from Redis enrichedConnections key.

Destination parameters

Each destination is a JSON object:

{
  //unique id of destination. The id is referenced in HTTP-api
  id: "string", // unique destination id
  //"clickhouse", "postgres", "mysql", "snowflake", "redshift" or "bigquery"
  //"s3" and "gcs" are coming soom
  type: "string", // destination type, see below
  //optional (time in ISO8601 format) when destination has been updated
  updatedAt: "2020-01-01T00:00:00Z",
  //how to connect to destination. Values are destination specific. See 
  credentials: {},
  options: {
    mode: "string", // "stream" or "batch"
    //maximum batch size. If not set, value of BULKER_BATCH_RUNNER_DEFAULT_BATCH_SIZE is used
    //see "Batching" section above
    //default value: 10000
    batchSize: 10000,
    //period of running batch consumer in minutes (float). If not set, value of BULKER_BATCH_RUNNER_DEFAULT_PERIOD_SEC is used
    //see "Batching" section above
    //default value: 5
    frequency: 5, 
    //name of the field that contains unique event id.
    //optional
    primaryKey: "id", 
    //whether bulker should deduplicate events by primary key. See db-feature-matrix.md Requires primaryKey to be set. 
    //default value: false
    deduplicate: false, 
    //field that contains timestamp of an event. If set bulker will create destination tables optimized for range queries and sorting by provided column
    //optional
    timestamp: "timestamp",
    //batch size of retry consumer. If not set, value of BULKER_BATCH_RUNNER_DEFAULT_RETRY_BATCH_SIZE is used
    //see "Error Handling and Retries" section above
    //default value: 100
    retryBatchSize: 100, 
    //period of running retry consumer in minutes (float). If not set batchPeriodSec is used or BULKER_BATCH_RUNNER_DEFAULT_RETRY_PERIOD_SEC if batchPeriodSec is not set too.
    //see "Error Handling and Retries" section above
    //default value: 5
    retryFrequency: 5, 
  },
}

Postgres / MySQL / Redshift / Snowflake credentials

Postrgres, MySQL, Redshift and Snowflake credentials shares same configuration structure

{
  host: "string",
  port: 5432,
  database: "string",
  defaultSchema: "",
  username: "string",
  password: "string",
  //custom SQL connection parameters
  parameters: {},
  //Only for Redshift. Intermediate S3 bucket for uploading data
  s3Config: {
    //bucket name
    bucket: "string",
    //bucker region. Seehttps://docs.aws.amazon.com/general/latest/gr/s3.html
    region: "string",
    //access credentials
    accessKeyId: "string",
    secretAccessKey: "string",
    //(optional) Folder inside bucker
    folder: "",
  }
}

Clickhouse

{
  //Clickhouse protocol: clickhouse, clickhouse-secure, http or https
  protocol: "string",
  //list of clickhouse servers as host:port. If port is not specified, default port for respective protocol will be used. http → 8123, https → 8443, clickhouse → 9000, clickhouse-secure → 9440  
  hosts: ["string"],
  //map of parameters. See https://clickhouse.com/docs/en/integrations/go/clickhouse-go/database-sql-api/#connection-settings 
  parameters: {},
  username: "string",
  password: "string",
  //name of the database
  database: "string",
  //cluster name
  cluster: "string",
  //clickhouse engine settings. Defines how new tables are created in clickhouse
  engine: {
    //todo
  }
}

BigQuery

{
  //service account credentials. See https://cloud.google.com/docs/authentication/production
  //Google Cloud project ID
  project: "string",
  //key file. Either JSON object or path to local file
  keyFile: "string",
  //BigQuery dataset name
  bqDataset: "string",
}

Files

server-config.md

Latest commit

History

server-config.md

File metadata and controls

🚚 Bulker-server configuration

Running Bulker

Common parameters

BULKER_INSTANCE_ID

BULKER_HTTP_PORT

BULKER_AUTH_TOKENS

BULKER_TOKEN_SECRET

BULKER_RAW_AUTH_TOKENS

Connection to Kafka

BULKER_KAFKA_BOOTSTRAP_SERVERS

BULKER_KAFKA_SSL

BULKER_KAFKA_SSL_SKIP_VERIFY

BULKER_KAFKA_SASL (aka Kafka auth)

Batching

BULKER_BATCH_RUNNER_DEFAULT_PERIOD_SEC

BULKER_BATCH_RUNNER_DEFAULT_BATCH_SIZE

Streaming

Error Handling and Retries

BULKER_MESSAGES_RETRY_COUNT

BULKER_MESSAGES_RETRY_BACKOFF_BASE

BULKER_MESSAGES_RETRY_BACKOFF_MAX_DELAY

BULKER_BATCH_RUNNER_DEFAULT_RETRY_PERIOD_SEC

BULKER_BATCH_RUNNER_DEFAULT_RETRY_BATCH_SIZE

Kafka topic management (advanced)

BULKER_KAFKA_TOPIC_RETENTION_HOURS

BULKER_KAFKA_RETRY_TOPIC_RETENTION_HOURS

BULKER_KAFKA_DEAD_TOPIC_RETENTION_HOURS

BULKER_KAFKA_TOPIC_REPLICATION_FACTOR

Events Log

BULKER_CLICKHOUSE_HOST

BULKER_CLICKHOUSE_DATABASE

BULKER_CLICKHOUSE_SSL

BULKER_CLICKHOUSE_USERNAME

BULKER_CLICKHOUSE_PASSWORD

Defining destinations

With BULKER_DESTINATION_* environment variables

With HTTP Endpoint

BULKER_CONFIG_SOURCE

BULKER_CONFIG_SOURCE_HTTP_AUTH_TOKEN

BULKER_CONFIG_REFRESH_PERIOD_SEC

With Redis

Destination parameters

Postgres / MySQL / Redshift / Snowflake credentials

Clickhouse

BigQuery

`BULKER_INSTANCE_ID`

`BULKER_HTTP_PORT`

`BULKER_AUTH_TOKENS`

`BULKER_TOKEN_SECRET`

`BULKER_RAW_AUTH_TOKENS`

`BULKER_KAFKA_BOOTSTRAP_SERVERS`

`BULKER_KAFKA_SSL`

`BULKER_KAFKA_SSL_SKIP_VERIFY`

`BULKER_KAFKA_SASL` (aka Kafka auth)

`BULKER_BATCH_RUNNER_DEFAULT_PERIOD_SEC`

`BULKER_BATCH_RUNNER_DEFAULT_BATCH_SIZE`

`BULKER_MESSAGES_RETRY_COUNT`

`BULKER_MESSAGES_RETRY_BACKOFF_BASE`

`BULKER_MESSAGES_RETRY_BACKOFF_MAX_DELAY`

`BULKER_BATCH_RUNNER_DEFAULT_RETRY_PERIOD_SEC`

`BULKER_BATCH_RUNNER_DEFAULT_RETRY_BATCH_SIZE`

`BULKER_KAFKA_TOPIC_RETENTION_HOURS`

`BULKER_KAFKA_RETRY_TOPIC_RETENTION_HOURS`

`BULKER_KAFKA_DEAD_TOPIC_RETENTION_HOURS`

`BULKER_KAFKA_TOPIC_REPLICATION_FACTOR`

`BULKER_CLICKHOUSE_HOST`

`BULKER_CLICKHOUSE_DATABASE`

`BULKER_CLICKHOUSE_SSL`

`BULKER_CLICKHOUSE_USERNAME`

`BULKER_CLICKHOUSE_PASSWORD`

With `BULKER_DESTINATION_*` environment variables

`BULKER_CONFIG_SOURCE`

`BULKER_CONFIG_SOURCE_HTTP_AUTH_TOKEN`

`BULKER_CONFIG_REFRESH_PERIOD_SEC`