Skip to content

Commit

Permalink
[FLINK-20053][table][doc] Add document for file compaction
Browse files Browse the repository at this point in the history
This closes apache#13990
  • Loading branch information
JingsongLi committed Nov 11, 2020
1 parent 592c2e7 commit 83d8137
Show file tree
Hide file tree
Showing 2 changed files with 68 additions and 0 deletions.
34 changes: 34 additions & 0 deletions docs/dev/table/connectors/filesystem.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,40 @@ become finished on the next checkpoint) control the size and number of these par
**NOTE:** For row formats (csv, json), you can set the parameter `sink.rolling-policy.file-size` or `sink.rolling-policy.rollover-interval` in the connector properties and parameter `execution.checkpointing.interval` in flink-conf.yaml together
if you don't want to wait a long period before observe the data exists in file system. For other formats (avro, orc), you can just set parameter `execution.checkpointing.interval` in flink-conf.yaml.

### File Compaction

The file sink supports file compactions, which allows applications to have smaller checkpoint intervals without generating a large number of files.

<table class="table table-bordered">
<thead>
<tr>
<th class="text-left" style="width: 20%">Key</th>
<th class="text-left" style="width: 15%">Default</th>
<th class="text-left" style="width: 10%">Type</th>
<th class="text-left" style="width: 55%">Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><h5>auto-compaction</h5></td>
<td style="word-wrap: break-word;">false</td>
<td>Boolean</td>
<td>Whether to enable automatic compaction in streaming sink or not. The data will be written to temporary files. After the checkpoint is completed, the temporary files generated by a checkpoint will be compacted. The temporary files are invisible before compaction.</td>
</tr>
<tr>
<td><h5>compaction.file-size</h5></td>
<td style="word-wrap: break-word;">(none)</td>
<td>MemorySize</td>
<td>The compaction target file size, the default value is the rolling file size.</td>
</tr>
</tbody>
</table>

If enabled, file compaction will merge multiple small files into larger files based on the target file size.
When running file compaction in production, please be aware that:
- Only files in a single checkpoint are compacted, that is, at least the same number of files as the number of checkpoints is generated.
- The file before merging is invisible, so the visibility of the file may be: checkpoint interval + compaction time.

### Partition Commit

After writing a partition, it is often necessary to notify downstream applications. For example, add the partition to a Hive metastore or writing a `_SUCCESS` file in the directory. The file system sink contains a partition commit feature that allows configuring custom policies. Commit actions are based on a combination of `triggers` and `policies`.
Expand Down
34 changes: 34 additions & 0 deletions docs/dev/table/connectors/filesystem.zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,40 @@ become finished on the next checkpoint) control the size and number of these par
**NOTE:** For row formats (csv, json), you can set the parameter `sink.rolling-policy.file-size` or `sink.rolling-policy.rollover-interval` in the connector properties and parameter `execution.checkpointing.interval` in flink-conf.yaml together
if you don't want to wait a long period before observe the data exists in file system. For other formats (avro, orc), you can just set parameter `execution.checkpointing.interval` in flink-conf.yaml.

### File Compaction

The file sink supports file compactions, which allows applications to have smaller checkpoint intervals without generating a large number of files.

<table class="table table-bordered">
<thead>
<tr>
<th class="text-left" style="width: 20%">Key</th>
<th class="text-left" style="width: 15%">Default</th>
<th class="text-left" style="width: 10%">Type</th>
<th class="text-left" style="width: 55%">Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><h5>auto-compaction</h5></td>
<td style="word-wrap: break-word;">false</td>
<td>Boolean</td>
<td>Whether to enable automatic compaction in streaming sink or not. The data will be written to temporary files. After the checkpoint is completed, the temporary files generated by a checkpoint will be compacted. The temporary files are invisible before compaction.</td>
</tr>
<tr>
<td><h5>compaction.file-size</h5></td>
<td style="word-wrap: break-word;">(none)</td>
<td>MemorySize</td>
<td>The compaction target file size, the default value is the rolling file size.</td>
</tr>
</tbody>
</table>

If enabled, file compaction will merge multiple small files into larger files based on the target file size.
When running file compaction in production, please be aware that:
- Only files in a single checkpoint are compacted, that is, at least the same number of files as the number of checkpoints is generated.
- The file before merging is invisible, so the visibility of the file may be: checkpoint interval + compaction time.

### Partition Commit

After writing a partition, it is often necessary to notify downstream applications. For example, add the partition to a Hive metastore or writing a `_SUCCESS` file in the directory. The file system sink contains a partition commit feature that allows configuring custom policies. Commit actions are based on a combination of `triggers` and `policies`.
Expand Down

0 comments on commit 83d8137

Please sign in to comment.