[FLINK-20053][table][doc] Add document for file compaction

This closes apache#13990
derekm · Nov 11, 2020 · 83d8137 · 83d8137
1 parent 592c2e7
commit 83d8137
Show file tree

Hide file tree

Showing 2 changed files with 68 additions and 0 deletions.
diff --git a/docs/dev/table/connectors/filesystem.md b/docs/dev/table/connectors/filesystem.md
@@ -150,6 +150,40 @@ become finished on the next checkpoint) control the size and number of these par
 **NOTE:** For row formats (csv, json), you can set the parameter `sink.rolling-policy.file-size` or `sink.rolling-policy.rollover-interval` in the connector properties and parameter `execution.checkpointing.interval` in flink-conf.yaml together
 if you don't want to wait a long period before observe the data exists in file system. For other formats (avro, orc), you can just set parameter `execution.checkpointing.interval` in flink-conf.yaml.
 
+### File Compaction
+
+The file sink supports file compactions, which allows applications to have smaller checkpoint intervals without generating a large number of files.
+
+<table class="table table-bordered">
+ <thead>
+ <tr>
+  <th class="text-left" style="width: 20%">Key</th>
+  <th class="text-left" style="width: 15%">Default</th>
+  <th class="text-left" style="width: 10%">Type</th>
+  <th class="text-left" style="width: 55%">Description</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+  <td><h5>auto-compaction</h5></td>
+  <td style="word-wrap: break-word;">false</td>
+  <td>Boolean</td>
+  <td>Whether to enable automatic compaction in streaming sink or not. The data will be written to temporary files. After the checkpoint is completed, the temporary files generated by a checkpoint will be compacted. The temporary files are invisible before compaction.</td>
+ </tr>
+ <tr>
+  <td><h5>compaction.file-size</h5></td>
+  <td style="word-wrap: break-word;">(none)</td>
+  <td>MemorySize</td>
+  <td>The compaction target file size, the default value is the rolling file size.</td>
+ </tr>
+ </tbody>
+</table>
+
+If enabled, file compaction will merge multiple small files into larger files based on the target file size.
+When running file compaction in production, please be aware that:
+- Only files in a single checkpoint are compacted, that is, at least the same number of files as the number of checkpoints is generated.
+- The file before merging is invisible, so the visibility of the file may be: checkpoint interval + compaction time.
+
 ### Partition Commit
 
 After writing a partition, it is often necessary to notify downstream applications. For example, add the partition to a Hive metastore or writing a `_SUCCESS` file in the directory. The file system sink contains a partition commit feature that allows configuring custom policies. Commit actions are based on a combination of `triggers` and `policies`. 

diff --git a/docs/dev/table/connectors/filesystem.zh.md b/docs/dev/table/connectors/filesystem.zh.md
@@ -150,6 +150,40 @@ become finished on the next checkpoint) control the size and number of these par
 **NOTE:** For row formats (csv, json), you can set the parameter `sink.rolling-policy.file-size` or `sink.rolling-policy.rollover-interval` in the connector properties and parameter `execution.checkpointing.interval` in flink-conf.yaml together
 if you don't want to wait a long period before observe the data exists in file system. For other formats (avro, orc), you can just set parameter `execution.checkpointing.interval` in flink-conf.yaml.
 
+### File Compaction
+
+The file sink supports file compactions, which allows applications to have smaller checkpoint intervals without generating a large number of files.
+
+<table class="table table-bordered">
+ <thead>
+ <tr>
+  <th class="text-left" style="width: 20%">Key</th>
+  <th class="text-left" style="width: 15%">Default</th>
+  <th class="text-left" style="width: 10%">Type</th>
+  <th class="text-left" style="width: 55%">Description</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+  <td><h5>auto-compaction</h5></td>
+  <td style="word-wrap: break-word;">false</td>
+  <td>Boolean</td>
+  <td>Whether to enable automatic compaction in streaming sink or not. The data will be written to temporary files. After the checkpoint is completed, the temporary files generated by a checkpoint will be compacted. The temporary files are invisible before compaction.</td>
+ </tr>
+ <tr>
+  <td><h5>compaction.file-size</h5></td>
+  <td style="word-wrap: break-word;">(none)</td>
+  <td>MemorySize</td>
+  <td>The compaction target file size, the default value is the rolling file size.</td>
+ </tr>
+ </tbody>
+</table>
+
+If enabled, file compaction will merge multiple small files into larger files based on the target file size.
+When running file compaction in production, please be aware that:
+- Only files in a single checkpoint are compacted, that is, at least the same number of files as the number of checkpoints is generated.
+- The file before merging is invisible, so the visibility of the file may be: checkpoint interval + compaction time.
+
 ### Partition Commit
 
 After writing a partition, it is often necessary to notify downstream applications. For example, add the partition to a Hive metastore or writing a `_SUCCESS` file in the directory. The file system sink contains a partition commit feature that allows configuring custom policies. Commit actions are based on a combination of `triggers` and `policies`.