[FLINK-15999][doc] Update/remove outdated information in concepts sec…

…tion The content of the concepts section was gathered from pre-existing documentation texts. This commit cleans that up a bit, removes outdated content that is not "conceptual" and updates some of the terms that we no longer use.
mbalassi · Feb 21, 2020 · a26b75f · a26b75f
1 parent 87e6cc2
commit a26b75f
Show file tree

Hide file tree

Showing 6 changed files with 56 additions and 81 deletions.
diff --git a/docs/concepts/flink-architecture.md b/docs/concepts/flink-architecture.md
@@ -43,9 +43,9 @@ The Flink runtime consists of two types of processes:
  - The *Flink Master* coordinates the distributed execution. It schedules
  tasks, coordinates checkpoints, coordinates recovery on failures, etc.
 
- There is always at least one *Flink Master*. A high-availability setup will
- have multiple *Flink Masters*, one of which one is always the *leader*, and
- the others are *standby*.
+ There is always at least one *Flink Master*. A high-availability setup
+ might have multiple *Flink Masters*, one of which one is always the
+ *leader*, and the others are *standby*.
 
  - The *TaskManagers* (also called *workers*) execute the *tasks* (or more
  specifically, the subtasks) of a dataflow, and buffer and exchange the data
@@ -129,12 +129,4 @@ two main benefits:
 
 <img src="{{ site.baseurl }}/fig/slot_sharing.svg" alt="TaskManagers with shared Task Slots" class="offset" width="80%" />
 
-The APIs also include a *[resource group]({{ site.baseurl }}{% link
-dev/stream/operators/index.md %}#task-chaining-and-resource-groups)* mechanism
-which can be used to prevent undesirable slot sharing. 
-
-As a rule-of-thumb, a good default number of task slots would be the number of
-CPU cores. With hyper-threading, each slot then takes 2 or more hardware
-thread contexts.
-
 {% top %}
diff --git a/docs/concepts/stateful-stream-processing.md b/docs/concepts/stateful-stream-processing.md
@@ -26,14 +26,10 @@ under the License.
 
 While many operations in a dataflow simply look at one individual *event at a
 time* (for example an event parser), some operations remember information
-across multiple events (for example window operators).  These operations are
+across multiple events (for example window operators). These operations are
 called **stateful**.
 
-Stateful functions and operators store data across the processing of individual
-elements/events, making state a critical building block for any type of more
-elaborate operation.
-
-For example:
+Some examples of stateful operations:
 
  - When an application searches for certain event patterns, the state will
  store the sequence of events encountered so far.
@@ -44,26 +40,19 @@ For example:
  - When historic data needs to be managed, the state allows efficient access
  to events that occurred in the past.
 
-Flink needs to be aware of the state in order to make state fault tolerant
-using [checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md
-%}) and to allow [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md
-%}) of streaming applications.
+Flink needs to be aware of the state in order to make it fault tolerant using
+[checkpoints]({{ site.baseurl}}{% link dev/stream/state/checkpointing.md %})
+and [savepoints]({{ site.baseurl }}{%link ops/state/savepoints.md %}).
 
 Knowledge about the state also allows for rescaling Flink applications, meaning
 that Flink takes care of redistributing state across parallel instances.
 
-The [queryable state]({{ site.baseurl }}{% link
-dev/stream/state/queryable_state.md %}) feature of Flink allows you to access
-state from outside of Flink during runtime.
+[Queryable state]({{ site.baseurl }}{% link dev/stream/state/queryable_state.md
+%}) allows you to access state from outside of Flink during runtime.
 
 When working with state, it might also be useful to read about [Flink's state
 backends]({{ site.baseurl }}{% link ops/state/state_backends.md %}). Flink
 provides different state backends that specify how and where state is stored.
-State can be located on Java's heap or off-heap. Depending on your state
-backend, Flink can also *manage* the state for the application, meaning Flink
-deals with the memory management (possibly spilling to disk if necessary) to
-allow applications to hold very large state. State backends can be configured
-without changing your application logic.
 
 * This will be replaced by the TOC
 {:toc}
@@ -85,12 +74,12 @@ without changing your application logic.
 Keyed state is maintained in what can be thought of as an embedded key/value
 store. The state is partitioned and distributed strictly together with the
 streams that are read by the stateful operators. Hence, access to the key/value
-state is only possible on *keyed streams*, after a *keyBy()* function, and is
-restricted to the values associated with the current event's key. Aligning the
-keys of streams and state makes sure that all state updates are local
-operations, guaranteeing consistency without transaction overhead. This
-alignment also allows Flink to redistribute the state and adjust the stream
-partitioning transparently.
+state is only possible on *keyed streams*, i.e. after a keyed/partitioned data
+exchange, and is restricted to the values associated with the current event's
+key. Aligning the keys of streams and state makes sure that all state updates
+are local operations, guaranteeing consistency without transaction overhead.
+This alignment also allows Flink to redistribute the state and adjust the
+stream partitioning transparently.
 
 <img src="{{ site.baseurl }}/fig/state_partitioning.svg" alt="State and Partitioning" class="offset" width="50%" />
 
@@ -103,28 +92,28 @@ Groups.
 ## State Persistence
 
 Flink implements fault tolerance using a combination of **stream replay** and
-**checkpointing**. A checkpoint is related to a specific point in each of the
+**checkpointing**. A checkpoint marks a specific point in each of the
 input streams along with the corresponding state for each of the operators. A
 streaming dataflow can be resumed from a checkpoint while maintaining
 consistency *(exactly-once processing semantics)* by restoring the state of the
-operators and replaying the events from the point of the checkpoint.
+operators and replaying the records from the point of the checkpoint.
 
 The checkpoint interval is a means of trading off the overhead of fault
-tolerance during execution with the recovery time (the number of events that
+tolerance during execution with the recovery time (the number of records that
 need to be replayed).
 
 The fault tolerance mechanism continuously draws snapshots of the distributed
 streaming data flow. For streaming applications with small state, these
 snapshots are very light-weight and can be drawn frequently without much impact
 on performance. The state of the streaming applications is stored at a
-configurable place (such as the master node, or HDFS).
+configurable place, usually in a distributed file system.
 
 In case of a program failure (due to machine-, network-, or software failure),
 Flink stops the distributed streaming dataflow. The system then restarts the
 operators and resets them to the latest successful checkpoint. The input
 streams are reset to the point of the state snapshot. Any records that are
 processed as part of the restarted parallel dataflow are guaranteed to not have
-been part of the previously checkpointed state.
+affected the previously checkpointed state.
 
 {% info Note %} By default, checkpointing is disabled. See [Checkpointing]({{
 site.baseurl }}{% link dev/stream/state/checkpointing.md %}) for details on how
@@ -133,13 +122,14 @@ to enable and configure checkpointing.
 {% info Note %} For this mechanism to realize its full guarantees, the data
 stream source (such as message queue or broker) needs to be able to rewind the
 stream to a defined recent point. [Apache Kafka](http:https://kafka.apache.org) has
-this ability and Flink's connector to Kafka exploits this ability. See [Fault
+this ability and Flink's connector to Kafka exploits this. See [Fault
 Tolerance Guarantees of Data Sources and Sinks]({{ site.baseurl }}{% link
 dev/connectors/guarantees.md %}) for more information about the guarantees
 provided by Flink's connectors.
 
 {% info Note %} Because Flink's checkpoints are realized through distributed
-snapshots, we use the words *snapshot* and *checkpoint* interchangeably.
+snapshots, we use the words *snapshot* and *checkpoint* interchangeably. Often
+we also use the term *snapshot* to mean either *checkpoint* or *savepoint*.
 
 ### Checkpointing
 
@@ -159,7 +149,7 @@ asynchronously. The checkpoint barriers don't travel in lock step and
 operations can asynchronously snapshot their state.
 
 
-### Barriers
+#### Barriers
 
 A core element in Flink's distributed snapshotting are the *stream barriers*.
 These barriers are injected into the data stream and flow with the records as
@@ -216,7 +206,7 @@ streams on the snapshot barriers. The figure above illustrates this:
  processing records from the input buffers before processing the records
  from the streams.
 
-### Snapshotting Operator State
+#### Snapshotting Operator State
 
 When operators contain any form of *state*, this state must be part of the
 snapshots as well.
@@ -244,7 +234,7 @@ The resulting snapshot now contains:
  <img src="{{ site.baseurl }}/fig/checkpointing.svg" alt="Illustration of the Checkpointing Mechanism" style="width:100%; padding-top:10px; padding-bottom:10px;" />
 </div>
 
-### Recovery
+#### Recovery
 
 Recovery under this mechanism is straightforward: Upon a failure, Flink selects
 the latest completed checkpoint *k*. The system then re-deploys the entire
@@ -271,7 +261,8 @@ hash map, another state backend uses [RocksDB](http:https://rocksdb.org) as the
 key/value store. In addition to defining the data structure that holds the
 state, the state backends also implement the logic to take a point-in-time
 snapshot of the key/value state and store that snapshot as part of a
-checkpoint.
+checkpoint. State backends can be configured without changing your application
+logic.
 
 <img src="{{ site.baseurl }}/fig/checkpoints.svg" alt="checkpoints and snapshots" class="offset" width="60%" />
 
@@ -281,24 +272,18 @@ checkpoint.
 
 `TODO: expand this section`
 
-Programs written in the Data Stream API can resume execution from a
-**savepoint**. Savepoints allow both updating your programs and your Flink
-cluster without losing any state. 
+All programs that use checkpointing can resume execution from a **savepoint**.
+Savepoints allow both updating your programs and your Flink cluster without
+losing any state.
 
 [Savepoints]({{ site.baseurl }}{% link ops/state/savepoints.md %}) are
 **manually triggered checkpoints**, which take a snapshot of the program and
 write it out to a state backend. They rely on the regular checkpointing
-mechanism for this. During execution programs are periodically snapshotted on
-the worker nodes and produce checkpoints. For recovery only the last completed
-checkpoint is needed and older checkpoints can be safely discarded as soon as a
-new one is completed.
+mechanism for this.
 
-Savepoints are similar to these periodic checkpoints except that they are
+Savepoints are similar to checkpoints except that they are
 **triggered by the user** and **don't automatically expire** when newer
-checkpoints are completed. Savepoints can be created from the [command line]({{
-site.baseurl }}{% link ops/cli.md %}#savepoints) or when cancelling a job via
-the [REST API]({{ site.baseurl }}{% link monitoring/rest_api.md
-%}#cancel-job-with-savepoint).
+checkpoints are completed.
 
 {% top %}
 

diff --git a/docs/concepts/stream-processing.md b/docs/concepts/stream-processing.md
@@ -52,7 +52,7 @@ gloss over this for simplicity.
 
 Often there is a one-to-one correspondence between the transformations in the
 programs and the operators in the dataflow. Sometimes, however, one
-transformation may consist of multiple transformation operators.
+transformation may consist of multiple operators.
 
 {% top %}
 

diff --git a/docs/concepts/timely-stream-processing.md b/docs/concepts/timely-stream-processing.md
@@ -169,11 +169,6 @@ parallel streams, and operators tracking event time.
 
 <img src="{{ site.baseurl }}/fig/parallel_streams_watermarks.svg" alt="Parallel data streams and operators with events and watermarks" class="center" width="80%" />
 
-Note that the Kafka source supports per-partition watermarking, which you can
-read more about [here]({{ site.baseurl }}{% link
-dev/event_timestamps_watermarks.md %}#timestamps-per-kafka-partition).
-
-
 ## Lateness
 
 It is possible that certain elements will violate the watermark condition,
@@ -207,9 +202,10 @@ overlap), and *session windows* (punctuated by a gap of inactivity).
 
 <img src="{{ site.baseurl }}/fig/windows.svg" alt="Time- and Count Windows" class="offset" width="80%" />
 
-More window examples can be found in this [blog
-post](https://flink.apache.org/news/2015/12/04/Introducing-windows.html). More
-details are in the [window docs]({{ site.baseurl }}{% link
-dev/stream/operators/windows.md %}).
+Please check out this [blog
+post](https://flink.apache.org/news/2015/12/04/Introducing-windows.html) for
+additional examples of windows or take a look a [window documentation]({{
+site.baseurl }}{% link dev/stream/operators/windows.md %}) of the DataStream
+API.
 
 {% top %}
diff --git a/docs/dev/stream/state/index.md b/docs/dev/stream/state/index.md
@@ -25,8 +25,8 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-In this section you will learn about the stateful abstractions that Flink
-offers and how to use them in practice. Please take a look at [Stateful Stream
+In this section you will learn about the APIs that Flink provides for writing
+stateful programs. Please take a look at [Stateful Stream
 Processing]({{site.baseurl}}{% link concepts/stateful-stream-processing.md %})
 to learn about the concepts behind stateful stream processing.
 

diff --git a/docs/dev/stream/state/state.md b/docs/dev/stream/state/state.md
@@ -22,8 +22,8 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-In this section you will learn about the stateful abstractions that Flink
-offers and how to use them in practice. Please take a look at [Stateful Stream
+In this section you will learn about the APIs that Flink provides for writing
+stateful programs. Please take a look at [Stateful Stream
 Processing]({{site.baseurl}}{% link concepts/stateful-stream-processing.md %})
 to learn about the concepts behind stateful stream processing.
 
@@ -505,8 +505,8 @@ Operator State in Flink. Each parallel instance of the Kafka consumer maintains
 a map of topic partitions and offsets as its Operator State.
 
 The Operator State interfaces support redistributing state among parallel
-operator instances when the parallelism is changed. There can be different
-schemes for doing this redistribution.
+operator instances when the parallelism is changed. There are different schemes
+for doing this redistribution.
 
 In a typical stateful Flink Application you don't need operators state. It is
 mostly a special type of state that is used in source/sink implementations and
@@ -515,13 +515,15 @@ scenarios where you don't have a key by which state can be partitioned.
 ## Broadcast State
 
 *Broadcast State* is a special type of *Operator State*. It was introduced to
-support use cases where some data coming from one stream is required to be
-broadcasted to all downstream tasks, where it is stored locally and is used to
-process all incoming elements on the other stream. As an example where
-broadcast state can emerge as a natural fit, one can imagine a low-throughput
-stream containing a set of rules which we want to evaluate against all elements
-coming from another stream. Having the above type of use cases in mind,
-broadcast state differs from the rest of operator states in that:
+support use cases where records of one stream need to be broadcasted to all
+downstream tasks, where they are used to maintain the same state among all
+subtasks. This state can then be accessed while processing records of a second
+stream. As an example where broadcast state can emerge as a natural fit, one
+can imagine a low-throughput stream containing a set of rules which we want to
+evaluate against all elements coming from another stream. Having the above type
+of use cases in mind, broadcast state differs from the rest of operator states
+in that:
+
  1. it has a map format,
  2. it is only available to specific operators that have as inputs a
  *broadcasted* stream and a *non-broadcasted* one, and