[FLINK-15999][doc] Clean up DAG/parallel DAG section in concepts doc

spotify · Feb 21, 2020 · c38315d · c38315d
1 parent 852834b
commit c38315d
Showing 1 changed file with 44 additions and 39 deletions.
diff --git a/docs/concepts/stream-processing.md b/docs/concepts/stream-processing.md
@@ -36,56 +36,61 @@ under the License.
 
 ## Programs and Dataflows
 
-The basic building blocks of Flink programs are **streams** and **transformations**. (Note that the
-DataSets used in Flink's DataSet API are also streams internally -- more about that
-later.) Conceptually a *stream* is a (potentially never-ending) flow of data records, and a *transformation* is an
-operation that takes one or more streams as input, and produces one or more output streams as a
-result.
-
-When executed, Flink programs are mapped to **streaming dataflows**, consisting of **streams** and transformation **operators**.
-Each dataflow starts with one or more **sources** and ends in one or more **sinks**. The dataflows resemble
-arbitrary **directed acyclic graphs** *(DAGs)*. Although special forms of cycles are permitted via
-*iteration* constructs, for the most part we will gloss over this for simplicity.
+The basic building blocks of Flink programs are **streams** and
+**transformations**. Conceptually a *stream* is a (potentially never-ending)
+flow of data records, and a *transformation* is an operation that takes one or
+more streams as input, and produces one or more output streams as a result.
+
+When executed, Flink programs are mapped to **streaming dataflows**, consisting
+of **streams** and transformation **operators**. Each dataflow starts with one
+or more **sources** and ends in one or more **sinks**. The dataflows resemble
+arbitrary **directed acyclic graphs** *(DAGs)*. Although special forms of
+cycles are permitted via *iteration* constructs, for the most part we will
+gloss over this for simplicity.
 
 <img src="{{ site.baseurl }}/fig/program_dataflow.svg" alt="A DataStream program, and its dataflow." class="offset" width="80%" />
 
-Often there is a one-to-one correspondence between the transformations in the programs and the operators
-in the dataflow. Sometimes, however, one transformation may consist of multiple transformation operators.
-
-Sources and sinks are documented in the [streaming connectors](../dev/connectors/index.html) and [batch connectors](../dev/batch/connectors.html) docs.
-Transformations are documented in [DataStream operators]({{ site.baseurl }}/dev/stream/operators/index.html) and [DataSet transformations](../dev/batch/dataset_transformations.html).
+Often there is a one-to-one correspondence between the transformations in the
+programs and the operators in the dataflow. Sometimes, however, one
+transformation may consist of multiple transformation operators.
 
 {% top %}
 
 ## Parallel Dataflows
 
-Programs in Flink are inherently parallel and distributed. During execution, a *stream* has one or more **stream partitions**,
-and each *operator* has one or more **operator subtasks**. The operator subtasks are independent of one another, and execute in different threads
-and possibly on different machines or containers.
+Programs in Flink are inherently parallel and distributed. During execution, a
+*stream* has one or more **stream partitions**, and each *operator* has one or
+more **operator subtasks**. The operator subtasks are independent of one
+another, and execute in different threads and possibly on different machines or
+containers.
 
-The number of operator subtasks is the **parallelism** of that particular operator. The parallelism of a stream
-is always that of its producing operator. Different operators of the same program may have different
-levels of parallelism.
+The number of operator subtasks is the **parallelism** of that particular
+operator. The parallelism of a stream is always that of its producing operator.
+Different operators of the same program may have different levels of
+parallelism.
 
 <img src="{{ site.baseurl }}/fig/parallel_dataflow.svg" alt="A parallel dataflow" class="offset" width="80%" />
 
-Streams can transport data between two operators in a *one-to-one* (or *forwarding*) pattern, or in a *redistributing* pattern:
-
- - **One-to-one** streams (for example between the *Source* and the *map()* operators in the figure
- above) preserve the partitioning and ordering of the
- elements. That means that subtask[1] of the *map()* operator will see the same elements in the same order as they
- were produced by subtask[1] of the *Source* operator.
-
- - **Redistributing** streams (as between *map()* and *keyBy/window* above, as well as between
- *keyBy/window* and *Sink*) change the partitioning of streams. Each *operator subtask* sends
- data to different target subtasks, depending on the selected transformation. Examples are
- *keyBy()* (which re-partitions by hashing the key), *broadcast()*, or *rebalance()* (which
- re-partitions randomly). In a *redistributing* exchange the ordering among the elements is
- only preserved within each pair of sending and receiving subtasks (for example, subtask[1]
- of *map()* and subtask[2] of *keyBy/window*). So in this example, the ordering within each key
- is preserved, but the parallelism does introduce non-determinism regarding the order in
- which the aggregated results for different keys arrive at the sink.
-
-Details about configuring and controlling parallelism can be found in the docs on [parallel execution](../dev/parallel.html).
+Streams can transport data between two operators in a *one-to-one* (or
+*forwarding*) pattern, or in a *redistributing* pattern:
+
+ - **One-to-one** streams (for example between the *Source* and the *map()*
+ operators in the figure above) preserve the partitioning and ordering of
+ the elements. That means that subtask[1] of the *map()* operator will see
+ the same elements in the same order as they were produced by subtask[1] of
+ the *Source* operator.
+
+ - **Redistributing** streams (as between *map()* and *keyBy/window* above, as
+ well as between *keyBy/window* and *Sink*) change the partitioning of
+ streams. Each *operator subtask* sends data to different target subtasks,
+ depending on the selected transformation. Examples are *keyBy()* (which
+ re-partitions by hashing the key), *broadcast()*, or *rebalance()* (which
+ re-partitions randomly). In a *redistributing* exchange the ordering among
+ the elements is only preserved within each pair of sending and receiving
+ subtasks (for example, subtask[1] of *map()* and subtask[2] of
+ *keyBy/window*). So in this example, the ordering within each key is
+ preserved, but the parallelism does introduce non-determinism regarding the
+ order in which the aggregated results for different keys arrive at the
+ sink.
 
 {% top %}