Skip to content

Commit

Permalink
This closes apache#2281: Clean up dataflow/google references/URLs in …
Browse files Browse the repository at this point in the history
…examples
  • Loading branch information
kennknowles committed Mar 23, 2017
2 parents ddc7595 + 81bcbb4 commit 9ca6511
Show file tree
Hide file tree
Showing 7 changed files with 42 additions and 36 deletions.
61 changes: 34 additions & 27 deletions examples/java/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,25 +20,23 @@
# Example Pipelines

The examples included in this module serve to demonstrate the basic
functionality of Google Cloud Dataflow, and act as starting points for
functionality of Apache Beam, and act as starting points for
the development of more complex pipelines.

## Word Count

A good starting point for new users is our set of
[word count](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples) examples, which computes word frequencies. This series of four successively more detailed pipelines is described in detail in the accompanying [walkthrough](https://cloud.google.com/dataflow/examples/wordcount-example).
[word count](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples) examples, which computes word frequencies. This series of four successively more detailed pipelines is described in detail in the accompanying [walkthrough](https://beam.apache.org/get-started/wordcount-example/).

1. [`MinimalWordCount`](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/MinimalWordCount.java) is the simplest word count pipeline and introduces basic concepts like [Pipelines](https://cloud.google.com/dataflow/model/pipelines),
[PCollections](https://cloud.google.com/dataflow/model/pcollection),
[ParDo](https://cloud.google.com/dataflow/model/par-do),
and [reading and writing data](https://cloud.google.com/dataflow/model/reading-and-writing-data) from external storage.
1. [`MinimalWordCount`](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/MinimalWordCount.java) is the simplest word count pipeline and introduces basic concepts like [Pipelines](https://beam.apache.org/documentation/programming-guide/#pipeline),
[PCollections](https://beam.apache.org/documentation/programming-guide/#pcollection),
[ParDo](https://beam.apache.org/documentation/programming-guide/#transforms-pardo),
and [reading and writing data](https://beam.apache.org/documentation/programming-guide/#io) from external storage.

1. [`WordCount`](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/WordCount.java) introduces Dataflow best practices like [PipelineOptions](https://cloud.google.com/dataflow/pipelines/constructing-your-pipeline#Creating) and custom [PTransforms](https://cloud.google.com/dataflow/model/composite-transforms).
1. [`WordCount`](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/WordCount.java) introduces best practices like [PipelineOptions](https://beam.apache.org/documentation/programming-guide/#pipeline) and custom [PTransforms](https://beam.apache.org/documentation/programming-guide/#transforms-composite).

1. [`DebuggingWordCount`](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/DebuggingWordCount.java)
shows how to view live aggregators in the [Dataflow Monitoring Interface](https://cloud.google.com/dataflow/pipelines/dataflow-monitoring-intf), get the most out of
[Cloud Logging](https://cloud.google.com/dataflow/pipelines/logging) integration, and start writing
[good tests](https://cloud.google.com/dataflow/pipelines/testing-your-pipeline).
demonstrates some best practices for instrumenting your pipeline code.

1. [`WindowedWordCount`](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/WindowedWordCount.java) shows how to run the same pipeline over either unbounded PCollections in streaming mode or bounded PCollections in batch mode.

Expand All @@ -50,46 +48,55 @@ Change directory into `examples/java` and run the examples:
-Dexec.mainClass=<MAIN CLASS> \
-Dexec.args="<EXAMPLE-SPECIFIC ARGUMENTS>"

For example, you can execute the `WordCount` pipeline on your local machine as follows:
Alternatively, you may choose to bundle all dependencies into a single JAR and
execute it outside of the Maven environment.

### Direct Runner

You can execute the `WordCount` pipeline on your local machine as follows:

mvn compile exec:java \
-Dexec.mainClass=org.apache.beam.examples.WordCount \
-Dexec.args="--inputFile=<LOCAL INPUT FILE> --output=<LOCAL OUTPUT FILE>"

Once you have followed the general Cloud Dataflow
[Getting Started](https://cloud.google.com/dataflow/getting-started) instructions, you can execute
the same pipeline on fully managed resources in Google Cloud Platform:
To create the bundled JAR of the examples and execute it locally:

mvn package

java -cp examples/java/target/beam-examples-java-bundled-<VERSION>.jar \
org.apache.beam.examples.WordCount \
--inputFile=<INPUT FILE PATTERN> --output=<OUTPUT FILE>

### Google Cloud Dataflow Runner

After you have followed the general Cloud Dataflow
[prerequisites and setup](https://beam.apache.org/documentation/runners/dataflow/), you can execute
the pipeline on fully managed resources in Google Cloud Platform:

mvn compile exec:java \
-Dexec.mainClass=org.apache.beam.examples.WordCount \
-Dexec.args="--project=<YOUR CLOUD PLATFORM PROJECT ID> \
--tempLocation=<YOUR CLOUD STORAGE LOCATION> \
--runner=BlockingDataflowRunner"
--runner=DataflowRunner"

Make sure to use your project id, not the project number or the descriptive name.
The Cloud Storage location should be entered in the form of
The Google Cloud Storage location should be entered in the form of
`gs:https://bucket/path/to/staging/directory`.

Alternatively, you may choose to bundle all dependencies into a single JAR and
execute it outside of the Maven environment. For example, you can execute the
following commands to create the
bundled JAR of the examples and execute it both locally and in Cloud
Platform:
To create the bundled JAR of the examples and execute it in Google Cloud Platform:

mvn package

java -cp examples/java/target/beam-examples-java-bundled-<VERSION>.jar \
org.apache.beam.examples.WordCount \
--inputFile=<INPUT FILE PATTERN> --output=<OUTPUT FILE>

java -cp examples/java/target/beam-examples-java-bundled-<VERSION>.jar \
org.apache.beam.examples.WordCount \
--project=<YOUR CLOUD PLATFORM PROJECT ID> \
--tempLocation=<YOUR CLOUD STORAGE LOCATION> \
--runner=BlockingDataflowRunner
--runner=DataflowRunner

## Other Examples

Other examples can be run similarly by replacing the `WordCount` class path with the example classpath, e.g.
`org.apache.beam.examples.cookbook.BigQueryTornadoes`,
`org.apache.beam.examples.cookbook.CombinePerKeyExamples`,
and adjusting runtime options under the `Dexec.args` parameter, as specified in
the example itself.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,7 @@ public static void main(String[] args) {
* <p>Below we verify that the set of filtered words matches our expected counts. Note
* that PAssert does not provide any output and that successful completion of the
* Pipeline implies that the expectations were met. Learn more at
* https://cloud.google.com/dataflow/pipelines/testing-your-pipeline on how to test
* https://beam.apache.org/documentation/pipelines/test-your-pipeline/ on how to test
* your Pipeline and see {@link DebuggingWordCountTest} for an example unit test.
*/
List<KV<String, Long>> expectedResults = Arrays.asList(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@

This directory holds simple "cookbook" examples, which show how to define
commonly-used data analysis patterns that you would likely incorporate into a
larger Dataflow pipeline. They include:
larger Apache Beam pipeline. They include:

<ul>
<li><a href="https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/cookbook/BigQueryTornadoes.java">BigQueryTornadoes</a>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,10 @@
# 'Gaming' examples


This directory holds a series of example Dataflow pipelines in a simple 'mobile
This directory holds a series of example Apache Beam pipelines in a simple 'mobile
gaming' domain. They all require Java 8. Each pipeline successively introduces
new concepts, and gives some examples of using Java 8 syntax in constructing
Dataflow pipelines. Other than usage of Java 8 lambda expressions, the concepts
Beam pipelines. Other than usage of Java 8 lambda expressions, the concepts
that are used apply equally well in Java 7.

In the gaming scenario, many users play, as members of different teams, over
Expand Down Expand Up @@ -58,7 +58,7 @@ the day's cutoff point.

The next pipeline in the series is `HourlyTeamScore`. This pipeline also
processes data collected from gaming events in batch. It builds on `UserScore`,
but uses [fixed windows](https://cloud.google.com/dataflow/model/windowing), by
but uses [fixed windows](https://beam.apache.org/documentation/programming-guide/#windowing), by
default an hour in duration. It calculates the sum of scores per team, for each
window, optionally allowing specification of two timestamps before and after
which data is filtered out. This allows a model where late data collected after
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -260,8 +260,7 @@ private static String generateEvent(Long currTime, int delayInMillis) {
user = team.getRandomUser();
}
String event = user + "," + teamName + "," + random.nextInt(MAX_SCORE);
// Randomly introduce occasional parse errors. You can see a custom counter tracking the number
// of such errors in the Dataflow Monitoring UI, as the example pipeline runs.
// Randomly introduce occasional parse errors.
if (random.nextInt(parseErrorRate) == 0) {
System.out.println("Introducing a parse error.");
event = "THIS LINE REPRESENTS CORRUPT DATA AND WILL CAUSE A PARSE ERROR";
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@
* Tests of GameStats.
* Because the pipeline was designed for easy readability and explanations, it lacks good
* modularity for testing. See our testing documentation for better ideas:
* https://cloud.google.com/dataflow/pipelines/testing-your-pipeline.
* https://beam.apache.org/documentation/pipelines/test-your-pipeline/
*/
@RunWith(JUnit4.class)
public class GameStatsTest implements Serializable {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@
* Tests of HourlyTeamScore.
* Because the pipeline was designed for easy readability and explanations, it lacks good
* modularity for testing. See our testing documentation for better ideas:
* https://cloud.google.com/dataflow/pipelines/testing-your-pipeline.
* https://beam.apache.org/documentation/pipelines/test-your-pipeline/
*/
@RunWith(JUnit4.class)
public class HourlyTeamScoreTest implements Serializable {
Expand Down

0 comments on commit 9ca6511

Please sign in to comment.