Skip to content

Commit

Permalink
[FLINK-11086][docs] Make HADOOP_CLASSPATH approach more prominent in …
Browse files Browse the repository at this point in the history
…docs
  • Loading branch information
rmetzger committed May 13, 2020
1 parent 75c9cf1 commit 2cc63a6
Show file tree
Hide file tree
Showing 4 changed files with 96 additions and 116 deletions.
81 changes: 39 additions & 42 deletions docs/ops/deployment/hadoop.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,34 +26,13 @@ under the License.
* This will be replaced by the TOC
{:toc}

## Referencing a Hadoop configuration

You can reference a Hadoop configuration by setting the environment variable `HADOOP_CONF_DIR`.

```sh
HADOOP_CONF_DIR=/path/to/etc/hadoop
```

Referencing the HDFS configuration in the [Flink configuration]({{ site.baseurl }}/ops/config.html#hdfs) is deprecated.

Another way to provide the Hadoop configuration is to have it on the class path of the Flink process, see more details below.

## Providing Hadoop classes

In order to use Hadoop features (e.g., YARN, HDFS) it is necessary to provide Flink with the required Hadoop classes,
as these are not bundled by default.

This can be done by
1) Adding the Hadoop classpath to Flink
2) Putting the required jar files into /lib directory of the Flink distribution
Option 1) requires very little work, integrates nicely with existing Hadoop setups and should be the
preferred approach.
However, Hadoop has a large dependency footprint that increases the risk for dependency conflicts to occur.
If this happens, please refer to option 2).

The following subsections explains these approaches in detail.

### Adding Hadoop Classpaths
This can be done by adding the Hadoop classpath to Flink through the `HADOOP_CLASSPATH` environment variable.

Flink will use the environment variable `HADOOP_CLASSPATH` to augment the
classpath that is used when starting Flink components such as the Client,
Expand All @@ -75,35 +54,24 @@ in the shell. Note that `hadoop` is the hadoop binary and that `classpath` is an

Putting the Hadoop configuration in the same class path as the Hadoop libraries makes Flink pick up that configuration.

### Adding Hadoop to /lib

The Flink project releases Hadoop distributions for specific versions, that relocate or exclude several dependencies
to reduce the risk of dependency clashes.
These can be found in the [Additional Components]({{ site.download_url }}#additional-components) section of the download page.
For these versions it is sufficient to download the corresponding `Pre-bundled Hadoop` component and putting it into
the `/lib` directory of the Flink distribution.

If the used Hadoop version is not listed on the download page (possibly due to being a Vendor-specific version),
then it is necessary to build [flink-shaded](https://github.com/apache/flink-shaded) against this version.
You can find the source code for this project in the [Additional Components]({{ site.download_url }}#additional-components) section of the download page.
## Referencing a Hadoop configuration

<span class="label label-info">Note</span> If you want to build `flink-shaded` against a vendor specific Hadoop version, you first have to configure the
vendor-specific maven repository in your local maven setup as described [here](https://maven.apache.org/guides/mini/guide-multiple-repositories.html).
You can reference a Hadoop configuration by setting the environment variable `HADOOP_CONF_DIR`.

Run the following command to build and install `flink-shaded` against your desired Hadoop version (e.g., for version `2.6.5-custom`):
```sh
HADOOP_CONF_DIR=/path/to/etc/hadoop
```

{% highlight bash %}
mvn clean install -Dhadoop.version=2.6.5-custom
{% endhighlight %}
Referencing the HDFS configuration in the [Flink configuration]({{ site.baseurl }}/ops/config.html#hdfs) is deprecated.

After this step is complete, put the `flink-shaded-hadoop-2-uber` jar into the `/lib` directory of the Flink distribution.
Another way to provide the Hadoop configuration is to have it on the class path of the Flink process, see more details above.

## Running a job locally

To run a job locally as one JVM process using the mini cluster, the required hadoop dependencies have to be explicitly
added to the classpath of the started JVM process.

To run an application using maven (also from IDE as a maven project), the required hadoop dependencies can be added
To run an application using Maven (also from IDE as a Maven project), the required Hadoop dependencies can be added
as provided to the pom.xml, e.g.:

```xml
Expand All @@ -115,9 +83,38 @@ as provided to the pom.xml, e.g.:
</dependency>
```

This way it should work both in local and cluster run where the provided dependencies are added elsewhere as described before.
This way it should work both in local and cluster mode where the provided dependencies are added elsewhere as described before.

To run or debug an application in IntelliJ Idea the provided dependencies can be included to the class path
in the "Run|Edit Configurations" window.


## Using `flink-shaded-hadoop-2-uber` jar for resolving dependency conflicts (legacy)

<div class="alert alert-info" markdown="span">
<strong>Warning:</strong> Starting from Flink 1.11, using `flink-shaded-hadoop-2-uber` releases is not officially supported
by the Flink project anymore. Users are advised to provide Hadoop dependencies through `HADOOP_CLASSPATH` (see above).
</div>

The Flink project used to (until Flink 1.10) release Hadoop distributions for specific versions, that relocate or exclude several dependencies to reduce the risk of dependency clashes.
These can be found in the [Additional Components]({{ site.download_url }}#additional-components) section of the download page.
For these versions it is sufficient to download the corresponding `Pre-bundled Hadoop` component and putting it into
the `/lib` directory of the Flink distribution.

If the used Hadoop version is not listed on the download page (possibly due to being a Vendor-specific version),
then it is necessary to build [flink-shaded](https://github.com/apache/flink-shaded) against this version.
You can find the source code for this project in the [Additional Components]({{ site.download_url }}#additional-components) section of the download page.

<span class="label label-info">Note</span> If you want to build `flink-shaded` against a vendor specific Hadoop version, you first have to configure the
vendor-specific maven repository in your local maven setup as described [here](https://maven.apache.org/guides/mini/guide-multiple-repositories.html).

Run the following command to build and install `flink-shaded` against your desired Hadoop version (e.g., for version `2.6.5-custom`):

{% highlight bash %}
mvn clean install -Dhadoop.version=2.6.5-custom
{% endhighlight %}

After this step is complete, put the `flink-shaded-hadoop-2-uber` jar into the `/lib` directory of the Flink distribution.


{% top %}
81 changes: 39 additions & 42 deletions docs/ops/deployment/hadoop.zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,34 +26,13 @@ under the License.
* This will be replaced by the TOC
{:toc}

## Referencing a Hadoop configuration

You can reference a Hadoop configuration by setting the environment variable `HADOOP_CONF_DIR`.

```sh
HADOOP_CONF_DIR=/path/to/etc/hadoop
```

Referencing the HDFS configuration in the [Flink configuration]({{ site.baseurl }}/ops/config.html#hdfs) is deprecated.

Another way to provide the Hadoop configuration is to have it on the class path of the Flink process, see more details below.

## Providing Hadoop classes

In order to use Hadoop features (e.g., YARN, HDFS) it is necessary to provide Flink with the required Hadoop classes,
as these are not bundled by default.

This can be done by
1) Adding the Hadoop classpath to Flink
2) Putting the required jar files into /lib directory of the Flink distribution
Option 1) requires very little work, integrates nicely with existing Hadoop setups and should be the
preferred approach.
However, Hadoop has a large dependency footprint that increases the risk for dependency conflicts to occur.
If this happens, please refer to option 2).

The following subsections explains these approaches in detail.

### Adding Hadoop Classpaths
This can be done by adding the Hadoop classpath to Flink through the `HADOOP_CLASSPATH` environment variable.

Flink will use the environment variable `HADOOP_CLASSPATH` to augment the
classpath that is used when starting Flink components such as the Client,
Expand All @@ -75,35 +54,24 @@ in the shell. Note that `hadoop` is the hadoop binary and that `classpath` is an

Putting the Hadoop configuration in the same class path as the Hadoop libraries makes Flink pick up that configuration.

### Adding Hadoop to /lib

The Flink project releases Hadoop distributions for specific versions, that relocate or exclude several dependencies
to reduce the risk of dependency clashes.
These can be found in the [Additional Components]({{ site.download_url }}#additional-components) section of the download page.
For these versions it is sufficient to download the corresponding `Pre-bundled Hadoop` component and putting it into
the `/lib` directory of the Flink distribution.

If the used Hadoop version is not listed on the download page (possibly due to being a Vendor-specific version),
then it is necessary to build [flink-shaded](https://github.com/apache/flink-shaded) against this version.
You can find the source code for this project in the [Additional Components]({{ site.download_url }}#additional-components) section of the download page.
## Referencing a Hadoop configuration

<span class="label label-info">Note</span> If you want to build `flink-shaded` against a vendor specific Hadoop version, you first have to configure the
vendor-specific maven repository in your local maven setup as described [here](https://maven.apache.org/guides/mini/guide-multiple-repositories.html).
You can reference a Hadoop configuration by setting the environment variable `HADOOP_CONF_DIR`.

Run the following command to build and install `flink-shaded` against your desired Hadoop version (e.g., for version `2.6.5-custom`):
```sh
HADOOP_CONF_DIR=/path/to/etc/hadoop
```

{% highlight bash %}
mvn clean install -Dhadoop.version=2.6.5-custom
{% endhighlight %}
Referencing the HDFS configuration in the [Flink configuration]({{ site.baseurl }}/ops/config.html#hdfs) is deprecated.

After this step is complete, put the `flink-shaded-hadoop-2-uber` jar into the `/lib` directory of the Flink distribution.
Another way to provide the Hadoop configuration is to have it on the class path of the Flink process, see more details above.

## Running a job locally

To run a job locally as one JVM process using the mini cluster, the required hadoop dependencies have to be explicitly
added to the classpath of the started JVM process.

To run an application using maven (also from IDE as a maven project), the required hadoop dependencies can be added
To run an application using Maven (also from IDE as a Maven project), the required Hadoop dependencies can be added
as provided to the pom.xml, e.g.:

```xml
Expand All @@ -115,9 +83,38 @@ as provided to the pom.xml, e.g.:
</dependency>
```

This way it should work both in local and cluster run where the provided dependencies are added elsewhere as described before.
This way it should work both in local and cluster mode where the provided dependencies are added elsewhere as described before.

To run or debug an application in IntelliJ Idea the provided dependencies can be included to the class path
in the "Run|Edit Configurations" window.


## Using `flink-shaded-hadoop-2-uber` jar for resolving dependency conflicts (legacy)

<div class="alert alert-info" markdown="span">
<strong>Warning:</strong> Starting from Flink 1.11, using `flink-shaded-hadoop-2-uber` releases is not officially supported
by the Flink project anymore. Users are advised to provide Hadoop dependencies through `HADOOP_CLASSPATH` (see above).
</div>

The Flink project used to (until Flink 1.10) release Hadoop distributions for specific versions, that relocate or exclude several dependencies to reduce the risk of dependency clashes.
These can be found in the [Additional Components]({{ site.download_url }}#additional-components) section of the download page.
For these versions it is sufficient to download the corresponding `Pre-bundled Hadoop` component and putting it into
the `/lib` directory of the Flink distribution.

If the used Hadoop version is not listed on the download page (possibly due to being a Vendor-specific version),
then it is necessary to build [flink-shaded](https://github.com/apache/flink-shaded) against this version.
You can find the source code for this project in the [Additional Components]({{ site.download_url }}#additional-components) section of the download page.

<span class="label label-info">Note</span> If you want to build `flink-shaded` against a vendor specific Hadoop version, you first have to configure the
vendor-specific maven repository in your local maven setup as described [here](https://maven.apache.org/guides/mini/guide-multiple-repositories.html).

Run the following command to build and install `flink-shaded` against your desired Hadoop version (e.g., for version `2.6.5-custom`):

{% highlight bash %}
mvn clean install -Dhadoop.version=2.6.5-custom
{% endhighlight %}

After this step is complete, put the `flink-shaded-hadoop-2-uber` jar into the `/lib` directory of the Flink distribution.


{% top %}
23 changes: 8 additions & 15 deletions docs/ops/deployment/yarn_setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,11 +33,8 @@ under the License.
Start a YARN session where the job manager gets 1 GB of heap space and the task managers 4 GB of heap space assigned:

{% highlight bash %}
# get the hadoop2 package from the Flink download page at
# {{ site.download_url }}
curl -O <flink_hadoop2_download_url>
tar xvzf flink-{{ site.version }}-bin-hadoop2.tgz
cd flink-{{ site.version }}/
# If HADOOP_CLASSPATH is not set:
# export HADOOP_CLASSPATH=`hadoop classpath`
./bin/yarn-session.sh -jm 1024m -tm 4096m
{% endhighlight %}

Expand All @@ -48,11 +45,8 @@ Once the session has been started, you can submit jobs to the cluster using the
### Run a Flink job on YARN

{% highlight bash %}
# get the hadoop2 package from the Flink download page at
# {{ site.download_url }}
curl -O <flink_hadoop2_download_url>
tar xvzf flink-{{ site.version }}-bin-hadoop2.tgz
cd flink-{{ site.version }}/
# If HADOOP_CLASSPATH is not set:
# export HADOOP_CLASSPATH=`hadoop classpath`
./bin/flink run -m yarn-cluster -p 4 -yjm 1024m -ytm 4096m ./examples/batch/WordCount.jar
{% endhighlight %}

Expand All @@ -62,11 +56,9 @@ Apache [Hadoop YARN](http:https://hadoop.apache.org/) is a cluster resource management

**Requirements**

- at least Apache Hadoop 2.2
- at least Apache Hadoop 2.4.1
- HDFS (Hadoop Distributed File System) (or another distributed file system supported by Hadoop)

If you have troubles using the Flink YARN client, have a look in the [FAQ section](https://flink.apache.org/faq.html#yarn-deployment).

### Start Flink Session

Follow these instructions to learn how to launch a Flink Session within your YARN cluster.
Expand All @@ -75,15 +67,16 @@ A session will start all required Flink services (JobManager and TaskManagers) s

#### Download Flink

Download a Flink package for Hadoop >= 2 from the [download page]({{ site.download_url }}). It contains the required files.
Download a Flink package from the [download page]({{ site.download_url }}). It contains the required files.

Extract the package using:

{% highlight bash %}
tar xvzf flink-{{ site.version }}-bin-hadoop2.tgz
tar xvzf flink-{{ site.version }}-bin-scala*.tgz
cd flink-{{site.version }}/
{% endhighlight %}


#### Start a Session

Use the following command to start a session
Expand Down
27 changes: 10 additions & 17 deletions docs/ops/deployment/yarn_setup.zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,11 +33,8 @@ under the License.
Start a YARN session where the job manager gets 1 GB of heap space and the task managers 4 GB of heap space assigned:

{% highlight bash %}
# get the hadoop2 package from the Flink download page at
# {{ site.download_url }}
curl -O <flink_hadoop2_download_url>
tar xvzf flink-{{ site.version }}-bin-hadoop2.tgz
cd flink-{{ site.version }}/
# If HADOOP_CLASSPATH is not set:
# export HADOOP_CLASSPATH=`hadoop classpath`
./bin/yarn-session.sh -jm 1024m -tm 4096m
{% endhighlight %}

Expand All @@ -48,11 +45,8 @@ Once the session has been started, you can submit jobs to the cluster using the
### Run a Flink job on YARN

{% highlight bash %}
# get the hadoop2 package from the Flink download page at
# {{ site.download_url }}
curl -O <flink_hadoop2_download_url>
tar xvzf flink-{{ site.version }}-bin-hadoop2.tgz
cd flink-{{ site.version }}/
# If HADOOP_CLASSPATH is not set:
# export HADOOP_CLASSPATH=`hadoop classpath`
./bin/flink run -m yarn-cluster -p 4 -yjm 1024m -ytm 4096m ./examples/batch/WordCount.jar
{% endhighlight %}

Expand All @@ -62,11 +56,9 @@ Apache [Hadoop YARN](http:https://hadoop.apache.org/) is a cluster resource management

**Requirements**

- at least Apache Hadoop 2.2
- at least Apache Hadoop 2.4.1
- HDFS (Hadoop Distributed File System) (or another distributed file system supported by Hadoop)

If you have troubles using the Flink YARN client, have a look in the [FAQ section](https://flink.apache.org/faq.html#yarn-deployment).

### Start Flink Session

Follow these instructions to learn how to launch a Flink Session within your YARN cluster.
Expand All @@ -75,15 +67,16 @@ A session will start all required Flink services (JobManager and TaskManagers) s

#### Download Flink

Download a Flink package for Hadoop >= 2 from the [download page]({{ site.download_url }}). It contains the required files.
Download a Flink package from the [download page]({{ site.download_url }}). It contains the required files.

Extract the package using:

{% highlight bash %}
tar xvzf flink-{{ site.version }}-bin-hadoop2.tgz
tar xvzf flink-{{ site.version }}-bin-scala*.tgz
cd flink-{{site.version }}/
{% endhighlight %}


#### Start a Session

Use the following command to start a session
Expand Down Expand Up @@ -125,7 +118,7 @@ If you don't want to change the configuration file to set configuration paramete

The example invocation starts a single container for the ApplicationMaster which runs the Job Manager.

The session cluster will automatically allocate additional containers which run the Task Managers when jobs are submitted to the cluster.
The session cluster will automatically allocate additional containers which run the Task Managers when jobs are submitted to the cluster.

Once Flink is deployed in your YARN cluster, it will show you the connection details of the Job Manager.

Expand Down Expand Up @@ -338,4 +331,4 @@ The *JobManager* and AM are running in the same container. Once they successfull

After that, the AM starts allocating the containers for Flink's TaskManagers, which will download the jar file and the modified configuration from the HDFS. Once these steps are completed, Flink is set up and ready to accept Jobs.

{% top %}
{% top %}

0 comments on commit 2cc63a6

Please sign in to comment.