Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interactive docker image #709

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

indy-3rdman
Copy link

This PR contains a build script, Dockerfile(s), README.md and supporting files to create a docker image that can run .NET for Apache Spark in an interactive Jupyter notebook.

An initial description for the interactive image, along with the folder structure can be found here: https://github.com/indy-3rdman/spark/tree/interactive-docker-image/docker/images/interactive

Copy link
Member

@MichaelSimons MichaelSimons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mainly looked at the dotnet-interactive/Dockerfile. I left a few suggestions/questions that may help simplify the Dockerfile.

@@ -0,0 +1,46 @@
FROM jupyter/base-notebook:ubuntu-18.04

ARG NB_USER=jovyan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the usage in which this ARG would be specified when building? I am not seeing how this would work if a different user were specified. By this I mean I don't see the user being defined/added. The base image defines the "jovyan" user, why would you want do define something different? The same applies to NB_UID.

ARG NB_USER=jovyan
ARG NB_UID=1000
ARG DOTNET_CORE_VERSION=3.1
ARG DEBIAN_FRONTEND=noninteractive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? The base image is already defining ENV DEBIAN_FRONTEND=noninteractive

ARG DEBIAN_FRONTEND=noninteractive

ENV DOTNET_CORE_VERSION=$DOTNET_CORE_VERSION
ENV USER ${NB_USER}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to a previous comment, are these USER related ENVs necessary since they are defined in the base image.

DOTNET_RUNNING_IN_CONTAINER=true \
DOTNET_USE_POLLING_FILE_WATCHER=true \
NUGET_XMLDOC_MODE=skip \
DOTNET_TRY_CLI_TELEMETRY_OPTOUT=true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is being proposed as part of the .NET project, telemetry should remain enabled.


RUN apt-get update \
&& apt-get install -y --no-install-recommends \
dialog apt-utils wget ca-certificates openjdk-8-jdk bash software-properties-common supervisor unzip socat net-tools vim \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is requiring all of these dependencies. Ones that are already provided by the base image don't seem necessary to include.

&& apt-get install -y apt-transport-https \
&& apt-get update \
&& apt-get install -y dotnet-sdk-$DOTNET_CORE_VERSION \
&& apt-get autoremove -y --purge \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this remove anything?

&& apt-get update \
&& apt-get install -y dotnet-sdk-$DOTNET_CORE_VERSION \
&& apt-get autoremove -y --purge \
&& apt-get clean \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be necessary because of the next line.

&& rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \
&& chmod 755 /usr/local/bin/start-spark-debug.sh \
&& chown -R ${NB_UID} ${HOME} \
&& cd ${HOME}/nb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem necessary.


RUN apt-get update \
&& apt-get install -y --no-install-recommends \
dialog apt-utils wget ca-certificates openjdk-8-jdk bash software-properties-common supervisor unzip socat net-tools vim \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the Dockerfile best practices, it helps the readability to break this apart one package per line and alphabetize them.

@indy-3rdman
Copy link
Author

@MichaelSimons, thank you very much for reviewing the Dockerfile and your valuable comments. I've just updated the PR to reflect the changes required for dotnet.spark version 1.0.0. This should also include an updated version of the Dockerfile, addressing your comments.

Copy link
Member

@MichaelSimons MichaelSimons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@indy-3rdman, Thanks for addressing my comments. I took another look and had a few more suggestions and questions.

@@ -21,24 +14,30 @@ USER root

RUN apt-get update \
&& apt-get install -y --no-install-recommends \
dialog apt-utils wget ca-certificates openjdk-8-jdk bash software-properties-common supervisor unzip socat net-tools vim \
libc6 libgcc1 libgssapi-krb5-2 libicu60 libssl1.1 libstdc++6 zlib1g \
apt-utils \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is requiring all of these native dependencies? Several are already provided by the base image so they don't seem necessary to declare.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be cleaned up now. Java obviously is required by spark.

libgssapi-krb5-2 \
libicu60 \
libssl1.1 \
libstdc++6 zlib1g \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple packages listed together, should get split apart so that zlib1g is not overlooked.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be in a separate line as well now.

libstdc++6 zlib1g \
openjdk-8-jdk \
software-properties-common \
unzip \
&& wget -q --show-progress --progress=bar:force:noscroll https://packages.microsoft.com/config/ubuntu/18.04/packages-microsoft-prod.deb -O packages-microsoft-prod.deb \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider: I've typically seen Dockerfile avoid using --show-progress as it does have a perf impact.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That raises an interesting point about the purpose of the Dockerfile(s). As far as I am aware, the focus at the moment is to enable an user to build the image(s) her/himself, instead of automating the image build process. That's why I thought it would be useful to show the download progress. Now, for small downloads that doesn't really matter and I therefore have removed it. However, I have added the following line to the dotnet-spark/Dockerfile

&& echo "\nDownloading spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz ..." \

as the spark download can take a while. Does that make sense?


RUN mkdir -p /dotnet/HelloSpark \
&& mkdir -p /dotnet/Debug/netcoreapp${DOTNET_CORE_VERSION} \
&& wget -q --show-progress --progress=bar:force:noscroll https://github.com/dotnet/spark/releases/download/v${DOTNET_SPARK_VERSION}/Microsoft.Spark.Worker.netcoreapp${DOTNET_CORE_VERSION}.linux-x64-${DOTNET_SPARK_VERSION}.tar.gz \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should at some point perform checksum validation. @JeremyLikness, are checksums published for the Spark artifacts?

RUN mkdir -p /dotnet/HelloSpark \
&& mkdir -p /dotnet/Debug/netcoreapp${DOTNET_CORE_VERSION} \
&& wget -q --show-progress --progress=bar:force:noscroll https://github.com/dotnet/spark/releases/download/v${DOTNET_SPARK_VERSION}/Microsoft.Spark.Worker.netcoreapp${DOTNET_CORE_VERSION}.linux-x64-${DOTNET_SPARK_VERSION}.tar.gz \
&& tar -xvzf Microsoft.Spark.Worker.netcoreapp${DOTNET_CORE_VERSION}.linux-x64-${DOTNET_SPARK_VERSION}.tar.gz \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered extracting the tarball to the /dotnet folder instead of extracting it to the working dir and then immediately mv it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be changed in the latest commit.

&& wget -q --show-progress --progress=bar:force:noscroll https://github.com/dotnet/spark/releases/download/v${DOTNET_SPARK_VERSION}/Microsoft.Spark.Worker.netcoreapp${DOTNET_CORE_VERSION}.linux-x64-${DOTNET_SPARK_VERSION}.tar.gz \
&& tar -xvzf Microsoft.Spark.Worker.netcoreapp${DOTNET_CORE_VERSION}.linux-x64-${DOTNET_SPARK_VERSION}.tar.gz \
&& mv Microsoft.Spark.Worker-${DOTNET_SPARK_VERSION} /dotnet/ \
&& cp /dotnet/Microsoft.Spark.Worker-${DOTNET_SPARK_VERSION}/Microsoft.Spark.Worker /dotnet/Microsoft.Spark.Worker-${DOTNET_SPARK_VERSION}/Microsoft.Spark.Worker.exe \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why this is necessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this is just a leftover from another file. Removed.

&& chmod 755 /dotnet/Microsoft.Spark.Worker-${DOTNET_SPARK_VERSION}/Microsoft.Spark.Worker \
&& rm Microsoft.Spark.Worker.netcoreapp${DOTNET_CORE_VERSION}.linux-x64-${DOTNET_SPARK_VERSION}.tar.gz

COPY HelloSpark /dotnet/HelloSpark
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the HelloSpark project is used to install the correct microsoft-spark-*.jar version that is required to start a spark-submit session in debug mode. Is it necessary to have the project in the resulting image? It feels a little strange to have a sample project in a "base" image.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from dotnet-spark-base to dotnet-spark Dockerfile. Additionally this is now removed, after the jar file has been copied.


USER root
ENV DAEMON_RUN=true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered using the multi-line ENV format? I find that is can help the readability of the Dockerfiles in that it helps the reader easily scan the Dockerfile. Tt makes it more obvious these are all ENV instructions.

ENV DAEMON_RUN=true \
    DOTNETBACKEND_PORT=5567 \
    HADOOP_VERSION=2.7 \
    JUPYTER_ENABLE_LAB=true \
    SPARK_VERSION=$SPARK_VERSION \
    SPARK_HOME=/spark \
    PATH="${SPARK_HOME}/bin:${DOTNET_WORKER_DIR}:${PATH}"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the Dockerfiles.

ARG DOTNET_SPARK_VERSION=0.12.1
ENV DOTNET_SPARK_VERSION=$DOTNET_SPARK_VERSION
ENV DOTNET_WORKER_DIR=/dotnet/Microsoft.Spark.Worker-${DOTNET_SPARK_VERSION}
ARG SPARK_VERSION=3.0.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having version numbers like this hard coded gives me pause. Is this done so that the Dockerfile as it is checked in is buildable without having to specify any args? The problem that introduces is a maintenance burden of keeping it up-to-date.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to my earlier point about the purpose of the Dockerfile(s). The intention was to have a build-able Dockerfile even if the build script is not used. I agree with your observation about maintenance. Maybe @rapoth has a view on that.

@@ -8,12 +8,17 @@ set -o nounset # abort on unbound variable
set -o pipefail # don't hide errors within pipes

readonly image_repository='3rdman'
readonly supported_apache_spark_versions=("2.3.3" "2.3.4" "2.4.0" "2.4.1" "2.4.3" "2.4.4" "2.4.5" "2.4.6")
readonly supported_dotnet_spark_versions=("0.9.0" "0.10.0" "0.11.0" "0.12.1")
readonly supported_apache_spark_versions=(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a question for Spark team. Thoughts on how to keep this version list up-to-date and other versions included in this script up-to-date? It feels like there should be long term plans for getting this updated "automatically" as part of the release process. Without this they will become stale and/or be a maintenance burden.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

ENV PATH="${PATH}:${HOME}/.dotnet/tools"

ENV DOTNET_RUNNING_IN_CONTAINER=true \
ENV DOTNET_CORE_VERSION=$DOTNET_CORE_VERSION \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the Dockerfile Best Practices, sort multi-line instructions to improve readability where possible (e.g. cross dependencies)

&& rm -rf /dotnet/HelloSpark \
&& cd / \
&& echo "\nDownloading spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz ..." \
&& wget -q https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \
&& tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can extract to the spark directory with a single instruction which would eliminate the need for mv

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you mean to use tar with --directory. But wouldn't that required that the directory exist already? In that case I'd have to add a mkdir first.

Copy link
Member

@MichaelSimons MichaelSimons Oct 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct, I missed what was happening here. Please ignore my comment.

RUN cd /dotnet/HelloSpark \
&& dotnet build \
&& cp /dotnet/HelloSpark/bin/Debug/netcoreapp${DOTNET_CORE_VERSION}/microsoft-spark-*.jar ${HOME}/ \
&& rm -rf /dotnet/HelloSpark \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unfortunate consequence of this pattern is that HelloSpark remains in the image as a result of obtaining it via COPY. This is not desirable. Is there a way this can be generated during the Docker build or can it be a published tarball so that is can get copied and deleted within a single Dockerfile instruction?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @MichaelSimons for your great feedback! Just creating a dummy project during the build process now.

Base automatically changed from master to main March 18, 2021 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants