Skip to content

Commit

Permalink
Add Hugging Face datasets (Kaggle#1152)
Browse files Browse the repository at this point in the history
- Included a smoke test
- Fix apt-key issue

http:https://b/230657835
  • Loading branch information
rosbo committed Apr 29, 2022
1 parent f1a3cfc commit 2e43fcc
Show file tree
Hide file tree
Showing 2 changed files with 24 additions and 4 deletions.
12 changes: 8 additions & 4 deletions Dockerfile.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -51,13 +51,16 @@ RUN pip uninstall -y horovod && \
/tmp/clean-layer.sh
{{ end }}

{{ if eq .Accelerator "gpu" }}
# b/230864778: Temporarily swap the NVIDIA GPG key. Remove once new base image with new GPG key is released.
RUN rm /etc/apt/sources.list.d/cuda.list && \
apt-key del 7fa2af80 && \
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
{{ end }}

# Use a fixed apt-get repo to stop intermittent failures due to flaky httpredir connections,
# as described by Lionel Chan at http:https://stackoverflow.com/a/37426929/5881346
RUN sed -i "s/httpredir.debian.org/debian.uchicago.edu/" /etc/apt/sources.list && \
# b/230864778: Temporarily swap the NVIDIA GPG key. Remove once new base image with new GPG key is released.
rm /etc/apt/sources.list.d/cuda.list && \
apt-key del 7fa2af80 && \
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub && \
apt-get update && \
# Needed by lightGBM (GPU build)
# https://lightgbm.readthedocs.io/en/latest/GPU-Tutorial.html#build-lightgbm
Expand Down Expand Up @@ -491,6 +494,7 @@ RUN pip install flashtext && \
pip install bqplot && \
pip install earthengine-api && \
pip install transformers && \
pip install datasets && \
pip install dlib && \
pip install kaggle-environments && \
pip install geopandas && \
Expand Down
16 changes: 16 additions & 0 deletions tests/test_hf_datasets.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
import unittest

from datasets import Dataset


class TestHuggingFaceDatasets(unittest.TestCase):

def test_map(self):
def some_func(batch):
batch['label'] = 'foo'
return batch

df = Dataset.from_dict({'text': ['Kaggle rocks!']})
mapped_df = df.map(some_func)

self.assertEqual('foo', mapped_df[0]['label'])

0 comments on commit 2e43fcc

Please sign in to comment.