coyo-dataset/download at main · justHungryMan/coyo-dataset

Name	Name	Last commit message	Last commit date
parent directory ..
dataproc	dataproc
README.md	README.md

Download the metadata

Download metadata files from Huggingface Dataset

# install git lfs
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt-get install git-lfs
git lfs install

# download coyo-700m
git clone https://huggingface.co/datasets/kakaobrain/coyo-700m

Download the images with img2dataset

img2dataset can easily download and store to webdataset or tfrecord format for large-scale distributed training.

Download using Google Cloud Dataproc

Dataproc is a fully managed and highly scalable service for running Apache Spark and 30+ open source tools. With dataproc, you can easily configure multi-node clusters and process large amounts of data quickly.
Copy dataproc-initialiation.sh to your gs bucket. It contains commands to initialize the environment when each node is launched. (e.g, pip install img2dataset)
```
gsutil cp dataproc-initialization.sh gs:https://${YOUR_GS_BUCKET}/dataproc/dataproc-initialization.sh
```

Download metadata and upload it to Google Cloud Storage

for i in {00000..00127}; do wget https://huggingface.co/datasets/kakaobrain/coyo-700m/resolve/main/data/part-$i-17da4908-939c-46e5-91d0-15f256041956-c000.snappy.parquet -O - | gsutil cp - gs:https://${YOUR_GS_BUCKET}/dataset/coyo-700m/parquet/part-$i-17da4908-939c-46e5-91d0-15f256041956-c000.snappy.parquet; done

Create Dataproc Cluster

gcloud dataproc clusters create coyo-700m \
    --project=${YOUR_PROJECT_NAME} \
    --region=${YOUR_REGION} \
    --zone=${YOUR_ZONE} \
    --master-machine-type=n2-standard-16 \
    --num-workers=2 \
    --worker-machine-type=n2-standard-16 \
    --num-secondary-workers=8 \
    --secondary-worker-boot-disk-size=100 \
    --image-version=2.0-ubuntu18 \
    --scopes='https://www.googleapis.com/auth/cloud-platform' \
    --properties='yarn:yarn.nodemanager.user-home-dir=/var/lib/hadoop-yarn' \
    --initialization-actions=gs:https://${YOUR_GS_BUCKET}/dataproc/dataproc-initialization.sh

This command creates 10 nodes for the dataproc cluster
- 1 master machine
- 2 primary workers
- 8 secondary workers (preemptible/spot instances)

Run/Submit PySpark Job

gcloud dataproc jobs submit pyspark --cluster=coyo-700m dataproc-img2dataset.py -- \
    --url_list=gcs:https://${YOUR_GS_BUCKET}/dataset/coyo-700m/parquet \
    --input_format="parquet" \
    --url_col="url" \
    --caption_col="text" \
    --output_format=webdataset \
    --output_folder=gcs:https://${YOUR_GS_BUCKET}/dataset/coyo-700m/webdataset \
    --distributor="pyspark" \
    --processes_count=1 \
    --thread_count=64 \
    --image_size=384 \
    --retries=1 \
    --min_image_size=200 \
    --max_aspect_ratio=3 \
    --resize_only_if_bigger=True \
    --resize_mode="keep_ratio" \
    --skip_reencode=True \
    --save_additional_columns='["clip_similarity_vitb32","clip_similarity_vitl14","nsfw_score_opennsfw2","nsfw_score_gantman","watermark_score","aesthetic_score_laion_v2"]' \
    --enable_wandb=False

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

download

download

README.md

Download the metadata

Download the images with img2dataset

Download using Google Cloud Dataproc

Download using your own cluster

Missing images

Files

download

Directory actions

More options

Directory actions

More options

Latest commit

History

download

Folders and files

parent directory

README.md

Download the metadata

Download the images with img2dataset

Download using Google Cloud Dataproc

Download using your own cluster

Missing images