Split definition of DTD, EuroSAT and SUN397 #1

gortizji · 2023-02-06T16:02:06Z

Hi, awesome work!

I'm trying to reproduce your results but I cannot find the split definitions you use for DTD, EuroSAT and SUN397. Would you mind pointing me to the right resources to download the versions of these datasets compatible with your code?

Thanks a lot!

gabrielilharco · 2023-02-06T18:05:27Z

Hi @gortizji. Thanks for the interest in our work and for the kind words!

DTD: Downloaded from https://www.robots.ox.ac.uk/~vgg/data/dtd/ (direct link: https://www.robots.ox.ac.uk/~vgg/data/dtd/download/dtd-r1.0.1.tar.gz), which also contains the original splits. As in PAINT, we use their test set unchanged, and merge the training and validation sets and re-split them. So from the files you download, you can put the training and validation files into a train folder, and the test files into a val folder.
EuroSAT: Downloaded from https://github.com/phelber/EuroSAT (direct link: https://madm.dfki.de/files/sentinel/EuroSAT.zip). For this dataset we randomly split the downloaded data into train/validation/test (~~55,000/5,000/10,000 samples~~ 21,600/2,700/2,700 samples).
SUN397: Download from https://vision.princeton.edu/projects/2010/SUN/ (direct link: https://vision.princeton.edu/projects/2010/SUN/SUN397.tar.gz)

Please note that in this codebase we use the suffix "Val" to indicate that we want to use the validation sets instead of the test sets (e.g. evaluating on DTDVal uses the validation set, and DTD uses the test set. You should also use the Val suffix when training).

Hope this helps, and let me know if you have any other questions!

gortizji · 2023-02-07T08:19:58Z

Thanks @gabrielilharco for your quick answer. Some follow-up questions:

DTD: There are 10 different splits defined in the original webpage. I assume you use the first split, then?
EuroSAT: As far as I can tell, EuroSAT has only 27,000 images. Could the split be 12,000/5,000/10,000?
SUN397: Again there are 10 different balanced splits defined in the original webpage. Do you use any one in particular?

Thanks!

gabrielilharco · 2023-02-07T20:26:02Z

For DTD and SUN397, yes, we use the first split (train1.txt+val1.txt / test1.txt for DTD, and Testing_01.txt/Training_01.txt for SUN397, as in https://vision.princeton.edu/projects/2010/SUN/download/Partitions.zip). For EuroSAT, it indeed has 27,000, and we use a 21,600/2,700/2,700 split (also updated the previous message)

gortizji · 2023-02-08T08:36:38Z

That makes sense 😄. Thanks a lot @gabrielilharco!

gortizji · 2023-03-24T16:14:16Z

Hi again,

Could you comment what is the expected folder structure for SUN397? This seems to determine the classnames of the dataset, but I am not sure how to deal with the nested structure of the labels such as volleyball_court/indoor and volleyball_court/outdoor.

Thanks in advance 😄

gabrielilharco · 2023-03-24T17:46:14Z

Hi @gortizji,

We expect the data to be stored without nested folders, it should look like this:

a_abbey
     sun_aaalbzqrimafwbiv.jpg
     sun_aasgdbvvfthiibcm.jpg
     ...
a_airplane_cabin
a_airport_terminal
a_alley
a_amphitheater
...
v_volleyball_court_indoor
v_volleyball_court_outdoor
...
y_youth_hostel

gortizji · 2023-03-24T18:12:07Z

Perfect! Thanks a lot.

prateeky2806 · 2023-04-18T18:57:37Z

Hi @gabrielilharco and @gortizji, I am facing a similar issue. I have downloaded the datasets from the provided links but I am not sure how to structure the downloaded files so that they can be loaded correctly. Does any of you have a script that can be used to correctly structure these downloaded datasets?

Thank you in advance!
Prateek Yadav

prateeky2806 · 2023-04-19T03:50:00Z

I kind of figured this out myself but for anyone else like me here are the scripts I used. There are four datasets that require manual downloading, DTD, EuroSAT, RESISC45, SUN397. the links for downloading the datasets and the splits file are mentioned above in the thread. Download datasets from those links.
There is no folder structuring required for RESISC45, I have provided the code for SUN397, and EuroSAT below. I forgot to save the script for DTD but it's pretty similar to the ones provided below.

## PROCESS SUN397 DATASET

import os
import shutil
from pathlib import Path


def process_dataset(txt_file, downloaded_data_path, output_folder):
    with open(txt_file, 'r') as file:
        lines = file.readlines()

    for i, line in enumerate(lines):
        input_path = line.strip()
        final_folder_name = "_".join(x for x in input_path.split('/')[:-1])[1:]
        filename = input_path.split('/')[-1]
        output_class_folder = os.path.join(output_folder, final_folder_name)

        if not os.path.exists(output_class_folder):
            os.makedirs(output_class_folder)

        full_input_path = os.path.join(downloaded_data_path, input_path[1:])
        output_file_path = os.path.join(output_class_folder, filename)
        # print(final_folder_name, filename, output_class_folder, full_input_path, output_file_path)
        # exit()
        shutil.copy(full_input_path, output_file_path)
        if i % 100 == 0:
            print(f"Processed {i}/{len(lines)} images")

downloaded_data_path = "path/to/downloaded/SUN/data"
process_dataset('Training_01.txt', downloaded_data_path, os.path.join(downloaded_data_path, "train"))
process_dataset('Testing_01.txt', downloaded_data_path, os.path.join(downloaded_data_path, "val"))

### PROCESS EuroSAT_RGB DATASET

import os
import shutil
import random

def create_directory_structure(base_dir, classes):
    for dataset in ['train', 'val', 'test']:
        path = os.path.join(base_dir, dataset)
        os.makedirs(path, exist_ok=True)
        for cls in classes:
            os.makedirs(os.path.join(path, cls), exist_ok=True)

def split_dataset(base_dir, source_dir, classes, val_size=270, test_size=270):
    for cls in classes:
        class_path = os.path.join(source_dir, cls)
        images = os.listdir(class_path)
        random.shuffle(images)

        val_images = images[:val_size]
        test_images = images[val_size:val_size + test_size]
        train_images = images[val_size + test_size:]

        for img in train_images:
            src_path = os.path.join(class_path, img)
            dst_path = os.path.join(base_dir, 'train', cls, img)
            print(src_path, dst_path)
            shutil.copy(src_path, dst_path)
        for img in val_images:
            src_path = os.path.join(class_path, img)
            dst_path = os.path.join(base_dir, 'val', cls, img)
            print(src_path, dst_path)
            shutil.copy(src_path, dst_path)
        for img in test_images:
            src_path = os.path.join(class_path, img)
            dst_path = os.path.join(base_dir, 'test', cls, img)
            print(src_path, dst_path)
            shutil.copy(src_path, dst_path)

source_dir = '/nas-hdd/prateek/data/EuroSAT_RGB'  # replace with the path to your dataset
base_dir = '/nas-hdd/prateek/data/EuroSAT_Splitted'  # replace with the path to the output directory

classes = [d for d in os.listdir(source_dir) if os.path.isdir(os.path.join(source_dir, d))]

create_directory_structure(base_dir, classes)
split_dataset(base_dir, source_dir, classes)

Cheers,
Prateek

gabrielilharco · 2023-04-19T15:47:52Z

Thanks a lot @prateeky2806!

fredzzhang · 2023-11-29T01:40:52Z

Hi @gabrielilharco and @prateeky2806,

I might have missed something but why doesn't RESISC45 need a folder structure? The dataset class inherits from the ImageFolder class, which assumes the images are arranged under folders named after the classes. So when I tried to run the code on this dataset, I get the following error.

...
  File "/home/frederic/miniconda3/envs/ws/lib/python3.10/site-packages/torchvision/datasets/folder.py", line 309, in __init__
    super().__init__(
  File "/home/frederic/miniconda3/envs/ws/lib/python3.10/site-packages/torchvision/datasets/folder.py", line 144, in __init__
    classes, class_to_idx = self.find_classes(self.root)
  File "/home/frederic/miniconda3/envs/ws/lib/python3.10/site-packages/torchvision/datasets/folder.py", line 218, in find_classes
    return find_classes(directory)
  File "/home/frederic/miniconda3/envs/ws/lib/python3.10/site-packages/torchvision/datasets/folder.py", line 42, in find_classes
    raise FileNotFoundError(f"Couldn't find any class folder in {directory}.")
FileNotFoundError: Couldn't find any class folder in /home/frederic/data/resisc45/NWPU-RESISC45.

Is there something I missed?

Thanks,
Fred.

fredzzhang · 2023-11-29T07:41:23Z

It seems that creating a folder for each class is necessary. Either way, I'll attach the scripts to set up the resisc45 dataset for future reference.

mkdir resisc45 && cd resisc45
# Download the dataset and splits
FILE=NWPU-RESISC45.rar
ID=1DnPSU5nVSN7xv95bpZ3XQ0JhKXZOKgIv
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&id=$ID" -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1/p')&id=$ID" -O $FILE && rm -rf /tmp/cookies.txt
unrar x $FILE
wget -O resisc45-train.txt "https://storage.googleapis.com/remote_sensing_representations/resisc45-train.txt"
wget -O resisc45-val.txt "https://storage.googleapis.com/remote_sensing_representations/resisc45-val.txt"
wget -O resisc45-test.txt "https://storage.googleapis.com/remote_sensing_representations/resisc45-test.txt"

# Partition the dataset into different classes

import os
import shutil

def create_directory_structure(data_root, split):
    split_file = f'resisc45-{split}.txt'
    with open(os.path.join(data_root, split_file), 'r') as f:
        lines = f.readlines()
    for l in lines:
        l = l.strip()
        class_name = '_'.join(l.split('_')[:-1])
        class_dir = os.path.join(data_root, 'NWPU-RESISC45', class_name)
        if not os.path.exists(class_dir):
            os.mkdir(class_dir)
        src_path = os.path.join(data_root, 'NWPU-RESISC45', l)
        dst_path = os.path.join(class_dir, l)
        print(src_path, dst_path)
        shutil.move(src_path, dst_path)

data_root = '/home/frederic/data/resisc45'
for split in ['train', 'val', 'test']:
    create_directory_structure(data_root, split)

Cheers,
Fred.

enkeejunior1 · 2024-07-18T13:11:12Z

In case who want fully automatic code for dataset preperation, I'll attach my code

before run this code, please manually download resisc45 dataset in ~/ path.
You can download the dataset in this link

download.sh

sudo apt -y install kaggle 
mkdir <your base dir>
cd <your base dir>
export KAGGLE_USERNAME=<your kaggle username>
export KAGGLE_KEY=<your kaggle key>

# stanford cars dataset (ref: https://github.com/pytorch/vision/issues/7545#issuecomment-1631441616)
mkdir stanford_cars && cd stanford_cars
kaggle datasets download -d jessicali9530/stanford-cars-dataset
kaggle datasets download -d abdelrahmant11/standford-cars-dataset-meta
unzip standford-cars-dataset-meta.zip
unzip stanford-cars-dataset.zip
tar -xvzf car_devkit.tgz
mv cars_test a
mv a/cars_test/ cars_test
rm -rf a
mv cars_train a
mv a/cars_train/ cars_train
rm -rf a
mv 'cars_test_annos_withlabels (1).mat' cars_test_annos_withlabels.mat
rm -rf 'cars_annos (2).mat' *.zip
cd ..

# ressic45
mkdir resisc45 && cd resisc45
# (manual download) https://onedrive.live.com/?authkey=%21AHHNaHIlzp%5FIXjs&id=5C5E061130630A68%21107&cid=5C5E061130630A68&parId=root&parQt=sharedby&o=OneUp
mv ~/NWPU-RESISC45.rar ./
sudo apt -y install unar
unar NWPU-RESISC45.rar
wget -O resisc45-train.txt "https://storage.googleapis.com/remote_sensing_representations/resisc45-train.txt"
wget -O resisc45-val.txt "https://storage.googleapis.com/remote_sensing_representations/resisc45-val.txt"
wget -O resisc45-test.txt "https://storage.googleapis.com/remote_sensing_representations/resisc45-test.txt"
rm -rf NWPU-RESISC45.rar
cd ..

# dtd
mkdir dtd && cd dtd
wget https://www.robots.ox.ac.uk/~vgg/data/dtd/download/dtd-r1.0.1.tar.gz
tar -xvzf dtd-r1.0.1.tar.gz
rm -rf dtd-r1.0.1.tar.gz
mv dtd/images images
mv dtd/imdb/ imdb
mv dtd/labels labels
cat labels/train1.txt labels/val1.txt > labels/train.txt
cat labels/test1.txt > labels/test.txt

# euro_sat
mkdir euro_sat && cd euro_sat
wget --no-check-certificate https://madm.dfki.de/files/sentinel/EuroSAT.zip
unzip EuroSAT.zip
rm -rf EuroSAT.zip

# sun397
mkdir sun397 && cd sun397
wget https://vision.princeton.edu/projects/2010/SUN/SUN397.tar.gz
unzip Partitions.zip
tar -xvzf SUN397.tar.gz
rm -rf SUN397.tar.gz

split_dataset.py

base_dir = f'<your base dir>'

### PROCESS SUN397 DATASET
import os
import shutil
from pathlib import Path
downloaded_data_path = f"{base_dir}/sun397"
output_path = f"{base_dir}/sun397"

def process_dataset(txt_file, downloaded_data_path, output_folder):
    with open(txt_file, 'r') as file:
        lines = file.readlines()

    for i, line in enumerate(lines):
        input_path = line.strip()
        final_folder_name = "_".join(x for x in input_path.split('/')[:-1])[1:]
        filename = input_path.split('/')[-1]
        output_class_folder = os.path.join(output_folder, final_folder_name)

        if not os.path.exists(output_class_folder):
            os.makedirs(output_class_folder)

        full_input_path = os.path.join(downloaded_data_path, input_path[1:])
        output_file_path = os.path.join(output_class_folder, filename)
        # print(final_folder_name, filename, output_class_folder, full_input_path, output_file_path)
        # exit()
        shutil.copy(full_input_path, output_file_path)
        if i % 100 == 0:
            print(f"Processed {i}/{len(lines)} images")

process_dataset(
    os.path.join(downloaded_data_path, 'Training_01.txt'), 
    os.path.join(downloaded_data_path, 'SUN397'), 
    os.path.join(output_path, "train")
)
process_dataset(
    os.path.join(downloaded_data_path, 'Testing_01.txt'), 
    os.path.join(downloaded_data_path, 'SUN397'), 
    os.path.join(output_path, "val")
)


### PROCESS EuroSAT_RGB DATASET
src_dir = f'{base_dir}/euro_sat/2750'    # replace with the path to your dataset
dst_dir = f'{base_dir}/EuroSAT_splits'  # replace with the path to the output directory

import os
import shutil
import random

def create_directory_structure(dst_dir, classes):
    for dataset in ['train', 'val', 'test']:
        path = os.path.join(dst_dir, dataset)
        os.makedirs(path, exist_ok=True)
        for cls in classes:
            os.makedirs(os.path.join(path, cls), exist_ok=True)

def split_dataset(dst_dir, src_dir, classes, val_size=270, test_size=270):
    for cls in classes:
        class_path = os.path.join(src_dir, cls)
        images = os.listdir(class_path)
        random.shuffle(images)

        val_images = images[:val_size]
        test_images = images[val_size:val_size + test_size]
        train_images = images[val_size + test_size:]
        
        for img in train_images:
            src_path = os.path.join(class_path, img)
            dst_path = os.path.join(dst_dir, 'train', cls, img)
            print(src_path, dst_path)
            shutil.copy(src_path, dst_path)
            # break
        for img in val_images:
            src_path = os.path.join(class_path, img)
            dst_path = os.path.join(dst_dir, 'val', cls, img)
            print(src_path, dst_path)
            shutil.copy(src_path, dst_path)
            # break
        for img in test_images:
            src_path = os.path.join(class_path, img)
            dst_path = os.path.join(dst_dir, 'test', cls, img)
            print(src_path, dst_path)
            shutil.copy(src_path, dst_path)
            # break

classes = [d for d in os.listdir(src_dir) if os.path.isdir(os.path.join(src_dir, d))]
create_directory_structure(dst_dir, classes)
split_dataset(dst_dir, src_dir, classes)

### PROCESS DTD DATASET
import os
import shutil
from pathlib import Path
downloaded_data_path = f"{base_dir}/dtd/images"
output_path = f"{base_dir}/dtd"

def process_dataset(txt_file, downloaded_data_path, output_folder):
    with open(txt_file, 'r') as file:
        lines = file.readlines()

    for i, line in enumerate(lines):
        input_path = line.strip()
        final_folder_name = input_path.split('/')[:-1][0]
        filename = input_path.split('/')[-1]
        output_class_folder = os.path.join(output_folder, final_folder_name)

        if not os.path.exists(output_class_folder):
            os.makedirs(output_class_folder)

        full_input_path = os.path.join(downloaded_data_path, input_path)
        output_file_path = os.path.join(output_class_folder, filename)
        shutil.copy(full_input_path, output_file_path)
        if i % 100 == 0:
            print(f"Processed {i}/{len(lines)} images")

process_dataset(
    f'{base_dir}/dtd/labels/train.txt', downloaded_data_path, os.path.join(output_path, "train")
)
process_dataset(
    f'{base_dir}/dtd/labels/test.txt', downloaded_data_path, os.path.join(output_path, "val")
)

gortizji closed this as completed Feb 8, 2023

gortizji reopened this Mar 24, 2023

gortizji closed this as completed Mar 24, 2023

gabrielilharco mentioned this issue Sep 28, 2023

Query Regarding SUN397 Dataset mlfoundations/patching#4

Closed

yifei-he mentioned this issue Feb 14, 2024

About dataset preparation EnnengYang/AdaMerging#1

Closed

tanganke mentioned this issue Jun 18, 2024

about how to train tanganke/weight-ensembling_MoE#1

Open

tanganke mentioned this issue Jul 18, 2024

Minimal code for reproduce the results tanganke/peta#1

Open

wang-kee mentioned this issue Sep 19, 2024

DTD test set performance nik-dim/tall_masks#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split definition of DTD, EuroSAT and SUN397 #1

Split definition of DTD, EuroSAT and SUN397 #1

gortizji commented Feb 6, 2023

gabrielilharco commented Feb 6, 2023 •

edited

Loading

gortizji commented Feb 7, 2023 •

edited

Loading

gabrielilharco commented Feb 7, 2023

gortizji commented Feb 8, 2023

gortizji commented Mar 24, 2023

gabrielilharco commented Mar 24, 2023

gortizji commented Mar 24, 2023

prateeky2806 commented Apr 18, 2023

prateeky2806 commented Apr 19, 2023 •

edited

Loading

gabrielilharco commented Apr 19, 2023

fredzzhang commented Nov 29, 2023

fredzzhang commented Nov 29, 2023

enkeejunior1 commented Jul 18, 2024 •

edited

Loading

Split definition of DTD, EuroSAT and SUN397 #1

Split definition of DTD, EuroSAT and SUN397 #1

Comments

gortizji commented Feb 6, 2023

gabrielilharco commented Feb 6, 2023 • edited Loading

gortizji commented Feb 7, 2023 • edited Loading

gabrielilharco commented Feb 7, 2023

gortizji commented Feb 8, 2023

gortizji commented Mar 24, 2023

gabrielilharco commented Mar 24, 2023

gortizji commented Mar 24, 2023

prateeky2806 commented Apr 18, 2023

prateeky2806 commented Apr 19, 2023 • edited Loading

gabrielilharco commented Apr 19, 2023

fredzzhang commented Nov 29, 2023

fredzzhang commented Nov 29, 2023

enkeejunior1 commented Jul 18, 2024 • edited Loading

gabrielilharco commented Feb 6, 2023 •

edited

Loading

gortizji commented Feb 7, 2023 •

edited

Loading

prateeky2806 commented Apr 19, 2023 •

edited

Loading

enkeejunior1 commented Jul 18, 2024 •

edited

Loading