-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split definition of DTD, EuroSAT and SUN397 #1
Comments
Hi @gortizji. Thanks for the interest in our work and for the kind words!
Please note that in this codebase we use the suffix "Val" to indicate that we want to use the validation sets instead of the test sets (e.g. evaluating on Hope this helps, and let me know if you have any other questions! |
Thanks @gabrielilharco for your quick answer. Some follow-up questions:
Thanks! |
For DTD and SUN397, yes, we use the first split (train1.txt+val1.txt / test1.txt for DTD, and Testing_01.txt/Training_01.txt for SUN397, as in https://vision.princeton.edu/projects/2010/SUN/download/Partitions.zip). For EuroSAT, it indeed has 27,000, and we use a 21,600/2,700/2,700 split (also updated the previous message) |
That makes sense 😄. Thanks a lot @gabrielilharco! |
Hi again, Could you comment what is the expected folder structure for Thanks in advance 😄 |
Hi @gortizji, We expect the data to be stored without nested folders, it should look like this:
|
Perfect! Thanks a lot. |
Hi @gabrielilharco and @gortizji, I am facing a similar issue. I have downloaded the datasets from the provided links but I am not sure how to structure the downloaded files so that they can be loaded correctly. Does any of you have a script that can be used to correctly structure these downloaded datasets? Thank you in advance! |
I kind of figured this out myself but for anyone else like me here are the scripts I used. There are four datasets that require manual downloading, ## PROCESS SUN397 DATASET
import os
import shutil
from pathlib import Path
def process_dataset(txt_file, downloaded_data_path, output_folder):
with open(txt_file, 'r') as file:
lines = file.readlines()
for i, line in enumerate(lines):
input_path = line.strip()
final_folder_name = "_".join(x for x in input_path.split('/')[:-1])[1:]
filename = input_path.split('/')[-1]
output_class_folder = os.path.join(output_folder, final_folder_name)
if not os.path.exists(output_class_folder):
os.makedirs(output_class_folder)
full_input_path = os.path.join(downloaded_data_path, input_path[1:])
output_file_path = os.path.join(output_class_folder, filename)
# print(final_folder_name, filename, output_class_folder, full_input_path, output_file_path)
# exit()
shutil.copy(full_input_path, output_file_path)
if i % 100 == 0:
print(f"Processed {i}/{len(lines)} images")
downloaded_data_path = "path/to/downloaded/SUN/data"
process_dataset('Training_01.txt', downloaded_data_path, os.path.join(downloaded_data_path, "train"))
process_dataset('Testing_01.txt', downloaded_data_path, os.path.join(downloaded_data_path, "val")) ### PROCESS EuroSAT_RGB DATASET
import os
import shutil
import random
def create_directory_structure(base_dir, classes):
for dataset in ['train', 'val', 'test']:
path = os.path.join(base_dir, dataset)
os.makedirs(path, exist_ok=True)
for cls in classes:
os.makedirs(os.path.join(path, cls), exist_ok=True)
def split_dataset(base_dir, source_dir, classes, val_size=270, test_size=270):
for cls in classes:
class_path = os.path.join(source_dir, cls)
images = os.listdir(class_path)
random.shuffle(images)
val_images = images[:val_size]
test_images = images[val_size:val_size + test_size]
train_images = images[val_size + test_size:]
for img in train_images:
src_path = os.path.join(class_path, img)
dst_path = os.path.join(base_dir, 'train', cls, img)
print(src_path, dst_path)
shutil.copy(src_path, dst_path)
for img in val_images:
src_path = os.path.join(class_path, img)
dst_path = os.path.join(base_dir, 'val', cls, img)
print(src_path, dst_path)
shutil.copy(src_path, dst_path)
for img in test_images:
src_path = os.path.join(class_path, img)
dst_path = os.path.join(base_dir, 'test', cls, img)
print(src_path, dst_path)
shutil.copy(src_path, dst_path)
source_dir = '/nas-hdd/prateek/data/EuroSAT_RGB' # replace with the path to your dataset
base_dir = '/nas-hdd/prateek/data/EuroSAT_Splitted' # replace with the path to the output directory
classes = [d for d in os.listdir(source_dir) if os.path.isdir(os.path.join(source_dir, d))]
create_directory_structure(base_dir, classes)
split_dataset(base_dir, source_dir, classes) Cheers, |
Thanks a lot @prateeky2806! |
Hi @gabrielilharco and @prateeky2806, I might have missed something but why doesn't
Is there something I missed? Thanks, |
It seems that creating a folder for each class is necessary. Either way, I'll attach the scripts to set up the mkdir resisc45 && cd resisc45
# Download the dataset and splits
FILE=NWPU-RESISC45.rar
ID=1DnPSU5nVSN7xv95bpZ3XQ0JhKXZOKgIv
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&id=$ID" -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1/p')&id=$ID" -O $FILE && rm -rf /tmp/cookies.txt
unrar x $FILE
wget -O resisc45-train.txt "https://storage.googleapis.com/remote_sensing_representations/resisc45-train.txt"
wget -O resisc45-val.txt "https://storage.googleapis.com/remote_sensing_representations/resisc45-val.txt"
wget -O resisc45-test.txt "https://storage.googleapis.com/remote_sensing_representations/resisc45-test.txt" # Partition the dataset into different classes
import os
import shutil
def create_directory_structure(data_root, split):
split_file = f'resisc45-{split}.txt'
with open(os.path.join(data_root, split_file), 'r') as f:
lines = f.readlines()
for l in lines:
l = l.strip()
class_name = '_'.join(l.split('_')[:-1])
class_dir = os.path.join(data_root, 'NWPU-RESISC45', class_name)
if not os.path.exists(class_dir):
os.mkdir(class_dir)
src_path = os.path.join(data_root, 'NWPU-RESISC45', l)
dst_path = os.path.join(class_dir, l)
print(src_path, dst_path)
shutil.move(src_path, dst_path)
data_root = '/home/frederic/data/resisc45'
for split in ['train', 'val', 'test']:
create_directory_structure(data_root, split) Cheers, |
In case who want fully automatic code for dataset preperation, I'll attach my code before run this code, please manually download resisc45 dataset in download.sh sudo apt -y install kaggle
mkdir <your base dir>
cd <your base dir>
export KAGGLE_USERNAME=<your kaggle username>
export KAGGLE_KEY=<your kaggle key>
# stanford cars dataset (ref: https://github.com/pytorch/vision/issues/7545#issuecomment-1631441616)
mkdir stanford_cars && cd stanford_cars
kaggle datasets download -d jessicali9530/stanford-cars-dataset
kaggle datasets download -d abdelrahmant11/standford-cars-dataset-meta
unzip standford-cars-dataset-meta.zip
unzip stanford-cars-dataset.zip
tar -xvzf car_devkit.tgz
mv cars_test a
mv a/cars_test/ cars_test
rm -rf a
mv cars_train a
mv a/cars_train/ cars_train
rm -rf a
mv 'cars_test_annos_withlabels (1).mat' cars_test_annos_withlabels.mat
rm -rf 'cars_annos (2).mat' *.zip
cd ..
# ressic45
mkdir resisc45 && cd resisc45
# (manual download) https://onedrive.live.com/?authkey=%21AHHNaHIlzp%5FIXjs&id=5C5E061130630A68%21107&cid=5C5E061130630A68&parId=root&parQt=sharedby&o=OneUp
mv ~/NWPU-RESISC45.rar ./
sudo apt -y install unar
unar NWPU-RESISC45.rar
wget -O resisc45-train.txt "https://storage.googleapis.com/remote_sensing_representations/resisc45-train.txt"
wget -O resisc45-val.txt "https://storage.googleapis.com/remote_sensing_representations/resisc45-val.txt"
wget -O resisc45-test.txt "https://storage.googleapis.com/remote_sensing_representations/resisc45-test.txt"
rm -rf NWPU-RESISC45.rar
cd ..
# dtd
mkdir dtd && cd dtd
wget https://www.robots.ox.ac.uk/~vgg/data/dtd/download/dtd-r1.0.1.tar.gz
tar -xvzf dtd-r1.0.1.tar.gz
rm -rf dtd-r1.0.1.tar.gz
mv dtd/images images
mv dtd/imdb/ imdb
mv dtd/labels labels
cat labels/train1.txt labels/val1.txt > labels/train.txt
cat labels/test1.txt > labels/test.txt
# euro_sat
mkdir euro_sat && cd euro_sat
wget --no-check-certificate https://madm.dfki.de/files/sentinel/EuroSAT.zip
unzip EuroSAT.zip
rm -rf EuroSAT.zip
# sun397
mkdir sun397 && cd sun397
wget https://vision.princeton.edu/projects/2010/SUN/SUN397.tar.gz
unzip Partitions.zip
tar -xvzf SUN397.tar.gz
rm -rf SUN397.tar.gz split_dataset.py base_dir = f'<your base dir>'
### PROCESS SUN397 DATASET
import os
import shutil
from pathlib import Path
downloaded_data_path = f"{base_dir}/sun397"
output_path = f"{base_dir}/sun397"
def process_dataset(txt_file, downloaded_data_path, output_folder):
with open(txt_file, 'r') as file:
lines = file.readlines()
for i, line in enumerate(lines):
input_path = line.strip()
final_folder_name = "_".join(x for x in input_path.split('/')[:-1])[1:]
filename = input_path.split('/')[-1]
output_class_folder = os.path.join(output_folder, final_folder_name)
if not os.path.exists(output_class_folder):
os.makedirs(output_class_folder)
full_input_path = os.path.join(downloaded_data_path, input_path[1:])
output_file_path = os.path.join(output_class_folder, filename)
# print(final_folder_name, filename, output_class_folder, full_input_path, output_file_path)
# exit()
shutil.copy(full_input_path, output_file_path)
if i % 100 == 0:
print(f"Processed {i}/{len(lines)} images")
process_dataset(
os.path.join(downloaded_data_path, 'Training_01.txt'),
os.path.join(downloaded_data_path, 'SUN397'),
os.path.join(output_path, "train")
)
process_dataset(
os.path.join(downloaded_data_path, 'Testing_01.txt'),
os.path.join(downloaded_data_path, 'SUN397'),
os.path.join(output_path, "val")
)
### PROCESS EuroSAT_RGB DATASET
src_dir = f'{base_dir}/euro_sat/2750' # replace with the path to your dataset
dst_dir = f'{base_dir}/EuroSAT_splits' # replace with the path to the output directory
import os
import shutil
import random
def create_directory_structure(dst_dir, classes):
for dataset in ['train', 'val', 'test']:
path = os.path.join(dst_dir, dataset)
os.makedirs(path, exist_ok=True)
for cls in classes:
os.makedirs(os.path.join(path, cls), exist_ok=True)
def split_dataset(dst_dir, src_dir, classes, val_size=270, test_size=270):
for cls in classes:
class_path = os.path.join(src_dir, cls)
images = os.listdir(class_path)
random.shuffle(images)
val_images = images[:val_size]
test_images = images[val_size:val_size + test_size]
train_images = images[val_size + test_size:]
for img in train_images:
src_path = os.path.join(class_path, img)
dst_path = os.path.join(dst_dir, 'train', cls, img)
print(src_path, dst_path)
shutil.copy(src_path, dst_path)
# break
for img in val_images:
src_path = os.path.join(class_path, img)
dst_path = os.path.join(dst_dir, 'val', cls, img)
print(src_path, dst_path)
shutil.copy(src_path, dst_path)
# break
for img in test_images:
src_path = os.path.join(class_path, img)
dst_path = os.path.join(dst_dir, 'test', cls, img)
print(src_path, dst_path)
shutil.copy(src_path, dst_path)
# break
classes = [d for d in os.listdir(src_dir) if os.path.isdir(os.path.join(src_dir, d))]
create_directory_structure(dst_dir, classes)
split_dataset(dst_dir, src_dir, classes)
### PROCESS DTD DATASET
import os
import shutil
from pathlib import Path
downloaded_data_path = f"{base_dir}/dtd/images"
output_path = f"{base_dir}/dtd"
def process_dataset(txt_file, downloaded_data_path, output_folder):
with open(txt_file, 'r') as file:
lines = file.readlines()
for i, line in enumerate(lines):
input_path = line.strip()
final_folder_name = input_path.split('/')[:-1][0]
filename = input_path.split('/')[-1]
output_class_folder = os.path.join(output_folder, final_folder_name)
if not os.path.exists(output_class_folder):
os.makedirs(output_class_folder)
full_input_path = os.path.join(downloaded_data_path, input_path)
output_file_path = os.path.join(output_class_folder, filename)
shutil.copy(full_input_path, output_file_path)
if i % 100 == 0:
print(f"Processed {i}/{len(lines)} images")
process_dataset(
f'{base_dir}/dtd/labels/train.txt', downloaded_data_path, os.path.join(output_path, "train")
)
process_dataset(
f'{base_dir}/dtd/labels/test.txt', downloaded_data_path, os.path.join(output_path, "val")
) |
Hi, awesome work!
I'm trying to reproduce your results but I cannot find the split definitions you use for DTD, EuroSAT and SUN397. Would you mind pointing me to the right resources to download the versions of these datasets compatible with your code?
Thanks a lot!
The text was updated successfully, but these errors were encountered: