Name	Name	Last commit message	Last commit date
parent directory ..
preprocess	preprocess
README.md	README.md
preprocess.sh	preprocess.sh
stats.sh	stats.sh

CelebA Dataset

Our task is to determine whether the celebrity in the image is smiling. This can be easily changed to any of the binary attributes provided by the original CelebA project by modifying the TARGET_NAME constant in preprocess/metadata_to_json. We have ignored all celebrities with less than 5 images in our pipeline.

Setup Instructions

pip3 install numpy
pip3 install pillow
From http:https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html, download or request the metadata files identity_CelebA.txt and list_attr_celeba.txt, and place them inside the data/raw folder.
Download the celebrity faces dataset from the same site. Place the images in a folder named img_align_celeba in the same folder as above.
Run ./preprocess.sh with a choice of the following tags:
- -s := 'iid' to sample in an i.i.d. manner, or 'niid' to sample in a non-i.i.d. manner; more information on i.i.d. versus non-i.i.d. is included in the 'Notes' section
- --iu := number of users, if iid sampling; expressed as a fraction of the total number of users; default is 0.01
- --sf := fraction of data to sample, written as a decimal; default is 0.1
- -k := minimum number of samples per user
- -t := 'user' to partition users into train-test groups, or 'sample' to partition each user's samples into train-test groups
- --tf := fraction of data in training set, written as a decimal; default is 0.9
- --smplseed := seed to be used before random sampling of data
- --spltseed := seed to be used before random split of data

i.e.

./preprocess.sh -s niid --sf 1.0 -k 5 -t sample (full-sized dataset)
./preprocess.sh -s niid --sf 0.05 -k 5 -t sample (small-sized dataset)

Make sure to delete the rem_user_data, sampled_data, test, and train subfolders in the data directory before re-running preprocess.sh

Notes

More details on i.i.d. versus non-i.i.d.:
- In the i.i.d. sampling scenario, each datapoint is equally likely to be sampled. Thus, all users have the same underlying distribution of data.
- In the non-i.i.d. sampling scenario, the underlying distribution of data for each user is consistent with the raw data. Since we assume that data distributions vary between user in the raw data, we refer to this sampling process as non-i.i.d.
More details on preprocess.sh:
- The order in which preprocess.sh processes data is 1. generating all_data, 2. sampling, 3. removing users, and 4. creating train-test split. The script will look at the data in the last generated directory and continue preprocessing from that point. For example, if the all_data directory has already been generated and the user decides to skip sampling and only remove users with the -k tag (i.e. running preprocess.sh -k 50), the script will effectively apply a remove user filter to data in all_data and place the resulting data in the rem_user_data directory.
- File names provide information about the preprocessing steps taken to generate them. For example, the all_data_niid_1_keep_64.json file was generated by first sampling 10 percent (.1) of the data all_data.json in a non-i.i.d. manner and then applying the -k 64 argument to the resulting data.
Each .json file is an object with 3 keys:
1. 'users', a list of users
2. 'num_samples', a list of the number of samples for each user, and
3. 'user_data', an object with user names as keys and their respective data as values.
Run ./stats.sh to get statistics of data (data/all_data/all_data.json must have been generated already)
In order to run reference implementations in ../models directory, the -t sample tag must be used when running ./preprocess.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

celeba

celeba

README.md

CelebA Dataset

Setup Instructions

Notes

Files

celeba

Directory actions

More options

Directory actions

More options

Latest commit

History

celeba

Folders and files

parent directory

README.md

CelebA Dataset

Setup Instructions

Notes