cleaned dataset #4

ha1990-12 · 2019-11-14T08:49:25Z

Could you please share the cleaned dataset?

EB-Dodo · 2019-11-22T10:34:03Z

The cleaned dataset is "clean_list.7z" with clean image IDs in it.
You may want to read the "How to use C-MS-Celeb" section for details.

AGenchev · 2020-12-31T11:47:25Z

How to use could be extended with some practical NFO:
First, you will download the academic torrent by Hyper.AI Datasets Team, because the original dataset was removed by MS. It contains 2 data files:

FaceImageCroppedWithAlignment.tsv
FaceImageCroppedWithOutAlignment.tsv
TSV means tab-separated-values. A line of these files looks like:
m.0107_f 0 http:https://getbeatmadrid.files.wordpress.com/2013/01/magic-alex.jpg http:https://getbeatmadrid.wordpress.com/2013/01/28/magic-alex/ FaceId-0 KsQsP3Pumj2B6UE/Vj4/Pg== base64_jpegdata
The columns inside are as follows:
m_id, image_search_rank, image_url, page_url, face_id, face_rectangle, face_data
for many rows the image_url and page_url will be useless, since many pages were removed/died.
Found description of the columns here: https://frchallenge.github.io/download/aligned cached below for reference:

File format: text files, each line is an image record containing 7 columns, delimited by TAB.
Column1: Freebase MID
Column2: ImageSearchRank
Column3: ImageURL
Column4: PageURL
Column5: FaceID
Column6: FaceRectangle_Base64Encoded (four floats, relative coordinates of UpperLeft and BottomRight corner)
Column7: FaceData_Base64Encoded d Data]

The initial dataset is very noisy, I don't recommend for training person recognition on it: if you see person "m.0107_f", you'll see images of males, females most of them not belonging to the same person...
The face images are not high quality (checked FaceImageCroppedWithAlignment.tsv).
You need to extract the data to perform the filtering.
Extraction script: https://www.programmersought.com/article/53293636195/ The extracted data has a folder for each person id named with the index values: "m.0107_f" (for example).
Clean list (from stage 2) has 4,924,737 rows. Relabel list contains 1,539,279 rows.
Combined, the lists have 6,464,016 rows which cover the C-MS-Celeb dataset.
The 2 lists can be combined by concatenation, because the columns are equal:
Clean list has 2 columns, space-separated:

m.0107_f m.0107_f/100-FaceId-0.jpg
m.0107_f m.0107_f/102-FaceId-0.jpg

I guess the first column is the selected person_id, the second - the photos of this person.
We observe the sex is the same and it is likely the same person. Hence, noise is reduced.
There are also omission errors - for example m.0107_f/116-Faceid-0.jpg is missing.
at least the false images are mostly removed. So this is our new person index to use. Next, we want to merge the relabel list:
the columns have the same meaning, just relabel list is cross-folder index.
We merge and sort the index and are ready.
Next step: data set is still noisy, you might want to run a (well trained) gender detector to clean the non same gender pictures.
Next step: data set is not so diverse - there are repeating "same" images taken from one and the same photo, so it can be further reduced to contain only unique pictures of the same person.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cleaned dataset #4

cleaned dataset #4

ha1990-12 commented Nov 14, 2019

EB-Dodo commented Nov 22, 2019

AGenchev commented Dec 31, 2020 •

edited

Loading

cleaned dataset #4

cleaned dataset #4

Comments

ha1990-12 commented Nov 14, 2019

EB-Dodo commented Nov 22, 2019

AGenchev commented Dec 31, 2020 • edited Loading

AGenchev commented Dec 31, 2020 •

edited

Loading