Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ubuntu IRC broken encoding, impacting generative models downstream #102

Open
briansemrau opened this issue Jan 19, 2023 · 6 comments
Open

Comments

@briansemrau
Copy link

The Ubuntu IRC dataset appears to contain broken character encoding, which noticeably impacts generated output from models trained on The Pile in certain situations.

For example, from https://irclogs.ubuntu.com/2020/08/23/%23ubuntu.txt
This file contains ¯\_(ツ)_/¯ which should instead show as ¯\_(ツ)_/¯, if it were properly encoded.

I can't currently inspect the data directly in The Pile, because the-eye.eu and eaidata.bmk.sh are both inaccessible right now.
However, I have seen lots of garbled output from GPT-J that looks remarkably similar to this broken encoding, e.g. ¯_(�)_/¯

It looks like this dataset could be cleaned by using the ftfy python library. https://ftfy.readthedocs.io/en/latest/
In my very brief testing, this appears to fix the broken encoding from the file linked above.

@Mistobaan
Copy link

Mistobaan commented Jan 19, 2023

Could we download them again without errors, or are they gone?
So my guess is that is a utf8-to-ascii error. Maybe the server is messing with the encoding?
image
try to request utf8 when doing the GET request.

@briansemrau
Copy link
Author

briansemrau commented Jan 19, 2023

I don't believe you can specify character encoding in HTTP requests. I'll try to contact the author of the bot that scrapes for irclogs.ubuntu.com to get some insight, or report a bug (no way the data has been encoded wrong for over a decade, right?...)

@briansemrau
Copy link
Author

@keunwoochoi
Copy link

@briansemrau do you know if huggingface would decode this properly? i'm not sure where i should look into from https://github.com/huggingface/datasets/tree/main/src/datasets/utils

@briansemrau
Copy link
Author

briansemrau commented Apr 10, 2023

do you know if huggingface would decode this properly?

I would not expect it to. This dataset has strange encoding to work around a specific technical problem with IRC compatibility.
You should use the code from the links I posted above to make sure the data is being properly decoded.

@keunwoochoi
Copy link

i see. thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants