Skip to content

Commit

Permalink
Update README to reflect version 1.0 finalization
Browse files Browse the repository at this point in the history
  • Loading branch information
carlini committed Mar 11, 2022
1 parent 8c54c64 commit 8a172b0
Showing 1 changed file with 0 additions and 2 deletions.
2 changes: 0 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
# Deduplicating Training Data Makes Language Models Better

WARNING: This is a development branch. I am rewriting the code to be cleaner. Continue at your own risk.

This repository contains code to deduplicate language model datasets as descrbed in the paper ["Deduplicating Training Data Makes Language Models Better"](https://arxiv.org/abs/2107.06499) by Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch and Nicholas Carlini.
We release the ExactSubstr deduplication implementation (written in Rust) along with the scripts we used in the paper to perform ExactSubstr deduplication and inspect the results (written in Python).
We also release the document clusters resulting from running NearDup deduplication on C4, RealNews, LM1B, and Wiki-4B-en.
Expand Down

0 comments on commit 8a172b0

Please sign in to comment.