add starcoder2 dataset

mlabonne · Apr 30, 2024 · 380b180 · 380b180
1 parent c8018b9
commit 380b180
Showing 1 changed file with 3 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -77,6 +77,7 @@ Code is another challenging domain for LLMs that lack specialized pre-training.
 | [sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context) | 78.6k | b-mc2 | Apr 2023 | Cleansed and augmented version of the [WikiSQL](https://huggingface.co/datasets/wikisql) and [Spider](https://huggingface.co/datasets/spider) datasets. |
 | [Magicoder-OSS-Instruct-75K](https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K) | 75k | Wei et al. | Nov 2023 | OSS-Instruct dataset generated by `gpt-3.5-turbo-1106`. See [Magicoder paper](https://arxiv.org/abs/2312.02120). |
 | [Code-Feedback](https://huggingface.co/datasets/m-a-p/Code-Feedback) | 66.4k | Zheng et al. | Feb 2024 | Diverse Code Interpreter-like dataset with multi-turn dialogues and interleaved text and code responses. See [OpenCodeInterpreter paper](https://arxiv.org/abs/2402.14658). |
+| [self-oss-instruct-sc2-exec-filter-50k](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k) | 50.7k | Lozhkov et al. | Apr 2024 | Created in three steps with seed functions from TheStack v1, self-instruction with StarCoder2, and self-validation. See the [blog post](https://huggingface.co/blog/sc2-instruct). |
 
 ### Conversation & Role-Play
 
@@ -142,6 +143,8 @@ Special thanks to [geronimi73](https://github.com/geronimi73) for the PR.
 
 ## References
 
+Please let me know if a dataset is not properly credited.
+
 - Wei-Lin Chiang et al, "Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality," 2023.
 - Yihan Cao et al, "Instruction Mining: When Data Mining Meets Large Language Model Finetuning," 2023.
 - Subhabrata Mukherjee et al, "Orca: Progressive Learning from Complex Explanation Traces of GPT-4," 2023.