Skip to content

Commit

Permalink
add starcoder2 dataset
Browse files Browse the repository at this point in the history
  • Loading branch information
mlabonne committed Apr 30, 2024
1 parent c8018b9 commit 380b180
Showing 1 changed file with 3 additions and 0 deletions.
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ Code is another challenging domain for LLMs that lack specialized pre-training.
| [sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context) | 78.6k | b-mc2 | Apr 2023 | Cleansed and augmented version of the [WikiSQL](https://huggingface.co/datasets/wikisql) and [Spider](https://huggingface.co/datasets/spider) datasets. |
| [Magicoder-OSS-Instruct-75K](https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K) | 75k | Wei et al. | Nov 2023 | OSS-Instruct dataset generated by `gpt-3.5-turbo-1106`. See [Magicoder paper](https://arxiv.org/abs/2312.02120). |
| [Code-Feedback](https://huggingface.co/datasets/m-a-p/Code-Feedback) | 66.4k | Zheng et al. | Feb 2024 | Diverse Code Interpreter-like dataset with multi-turn dialogues and interleaved text and code responses. See [OpenCodeInterpreter paper](https://arxiv.org/abs/2402.14658). |
| [self-oss-instruct-sc2-exec-filter-50k](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k) | 50.7k | Lozhkov et al. | Apr 2024 | Created in three steps with seed functions from TheStack v1, self-instruction with StarCoder2, and self-validation. See the [blog post](https://huggingface.co/blog/sc2-instruct). |

### Conversation & Role-Play

Expand Down Expand Up @@ -142,6 +143,8 @@ Special thanks to [geronimi73](https://github.com/geronimi73) for the PR.

## References

Please let me know if a dataset is not properly credited.

- Wei-Lin Chiang et al, "Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality," 2023.
- Yihan Cao et al, "Instruction Mining: When Data Mining Meets Large Language Model Finetuning," 2023.
- Subhabrata Mukherjee et al, "Orca: Progressive Learning from Complex Explanation Traces of GPT-4," 2023.
Expand Down

0 comments on commit 380b180

Please sign in to comment.