-
Notifications
You must be signed in to change notification settings - Fork 61
Insights: NVIDIA/NeMo-Curator
Overview
-
- 10 Merged pull requests
- 0 Open pull requests
- 2 Closed issues
- 2 New issues
Could not load contribution data
Please try again later
1 Release published by 1 person
-
v0.4.1
published
Oct 3, 2024
10 Pull requests merged by 6 people
-
Add Curator and Spark example in docs
#261 merged
Oct 7, 2024 -
Add profiling / time in different stages of fuzzy / exact / semantic dedup
#251 merged
Oct 7, 2024 -
Pin to 24.8.x instead of 24.8
#282 merged
Oct 4, 2024 -
Lower nworkers for tinystories tutorial to avoid OOM
#280 merged
Oct 3, 2024 -
Add spacy<3.8 pin to r0.4.1
#279 merged
Oct 3, 2024 -
Fix enabling spilling by enabling it on client process
#275 merged
Oct 2, 2024 -
Fix the Image Curation Tutorial for 0.5.0 release
#277 merged
Oct 2, 2024 -
Pin Rapids to 24.8 for 0.5.0 Release
#273 merged
Oct 2, 2024 -
Pin to spacy<3.8 temporarily to unblock CI
#276 merged
Oct 2, 2024 -
Fix 255 - Improve separate_by_metadata performance for jsonl files
#256 merged
Oct 1, 2024
2 Issues closed by 2 people
-
Installation error: pip._vendor.resolvelib.resolvers.ResolutionTooDeep: 200000
#278 closed
Oct 3, 2024 -
[FEA] Improve separate_by_metadata performance when dealing with jsonl files
#255 closed
Oct 1, 2024
2 Issues opened by 2 people
-
Semantic Dedup doesn't work with UCX
#283 opened
Oct 8, 2024 -
NEMO Curator Not Extracting Thai Language Content
#281 opened
Oct 4, 2024
13 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
Add RPV2 Pre-training Data Curation Tutorial
#267 commented on
Oct 7, 2024 • 26 new comments -
Added example notebook for translation with ct2 model.
#262 commented on
Oct 8, 2024 • 21 new comments -
Add GPU CI/CD
#253 commented on
Oct 7, 2024 • 11 new comments -
Update deduplication docs
#258 commented on
Oct 7, 2024 • 8 new comments -
Add fineweb classifer documentation
#269 commented on
Oct 7, 2024 • 7 new comments -
Expand RMM options for Python API
#266 commented on
Oct 4, 2024 • 5 new comments -
Add support for parallel data curation
#193 commented on
Oct 2, 2024 • 1 new comment -
GitHub workflows improvements
#259 commented on
Oct 7, 2024 • 0 new comments -
Adding an example for executing NeMo modules using kubernetes Python …
#148 commented on
Oct 7, 2024 • 0 new comments -
Add Multiple Model Quality Classification example
#173 commented on
Oct 7, 2024 • 0 new comments -
Add image documentation
#238 commented on
Oct 7, 2024 • 0 new comments -
Added a translation pipeline for ctranslate2 inference
#245 commented on
Oct 8, 2024 • 0 new comments -
Fix the Image Curation Tutorial on main
#271 commented on
Oct 7, 2024 • 0 new comments