Name		Name	Last commit message	Last commit date
parent directory ..
auto_evaluation_helm		auto_evaluation_helm
data		data
data_mixture		data_mixture
data_process_hpo		data_process_hpo
data_process_loop		data_process_loop
data_visualization_diversity		data_visualization_diversity
data_visualization_op_effect		data_visualization_op_effect
data_visualization_op_insight		data_visualization_op_insight
data_visualization_statistics		data_visualization_statistics
overview_scan		overview_scan
process_cft_zh_data		process_cft_zh_data
process_code_data		process_code_data
process_on_ray		process_on_ray
process_sci_data		process_sci_data
process_video_on_ray		process_video_on_ray
tool_dataset_splitting_by_language		tool_dataset_splitting_by_language
tool_quality_classifier		tool_quality_classifier
README.md		README.md
README_ZH.md		README_ZH.md

README.md

Demos

This folder contains some demos, which allow users to easily experience the basic functions and tools of Data-Juicer.

Usage

Use app.py in the subdirectory of demos to run the demos.

cd <subdir_of_demos>
streamlit run app.py

Available Demos

Data (data)
- This folder contains some sample datasets.
Overview scan (overview_scan)
- This demo introduces the basic concepts and functions of Data-Juicer, such as features, configuration, operators, and so on.
Data process loop (data_process_loop)
- This demo analyzes and processes a dataset, providing a comparison of statistical information before and after the processing.
Data visualization diversity (data_visualization_diversity)
- This demo analyzes the verb-noun structure of the CFT dataset and plots its diversity in sunburst format.
Data visualization op effect (data_visualization_op_effect)
- This demo analyzes the statistics of dataset, and displays the effect of each Filter op by setting different thresholds.
Data visualization statistics (data_visualization_statistics)
- This demo analyzes the dataset and obtain up to 13 statistics.
Process CFT Chinese data (process_cft_zh_data)
- This demos analyzes and processes part of Chinese dataset in Alpaca-CoT to show how to process IFT or CFT data for LLM fine-tuning.
Process SCI data (process_sci_data)
- This demos analyzes and processes part of arXiv dataset to show how to process scientific literature data for LLM pre-training.
Process code data (process_code_data)
- This demos analyzes and processes part of Stack-Exchange dataset to show how to process code data for LLM pre-training.
Text quality classifier (tool_quality_classifier)
- This demo provides 3 text quality classifier to score the dataset.
Dataset splitting by language (tool_dataset_splitting_by_language)
- This demo splits a dataset to different sub-datasets by language.
Data mixture (data_mixture)
- This demo selects and mixes samples from multiple datasets and exports them into a new dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

demos

demos

README.md

Demos

Usage

Available Demos

Files

demos

Directory actions

More options

Directory actions

More options

Latest commit

History

demos

Folders and files

parent directory

README.md

Demos

Usage

Available Demos