V2 Pipeline with PySpark #13

alvin319 · 2023-08-28T00:28:58Z

Summary

Refactored the existing pipeline to use PySpark
Swapped to a standard logging system that goes to both stdout and a log file in its respective run directory
Renamed key column names to have a consistent them, i.e., index -> sequence_id
Re-did the filters to calculate sequence/token frequencies

Note
To showcase the result, I ran the pipeline with these two filters across all datasets; the result is available on HF.

- Saperate Token frequency filter into it's own function

add files

0d87ce9

alvin319 requested review from Kyle1668 and uSaiPrashanth August 28, 2023 00:29

alvin319 added 3 commits August 31, 2023 23:10

test

ee8bba0

pyspark progress

305ca05

more progress

2c56369

alvin319 changed the title ~~Filter pipeline v2~~ V2 Pipeline with PySpark Sep 4, 2023

alvin319 and others added 3 commits September 4, 2023 02:47

add script to upload datasets

c4315fe

- Coalesce parquet into a single file

2cfa029

- Saperate Token frequency filter into it's own function

Documentation and pre-detokenize tokens

51d70c5

uSaiPrashanth approved these changes Sep 5, 2023

View reviewed changes

update

45633fb

alvin319 merged commit d8698c2 into master Sep 6, 2023
1 check passed

alvin319 deleted the alvin/new-pipeline branch March 4, 2024 01:27

Provide feedback