Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add dynamic dataset for processing/tokenizing examples lazily #46

Closed
wants to merge 7 commits into from

Commits on Jan 5, 2021

  1. add dynamic dataset for processing/tokenizing examples lazily

    - adding pysimdjson as a requirement for memory-efficient and fast json parsing
    - added DynamicDataset class intended for jsonlines, leveraging TF's TextLineDataset C++ io as an iterator.
    trisongz committed Jan 5, 2021
    Configuration menu
    Copy the full SHA
    e3aec3b View commit details
    Browse the repository at this point in the history

Commits on Jan 11, 2021

  1. Configuration menu
    Copy the full SHA
    dd0a17c View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    80c65bf View commit details
    Browse the repository at this point in the history

Commits on Jan 12, 2021

  1. minor fixes after testing.

    trisongz committed Jan 12, 2021
    Configuration menu
    Copy the full SHA
    423d688 View commit details
    Browse the repository at this point in the history
  2. Adding token cache implementation, removing implicit padding

    - Adds dynamic chunking through token cache implementation, adding overflow tokens from tokenized examples into a token cache, which gets filled first in the next call.
    
    - Every example added to token_cache is appended by tokenizer.eos_token_id
    
    - Removes implicit padding unless sequence falls under max_length after filling from token cache
    trisongz committed Jan 12, 2021
    Configuration menu
    Copy the full SHA
    f3a4116 View commit details
    Browse the repository at this point in the history
  3. add exception

    trisongz committed Jan 12, 2021
    Configuration menu
    Copy the full SHA
    abb5e46 View commit details
    Browse the repository at this point in the history

Commits on Jan 13, 2021

  1. Configuration menu
    Copy the full SHA
    cc7cc2f View commit details
    Browse the repository at this point in the history