Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raw extraction #200

Closed
wants to merge 11 commits into from
Closed

Raw extraction #200

wants to merge 11 commits into from

Conversation

AlexTMallen
Copy link
Collaborator

  • extracts hiddens without applying templates or making contrast tuples
  • can be used with eval by specifying a magic dataset “raw” and including --data_dir
    • doesn't support few-shot examples, yes balancing by default (though optional for everything now), no streaming (enforced in PromptConfig's __post_init__)
  • Add support for inference without contrast tuples in Reporter
    • renaming score to score_contrast_tuple
    • I'm not sure if I should just make them be the same function and do different things depending on the shape of the input
  • Columns of provided dataset in --data_dir must contain string “text” and binary “label”, and it shouldn't have any splits
  • In this mode the LM total logprob assigned to the text is also computed
    • That way you can perform ~whatever analyses you want by defining the input dataset and reading the output CSV
    • I prepend tokenizer.bos_token to the input so that I can compute this. Will this always work and be in distribution?
  • Adds base_fingerprint argument to the builder which reads the fingerprint of the raw dataset to improve caching as the raw datasets are modified
  • Adds support for saving the predictions to an output directory with --preds_out_dir

@norabelrose
Copy link
Member

Won't merge as-is; let's talk about ways to accomplish a similar goal within the templates system perhaps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants