PII Data Detection with Albert base v2 Training + Inference

Overview

Inspired by the Kaggle Competition: The Learning Agency Lab - PII Data Detection
The goal of this model is to detect personally identifiable information (PII) in student writing. Automating the detection and removal of PII from educational data will lower the cost of releasing educational datasets, which will support learning science research and the development of educational tools.

The dataset comprises approximately 22,000 essays written by students enrolled in a massively open online course. All of the essays were written in response to a single assignment prompt, which asked students to apply course material to a real-world problem. The goal is to annotate personally identifiable information (PII) found within the essays.
In order to protect student privacy, the original PII in the dataset has been replaced by surrogate identifiers of the same type using a partially automated process. A majority of the essays are reserved for the test set (70%).

The data is presented in JSON format, which includes a document identifier, the full text of the essay, a list of tokens, information about whitespace, and token annotations. The documents were tokenized using the SpaCy English tokenizer.
Token labels are presented in BIO (Beginning, Inner, Outer) format. The PII type is prefixed with B- when it is the beginning of an entity. If the token is a continuation of an entity, it is prefixed with I-. Tokens that are not PII are labeled O.
- {test|train}.json - the test and training data; the test data given on this page is for illustrative purposes only, and will be replaced during Code rerun with a hidden test set.
  - (int): the index of the essay
  - document (int): an integer ID of the essay
  - full_text (string): a UTF-8 representation of the essay
  - tokens (list)
    - (string): a string representation of each token
  - trailing_whitespace (list)
    - (bool): a boolean value indicating whether each token is followed by whitespace.
  - labels (list) [training data only]
    - (string): a token label in BIO format
- sample_submission.csv - An example of the correct submission format. See the Submission File section of the Overview page for details.

kaggle competitions download -c pii-detection-removal-from-educational-data

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
.gitignore		.gitignore
README.md		README.md
data.py		data.py
inference.py		inference.py
requirements.txt		requirements.txt
train.py		train.py
utils.py		utils.py