Package used for processing TAGS documents downloaded as tab-separated files.
tags = TAGS.Document(path="./datasets/downloaded_tags_document.tsv")
If you need to ingest more than one file or perhaps one or more directories into one dataset, you can do so using the DocumentSet
object.
If you would only like to include a list of documents, you can do so by using the paths
parameter:
tags = TAGS.DocumentSet(paths=["./datasets/downloaded_tags_document.tsv", "./datasets/another_downloaded_tags_document.tsv"])
If you woud rather want to include any number of directories, you can do so using the directories
parameter:
tags = TAGS.DocumentSet(directories=["./datasets/", "./another_dataset_folder/"])
Note that if you are including directories, make sure that there are no other .tsv files in the directories added. If there are, the script will likely crash.
Note that you can also combine paths
and directories
to ingest anything you'd wish into your dataset.
There is one more parameter that you can provide to the constructor for both TAGS.Document
and TAGS.DocumentSet
: suppress_warnings
. It must be a booleans (True
or False
) nd it is by default turned to False, thus generating warnings as you ingest your dataset.
The following two examples will turn it off:
tags = TAGS.Document(path="./datasets/downloaded_tags_document.tsv", suppress_warnings=True)
multiple_tags = TAGS.DocumentSet(paths=["./datasets/folder_1/", "./datasets/folder_2/"], suppress_warnings=True)
Both the TAGS.Document
and the TAGS.DocumentSet
objects have a property that contains a list of all IDs in the file/s in the object for easy processing:
tags.ids
A TAGS.Document
object can also retrieve data for a specific ID from the file using the get_data_for_id
method: (if no data, returns None
)
test_id = 1156639282024464385
tags.get_data_for_id(test_id) # get all data for an ID
tags.get_data_for_id(test_id, 'text') # get specific data for an ID
Unfortunately, the TAGS.DocumentSet does not currently include such a method.