-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PresidioSentenceFaker #50
PresidioSentenceFaker #50
Conversation
Map NATIONALITY entity to NRP not to LOCATION Change dict literals to dict constructors to improve readability
Asa higher level abstraction over PresidioDataGenerator for utilising all templates and providers in this library
…nguage for predictions
…fault AnalyzerEngine Validate chosen language is available in provided AnalyzerEngine
Thanks @Robbie-Palmer. We'll review this soon. |
Thanks @Robbie-Palmer, this is a great addition and greatly improves the codebase!
|
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
presidio_evaluator/data_generator/faker_extensions/providers.py
Outdated
Show resolved
Hide resolved
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
@Robbie-Palmer would you be interested in completing this PR? we'd be happy to help if needed |
@omri374 I'm sorry for the really long delay! I've been pulled into a number of unrelated projects these past few months |
Thanks @Robbie-Palmer! Happy to collaborate and help on this if needed, there are really good contributions in this PR. |
Map NATIONALITY entity to NRP not to LOCATION Change dict literals to dict constructors to improve readability
Asa higher level abstraction over PresidioDataGenerator for utilising all templates and providers in this library
…nguage for predictions
…fault AnalyzerEngine Validate chosen language is available in provided AnalyzerEngine
Update PresidioDataGenerator tests to make stronger assertions about contents of results
Update PresidioFakeRecordGenerator to use ReligionProvider
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, thank you @Robbie-Palmer! this PR introduces really great code improvements and bug fixes.
On the new class topic, I agree we should first align on the right structure. Backward compatibility is not a big issue IMHO as this package was just recently released to PyPI and we can create a deprecation notice for the old class name.
Here are the options as I see it:
- Continue having one class
- Have two classes, which is proposed in this PR
- Have a base class, which is inherited by another class, similar to this PR but maintaining some relation between the two.
Another discussion was on the name(s) of classes.
It makes sense IMHO to have two classes, one serving as a wrapper for Faker and the other as an orchestrator. We should just think of the names in a way that wouldn't confuse, especially in light of the terms used in faker, e.g. Faker
and Generator
.
Happy to hear your thoughts @Robbie-Palmer and @melmatlis.
We could also set up a short call if that would be the easiest.
Co-authored-by: melmatlis <[email protected]>
…lt providers and templates
…-record-generator
…ator # Conflicts: # presidio_evaluator/data_generator/presidio_data_generator.py
Rename PresidioFakeRecordGenerator to PresidioSentenceFaker to distinguish it from `RecordGenerator`
Move functions for loading data from FakeNameGenerator.com in faker format into new datasets.py module Move logic for choosing templates out of SentenceFaker into PresidioSentenceFaker Remove generic read file function Add missing HospitalProvider Update Recognizer tests to use PresidioSentenceProvider
Make single module to hold all sentence semantic dependency logic for Faker, including SentenceFaker, RecordGenerator and RecordsFaker
…aker Rename faker_to_presidio_entity_type to ENTITY_TYPE_MAPPING Make presidio_templates_file_path and presidio_additional_entity_providers available from package Update Data Generation README to outline choices
/azp run |
Azure Pipelines could not run because the pipeline triggers exclude this branch/path. |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
1_Generate_data.ipynb
andpresidio_data_generator.py
'smain
function contain code for generating fake records utilising the new providers and the mappings defined in this package.This is a useful higher-level API for consuming the functionality of this package in other projects e.g.
The version of the code in this PR takes from the notebook, which includes a mapping from Faker entity types to Presidio entity types, which isn't done in
presidio_data_generator.py
'smain
function.Otherwise, it is the same code, just wrapped up into a class.
In this PR, I've also changed what I think are bugs.
presidio_entities_map
which maps Faker entity types to Presidio entity types"URL": "DOMAIN_NAME"
which I believe should be the other way round"DOMAIN_NAME": "DOMAIN_NAME"
butDOMAIN_NAME
isn't listed in https://microsoft.github.io/presidio/supported_entities/PresidioAnalyzerWrapper
'spredict
method is hard coded to always use"en"
instead of the provided languagePresidioAnalyzerWrapper
's construction logic around language is confusing, especially between providing an analyzer or notAnd made some stylistic changes which I think makes the code more readable, though obviously, this is subjective
/
for creating pathlib Paths