Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Misaligned DataFrame Schemas #87

Open
kmaphoenix opened this issue Sep 30, 2022 · 2 comments
Open

[BUG] Misaligned DataFrame Schemas #87

kmaphoenix opened this issue Sep 30, 2022 · 2 comments
Assignees
Labels
bug Something isn't working dataframes Issues or Features with Pandas Dataframes Entity Types Entity Resources Intents Intent Resources

Comments

@kmaphoenix
Copy link
Member

Current Behavior

Currently, there are several areas in SCRAPI that we export and import DataFrames, and their schemas are misaligned.
This causes issues with streamlining a pipeline of events because column renaming or ETLs need to be done.

Examples:
Intents.intent_proto_to_dataframe exports columns = display_name, training_phrase in basic mode.
In advanced mode for the same method, the utterance is now called text.
Mismatch of schema and semantics in the same method.

In DataframeFunctions.bulk_update_intents_from_dataframe, the basic mode expects input columns of display_name and text.
This is misaligned from the above schemas of the generated dataframes in Intents class.

So if your workflow is this:

  1. Intents to Dataframe
  2. Dataframe to Sheet
  3. Sheet to Dataframe

Step 3 will break due to misaligned schema.
We should always be in alignment with "like for like" export/import (i.e. basic and basic should match 100%).
We should also be in alignment semantically across modes (i.e. basic and advanced have different schemas, but the columns that are shared are 100% named identically)

Expected Behavior

All DataFrame schemas within the same Resource type (i.e. Intents, Entity Types, etc.) should be in alignment.

Possible Solution

Centralize the creation and validation of all schema types to a file outside of the class that is using them.
Introduce core/schemas.py or similar to maintain a central schema repository.
Then each respective class can pull their schema and schema validation rules from the central class, ensuring that we have continuity in DataFrame resources.

Steps to Reproduce

Try the following

  1. Intents to Dataframe
  2. Dataframe to Sheet
  3. Sheet to Dataframe (without modifying your sheet. leave it as-is)
@kmaphoenix kmaphoenix added the bug Something isn't working label Sep 30, 2022
@kmaphoenix kmaphoenix added dataframes Issues or Features with Pandas Dataframes Intents Intent Resources Entity Types Entity Resources labels Sep 30, 2022
@kmaphoenix
Copy link
Member Author

@MRyderOC I found this bug / issue when prepping for my SCRAPI demo today.
More of a minor annoyance than a bug, but I think we should be able to easily enforce this across the entire library.

cc: @Greenford @SeanScripts @cgibson6279
in case y'all want to weigh, help triage, or have any ideas other than what I offered.

@Greenford
Copy link
Collaborator

Certainly in favor of a centralized solution for column-naming. I've also run into this problem a few times

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working dataframes Issues or Features with Pandas Dataframes Entity Types Entity Resources Intents Intent Resources
Projects
None yet
Development

No branches or pull requests

3 participants