Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhanced Summarization for DataFrames and Series #75

Merged
merged 8 commits into from
Mar 31, 2023
Merged

Conversation

rgbkrk
Copy link
Owner

@rgbkrk rgbkrk commented Mar 27, 2023

This PR improves how we summarize DataFrames and Series. The enhanced summarization allows GPT to understand the data structure and content better. This summarized format takes up about ~1,600 tokens, depending largely on how big the fields are in the sample. Long text fields do pose a problem. However, that's no different than the current implementation! Another thing that is hopefully a help to GPT -- we specify what level of sampling we're doing as well as the real size.

Key Changes:

  • Added summarize_dataframe and summarize_series functions to generate detailed summaries for DataFrames and Series respectively.
  • The summary for DataFrames includes:
    • Number of rows and columns
    • Column information (name, data type, missing values, and percentage of missing values)
    • Basic summary statistics for numerical and categorical columns
    • A sample of the data (configurable number of rows and columns)
  • The summary for Series includes:
    • Number of values
    • Data type
    • Missing values and their percentage
    • Summary statistics (based on the data type)
    • A sample of the data (configurable number of values)

Example output

Dataframe Summary

Number of Rows: 806

Number of Columns: 14

Column Information

Column Name Data Type Missing Values % Missing
0 w3alcd object 0 0
1 doing_business_as object 0 0
2 restaurant_address object 0 0
3 inspection_date datetime64[ns] 117 14.5161
4 major_violation_improper_holding_temperature int64 0 0
5 minor_violation_improper_holding_temperature int64 0 0
6 major_violation_inadequate_cooking int64 0 0
7 minor_violation_inadequate_cooking int64 0 0
8 major_violation_personal_hygiene int64 0 0
9 minor_violation_personal_hygiene int64 0 0
10 major_violation_contaminated_equipment int64 0 0
11 minor_violation_contaminated_equipment int64 0 0
12 major_violation_unsafe_food_source int64 0 0
13 minor_violation_unsafe_food_source int64 0 0

Numerical Summary

Column Name count mean std min 25% 50% 75% max
0 major_violation_improper_holding_temperature 806 0.0111663 0.105144 0 0 0 0 1
1 minor_violation_improper_holding_temperature 806 0.10794 0.310498 0 0 0 0 1
2 major_violation_inadequate_cooking 806 0 0 0 0 0 0 0
3 minor_violation_inadequate_cooking 806 0 0 0 0 0 0 0
4 major_violation_personal_hygiene 806 0 0 0 0 0 0 0
5 minor_violation_personal_hygiene 806 0 0 0 0 0 0 0
6 major_violation_contaminated_equipment 806 0 0 0 0 0 0 0
7 minor_violation_contaminated_equipment 806 0.0694789 0.254425 0 0 0 0 1
8 major_violation_unsafe_food_source 806 0 0 0 0 0 0 0
9 minor_violation_unsafe_food_source 806 0 0 0 0 0 0 0

Categorical Summary

Column Name count unique top freq first last
0 inspection_date 689 550 2018-10-10 00:00:00 4 2011-01-23 00:00:00 2023-01-17 00:00:00

Sample Data (5x14)

doing_business_as w3alcd inspection_date minor_violation_unsafe_food_source restaurant_address minor_violation_inadequate_cooking major_violation_personal_hygiene major_violation_contaminated_equipment major_violation_improper_holding_temperature minor_violation_personal_hygiene major_violation_unsafe_food_source minor_violation_contaminated_equipment minor_violation_improper_holding_temperature major_violation_inadequate_cooking
15 AFC SUSHI @ SAFEWAY #691 FA0001354 2019-02-28 00:00:00 0 1444 SHATTUCK AVE, BERKELEY, CA 0 0 0 0 0 0 0 0 0
769 VIK'S CHAAT CORNER FA0000567 2016-05-18 00:00:00 0 2390 FOURTH ST , BERKELEY, CA 0 0 0 0 0 0 0 0 0
325 GYPSY'S TRATTORIA ITALIANO FA0000674 2016-12-20 00:00:00 0 2519-A DURANT AVE, BERKELEY, CA 0 0 0 0 0 0 0 0 0
220 CVS PHARMACY FA0001247 2018-10-26 00:00:00 0 2655 TELEGRAPH AVE, BERKELEY, CA 0 0 0 0 0 0 0 0 0
419 LE BATEAU IVRE/DRUNKEN BOAT FA0000547 2022-08-26 00:00:00 0 2629 TELEGRAPH AVE , BERKELEY, CA 0 0 0 0 0 0 0 0 0

@rgbkrk rgbkrk merged commit e14fbd3 into main Mar 31, 2023
@rgbkrk rgbkrk deleted the better-df-summary branch March 31, 2023 21:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant