Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleaning of CMS open payments general payments data #55

Merged
merged 6 commits into from
Mar 21, 2017
Merged

Cleaning of CMS open payments general payments data #55

merged 6 commits into from
Mar 21, 2017

Conversation

sangxia
Copy link
Contributor

@sangxia sangxia commented Mar 8, 2017

Scripts for checking basic consistency, splitting information into separate tables, and creating standardized names for manufacturers. Some more work needs to be done for the general payments table, but hopefully the scripts could be useful for people working on other parts of this dataset.

It could also be good to have a naming convention for companies that have changed names, been acquired, etc.

For some reason the Date_of_Payment field of 2015 contains a lot more error than 2013 and 2014.

@jenniferthompson
Copy link
Contributor

Thanks so much for this @sangxia! Would you mind adding some comments to make things a bit easier for our reviewers?

Copy link
Contributor

@mattgawarecki mattgawarecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holy crap, @sanxia! You have done an amazing amount of work getting all this payment data scraped and cleaned. Thank you so so so much.

I'm learning a lot about pandas and working with data in Python just doing this review, but I find myself getting lost at points as I try to understand what exactly is happening at each step. I think that with some added documentation in the form of descriptive variable names and explanatory comments will make this not just an invaluable script for our work in drug-spending, but a useful real-world example of the power of Python for data engineering and analysis.

Of course, I realize you've spent quite a while getting to this point, so if you'd like to pair with someone to lighten the load, I'm sure any of us would be happy to step up. 😄 Either way, you deserve some major high fives for cranking out such a complex and impactful feature! 🙌

import copy

def name_canon(s):
ret = unicodedata.normalize('NFKD',s.strip()).encode('ascii','ignore').decode()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some great success recently using unidecode to convert Unicode to simple ASCII, and it's much easier to use. Admittedly this isn't an important change, but is that something you're interested in? The equivalent code would be something like this:

from unidecode import unidecode
# ... code ...
ret = unidecode.decode(s.strip())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nevermind. I see pandas seems to use the same unicodedata calls. I'm all for consistency in this case. 👍

def name_canon(s):
ret = unicodedata.normalize('NFKD',s.strip()).encode('ascii','ignore').decode()
ret = re.sub(' +', ' ', ret)
return str.upper(ret.strip())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're stripping on line 25, it may be redundant to do it again here.

colstate = 'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_State'
colcountry = 'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country'
df = df.loc[~df[colid].isin(name_dict)].drop_duplicates()
# print('Checking null values')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this block of commented code needed anymore?

# assert(((df[colstate].isnull()) & (df[colcountry] == 'United States')).sum()==0)
# assert((df.loc[df[colstate].notnull(), colstate].str.len() != 2).sum()==0)
print('Computing canonical names')
df['Name_canonical'] = df[colname] \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a personal preference on my part, but I'd break lines here such that each "step" is on its own line, like so:

df['Name_canonical'] = df[colname] \
    .str.normalize('NFKD') \
    .str.encode('ascii','ignore') \
    .str.decode('utf-8') \
    .str.upper() \
    .str.replace(' +', '') \
    .str.replace('[^A-Z0-9]','') \
    .str.strip()

df['Name_canonical'] = df[colname] \
.str.normalize('NFKD').str.encode('ascii','ignore').str.decode('utf-8') \
.str.upper().str.replace(' +', '').str.replace('[^A-Z0-9]','').str.strip()
df_p = df[[colid,colname,colstate,colcountry,'Name_canonical']]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be helpful to rename df_p to something more obvious here.

for c in [colname,colstate,colcountry]:
df_p.loc[df_p[c].notnull(),c] = df_p.loc[df_p[c].notnull(),c].apply(name_canon)
print('Checking ID consistency')
df_cp = df_p.drop(colname,axis=1).drop_duplicates()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same situation as above with df_cp -- as a reader, I'm not entirely sure what this variable stands for.

@sangxia
Copy link
Contributor Author

sangxia commented Mar 11, 2017

Thanks for the comments @mattgawarecki ! I made some changes and added some comments.

@mattgawarecki
Copy link
Contributor

@sangxia Sorry it's taken a while for me to get back to this. I'm trying to re-review it right now. 👍

@mattgawarecki
Copy link
Contributor

@sangxia I think your changes helped quite a bit! Sorry it took so long (again) but I'll be merging this PR momentarily.

If it's okay, though, I might like to take a stab at refactoring the scripts to make them more newbie-friendly. Does that sound alright to you?

@mattgawarecki mattgawarecki merged commit bf02b82 into Data4Democracy:master Mar 21, 2017
@sangxia
Copy link
Contributor Author

sangxia commented Mar 21, 2017

@mattgawarecki No problem. Regarding refactoring, is there something specific I should pay attention to? I am working on some more code, so I think it would be good that I keep that in mind. Thanks.

@mattgawarecki
Copy link
Contributor

From a high level, I think smaller functions are always a good thing to strive for. Code structure and style can turn into a very philosophical discussion, but I think the best description I've heard is something like, "write like the person reading is completely new to what you're doing." It's a bit of an over-simplification, but I think it's been a useful way for me to evaluate my own code day-to-day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants