New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleaning of CMS open payments general payments data #55

Merged

mattgawarecki merged 6 commits into Data4Democracy:master from sangxia:cms-open-general

Mar 21, 2017

Contributor

sangxia commented Mar 8, 2017

Scripts for checking basic consistency, splitting information into separate tables, and creating standardized names for manufacturers. Some more work needs to be done for the general payments table, but hopefully the scripts could be useful for people working on other parts of this dataset.

It could also be good to have a naming convention for companies that have changed names, been acquired, etc.

For some reason the Date_of_Payment field of 2015 contains a lot more error than 2013 and 2014.

sangxia added 5 commits

February 27, 2017 23:27


split general into separate files for physicians, hospitals and manuf…

9e4a52e

…acturers, indexed by ids


check program_year and publication_date before splitting

25d8655


process manufacturer names. could potentially be adapted for hospital…

7d8a5ea

… names.


minor tweaks in manufacturer processing

73dd11d


Merge branch 'master' of https://github.com/Data4Democracy/drug-spending

 into cms-open-general

Contributor

jenniferthompson commented Mar 10, 2017

Thanks so much for this @sangxia! Would you mind adding some comments to make things a bit easier for our reviewers?

jenniferthompson requested review from mattgawarecki and dhuppenkothen

March 10, 2017 23:06

mattgawarecki requested changes

View reviewed changes

Contributor

mattgawarecki left a comment

Holy crap, @sanxia! You have done an amazing amount of work getting all this payment data scraped and cleaned. Thank you so so so much.

I'm learning a lot about pandas and working with data in Python just doing this review, but I find myself getting lost at points as I try to understand what exactly is happening at each step. I think that with some added documentation in the form of descriptive variable names and explanatory comments will make this not just an invaluable script for our work in drug-spending, but a useful real-world example of the power of Python for data engineering and analysis.

Of course, I realize you've spent quite a while getting to this point, so if you'd like to pair with someone to lighten the load, I'm sure any of us would be happy to step up. 😄 Either way, you deserve some major high fives for cranking out such a complex and impactful feature! 🙌

python/datawrangling/cms-openpayment/general_payment_process_manufacturer.py Outdated

+import copy
+def name_canon(s):
+ ret = unicodedata.normalize('NFKD',s.strip()).encode('ascii','ignore').decode()

Contributor

mattgawarecki Mar 11, 2017

I had some great success recently using unidecode to convert Unicode to simple ASCII, and it's much easier to use. Admittedly this isn't an important change, but is that something you're interested in? The equivalent code would be something like this:

from unidecode import unidecode
# ... code ...
ret = unidecode.decode(s.strip())

Contributor

mattgawarecki Mar 11, 2017

Ah, nevermind. I see pandas seems to use the same unicodedata calls. I'm all for consistency in this case. 👍

python/datawrangling/cms-openpayment/general_payment_process_manufacturer.py Outdated

+def name_canon(s):
+ ret = unicodedata.normalize('NFKD',s.strip()).encode('ascii','ignore').decode()
+ ret = re.sub(' +', ' ', ret)
+ return str.upper(ret.strip())

Contributor

mattgawarecki Mar 11, 2017

If we're stripping on line 25, it may be redundant to do it again here.

python/datawrangling/cms-openpayment/general_payment_process_manufacturer.py Outdated

+ colstate = 'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_State'
+ colcountry = 'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country'
+ df = df.loc[~df[colid].isin(name_dict)].drop_duplicates()
+# print('Checking null values')

Contributor

mattgawarecki Mar 11, 2017

Is this block of commented code needed anymore?

python/datawrangling/cms-openpayment/general_payment_process_manufacturer.py Outdated

+# assert(((df[colstate].isnull()) & (df[colcountry] == 'United States')).sum()==0)
+# assert((df.loc[df[colstate].notnull(), colstate].str.len() != 2).sum()==0)
+ print('Computing canonical names')
+ df['Name_canonical'] = df[colname] \

Contributor

mattgawarecki Mar 11, 2017

Just a personal preference on my part, but I'd break lines here such that each "step" is on its own line, like so:

df['Name_canonical'] = df[colname] \
    .str.normalize('NFKD') \
    .str.encode('ascii','ignore') \
    .str.decode('utf-8') \
    .str.upper() \
    .str.replace(' +', '') \
    .str.replace('[^A-Z0-9]','') \
    .str.strip()

python/datawrangling/cms-openpayment/general_payment_process_manufacturer.py Outdated

+ df['Name_canonical'] = df[colname] \
+ .str.normalize('NFKD').str.encode('ascii','ignore').str.decode('utf-8') \
+ .str.upper().str.replace(' +', '').str.replace('[^A-Z0-9]','').str.strip()
+ df_p = df[[colid,colname,colstate,colcountry,'Name_canonical']]

Contributor

mattgawarecki Mar 11, 2017

I think it might be helpful to rename df_p to something more obvious here.

python/datawrangling/cms-openpayment/general_payment_process_manufacturer.py Outdated

+ for c in [colname,colstate,colcountry]:
+ df_p.loc[df_p[c].notnull(),c] = df_p.loc[df_p[c].notnull(),c].apply(name_canon)
+ print('Checking ID consistency')
+ df_cp = df_p.drop(colname,axis=1).drop_duplicates()

Contributor

mattgawarecki Mar 11, 2017

Same situation as above with df_cp -- as a reader, I'm not entirely sure what this variable stands for.


comments and style improvements for the manufacturer name processing …

…script

Contributor Author

sangxia commented Mar 11, 2017

Thanks for the comments @mattgawarecki ! I made some changes and added some comments.

Contributor

mattgawarecki commented Mar 15, 2017

@sangxia Sorry it's taken a while for me to get back to this. I'm trying to re-review it right now. 👍

Contributor

mattgawarecki commented Mar 21, 2017

@sangxia I think your changes helped quite a bit! Sorry it took so long (again) but I'll be merging this PR momentarily.

If it's okay, though, I might like to take a stab at refactoring the scripts to make them more newbie-friendly. Does that sound alright to you?

mattgawarecki merged commit bf02b82 into Data4Democracy:master

Contributor Author

sangxia commented Mar 21, 2017

@mattgawarecki No problem. Regarding refactoring, is there something specific I should pay attention to? I am working on some more code, so I think it would be good that I keep that in mind. Thanks.

Contributor

mattgawarecki commented Mar 21, 2017

From a high level, I think smaller functions are always a good thing to strive for. Code structure and style can turn into a very philosophical discussion, but I think the best description I've heard is something like, "write like the person reading is completely new to what you're doing." It's a bit of an over-simplification, but I think it's been a useful way for me to evaluate my own code day-to-day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment