-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleaning of CMS open payments general payments data #55
Cleaning of CMS open payments general payments data #55
Conversation
…acturers, indexed by ids
Thanks so much for this @sangxia! Would you mind adding some comments to make things a bit easier for our reviewers? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Holy crap, @sanxia! You have done an amazing amount of work getting all this payment data scraped and cleaned. Thank you so so so much.
I'm learning a lot about pandas and working with data in Python just doing this review, but I find myself getting lost at points as I try to understand what exactly is happening at each step. I think that with some added documentation in the form of descriptive variable names and explanatory comments will make this not just an invaluable script for our work in drug-spending, but a useful real-world example of the power of Python for data engineering and analysis.
Of course, I realize you've spent quite a while getting to this point, so if you'd like to pair with someone to lighten the load, I'm sure any of us would be happy to step up. 😄 Either way, you deserve some major high fives for cranking out such a complex and impactful feature! 🙌
import copy | ||
|
||
def name_canon(s): | ||
ret = unicodedata.normalize('NFKD',s.strip()).encode('ascii','ignore').decode() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some great success recently using unidecode
to convert Unicode to simple ASCII, and it's much easier to use. Admittedly this isn't an important change, but is that something you're interested in? The equivalent code would be something like this:
from unidecode import unidecode
# ... code ...
ret = unidecode.decode(s.strip())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, nevermind. I see pandas seems to use the same unicodedata
calls. I'm all for consistency in this case. 👍
def name_canon(s): | ||
ret = unicodedata.normalize('NFKD',s.strip()).encode('ascii','ignore').decode() | ||
ret = re.sub(' +', ' ', ret) | ||
return str.upper(ret.strip()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're stripping on line 25, it may be redundant to do it again here.
colstate = 'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_State' | ||
colcountry = 'Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country' | ||
df = df.loc[~df[colid].isin(name_dict)].drop_duplicates() | ||
# print('Checking null values') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this block of commented code needed anymore?
# assert(((df[colstate].isnull()) & (df[colcountry] == 'United States')).sum()==0) | ||
# assert((df.loc[df[colstate].notnull(), colstate].str.len() != 2).sum()==0) | ||
print('Computing canonical names') | ||
df['Name_canonical'] = df[colname] \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a personal preference on my part, but I'd break lines here such that each "step" is on its own line, like so:
df['Name_canonical'] = df[colname] \
.str.normalize('NFKD') \
.str.encode('ascii','ignore') \
.str.decode('utf-8') \
.str.upper() \
.str.replace(' +', '') \
.str.replace('[^A-Z0-9]','') \
.str.strip()
df['Name_canonical'] = df[colname] \ | ||
.str.normalize('NFKD').str.encode('ascii','ignore').str.decode('utf-8') \ | ||
.str.upper().str.replace(' +', '').str.replace('[^A-Z0-9]','').str.strip() | ||
df_p = df[[colid,colname,colstate,colcountry,'Name_canonical']] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be helpful to rename df_p
to something more obvious here.
for c in [colname,colstate,colcountry]: | ||
df_p.loc[df_p[c].notnull(),c] = df_p.loc[df_p[c].notnull(),c].apply(name_canon) | ||
print('Checking ID consistency') | ||
df_cp = df_p.drop(colname,axis=1).drop_duplicates() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same situation as above with df_cp
-- as a reader, I'm not entirely sure what this variable stands for.
Thanks for the comments @mattgawarecki ! I made some changes and added some comments. |
@sangxia Sorry it's taken a while for me to get back to this. I'm trying to re-review it right now. 👍 |
@sangxia I think your changes helped quite a bit! Sorry it took so long (again) but I'll be merging this PR momentarily. If it's okay, though, I might like to take a stab at refactoring the scripts to make them more newbie-friendly. Does that sound alright to you? |
@mattgawarecki No problem. Regarding refactoring, is there something specific I should pay attention to? I am working on some more code, so I think it would be good that I keep that in mind. Thanks. |
From a high level, I think smaller functions are always a good thing to strive for. Code structure and style can turn into a very philosophical discussion, but I think the best description I've heard is something like, "write like the person reading is completely new to what you're doing." It's a bit of an over-simplification, but I think it's been a useful way for me to evaluate my own code day-to-day. |
Scripts for checking basic consistency, splitting information into separate tables, and creating standardized names for manufacturers. Some more work needs to be done for the general payments table, but hopefully the scripts could be useful for people working on other parts of this dataset.
It could also be good to have a naming convention for companies that have changed names, been acquired, etc.
For some reason the Date_of_Payment field of 2015 contains a lot more error than 2013 and 2014.