Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

USP Drug Classification data dictionary + tidying #33

Merged

Conversation

cduvallet
Copy link
Contributor

Continuing on issue #14, finalize the USP Drug Classification data dictionary, etc. Taw and tidy data are on data.world.

This data may or may not be useful - it has non-Medicare Part D medications and their respective classes/categories. The classes and categories are pretty self-explanatory (e.g. Antidepressants, Antiparkinson Agents, Sleep Disorder Agents) and can likely easily be tied to usage (depending on how we decide to define usage...).

Some follow up tasks, if we decide to use this data:

  • Figure out if it has Medicare Part D drugs, or only outpatient/non-Medicare drugs
  • See if we have spending data on these drugs, either in the Medicare data or elsewhere

dhuppenkothen and others added 14 commits February 3, 2017 16:22
* Pulling drug use classes out of the CMS PUF files for categorizing the Plan D data

* Fixed directory creating, refactored df, shortened loop
* Fix gitignore to ignore XLSX/ZIP and move CMS data to its own dir

* Add drug names back into annual spending data, for ease of use

* Forgot to add notebook in last commit

* Fix exploration notebook after earlier changes

* Add .DS_Store files to .gitignore

* Remove Medicare drug spending dataset (migrated to data.world)

* Remove data in favor of using data.world

- (External) Move all data files to data.world repository (https://data.world/data4democracy/drug-spending)
- Remove data/ directory
- Correct notebook code to work with data.world as a source

* Wrote a helper function that gets data from a URL, wrote function that downloads Part D data based on notebook

* Added docstring to function

* Added more functions that load data wrangle it

* Added argument parser and squashed some bugs

* Removed dependence on openpyxl, since Pandas does the trick

* Notebook runs

* Minor change to command line arg and addition to help string

* Move comments into Markdown cells and add CSV output

* Added functionality to decide between input/output data formats; supports cvs and feather at the moment
Markdown version of goals statement - first draft.
Cleaning drug manufacturer data sourced from CMS.
@jenniferthompson
Copy link
Contributor

@cduvallet! The data summary and data dictionary are SO helpful! I've asked Matt or Daniela to review it because I'm not a Python user, but whether or not the data relates to what we're doing immediately, having all this documented so well is fantastic. Thank you!

@dhuppenkothen
Copy link
Contributor

I can review this today, unless @mattgawarecki is on it already.

@dhuppenkothen
Copy link
Contributor

I can also check tonight if these classifications work for the Part D data that I've been playing around with.

Copy link
Contributor

@dhuppenkothen dhuppenkothen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! This'll be really useful!

import pandas as pd

if __name__ == "__main__":

Copy link
Contributor

@dhuppenkothen dhuppenkothen Feb 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be just a script to be executable from the command line? Or should this supply a function that can be run from within a larger Python program as well? If it's the former, you don't really need the if __name__ == "__main__" line.

I would actually suggest moving the code below into a function tidy_kegg_data() or something, and then have it execute that here. This would allow someone to run this from within a larger programme, if necessary.


if __name__ == "__main__":

fname = 'br08302.keg'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this data on data.world? It might be worth not assuming that the data exists on the local system, or at least check whether it exists on the local system.
There's a function in scripts/read_data.py that might make that easier (you might need to git pull upstream/master).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, both great points that I meant to address but forgot! Will have time tomorrow to fix. Thanks for pointing it out! :)

@cduvallet
Copy link
Contributor Author

cduvallet commented Feb 8, 2017

@dhuppenkothen I made the changes you recommended, it's much nicer now. I wasn't sure of the best way to interface with read_data.py (so I just re-wrote the download data wrapper...)

Also, it seems that there are currently two ways we're keeping track of, downloading, and tidying data:

  1. The script/read_data.py script has individual functions for each of the datasets that downloads and tidies, and
  2. The data/ folder has individual data dictionaries and corresponding tidying scripts, one for each individual dataset.

From what I understood from @mattgawarecki, I think we're going with option 2? But let me know if not, and I can incorporate this into the read_data.py script.

…aries

Merge data-dictionaries branch in preparation for restructuring
@jenniferthompson
Copy link
Contributor

@cduvallet I'll let @dhuppenkothen speak to read_data.py, but just wanted to jump in and say we had a long discussion today about repo organization, and I just submitted a PR to reflect the updated file structure. Once we get that finalized we'll clean up all the documentation, but the idea will be to have a dictionary (md) in /datadictionaries, and tidying scripts in (in your case) /python/datawrangling/[subfolders if you need it]. Not sure if that answers all your questions, but hopefully helps! Thanks so much for bearing with us while we get more streamlined - it'll help tremendously in the long run.

Selah Lynch and others added 5 commits February 8, 2017 11:45
added direct link to the datasets of interest
* Reorganization FTW

* Reorganization FTW, part 2

* Add .gitignore

* Add READMEs to each subdirectory. Rename data dictionary template (now TEMPLATE) and remove suffix from manufacturer_datadict.md.

* Add link to data.world Python client

* Update main README to reflect new file structure

* Fix link to datadictionaries

* Really fix it this time

* Fix the other datadictionaries links to overview and template

* More streamlining and edits to README
@jenniferthompson
Copy link
Contributor

Hey @cduvallet and @dhuppenkothen! Just checking in on the status of this PR. No rush intended on my end, just wanted to make sure there isn't anything blocking either of you that we need to take care of administratively.

@cduvallet
Copy link
Contributor Author

@jenniferthompson Nope, I was just traveling this weekend so haven't gotten around to finalizing this. Will update if I need anything from y'all! :)

@cduvallet
Copy link
Contributor Author

Okay, I think we should be ready to merge! @jenniferthompson double-check and let me know if anything needs to change?

@jenniferthompson
Copy link
Contributor

@cduvallet The data-dictionaries branch looks great! Would you mind pushing that to your master branch so it'll show up on master here? I think that should do it!

@dhuppenkothen did you have any further suggestions on the Python code?

@cduvallet
Copy link
Contributor Author

@jenniferthompson I think I did it! Should be ready to merge if @dhuppenkothen doesn't have other comments.

@dhuppenkothen
Copy link
Contributor

Looks good to me!

@mattgawarecki mattgawarecki merged commit 57a218f into Data4Democracy:data-dictionaries Feb 15, 2017
@mattgawarecki
Copy link
Contributor

Oops. I'll get this into master instead of data-dictionaries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants