-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove columns having list
datatype
#82
Conversation
Commented in slack. Remove it before making dataframe: https://github.com/ucscXena/xena-GDC-ETL/blob/master/xena_gdc_etl/gdc.py#L431-L434. Checking for "[" is hacky and not right. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a matter of fact, there could be an easier solution depending on the data:
>>> from pandas.io.json import json_normalize
>>> data = [{'state': 'Florida',
... 'shortname': 'FL',
... 'info': {
... 'governor': 'Rick Scott'
... },
... 'counties': [{'name': 'Dade', 'population': 12345},
... {'name': 'Broward', 'population': 40000},
... {'name': 'Palm Beach', 'population': 60000}]},
... {'state': 'Ohio',
... 'shortname': 'OH',
... 'info': {
... 'governor': 'John Kasich'
... },
... 'counties': [{'name': 'Summit', 'population': 1234},
... {'name': 'Cuyahoga', 'population': 1337}]}]
>>> result = json_normalize(data)
>>> result
counties info.governor shortname state
0 [{'name': 'Dade', 'population': 12345}, {'name... Rick Scott FL Florida
1 [{'name': 'Summit', 'population': 1234}, {'nam... John Kasich OH Ohio
>>> result.applymap(type)
counties info.governor shortname state
0 <class 'list'> <class 'str'> <class 'str'> <class 'str'>
1 <class 'list'> <class 'str'> <class 'str'> <class 'str'>
You would skip reduce_json_array
since it's bad in messing up the data. You do need to process the JSON properly in regarding to samples. In the end, you can drop the column having list values.
xena_gdc_etl/gdc.py
Outdated
@@ -428,6 +427,10 @@ def get_samples_clinical(projects=None): | |||
res = search( | |||
'cases', in_filter=in_filter, fields=fields, expand=expand, typ='json' | |||
) | |||
reduced_json = reduce_json_array(res) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reduce_json_array
will try to expand length 1 array: https://github.com/ucscXena/xena-GDC-ETL/blob/master/xena_gdc_etl/utils.py#L175 This might be a bad behavior. A field would be removed as long as it's datatype is list unless we decide differently. It shouldn't be used here. Honestly, reduce_json_array
is probably a bad function in the first place.
@@ -447,7 +450,9 @@ def get_samples_clinical(projects=None): | |||
'id', | |||
record_prefix='samples.', | |||
) | |||
return pd.merge(cases_df, samples_df, how='inner', on='id') | |||
merged_df = pd.merge(cases_df, samples_df, how='inner', on='id') | |||
merged_df.drop(list(to_drops), axis=1, inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those fields can be dropped in JSON and don't have to wait until here. Dropping something as early as possible would save extra computation.
This reverts commit 444ea2b.
Closes #64.