Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove columns having list datatype #82

Merged
merged 6 commits into from
Jul 12, 2019
Merged

Conversation

ayan-b
Copy link
Collaborator

@ayan-b ayan-b commented Jul 11, 2019

Closes #64.

@yunhailuo
Copy link
Collaborator

Commented in slack. Remove it before making dataframe: https://github.com/ucscXena/xena-GDC-ETL/blob/master/xena_gdc_etl/gdc.py#L431-L434. Checking for "[" is hacky and not right.

@yunhailuo yunhailuo closed this Jul 11, 2019
@ayan-b ayan-b reopened this Jul 11, 2019
@ayan-b ayan-b self-assigned this Jul 11, 2019
Copy link
Collaborator

@yunhailuo yunhailuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a matter of fact, there could be an easier solution depending on the data:

>>> from pandas.io.json import json_normalize
>>> data = [{'state': 'Florida',
...          'shortname': 'FL',
...          'info': {
...               'governor': 'Rick Scott'
...          },
...          'counties': [{'name': 'Dade', 'population': 12345},
...                      {'name': 'Broward', 'population': 40000},
...                      {'name': 'Palm Beach', 'population': 60000}]},
...         {'state': 'Ohio',
...          'shortname': 'OH',
...          'info': {
...               'governor': 'John Kasich'
...          },
...          'counties': [{'name': 'Summit', 'population': 1234},
...                       {'name': 'Cuyahoga', 'population': 1337}]}]
>>> result = json_normalize(data)
>>> result
                                            counties info.governor shortname    state
0  [{'name': 'Dade', 'population': 12345}, {'name...    Rick Scott        FL  Florida
1  [{'name': 'Summit', 'population': 1234}, {'nam...   John Kasich        OH     Ohio
>>> result.applymap(type)
         counties  info.governor      shortname          state
0  <class 'list'>  <class 'str'>  <class 'str'>  <class 'str'>
1  <class 'list'>  <class 'str'>  <class 'str'>  <class 'str'>

You would skip reduce_json_array since it's bad in messing up the data. You do need to process the JSON properly in regarding to samples. In the end, you can drop the column having list values.

@@ -428,6 +427,10 @@ def get_samples_clinical(projects=None):
res = search(
'cases', in_filter=in_filter, fields=fields, expand=expand, typ='json'
)
reduced_json = reduce_json_array(res)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reduce_json_array will try to expand length 1 array: https://github.com/ucscXena/xena-GDC-ETL/blob/master/xena_gdc_etl/utils.py#L175 This might be a bad behavior. A field would be removed as long as it's datatype is list unless we decide differently. It shouldn't be used here. Honestly, reduce_json_array is probably a bad function in the first place.

@@ -447,7 +450,9 @@ def get_samples_clinical(projects=None):
'id',
record_prefix='samples.',
)
return pd.merge(cases_df, samples_df, how='inner', on='id')
merged_df = pd.merge(cases_df, samples_df, how='inner', on='id')
merged_df.drop(list(to_drops), axis=1, inplace=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those fields can be dropped in JSON and don't have to wait until here. Dropping something as early as possible would save extra computation.

xena_gdc_etl/xena_dataset.py Outdated Show resolved Hide resolved
@ayan-b ayan-b changed the title Remove columns Remove columns having list datatype Jul 12, 2019
@yunhailuo yunhailuo merged commit 444ea2b into ucscXena:master Jul 12, 2019
yunhailuo added a commit to yunhailuo/xena-GDC-ETL that referenced this pull request Jul 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove disease_type.project from all phenotype files
2 participants