Remove columns having `list` datatype #82

ayan-b · 2019-07-11T18:48:46Z

Closes #64.

yunhailuo · 2019-07-11T18:57:45Z

Commented in slack. Remove it before making dataframe: https://github.com/ucscXena/xena-GDC-ETL/blob/master/xena_gdc_etl/gdc.py#L431-L434. Checking for "[" is hacky and not right.

yunhailuo

As a matter of fact, there could be an easier solution depending on the data:

>>> from pandas.io.json import json_normalize
>>> data = [{'state': 'Florida',
...          'shortname': 'FL',
...          'info': {
...               'governor': 'Rick Scott'
...          },
...          'counties': [{'name': 'Dade', 'population': 12345},
...                      {'name': 'Broward', 'population': 40000},
...                      {'name': 'Palm Beach', 'population': 60000}]},
...         {'state': 'Ohio',
...          'shortname': 'OH',
...          'info': {
...               'governor': 'John Kasich'
...          },
...          'counties': [{'name': 'Summit', 'population': 1234},
...                       {'name': 'Cuyahoga', 'population': 1337}]}]
>>> result = json_normalize(data)
>>> result
                                            counties info.governor shortname    state
0  [{'name': 'Dade', 'population': 12345}, {'name...    Rick Scott        FL  Florida
1  [{'name': 'Summit', 'population': 1234}, {'nam...   John Kasich        OH     Ohio
>>> result.applymap(type)
         counties  info.governor      shortname          state
0  <class 'list'>  <class 'str'>  <class 'str'>  <class 'str'>
1  <class 'list'>  <class 'str'>  <class 'str'>  <class 'str'>

You would skip reduce_json_array since it's bad in messing up the data. You do need to process the JSON properly in regarding to samples. In the end, you can drop the column having list values.

yunhailuo · 2019-07-11T21:59:35Z

xena_gdc_etl/gdc.py

@@ -428,6 +427,10 @@ def get_samples_clinical(projects=None):
 res = search(
 'cases', in_filter=in_filter, fields=fields, expand=expand, typ='json'
 )
+ reduced_json = reduce_json_array(res)


reduce_json_array will try to expand length 1 array: https://github.com/ucscXena/xena-GDC-ETL/blob/master/xena_gdc_etl/utils.py#L175 This might be a bad behavior. A field would be removed as long as it's datatype is list unless we decide differently. It shouldn't be used here. Honestly, reduce_json_array is probably a bad function in the first place.

yunhailuo · 2019-07-11T22:00:55Z

xena_gdc_etl/gdc.py

@@ -447,7 +450,9 @@ def get_samples_clinical(projects=None):
 'id',
 record_prefix='samples.',
 )
- return pd.merge(cases_df, samples_df, how='inner', on='id')
+ merged_df = pd.merge(cases_df, samples_df, how='inner', on='id')
+ merged_df.drop(list(to_drops), axis=1, inplace=True)


Those fields can be dropped in JSON and don't have to wait until here. Dropping something as early as possible would save extra computation.

xena_gdc_etl/xena_dataset.py

This reverts commit 444ea2b.

ayan-b added 2 commits July 12, 2019 00:07

Remove diagnoses.treatments from expands

0f18fec

Drop columns from phenotype data whose cells are of list type

8377a08

yunhailuo closed this Jul 11, 2019

Use different approach

91990ee

ayan-b reopened this Jul 11, 2019

ayan-b self-assigned this Jul 11, 2019

yunhailuo requested changes Jul 11, 2019

View reviewed changes

ayan-b added 3 commits July 12, 2019 10:07

Remove unrelated change

acc3f81

Do not use reduce_json_array

754acac

Fix docstring

ad3ab71

ayan-b changed the title ~~Remove columns~~ Remove columns having list datatype Jul 12, 2019

yunhailuo approved these changes Jul 12, 2019

View reviewed changes

yunhailuo merged commit 444ea2b into ucscXena:master Jul 12, 2019

yunhailuo added a commit to yunhailuo/xena-GDC-ETL that referenced this pull request Jul 13, 2019

Revert "Remove columns having list datatype (ucscXena#82)"

fb3653f

This reverts commit 444ea2b.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove columns having `list` datatype #82

Remove columns having `list` datatype #82

ayan-b commented Jul 11, 2019 •

edited

Loading

yunhailuo commented Jul 11, 2019

yunhailuo left a comment

yunhailuo Jul 11, 2019

yunhailuo Jul 11, 2019

Remove columns having list datatype #82

Remove columns having list datatype #82

Conversation

ayan-b commented Jul 11, 2019 • edited Loading

yunhailuo commented Jul 11, 2019

yunhailuo left a comment

Choose a reason for hiding this comment

yunhailuo Jul 11, 2019

Choose a reason for hiding this comment

yunhailuo Jul 11, 2019

Choose a reason for hiding this comment

Remove columns having `list` datatype #82

Remove columns having `list` datatype #82

ayan-b commented Jul 11, 2019 •

edited

Loading