Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove disease_type.project from all phenotype files #64

Closed
maryjgoldman opened this issue Jul 1, 2019 · 10 comments · Fixed by #82
Closed

Remove disease_type.project from all phenotype files #64

maryjgoldman opened this issue Jul 1, 2019 · 10 comments · Fixed by #82
Assignees

Comments

@maryjgoldman
Copy link

I think we said to do this at some point but I'm still seeing this field in the files. It is a list and we don't want that. And the data in this field is already contained in the 'disease_type' field that we get from the xml files.

@yunhailuo
Copy link
Collaborator

Maybe more clean up on cases.project. We have the following fields:

Fields Possible values
project.dbgap_accession_number Always null for TCGA; but can be [phs000467, phs000465, phs000471, phs000468, phs000470, phs000466] for TARGET; useful?
project.disease_type A list
project.name Uterine Corpus Endometrial Carcinoma...
project.primary_site A list
project.project_id TCGA-UCEC...
project.released Always true?
project.state Always open?

@ayan-b
Copy link
Collaborator

ayan-b commented Jul 2, 2019

Only removing disease_type.project for the time being https://github.com/yunhailuo/xena-GDC-ETL/pull/68.

@maryjgoldman
Copy link
Author

Good call @ayan-b to stick with the known until Yunhai and I came to a conclusion.

As far as these other fields sounds like we need to remove them. Should we open a new github ticket? Or do it here? I'm fine with either. Whatever you prefer @ayan-b

@ayan-b
Copy link
Collaborator

ayan-b commented Jul 2, 2019

Let's do it here.

@ayan-b
Copy link
Collaborator

ayan-b commented Jul 4, 2019

@maryjgoldman Updated data in the hub.

@maryjgoldman
Copy link
Author

@yunhailuo I am confused. I am looking at your list of fields above that you asked @ayan-b to take out. Many of them are still in there but when I look at them, they are not lists. Similarly, in the old GDC data they appear to not be lists too. Can you please give an example where they are a list?

Fields to investigate:
project.disease_type
project.name
project.primary_site
project.project_id

@ayan-b
Copy link
Collaborator

ayan-b commented Jul 9, 2019

@maryjgoldman I have only removed primary_site.project and disease_type.project since we didn't reach a conclusion for the others.

@maryjgoldman
Copy link
Author

That makes a lot of sense. @ayan-b can you investigate these fields to see if there is ever a time in which they are a list? If not, then we can leave them since they are in the older GDC data as well.

Fields to investigate:
project.disease_type
project.name
project.primary_site
project.project_id

@yunhailuo
Copy link
Collaborator

@yunhailuo I am confused. I am looking at your list of fields above that you asked @ayan-b to take out. Many of them are still in there but when I look at them, they are not lists. Similarly, in the old GDC data they appear to not be lists too. Can you please give an example where they are a list?

Fields to investigate:
project.disease_type
project.name
project.primary_site
project.project_id

Sorry, I'm not saying they are lists. I just want to get clarifications and be sure about what to keep and what not.

@maryjgoldman
Copy link
Author

Looks great. No fields that are lists in the TCGA data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants