Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hicPCA using Gene Track #719

Closed
mxw010 opened this issue Jun 1, 2021 · 6 comments
Closed

hicPCA using Gene Track #719

mxw010 opened this issue Jun 1, 2021 · 6 comments

Comments

@mxw010
Copy link

mxw010 commented Jun 1, 2021

As mentioned in #716, when correcting for the signs for eigenvectors using a gene track, hicPCA back calculates size of the genome based on bins sometime, instead of the actual size of the chromosomes. For example, in my data, when hicPCA loads the interaction matrix, the size of the chromosomes are:

pMatrix.get_chromosome_sizes()   
OrderedDict([('chr1', 248956422), ('chr2', 242193529), ('chr3', 198295559), ('chr4', 190000000), ('chr5', 181538259), ('chr6', 170805979), ('chr7', 159345973), ('chr8', 145138636), ('chr9', 138394717), ('chr10', 133797422), ('chr11', 135086622), ('chr12', 133275309), ('chr13', 114364328), ('chr14', 107000000), ('chr15', 101991189), ('chr16', 90338345), ('chr17', 83257441), ('chr18', 80373285), ('chr19', 58617616), ('chr20', 64444167), ('chr21', 46709983), ('chr22', 50818468), ('chrX', 156000000)])

This is fine for most chromosomes, but chr4 and chrX both have genes out of the range of the chromosomes, resulting in the same error as seen in #716:

ERROR:hicmatrix.HiCMatrix:Index error
Traceback (most recent call last):
  File "/home/gdstantonlab/mxw010/.conda/envs/hic/lib/python3.8/site-packages/hicmatrix/HiCMatrix.py", line 270, in getRegionBinRange
    endbin = sorted(self.interval_trees[chrname][endpos:endpos + 1])[0].data
IndexError: list index out of range
Traceback (most recent call last):
  File "/home/gdstantonlab/mxw010/.conda/envs/hic/bin/hicPCA", line 7, in <module>
    main()
  File "/home/gdstantonlab/mxw010/.conda/envs/hic/lib/python3.8/site-packages/hicexplorer/hicPCA.py", line 337, in main
    vecs_list = correlateEigenvectorWithGeneTrack(ma, vecs_list, args.extraTrack)
  File "/home/gdstantonlab/mxw010/.conda/envs/hic/lib/python3.8/site-packages/hicexplorer/hicPCA.py", line 161, in correlateEigenvectorWithGeneTrack
    gene_occurrence[bin_id[1]] += 1
TypeError: 'NoneType' object is not subscriptable

Imo, this can simply be fixed by skipping lines from the gene reference file that are out of range, in function correlateEigenvectorWithGeneTrack:

for interval in bed:
    chromosome_name = interval.chromosome
    if chromosome_name not in chromosome_list:
        continue
    #skip lines out of range:
    if interval.start > pMatrix.get_chromosome_sizes()[chromosome_name]:
        continue
    # in which bin of the Hi-C matrix is the given gene?
    bin_id = pMatrix.getRegionBinRange(interval.chromosome,
	                               interval.start, interval.end)
    # add +1 for one gene occurrence in this bin
    gene_occurrence[bin_id[1]] += 1

This shouldn't have a big impact on the sign (unless there is a major problem), and would fix the bug.

@joachimwolff
Copy link
Collaborator

Hi,

Thanks for reporting this and providing a patch for this issue. I will consider it in the next release of HiCExplorer.

Best,

Joachim

@TTTPOB
Copy link

TTTPOB commented Jul 6, 2021

thank @mxw010 for providing a work around, i temporarily solved this problem by adding this skip line

edit:
sorry, but i found the problem still exists

@joachimwolff
Copy link
Collaborator

@mxw010 What reference genome is this? hg19 is bigger, and hg38 matches for the most chromosomes, but chromosomes 4 and X are not correct. (https://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.chrom.sizes and https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/#/st)

For example hg19: For chromosome 1 you have 248956422 vs. how it should be: 249250621
hg38: For chromosome 4 you have 190000000 vs. how it should be: 190,214,555; same for chrX: You have 156000000 but it should be 156,040,895.
Can it be you created your matrix without a correct chromosome sizes file?

@joachimwolff
Copy link
Collaborator

@TTTPOB Please provide more information why it fails for you. What reference genome do you use, what are your chromosome sizes, have you used a chromosome size file to create the matrix, and what exact data you use for the correlation?

@TTTPOB
Copy link

TTTPOB commented Jul 12, 2021

(sorry for not providing needed information in time, I am busy with some other project these days

@mxw010
Copy link
Author

mxw010 commented Jul 12, 2021

@joachimwolff
I did map to hg38, and the interaction matrix were binned with cooler/cooltools and then converted to h5 with HiCExplorer. The problem is not that the genome size file was wrong (I just checked: my reference file and genome.size were both correct), but hicPCA infers genome size from the interaction matrix. If, for one reason or another, that information isn't in the interaction matrix, or some bins were removed for some reason (in my data), then we would run into that specific problem. In retrospect, perhaps adding a warning is simply enough for us to know where it went wrong.

the genome.size I used in creating interaction matrix were:

chr1 248956422
chr2 242193529
chr3 198295559
chr4 190214555
chr5 181538259
chr6 170805979
chr7 159345973
chr8 145138636
chr9 138394717
chr10 133797422
chr11 135086622
chr12 133275309
chr13 114364328
chr14 107043718
chr15 101991189
chr16 90338345
chr17 83257441
chr18 80373285
chr19 58617616
chr20 64444167
chr21 46709983
chr22 50818468
chrX 156040895
chrY 57227415

joachimwolff added a commit that referenced this issue Jul 12, 2021
Bug fix for hicPCA with new test data and test case. Fix #719 #716 #7
@joachimwolff joachimwolff mentioned this issue Jul 27, 2021
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants