Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalization and Correction after hic2cool #724

Closed
RomeroMatt opened this issue Jun 18, 2021 · 17 comments
Closed

Normalization and Correction after hic2cool #724

RomeroMatt opened this issue Jun 18, 2021 · 17 comments

Comments

@RomeroMatt
Copy link

Hello,
Thanks so much for the great tool. I'm new to hicexplorer (and bioinformatics, in general) and I had a quick question about normalization and correction.

I've been reading through the docs and found the hicConvertFormat info and I was able to get an .mcool matrix from my .hic.
From there, I read that the doc suggests normalizing first and then correcting, but I'm a bit confused as to if using the command hicConvertFormat with correction_name KR is automatically applied to the matrix or if I need to use hicCorrectMatrix. My original .hic file has corrections in it and when using hicInfo it shows:
"The following columns are available: ['chrom' 'start' 'end' 'KR' 'SCALE' 'VC' 'VC_SQRT']."

I found previous questions regarding this issue:
#661
#570

Where I'm confused is that in issue 661, it is suggested to use hicConvertFormat to get .hic to .cool then using hicConvertFormat again to apply correction_name, but gives no info on normalization.

In issue 570 it is stated to use hicConvertFormat to get .hic to .cool and then normalize using hicNormalize followed by matrix correction via hicConvertFormat only if the correction is stored in the original matrix; if not then you must use hicCorrectMatrix.

I just want to make sure that I am on the right track by:
converting some.hic to some.cool file;
normalizing some.cool to some_normalized.cool;
and lastly correcting some_normalized.cool to some_normalized_corrected.cool via the hicConvertFormat with the correction_name KR option set.

I hope this makes sense and thanks so much for the help!
-Matt

@joachimwolff
Copy link
Collaborator

Hi Matt,

that's a tough one and there is no 'no fits all answer'.

  • Normalization is necessary if you have multiple samples you want to compare. If you just have one sample, it doesn't really matter.
  • If you have multiple hic files it depends on them if they are maybe already normalized. In this case, you can apply the correction factors directly.
  • If you have multiple hic files and they are not normalized, you need to know why this has not been done. Maybe there is a good reason for this e.g. if differences at certain read coverage levels are investigated.
  • If you need to normalize, I suggest normalizing to equal read coverage and apply via hicCorrectMatrix the correction.
  • normalization after convert format #570 is a bit different: this user mixes the terms 'normalization and 'correction'. What we wrote in this issue is, that you need to have the correction factors stored in the specific column 'KR' if you want to apply it. However, the user calls it 'normalization' even if it is the correction.

Best,

Joachim

@RomeroMatt
Copy link
Author

Hi Joachim,
Thanks so much for the helpful response. I have been trying to test some things out, but I'm not sure if I am doing things correctly. My hic files have not been normalized as I analyzed them separately, so I can't see how they could be.
As mentioned, when converting using hic2cool, I do have the correction columns available, but obviously they are not applied without using the hicConvertFormat with the correction_name KR option set.

You mentioned normalizing first and then using hicCorrectMatrix - I am just a bit confused because this produces a pretty weird map compared to the other methods.

Here's an uncorrected map:
testuncorrectedmatrix

Here's a normalized and corrected map via hicNormalize and hicCorrectMatrix:
h9std_normalized_corrected_test_matrix

And here's a map that was normalized with hicNormalize then used hicConvertMatrix with correction_name KR:
h9std_norm_KR_test_matrix

The second map seems to be pretty much wiping my data out, which is odd.
I hope this makes sense and clarifies my confusion and ways of going about normalization and correction. Thanks again!
-Matt

@joachimwolff
Copy link
Collaborator

To which value range have you normalized? 0 - 1? because you have now 10^0 instead 10^5?!

@RomeroMatt
Copy link
Author

Hi Joachim,
I know! That's what's weird - I have the normalize option set to "smallest". Could it be that my samples are too different with respect to sequencing depth? My lowest sample has ~450mill reads and my highest sample has ~1.8bill reads. My original thought when seeing the figure is maybe the correction method is getting applied twice? I was thinking that if correction methods are automatically applied then perhaps by using hicCorrectMatrix I was somehow correcting twice?

@RomeroMatt
Copy link
Author

Hi Joachim,
I just wanted to check-in to see if you had any additional thoughts on the issues I'm running in with my data?

Another thought I had was perhaps my dataset isn't being corrected all the way through - I am using my laptop to correct data, but based on the hicexplorer website, it gives a timeframe for correction in hours with lots of RAM, while my correction command using hicCorrectMatrix on my laptop (16gb RAM) was on avg about 30 mins. Maybe this is why my values are so low in my corrected maps?

As a reminder, I used hic2cool for this process and have the corrections, but the columns are not applied. I am still confused on the difference between hicConvertMatrix with correction factor set to KR vs. hicCorrectMatrix using KR as the correction method.
I figured that by using hicNormalize then hicConvertMatrrix with correction factor set to KR that this would solve the problem, but it seems that this may not be correct based on your previous answers.
Thanks again!
-Matt

@joachimwolff
Copy link
Collaborator

Hi Matt,

sorry for the delayed answer.

With 1.8bill reads and 16 GB and half an hour runtime, there is no way this is not crashing, and if it doesn't I don't what it computes. But with that read number, better go to 128 or maybe even 256 GB of memory. With fewer resources, I would not even try it. Also with the 450 million reads, do not work on a notebook with 16 GB memory. Hi-C (and especially the correction) is very memory-demanding. If you don't have the resources, I recommend using the software-as-a-service approach we offer: hicexplorer.usegalaxy.eu.

The difference between hicConvertMatrix with correction factor set to KR and hicCorrectMatrix using KR is that in the first case, someone already computed the correction factors and you just need to apply them. In the second case, these factors are not present and need to be computed by yourself, and that is what hicCorrectMatrix does.

The reason why you can't use the normalization and then apply the precomputed correction factors is simply that the data basis changed with the normalization and therefore the correction factors are not the mathematical correct factors anymore. If you apply them, they might look alright, but that doesn't change the fact that they are wrong.

Best,

Joachim

@RomeroMatt
Copy link
Author

Thanks so much, Joachim! What you're saying makes total sense. I knew I shouldn't be using 16gb to analyze the data, but wanted to try for some reason lol. I do have an AWS account and I currently have them running now.

Ah, the convertmatrix vs. correctmatrix makes sense now. Thanks for the clarification - really, I should have just said it out loud and it would have made more sense!

I'll keep you updated with AWS computing (Linux Ubuntu, 196GB of RAM), so can we keep this open for a bit longer in case I run into any more problems?
Thanks so much again!
-Matt

@RomeroMatt
Copy link
Author

RomeroMatt commented Jul 12, 2021

Hi Joachim,
So I have been testing a few things out the last few days and I still can't seem to get the hicCorrectMatrix command to work properly; or my data is off somehow to begin, but I don't think that's the case, but maybe you think otherwise?

I'll start from the beginning so you can have the full picture.
I have my hic files that I generated from juicer and I generated them separately on amazon AWS.
I convert using this command for all 3 of my data sets:

hicConvertFormat -m h9std/juicer_files/h9std_inter_30.hic -o h9std/hicexplorer_data/h9std_inter_30.cool --inputFormat hic --outputFormat cool

hicConvertFormat -m h9low/juicer_files/h9low_inter_30.hic -o h9low/hicexplorer_data/h9low_inter_30.cool --inputFormat hic --outputFormat cool

hicConvertFormat -m smpclow/juicer_files/smpclow_inter_30.hic -o smpclow/hicexplorer_data/smpclow_inter_30.cool --inputFormat hic --outputFormat cool

This code generates an mcool file with all my resolutions (2.5mb, 1mb, 500kb, 250kb, 100kb, 50kb, 25kb, 10kb, 5kb, 1kb) for each dataset.

From there, I individually normalize each resolution from each dataset with each other.
Here's my code for doing that for 2.5mb resolution:

hicNormalize -m h9std/hicexplorer_data/h9std_inter_30.mcool::/resolutions/2500000 h9low/hicexplorer_data/h9low_inter_30.mcool::/resolutions/2500000 smpclow/hicexplorer_data/smpclow_inter_30.mcool::/resolutions/2500000 --normalize smallest -o h9std/hicexplorer_data/h9std_inter_30_2.5mb_norm.cool h9low/hicexplorer_data/h9low_inter_30_2.5mb_norm.cool smpclow/hicexplorer_data/smpclow_inter_30_2.5mb_norm.cool

From there, I then correct each resolution individually.
Here's the code for 2.5mb resolution:
hicCorrectMatrix correct --matrix h9std/hicexplorer_data/h9std_inter_30_2.5mb_norm.cool --correctionMethod KR --outFileName h9std/hicexplorer_data/h9std_inter_30_2.5mb_norm_KRCorrected.cool

I did that for each resolution (2.5mb all the way down to 1kb). I did all of this on a AWS EC2 instance - I used ubuntu 20.04, using an m4.16xlarge instance with 64 vCPUs and 256GB of memory and 600GB of EBS storage. Similar to when I was using my laptop, correction took ~1min to ~30mins for the 2.5mb to 1kb resolutions, respectively, which is much shorter than what is on the hicexplorer tutorial page. I then tried the hicexplorer tool using galaxy and got similar results with the 2.5mb resolution matrix - I didn't try 1kb resolution yet because I was getting the same results with the 2.5mb resolution.

I don't know if it's something I'm doing wrong in the steps before hicCorrectMatrix or if my dataset is just off or do I need to convert from cool to h5?

I also ran some diagnostic plots from 50kb resolution and those are pasted below:
Here's a plot of the raw data from the diagnostic plot sub-command:
hicCorrectMatrix diagnostic_plot --matrix h9std/hicexplorer_data/h9std_inter_30.mcool::/resolutions/50000 -o hicexplorer_figs/h9std_raw_50kb_diag_plot.png
h9std_raw_50kb_diag_plot

And here's diagnostic plots of normalized and then normalized and corrected data, respectively.
h9std_norm_50kb_diag_plot
h9std_norm_corrected_50kb_diag_plot

I also ran a plot for distribution vs. counts as well, which is pasted below as raw, norm, and norm with correction, respectively.
h9std_low_smpclow_distvscounts_test_raw_50kb
h9std_low_smpclow_distvscounts_test_norm_50kb
h9std_low_smpclow_distvscounts_test_norm_corrected_50kb

Lastly, I made a regional matrix plot on Chromosome 1 from 10-20mb with 50kb resolution.
The code used to make those are as follows for normalized and then normalized with correction, respectively:
hicPlotMatrix -m h9std/hicexplorer_data/h9std_inter_30_50kb_norm.cool -o hicexplorer_figs/h9std_norm_test_50kb_chr1.png --log1p --region 1:10000000-20000000

hicPlotMatrix -m h9std/hicexplorer_data/h9std_inter_30_50kb_norm_KRCorrected.cool -o hicexplorer_figs/h9std_norm_corrected_test_50kb_chr1.png --log1p --region 1:10000000-20000000

h9std_norm_test_50kb_chr1
h9std_norm_corrected_test_50kb_chr1

I'm not sure where all my reads are going! It just makes no sense why I start with 10^5 reads and basically go down to zero. I apologize for the long message and I hope it's something small that I am messing up.
Thanks so much for the help!
-Matt

@joachimwolff
Copy link
Collaborator

joachimwolff commented Jul 14, 2021

Hi Matt,

I confirm this bug with KR correction and cool files. We can't really explain it at the moment, because we have not changed the concerning code section in the last two years.

We will come up with a patch as soon as we have identified what goes wrong. In the meantime, the workaround I can offer you is to convert the cool files to h5 as follows:

hicConvertFormat -m matrix.cool --inputFormat cool --outputFormat h5 --load_raw_values -o matrix_raw_values.h5

Using now the KR correction works.

A short explanation what is wrong: KR correction normalizes the data in a way that the sum is 1, leading to the small values you have at the moment. This behaviour is normal, and some competitors of us offer it only that way. What we apply after the correction is to scale the values back the original data range. For reasons we need to identify this does not work for cool files.

Thanks for reporting this issue,

Joachim

@joachimwolff
Copy link
Collaborator

Hi Matt,

the correction factors have not been scaled back to the original data range as I suspected in the last post. In the development branch, there is a fix for this provided. Feel free to try it.

Best,

Joachim

@RomeroMatt
Copy link
Author

RomeroMatt commented Jul 14, 2021

Hi Joachim,
Thanks so much for the fix! Excited to try it... Noob question though - how do I download from the development branch onto my AWS instance? Would I just reinstall it using conda and it would automatically update?

@joachimwolff
Copy link
Collaborator

joachimwolff commented Jul 15, 2021

Hi Matt,

a bit of work is required:

  • Download the git repository of HiCExplorer git clone https://github.com/deeptools/HiCExplorer.git
  • cd HiCExplorer and git checkout develop.
  • Create a new conda environment: conda create --name hic_dev --file requirements.txt
  • Activate conda environment: conda activate hic_dev
  • Install the dev branch: python setup.py install

Best,

Joachim

@RomeroMatt
Copy link
Author

Thanks for the info, Joachim! I have since just been using the workaround that you suggested converting to an .h5 file. From there, I normalized and then corrected using hicCorrectMatrix.

I am, however, running into another problem. I was recently running hicPCA and used the structure suggested on the hicexplorer page:
hicPCA -m h9std/hicexplorer_data/h5/norm_KRCorrected/h9std_2.5mb_norm_KRCorrected.h5 \ -o h9std/hicexplorer_data/pca_pearson_obsexp/h9std_2.5mb_pca1.bedgraph \ h9std/hicexplorer_data/pca_pearson_obsexp/h9std_2.5mb_pca2.bedgraph \ -pm h9std/hicexplorer_data/pca_pearson_obsexp/h9std_2.5mb_pearson.h5 \ -oem h9std/hicexplorer_data/pca_pearson_obsexp/h9std_2.5mb_obsexp.h5

It seemed to run smoothly, but when I then tried to use those output files for hicCompartmentalization using this command:
hicCompartmentalization -m h9std/hicexplorer_data/pca_pearson_obsexp/h9std_2.5mb_obsexp.h5 --pca h9std/hicexplorer_data/pca_pearson_obsexp/h9std_2.5mb_pca1.bedgraph -o h9std/hicexplorer_data/pca_pearson_obsexp/h9std_compartments_2.5mb_global_signal.png --outputMatrix h9std/hicexplorer_data/pca_pearson_obsexp/h9std_2.5mb_compartment_output.npz

However, I keep getting an error:
"Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/hicexplorer/bin/hicCompartmentalization", line 7, in
main()
File "/home/ubuntu/anaconda3/envs/hicexplorer/lib/python3.8/site-packages/hicexplorer/hicCompartmentalization.py", line 176, in main
pc1 = pd.read_table(args.pca, header=None, sep="\t",
File "/home/ubuntu/anaconda3/envs/hicexplorer/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/hicexplorer/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 683, in read_table
return _read(filepath_or_buffer, kwds)
File "/home/ubuntu/anaconda3/envs/hicexplorer/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 482, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/home/ubuntu/anaconda3/envs/hicexplorer/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 811, in init
self._engine = self._make_engine(self.engine)
File "/home/ubuntu/anaconda3/envs/hicexplorer/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1040, in _make_engine
return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
File "/home/ubuntu/anaconda3/envs/hicexplorer/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 69, in init
self._reader = parsers.TextReader(self.handles.handle, **kwds)
File "pandas/_libs/parsers.pyx", line 542, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 745, in pandas._libs.parsers.TextReader._get_header
File "pandas/_libs/parsers.pyx", line 843, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 1917, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 1: invalid start byte"

It's my understanding that I can just use the obs_exp matrix created during hicPCA just the same as if I created it using hicTransform so I should be able to plug it right into hicCompartmentalization or is that not correct?

I am not sure if it's my files or just am inputting commands wrong? I looked at my PCA1.bedgraph file in my text editor and the first few lines look like this:
26fc 8f88 0400 0100 5801 0000 0000 0000
7602 0000 0000 0000 c32f 0000 0000 0000
0000 0000 0000 0000 0000 0000 3001 0000
0000 0000 0080 0000 0000 0000 0000 0000
704a 5b02 0000 0000 f732 0000 0000 0000

I'm not familiar with the structure, but seems odd that there's letters there? When I look at my .h5 files, they also have letters - so maybe they're supposed to be there? I apologize for the naive questions!

@joachimwolff
Copy link
Collaborator

Hi Matt,

for PCA: you need to specify the file format: https://github.com/deeptools/HiCExplorer/blob/master/hicexplorer/hicPCA.py#L64 The default is bigwig, and therefore the 'numbers and letters' you see are hexadecimal numbers representing the binary bigwig format. Therefore you get the error in hicCompartmentalization, there we expect bedgraph and not bigwig files. Also, the h5 is a binary format based on hdf5 files, the same is true for the cool files. Binary files have major advantages, also handling Hi-C matrices as regular text files is in my opinion a very naive and error-prone solution - but I know that competing Hi-C analysis software handle it in this way. To open these files you need special software that know how to read this data. If you really want to have a look at the internal structures of h5 or cool files you need either the access via the API or use software like the hdf viewer.

Best,

Joachim

@RomeroMatt
Copy link
Author

Hi Joachim,
Thanks so much for the help! I apologize for the naive questions and I appreciate your input and patience. Good to know I simply need to specify the files. I knew that, but I guess just forgot since I specified the extension as .bedgraph and just didn't set the format option.
I'll start that process today and see if I get it worked out! Now that I'm getting more comfortable with these tools, they really are great!
Thanks again,
Matt

@RomeroMatt
Copy link
Author

Hi Joachim,
Just wanted to update you that everything has been working great since receiving your help/suggestions. I had a couple more general questions:

  1. I have all my data normalized and corrected in .h5 file format. I was planning on using some of the tools in GENOVA; could I convert my normalized and corrected .h5 files back to .cool files and all the data will stay the same? I'm assuming so, but just wanted to check.

  2. I have been playing with the hicFindTADs tool and was just wondering if there were some resources you'd suggest for picking specific parameters for the tool? I'm sure that it is context dependent, but would just like to make sure I'm picking the most appropriate ones for my question. So far I have just used the suggested parameters in the example pages.

Thanks so much!
-Matt

@joachimwolff
Copy link
Collaborator

I have all my data normalized and corrected in .h5 file format. I was planning on using some of the tools in GENOVA; could I convert my normalized and corrected .h5 files back to .cool files and all the data will stay the same? I'm assuming so, but just wanted to check.

More or less, the correction factors are directly applied to the data for h5 files; therefore, in the h5-> cool created files, it will be the same.

I have been playing with the hicFindTADs tool and was just wondering if there were some resources you'd suggest for picking specific parameters for the tool? I'm sure that it is context-dependent, but would just like to make sure I'm picking the most appropriate ones for my question. So far I have just used the suggested parameters in the example pages.

I can't give you a 'one fits' all solution. The best way is to compute the TADs and plot them with pyGenomeTracks to see if it is matching as well as possible. And simply redo this until you have a good setting. The general issue with TADs is that if you have 20 algorithms, they will give you 20 different solutions...

@joachimwolff joachimwolff mentioned this issue Jul 27, 2021
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants