Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q: HMMRATAC producing too many peaks #636

Open
igordot opened this issue Apr 3, 2024 · 10 comments
Open

Q: HMMRATAC producing too many peaks #636

igordot opened this issue Apr 3, 2024 · 10 comments

Comments

@igordot
Copy link

igordot commented Apr 3, 2024

I ran HMMRATAC on two samples that should be biological replicates. The quality difference between them is high, although the quality is far from ideal for both. However, the low quality sample produced more than 10x as many peaks (<10k versus >100k) with the default settings.

For comparison, the Java version of HMMRATAC produced <1k peaks for the first sample and no peaks for the second sample (score cutoff of 30).

Here is a genome browser screenshot to illustrate the situation:
image

The called peaks for the first sample look like peaks. Not so much for the second sample. Is this expected? Are there some guidelines to help avoid this?

@taoliu
Copy link
Contributor

taoliu commented Apr 8, 2024

One thing you can try is to increase the cutoff for calling candidate regions in the macs3 version of hmmratac using the option for prescan-cutoff -c. And perhaps first run the --cutoff-analysis in hmmratac to decide the OK parameter for -c. Also it is likely that the cutoff values for -u and -l for training HMM have to be adjusted as well. Check https://macs3-project.github.io/MACS/docs/hmmratac.html#choices-of-cutoff-values

Please let me know if this can help.

@igordot
Copy link
Author

igordot commented Apr 8, 2024

I saw that, but wasn't sure how to properly use the output to determine the optimal values. Do you have an example documented somewhere?

@taoliu
Copy link
Contributor

taoliu commented Apr 9, 2024

The cutoff-analysis for hmmratac command is discussed here: https://macs3-project.github.io/MACS/docs/hmmratac.html#choices-of-cutoff-values

And a similar cutoff-analysis for callpeak command is discussed here: https://macs3-project.github.io/MACS/docs/callpeak.html#cutoff-analysis

And if you can generate the cutoff analysis report from hmmratac --cutoff-analysis-only and share with me here, we can analyze the report together in this thread.

@igordot
Copy link
Author

igordot commented Apr 9, 2024

I am sharing the cutoff report here (I had to add .txt because curiously GitHub does not allow .tsv):
test_cutoff_analysis.tsv.txt

It might even be easier to just paste it in whole:

score	npeaks	lpeaks	avelpeak
638.79	1	234	234.00
630.39	1	236	236.00
621.98	1	267	267.00
613.58	1	267	267.00
605.17	1	277	277.00
596.77	1	280	280.00
588.36	1	291	291.00
579.96	1	292	292.00
571.55	3	794	264.67
563.15	2	1972	986.00
554.74	2	2523	1261.50
546.34	3	2902	967.33
537.93	3	2913	971.00
529.53	3	2920	973.33
521.12	3	2925	975.00
512.71	3	2955	985.00
504.31	3	2989	996.33
495.90	3	3183	1061.00
487.50	3	3202	1067.33
479.09	3	3257	1085.67
470.69	3	3407	1135.67
462.28	3	3458	1152.67
453.88	3	3501	1167.00
445.47	3	3505	1168.33
437.07	3	3510	1170.00
428.66	3	3563	1187.67
420.26	3	3639	1213.00
411.85	3	3678	1226.00
403.45	3	3735	1245.00
395.04	3	4062	1354.00
386.64	3	4456	1485.33
378.23	3	4507	1502.33
369.83	3	4511	1503.67
361.42	3	4539	1513.00
353.02	3	4568	1522.67
344.61	3	4577	1525.67
336.21	3	4677	1559.00
327.80	3	4704	1568.00
319.40	3	5206	1735.33
310.99	2	6264	3132.00
302.58	3	6392	2130.67
294.18	3	6401	2133.67
285.77	3	6412	2137.33
277.37	3	6426	2142.00
268.96	3	6467	2155.67
260.56	3	6484	2161.33
252.15	3	6502	2167.33
243.75	3	6508	2169.33
235.34	4	6635	1658.75
226.94	5	6762	1352.40
218.53	5	6785	1357.00
210.13	6	6967	1161.17
201.72	6	6986	1164.33
193.32	6	7039	1173.17
184.91	6	7067	1177.83
176.51	7	7301	1043.00
168.10	7	7333	1047.57
159.70	8	7900	987.50
151.29	8	8338	1042.25
142.89	10	9056	905.60
134.48	13	9942	764.77
126.08	14	11087	791.93
117.67	18	12245	680.28
109.27	23	18535	805.87
100.86	27	25300	937.04
92.46	31	34402	1109.74
84.05	30	45057	1501.90
75.65	31	52352	1688.77
67.24	30	63203	2106.77
58.84	29	73885	2547.76
50.43	26	82657	3179.12
42.03	27	88598	3281.41
33.62	28	92070	3288.21
25.22	46	101435	2205.11
16.81	414	255592	617.37
8.40	14319	7995323	558.37
0.00	64045	2385081806	37240.72

@taoliu
Copy link
Contributor

taoliu commented Apr 9, 2024

The fold change values for reasonable peaks are within the lower range. The default parameters for hmmratac are -u 20 -l 10 -c 1.2. It's possible that we need to increase the -c parameter a bit higher. I can't judge since the current npeak, lpeak, and avelpeak values of the score cutoff range 0.00-8.40 show huge difference and we don't have a finer resolution of the analysis between 0 and 8.4. You can try to use -u 20 -l 8 -c 5, and to see if the result can improve. We will try to add an option for fine-tuning this cutoff analysis to get better resolution in the future. At the meantime, we are adding a different emission model -- Poisson emission for HMM, as an addition to the current Gaussian emission model. In our internal test on some small dataset, it gives us more similar results as the java version HMMRATAC.

How is the current HMM from hmmratac? Could you share us the command line output as https://macs3-project.github.io/MACS/docs/hmmratac.html#tune-the-hmm-model?

@Jungal10
Copy link

I ran HMMRATAC on two samples that should be biological replicates. The quality difference between them is high, although the quality is far from ideal for both. However, the low quality sample produced more than 10x as many peaks (<10k versus >100k) with the default settings.

For comparison, the Java version of HMMRATAC produced <1k peaks for the first sample and no peaks for the second sample (score cutoff of 30).

Here is a genome browser screenshot to illustrate the situation: image

The called peaks for the first sample look like peaks. Not so much for the second sample. Is this expected? Are there some guidelines to help avoid this?

Not a direct answer, but the scale on your IGV is quite different as well between the two

@igordot
Copy link
Author

igordot commented Apr 11, 2024

Not a direct answer, but the scale on your IGV is quite different as well between the two

Yes, each sample is auto-scaled independently since peak-calling is independent as well.

@taoliu
Copy link
Contributor

taoliu commented Apr 11, 2024

Also related to #638, it's highly possible that the hidden markov model learned from the low quality data can't capture the expected chromatin structure around accessible sites. This is a common problem of "machine learning". In this case, perhaps a general peak calling process such as using "callpeak" function is more appropriate.

@igordot
Copy link
Author

igordot commented Apr 11, 2024

I am not surprised that machine learning is failing for these samples. I was primarily concerned about the large difference between the Java and MACS3 implementations.

@taoliu
Copy link
Contributor

taoliu commented Apr 22, 2024

We are implementing a different emission model for HMM in MACS3 called the Poisson Emission model. During our internal tests, this simple model (compared with the Gaussian Emission model) generated results that were similar to those of the Java implementation. Stay tuned if you are interested in it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants