Question about the simulation dataset #4

FangfeiHu · 2021-08-15T08:24:08Z

Hi,

I'm a master student at the University of Melbourne and my research project is also to develop an approach for tandem repeat detection. To evaluate the performance of our approach, my supervisor suggests I could compare it with TideHunter using the 15 simulation datasets mentioned in your paper. Would you mind tell me where I could find the simulation data?

Many thanks and waiting for your reply.

Kind regards,
Fangfei

yangao07 · 2021-08-15T11:55:09Z

Hi Fangfei,

I just uploaded the simulation datasets used in the paper to the github repo, please try it out.

Yan

FangfeiHu · 2021-08-15T12:00:49Z

Thank you very much!!

FangfeiHu · 2021-08-16T06:12:26Z

Hi Yan,

I just tried one simulation dataset with TideHunter, but I only got 26 records in the result file. I'm not sure why it could happen. Could you help me figure it out?
The dataset is sim_e0.13_s1000_c10 and here is my command:
TideHunter -f 2 TideHunter/simulation/err_rate/sim_e0.13_s1000_c10/sim.fa > sim_e0.13_s1000_c10.out

Kind regards,
Fangfei

yangao07 · 2021-08-16T09:14:29Z

Hi Fangfei,

Thanks for pointing this out.
This is actually a careless bug related to the recent update of the submodule (abPOA).
It is fixed now. Please try out the latest version.

Yan

yangao07 · 2021-08-16T09:15:46Z

Also, TideHunter could output multiple tandem repeats if possible.
If you want to reproduce the result in our paper, you may need to add the parameter -l.

Yan

FangfeiHu · 2021-08-17T02:01:34Z

Thank you, Yan. I tried the latest TideHunter with sim_e0.13_s1000_c10 and it works well. Another thing I want to confirm is that did you also improve the accuracy of the TideHunter? In the result, I found it's 100% accurate for repeat unit (consensus length) but it's 99.9% mentioned in your paper.

Fangfei

FangfeiHu · 2021-08-17T08:03:28Z

Hi Yan,

Sorry to bother you again. I'm trying to generate more simulation datasets with different sizes of repeat patterns (for example 20 and 50). But I'm a little confused with the use of pbsim. Would you mind share the commands you use for generating simulation data? I'm also wondering how to randomly extract sequences from the reference genome.

Kind regard,
Fangfei

yangao07 · 2021-08-17T10:12:57Z

The repeat pattern size and copy number were set without using pbsim.
It was done by a customized python script, which extracts a random sequence with a specific length (repeat pattern size) and copies it by multiple times (copy number) to generate a tandem repeat.
Sorry that I no longer have the script now.

The different error rates and error ratios were set directly by feeding different parameters to pbsim program.

Yan

yangao07 · 2021-08-17T10:14:09Z

Thank you, Yan. I tried the latest TideHunter with sim_e0.13_s1000_c10 and it works well. Another thing I want to confirm is that did you also improve the accuracy of the TideHunter? In the result, I found it's 100% accurate for repeat unit (consensus length) but it's 99.9% mentioned in your paper.

Fangfei

The current version of TideHunter has some improvements over the old one. So this is expected.

FangfeiHu · 2021-08-18T06:48:04Z

The repeat pattern size and copy number were set without using pbsim.
It was done by a customized python script, which extracts a random sequence with a specific length (repeat pattern size) and copies it by multiple times (copy number) to generate a tandem repeat.
Sorry that I no longer have the script now.

The different error rates and error ratios were set directly by feeding different parameters to pbsim program.

Yan

May I ask about the base frequencies when generating random flanking sequences? Did you use equal base frequencies?

yangao07 · 2021-08-18T10:15:00Z

The 100 bp sequence was also randomly extracted from the reference genome.

FangfeiHu · 2021-08-18T10:59:48Z

The 100 bp sequence was also randomly extracted from the reference genome.

Thanks a lot! And for different repeat sizes, I notice you used a 15% error rate. Is the error ratio the same as 15%-a or 15%-b? Sorry for too many questions...

Fangfei

yangao07 · 2021-08-18T13:13:16Z

They are different, please refer to the TideHunter paper published in Bioinformatics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the simulation dataset #4

Question about the simulation dataset #4

FangfeiHu commented Aug 15, 2021

yangao07 commented Aug 15, 2021

FangfeiHu commented Aug 15, 2021

FangfeiHu commented Aug 16, 2021

yangao07 commented Aug 16, 2021

yangao07 commented Aug 16, 2021

FangfeiHu commented Aug 17, 2021

FangfeiHu commented Aug 17, 2021

yangao07 commented Aug 17, 2021

yangao07 commented Aug 17, 2021

FangfeiHu commented Aug 18, 2021

yangao07 commented Aug 18, 2021

FangfeiHu commented Aug 18, 2021

yangao07 commented Aug 18, 2021

Question about the simulation dataset #4

Question about the simulation dataset #4

Comments

FangfeiHu commented Aug 15, 2021

yangao07 commented Aug 15, 2021

FangfeiHu commented Aug 15, 2021

FangfeiHu commented Aug 16, 2021

yangao07 commented Aug 16, 2021

yangao07 commented Aug 16, 2021

FangfeiHu commented Aug 17, 2021

FangfeiHu commented Aug 17, 2021

yangao07 commented Aug 17, 2021

yangao07 commented Aug 17, 2021

FangfeiHu commented Aug 18, 2021

yangao07 commented Aug 18, 2021

FangfeiHu commented Aug 18, 2021

yangao07 commented Aug 18, 2021