Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the simulation dataset #4

Open
FangfeiHu opened this issue Aug 15, 2021 · 13 comments
Open

Question about the simulation dataset #4

FangfeiHu opened this issue Aug 15, 2021 · 13 comments

Comments

@FangfeiHu
Copy link

Hi,

I'm a master student at the University of Melbourne and my research project is also to develop an approach for tandem repeat detection. To evaluate the performance of our approach, my supervisor suggests I could compare it with TideHunter using the 15 simulation datasets mentioned in your paper. Would you mind tell me where I could find the simulation data?

Many thanks and waiting for your reply.

Kind regards,
Fangfei

@yangao07
Copy link
Owner

Hi Fangfei,

I just uploaded the simulation datasets used in the paper to the github repo, please try it out.

Yan

@FangfeiHu
Copy link
Author

Thank you very much!!

@FangfeiHu
Copy link
Author

Hi Yan,

I just tried one simulation dataset with TideHunter, but I only got 26 records in the result file. I'm not sure why it could happen. Could you help me figure it out?
The dataset is sim_e0.13_s1000_c10 and here is my command:
TideHunter -f 2 TideHunter/simulation/err_rate/sim_e0.13_s1000_c10/sim.fa > sim_e0.13_s1000_c10.out

Kind regards,
Fangfei

@yangao07
Copy link
Owner

Hi Fangfei,

Thanks for pointing this out.
This is actually a careless bug related to the recent update of the submodule (abPOA).
It is fixed now. Please try out the latest version.

Yan

@yangao07
Copy link
Owner

Also, TideHunter could output multiple tandem repeats if possible.
If you want to reproduce the result in our paper, you may need to add the parameter -l.

Yan

@FangfeiHu
Copy link
Author

Thank you, Yan. I tried the latest TideHunter with sim_e0.13_s1000_c10 and it works well. Another thing I want to confirm is that did you also improve the accuracy of the TideHunter? In the result, I found it's 100% accurate for repeat unit (consensus length) but it's 99.9% mentioned in your paper.

Fangfei

@FangfeiHu
Copy link
Author

Hi Yan,

Sorry to bother you again. I'm trying to generate more simulation datasets with different sizes of repeat patterns (for example 20 and 50). But I'm a little confused with the use of pbsim. Would you mind share the commands you use for generating simulation data? I'm also wondering how to randomly extract sequences from the reference genome.

Kind regard,
Fangfei

@yangao07
Copy link
Owner

The repeat pattern size and copy number were set without using pbsim.
It was done by a customized python script, which extracts a random sequence with a specific length (repeat pattern size) and copies it by multiple times (copy number) to generate a tandem repeat.
Sorry that I no longer have the script now.

The different error rates and error ratios were set directly by feeding different parameters to pbsim program.

Yan

@yangao07
Copy link
Owner

Thank you, Yan. I tried the latest TideHunter with sim_e0.13_s1000_c10 and it works well. Another thing I want to confirm is that did you also improve the accuracy of the TideHunter? In the result, I found it's 100% accurate for repeat unit (consensus length) but it's 99.9% mentioned in your paper.

Fangfei

The current version of TideHunter has some improvements over the old one. So this is expected.

@FangfeiHu
Copy link
Author

The repeat pattern size and copy number were set without using pbsim.
It was done by a customized python script, which extracts a random sequence with a specific length (repeat pattern size) and copies it by multiple times (copy number) to generate a tandem repeat.
Sorry that I no longer have the script now.

The different error rates and error ratios were set directly by feeding different parameters to pbsim program.

Yan

May I ask about the base frequencies when generating random flanking sequences? Did you use equal base frequencies?

@yangao07
Copy link
Owner

The 100 bp sequence was also randomly extracted from the reference genome.

@FangfeiHu
Copy link
Author

The 100 bp sequence was also randomly extracted from the reference genome.

Thanks a lot! And for different repeat sizes, I notice you used a 15% error rate. Is the error ratio the same as 15%-a or 15%-b? Sorry for too many questions...

Fangfei

@yangao07
Copy link
Owner

They are different, please refer to the TideHunter paper published in Bioinformatics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants