Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kallisto index freezes on Ensembl database #210

Open
clydeandforth opened this issue May 22, 2019 · 0 comments
Open

kallisto index freezes on Ensembl database #210

clydeandforth opened this issue May 22, 2019 · 0 comments

Comments

@clydeandforth
Copy link

Hi all,

I downloaded cDNA fasta files and gtf files of the fungal database in Ensembl. I concatenated these into single fasta and gtf files and then tried to index them. However, I get the following error when I run the index command, even on a 382G high memory node:

kallisto index -i test.gtf.gz test.fa.gz

[build] loading fasta file test.fa.gz [build] k-mer length: 31 [build] warning: clipped off poly-A tail (longer than 10) from 4479 target sequences [build] warning: replaced 2455576 non-ACGUT characters in the input sequence with pseudorandom nucleotides [build] counting k-mers ... terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc /var/log/slurm/spool_slurmd/job6981046/slurm_script: line 20: 136148 Aborted kallisto index -i test.fa.gz

The test data completes with the kallisto index command. Is this a memory issue? Do I need more that a 382G node with 2 CPUs and 40 cores? Here is my system information:

kallisto 0.45.1
(GNU libc) 2.17
Linux RedHatEnterpriseServer
Red Hat Enterprise Linux Server release 7.4 (Maipo)

Input files:

database size:
test.gtf.gz 1.3G
test.fa.gz 4.3G

I removed most of the header information from the fasta, here is a chunk of the file:

test.fa.gz

>SAM02534
ATGCCTTCCCTGTCCCGAGTGATTAACCATCCTCTGTTTAACGTTGTCTTCTTTTTGCTG
GCTCGACAAGTAACCAAGGTCCTCCCATTAGAAGACGGGTCTTACTTATGGGGCCTTCGT
GCTCTTTACTATGGCGCTCAAGCTGCGATTATGTTACTAAATCTTTACATTATCCAGATC
ATTGAAAAGAAAAACGATCAGACTGTTTTGCGCTACGTGGAACCGGCGAAACAAACCTGG
GACGGGACCACTACAAAGGATACATTGGTGGTGACCAACTTTGCCGATTACGACAAGAGT
GAAGTCTTGAAGGGGTTGAAACAATCGGGGATTGGGCTGGCCATGGTGACCTTCTTGCAC
TTCAAATTTGGATATGTACAGCCTTTGATCATCCAAGCAATCCTTGGTTTCAAGACCTTC
TTCACGACCAAAGAAGCAAGAATCCACCTATTCAACCAATCCACCAGCAGCGGTGATCTG
AAACGACCTTTCCGGGTGGATTCTCCTTTTGGAATGAACTCACTCAACCCTCAACCCAAG
ACCGACAAGGCATCCATCAAAAAGGCGGAACGTGCTATGAAGGCGGATTAG
>SAM02535
ATGAAAGACGGCTTCAAGTCCATTACGATCGAACCGTTTAATGGGTATCTCGACTTTCAG
GGACCTATCAACGCACAGCAGTCCACCGGCAACATGGTTCTCAAAGGCGACATTCACCTG
GAGCTCACCAAAGCGGTCAATGTCAAGAAGGCCACCCTCAGGTTTATTGGGTCTAGTCGT
GTCTGCCACCACAACACCCTCGATACCGTCGATATCAGCACTCCGATCCTGCCGAAACTC
AAGACACATCTCTTCTCTTCCACTACAACACTTGGTCCTGGCGAGGTGATCTTACCGTGG
GAAATGGAAATCCTCAACATATATCCGTGCAGCGTCATGATCAAACGGGTCACCGTCTCA

I removed comment lines from the gtf file which contained information about each fungal species, here is a chunk of the file:

test.gtf.gz

scf_12295       ena     CDS     1080    1208    .       +       0       gene_id "SAM05242"; transcript_id "SAM05242"; exon_number "1"; gene_name "ABSGL_11117.1 scaffold 12295"; gene_source "ena"; gene_biotype "protein_coding"; transcript_name "ABSGL_11117.1 scaffold 12295-1"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "SAM05242";
scf_12295       ena     start_codon     1080    1082    .       +       0       gene_id "SAM05242"; transcript_id "SAM05242"; exon_number "1"; gene_name "ABSGL_11117.1 scaffold 12295"; gene_source "ena"; gene_biotype "protein_coding"; transcript_name "ABSGL_11117.1 scaffold 12295-1"; transcript_source "ena"; transcript_biotype "protein_coding";
scf_12295       ena     exon    1293    1366    .       +       .       gene_id "SAM05242"; transcript_id "SAM05242"; exon_number "2"; gene_name "ABSGL_11117.1 scaffold 12295"; gene_source "ena"; gene_biotype "protein_coding"; transcript_name "ABSGL_11117.1 scaffold 12295-1"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "SAM05242-2";
scf_12295       ena     CDS     1293    1366    .       +       0       gene_id "SAM05242"; transcript_id "SAM05242"; exon_number "2"; gene_name "ABSGL_11117.1 scaffold 12295"; gene_source "ena"; gene_biotype "protein_coding"; transcript_name "ABSGL_11117.1 scaffold 12295-1"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "SAM05242";

Thanks,

James

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant