Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output not generated due to Python Error #10

Closed
DuttaAnik opened this issue Jun 25, 2024 · 6 comments
Closed

Output not generated due to Python Error #10

DuttaAnik opened this issue Jun 25, 2024 · 6 comments
Assignees

Comments

@DuttaAnik
Copy link

Hello,
Thank you for developing this tool. It looks interesting. I ran this tool with some ONT reads from an untargeted sequencing run. I used the following code to run the analysis:
lemur -i Data_cutadapt.fastq -o lem_test -d rv221bacarc-rv222fungi --tax-path rv221bacarc-rv222fungi/taxonomy.tsv --mm2-type map-ont --verbose -e LOG_FILE -r species

But it threw me the following error:

Traceback (most recent call last):
  File "/home/rd/miniconda3/bin/lemur", line 901, in <module>
    main()
  File "/home/rd/miniconda3/bin/lemur", line 887, in main
    run.EM_complete()
  File "/home/rd/miniconda3/bin/lemur", line 672, in EM_complete
    self.low_abundance_threshold = 1. / n_reads
ZeroDivisionError: float division by zero

The fastq file worked well and provided desired results when I used that for taxonomic classification using Kraken2.

Then, I check the log file, which contains the following contents:

2024-06-25 11:15:09 AM INFO:	Starting run of minimap2 at 2024-06-25 11:15:09.730779
2024-06-25 11:16:51 AM DEBUG:	
2024-06-25 11:16:51 AM DEBUG:	[M::mm_idx_gen::40.391*1.47] collected minimizers
[M::mm_idx_gen::47.459*2.30] sorted minimizers
[M::main::47.691*2.30] loaded/built the index for 3335785 target sequence(s)
[M::mm_mapopt_update::47.691*2.30] mid_occ = 1000000
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 3335785
[M::mm_idx_stat::48.247*2.28] distinct minimizers: 51995586 (41.46% are singletons); average occurrences: 11.640; average spacing: 5.451; total length: 3299326635
[M::worker_pipeline::101.127*7.31] mapped 64828 sequences
[M::main] Version: 2.24-r1122
[M::main] CMD: minimap2 -ax map-ont -N 50 -K 500000000 -p .9 -f 0 --sam-hit-only --eqx -t 20 -o lem_test/reads.sam rv221bacarc-rv222fungi/species_taxid.fasta Chry_1_cutadapt.fastq
[M::main] Real time: 101.611 sec; CPU: 739.436 sec; Peak RSS: 20.974 GB

2024-06-25 11:16:51 AM INFO:	Finished running minimap2 in 101787.308 ms
2024-06-25 11:16:51 AM DEBUG:	                                    species                   genus              family  ... subspecies species subgroup species group
tax_id                                                                                   ...                                          
5467               Colletotrichum truncatum          Colletotrichum      Glomerellaceae  ...        NaN              NaN           NaN
5499                           Fulvia fulva                  Fulvia  Mycosphaerellaceae  ...        NaN              NaN           NaN
245174                Wallemia ichthyophaga                Wallemia        Wallemiaceae  ...        NaN              NaN           NaN
1903569           Penicillium riverlandense             Penicillium      Aspergillaceae  ...        NaN              NaN           NaN
36630                  Aspergillus fischeri             Aspergillus      Aspergillaceae  ...        NaN              NaN           NaN
...                                     ...                     ...                 ...  ...        ...              ...           ...
2977707        Synechococcus sp. AH-551-E02           Synechococcus    Synechococcaceae  ...        NaN              NaN           NaN
467085   Candidatus Symbiothrix dinenymphae  Candidatus Symbiothrix                 NaN  ...        NaN              NaN           NaN
1873883        Rhodoplanes sp. JGI PP 4-B12             Rhodoplanes   Hyphomicrobiaceae  ...        NaN              NaN           NaN
2977708        Synechococcus sp. AH-551-E05           Synechococcus    Synechococcaceae  ...        NaN              NaN           NaN
1492901            Bizionia psychrotolerans                Bizionia   Flavobacteriaceae  ...        NaN              NaN           NaN

[47390 rows x 11 columns]
2024-06-25 11:16:51 AM DEBUG:	Loaded taxonomy file rv221bacarc-rv222fungi/taxonomy.tsv
2024-06-25 11:16:51 AM INFO:	Finished loading taxonomy in 88.576 ms
2024-06-25 11:16:51 AM DEBUG:	Initialized abundance vector
2024-06-25 11:16:51 AM INFO:	Finished initializing F in 3.161 ms
2024-06-25 11:16:51 AM INFO:	Finished building alignment model in 0.027 ms
2024-06-25 11:16:56 AM DEBUG:	build_P_rgs_df extracted 100000 reads
2024-06-25 11:16:56 AM DEBUG:	build_P_rgs_df extracted 200000 reads
2024-06-25 11:16:57 AM DEBUG:	build_P_rgs_df extracted 300000 reads
2024-06-25 11:16:58 AM DEBUG:	build_P_rgs_df extracted 400000 reads
2024-06-25 11:16:58 AM DEBUG:	build_P_rgs_df extracted 500000 reads
2024-06-25 11:16:59 AM DEBUG:	build_P_rgs_df extracted 600000 reads
2024-06-25 11:17:00 AM DEBUG:	build_P_rgs_df extracted 700000 reads
2024-06-25 11:17:00 AM DEBUG:	build_P_rgs_df extracted 800000 reads
2024-06-25 11:17:34 AM DEBUG:	Empty DataFrame
Columns: [log_P, Gene, Reference, aln_len, max_aln_len, max_log_P, Name, Genome]
Index: []
2024-06-25 11:17:34 AM INFO:	Finished constructing P(r|s) in 43166.161 ms

But I do not get any output file with relative abundance of the microorganisms. In the output folder, I have only the P_rgs* files.

Could you please help me find a solution?
Thank you.

@nsapoval
Copy link
Collaborator

Hello,

The read dataframe ends up being empty after the initial processing (i.e. before the EM step is even run), the last line of log file indicates that, and hence downstream the n_reads retained is 0 raising division by 0 error. There are a few reasons this might happen, since we do have a few filters that remove low confidence read alignments to the marker genes.

If you have the .fastq or .sam file from the run available, and are willing to share it, I can try seeing what exactly kicks out the alignments and provide more detailed feedback.

In the meantime, you can try adding --min-aln-len-ratio 0.5 and --min-fidelity 0.25 flags. This will relax filtering criteria and give us an idea of whether it's the filters that cause the issue or whether there is a different problem/bug in the code.

Kind regards,
Nick

@nsapoval nsapoval self-assigned this Jun 25, 2024
@DuttaAnik
Copy link
Author

Hi, thank you for the reply. Unfortunately, adding the flags did not resolve the issue. Please send me your email so I can share the fastq file. I cannot attach it here.

@nsapoval
Copy link
Collaborator

[email protected]

@DuttaAnik
Copy link
Author

I have sent you the data. Please check your email (also the spam folder).

@nsapoval
Copy link
Collaborator

I have received the data and replicated the issue.

I think the main reason for the errors is the read length. Average read length in the sample is 370.2 bps and the max length read is 1378 bps, while the average length on a marker gene is 989 bps with the longest one being 11,061 bps. This means that with relatively sparse sample chances of a good overlap for a marker gene are quite low. Looking at the alignment lengths in the P_rgs_df_raw.tsv file I can see that they range from 40 to 120 bps. This would mean that even with the lower filter of alignment length ratio (0.5) none would pass.

I ran Lemur with --min-aln-len-ratio 0.001 and attached the results here: relative_abundance.csv. There are a few bacteria and a single fungi picked up by Lemur. However, I would be cautious with these results, since the alignments are very short.

Please let me know if you are able to get the run to work after changing the flag value to 0.001.

@DuttaAnik
Copy link
Author

Hi Nicolae,
Thank you very much for running the analysis and the suggestions. Yes, it ran now with your proposed value. However, I was just testing this tool with low-quality data and make sure that it works. Now, that it works I will try it out with a high-quality dataset. Thanks a lot for the insights.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants