Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Q1, Q2, Q3" Output of "stats" subcommand seems weird #353

Closed
2 tasks done
MapleHe opened this issue Dec 2, 2022 · 2 comments
Closed
2 tasks done

"Q1, Q2, Q3" Output of "stats" subcommand seems weird #353

MapleHe opened this issue Dec 2, 2022 · 2 comments

Comments

@MapleHe
Copy link

MapleHe commented Dec 2, 2022

Prerequisites

  • make sure you're are using the latest version by seqkit version
  • read the usage

Describe your issue

Platform: Linux
Data: raw 150bp fastq without trimming, number of reads: 42275372
Command: seqkit stats temp.fq -j 8 -a -T >>temp.tsv 2>>temp.log

seqkit Version v0.13.2 output:

Q1, Q2, Q3
150, 150, 150

seqkit Version v0.16.1, v2.3.0 output: (BTW, the version I downloaded is v2.3.1, but seqkit version shows v2.3.0)

Q1, Q2, Q3
75.0, 150.0, 75.0

When I downsample fastq to 1 read from the original fastq, using seqkit v0.13.2, output:

Q1, Q2, Q3
0, 150, 0

When I downsample fastq to 1 read from the original fastq, using seqkit v2.3.0, output:

Q1, Q2, Q3
75.0, 150.0, 75.0

Expected result:

Q1, Q2, Q3
150, 150, 150

As I understand, the Q1, Q2, Q3 should be the quartile of seq length, right ? Then v0.13.2 seems work as expected, for full dataset at least.

BTW, it could be better if there is an description/explatination of each column in the output in your document (maybe its my fault to omit it ?).

Thank you for your excellent tool.

@shenwei356
Copy link
Owner

Fixed it and added some docs:

$ seqkit head -n 1 hairpin.fa | seqkit stats 
file  format  type  num_seqs  sum_len  min_len  avg_len  max_len
-     FASTA   RNA          1       99       99       99       99
simple statistics of FASTA/Q files

Columns:

  1.  file      input file, "-" for STDIN
  2.  format    FASTA or FASTQ
  3.  type      DNA, RNA, Protein or Unlimit
  4.  num_seqs  number of sequences
  5.  sum_len   number of bases or residues       , with gaps or spaces counted
  6.  min_len   minimal sequence length           , with gaps or spaces counted
  7.  avg_len   average sequence length           , with gaps or spaces counted
  8.  max_len   miximal sequence length           , with gaps or spaces counted
  9.  Q1        first quartile of sequence length , with gaps or spaces counted
  10. Q2        median of sequence length         , with gaps or spaces counted
  11. Q3        third quartile of sequence length , with gaps or spaces counted
  12. sum_gap   number of gaps
  13. N50       N50. https://en.wikipedia.org/wiki/N50,_L50,_and_related_statistics#N50
  14. Q20(%)    percentage of bases with the quality score greater than 20
  15. Q30(%)    percentage of bases with the quality score greater than 30
  16. GC(%)     percentage of GC content
  
Attentions:
  1. Sequence length metrics (sum_len, min_len, avg_len, max_len, Q1, Q2, Q3)
     count the number of gaps or spaces. You can remove them with "seqkit seq -g":
         seqkit seq -g input.fasta | seqkit stats

Tips:
  1. For lots of small files (especially on SDD), use big value of '-j' to
     parallelize counting.
  2. Extract one metric with csvtk (https://github.com/shenwei356/csvtk):
         seqkit stats -Ta input.fastq.gz | csvtk cut -t -f "Q30(%)" | csvtk del-header

The version I downloaded is v2.3.1, but seqkit version shows v2.3.0

Yes, I noticed that. Thanks for reporting.

@MapleHe
Copy link
Author

MapleHe commented Dec 2, 2022

Thank you for your quick answer and fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants