"Q1, Q2, Q3" Output of "stats" subcommand seems weird #353

MapleHe · 2022-12-02T11:05:18Z

Prerequisites

make sure you're are using the latest version by seqkit version
read the usage

Describe your issue

Platform: Linux
Data: raw 150bp fastq without trimming, number of reads: 42275372
Command: seqkit stats temp.fq -j 8 -a -T >>temp.tsv 2>>temp.log

seqkit Version v0.13.2 output:

Q1, Q2, Q3
150, 150, 150

seqkit Version v0.16.1, v2.3.0 output: (BTW, the version I downloaded is v2.3.1, but seqkit version shows v2.3.0)

Q1, Q2, Q3
75.0, 150.0, 75.0

When I downsample fastq to 1 read from the original fastq, using seqkit v0.13.2, output:

Q1, Q2, Q3
0, 150, 0

When I downsample fastq to 1 read from the original fastq, using seqkit v2.3.0, output:

Q1, Q2, Q3
75.0, 150.0, 75.0

Expected result:

Q1, Q2, Q3
150, 150, 150

As I understand, the Q1, Q2, Q3 should be the quartile of seq length, right ? Then v0.13.2 seems work as expected, for full dataset at least.

BTW, it could be better if there is an description/explatination of each column in the output in your document (maybe its my fault to omit it ?).

Thank you for your excellent tool.

The text was updated successfully, but these errors were encountered:

shenwei356 · 2022-12-02T14:51:23Z

Fixed it and added some docs:

$ seqkit head -n 1 hairpin.fa | seqkit stats 
file  format  type  num_seqs  sum_len  min_len  avg_len  max_len
-     FASTA   RNA          1       99       99       99       99

simple statistics of FASTA/Q files

Columns:

  1.  file      input file, "-" for STDIN
  2.  format    FASTA or FASTQ
  3.  type      DNA, RNA, Protein or Unlimit
  4.  num_seqs  number of sequences
  5.  sum_len   number of bases or residues       , with gaps or spaces counted
  6.  min_len   minimal sequence length           , with gaps or spaces counted
  7.  avg_len   average sequence length           , with gaps or spaces counted
  8.  max_len   miximal sequence length           , with gaps or spaces counted
  9.  Q1        first quartile of sequence length , with gaps or spaces counted
  10. Q2        median of sequence length         , with gaps or spaces counted
  11. Q3        third quartile of sequence length , with gaps or spaces counted
  12. sum_gap   number of gaps
  13. N50       N50. https://en.wikipedia.org/wiki/N50,_L50,_and_related_statistics#N50
  14. Q20(%)    percentage of bases with the quality score greater than 20
  15. Q30(%)    percentage of bases with the quality score greater than 30
  16. GC(%)     percentage of GC content
  
Attentions:
  1. Sequence length metrics (sum_len, min_len, avg_len, max_len, Q1, Q2, Q3)
     count the number of gaps or spaces. You can remove them with "seqkit seq -g":
         seqkit seq -g input.fasta | seqkit stats

Tips:
  1. For lots of small files (especially on SDD), use big value of '-j' to
     parallelize counting.
  2. Extract one metric with csvtk (https://github.com/shenwei356/csvtk):
         seqkit stats -Ta input.fastq.gz | csvtk cut -t -f "Q30(%)" | csvtk del-header

The version I downloaded is v2.3.1, but seqkit version shows v2.3.0

Yes, I noticed that. Thanks for reporting.

…#353

MapleHe · 2022-12-02T15:05:01Z

Thank you for your quick answer and fix.

…shenwei356#353

shenwei356 added the add some doc label Dec 2, 2022

shenwei356 added a commit that referenced this issue Dec 2, 2022

seqkit stats: fix compute Q1 and Q3 of sequence length for one record. …

4f2dbf0

…#353

MapleHe closed this as completed Dec 2, 2022

botond-sipos pushed a commit to botond-sipos/seqkit that referenced this issue Jan 29, 2023

seqkit stats: fix compute Q1 and Q3 of sequence length for one record. …

434b5de

…shenwei356#353

shenwei356 mentioned this issue Mar 17, 2023

Update SeqKit to v2.4.0 bioconda/bioconda-recipes#39957

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Q1, Q2, Q3" Output of "stats" subcommand seems weird #353

"Q1, Q2, Q3" Output of "stats" subcommand seems weird #353

MapleHe commented Dec 2, 2022

shenwei356 commented Dec 2, 2022

MapleHe commented Dec 2, 2022 •

edited

Loading

"Q1, Q2, Q3" Output of "stats" subcommand seems weird #353

"Q1, Q2, Q3" Output of "stats" subcommand seems weird #353

Comments

MapleHe commented Dec 2, 2022

Prerequisites

Describe your issue

Expected result:

shenwei356 commented Dec 2, 2022

MapleHe commented Dec 2, 2022 • edited Loading

MapleHe commented Dec 2, 2022 •

edited

Loading