Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training data from HG002, HG003 and HG004 #26

Closed
aCoalBall opened this issue Jun 6, 2024 · 1 comment
Closed

Training data from HG002, HG003 and HG004 #26

aCoalBall opened this issue Jun 6, 2024 · 1 comment

Comments

@aCoalBall
Copy link

Hi,
Thanks for the impressive work of DeepMod2. I am trying to generate the training/valid/test data following your criteria as described in your paper, but I can not get the same number as in Supplementary Table 4. Take HG002 for instance, I downloaded 12 files by LAB01 of the six WGBS methods (TrueSeq, MethylSeq, SPLAT, TrueMethIBS, TrueMethyIOX, and EMSeq) from EpiQC.

GSM5649436_TruSeq_HG002_LAB01_REP02.bedGraph.gz GSM5649437_TruSeq_HG002_LAB01_REP01.bedGraph.gz GSM5649458_SPLAT_HG002_LAB01_REP02.bedGraph.gz GSM5649459_SPLAT_HG002_LAB01_REP01.bedGraph.gz GSM5649479_MethylSeq_HG002_LAB01_REP02.bedGraph.gz GSM5649480_MethylSeq_HG002_LAB01_REP01.bedGraph.gz GSM5649518_EMSeq_HG002_LAB01_REP02.bedGraph.gz GSM5649520_EMSeq_HG002_LAB01_REP01.bedGraph.gz GSM5649543_TrueMethylOX_HG002_LAB01_REP02.bedGraph.gz GSM5649545_TrueMethylOX_HG002_LAB01_REP01.bedGraph.gz GSM5649565_TrueMethylBS_HG002_LAB01_REP02.bedGraph.gz GSM5649567_TrueMethylBS_HG002_LAB01_REP01.bedGraph.gz

I processed these files following the description in your paper:

  1. Before processing, for each WGBS method, I summed the counts from the two replicates and calculated the methylation percentage.
  2. For each of the six WGBS method, we find CpG sites with (coverage >=10 and percentage >=90%) as methylation and (coverage >=10 and percentage < 10%) as unmethylation.
  3. We take the overlapping for ALL the six WGBS methods for the above methylation and unmethylation sites. We then split the data into train/valid/test, according to the region of Chr2-21, Chr22 and Chr1.

Our Train: meth 4,212,407, unmeth: 2,169,575
Your Paper Train: meth 5,332,746, unmeth: 2,490,174

Our Valid: meth 134,316, unmeth: 58,293
Your Paper Valid: meth 179,954, unmeth: 65,730

Could you help me to reproduce the training data split as in Supplmentary Table 4 with more detailed information or kindly sharing your script to generate the data?
Thank you very much!

@umahsn
Copy link
Collaborator

umahsn commented Jun 7, 2024

Hi, please see the following code for reproducing the training CpG labels:

Process individual datasets for each genome:

h="HG002"

mkdir -p merged/$h

for tp in {TruSeq,TrueMethylBS,TrueMethylOX,EMSeq,MethylSeq,SPLAT}
do
# merge replicates
f1=`find * -name GSM5649*${tp}*${h}_LAB01_REP01.bedGraph`
f2=`find * -name GSM5649*${tp}*${h}_LAB01_REP02.bedGraph`

bedtools intersect -a $f1 -b $f2 -loj |awk -v OFS="\t"  '{print $1, $2, $3,($5+$11)/($5+$11+$6+$12),$5+$11+$6+$12,$5+$11,$6+$12}' |grep -v 'Un\|random\|chrEBV\|p\|M' > merged/$h/merged.$tp.bed

# filter according to depth, and separate according to high or low methylation
mincov=10
high=0.9
low=0.1

mkdir -p merged/$h/dp.$mincov/gt_${high}
mkdir -p merged/$h/dp.$mincov/lt_${low}

cat merged/$h/merged.$tp.bed|awk -v high=$high -v mincov=$mincov  '{if ($5>=mincov && $4>=high){print}}'| grep -v 'X\|Y\|M' > merged/$h/dp.$mincov/gt_${high}/$tp.bed

cat merged/$h/merged.$tp.bed|awk -v low=$low -v mincov=$mincov  '{if ($5>=mincov && $4<=low){print}}'| grep -v 'X\|Y\|M' > merged/$h/dp.$mincov/lt_${low}/$tp.bed
done

Get common sites among six technologies and split into training, testing and validation

g=2

# get common sites
for tp in {"gt_0.9","lt_0.1"}
do 
folder="merged/HG00$g/dp.10/$tp/"
bedtools multiinter -i $folder/*.bed > $folder/outer_join
cat $folder/outer_join|awk '{if($4==6){print}}'|cut -f 1-3 > $folder/common_sites_all
done

#convert sites into stranded CpGs
d=dp.10
l=lt_0.1
h=gt_0.9
mkdir -p merged/HG00$g/$d/training_labels
cat merged/HG00$g/$d/$l/common_sites_all | awk '{print $1 "\t" $2 "\t+\t" 0 "\n" $1 "\t" $2+1 "\t-\t" 0}' > merged/HG00$g/$d/training_labels/HG00$g.$d.$l.$h ; cat merged/HG00$g/$d/$h/common_sites_all|awk '{print $1 "\t" $2 "\t+\t" 1 "\n" $1 "\t" $2+1 "\t-\t" 1}' >> merged/HG00$g/$d/training_labels/HG00$g.$d.$l.$h

#split datasets
cat merged/HG00$g/$d/training_labels/HG00$g.$d.$l.$h|grep  "chr22" > merged/HG00$g/$d/training_labels/HG00$g.$d.$l.$h.chr22

cat merged/HG00$g/$d/training_labels/HG00$g.$d.$l.$h|grep  -v "chr1$(printf '\t')\|chr22" > merged/HG00$g/$d/training_labels/HG00$g.$d.$l.$h.chr2_21

cat merged/HG00$g/$d/training_labels/HG00$g.$d.$l.$h|grep  "chr1$(printf '\t')" > merged/HG00$g/$d/training_labels/HG00$g.$d.$l.$h.chr1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants