-
Notifications
You must be signed in to change notification settings - Fork 14
/
dmrseq.Rmd
655 lines (533 loc) · 26.9 KB
/
dmrseq.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
---
title: "Analyzing Bisulfite-seq data with dmrseq"
author: "Keegan Korthauer"
date: "`r BiocStyle::doc_date()`"
package: "`r BiocStyle::pkg_ver('dmrseq')`"
bibliography: dmrseqBib.bib
abstract: >
A basic task in the analysis of count data from Whole Genome
Bisulfite-Sequencing is the detection of differentially methylated regions.
The count data consist of, for each sample, the number of methylated
reads and the total number of reads covering CpG.
An important analysis question is to detect regions (collections of
neighboring CpGs) with systematic differences between conditions,
as compared to within-condition variability. These so-called Differentially
Methylated Regions (DMRs) are thought to be more informative than single CpGs
in terms of of biological function. Although several methods exist
to quantify and perform statistical inference on changes at the individual
CpG level, detection of DMRs is still limited to aggregating signifiant
CpGs without proper inference at the region level. The package **dmrseq**
addresses this gap by providing a rigorous permutation-based approach to
detect and perform inference for differential methylation by use of
generalized least squares models that account for inter-individual and
inter-CpG variability to generate region-level statistics that can be
comparable across the genome. This allows the framework to perform well even
on samples as small as two per group. This vignette explains the
use of the package and demonstrates typical workflows. This vignette was
generated with dmrseq package version `r packageVersion("dmrseq")`
output:
BiocStyle::html_document:
highlight: pygments
toc_float: true
fig_width: 5
vignette: >
%\VignetteIndexEntry{Analyzing Bisulfite-seq data with dmrseq}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding[utf8]{inputenc}
---
<!-- This is the source document -->
```{r setup, echo=FALSE, results="hide"}
knitr::opts_chunk$set(tidy=FALSE, cache=FALSE,
dev="png",
message=FALSE, error=FALSE, warning=TRUE)
```
# Quick start
**If you use dmrseq in published research, please cite:**
> Korthauer, K., Chakraborty, S., Benjamini, Y., and Irizarry, R.A.
> Detection and accurate False Discovery Rate control of differentially
methylated regions from Whole Genome Bisulfite Sequencing
> *Biostatistics*, 2018 (in press).
This package builds upon the
[bsseq](http:https://bioconductor.org/packages/bsseq) package [@Hansen2012],
which provides efficient storage and manipulation of bisulfite
sequencing data and inference for differentially methylated CpGs.
The main goal of **dmrseq** [@Korthauer183210]
is to provide inference for differentially methylated *regions*, or
groups of CpGs.
Here we show the most basic steps for a differential methylation
analysis. There are a variety of steps upstream of **dmrseq** that result
in the generation of counts of methylated reads and total reads covering each
CpG for each sample, including mapping of sequencing reads to a reference
genome with and without bisulfite conversion. You can use the software
of your preference for this step (one option is
[Bismark](https://www.bioinformatics.babraham.ac.uk/projects/bismark/)), as
long as you are able to obtain counts of methylation and coverage (as
opposed to solely methylation proportions, as discussed below).
This package uses a specific data structure to store and manipulate
bisulfite sequencing data introduced by the **bsseq** package. This data
structure is a *class* called `BSseq`. Objects of the class `BSseq` contain
all pertinent information for a bisulfite sequencing experiment, including
the number of reads corresponding to methylation, and the total number
of reads at each
CpG site, the location of each CpG site, and experimental metadata on the
samples. Note that here we focus on CpG methylation, since this is the
most common form of methylation in humans and many other organisms; take
care when applying this method to other types of methylation and make sure
that it will
be able to scale to the number of methylation sites, and that similar
assumptions can be made regarding spatial correlation. Also note that
the default settings for smoothing parameters and spacing/gap parameters
are set to values that we found useful, but may need to be altered for
datasets for other organisms.
To store your data in a `BSseq` object, make sure you have the following
neccessary components:
1. genomic positions, including chromosome and location, for methylation loci.
2. a (matrix) of M (Methylation) values, describing the number of reads
supporting methylation covering a single loci.
Each row in this matrix is a methylation loci and each column is a sample.
3. a (matrix) of Cov (Coverage) values,
describing the total number of reads covering a single loci.
Each row in this matrix is a methylation loci and each column is a sample.
The following code chunk assumes that `chr` and `pos` are vectors of
chromosome names and positions, respectively, for each CpG in the dataset.
You can also provide a `GRanges` object instead of `chr` and `pos`. It
also assumes that the matrices of methylation and coverage values (described
above) are named `M` and `Cov`, respectively. Note, `M` and `Cov` can also
be data stored on-disk (not in memory) using HDF5 files with the `HDF5Array`
package or `DelayedMatrix` with the `DelayedArray` package.
The `sampleNames` and `trt` objects are
vectors with sample labels and condition labels for each sample. A condition
label could be something like
treatment or control, a tissue type, or a continous measurement.
This is the covariate for which you wish to test for differences in
methylation. Once the `BSseq` object is constructed and the sample covariate
information is added, DMRs are obtained by running the `dmrseq` function.
A continuous covariate is assumed if the data type of the `testCovariate`
arugment in `dmrseq` is
continuous, with the exception of if there are only two unique values
(then a two group comparison is carried out).
```{r quickStart, eval=FALSE}
bs <- BSseq(chr = chr, pos = pos,
M = M, Cov = Cov,
sampleNames = sampleNames)
pData(bs)$Condition <- trt
regions <- dmrseq(bs=bs, testCovariate="Condition")
```
For more information on constructing and manipulating `BSseq` objects,
see the [bsseq](http:https://bioconductor.org/packages/bsseq) vignettes.
* If you used *Bismark* to align your bisulfite sequencing data,
you can use the `read.bismark` function to read bismark files
into `BSseq` objects. See below for more details.
# How to get help for dmrseq
Please post **dmrseq** questions to the
**Bioconductor support site**, which serves as a searchable knowledge
base of questions and answers:
<https://support.bioconductor.org>
Posting a question and tagging with "dmrseq" will automatically send
an alert to the package authors to respond on the support site. See
the first question in the list of [Frequently Asked Questions](#FAQ)
(FAQ) for information about how to construct an informative post.
# Input data
## Why counts instead of methylation proportions?
As input, the **dmrseq** package expects count data as obtained, e.g.,
from Bisulfite-sequencing. The value in the *i*-th row and the *j*-th column of
the `M` matrix tells how many methylated reads can be assigned to CpG *i*
in sample *j*. Likewise, the value in the *i*-th row and the *j*-th column of
the `Cov` matrix tells how many total reads can be assigned to CpG *i*
in sample *j*. Although we might be tempted to combine these matrices into
one matrix that contains the methylation *proportion* (`M`/`Cov`) at each CpG
site, it is critical to notice that this would be throwing away a lot of
information. For example, some sites have much higher coverage than others,
and naturally, we have more confidence in those with many reads mapping to them.
If we only kept the proportions, a CpG with 2 out of 2 reads methylated would
be treated the same as a CpG with 30 out of 30 reads methylated.
## How many samples do I need?
To use **dmrseq**, you need to have at least 2 samples in each condition.
Without this replicates, it is impossible to distinguish between biological
variability due to condition/covariate of interest, and inter-individual
variability within condition.
If your experiment contains additional samples, perhaps from other conditions
that are not of interest in the current test, these should be filtered out
prior to running **dmrseq**. Rather than creating a new filtered object,
the filtering step can be included in the call to the main function `dmrseq`.
For more details, see the
[Filtering CpGs and samples Section](#filtering-cpgs-and-samples).
## Bismark input
If you used Bismark for mapping and methylation level extraction, you can
use the `read.bismark` function from the **bsseq** package to read the
data directly into
a `BSeq` object.
The following example is from the help page of the function. After running
Bismark's methylation extractor, you should have output files with names
that end in `.bismark.cov.gz`. You can specify a vector of file names with
the `file` argument, and a corresponding vector of `sampleNames`. It is
recommended that you set `rmZeroCov` to TRUE in order to remove CpGs with
no coverage in any of the samples, and set `strandCollapse` to TRUE in order
to combine CpGs on opposite strands into one observation (since CpG methylation)
is symmetric.
```{r bismarkinput}
library(dmrseq)
infile <- system.file("extdata/test_data.fastq_bismark.bismark.cov.gz",
package = 'bsseq')
bismarkBSseq <- read.bismark(files = infile,
rmZeroCov = TRUE,
strandCollapse = FALSE,
verbose = TRUE)
bismarkBSseq
```
See the [bsseq](http:https://bioconductor.org/packages/bsseq) help pages for
more information on using this function.
## Count matrix input
If you haven't used Bismark, but you have count data for number of methylated
reads and total coverage for each CpG, along with their corresponding chromosome
and position information, you can construct a `BSseq` object from scratch,
like below. Notice that the `M` and `Cov` matrices have the same dimension, and
`chr` and `pos` have the same number of elements as rows in the count matrices
(which corresponds to the number of CpGs). Also note that the number of columns
in the count matrices matches the number of elements in `sampleNames` and the
condition variable 'celltype`.
```{r dissect, results="hide", echo=FALSE}
data("BS.chr21")
M <- getCoverage(BS.chr21, type="M")
Cov <- getCoverage(BS.chr21, type="Cov")
chr <- as.character(seqnames(BS.chr21))
pos <- start(BS.chr21)
celltype <- pData(BS.chr21)$CellType
sampleNames <- sampleNames(BS.chr21)
```
```{r fromScratch}
head(M)
head(Cov)
head(chr)
head(pos)
dim(M)
dim(Cov)
length(chr)
length(pos)
print(sampleNames)
print(celltype)
bs <- BSseq(chr = chr, pos = pos,
M = M, Cov = Cov,
sampleNames = sampleNames)
show(bs)
```
```{r cleanup, results="hide", echo=FALSE}
rm(M, Cov, pos, chr, bismarkBSseq)
```
The example data contains CpGs from chromosome 21 for four samples
from @Lister2009. To load this data directly (already in the `BSseq` format),
simply type `data(BS.chr21)`.
Two of the samples are replicates of the cell type 'imr90'
and the other two are replicates of the cell type 'h1'. Now that we have the
data loaded into a `BSseq` object, we can use **dmrseq**
to find regions of the genome where these two cell types have significantly
different methylation levels. But first, we need to add the sample metadata
that indicates which samples are from which cell type (the `celltype`
varialbe above). This information, which we call 'metadata',
will be used by the `dmrseq` function to decide
which samples to compare to one another. The next section shows how to add
this information to the `BSseq` object.
## Sample metadata
To add sample metadata, including the covariate of interest, you can add it
to the
`BSseq` object by adding columns to the `pData` slot. You must have at least
one column of `pData`, which contains the covariate of interest. Additional
columns are optional.
```{r meta}
pData(bs)$CellType <- celltype
pData(bs)$Replicate <- substr(sampleNames,
nchar(sampleNames), nchar(sampleNames))
pData(bs)
```
We will then tell the `dmrseq` function which metadata variable to use
for testing for methylation differences by setting the `testCovariate`
parameter equal to its column name.
## Smoothing
Note that unlike in **bsseq**, you do not need to carry out the smoothing step
with a separate function. In addition, you should not use **bsseq**'s `BSmooth()`
function to smooth the methylation levels, since **dmrseq** smooths in a very
different way. Briefly, **dmrseq** smooths methylation *differences*, so it
carries out the smoothing step once. This is automatically done with the main
`dmrseq` function. **bsseq** on the other hand, smooths each sample
independently, so smoothing needs to be carried out once per sample.
## Filtering CpGs and samples
For pairwise comparisons, **dmrseq** analyzes all CpGs that have at least one
read in at least one sample per group.
Thus, if your dataset contains CpGs with zero reads in every sample within a
group, you should filter them out prior to running `dmrseq`. Likewise,
if your `bsseq` object contains extraneous samples that are part of the
experiment but not the differential methylation testing of interest, these
should be filtered out as well.
Filtering `bsseq` objects is straightforward:
* Subset rows to filter CpG loci
* Subset columns to filter samples
If we wish to remove all CpGs that have no coverage in at least one sample
and only keep samples with a CellType of "imr90" or "h1", we would do so with:
```{r, filter}
# which loci and sample indices to keep
loci.idx <- which(DelayedMatrixStats::rowSums2(getCoverage(bs, type="Cov")==0) == 0)
sample.idx <- which(pData(bs)$CellType %in% c("imr90", "h1"))
bs.filtered <- bs[loci.idx, sample.idx]
```
```{r, results="hide", echo=FALSE}
rm(bs.filtered)
```
Note that this is a trivial example, since our toy example object `BS.chr21`
already contains only loci with coverage at least one read in all samples as well
as only samples from the "imr90" and "h1" conditions.
Also note that instead of creating a separate object, the filtering step
can be combined with the call to `dmrseq` by replacing the `bs` input with a
filtered version `bs[loci.idx, sample.idx]`.
## Adjusting for covariates
There are two ways to adjust for covariates in the dmrseq model. The first way
is to specify the `adjustCovariate` parameter of the `dmrseq()` function as
a column of the `pData()` slot that contains the covariate you
would like to adjust for. This will include that covariate directly in the
model. This is ideal if the adjustment covariate is continuous or has more
than two groups.
The second way is to specify the `matchCovariate` parameter of the `dmrseq`
function as a column of the `pData()` slot that contains the covariate you
would like to match on. This will restrict the permutations considered to only
those where the `matchCovariate` is balanced. For example, the `matchCovariate`
could represent the sex of each sample. In that case, a permutation that
includes all males in one group and all females in another would not be
considered (since there is a plausible biological difference that may induce
the null distribution to resemble non-null). This matching adjustment is ideal
for two-group comparisons.
# Differentially Methylated Regions
The standard differential expression analysis steps are wrapped
into a single function, `dmrseq`. The estimation steps performed
by this function are described briefly below, as well as in
more detail in the **dmrseq** paper. Here we run the results for a subset
of 20,000 CpGs in the interest of computation time.
```{r mainfunction, message=TRUE, warning=TRUE}
testCovariate <- "CellType"
regions <- dmrseq(bs=bs[240001:260000,],
cutoff = 0.05,
testCovariate=testCovariate)
```
Progress messages are printed to the console if `verbose` is TRUE.
The text, `condition h1 vs imr90`, tells you that positive methylation
differences mean h1 has higher methylation than imr90 (see below for
more details).
## Output of dmrseq
The results object is a `GRanges` object with the coordiates
of each candidate region, and contains the following metadata columns (which
can be extracted with the `$` operator:
1. `L` = the number of CpGs contained in the region,
2. `area` = the sum of the smoothed beta values
3. `beta` = the coefficient value for the condition difference (Note: if the
test covariate is categorical with more than 2 groups, there will be
more than one beta column),
4. `stat` = the test statistic for the condition difference,
5. `pval` = the permutation _p_-value for the significance of the test
statistic, and
6. `qval` = the _q_-value for the test statistic (adjustment
for multiple comparisons to control false discovery rate).
7. `index = an `IRanges` containing the indices of the region's
first CpG to last CpG.
```{r, showresults}
show(regions)
```
The above steps are carried out on a very small subset of data (20,000 CpGs).
This package loads data into memory one chromosome at a
time. For on human data, this means objects with a few million
entries per sample (since there are roughly 28.2 million total CpGs in the human
genome, and the largest chromosomes will have more than 2 million CpGs).
This means that whole-genome `BSseq` objects for several samples can use up
several GB of RAM. In order to improve speed, the package allows for easy
parallel processing of chromosomes, but be aware that using more cores will
also require the use of more RAM.
To use more cores, use the `register` function of
[BiocParallel](http:https://bioconductor.org/packages/BiocParallel). For example,
the following chunk (not evaluated here), would register 4 cores, and
then the functions above would
split computation over these cores.
```{r parallel, eval=FALSE}
library("BiocParallel")
register(MulticoreParam(4))
```
## Steps of the dmrseq method
**dmrseq** is a two-stage approach that first detects candidate regions and then
explicitly evaluates statistical significance at the region level while
accounting for known sources of variability.
Candidate DMRs are defined by segmenting the genome into groups of CpGs
that show consistent evidence of differential methylation.
Because the methylation levels of neighboring CpGs are highly correlated,
we first smooth the signal to combat loss of power due to low coverage as done
in **bsseq**.
In the second stage, we compute a statistic for each candidate
DMR that takes into account variability between biological replicates
and spatial correlation among neighboring loci. Significance of each
region is assessed via a permutation procedure which uses a pooled null
distribution that can be generated from as few as two biological replicates,
and false discovery rate is controlled using the Benjamini-Hochberg
procedure.
For more details, refer to the **dmrseq** paper [@Korthauer183210].
## Detecting large-scale methylation blocks
The default smoothing parameters (`bpSpan`, `minInSpan`, and `maxGapSmooth`)
are designed to focus on local DMRs, generally in the range of hundreds to
thousands of bases. In some applications, such as cancer, it is of interest
to effectively 'zoom out' in order to detect larger (lower-resolution)
methylation blocks on the order of hundreds of thousands to millions of bases.
To do so, you can
set the `block` argument to true, which will only include candidate regions with
at least `blockSize` basepairs (default = 5000). This setting will also merge
candidate regions that (1) are in the same direction and (2) are less than 1kb
apart with no covered CpGs separating them. The region-level model used is also
slightly modified - instead of a loci-specific intercept for each CpG in the
region, the intercept term is modeled as a natural spline with one interior
knot per each 10kb of length (up to 10 interior knots).
In addition, detecting large-scale blocks requires that
the smoothing window be increased to minimize the impact of noisy local
methylation measurements. To do so, the values of the
smoothing parameters should be increased. For example, to use a smoothing window
that captures at least 500 CpGs or 50,000 basepairs that are spaced apart by no
more than 1e6 bases, use `minInSpan=500`, `bpSpan=5e4`, and `maxGapSmooth=1e6`.
In addition, to avoid a block being broken up simply due to a gap with no
covered CpGs, you can increase the `maxGap` parameter.
```{r blocks, message=TRUE, warning=TRUE}
testCovariate <- "CellType"
blocks <- dmrseq(bs=bs[120001:125000,],
cutoff = 0.05,
testCovariate=testCovariate,
block = TRUE,
minInSpan = 500,
bpSpan = 5e4,
maxGapSmooth = 1e6,
maxGap = 5e3)
head(blocks)
```
The top hit is `r signif(width(blocks)[1]/1e3, 3)` thousand basepairs wide.
In general, it also might be advised to decrease the cutoff when detecting
blocks, since a smaller methylation
difference might be biologically significant if it is maintained
over a large genomic region. Note that block-finding can be more computationally
intensive since we are fitting region-level models to large numbers of CpGs at a
time. In the toy example above we are only searching over 5,000 CpGs (which
span
`r signif((max(end(bs[120001:125000,])) -
min(start(bs[120001:125000,])))/1e3,3)`
thousand basepairs), so we do not find enough null
candidate regions to carry out inference and obtain significance levels.
# Exploring and exporting results
## Explore how many regions were significant
How many regions were significant at the FDR (_q_-value) cutoff of 0.05? We
can find this by counting how many values in the `qval` column of the `regions`
object were less than 0.05.
You can also subset the regions by an FDR cutoff.
```{r}
sum(regions$qval < 0.05)
# select just the regions below FDR 0.05 and place in a new data.frame
sigRegions <- regions[regions$qval < 0.05,]
```
## Hypo- or Hyper- methylation?
You can determine the proportion of regions with hyper-methylation by counting
how many had a positive direction of effect (positive statistic).
```{r hyper}
sum(sigRegions$stat > 0) / length(sigRegions)
```
To interpret the direction of effect, note that for a two-group comparison
**dmrseq** uses alphabetical order of the covariate of interest.
The condition with a higher alphabetical rank will become the reference category.
For example, if
the two conditions are "A" and "B", the "A" group will be the reference category,
so a positive direction of effect means that
"B" is hyper-methylated relative to "A". Conversely, a negative direction of
effect means that "B" is hypo-methylated relative to "A".
## Plot DMRs
It can be useful to visualize individual DMRs, so we provide a plotting
function that is based off of **bsseq**'s plotting functions. There is also
functionality to add annotations using the
[annotatr](http:https://bioconductor.org/packages/annotatr) package to
see the nearby CpG categories (island, shore, shelf, open sea) and nearby
coding sequences.
To retrieve annotations for genomes supported by **annotatr**, use the
helper function `getAnnot`, and pass this annotation object to the `plotDMRs`
function as the `annoTrack` parameter.
```{r plot, out.width='\\textwidth', fig.height = 2.5, warning='hide'}
# get annotations for hg18
annoTrack <- getAnnot("hg18")
plotDMRs(bs, regions=regions[1,], testCovariate="CellType",
annoTrack=annoTrack)
```
Here we also plot the top methylation block from the block analysis:
```{r plotblock, out.width='\\textwidth', fig.height = 2.5, warning='hide'}
plotDMRs(bs, regions=blocks[1,], testCovariate="CellType",
annoTrack=annoTrack)
```
## Plot distribution of methylation values and coverage
It can also be helpful to visualize overall distributions of methylation values
and / or coverage. The function `plotEmpiricalDistribution` will plot the
methylation values of
the covariate of interest (specified with `testCovariate`).
```{r plot2, fig.height=3}
plotEmpiricalDistribution(bs, testCovariate="CellType")
```
By changing the `type` argument to `Cov`, it will also plot the distribution of
coverage values. In addition, samples can be plotted separately by setting
`bySample` to true.
```{r plot3, fig.height=3}
plotEmpiricalDistribution(bs, testCovariate="CellType",
type="Cov", bySample=TRUE)
```
## Exporting results to CSV files
A plain-text file of the results can be exported using the
base R functions *write.csv* or *write.delim*.
We suggest using a descriptive file name indicating the variable
and levels which were tested.
```{r export, eval=FALSE}
write.csv(as.data.frame(regions),
file="h1_imr90_results.csv")
```
## Extract raw mean methylation differences
For a two-group comparison, it might be of interest to extract the raw mean
methylation differences over the DMRs. This can be done with the helper function
`meanDiff`. For example, we can extract the raw mean difference values for
the regions at FDR level 0.05 (using the `sigRegions` object created
in the section
[Explore how many regions were significant](#explore-how-many-regions-were-significant)).
```{r, meandiff}
rawDiff <- meanDiff(bs, dmrs=sigRegions, testCovariate="CellType")
str(rawDiff)
```
# Simulating DMRs
If you have multiple samples from the same condition (e.g. control samples),
the function `simDMRS` will split these into two artificial sample groups
and then add _in silico_ DMRs. This can then be used to assess sensitivity
and specificity of DMR approaches, since we hope to be able to recover the
DMRs that were spiked in, but not identify too many other differences (since
we don't expect any biological difference between the two artificial sample
groups).
The use of this function is demonstrated below, although note that in this
toy example, we do not have enough samples from the same biological condition to
split into two groups, so instead we shuffle the cell types to create a null
sample comparison.
```{r, sim}
data(BS.chr21)
# reorder samples to create a null comparison
BS.null <- BS.chr21[1:20000,c(1,3,2,4)]
# add 100 DMRs
BS.chr21.sim <- simDMRs(bs=BS.null, num.dmrs=100)
# bsseq object with original null + simulated DMRs
show(BS.chr21.sim$bs)
# coordinates of spiked-in DMRs
show(BS.chr21.sim$gr.dmrs)
# effect sizes
head(BS.chr21.sim$delta)
```
The resulting object is a list with the following elements:
* `gr.dmrs`: a `GRanges` object containing the coordinates of the true spiked
in DMRs
* `dmr.mncov`: a numeric vector containing the mean coverage of the
simulated DMRs
* `dmr.L`: a numeric vector containing the sizes (number of CpG loci) of the
simulated DMRs
* `bs`: the `BSSeq` object containing the original null data + simulated DMRs
* `delta`: a numeric vector of effect sizes (proportion differences) of the
simulated DMRs.
# Session info
```{r sessionInfo}
sessionInfo()
```
# References