Context change
Context change/discrepancy between Bismark coverage and genome-wide cytosine reports
A question that comes up every so often is: "Why do some positions have a different cytosine context between the coverage
and genome-wide cytosine reports produced by coverage2cytosine
? In rare(r) cases, the same position can even be present
in several different contexts - how is this possible?"
Answer:
The Bismark coverage files contain every position that received a methylation call during the mapping step. There are certain cases where the cytosine context may change due to a deletion in the immediate downstream proximity of the cytosine, like in the following example:
CAATGGGA
Here, the first C is in CHH context. If there was a deletion of AAT, one would would get an alignment like this:
C---GGGA
here the context would effectively have changed from CHH to CG.
At least for mammalian systems it is quite likely that such a change would also affect the methylation state of the cytosine involved because CpGs are typically methylated whereas non-CG cytosines are largely completely unmethylated.
This context change only ever occurs for deletions immediate downstream of a cytosine, but not insertions. The reason for this is
that insertions are padded with X
during the methylation call procedure, which would render the cytosine context Unknown
.
coverage2cytosine
on the other hand is fully reference genome-based, so it will go through the reference sequence
and check whether a cytosine was covered or not. The sequence context is purely assigned based on the reference sequence.
-
For CpG context only (the default), it will therefore ‘miss’ out any of the calls where a deletion had changed the sequence context from
CpG
to eitherCHG
orCHH
. -
In
--CX
context: One may encounter cases where the context of a cytosine has changed, or much more rarely, where the very same position may have been called with different cytosine contexts in the intitial CHG, CHH and CpG-context files produced by thebismark_methylation_extractor
. If now, howevever,--CX
was used forbismark2bedGraph
andcoverage2cytosine
(or directly during thebismark_methylation_extractor
step), those different cytosine contexts will be merged again for that position, and will get the cytosine context assigned purely based on the reference sequence. In other words: If there truly was a cytosine context - that may or may not affect the methylation state - it would be, probably erroneously, attributed to the context provided by the reference sequence.
In a nutshell:
By design, Bismark generally bases its methylation call behavior on the reference genome, with the rationale being that
sequencing errors occur much more frequently than true polymorphisms in the genome sequence. The sequence context of insertions
would be dependent on the basecall accuracy of the inserted base(s), so we chose to call these methylation calls in Unknown
context.
Deletions on the other hand are a very rare error in Illumina data, and we believe it should be fine to proceed with the default behavior
of calling changed sequence context because it is very likely that a change from CHH to CpG context, or vice versa, will really
also lead to a change in that residue's methylation state.
If you work with coverage files in CpG
context only you may miss a few positions that should be CpG positions according to
the reference sequence. On the other hand it may also contain a few newly gained CpGs positions that had a different context
(CHG
or CHH
) in the reference sequence but where the read sequence says otherwise.
If you are working in --CX
mode, or with the genome-wide report (or both), I am afraid it is a little more complicated...