Methylation extraction
Appendix (III): Bismark Methylation Extractor
A brief description of the Bismark methylation extractor and a full list of options can also be viewed by typing bismark_methylation_extractor --help
USAGE: bismark_methylation_extractor [options] <filenames>
ARGUMENTS:
<filenames>
A space-separated list of Bismark result files in SAM format from which methylation information is extracted for every cytosine in the reads. For alignment files in the older custom Bismark output see option --vanilla
.
OPTIONS:
-s/--single-end
Input file(s) are Bismark result file(s) generated from single-end read data. If neither -s
nor -p
is set the type of experiment will be determined automatically.
-p/--paired-end
Input file(s) are Bismark result file(s) generated from paired-end read data. If neither -s
nor -p
is set the type of experiment will be determined automatically.
--vanilla
The Bismark result input file(s) are in the old custom Bismark format (up to version 0.5.x) and not in SAM format which is the default as of Bismark version 0.6.x or higher. Default: OFF.
--no_overlap
For paired-end reads it is theoretically possible that Read 1 and Read 2 overlap. This option avoids scoring overlapping methylation calls twice (only methylation calls of read 1 are used for in the process since read 1 has historically higher quality basecalls than read 2). Whilst this option removes a bias towards more methylation calls in the center of sequenced fragments it may de facto remove a sizeable proportion of the data. This option is on by default for paired-end data but can be disabled using --include_overlap
. Default: ON.
--include_overlap
For paired-end data all methylation calls will be extracted irrespective of whether they overlap or not. Default: OFF.
--ignore <int>
Ignore the first <int> bp from the 5' end of Read 1 (or single-end alignment files) when processing the methylation call string. This can remove e.g. a restriction enzyme site at the start of each read or any other source of bias (such as PBAT-Seq data).
--ignore_r2 <int>
Ignore the first <int> bp from the 5' end of Read 2 of paired-end sequencing results only. Since the first couple of bases in Read 2 of BS-Seq experiments show a severe bias towards non-methylation as a result of end-repairing sonicated fragments with unmethylated cytosines (see M-bias plot), it is recommended that the first couple of bp of Read 2 are removed before starting downstream analysis. Please see the section on M-bias plots in the Bismark User Guide for more details.
--ignore_3prime <int>
Ignore the last <int> bp from the 3' end of Read 1 (or single-end alignment files) when processing the methylation call string. This can remove unwanted biases from the end of reads.
--ignore_3prime_r2 <int>
Ignore the last <int> bp from the 3' end of Read 2 of paired-end sequencing results only. This can remove unwanted biases from the end of reads.
--comprehensive
Specifying this option will merge all four possible strand-specific methylation info into context-dependent output files. The default contexts are:
- CpG context
- CHG context
-
CHH context
-
--merge_non_CpG
This will produce two output files (in --comprehensive mode
) or eight strand-specific output files (default) for Cs in
- CpG context
-
non-CpG context
-
--report
Prints out a short methylation summary as well as the parameters used to run this script. Default: ON.
--no_header
Suppresses the Bismark version header line in all output files for more convenient batch processing.
-o/--output DIR
Allows specification of a different output directory (absolute or relative path). If not specified explicitly, the output will be written to the current directory.
--samtools_path
The path to your Samtools installation, e.g. /home/user/samtools/. Does not need to be specified explicitly if Samtools is in the PATH already.
--gzip
The methylation extractor files (CpG_OT_..., CpG_OB_... etc) will be written out in a GZIP
compressed form to save disk space. This option is also passed on to the genome-wide cytosine report. bedGraph
and coverage
files are written out as .gz
by default.
--mbias_only
The methylation extractor will read the entire file but only output the M-bias table and plots as well as a report (optional) and then quit. Default: OFF.
--mbias_off
The methylation extractor will process the entire file as usual but doesn't write out any M-bias report. Only recommended for users who deliberately want to keep an earlier version of the M-bias report. Default: OFF.
--multicore <int>
Sets the number of cores to be used for the methylation extraction process. If system resources are plentiful this is a viable option to speed up the extraction process (we observed a near linear speed increase for up to 10 cores used). Please note that a typical process of extracting a BAM file and writing out .gz
output streams will in fact use ~3 cores per value of --multicore <int>
specified (1 for the methylation extractor itself, 1 for a Samtools stream, 1 for a GZIP stream), so --multicore 10
is likely to use around 30 cores of system resources. This option has no bearing on the bismark2bedGraph
or coverage2cytosine
(genome-wide cytosine report) processes.
--version
Displays version information.
-h/--help
Displays this help file and exits.
bedGraph specific options:
--bedGraph
After finishing the methylation extraction, the methylation output is written into a sorted bedGraph
file that reports the position of a given cytosine and its methylation state (in %, see details below) using 0-based genomic start and 1-based end coordinates. The methylation extractor output is temporarily split up into temporary files, one per chromosome (written into the current directory or folder specified with -o/--output
); these temp files are then used for sorting and deleted afterwards. By default, only cytosines in CpG context are sorted. The option --CX_context
may be used to report all cytosines irrespective of sequence context (this will take MUCH longer!). The bedGraph
conversion step is performed by the external module bismark2bedGraph
; this script needs to reside in the same folder as the bismark_methylation_extractor itself.
--zero_based
Write out an additional coverage file (ending in .zero.cov
) that uses 0-based genomic start and 1-based genomic end coordinates (zero-based, half-open), like used in the bedGraph
file, instead of using 1-based coordinates throughout. Default: OFF.
--cutoff [threshold]
The minimum number of times any methylation state (methylated or unmethylated) has to be seen for a nucleotide before its methylation percentage is reported. Default: 1.
--remove_spaces
Replaces white spaces in the sequence ID field with underscores to allow sorting.
--CX/--CX_context
The sorted bedGraph
output file contains information on every single cytosine that was covered in the experiment irrespective of its sequence context. This applies to both forward and reverse strands. Please be aware that this option may generate large temporary and output files and may take a long time to sort (up to many hours). Default: OFF (i.e. Default = CpG context only).
--buffer_size <string>
This allows you to specify the main memory sort buffer when sorting the methylation information. Either specify a percentage of physical memory by appending % (e.g. --buffer_size 50%
) or a multiple of 1024 bytes, e.g. K
multiplies by 1024, M
by 1048576 and so on for T
etc. (e.g. --buffer_size 20G
). For more information on sort type info sort
on a command line. Defaults to 2G.
--scaffolds/--gazillion
Users working with unfinished genomes sporting tens or even hundreds of thousands of scaffolds/contigs/chromosomes frequently encountered errors with pre-sorting reads to individual chromosome files. These errors were caused by the operating system's limit of the number of filehandles that can be written to at any one time (typically 1024; to find out this limit on Linux, type: ulimit -a
). To bypass the limitation of open filehandles, the option --scaffolds
does not pre-sort methylation calls into individual chromosome files. Instead, all input files are temporarily merged into a single file (unless there is only a single file), and this file will then be sorted by both chromosome AND position using the Unix sort command. Please be aware that this option might take a looooong time to complete, depending on the size of the input files, and the memory you allocate to this process (see --buffer_size
). Nevertheless, it seems to be working.
--ample_memory
Using this option will not sort chromosomal positions using the UNIX sort
command, but will instead use two arrays to sort methylated and unmethylated calls. This may result in a faster sorting process of very large files, but this comes at the cost of a larger memory footprint (two arrays of the length of the largest human chromosome 1 (~250M bp) consume around 16GB of RAM). Due to overheads in creating and looping through these arrays it seems that it will actually be slower for small files (few million alignments), and we are currently testing at which point it is advisable to use this option. Note that --ample_memory
is not compatible with options --scaffolds/--gazillion
(as it requires pre-sorted files to begin with).
Genome-wide cytosine methylation report specific options:
--cytosine_report
After the conversion to bedGraph has completed, the option --cytosine_report
produces a genome-wide methylation report for all cytosines in the genome. By default, the output uses 1-based chromosome coordinates (zero-based start coords are optional) and reports CpG context only (all cytosine context is optional). The output considers all Cs on both forward and reverse strands and reports their position, strand, trinucleotide content and methylation state (counts are 0 if not covered). The cytosine report conversion step is performed by the external module coverage2cytosine
; this script needs to reside in the same folder as the bismark_methylation_extractor itself.
--CX/--CX_context
The output file contains information on every single cytosine in the genome irrespective of its context. This applies to both forward and reverse strands. Please be aware that this will generate output files with > 1.1 billion lines for a mammalian genome such as human or mouse. Default: OFF (i.e. Default = CpG context only).
--zero_based
Uses 0-based genomic coordinates instead of 1-based coordinates. Default: OFF.
--genome_folder <path>
Enter the genome folder you wish to use to extract sequences from (full path only). Accepted formats are FastA files ending with .fa
or .fasta
. Specifying a genome folder path is mandatory.
--split_by_chromosome
Writes the output into individual files for each chromosome instead of a single output file. Files are named to include the input filename as well as the chromosome number.
OUTPUT
The bismark_methylation_extractor output is in the form (tab delimited, 1-based coords):
<seq-ID> <methylation state*> <chromosome> <start position (= end position)> <methylation call>
Methylated cytosines receive a '+' orientation,
Unmethylated cytosines receive a '-' orientation.
The bedGraph output (optional) looks like this (tab-delimited, 0-based start, 1-based end coords):
track type=bedGraph (header line)
<chromosome> <start position> <end position> <methylation percentage>
The coverage output looks like this (tab-delimited; 1-based genomic coords):
<chromosome> <start position> <end position> <methylation percentage> <count methylated> <count unmethylated>
The genome-wide cytosine report (optional) is tab-delimited in the following format (1-based coords):
<chromosome> <position> <strand> <count methylated> <count unmethylated> <C-context> <trinucleotide context>