Filtering out non-bisulfite converted reads
Filtering incomplete bisulfite conversion from Bismark BAM files (optional). This script examines the methylation calls of reads, or read pairs for paired-end sequencing, and filters out reads that exceed a certain threshold of methylated calls in non-CG context (the default is 3). By default, filter_non_conversion
looks for a certain number of methylated non-CG calls, but a percentage methylation cutoff may be specified alternatively.
Please Note: Be aware that this kind of filtering is not advisable - and will introduce biases - if you work with organisms which exhibit any appreciable levels of non-CG methylation (e.g. most plants).
Writes out a file called nonCG_filtered.bam, also a file called nonCG_removed_seqs.bam as well as a short report how many sequences have been analysed and removed.
USAGE:
Please also note that for paired-end BAM files filter_non_conversion
expects Read 1 and Read 2 to follow each other in consecutive lines! If the file has been sorted by position make sure that you resort it by read name first (e.g. using samtools sort -n
)
-s/--single
Deduplicate single-end Bismark BAM files. If not specified the library type is auto-detected.
-p/--paired
Deduplicate paired-end Bismark BAM files. If not specified the library type is auto-detected.
--threshold [int]
The number of methylated cytosines in non-CG context at which reads or read pairs are filtered out. For paired-end files either Read 1 or Read 2 can fail the entire read pair. [Default: 3].
--percentage_cutoff [int]
Instead of filtering on an absolute count of methylated cytosines in non-CG context (see --threshold [int]
) this option allows you to define an overall percentage of methylation in non-CG context (both CHH and CHG) which, if reached or exceeded, results in the read or read pair being filtered out. For paired-end files either Read 1 or Read 2 can fail the entire read pair. Also requires a minimum number of cytosines in non-CG context to make confident filtering choices (see --minimum_count [int]
).
--minimum_count [int]
At least this number of cytosines in non-CG context (CHH or CHG) have to be seen in a read (irrespective of their methylation state) before the --percentage_cutoff
filter kicks in. [Default: 5].
--consecutive
Non-CG methylation has to be found on consecutive non-CGs. Any kind of unmethylated cytosine (in any context) resets the methylated non-CG counter to 0. [Default: OFF].
--samtools_path
The path to your Samtools installation, e.g. /home/user/samtools/. Does not need to be specified explicitly if Samtools is in the PATH already.
--help
Displays this help text end exits.
--version
Displays version information and exits.
If you get stuck at any point or have any questions or comments please contact me via e-mail: fkrueger@altoslabs.com