Skip to content

Threading model

Why the speedup is larger than "Rust is faster"

Section titled “Why the speedup is larger than "Rust is faster"”

The bottleneck is gzip, not trimming. The actual trimming logic (adapter alignment, quality clipping) is ~5% of runtime; the other 95% is gzip compression (~60%) and decompression (~30%). Rust's speed advantage over Perl/Python only applies to that 5%.

The real wins come from architectural differences.

Trim Galore runs Cutadapt on R1, then R2, then pair-validates: reading and recompressing the data three separate times. The Oxidized Edition does everything in one pass.

Each worker independently handles trimming and gzip compression for its batch of reads, producing independently-compressed gzip blocks concatenated in order (valid per RFC 1952). This distributes the dominant cost (compression) across N workers instead of funneling through one thread.

When you pass -j N to Trim Galore, three separate programs each independently spawn threads. The theoretical maximum is approximately 3N+3, though not all threads are necessarily active simultaneously:

Trim Galore -j N thread breakdown (theoretical maximum):
Cutadapt: N workers + 1 reader + 1 writer = N+2
pigz (compress): N threads = N
pigz/igzip (decompress): up to N threads ≈ N
Perl: 1 main process = 1
≈ 3N+3 total

Note: The nf-core trimgalore module accounts for this by reserving task.cpus - 4 cores for the -j flag (e.g., 12 allocated CPUs to -j 8). Thread counts above were observed via ps during benchmarking and represent approximate peak values.

The Oxidized Edition uses a single process with a fixed infrastructure cost of +4 threads:

Oxidized Edition --cores N thread breakdown:
N worker threads (each: trim + gzip compress -> independent gzip block)
2 decompression threads (one per input file)
1 batcher thread (creates numbered batches of 4096 reads)
1 main thread (collects blocks in order -> writes to output files)
= N+4 total

At --cores 1, the worker-pool is bypassed entirely: a single thread does everything with zero parallelism overhead (1 thread, 5 MB RAM). The infrastructure cost only applies from --cores 2 upward, where each additional core adds exactly 1 thread and ~10 MB of memory.

CoresTG threads (up to ~3N+3)Oxidized threads (N+4)
1up to ~61
4up to ~158
8up to ~2712
16not measured20

At -j 8 vs --cores 8: up to ~27 vs exactly 12 threads, yet 1.9x faster.

Parallel efficiency on the Xeon: 82% (2 cores) to 82% (4 cores) to 81% (8 cores) to 78% (16 cores) to 64% (24 cores). Scaling remains near-linear up to 16 cores, with diminishing returns beyond that. For most production use, --cores 8 to --cores 16 is the sweet spot. Beyond 16, additional cores still help but deliver progressively less benefit per core.

--coresMemoryNotes
15 MBWorker-pool bypassed; single-threaded path.
243 MBInfrastructure threads come online (+4 fixed cost).
462 MB
8100 MB
16171 MB
24157 MB

Each additional worker adds ~10 MB on average. The bulk of which is the per-worker compression buffer.

  • Zero external dependencies: No Python, no Cutadapt, no pigz. Single static binary.
  • Simpler deployment: cargo install or download a binary. No conda environment needed.
  • Single-pass paired-end: Both reads processed together, with guaranteed synchronization, no temp files.
  • Lower memory: 5 MB single-threaded, ~10 MB per additional worker. No Python interpreter, no subprocess pipes.
  • CPU-efficient: Uses 2.6 to 5x less CPU time than Trim Galore. Meaningful on shared HPC clusters where CPU-hours = money.
  • Reproducible: Pure Rust with deterministic behaviour across platforms.
  • New features: Poly-G trimming (auto-detected for 2-colour instruments like NovaSeq/NextSeq) and poly-A trimming, both built in without external tools.

The Oxidized Edition can use the full CPU allocation directly: no need to subtract cores for subprocess overhead, since everything runs in a single process. If a Nextflow process has 12 CPUs, just pass --cores 12.

For the historical nf-core pattern of task.cpus - 4, the equivalent Oxidized invocation is --cores task.cpus, with the fixed +4 thread cost matching the existing CPU budget without manual subtraction.

--cores N produces byte-identical decompressed output for any N (verified via md5 across the benchmark range). The worker-pool emits independently-compressed gzip blocks in deterministic order, so the gzipped bytes themselves vary by core count, but decompressing them yields the same FASTQ content every time.