Context
With the democratization of sequencing, the volumes of data that need to be stored are increasing dramatically, especially in human health. Efficient storage requires fast, energy-efficient compression algorithms. We design MiMyCS: a C++ software that performs lossless reference-based compression of NGS datasets such as Illumina reads. To this end, MiMyCS computes a non-exhaustive mapping against a reference genome and accelerates this step with the UPMEM Processing-in-Memory devices. To reduce the overall amount of sequence comparisons and accelerate further the process, MiMyCS also incorporates a Bloom filters-based dispatcher that predicts against which genome parts reads are most likely to be mapped. We show with real whole Human sequencing datasets that MiMyCS is able to achieve a speed-up between 1.2x and 2.7x compared to Genozip, the current leading state-of-the-art compressor, while maintaining a comparable compression ratio and lowering significantly the overall energy consumption. |
Code link
Talk
-
D. Lavenier, Compression of HumanGenomic Data, 2nd minisymposium on applications and benefits of UPMEM commercial Massively Parallel Processing-in-Memory Platform, EURO-PAR 2024, Madrid, Aug 2024 [slides]