Genomic Data Compression

Context

With the democratization of sequencing, the volumes of data that need to be stored are increasing dramatically, especially in human health. Efficient storage requires fast, energy-efficient compression algorithms. We design MiMyCS: a C++ software that performs lossless reference-based compression of NGS datasets such as Illumina reads. To this end, MiMyCS computes a non-exhaustive mapping against a reference genome and accelerates this step with the UPMEM Processing-in-Memory devices. To reduce the overall amount of sequence comparisons and accelerate further the process, MiMyCS also incorporates a Bloom filters-based dispatcher that predicts against which genome parts reads are most likely to be mapped. We show with real whole Human sequencing datasets that MiMyCS is able to achieve a speed-up between 1.2x and 2.7x compared to Genozip, the current leading state-of-the-art compressor, while maintaining a comparable compression ratio and lowering significantly the overall energy consumption.

Code link

https://gitlab.inria.fr/pim/org.pim.srm

Talk

slides

Publication

F. De Moor, M. Mognol, C. Deltel, E. Drezen, J. Legriel, D. Lavenier, MiMyCS: A Processing-in-Memory Read Mapper for Compressing Next-Gen Sequencing Datasets, 11th International Workshop on High Performance Computing on Bioinformatics, BIBM 2024, Dec. 2024, Lisbon, Portugal