Genomic Data Compression

Context

With the democratization of sequencing, the volumes of data that need to be stored are increasing dramatically, especially in human health. Efficient storage requires fast, energy-efficient compression algorithms. We design MiMyCS: a C++ software that performs lossless reference-based compression of NGS datasets such as Illumina reads. To this end, MiMyCS computes a non-exhaustive mapping against a reference genome and accelerates this step with the UPMEM Processing-in-Memory devices. To reduce the overall amount of sequence comparisons and accelerate further the process, MiMyCS also incorporates a Bloom filters-based dispatcher that predicts against which genome parts reads are most likely to be mapped. We show with real whole Human sequencing datasets that MiMyCS is able to achieve a speed-up between 1.2x and 2.7x compared to Genozip, the current leading state-of-the-art compressor, while maintaining a comparable compression ratio and lowering significantly the overall energy consumption.

Code link

Talk

    D. Lavenier, Compression of Human Genomic Data, 2nd minisymposium on applications and benefits of UPMEM commercial Massively Parallel Processing-in-Memory Platform, EURO-PAR 2024, Madrid, Aug 2024 [slides]

Publication

    F. De Moor, M. Mognol, C. Deltel, E. Drezen, J. Legriel, D. Lavenier, MiMyCS: A Processing-in-Memory Read Mapper for Compressing Next-Gen Sequencing Datasets, 11th International Workshop on High Performance Computing on Bioinformatics, BIBM 2024, Dec. 2024, Lisbon, Portugal

Comments are closed.