PPI4HPC system at BSC employed for massive European Genome-phenome Archive

01 Jun 2021

Written by Teresa D'Altri, CRG

Using the powerful storage facilities of the PPI4HPC system at the Barcelona Supercomputing Center (BSC), the European Genome-phenome Archive (EGA) provides an invaluable service for secure archiving and sharing of all types of personally identifiable genetic and phenotypic data resulting from biomedical studies and healthcare centres to the worldwide biomedical research community.

Jointly managed by the European Bioinformatics Institute (EBI) in Cambridge (UK) and the Centre for Genomic Regulation (CRG), the teams leading the EGA are involved in several international partnerships and consortia in numerous scientific fields, where they contribute to ambitious projects.

The human genomic data hosted at the EGA originate from projects in all fields of life science, like cancer research, neurology, immunology, gut macrobiotics, among others. The data have been generated with a variety of sequencing technology, which can produce the entire or partial sequence of the individuals’ genome. Up to date, the EGA archive has more than 4000 studies and 6000 datasets that can be browsed and queried on the webpage.

An essential part of building the EGA service was the use of a high-performance storage system, located at BSC and acquired through the public procurement procedure of PPI4HPC. This innovative infrastructure offered an optimised data storage with servers of enough memory and data analysis capacity which was essential in scaling up the EGA filesystem and processing vast amounts of data. The EGA filesystem is a software layer that is deployed on top of the storage system provided at BSC.

Such data is extremely valuable for research and therefore, from an ethical point of view, must be reused as many times as possible to empower, validate or complement new studies. At the same time, human genomic sequences are considered private data and are protected by the European GDPR regulation. Therefore, access to the data must be legally controlled and restricted only to allowed researchers. The EGA provides a platform that enables both of these purposes: law compliant and secure permanent storing of private genomic data while making it searchable and thus reusable by other researchers. This is made possible with data encryption methods and a safe storage solution. The new storage infrastructure, realised through PPI4HPC, is based on IBM's parallel file system Spectrum Scale. It permits to create completely encrypted filesets which improve the quality of security.

Data sharing can create some virtuous paths with an amazing potential to benefit science and to translate into medical advances. It can indeed amplify the potential of any dataset, well past the scope of its creation, the imagination of its owners and any geographical border. At EGA, we are proud to empower such fruitful worldwide cycles of knowledge, providing a platform that enables safe sharing of sensitive genetic data.