Reproducible science. Accessible, interoperable and reusable data. Regulatory and audit requirements. For the higher education and research community this means more data needs to be kept for longer. SWITCH has developed a long term storage service to complement existing SWITCHengines storage and institutional systems.
Data preservation and archiving can be daunting for researchers and administrations. A perspective of 5-10 years or longer can quickly build up complexity and cost, forcing difficult choices about what can be saved and how. In partnership with EPFL and research communities (FORS, SDSC, DLCM), SWITCH identified common use cases and started building a long-term storage service based on the community’s needs: robust, sustainable and able to interoperate with a range of applications used at institutions and by research teams. The result is a long term storage service within the SWITCH environment, that makes storing and retrieving data simple and cost efficient.
Two clear use cases emerged. Long term storage offsite is needed both for research data management and administrative data preservationfor audit purposes. For research data management, an institution may mandate particular data lifecycle management plans, and then provide infrastructure and tools to support that. Through an archiving application, users could then specify which data they are actively working on, and which can be archived. The archived data could then be automatically pushed to the institution’s long term storage capacity at SWITCH.
Later, if a subset is needed again, it can be retrieved from long term storage and restaged as working data without financial penalty for retrieval. This allows scientific need, not pricing model to drive how data is used.
For administrative data, long term storage can be integrated with more classic data preservation solutions. Periodic snapshots are pushed offsite, with a clear access log and record showing they have not been altered to facilitate regulatory and audit needs.
Common requirements to both use cases included a simple, consistent interface, geographical resilience from the primary data location and measures to ensure data integrity. A different tradeoff on price vs. performance compared to live data storage such as SWITCHengines was also desirable.
A consistent technical interface is important for the storage layer to be able to support a wide range of potential archiving tools and systems. The current de facto standard is object storage, in particular S3, based on AWS and this was therefore selected as the only interface. Objects are stored or retrieved based on an identifier, rather than needing to be mounted directly on a file system connected to the application. Performance may be less than conventional block or volume storage on SWITCHengines, but scalability and sustainability is prioritized and the simplicity of a single interface enables costs to be reduced.
Resilience and data integrity meant SWITCH long term storage would be deployed at a location not only remote from pilot institutions, but also from SWITCHengines. This enables data from SWITCHengines users to be resiliently archived once it is no longer actively used. SWITCH selected Cloudian as the supplier for the platform infrastructure. This complements the Ceph implementation on SWITCHengines to give maximum resilience to the services. Finally, an erasure coding schema was selected that enables data to be recovered in the event of multiple disk failures.
During the development, EPFL and DLCM carried out testing on an interim installation, and the full service is in place and being tested from the end of September 2019. While pilot users are still welcome, the service will be ready for real use at project pricing in Q4 2019 and included in the tariff model from 2020 as a full service.