Data Ecosystem reference architecture

The new deliverable “Data Ecosystem reference architecture” is divided in two parts. The first one contains a state-of-the-art analysis of security and privacy requirements for health data sharing and re-use and a corresponding state-of-the-art analysis of technologies and architectural approaches that can satisfy those requirements. The second part focuses on the IDEA4RC ecosystems and outlines the architectural choices that will be taken to fulfil those requirements, while at the same time satisfying the users’ needs.

Privacy and security requirements and available technologies and architectural approach to satisfy them

Due the great number of initiatives in the field of health data sharing within the European Union, drafting the reference architecture of IDEA4RC ecosystem required a review of the main findings of these initiatives and of the relevant EU regulations.

The proposal for a regulation of the European Health Data Space, published in May 2022, and in particular Chapter IV, presents a comprehensive set of requirements that greatly influence the architecture and technical development of the IDEA4RC ecosystem.

These requirements have already been analyzed in the THEDAS project, which outlined the user journey and the essential set of services for the EHDS. All the services mentioned in THEDAS are necessary components within the IDEA4RC architecture to achieve a federated approach, allowing data to be processed in secure environments and enabling cross-border data sharing.

THEDAS, which ended in July 2023 and will be followed by a new joint action starting in 2024, has incorporated inputs and perspectives from relevant European initiatives and projects such as BBMRI, EMA, ECDS, ELIXIR. The EHDS2Pilot initiative, which started in October 2022 and will end in October 2024, serves as a proof of concept for the TEHDAS architecture and holds significant relevance for IDEA4RC as well.

In the early stages of the ecosystem definition, it is crucial to adhere to the principle of security and privacy by design. Privacy considerations need to be addressed at various levels, including data storage, data sharing, and data processing.

Security threats and associated mitigation strategies in IDEA4RC have been identified following the STRIDE model for computer security threats, which groups threats in six categories (spoofing, tampering, repudiation, information disclosure, denial of service, elevation of privilege).

Privacy threats that could menace IDEA4RC as a data-sharing system within the EHDS have been modelled using the LINDDUN framework. They include mainly the risk of linkability and identifiability. The authors discuss corresponding mitigation strategies, including anonymization techniques and rigorous de-identification methods.

The deliverable then reviews the existing security and privacy technologies that could be employed to mitigate the identified risks in IDEA4RC. As for the architectural approaches and components which are more suitable to implement security and privacy by design, the authors focus their attention on microservice architectures based on raw containers, orchestrators and service meshes. They observe that a service mesh, such as Istio, offers several advantages over using only a container or orchestrator when it comes to security, scalability, audit functionalities, and data isolation in a privacy-preserving environment for data sharing.

IDEA4RC users’ requirements and its reference architecture

Beside the need to comply with the security and privacy requirements imposed by regulations on data sharing and re-use, IDEA4RC should also satisfy the requirements of its users. To identify the latter set of requirements, the authors combined a bottom-up and top-down approach. On the one hand they conducted a survey among the researchers in the 11 clinical centers involved in the consortium. On the other hand, they considered the user stories for the secondary use of data in the EHDS developed by the TEHDAS project. Comparing the requirements coming from these two sources, the authors identified two sets of partially overlapping criteria, that address security, privacy, and optimization of data reuse.

Having defined the requirements the platform needs to address, the authors outline the main feature of IDEA4RC reference architecture.

First, IDEA4RC will adopt a zero-trust network approach. Zero-trust network is a security model that assumes all network traffic is untrusted until proven otherwise. In other words, it adopts a security model that assumes no implicit trust for any user or device, both inside and outside the network perimeter. It focuses on continuously verifying and authorizing access to resources based on several factors rather than relying solely on traditional network controls.

The authors then give a high-level abstract representation of the IDEA4RC platform that emphasizes its functional components and their interaction, considering that the main assumptions are: (i) it must follow the EHDS user journey, (ii) it should provide Secure Processing Environments, (iii) it should implement Zero trust architecture and (iv) processing should occur only through privacy preserving environments.

Finally, the deliverable details the software components needed to deploy the different phases of the user journey: data preparation, data discovery and feasibility study (oriented to gather information to submit a data access application), the data use phase and the data finalization phase.