Intelligent ecosystem to improve
the governance, the sharing,

and the re-use of health data for rare cancers

Hot takes from the second consortium meeting in Venice

IDEA4RC members met for the second time on 19 and 20 April 2023 in Venice. It was an opportunity to share the results of active working groups and agree on next steps.

The main objective of the first year of the project is the ecosystem conception, and it is carried out by Working Package 2 (WP2).

Several requirements need to be considered. First, IDEA4RC platform should address the needs and expectations of researchers, who will be its main users.

Moreover, if IDEA4RC aims at facilitating and speeding up their work, it must comply with the requirements imposed by the GDPR, the EU regulation on data protection, and take into account the forthcoming regulation on the European Health Data Space (EHDS). EHDS first draft is currently being discussed by the European Parliament and the Council of the European Union.

To become a benchmark in health data sharing for secondary use, IDEA4RC should navigate this shifting regulatory environment.

The researchers’ journey: from data discovery to publication of results

The user stories collected among researchers and clinicians of the 11 expert centers involved in the consortium roughly align with those outlined by the joint action TEHDAS. TEHDAS has been tasked to develop European principles for the secondary use of health data in view of the debate on the EHDS regulation.

The IDEA4RC Data Protection Coordination Board (DPCB), which brings together the data protection officers of the 11 expert centers, is discussing who should evaluate data access applications submitted by interested researchers, and if the federated approach would still require a preliminary anonymization of the data each center injects into its own secure processing environment, called capsule.

Read more

The ecosystem conception starts from the user stories, the sequence of actions that researchers envisage to perform on the platform to conduct their study. The user stories have been identified through a survey conducted among the 11 expert centers of EURACAN involved in the consortium. This work was coordinated by Franco Mercalli (MultiMed Engineers) together with Annalisa Trama (Istituto Nazionale dei Tumori). Survey results roughly aligned with the process identified by the TEHDAS Joint Action. TEHDAS is developing European principles for the secondary use of health data to inform the European Health Data Space regulation (EHDS).

The journey of a data user starts with the data discovery phase (“What health data are available to support my research?”). In this phase, scientists might also need to assess the quality of data (“For how many patients is this information available?”). However, even counting the number of patients for which a specific information is available could be seen as “data processing” under the GDPR. This could be avoided if the data of each expert center is anonymized before providing it to the ecosystem. More on this point later.

Once the researcher has identified the relevant dataset, a data access application must be submitted to the health data provider, specifying the study protocol and other analyses currently required by the GDPR. In IDEA4RC there is not a single health data provider, nor a single legal entity that can guarantee this permit, consortium members are discussing how to handle the data access application process.

To do this, ECCP has established a Data Protection Coordination Board (DPCB) involving the data protection officers and legal experts of the 11 expert centers, who are evaluating several options.

In this regard, partners discussed in Venice the potential impact of the Data Governance Act (DGA), which came into force in June 2022. It introduces new regulations that impact the re-use of personal data for research purposes, in particular the principle of data altruism, which allows consent to be requested without referring to a specific study, potentially enabling blanket consent.

However, partners agreed that given the involvement of health data in IDEA4RC, the ecosystem should be designed taking into account the provisions of the European Health Data Space regulation, which is the legislative piece that paves the way for sharing health data for secondary use in the EU.

Once the data permit has been granted and the patients’ cohort has been selected, the researchers might want to run quality checks and then deploy federated analyses across the selected capsules of IDEA4RC.

Each center in the ecosystem will inject its data into its own caspule, a secure processing environments where IDEA4RC users to whom permit has been granted can run analyses.

The high-level structure of IDEA4RC ecosystem

Requirements coming from the user stories and those mandated by the GDPR and EHDS have been combined to find a set of criteria that IDEA4RC needs to fulfill. They have been summarized with two acronyms. The first one is FAIR (Findability, Accessibility, Interoperability, Reusability) and the second one is PrIIST-Q (Privacy, Isolation, Interoperability, Security, Trust and data Quality). The various components of the IDEA4RC architecture will be designed in order to respect these criteria.

Data will be harmonized and processed according to a common data model before being injected into the capsule. Also unstructured data, such as clinicians notes and pathology, imaging, and surgical reports, will be processed before entering the capsule. Natural Language Models developed by IDEA4RC researchers will take care of this.

Read more

The group coordinated by Eugenio Gaeta (Universidad Politécnica de Madrid) started from the requirements deduced by the user stories to canvas the high-level structure of IDEA4RC. His team has compared those requirements with the ones coming from GDPR and EHDS.

By combining bottom-up (user stories from surveys) and top-down (user stories for the secondary use of health data in EHDS) requirements, IDEA4RC researchers have identified two sets of partially overlapping criteria.

  • Findability, Accessibility, Interoperability, Reusability (FAIR principles of scientific data management)
  • Privacy, Security, Isolation, Trust, Interoperability, Data Quality (PrIIST-Q for short).
    Privacy: The EHDS is required to comply with relevant privacy regulations, such as the General Data Protection Regulation (GDPR), to ensure that patients’ personal and health data is protected and kept confidential.
    Isolation: The EHDS must protect sensitive health information from falling into the wrong hands and to maintain the privacy and confidentiality of patients. Isolation can be achieved through various security measures, such as firewalls, encryption, and access controls, which limit the flow of information between different systems and networks
    Interoperability: The EHDS must support the exchange of health data between different systems, technologies, and organizations, to ensure that data is accessible and usable for all stakeholders. This requires the implementation of common data standards and protocols to ensure that data can be seamlessly shared and processed.
    Security: The EHDS must implement robust security measures to protect health data from unauthorized access and cyberattacks. This includes implementing encryption, authentication, and access controls to secure data in transit and at rest.
    Trust: The EHDS must establish trust between stakeholders, including patients, healthcare providers, and researchers, to ensure that data is used appropriately and for the benefit of patients. This requires transparent processes for data access and usage, as well as clear communication and engagement with stakeholders.
    Data Quality: The EHDS must ensure the quality of health data, including accuracy, completeness, and consistency, to ensure that data is usable and relevant for healthcare purposes.

Findability will be ensured by the metadata layer of IDEA4RC which will allow researchers to search for relevant data and assess their quality. The governance layer will regulate the accessibility of the data, depending on how much each center in the ecosystem is willing to share. Interoperability will be granted by the FHIR capsule, the secure processing environment that will implement a common data model. Reusability will be enabled by the federated AI layer, which will allow to run analysis without moving the data from their original location, following an iterative training procedure.

To meet the second set of requirements, the team led by Gaeta plans to implement a fractal architecture. Each component of the application should comply with the PrIIST-Q principles so that the entire application ends up being compliant with the same principles. “The analogy with the fractal concept is that PRiiST-Q properties occur at different scales in the architecture”, explains Gaeta, “as in a tree where a branch has the same structure as the tree.”

The FHIR capsule needs to host a dedicated service for the extraction, transformation and load of the data (ETL). This service manages the data ingestion from the expert center’s information system into the capsule and is being discussed with the technical team of each center.

The data ingestion pipeline also involves the processing of unstructured data, clinician notes and pathology, imaging and surgical reports. Notes and reports will be treated using Natural Language Processing (NLP) models developed ad hoc by a joint team led by Aitor Almeida (Universidad de Deusto) and Alberto Lavelli (Fondazione Bruno Kessler).

The information extracted from notes and reports will be mapped on the common data model of IDEA4RC. Only structured data will be injected into the capsule. The common data model will be mapped both in FHIR and OMOP, since the FHIR-to-OMOP and OMOP-to-FHIR mappings are not yet mature enough.

Choosing the relevant variables to address research questions

Clinicians and researchers from the 11 expert centers have formulated research questions they wish to address thanks to the IDEA4RC ecosystem. Some of these questions will be selected as potential use cases, through which IDEA4RC will be tested. The variables to include in the common data model of the ecosystem are those needed to address the research questions.

Read more

The common data model will be defined to describe head and neck cancers and sarcomas, starting from the EURACAN rare cancers registry.

The way to include the data extracted from notes and reports through Natural Language Processing has been the subject of a lively debate among clinicians in Venice. They agreed to consider at first a minimal set of unstructured data when designing the use cases.

First co-creation workshop

The first co-creation workshop took place in the afternoon of the second day of the meeting and involved all participants, split in small groups.

During one of the activities, the groups were asked to discuss and improve schemes of the data flows underlying different processes in the platform. They were asked to look at the schemes embracing the point view of the different stakeholders involved in IDEA4RC, from clinicians and researchers to hospital managers and patients.

All the conversations were recorded and now they are being analyzed to summarize the main findings and further inform the ecosystem creation.

Read more

During the afternoon of the second day, Claudia Egher and Susan van Hees (University of Utrecht) led the first co-creation workshop. The objective of the workshop was to engage all the relevant actors in a timely manner and ensure that their values and expectations towards IDEA4RC are heard and considered in the development phase. This contributes to the successful implementation of the ecosystem.

All participants were involved in the two activities of the workshop.

In the first activity, partners were asked in advance to sketch the data flows from their specific point of view. Those sketches were proposed on paper to the different groups that were asked to discuss them interpreting different actors (researchers, clinicians, data managers, patients) regardless of their role in the project.

In the second activity, starting from vignettes, each group engaged in a discussion on six different topics: data governance, data quality, better care for patients diagnosed with rare cancers, barriers to use, changing relations between patients and doctors, better quality of life for patients.