Metadata Taxonomy

This deliverable presents quality metadata and the head and neck cancers common data model that will be adopted within IDEA4RC. The governance and findability metadata, together with the sarcomas common data model will be defined in a forthcoming deliverable.

The deliverable was originally intended to deal only with metadata, but “we decided to change a bit our initial plan, because we realized that data and metadata are extremely interrelated and it was worth to develop them together”, explains Aitor Almeida, researcher and lecturer at the University of Deusto who is leading the work on metadata and common data models for IDEA4RC.

Clinicians from the different centers in the consortium were asked to list the most important variables about the two families of rare cancers that will be considered by IDEA4RC, head and neck cancers and sarcomas. Once they agreed upon a minimal set of variables that allow to answer relevant research questions, the team ad Deusto Institute of Technology built the structure data model with the main objective of capturing the temporal evolution of some of those variables. The model is structured as a sequence of cancer episodes, each containing the values of variables such as stage, recurrence, treatment, surgery, and so on. “This is the approach followed by OMOP for example”, explains Almeida, “and we want to map our data model to OMOP and to FHIR as well.” Each variable was mapped on the corresponding terms in standard vocabularies adopted also by OMOP and FHIR, such as SNOMED, OHDSI and OSIRIS.

In the current deliverable you can find the data model for head and neck cancers, whereas the one for sarcomas will be presented in a separate and forthcoming deliverable.

As for the metadata, the current deliverable focuses only on quality metadata, while the next one, due at the beginning of 2025, will address findability and governance metadata.

Quality metadata are meant to measure how good a specific dataset is, something researchers need to know before moving ahead with their project.

“IDEA4RC will work with federated datasets, which combine data from different centers that can be very heterogenous”, Almeida comments and he adds that also data from the same center can be of different quality depending on the source each variable comes from. “We need to estimate the data quality at different levels, at the variable level, at the source level, at the center level, and at the federated level”, Almeida explains.

Once the data models for both families of rare cancers are established, the centers will start to locate each variable within their data warehouses and build the ETL (extraction, transformation, and load) engines accordingly.

Each center will also need to identify which variables are already structured in their systems and which instead need to be extracted from unstructured data, such as clinicians notes or pathology reports, by mean of Natural Language Processing.

You can download the deliverable here.