-

Design and generation of Linked Clinical Data Cubes

Laurent Lefort

Hugo Leroux

1 0 CSIRO ICT Centre , Canberra, ACT , Australia 1 The Australian E-Health Research Centre, CSIRO , Brisbane, Queensland , Australia

Clinical Study Data Exchange technologies, based on XML, have improved the data capture phase of clinical data and enabled larger and more diverse longitudinal clinical research studies. There is now a growing interest in this community for solutions based on Semantic Web standards. Healthcare and life sciences metadata resources such as medication classifications are now shared via linked data platforms. The increasing pressure to make clinical trial data more open is another strong incentive for the adoption of linked open data technologies. This paper describes the application of semantic statistics vocabularies to deliver clinical data as linked data in a form that is easy to consume by statisticians and easy to enrich with links to complementary data sources. We combine the strengths of the RDF Data Cube and DDI-RDF vocabularies to propose a Linked Clinical Data Cube (LCDC), a set of modular data cubes that helps us manage the multi-disciplinary nature of the source data. We validate our approach on the Australian, Imaging, Biomarker and Lifestyle study of Ageing (AIBL). This dataset, comprising more than 1600 variables clustered in 25 different sub-domains, has been fully converted into RDF with one general data cube and one specialised data cube for each sub-domain. This implementation demonstrates the effectiveness of the association of the RDF Data Cube and DDI-RDF vocabularies for the publication of large and diverse clinical datasets as linked data. We also show that the structure of the LCDC overcomes the monolithic nature of clinical data exchange standards and expedites the navigation and querying of the data from multiple views.

linked data clinical study data cube semantic statistics

The Australian, Imaging, Biomarker and Lifestyle study of Ageing1 (AIBL) [ 1 ] is a longitudinal clinical study of more than 1100 Australians aged over 65 years focusing on early pathological indicators of Alzheimer’s disease. The AIBL dataset contains 25 sub-domains encompassing more than 1600 variables. AIBL uses a Clinical Data Management System to collect and manage the study data. The tool used [ 2 ], OpenClinica2, supports the creation of customisable studies and the design of user-defined Case Report Forms (CRFs) using an Excel spreadsheet. It adheres to the Clinical Data Interchange Standards Consortium 3 (CDISC) Operational Data Model (ODM) [ 3 ] XML-based standard. ODM-compliant files contain the study data and the associated descriptions of the data items, their groupings into CRFs and the associated questions and code lists.

Our main motivation for the publication of AIBL data as linked data is to make the data seamlessly available to researchers and to enrich it when possible with other data sources. Medication information collected in the study can be mapped [ 4 ] to the Australian Medicines Terminology4 (AMT) and SNOMED-CT5. The rapid growth of the healthcare and life sciences Linked Open Data cloud (OPEN-PHACTS6, Bio2RDF7) opens new opportunities to add value to the AIBL data. Linking to DrugBank8, for example, can bring extra information on drug interaction, targets and pathways.

In this paper, we explain why we have opted to use semantic statistics vocabularies and how they help us overcome the monolithic nature of the ODM data model and its limitations outside the data capture phase. The RDF Data Cube vocabulary [ 5 ] is a proven solution [ 6 ] for the construction of multi-dimensional data cubes offering multiple access points to the data via thematic slices. The DDI-RDF Discovery Vocabulary [ 7, 8, 12 ] helps us manage the links between the data cube variables and their definitions supplied via the study-specific data dictionary embedded in the ODM standard. We describe how we have split the AIBL dataset into a set of data cubes to increase its modularity and designed its URI scheme to ensure that access to it is not constrained by the original data model. The slicing strategy supports the grouping of data into times series with various temporal granularity (phase of the study, day of the observation), and into cross-sections offering different options to group patient together, such as the membership of patients to specific cohorts or the gender.

We validate our approach on the AIBL dataset. We use this example to show how we can add additional slices to enrich the Medication sub-cube with external linked data sources and how we can consume the linked data with a generic off the shelf mashup tool, Visual Box [ 9 ].

The discussion is focused on the compliance of the Linked Clinical Data Cube with the RDF Data Cube vocabulary. Our goal is to have a solution that is compatible with visualisation tools based on this specification. We have reviewed the applicability of the proposed integrity constraints to our use case and found that clinical research data is a category of data that is patchier than other categories of statistical data. We conclude that semantic statistics vocabularies can and should serve a more diverse range of use cases than the ones already documented in [ 10 ]. 2 https://openclinica.com/ 3 http://www.cdisc.org/ 4 http://www.nehta.gov.au/our-work/clinical-terminology/australian-medicines-terminology 5 http://www.ihtsdo.org/snomed-ct/ 6 http://www.openphacts.org/ 7 http://bio2rdf.org/ 8 http://www.drugbank.ca/

The rest of this paper is structured as follows. Section 2 details the coverage of CDISC ODM features by the RDF Data Cube and DDI-RDF Discovery vocabularies. Section 3 introduces the design of the Linked Clinical Data Cube and of its URI scheme. Section 4 presents our work on the AIBL linked clinical dataset. Section 5 contains the discussion on the alignment and compliance issues and reviews the linked data management requirements which are specific to clinical studies. 2 2.1

Coverage of CDISC ODM by Semantic Statistics vocabularies CDISC ODM

The CDISC ODM standard [ 3 ] defines an XML-based format that facilitates the capture of clinical data during a clinical study. The tree structure of the ODM XML Schema is shown in Figure 1. For the data sub-tree, the top level element is study, followed by subject and study event (phase of the study). The next three elements match the structure of the electronic forms used for data capture. The ODM format also contains the variable definitions (items) and their associated codelists. Each “data” element is linked to a “def” element in the metadata sub-tree. The W3C RDF Data Cube vocabulary [ 5 ] is a vocabulary for the publication of statistical data in RDF [ 10 ] which is derived and compatible with the cube model that underlies SDMX 10 (Statistical Data and Metadata eXchange), a statistical data and metadata standard. This cube model (Figure 2) allows users to group subsets of observations within a dataset into slices where all but one (or a small subset) of the dimensions are fixed. The dimensions, measures and attributes of the data cube and their usage in slices and observations are specified via a Data Structure Definition (or DSD) object. The guidelines for DSDs published by SDMX [ 11 ] define the method9 http://xml.coverpages.org/CDISC-ODMOverviewV1-1.pdf 10 http://sdmx.org/ ology for slicing with detailed advice on how to design the data cube structure according to the nature of the data. The Data Documentation Initiative (DDI) is an alliance developing XML-based standards for information describing statistical and social science data. The key motivation for DDI is the need to share highly-detailed metadata to ensure the correct analysis and use of the data collected during surveys. The DDI-RDF Discovery vocabulary is a RDF version of a subset of the DDI standard created [ 8 ] by members of the Linked Open Data community. It is published as an unofficial draft [ 7 ] by DDI and reuses or extends several linked data vocabularies, including the RDF Data Cube. Figure 3 shows the relationships between the main classes defined by the Disco vocabulary [ 7, 12 ]. DDI-RDF also contains definitions for statistics based on quantitative and qualitative data, which can also be useful. 2.4

Coverage of ODM features by QB and Disco

We have outlined in [ 13 ] our approach to map the ODM data to the RDF Data Cube Vocabulary and the rationale behind our decision to split the AIBL dataset into one main data cube containing the common data and multiple specialised data cubes adapted to each sub-domain. The strength of the Data Cube, at the level of the main cube, is that the original structure of the ODM data model (Study-SubjectStudyEvent-Form-ItemGroup-Item) can be replicated in the generated cube if needed. Furthermore, it supports alternative methods of accessing the data, in particular, methods where the data is aggregated along other dimensions or along the same dimension in different order.

The correct use of the data recorded during surveys is also important for the producers of clinical trial data. We have opted to reuse the DDI-RDF Discovery Vocabulary to consistently manage the study-specific data dictionary exported from the OpenClinica tool via the ODM format and the CDISC metadata resources (STDM, CDASH). disco:Universe defines the domain at multiple levels of the data cube. disco:Variable corresponds to the property used to store the data and disco:VariableDefinition is used to link to the definition of this property (metadata).

Design of the Linked Clinical Data Cube

The design of the Linked Clinical Data Cube is done in three steps. The first step, also discussed in [ 13 ] is to split our dataset into a number of smaller, specialised, cubes. The second step is to define several slice hierarchies to offer multiple access options to individual data records. The third step is to define a URI scheme that supports access at all the levels of the slice hierarchy.

We have used the SDMX guidelines [ 11 ] to define the dimensions and attributes for our time series and cross-sectional slices. The time-series slices address the longitudinal nature of the study and organise the data into time intervals and dated and non-dated time points. The cross-section slices adopt a subject-centric approach into abstracting the data set along some important concepts such as gender, genotype and neurological classification. The Theme slices categorise the data into the study domains and sub-domains (disco:Universe in DDI-RDF) and help to link the main and specialised cubes. The navigation and querying of the data in the LCDC is easier because we provide three direct links to the node containing the data instead of one: the Phase series (at the level of Study Event Data in ODM), the Subject section (Subject Data) and the Sub-theme slice (Item Group Data).

The RDF Data Cube (QB) specification restricts the use of the qb:observation property to cases where the range class is a qb:Observation and does not allow qb:observation o qb:observation property chains between qb:Slice and qb:Observation via qb:ObservationGroup. We use void:subset11 for the dataset/slice and slice-sub-slice links shown in Figure 5.

The use of QB properties is shown in Figure 6 which presents only the LCDC slices which subsume qb:Slice. We use qb:observation and qb:observationGroup for the slice-observation links and slice-observation group links. The specialisedSeries and specialisedSection properties are for the links between the slices in the main and specialised cubes. The 11 http://www.w3.org/TR/void/ specialisedObservation property manages the links between the observation groups in the main cube and the corresponding observations in the specialised cubes and is a sub-property of qb:observation. Finally, the mainDataSet property is defined to link the observation groups back to the dataset.

The LCDC URI scheme (Table 1 and Table 2) follows the convention adopted by projects [ 6 ] which use the Linked Data API12. This convention uses URIs finishing with an identifier to give access to a single instance (Item endpoint) and URIs finishing with a keyword to give access to a list of instances (List endpoint). 12 https://code.google.com/p/linked-data-api/

Specialised cube URI scheme Sub-theme series ROOT/{dataset}/ts/pr/{pr}/th/{th}/st/{st}/ph/{ph} Sub-theme section ROOT/{dataset}/ts/pr/{pr}/th/{th}/st/{st}/nd/{nd}/su/{su} Observations

ROOT/{dataset}/{pr/{pr}/th/{th}/st/{st}/ph/{ph}/su/{su}

The URI patterns listed above are shortened to fit in more compact tables. The longer version of the pattern listed in last row of Table 2 is:

ROOT/{dataset}/product/{product}/theme/{theme}/subtheme /{subtheme}/phase/{phase}/subject/{subject}

Using alternate keywords and identifiers user-friendly URIs: ROOT/lcdc/product/odm/theme/cognitive/subtheme/neuropsych /phase/72months/subject/ss_1175.

The LCDC ontologies are available via the URIs included in Table 3. A majority of the core classes (DataFile, Phase, Product, Question, Questionnaire, Study, StudyGroup, SubTheme, Subject, SupplVariableDefinition,Theme, Variable, VariableDefinition) and properties are based on Disco. The Observation, Time Series, Cross-Section, Domain Slice and Cube ontologies contain the classes corresponding to the different aspects of the Linked Clinical Data Cube described above and the associated properties. The ODM and AIBL ontologies define datatype properties for the identifiers present in the ODM file to capture this information as provenance data.

Ontology URI Core http://purl.org/sstats/lcdc/def/core# Observations http://purl.org/sstats/lcdc/def/obs# Time Series http://purl.org/sstats/lcdc/def/time-series# Cross-section http://purl.org/sstats/lcdc/def/cross-section# Domain Slice http://purl.org/sstats/lcdc/def/domain-slice# Cube http://purl.org/sstats/lcdc/def/cube# ODM http://purl.org/sstats/lcdc/def/odm# AIBL http://purl.org/sstats/lcdc/def/aibl#

To create them, we have defined an Excel spreadsheet template to capture all the information required to generate the LCDC OWL ontologies with XSL transformations. Our template supports the definition of classes and properties definitions, their alignment to QB, VoID and Disco, the URI prefixes and patterns and the DSD levels. We can also generate SPARQL queries for the retrieval of instance data for each class and for the detection of traversal links between LCDC-named entities. We plan to further extend this template to automate the creation of the Linked Data API configuration files as much as possible.

Our application The AIBL study

AIBL has been designed to support investigations of the predictive utility of various biomarkers, cognitive parameters and lifestyle factors as indicators of Alzheimer’s disease (AD) with a cohort of over one thousand participants residing in two Australian cities, Perth and Melbourne. Each recruited participant completed blood and neurological testing and some underwent brain imaging testing. The AIBL study data was successfully migrated to the OpenClinica platform in 2011 [ 2 ] and has been live since August 2011. 4.2

Conversion of the AIBL data

We have converted an ODM file containing the data from AIBL study. This dataset uses more than 1600 variables clustered in 25 different sub-domains. The AIBL study has been split into five themes: Study, Clinical, Lifestyle, Imaging and Cognitive. The ‘Study’ category comprises administrative information, most of which will not be shared in the cube. Table 5 gives the total number of instances per theme for different LCDC classes.

Theme Clinical Cognitive Imaging Lifestyle Study

The LCDC design has been extended to support our plan [ 13 ] to use the AMT and SNOMED CT-AU taxonomies to enrich the medication data with other medication resources, in particular the ones that are already available as linked open data. We have implemented specific types of slices for the Concomitant Medication sub-cube to serve observations which contain links to external resources like AMT, SNOMED and the World Health Organization Anatomical Therapeutic Chemical Defined Daily Dose classification (ATC DDD). The CM (Concomitant Medication) ontologies (Ta13 The total number of variables is smaller than 1600 because the generation to RDF suppresses duplicates. ble 5) contain sub-classes of the observation and cross-section classes defined in the core ontology.

Ontology URI CM http://purl.org/sstats/lcdc/cm/def/cm# CMATC http://purl.org/sstats/lcdc/cm/def/cm-atc# CMAMT http://purl.org/sstats/lcdc/cm/def/cm-amt# CMSNOMEDhttp://purl.org/sstats/lcdc/cm/def/cm-snomed#

4.4 Visualisation of the AIBL data

An example of visualisation of the LCDC data developed is presented in Figure 8. We have used the Visual Box14 [ 9 ] tool to build visualisations of SPARQL query results to support data verification activities.

Fig.7.ClassificationofAIBLsubjectsat18months

5 Discussion 5.1 Implementation report

The RDF Data Vocabulary specification W3C Candidate Recommendation [ 5 ] has reached the stage used by W3C to gather implementation experience prior to the final decisions on “at risk” features. We can provide feedback on the usefulness of optional 14 http://visualbox.org terms and on the applicability of integrity constraints. We use two optional terms: the qb:ObservationGroup class and qb:observationGroup property.

Some specialised data cubes do not satisfy the integrity constraints, specifying that every qb:DataStructureDefinition must include at least one declared measure (IC-3), that only attributes may be optional (IC-6) and that each individual qb:Observation must have a value for every declared measure (IC-14). These constraints are too restrictive for our Nutrition data cube where the presence or absence of a value for a particular category of food varies according to the subject’s diet. This is a concern for survey questionnaires using previously entered values to determine if a field on a form should be mandatory filled. 5.2

Coverage of our use case by Semantic Statistics vocabularies

We recommend that the semantic statistics vocabularies under development cover a broader set of use cases than the ones currently outlined in [ 10 ] that correspond to collections of "regular" CSV files, spreadsheets and OLAP data cubes. The LCDC use of the RDF Data Cube vocabulary is different from the more common use cases [ 10 ] primarily because of the unreliable, disparate and longitudinal nature of clinical data. This, however, should still allow us to reuse visualisation tools based on the RDF Data Cube specification and especially RDF Data Cube browsers such as CubeViz15.

On the other hand, we have found that the DDI-RDF vocabulary is well suited to addressing the needs of the ODM community in standardising access to the clinical data and explicitly linking the clinical data with its associated metadata. 6

Conclusions

This paper has outlined an approach to integrate clinical study data exchange standards with semantic statistics standards to make the clinical data available as linked data. In particular, we have outlined the design of a Linked Clinical Data Cube, which integrates a general and several specialised data cubes to expedite the navigation and querying of clinical data. The Linked Clinical Data Cube combines the strength of the RDF Data Cube in defining multi-dimensional data cubes and the DDI-RDF vocabulary to encode the study-specific data dictionary as linked data. Our approach was validated on a large and diverse clinical dataset with features that differ from other types of statistical datasets. The sheer volume of variables has necessitated a split of the clinical data into a set of modular data cubes to improve their manageability during the generation process and facilitate their discovery and usability by end users. We have observed that the patchy nature of clinical data is also more pronounced than for other types of statistical datasets. We are convinced that the integration of clinical study data exchange technologies and semantic statistics vocabularies will expedite the deployment of cross-study analysis and evidence-based medicines by facilitating the integration of clinical trials from disparate sources. We conclude that the associa15 http://aksw.org/Projects/CubeViz.html tion of the RDF Data Cube and DDI-RDF vocabularies is very effective in facilitating the publication of large and diverse data set and hope that this will provide the catalyst for increased coordination between the two initiatives. 7

1. Ellis , K. A. ; Bush , A. I. ; Darby , D. ; De Fazio, D. ; Foster , J. ; Hudson, P. ; Lautenschlager , N. T. ; Lenzo , N. ; Martins , R. N. ; Maruff , P. ; Masters , C. ; Milner , A. ; Pike , K. ; Rowe , C. ; Savage , G. ; Szoeke, C. ; Taddei , K. ; Villemagne , V. ; Woodward , M. & Ames , D. : The Australian Imaging, Biomarkers and Lifestyle (AIBL) study of aging: methodology and baseline characteristics of 1112 individuals recruited for a longitudinal study of Alzheimer's disease , International Psychogeriatrics 21 ( 04 ), pp. 672 - 687 ( 2009 )

2. Leroux , H.; McBride , S. & Gibson , S. : On selecting a clinical trial management system for large scale, multi-centre, multi-modal clinical research study , Stud Health Technol Inform 168 , pp. 89 - 95 ( 2011 )

CDISC

Operational Data Model (ODM) Version 1 .3. 1 ( 2009 )

4. McBride , S. , Lawley , M. , Leroux , H. & Gibson , S. : Using Australian Medicines Terminology (AMT) and SNOMED CT-AU to better support clinical research , Studies in Health Technology and Informatics , 178 , pp. 144 - 149 ( 2012 )

5. Cyganiak , R. , Reynolds , D. & Tennison , J.: The RDF Data Cube Vocabulary, W3C Candidate Recommendation 25 June 2013 . World Wide Web Consortium ( 2013 )

6. Lefort , L. , Bobruk , J. , Haller , A. , Taylor , K. , & Woolf , A. : A Linked Sensor Data Cube for a 100 Year Homogenised Daily Temperature Dataset . In Proc. Semantic Sensor Networks Workshop (SSN 2012 ), CEUR Workshop Proceedings , vol. 904 , ceur-ws.org pp. 1 - 16 ( 2012 ).

7. Bosch , T. , Cyganiak , R. , Zapilko , B. , Cotton , F. , Gregory , A. , Kämpgen , B. , Olsson , O. , Paulheim , H. , Wackerow , J.: DDI-RDF Discovery Vocabulary A vocabulary for publishing metadata about data sets (research and survey data) into the Web of Linked Data Unofficial Draft 20 June 2013 Technical report , DDI Alliance ( 2013 )

8. Bosch , T. ; Cyganiak , R. ; Gregory , A. & Wackerow DDI-RDF Discovery Vocabulary: A Metadata Vocabulary for Documenting Research and Survey Data Proc. 6th Workshop on Linked Data on the Web (LDOW2013) CEUR-WS Workshop Proceedings . vol. 996 ceurws.org ( 2013 )

9. Graves , A : Creation of visualizations based on linked data . In Proc. of the 3rd International Conference on Web Intelligence, Mining and Semantics (WIMS '13) . ACM ( 2013 )

10. Kämpgen , B. & Cyganiak , R. ; Use Cases and Lessons for the Data Cube Vocabulary W3C Working Group Note 27 February 2013 , World Wide Web Consortium ( 2013 )

11. SDMX SDMX guidelines for DSDs Statistical Data and Metadata Exchange ( 2012 )

12. Bosch , T. ; Cyganiak , R. ; Wackerow, J. & Zapilko , B. ( 2012 ), Leveraging the DDI Model for Linked Statistical Data in the Social, Behavioural, and Economic Sciences , in Proc. of the 12th International Conference on Dublin Core and Metadata Applications (DC 2012 ) 46 - 55 , Dublin Core Metadata Initiative ( 2012 )

13. Leroux , H. , & Lefort , L. : Using CDISC ODM and the RDF Data Cube for the Semantic Enrichment of Longitudinal Clinical Trial Data . In Proc. Semantic Web Applications and Tools for Life Sciences (SWAT4LS 2012 ) CEUR Workshop Proceedings vol. 954 , ceurws.org ( 2012 )