The Semantic Data Dictionary Approach to Data Annotation & Integration Sabbir M. Rashid1 , Katherine Chastain1 , Jeanette A. Stingone2 , Deborah L. McGuinness1 , and James P. McCusker1 1 Rensselaer Polytechnic Institute, Troy, NY 12180, USA 2 Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA Abstract. A standard approach to describing datasets is through the use of data dictionaries: tables which contain information about the con- tent, description, and format of each data variable. While this approach is helpful for a human readability, it is difficult for a machine to un- derstand the meaning behind the data. Consequently, tasks involving the combination of data from multiple sources, such as data integration or schema merging, are not easily automated. In response, we present the Semantic Data Dictionary (SDD) specification, which allows for ex- tension and integration of data from multiple domains using a common metadata standard. We have developed a structure based on the Se- manticscience Integrated Ontology’s (SIO) high-level, domain-agnostic conceptualization of scientific data, which is then annotated with more specific terminology from domain-relevant ontologies. The SDD format will make the specification, curation and search of data much easier than direct search of data dictionaries through terminology alignment, but also through the use of “compositional” classes for column descriptions, rather than needing a 1:1 mapping from column to class. 1 Introduction A common challenge in scientific research involves finding data across databases with the same semantic meaning. This challenge arises since the labels of columns in data tables do not necessary reveal the meaning of the data. Furthermore, one to one mappings between columns from separate sources are not readily acces- sible. Column headers and traditional data dictionaries describe the conceptual structure underlying a dataset in a manner understandable by human readers, but it is difficult for computers to extract this same information. A single row in a dataset may contain data on multiple entities - for example, the subject, the subject’s blood sample, and information about the subject’s mother, such as whether or not she smoked during pregnancy. Understanding that these are separate but related entities and how they are related to each other facilitates finding other data that are relevant for comparison. The Semantic Data Dictionary (SDD) specification is a way to represent implicit entities and their relationships using a general ontology, namely the Semanticscience Integrated Ontology (SIO). SIO provides general properties to describe the relations between entities, and measured characteristics are repre- sented as attributes of those entities [4]. Domain-specific ontologies, such as the Children’s Health Exposure Analysis Resource (CHEAR) ontology [11], allow more fine-grained and dataset-specific annotation of concepts. A well-formed SDD contains information about the entity types represented and/or referred to by each column in a tabular dataset, utilizing the relevant ontology URIs in order to convey this information in a manner that is both machine-readable and unambiguous. We use SIO’s high-level conceptualization of data as our target semantic structure when constructing the SDD. Leveraging one particular structure as a basis helps to focus a user by providing a limited subset of relationships and entities for the user to consider. The SDD can express data against any SIO- compatible ontology, and can be used to describe tabular data where there are any number of entities, attributes, timepoints, roles, and relationships. Our in- tent is to create a process that is more accessible to domain scientists and data providers as it only requires knowledge of a limited number of ontologies. Prop- erties in the SIO ontology can be used to describe characteristics of a data variable. In this paper we demonstrate the utility of the SDD format and the use of the SIO and CHEAR ontologies by representing a number of relevant tables. We present an evaluation of the SDD approach by creating SDD specifications for the National Health and Nutrition Examination Survey (NHANES) 2013-2014 dataset. Finally, we describe the use of Semantic Data Dictionaries in existing research projects. 2 Related Work 2.1 Data Integration Data integration involves the ability to unite data from multiple sources in such a way that results in a unified view of the combined data [8]. An increasingly used approach to data integration is the use of ontologies to annotate data. However, the success of this approach has lead to an increase in the number of existing ontologies, resulting in difficulties in deciding which ontologies to use and the consideration of possible interoperability issues between ontologies. For the biomedical domain, the Open Biomedical Ontologies (OBO) consortium is helping to address this problem by creating a family of logically well-formed on- tologies which follow a set of shared design principles [16]. Ontologies contained in the OBO Foundry include the Gene Ontology (GO), Chemical Entities of Biological Interest (ChEBI) and Human Disease Ontology (DOID)3 . Another important ontology used in biomedical research and science in general is the Semanticscience Integrated Ontology (SIO), which divides entities into three distinct categories: objects, events and processes [5]. Adhering to a common foundational model such as SIO or the OBO Foundry ontologies facilitates data integration. 3 http://www.obofoundry.org/ 2.2 Schema Merging One approach to integrated information from multiple datasets is through Schema Merging. General methods for Schema Merging have involved either using a set of tools to alter multiple schemas such that the are consistent with each other, or using the multiple schemas to create one merged schema [1]. In order to suc- cessfully implement these approaches, however, it is important to know which corresponding elements in each schema should be aligned. Further considera- tions include possible union or intersection of schema elements, generalization of attributes described in a schema, and the removal of redundant attributes or relationships [10]. Aside from algorithmic approaches to Schema Merging or On- tology Alignment, other methods my take advantage of crowd sourcing in order to acquire human contributions. An example of such a platform is CrowdMap [15], which reduces complex alignment problems into individual alignment tasks, which are published online to be outsourced to a distributed group of contribu- tors. 2.3 Semantic Annotation Semantic Annotation refers to the practice of assigning metadata descriptions that describe information about entities in a database or in text [7]. Recent surveys on Semantic Annotation platforms describe architecture, methods, and performance on currently available tools that facilitate semantic annotations [13], [18]. Several of the most effective annotation platforms (in terms of F-Measure) include MUSE [9], Armadillo [3] and KIM [12]. MUSE is a information extrac- tion system that performs named entity recognition using a tokeniser, sentence splitter, part of speech tagger, and a semantic tagger [9]. Armadillo is a generic and portable architecture for scraping information for websites [3]. KIM is a platform for semantic annotation, indexing, and retrieval that includes the use of an ontology, a server, and a front-end interface [12]. Using an algorithm called Taxonomy-Based Disambiguation, which involves Spotting, Learning and Tag- ging, SemTag was able to achieve automated large-scale semantic tagging of over 250 million web pages [2]. OntoAnnotate leverages existing conceptualizations from domain specific ontologies, but relies primarily on human annotation [17]. 3 Methods 3.1 SDD Specification The Semantic Data Dictionary is made up of a collection of tabular data which can be written in Excel or Google sheets, or tabular text format, such as Comma Separated Value (CSV) files. The first of these files is the infosheet, which con- tains information about the study as well as the location of the other tables. The tables referenced in the infosheet are the Semantic Data Dictionary, Codebook, Timeline and Code Mappings. The Semantic Data Dictionary contains columns following the SDD specification, which is shown in Table 1. The SDD contain Table 1. Semantic Data Dictionary Specification Column Value Type Related Property Description Column ID all Column header Label string all rdfs:label Label for the column Comment string all rdfs:comment Comment for the column Definition string all skos:definition Text column definition Attribute URI attribute rdf:type URI of the attribute type attributeOf ID entity sio:isAttributeOf Entity having the attribute Unit ID attribute sio:hasUnit Unit of Measure for attribute Time ID attribute sio:measuredAt Time point attribute was measured Entity URI entity rdf:type Type of the entity Role URI entity sio:hasRole Type of the role the entity plays inRelationTo ID entity sio:inRelationTo Entity that the role is linked to wasDerivedFrom ID entity prov:wasDerivedFrom Entity from which the attribute was derived wasGeneratedBy ID all prov:wasGeneratedBy Activity from which the attribute was produced Table 2. Example Semantic Data Dictionary (Actual Columns) Column Attribute attributeOf Unit Time inRelationTo wasDerivedFrom wasGeneratedBy id sio:Identifier ??child race sio:Race ??mother age sio:Age ??mother sio:Year ??visit1 edu chear:EducationLevel ??mother ??visit1 bmi chear:BMI ??mother kgm2 ??visit1 weight, height weight sio:Mass ??mother kg ??visit1 height sio:Height ??mother cm ??visit1 smoker chear:SmokingStatus ??mother ??pregn pb 1 sio:Concentration ??pb 1 mgL ??visit1 ??sample1 ??sample1 hasco:ICP-MS pb 2 sio:Concentration ??pb 2 mgL ??visit2 ??sample2 ??sample2 hasco:ICP-MS ga chear:GestationalAge ??child sio:Week ??birth birthwt chear:Weight ??child kg ??birth actual columns derived from the dataset, as well as virtual columns. The actual columns contain mappings to the underlying attribute that is described by the dataset column as well as provenance information such as how that variable was generated or derived, as shown in Table 2. In order to describe the entity to which the attribute is describing or the time of measurement, virtual columns are used. One benefit of using virtual columns is that they allow for inclusion of mapping to concepts that are implicit to the data, such as the entity that an attribute belongs to. An example of virtual columns is shown in Table 3. Virtual columns involving time intervals should be stored in the Timeline table. Like standard codebooks used by the biomedical community, the Codebook ta- ble contains possible values of coded variables and their associated labels. We augment each possible value with mappings to corresponding ontological con- cepts, as shown in Table 4. Finally, the Code Mappings table contains mappings Table 3. Example Data Semantic Data Dictionary (Virtual Columns) Column Entity Role Relation inRelationTo wasDerivedFrom wasGeneratedBy ??mother sio:Human chear:Mother ??child ??child sio:Human chear:Child ??mother ??birth chear:Birth ??child ??preg chear:Pregnancy ??child ??sample1 S ??mother ??sample2 S ??mother ??pb 1 Pb sio:isPartOf ??sample1 ??pb 2 Pb sio:isPartOf ??sample2 Table 4. SDD Example Codebook Column Code Label Class race 0 chear:White race 1 chear:BlackOrAfricanAmerican race 2 chear:OtherRace edu 0 high school degree or less chear:HighSchoolOrLess edu 1 technical college or some college chear:SomeCollegeorTechnicalSchool edu 2 college graduate chear:CollegeGraduate smoke 0 no smoking in pregnancy chear:NonSmoker smoke 1 some smoking in pregnancy chear:Smoker of abbreviated terms or units to their corresponding concepts. The set of code mappings used in CHEAR can be found on GitHub4 . 3.2 OWL Generation Each cell of data from a dataset is used to create an instantiation of an at- tribute, based on the description of the column in the SDD. The value in the cell is used to assign a sio:hasValue property to the attribute instantiation. If the attributeOf column is filled out in the SDD, the sio:isAttributeOf property is used to link to the corresponding entity instantiation.If a unit is specified, the sio:hasUnit property is assigned the corresponding unit from the Units On- tology, which is determined by using the Code Mappings table. If a timepoint for the corresponding variable is specified, it is included in the OWL using the sio:existsAt property. The timepoint may also have an associated value, unit, and relation, as shown in the example OWL below. :birthweight a chear:Weight; sio:isAttributeOf :joe; sio:hasValue 3; sio:hasUnit uo:kilogram; sio:existsAt [ a sio:TimeInterval, chear:BirthTime; 4 https://github.com/tetherless-world/chear-ontology/blob/master/code_ mappings.csv sio:hasValue 0; sio:hasUnit sio:Day; sio:inRelationTo :birth ]; sio:existsAt [ a sio:TimeInterval, chear:GregorianTime; sio:hasValue "2016-03-12"^^xsd:dateTime; sio:hasUnit sio:Day; sio:inRelationTo :birth ]. 4 Evaluation The SDD specification approach was applied to the National Health and Nutri- tion Survey (NHANES) data from 2013-2014. In a manner specifically tailored to the NHANES website structure, values for columns in the SDD specification were populated through a web scraping script that used the Python Beautiful Soup package. In order to assign attributes and entities, a look-up approach was used to compare NHANES entries with terms in SIO or CHEAR. Using this approach, we were able to generate SDD starting points and Codebooks for 150 documents in 6 categories (Questionnaire, Demographics, Dietary, Laboratory, Examination, and Limited Access) corresponding to roughly 4818 SDD rows and over 17000 codebook entries. Of the 4818 SDD rows, 1148 or 23.83% were mapped to existing concepts in SIO such as Age, Height, Race and Ethnicity, as well as terms from CHEAR, including Weight, Education Level, Language, and Income. The remaining rows were not mapped to any concepts due to lim- itations in the extraction algorithm, which used pattern matching in the labels and comments to search for the above SIO and CHEAR terms, rather than more advanced natural language processing techniques. Therefore, while the current process reduces the amount of time required, human input is still necessary to complete the annotation. It is an ongoing effort to manually annotate the remain- ing NHANES concepts. Furthermore, the SDD specification is being applied to additional publicly available datasets, including the Genomic Data Commons5 , the Surveillance, Epidemiology, and End Results Program6 , and the Medical In- formation Mart for Intensive Care [6]. Additionally, by using a script to convert from SDDs, Codebooks, and the corresponding data into the Resource Descrip- tion Framework (RDF), Knowledge Graphs have been created for the subset of NHANES that had been annotated. These graphs are being actively used in a Data Analytics course at Rensselaer Polytechnic Institute to demonstrate to students how semantics can be leveraged to perform analytics. 5 Discussion Concentrating on mapping many data sets to one single conceptual structure serves the semantic web goal of interoperability: by mapping to the SIO con- ceptualization datasets can be compared to any other dataset that has also 5 https://gdc.cancer.gov 6 https://seer.cancer.gov been mapped. A Semantic Data Dictionary provides a formal means to map dataset columns into a compositional structure in a way that allows us to pro- duce OWL-based metadata for those datasets, creating explicitly defined classes that dataset columns map to. For some studies, like NHANES, tools for web scraping can be used, such as the Python library Beautiful Soup [14], allow- ing for a semi-automatic population of variable names, labels, and definitions. Nevertheless, automating the population of entities, roles or relations that cor- respond to the variable cannot be accomplished simply by using web scraping techniques, requiring the collaboration with domain experts. 6 Conclusions The Semantic Data Dictionary (SDD) standard allows for extension and integra- tion of data from multiple public health and biomedical domains through a com- mon metadata standard, and is convertible to OWL-based metadata that can be used to query for relevant datasets without knowledge of the structure of any one dataset. The CHEAR project uses the SDD specification to describe data related to demographics, anthropometry, birth outcomes, pregnancy characteristics, bio- logical responses and targeted analytes. The Center for Architecture Science and Ecology (CASE) is using Semantic Data Dictionaries to annotate data related to biological and physical environments, human demographics and physiology, and cognition. The Healthy Birth, Growth, and Development (HBGD) is using the SDD specification to capture data summary statistics, such as mean, stan- dard deviation, minimum and maximum confidence interval values, counts, and time information. As demonstrated by its applicability in the above projects, the SDD specification is an approach for semantic annotation that can be used to represent attributes described by data elements to allow for the integration of data from multiple sources. 7 Acknowledgements This work was funded by the National Institute of Environmental Health Sci- ences (NIEHS) Award 0255-0236-4609 / 1U2CES026555-01. We would like to thank Susan Teitelbaum at the Icahn School of Medicine at Mount Sinai for her leadership on the overall CHEAR data resource project, as well as her guidance in exposure and health domains. References 1. Buneman, P., Davidson, S., and Kosky, A. Theoretical aspects of schema merging. In Advances in Database TechnologyEDBT’92 (1992), Springer, pp. 152– 167. 2. Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Ka- nungo, T., Rajagopalan, S., Tomkins, A., Tomlin, J. A., and Zien, J. Y. Semtag and seeker: Bootstrapping the semantic web via automated semantic an- notation. In Proceedings of the 12th International Conference on World Wide Web (New York, NY, USA, 2003), WWW ’03, ACM, pp. 178–186. 3. Dingli, A., Ciravegna, F., and Wilks, Y. Automatic semantic annotation using unsupervised information extraction and integration. In Proceedings of SemAnnot 2003 Workshop (2003). 4. Dumontier, M. The Semanticscience Integrated Ontology (SIO). http://sio. semanticscience.org. 5. Dumontier, M., Baker, C. J., Baran, J., Callahan, A., Chepelev, L., Cruz-Toledo, J., Del Rio, N. R., Duck, G., Furlong, L. I., Keath, N., Klassen, D., McCusker, J. P., Queralt-Rosinach, N., Samwald, M., Villanueva-Rosales, N., Wilkinson, M. D., and Hoehndorf, R. The seman- ticscience integrated ontology (sio) for biomedical research and knowledge discov- ery. Journal of Biomedical Semantics 5, 1 (2014), 14. 6. Johnson, A. E., Pollard, T. J., Shen, L., Lehman, L.-w. H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., and Mark, R. G. Mimic-iii, a freely accessible critical care database. Scientific data 3 (2016). 7. Kiryakov, A., Popov, B., Ognyanoff, D., Manov, D., Kirilov, A., and Goranov, M. Semantic Annotation, Indexing, and Retrieval. Springer Berlin Heidelberg, Berlin, Heidelberg, 2003, pp. 484–499. 8. Lenzerini, M. Data integration: A theoretical perspective. In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (2002), ACM, pp. 233–246. 9. Maynard, D. Multi-source and multilingual information extraction. Expert Up- date 6, 3 (2003), 11–16. 10. McBrien, P., and Poulovassilis, A. A formalisation of semantic schema inte- gration. Information Systems 23, 5 (1998), 307 – 334. 11. McCusker, J. P., Rashid, S. M., Liang, Z., Liu, Y., Chastain, K., Pinheiro, P., Stingone, J. A., and McGuinness, D. L. Broad, interdisciplinary science in tela: An exposure and child health ontology. 12. Popov, B., Kiryakov, A., Kirilov, A., Manov, D., Ognyanoff, D., and Goranov, M. Kim–semantic annotation platform. In International Semantic Web Conference (2003), Springer, pp. 834–849. 13. Reeve, L., and Han, H. Survey of semantic annotation platforms. In Proceedings of the 2005 ACM Symposium on Applied Computing (New York, NY, USA, 2005), SAC ’05, ACM, pp. 1634–1638. 14. Richardson, L. Beautiful soup documentation, 2007. 15. Sarasua, C., Simperl, E., and Noy, N. F. Crowdmap: Crowdsourcing ontology alignment with microtasks. In International Semantic Web Conference (2012), Springer, pp. 525–541. 16. Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., Goldberg, L. J., Eilbeck, K., Ireland, A., Mungall, C. J., et al. The obo foundry: coordinated evolution of ontologies to support biomedical data integra- tion. Nature biotechnology 25, 11 (2007), 1251. 17. Staab, S., Maedche, A., and Handschuh, S. An annotation framework for the semantic web. Inst. AIFB, Univ., 2001. 18. Uren, V., Cimiano, P., Iria, J., Handschuh, S., Vargas-Vera, M., Motta, E., and Ciravegna, F. Semantic annotation for knowledge management: Re- quirements and a survey of the state of the art. Web Semantics: Science, Services and Agents on the World Wide Web 4, 1 (2006), 14 – 28.