=Paper=
{{Paper
|id=Vol-1747/IT602_ICBO2016
|storemode=property
|title=A Semantic Web Representation of Entire Populations
|pdfUrl=https://ceur-ws.org/Vol-1747/IT602_ICBO2016.pdf
|volume=Vol-1747
|authors=Daniel Welch,Amanda Hicks,Josh Hanna,William Hogan
|dblpUrl=https://dblp.org/rec/conf/icbo/0003HHH16
}}
==A Semantic Web Representation of Entire Populations ==
A Semantic Web Representation of Entire Populations Daniel Welch, Amanda Hicks, Josh Hanna, William R. Hogan Department of Health Outcomes and Policy University of Florida Gainesville, FL dwelch2101@ufl.edu, aehicks@ufl.edu, joshhanna@ufl.edu, hoganwr@ufl.edu Abstract—Accurately representing demographic realities is a by Grefenstette et al., accounting for differences in population critical component in creating useful, agent-based epidemiological density and sociodemographics indeed affects DTM results for models of infectious disease. Synthetic ecosystems are generated different regions [5]. from Census data microsamples in a statistically-sound manner to maintain population-level demographic characteristics. These Census data are a key resource for matching simulated highly detailed representations of populations are the basis of populations to actual ones [5, 6]. They include data about many advanced simulations of infectious disease epidemics. demographics, housing units, household composition, Creating a standard, machine-readable representation of employment, school attendance, and other physical and social synthetic ecosystem data would enable easier use and integration dynamics that have the potential to influence infectious disease with epidemic simulator software. Here we describe an ontology- transmission. However, record-level Census data are typically based representation in Resource Description Framework (RDF) only available as microsamples of the overall Census data set. and Web Ontology Language (OWL) of version 1.0 of the 2010 Therefore, there are typically representations of only 1%-5% of U.S. Synthetic Population database by RTI International. Our the population. This amount of data is insufficient for use with representation draws upon applicable classes from several AB-DTMs that model 100% of a population. To overcome this reference ontologies, including the Ontology of Medically Related limitation, researchers employ statistical methods to generate a Social Entities (OMRSE). After failing to find suitable ontological full population dataset from the microsamples such that the representations of several key data elements in the Synthetic synthetic population-wide dataset mirrors the actual population Population dataset, we created new classes in OMRSE for in aggregate in terms of various demographic characteristics representing employment status, employee roles, workplaces, residences, households, and age measurements. We loaded a test such as sex, race, and marital status [6]. For example, the RDF dataset (structured according to ontologies in OWL) of synthetic populations (or more generally, synthetic ecosystems, synthetic individuals into a commercial triple store (Stardog) and since housing units, workplaces, schools, etc. are also validated the representation with SPARQL queries. represented) available have percentages of blacks, women, employed individuals, students, etc. that statistically match the Keywords—ontology; synthetic ecosystem; disease transmission actual population. model In addition to expanding microsample Census data to full I. BACKGROUND population size, researchers incorporate significant additional information into synthetic ecosystems relevant to disease Disease transmission models (DTMs) are epidemiological transmission [5, 6, 7]. For example, although Census data models that predict the future course of infectious disease capture employment and school attendance statuses, they do not outbreaks under various assumptions. They are used to study associate individual persons to individual workplaces or which strategies for controlling outbreaks are potentially the schools. However, for AB-DTMs, these linkages are critical for most effective and thus for decision making during the course of studying whether and how well school closures and workers’ an outbreak. For example, researchers have used them to study decisions to stay at home (whether made individually or as a the effects of various vaccination control strategies on pandemic matter of public health or employer policy) control disease influenza [1, 2, 3] and to study the Ebola outbreaks in western transmission. Therefore, a significant component of extant Africa [4]. synthetic ecosystems is data about individual school and Agent-based disease transmission models (AB-DTMs) are a workplace assignment. class of DTMs that represent every host individual in the Researchers typically make these synthetic ecosystems population of interest, and sometimes individual vector available as delimited text files in a format suitable for loading organisms as well, to increase the realism of the simulations and into tables in a relational database management system, with thereby increase the accuracy of the predictions generated [5]. limited semantics and simple integer values for representing To accomplish these goals, the characteristics of the population categories such as race and gender. For example, see the represented in the AB-DTM must closely match the extensive collection of synthetic ecosystems available at [8]. characteristics of the actual population under study. As reported In this work, we make available full-population synthetic prek, kinder, gr01-12, and ungraded fields in schools.txt, which ecosystem data as Resource Description Framework (RDF) [9] represent the total number of students in different grade triples. It differs from past efforts to include Census data in categories in a given school. government linked open data (LOD) [10] in at least two key respects. First, we took a realist ontological perspective. To our We then determined whether each included data field could be accurately modeled using existing ontological classes from knowledge, our work is the first to attempt to represent the necessary entities to cover a Census-derived dataset from a OMRSE or other established ontologies, or whether new classes were necessary. We created graphical models of how the data realist perspective. We were able to reuse significant components of other realist-based ontologies, but we also would be structured ontologically as an initial specification for transforming the data into RDF, as well as to identify any new needed to carry out additional ontology development to accomplish the task. Second, to our knowledge, we are the first classes that we would need to create. These graphical models depict the individuals, relationships between pairs of to attempt representing an entire population from Census data in a Semantic Web framework using synthetic ecosystem data individuals, and classes to which individuals belong. These diagrams included specifications for associating people with created for AB-DTMs. their workplaces and schools as represented in the dataset. In previous work, we created the Ontology of Medically We then manually created these individuals and Related Social Entities (OMRSE) to handle demographics such as those represented in Census and electronic health record relationships for a single set of individuals in one household, including their associated school and workplace, in a Web (EHR) data [11]. OMRSE is a realist representation of medically related social entities. Social entities are those entities that exist Ontology Language (OWL) [17] file that imported OMRSE and the Apollo-SV ontology [18] (the latter was a choice of in reality but which would not exist outside of a social context. For example, the role of a doctor is distinct from the human convenience because it already brings together ontological representations from numerous ontologies, including its own, in being who bears that role. This role exists within the healthcare system and confers rights and responsibilities associated with the domain of epidemic simulation). This OWL file served as the machine-readable specification for converting Synthia text treating and diagnosing a patient. It is the result of social agreements and interactions rather than of the physical stuff that files into RDF triples. Once we had this machine-readable specification, we created a software application that performed makes up the natural world. It is realized through various processes of diagnosing, treating, prescribing, etc. We develop this conversion, and applied it to the county-based Synthia files for Alachua County, FL and Miami-Dade County, FL. This OMRSE in accordance with OBO Foundry best practices [12] and reuse classes from several other ontologies including Basic application is freely available at: https://github.com/ufbmi/synthia-rdf-converter. We then loaded Formal Ontology [13], NCBI Taxonomy [14], Information the RDF triple datasets output by the application into an instance Artifact Ontology [15], and the Document Acts Ontology [16]. of the Stardog triple store. Given the importance of school and workplace assignment and the data about them in synthetic ecosystems, it was critical A. New Ontology Classes in OMRSE to represent additionally the roles of students and employees. In accordance with OBO Foundry best practices, we reused Furthermore, it was necessary to capture the relationships of as many classes and object properties from other ontologies as these roles to the organizations that create them and to the we could to generate the OWL file. After importing existing individual facilities where they are realized. We also report here classes from OBO ontologies, it was still necessary to create new on the extent to which pre-existing ontologies fulfilled this need classes to represent several key elements of the Synthetic vs. the additional ontology development required. Population dataset. Specifically, we created new classes in OMRSE to represent employment status, employee roles, II. METHODS workplaces, residences, households, and age measurements. We reviewed the files generated by the Research Triangle B. Queries of the RDF Dataset Institute’s Synthia synthetic population generator [6] in conjunction with its documentation. Because Synthia uses U.S. We developed queries of the RDF datasets to validate our Census files and public-use microsample (PUMS) data, we also representations as well as to identify population characteristics reviewed U.S. Census definitions of the variables in those data. that are likely to influence disease transmission. If these differences are signficant among regions, they could influence We reviewed each of the data fields in the following subset the choice of DTM used to study an infectious disease control of Synthia files: synth_people.txt, synth_households.txt, strategy. For example, if two regions differ substantially in schools.txt, workplaces.txt. Through an iterative process, we household and workplace composition, size of school-aged analyzed and described each data field and determined whether population, etc., an AB-DTM is likely to be the better choice. to include the field in this work. The most common reason we Furthemore, these queries could also be done as part of a excluded a field from the final ontological representation was simulation experiment to help explain differing results among redundancy. For example, we excluded data fields from geographical regions in incidence rates, peak dates, and choice synth_households.txt, schools.txt, and workplaces.txt that of infectious disease control strategies output by the simulator. represented the total number of individuals assigned to a household, school, or workplace since these values could be Because the sizes of households, schools, workplaces, and derived by counting in the underlying data. Other fields were the amount of overlap among them (e.g., households with an excluded because they were determined to be of lesser employee in the workplace and student in a school) influence immediate importance to epidemiologic simulation, such as the disease transmission and thus potentially DTM results, we developed queries to find (1) the average numbers of individuals per household, workplace, and school; (2) the number and percentage of households with both an employee and a student; and (3) the number and percentage of workplaces with at least one employee who lives with a student. We executed these queries against both the Alachua County and Miami-Dade Fig 1. Key for Graphical Models. County datasets to contrast these locations based on characteristics relevant to disease transmission. In analyzing Synthia data fields, we found that Synthia conflates households and housing units, despite being based on We loaded the RDF data into an instance of version 3 of the U.S. Census data that make the distinction clear. For example, Stardog triple store from Complexible, Inc. This triple store runs Synthia assigns to households both the physical properties of a on an Amazon Web Services r4.large instance (2 CPUs and housing unit, such as latitude and longitude, as well as properties 15.25GB of RAM). Queries were submitted from the Stardog about the household as a social unit, such as total household command line on the same server on which the triple store was income, race and age of the head of the household, and running. The timings we report here are from the Stardog household size. Our approach distinguishes household from command line output. housing unit and asserts that housing units are individuated by their residence functions and that a household realizes the III. RESULTS housing unit’s residence function by living there. In OMRSE, A. RDF Representations we define a household as a human or collection of humans that occupies a housing unit by storing their possessions there and To accurately model the U.S. Synthetic Population habitually sleeping there thereby participating in the realization Database, we created RDF representations of the data fields of its residence relating to individual persons, households, housing units, workplaces, and schools. We created graphical models of these representations (Figs. 1-4). Fig. 2 illustrates our representation of humans in a household. Fig. 3 illustrates our representations of workplaces and employment. Many ontologies classify age as a physical quality, rather than as a measurement of some temporal interval with respect to the time the measurement was made. The Ontology for Biomedical Investigations (OBI) [19] has a class ‘age measurement datum’ that has a class restriction of being is about some age quality. The age quality class, in turn, comes from the Phenotypic Quality Ontology (PATO). By contrast, we represent age as a measurement of a one-dimensional temporal region that is occupied by a process that is part of the history of Fig 2. Graphical Model of RDF Specification of Household. some object (Fig. 4). Fig 3. Graphical Model of RDF Specification of a Person’s Relation to a Workplace. Fig 4. Graphical Model of RDF Specification of Age. function, and add the following description logic equivalence statement: TABLE I. SUMMARY STATISTICS FOR TWO COUNTY-BASED DATASETS household =def ('Homo sapiens' or 'collection of humans') and ('participates in' some (process and (realizes some Alachua Miami-Dade 'residence function'))) Triples 13,315,702 133,973,948 where residence function is defined as a function that inheres in a material entity and is realized by protecting persons and their People 233,549 2,448,514 possessions from weather and by some person or group of persons habitually sleeping in at least one site that is contained by that material entity. Schools 64 442 B. New OMRSE Classes Workplaces 13,895 180,773 We created a total of 11 new classes in OMRSE to support the representation of synthetic ecosystems. Each class has a Housing Units 100,517 867,252 textual definition adapted from U.S. Census. One major adaptation of the definitions was to put them in Aristotelian form Average Household Size 2.32 2.82 with the name of the direct superclass as part of the definition. Other adaptations were necessary to eliminate ambiguity and to Employees per workplace 8.05 6.13 reuse other defined ontology terms. OMRSE is a publicly- available resource at the following permanent URL: Students per school 584 1070 http://purl.obolibrary.org/obo/omrse.owl. Workplaces that overlap with a 7895 (56.8%) 121,951 (67.5%) C. RDF Datasets and Queries school The Alachua county dataset comprised ~13M triples, and the Households with both an 20,244 (20%) 255,614 (29.5%) Miami-Dade County dataset comprised 133M triples (Table 1). employee and a student The population totals for both counties are slightly lower than the 2010 Census numbers on which the Synthia datasets were based. The reason is that we did not incorporate group quarters query that counted all workplaces with at least one employee such as nursing homes and military barracks, which is future who lives at home with at least one student. work. The housing unit totals for both counties match the 2010 The execution time for the SPARQL queries ranged from a Census numbers. The data show distinct differences, as few milliseconds to 41 seconds. The longest of these was the expected, between Miami-Dade—a large urban county—and Alachua—a small county (in terms of population) where a large university is located. Miami-Dade has a larger household size and school size, a greater percentage of workplaces with at least one here are more extensible and suitable for representing these one employee that lives with at least one school student, and a networks. greater percentage of households with at least one workplace Future work includes expanding the specification to include employee and school student. By contrast, Alachua has a higher average workplace size, even when the University of Florida is data related to group quarters, which will require additional ontological analysis and ontology development. excluded from consideration. These differences are likely to impact simulator results—Miami-Dade will often have a larger ACKNOWLEDGMENTS incidence and prevalence of infectious disease that is spread from person to person such as influenza in the absence of control This work was supported by award UL1TR001427 from the measures. Infectious disease control measures designed to National Center for Advancing Translational Sciences reduce school and workplace transmission—such as school (NCATS) and award U24GM110707 from the National Institute closure, voluntary or imposed absenteeism from work, and for General Medical Science (NIGMS). The content is solely vaccination of the school and / or workplace population—are the responsibility of the authors and does not necessarily likely to have a greater predicted effectiveness (and thus perhaps represent the official views of NCATS, NIGMS, or the NIH. actual effectiveness) in Miami-Dade than Alachua. REFERENCES D. Availability of Materials [1] S. T. Brown, J. H. Tai, R. R. Bailey, P. C. Cooley, W. D. Wheaton, M. A. All materials created for this paper—the graphical models Potter, et al., “Would school closure for the 2009 H1N1 influenza epidemic have been worth the cost?: a computational simulation of (including additional ones not shown here), the SPARQL Pennsylvania,” BMC Pub. Health, vol. 11, p. 353, 2011. queries, and the OWL files with the entire datasets for Alachua [2] M. E. Halloran, N. M. Ferguson, S. Eubank, I. M. Longini, D. A. and Miami-Dade counties—are freely available under a Cummings, B. Lewis, et al., “Modeling targeted layered containment of Creative Commons Attribution (CC BY 4.0) license at: an influenza pandemic in the United States,” Proceedings of the National http://tinyurl.com/syneco-queries. Academy of Sciences of the United States of America, vol. 105(12), pp. 4639-4644, 2008. IV. DISCUSSION [3] I. M. Longini, A. Nizam, S. Xu, K. Ungchusak, W. Hanshaoworakul, D. A. Cummings, and M. E. Halloran, “Containing pandemic influenza at the We developed a Semantic Web and realism-based source,” Science, vol. 309(5737), pp. 1083-1087, 2005. representation of the entire populations of two counties in [4] C. Siettos, C. Anastassopoulou, L. Russo, C. Grigoras, and E. Mylonakis, Florida. We built SPARQL queries to assess differences “Modeling the 2014 ebola virus epidemic – agent-based simulations, between the two populations that are likely to influence disease temporal analysis and future predictions for Liberia and Sierra Leone,” transmission, as well as the results of experiments conducted PLOS Currents Outbreaks, Edition 1, 2015. using DTMs. The approach is generic and could be applied to [5] J. J. Grefenstette, S. T. Brown, R. Rosenfeld, J. DePasse, N. Stone, P. C. Cooley, et al., “FRED (a Framework for Reconstructing Epidemic any other synthetic ecosystem data, including for additional Dynamics): an open-source software system for modeling infectious geographical regions. The queries are generic and could be diseases and control strategies using census-based populations,” BMC applied to any additional county-based datasets (or datasets at Pub. Health, vol. 13, p. 940. 2013. other levels of geographical granularity such as Census tract) [6] W. D. Wheaton, J. C. Cajka, B. M. Chasteen, D. K. Wagener, P. C. similarly transformed via our processes and representations. Cooley, L. Ganapathi, et al., 2009. “Synthesized population databases: a US geospatial database for agent-based models,” Methods Report, RTI We have demonstrated the feasibility of using Semantic Web Press, 2009(10), p. 905. technologies for representing entire populations, and in [7] MIDAS Informatics Services Group, “Synthetic Populations and particular for representing synthetic ecosystems for use in AB- Ecosystems of the World,” 2016. DTMs. Additionally, through additions to OMRSE and the http://data.olympus.psc.edu/syneco/spew_documentation.pdf creation of RDF synthetic datasets, we have developed some of [8] MIDAS, “Synthetic Populations and Ecosystems,” 2014. the resources necessary to transform other U.S. Census data into http://www.epimodels.org/drupal-new/?q=node/112 Semantic Web representations. In so doing, we have made [9] W3C, “RDF Current Status,” https://www.w3.org/standards/techs/rdf#w3c_all. Last accessed explicit much of the semantics that are implicit in those data and 06/20/2016. the synthetic ecosystems that are based on them. It is our [10] L. Ding, T. Lebo, J. S. Erickson, D. DiFranzo, G. T. Williams, X. Li, et conjecture for future work that the explicit semantics improve al., “TWC LOGD: a portal for linked open government data ecosystems,” the ease with which synthetic ecosystems can be expanded to Journal of Web Semantics, vol. 9(3), pp. 1–11, 2011. incorporate additional biological, social, and abiotic ecosystem [11] W. R. Hogan, S. Garimalla, and S. A. Tariq, “Representing the reality elements. underlying demographic data,” In Proceedings of the International Conference on Biomedical Ontology, pp. 147-152, Buffalo, NY: Although we developed this work in the context of agent- International Conference on Biomedical Ontology, 2011 based DTMs, this resource and approach could also be leveraged [12] B. Smith, M. Ashburner, C. Rosse, J. Bard, W. Bug, W. Ceusters, et al., for social network analysis due to the graph-based nature of “The Obo Foundry: coordinated evolution of ontologies to support RDF. For example, one could construct queries for finding hubs biomedical data integration,” Nature Biotechnology, vol. 25(11), p. 1251, 2007. in the network and people or places that a set of people have in [13] P. Grenon and B. Smith, “Snap and span: towards dynamic spatial common. Furthermore, DTMs are increasingly taking into ontology,” Spatial Cognition and Computation, vol. 4(1), pp. 69-104, account social networks as part of the synthetic ecosystem itself 2004 (for example, see Frias-Martinez et al. [20]). Network-based [14] S. Federhen, “The NCBI taxonomy database,” Nucleic Acids Research, approaches and graph representations such as our RDF-based vol. 40(D1), pp. D136-D43, 2012. [15] W. Ceusters, Ed. “An information artifact ontology perspective on data Vocabulary and pre-existing representations,” In Proceedings of the collections and associated representational artifacts,” MIE, 2012. International Conference on Biomedical Ontology, Houston, Texas: [16] M. B. Almeida, L. Slaughter, and M. Brochhausen, Eds. “Towards an CEUR Workshop, W.R. Hogan, S. Arabandi, and M. Brochausen, Eds. ontology of document acts: introducing a document act template for 2014, pp. 21-26. healthcare,” In On the Move to Meaningful Internet Systems: OTM 2012 [19] R. R. Brinkman, M. Courtot, D. Derom, J. M. Fostel, Y. He, P. Lord, et Workshops, Rome, Italy: Springer, 2012, pp. 420-425. al., “Modeling biomedical experimental processes with OBI,” Journal of [17] W3C, “OWL 2 Web Ontology Language Biomed Semantics, vol. 1 (Suppl 1), p. S7, 2010. Document Overview (Second Edition),” 2012. [20] E. Frias-Martinez, et al., “An Agent-Based Model of Epidemic Spread https://www.w3.org/TR/owl2-overview/. Last accessed 06/20/2016. Using Human Mobility and Social Network Information,” Privacy, [18] M. Brochhausen, W. R. Hogan, J. Levander, S. T. Brown, N. Millet, J. Security, Risk and Trust (PASSAT) and 2011 IEEE Third International Hanna, et al., 2014. “A novel representation of terms related to infectious Conference on Social Computing (SocialCom), 2011. disease epidemiology for epidemic modeling: the Apollo Structured