<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semantic Composition of Disparate Data in GeoHealthUS for Navigation, Display and Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>David Wood</string-name>
          <email>fdavidg@geohealth.us</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>GeoHealth US Corp</institution>
          ,
          <addr-line>Arlington, VA 22209</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>GeoHealth US Corp produces environmental monitoring data in the United States, and aggregates that data with historical environmental data from US Government sources. Data sources include ve programs that track pollution history at the US Environmental Protection Agency, and air quality data collected by GeoHealth US Corp itself. Environmental monitoring data is combined with information from the US Department of Health &amp; Human Services, including the US National Library of Medicine, to automatically associate environmental conditions with diseases and symptoms. Semantic Web techniques are used to perform the data integration, navigate the data for analysis, and to drive the display of contextually relevant data in a Web user interface. GeoHealth US Corp is a Virginia Bene t Corporation (or \B-corporation"), a for-pro t corporate entity that includes positive impacts on society and the environment in its core mission.</p>
      </abstract>
      <kwd-group>
        <kwd>Environment</kwd>
        <kwd>Health</kwd>
        <kwd>Environmental Data</kwd>
        <kwd>Linked Data</kwd>
        <kwd>Linked Data Platform</kwd>
        <kwd>LDP</kwd>
        <kwd>RDF</kwd>
        <kwd>Linked Data</kwd>
        <kwd>W3C</kwd>
        <kwd>Callimachus</kwd>
        <kwd>Semantic Web</kwd>
        <kwd>US EPA</kwd>
        <kwd>GeoHealthUS</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Human health is impacted by at least three classes of information: Lifestyle,
genetics, and environment. Of these, environmental conditions remain the least
poorly integrated in the US healthcare system. Environmental health
information in the US is sparse, scattered, and in forms that make it di cult to nd,
to combine, and to relate to individual patients. In some cases, it is of
questionable quality. We set out to address those problems in order to provide a
more complete picture of the environmental impacts on US public health. The
potential bene ts are signi cant given the size of the US healthcare market:
Approximately 3 trillion USD is spent annually on healthcare[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], with more than
1.3 trillion USD spent by government entities[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>GeoHealthUS approached this problem by relating multiple US Government
datasets in Linked Data formats (mostly from the US Environmental Protection
Agency) with pseudo-real-time environmental air quality data collected by the
company. Additional Linked Data was created to describe diseases and chemical
substances related to them. Such data was previously available only in traditional
(XML) formats from the US Department of Health &amp; Human Services.</p>
      <p>
        The evolving GeoHealthUS Web site is available at http://geohealth.us.
Individuals may currently view historic pollution report data while the newer
environmental data is being integrated1. The site complies with the Linked Data
Platform v1.0 speci cation[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and is based on the Callimachus Project's Linked
Data platform[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. A SPARQL endpoint is available that allows for querying both
RDF and non-RDF data, although it is currently restricted to authenticated
users.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Current Air Quality Data</title>
      <p>GeoHealthUS has created mobile sensor packages called GeoHealthBoxesTM to
rapidly collect air quality information. GeoHealthBoxes are typically mounted on
vehicles and measurements taken. GeoHealthBoxes use a combination of Open
Hardware, software written by GeoHealth US Corp and a number of proprietary
environmental sensors.</p>
      <p>Newly collected environmental data from mobile sensors was considered too
voluminous for RDF modeling to make sense. Instead, RDF summaries were
created to facilitate the location, and querying of such data from larger relational
databases. RDF and non-RDF data is presented in a single Web user interface.
Contextual subsets of data are also presented for download in various reports and
formats (such as Turtle, JSON-LD, and CSV). Contextual subsets of non-RDF
data are converted to RDF at download time, as requested.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Historic Environmental Data</title>
      <p>Historic environmental data from the US Environmental Protection Agency
(EPA) was available from the programs summarized in Table 1.</p>
      <p>Historical data from government sources represents a 25-year history of
environmental pollutants, and some of their e ects. All government data was either
available in RDF formats or converted into RDF models for the purpose of
simple composition. Vocabulary mapping of government data sets was undertaken
as necessary.</p>
      <p>The US EPA operates a Linked Open Data service in a quality assurance
mode. This data service is not yet publicly available. The source data for the
programs is generally available via http://data.gov, but in some cases EPA
must be contacted directly to acquire information. This situation highlights the
relative immaturity of US Government data sources when users desire to combine
arbitrary data sets for further analysis and/or repurposing.</p>
      <p>Table 2 list the namespaces of some of the common Semantic Web
vocabularies used to represent the RDF portion of the data. Core vocabularies included
the rdf, rdfs, owl, skos, and xsd namespaces.
1 A prototypical, and temporary, interface for a portion of air quality data is available
at http://geohealth.us/home/pages/livedata.xhtml?view
Namespace URI Purpose
foaf http://xmlns.com/foaf/0.1/ Nearness, depictions
geo http://www.w3.org/2003/01/geo/wgs84 pos# Locations
place http://purl.org/ontology/places# Locations
dbpedia http://dbpedia.org/resource/ Units of measure, companies
vcard http://www.w3.org/2006/vcard/ns# Addresses</p>
      <p>We made use of a number of RDF vocabularies speci c to the EPA datasets,
which include de nitions such as the classi cation of facilities, substances, and
reports. These vocabularies are under the base URI of http://opendata.epa.gov,
but have not yet been formally published by the US EPA. We know of them
through our prior work with the agency, and look forward to them being publicly
available soon. Location information in EPA datasets was augmented by the use
of a custom vocabulary created to represent US postal (ZIP) codes to facilitate
the creation of maps and other geographic displays.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Relating Diseases and Symptoms</title>
      <p>Additionally, a number of custom vocabularies were developed to represent
information speci c to the GeoHealthUS application, such as the description of
diseases (extracted from traditional data formats available from the US Department
of Health &amp; Human Services). This data was mapped to chemical substances via
SRS identi ers and to clinical ndings using SNOMED-CT identi ers.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Data Architecture</title>
      <p>DB
DB
DB
DB
Historical Data
ETL
DB
DB
DB
DB
DB</p>
      <p>SQL
SPARQL
RDF Data Layer</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>The GeoHealthUS poster illustrates three bene ts of a Semantic Web approach
to data integration:
1. Conversion to RDF formats facilitated data integration across many data
sets by the simple mechanism of identi er alignment.
2. Presentation to end users is via a small number of SPARQL v1.1 queries,
simplifying maintenance requirements, and reducing maintenance costs over
traditional approaches.
3. The approach was shown to function in a large, real-world use case with
signi cant economic potential.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Chantrill</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <article-title>US Health Care Spending</article-title>
          . http://www.usgovernmentspending.com/ us\_health\_care\_spending\_10.html accessed
          <issue>30</issue>
          <year>June 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Centers for Medicare &amp; Medicaid
          <string-name>
            <surname>Services</surname>
          </string-name>
          .
          <source>National Health Expenditures</source>
          <year>2013</year>
          Highlights. http://www.cms.gov/Research-Statistics-
          <article-title>Data-and-</article-title>
          <string-name>
            <surname>Systems</surname>
          </string-name>
          / Statistics-Trends-and-Reports/NationalHealthExpendData/Downloads/ highlights.pdf accessed
          <issue>30</issue>
          <year>June 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Speicher</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arwe</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Malhotra</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . (eds).
          <source>Linked Data Platform 1.0. W3C Recommendation</source>
          , 26
          <year>February 2015</year>
          .
          <article-title>Retrieved 19 April 2015</article-title>
          from http://www.w3.org/TR/2015/REC-ldp-
          <volume>20150226</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Wood</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Leigh</surname>
          </string-name>
          , J.:
          <source>The Callimachus Project. Retrieved 19 April</source>
          <year>2015</year>
          from http://callimachusproject.org.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>