<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>April</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>openChart: Charting Quantitative Properties in LOD</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Filip Zembowicz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Opolon</string-name>
          <email>opolon@alum.mit.edu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stephen Miles</string-name>
          <email>s_miles@mit.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Harvard University, 414 Quincy Mailing Center</institution>
          ,
          <addr-line>Cambridge, MA 02138</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>MIT AutoID Labs</institution>
          ,
          <addr-line>77 Massachusetts Avenue, 35-014, Cambridge, MA 02139</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>MIT ESD</institution>
          ,
          <addr-line>77 Massachusetts Avenue, E40-286, Cambridge, MA 02139</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2010</year>
      </pub-date>
      <volume>27</volume>
      <issue>2010</issue>
      <abstract>
        <p>In this paper, we discuss the development of openChart, a quantitative Linked Open Data charting tool. It targets novice semantic web users by generating SPARQL queries to present interesting information. We also acknowledge the problems encountered in development and suggest improvements.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Linked Open Data</kwd>
        <kwd>Visualization</kwd>
        <kwd>Charting</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>2. STRUCTURE</title>
      <p>Finding data on the LOD cloud using openChart1 consists of
identifying an entity of interest, choosing two of its quantitative
properties, and selecting a peer group with which to compare
values. To enable entry into the semantic web, we use
Wikipedia’s autosuggest API to determine an entity’s Wikipedia
address. This is then matched with a corresponding semantic web
resource using a SPARQL query on the DBpedia database, which
is a central hub of the LOD cloud with many owl:sameAs linkages
to other sources of data [2]. While other endpoints could be used
with the openChart framework, DBpedia has a high number of
links to other LOD sources, making it useful for a general purpose
tool. .
Following the identification of an entity of interest, for example
Bangladesh, we find the quantitative properties from the RDF
resource by using regular expressions to remove non-quantitative
information. Two of these are selected by the user, for example
hdi and population density. Then, peer groups are found through a
SPARQL query that looks for distinct rdfs:type that contains
objects with both of the quantitative properties. These peer groups
may or may not contain the users’ original search term—but the
selection of one, for example Country, will display a scatter-plot
of the two variables. This peer group feature is an important
aspect of our application because it allows the user to branch out
when navigating information on the semantic web rather than
fixate on answering one question in particular. .
At all levels of the exploration, the data is locally cached in a
MySQL server. The frontend is written using JavaScript and the
jQuery library, while the backend is written in PHP 5 and the
ARC2 library. The plotting is done using the Protovis JavaScript</p>
    </sec>
    <sec id="sec-2">
      <title>3. RESULTS</title>
    </sec>
    <sec id="sec-3">
      <title>3.1 Easy Exploration with Structured Queries</title>
      <p>A focus on a simple user interface has made openChart an easy to
use introduction for WWW users unfamiliar with Linked Open
Data. By focusing on peer groups and not only information
directly relevant to a users’ query, the openChart tool emphasizes
a broad exploration of available data rather than merely answering
a specific question. Additionally, we incorporate a social
component into openChart, where interesting relationships
between concepts can be shared. This is new knowledge that is
being created, and eventually will be integrated into the LOD
cloud itself by defining such shared charts as RDF objects.
3.2 Identification of Errors in the Data2
An additional benefit of displaying data visually in openChart is
the ability to quickly identify errors within the data contained in
the LOD cloud. In isolation, it is often difficult to see errors in
___________________
1 The demonstration can be found at
http://openchart.mit.edu
2 An example may be seen:
http://hcs.harvard.edu/datavis/linkeddata/gallery/index.php?chart
=19
scale or other such mistakes—displaying them as outliers enables
mistakes to be rapidly identified. These data points can then be
flagged for review in order to improve the quality of the source
data, or any scripts that are used to parse the data into the RDF
format in the first place. Such flagging could be achieved by
defining a quality ontology and publishing triples for user
identified errors.</p>
    </sec>
    <sec id="sec-4">
      <title>4. PROBLEMS ENCOUNTERED</title>
    </sec>
    <sec id="sec-5">
      <title>4.1 Lack of Range Descriptors</title>
      <p>When searching the LOD cloud through a SPARQL query, it
would be economical to restrict SPARQL queries to retrieve only
properties with ranges limiting them to numerical values.
However, we found that many of the properties lack associated
rdfs:range and/or rdfs:domain values. This resulted in a need to
retrieve all results and then parse them using regular expressions,
increasing the overhead of the application. Thus, we suggest that
RDF authors take the time to specify rdfs:range and rdfs:domain
values such as xsd:integer and xsd:decimal to facilitate statistical
work using Linked Open Data.</p>
    </sec>
    <sec id="sec-6">
      <title>4.2 Lack of Unit Descriptors</title>
      <p>Another aspect often missing from data sources, especially from
DBPedia, is units of measure. Particularly when comparing across
endpoints, it is imperative that the units of measurements are
understood, in order to prevent scaling errors when comparing
data from different sources. We suggest that creators of RDF data
take the time to include unit specifications, either through
ontologies such as Quantities, Units, Dimensions and Data Types
in OWL and XML [6], or by agreeing on standardized unit
abbreviations and distributing unit-aware parsers.</p>
    </sec>
    <sec id="sec-7">
      <title>5. FUTURE DEVELOPMENT</title>
    </sec>
    <sec id="sec-8">
      <title>5.1 Automated Provenance</title>
      <p>Since the data in openChart is coming from multiple sources,
tracking the sources of a chart’s data would be important in
enabling the use of the charts in research. As a result, we plan to
implement a feature by which the origins of the data contained
within a chart will be displayed concurrently with the chart.
Although RDF quadruples (such as [4]) would allow this to be
easily implemented, methods that determine authorship based on
particular endpoint characteristics could be implemented
currently.</p>
    </sec>
    <sec id="sec-9">
      <title>5.2 Integration with Existing LOD Browsers</title>
      <p>
        There exist many existing browsers of semantic web data, such as
Tabulator, which offer capabilities similar to our system [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Although openChart is easier to use than these programs, due to
the restrictive nature of the queries permitted on our system, we
are working to enable the switching back and forth between
Tabulator and openChart, to allow more technical users to
experience the full potential of the semantic web, using openChart
as a starting point.
      </p>
    </sec>
    <sec id="sec-10">
      <title>5.3 Publishing of Results</title>
      <p>As mentioned previously, the information gleaned from
openChart can be published for others to access. Statistical
relationships can be described using the SCOVO ontology, which
allows the specification of statistics with reference to a particular
dataset over a range of time [5]. Care must be taken to ensure the
completeness of the data, however, since the statistics generated
only represent the data published to the LOD cloud. Two groups
of statistics are generated – one describing the local cloud itself,
such as describing the number of triples, and another describing
the data contained therein.</p>
    </sec>
    <sec id="sec-11">
      <title>6. ACKNOWLEDGMENTS</title>
      <p>We would like to thank Tim Berners-Lee, K. Krasnow Waterman,
Reed Stuyvesant, Ian Jacobi, Oshani Seneviratne, and everyone
else who participated in and organized MIT’s Linked Data week
in January of 2010.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Berners-Lee</surname>
          </string-name>
          et. al.,
          <article-title>Tabulator: Exploring and Analyzing linked data on the Semantic Web, Procedings of the The 3rd International Semantic Web User Interaction Workshop</article-title>
          (SWUI06) workshop, Athens, Georgia, 6
          <string-name>
            <surname>Nov</surname>
          </string-name>
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kobilarov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Becker</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Hellmann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>DBpedia - A Crystallization Point for the Web of Data</article-title>
          .
          <source>Journal of Web Semantics: Science, Services and Agents on the World Wide Web, Issue</source>
          <volume>7</volume>
          ,
          <string-name>
            <surname>Pages</surname>
            <given-names>154</given-names>
          </string-name>
          <source>-165</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          (http://sw.deri.org/
          <year>2008</year>
          /07/n-quads/) Hausenblas,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Halb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            ,
            <surname>Raimond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Feigenbaum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            , and
            <surname>Ayers</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>SCOVO: Using Statistics on the Web of Data.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Viégas</surname>
            ,
            <given-names>F.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wattenberg</surname>
          </string-name>
          , M.,
          <string-name>
            <surname>van Ham</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kriss</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>McKeon</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Many Eyes: A Site for Visualization at Internet Scale</article-title>
          . Infovis,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>