<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Tag Clouds to Quickly Discover Patterns in Linked Data Sets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xingjian Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Heflin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Engineering, Lehigh University 19 Memorial Drive West</institution>
          ,
          <addr-line>Bethlehem, PA 18015</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Casual users usually have knowledge gaps that prevent them from using a Knowledge Base (KB) e ectively. This problem is exacerbated by KBs for linked data sets because they cover ontologies with diverse domains and the data is often incomplete with regard to the ontologies. We believe providing visual summaries of how instances use ontological terms (classes and properties) is a promising route to reveal patterns in the KB and quickly familiarize users with it. In this paper we propose a novel contextual tag cloud system, that treats the ontological terms as tags and uses the font size of tags to reflect the number of instances related to the tags. As opposed to traditional tag clouds, which have a single view over all the data, our system has a dynamically generated set of tag clouds each of which shows proportional relations to a context specified as a tag set of classes and properties. Furthermore, our tags have a precise semantics enabling inference of tags. We optimize the infrastructure to enable scalable online computation. We give several examples of discoveries made about DBPedia using our system.</p>
      </abstract>
      <kwd-group>
        <kwd>Tag Cloud Browsing</kwd>
        <kwd>Semantic Web Exploration</kwd>
        <kwd>Linked Data</kwd>
        <kwd>Knowledge Discovery</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>As the Semantic Web has evolved over the last decade, the amount of interlinked
structured data has grown tremendously. While such a huge amount of information
potentially enables many di erent powerful applications, we also notice that there are some
obstacles preventing people from taking full advantage of this interlinked web-scale
knowledge base (KB). One of the challenges is how to present this huge KB to casual
users and familiarize them with it so that they can quickly start building interesting
queries and get useful answers. The users’ unfamiliarity with the KB (which we refer to
as their knowledge gap) arises due to various aspects with regard to both the ontology
and data.
– Knowing the distribution of data in the KB can also be important for a
successful query. A query can be semantically correct but practically less helpful, simply
because the coverage of the data is incomplete with regard to the ontology.</p>
      <p>All these aspects are knowledge gaps casual users typically have, though sometimes
some of them might be less important. e.g. if the KB covers a very focused domain,
with full-fledged data, then the second and third gaps are less relevant: the terms with
a specific focused domain usually have clear meanings with little ambiguity, and if the
data is complete, then presumably any semantically meaningful query will be
productive. In this case, providing some simple descriptions about the KB and a whole list of
ontological terms should mostly resolve the gaps.</p>
      <p>However, this is not the case in the Linked Data world. Even taking an adequate
subset of the linked data cloud, we can see wide spread domains from di erent sources.
Then ambiguity becomes a common issue: sometimes a word is used with di erent
senses (e.g. “Bridge” may refer to a structure or a card game), and sometimes the same
sense is refined within di erent domains (e.g. “Person” in a scientific ontology may
just refer to “Scientist”). One way to help users understand these terms is to show the
axioms related to each term, however sometimes the axioms are missing or too complex
to present in a user-friendly way. Another approach is to examine the instances related to
each term. While looking into the related instances one by one provides the most details,
it is very time consuming, and there is a risk of getting misled by coincidently looking
into some erroneous data. Instead, we think providing a summary of a “type” of data
can be more e cient. Ideally, a type is more than the named classes in the ontologies,
but something defined by users on the fly. Showing summaries of customized types
is also helpful for understanding the distributions of data, which helps users to decide
which terms to use and how to express the query. If we can properly define “types”,
by simply providing the count of each type, it will actually reflect the patterns of
cooccurrence of ontological terms in instances. From these patterns, a user can get various
information: the common patterns help users understand the terms and also understand
which queries are selective and which queries have adequate data; the rare patterns can
lead users to interesting facts or indicate possible errors in the data. By using the tag
cloud paradigm to convey this information, we provide a more straightforward way of
showing these patterns: a large tag suggests a common pattern with more instances of
the type, while the smaller tags indicate rarer relationships, which may be either very
special facts or erroneous data. Such patterns can also be interesting to present to users
even in the focused domain with complete data.</p>
      <p>We believe the idea of tag cloud summaries of customized types can really help for
both KB exploration and query building. Since the types are defined on the fly, there is
a trade-o between the expressiveness and time cost. In this paper we formally define
these types and the contextual tag cloud of a type (in Section 2), propose an e cient
approach to dynamically compute necessary statistics for the tag clouds (in Section 3),
and implement a system that demonstrates our idea with examples (in Section 4). We
discuss related work in Section 5 and conclude in Section 6.</p>
    </sec>
    <sec id="sec-2">
      <title>Tag Cloud for RDF Data</title>
      <p>In traditional tag cloud applications, every tag is a link to a set of related web pages that
are marked with that tag. A tag, in the Web 2.0 sense, is usually a folksonomy provided
by users that helps categorize the content of a web page. We can make analogies here
for RDF data. An instance is like a web page document, and is already tagged with
formal ontological classes, as opposed to folksonomies. In addition, we can also include
properties as tags. An instance that has one or more triples involving a property is
considered to have this property as one of its tags.</p>
      <p>Formally, consider a KB defined by S, a set of RDF statements. Each statement s 2
S can be represented as a triple of subject, predicate and object, i.e. s = &lt;sub; pre; obj&gt;.
By applying RDF entailment rules [3], we can get S , a closure of S which completes
S with the entailed statements. Using simple queries, we can also extract C the set of
classes, P the set of properties, I the set of instances and L the set of literals. Given
an instance i 2 I, there are two types of statements with regard to i in S: the ones with
i as subjects and the ones with i as objects in the statements. From these statements
we can extract the tags for instance i: all the classes and properties that describe i. The
predicates in the statements where i is the object are recorded as inverse properties
(denoted as p if p is used in the statement) in order to distinguish from the predicates
in the statements where i is the subject . We define P as extended P including inverse
object properties.</p>
      <p>
        We further introduce the negation of tags. While a tag represents that an instance is
described by a particular class or property, we use a negated tag to indicate that such a
description is missing. This can be useful for inspecting what portions of the data are
missing important properties, e.g., how many politicians are missing a political party.
We considered three possible semantics for the negated tags: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) classical negation:
Instances have the tag only if the negation of the corresponding concept is logically
entailed; (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) negation-as-failure: Instances have this tag if the system fails to infer the
positive tag, i.e. it does not have the positive tag in S ; and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) explicit negation:
Instances have this tag if they do not explicitly have the positive tag in S. Since classical
negation cannot be used to find missing properties and explicit negation could lead to
confusing scenarios where an instance has a positive inferred tag and a corresponding
explicit negation tag, we find negation-as-failure best fits our requirement. A negation
of a class c 2 C or a property p 2 P can be denoted as c or p. The extended class set
and property domain with negations are Cˆ = C [ f cjc 2 Cg and Pˆ = P [ f pjp 2 P g.
      </p>
      <p>Each instance i 2 I is explicitly associated with a set of tags: T ags : I ! 2T ,
where T = Cˆ [ Pˆ is the set of all the possible tags of the KB.</p>
      <p>T ags(i) = fcjc 2 C ^ &lt;i; rdf:type; c&gt; 2 S g [ fpjp 2 P ^ 9 j : &lt;i; p; j&gt; 2 S g
[ fp</p>
      <p>jp 2 P ^ 9 j : &lt; j; p; i&gt; 2 S g [ f cjc 2 C ^ &lt;i; rdf:type; c&gt; &lt; S g
jp 2 P ^ @ j : &lt; j; p; i&gt; 2 S g</p>
      <p>In traditional tag cloud systems, tags are typically listed alphabetically, where the
importance of each tag is represented by a di erent font size as a summary of all the
web pages. In our scenario, the importance is the size of instances of the type decided
by the tag. Currently, most of the popular tag cloud systems provide only a top level
tag cloud, and not a set of contextual tag clouds that depend on some selected tags.
However in the RDF data scenario, showing contextual tag clouds is very helpful since
combinations of ontological tags have more precise semantics than folksonomies, and
helps reveal the information about ontological terms and distributions of data.</p>
      <p>We define a contextual tag cloud, given a set of tags T0 as the context, as a list
of tags [t1; : : : ; tn] with various font sizes [ f s1; : : : ; f sn] that reflects the instance sizes
of types [T0 [ ft1g; : : : ; T0 [ ftng]. Formally, we can define the process of computing
the sizes of instances as a function count: 2T ! N . Semantically, given a tag set T ,
count(T ) = jfijT T ags(i)gj. Note that count(T [ f tg) = count(T ) count(T [ ftg). The
font sizes for a tag ti in the tag cloud is
f si = (FS MAX</p>
      <p>FS MIN ) loglocogucnotu(Tnt0(T[0f)tig) + FS MIN
where the max and min font sizes are denoted as FS MAX and FS MIN . We use log
functions on the count so that the tag cloud shows di erences of tags in orders of magnitude.</p>
      <p>In addition to calling count for generating the contextual tag cloud, in the
implemented system, count(T0 ftxg) is also called for every existing tag tx 2 T0 for the
removal of tags. An example of the contextual tag cloud based on fdbp:Persong is shown
in Fig. 1. We shall come back to this figure when we study use cases in Section 4.</p>
      <p>
        The initial tag cloud has context T = ; or semantically T = fowl:Thingg, and the
tags in the cloud reflect the absolute sizes of instances related to each tag. There is
no limit on the number of current tags jT j, so there can be at most 2jCj+jP j
combinations (negations can be calculated by the subtraction equation); this is far too many to
precompute, thus calculations must be performed in real-time, making performance an
important issue. In order to make this idea scale and practicable for real time online
system, we need to work on two sides for the e ciency purpose: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) decrease the time
cost for calling count, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) reduce the calls to count. Thus we will discuss this crucial
choice of infrastructure in the following section.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Infrastructure</title>
      <p>Our first attempt to build a system involved an RDBMS, but due to the large number
of queries needed, we could not achieve desirable performance. Therefore, we explored
the use of Information Retrieval (IR) techniques, which are known to scale well.
Treating instances as documents consisting of tags as terms, we can index all the instances,
and thus count(T ) can be computed by recording the total number of hits for a boolean
IR keyword query “t1 AND t2 ...” 1. The problem for indexing these instances is that
we must find all tags regarding a subject in order to create the virtual documents, but in
order to scale we must do this with a minimal number of passes through the data. Our
solution is as follows:</p>
      <p>1. Parse and Write the Big Triple File. The raw RDF files are parsed and
written into triples (one triple per line) with qualified URIs (prefix:local name), where the
prefix are automatically generated and the namespace mappings are recorded. During
the process, if a triple from the raw file is parsed as &lt;s; p; o&gt; and p is an object
property, then a flipped equivalent triple &lt;o; p ; s&gt; is also recorded to the output file, which
we call the Big Triple File. Thus semantically, if we select all the triples with pattern
&lt;s; ?; ?&gt;, we can get all the information about this instance s. Note by duplicating the
object property statements, the output can have up to twice as the original triple size.</p>
      <p>2. Sort the Big Triple File. The Big Triple File is then sorted by the Unix command
sort. The output of this step is another big triple file with the same file size, but is
organized such that all the triples &lt;s; ?; ?&gt; about every instance s are contiguous.</p>
      <p>3. Ontology Inference. The ontology file is usually provided in the datasets as
separate files. At this step, we only apply RDF entailment rules to ontology triples, preparing
for materialization in the following step. Thus we only apply the taxonomy entailments,
i.e. superclass and super property entailments: superclass super property hierarchy
inferences(rdfs5 and rdfs11) are done at this step and results are kept in memory.</p>
      <p>4. Index the Sorted File. The indexer then reads through the sorted file. When
reading all the lines that start with an instance s, a virtual document is created with all the
ontological terms as the document content, i.e. ftj9&lt;s; rdf:type; t&gt; _ 9&lt;s; t; ?&gt;g. Based on
the entailments in previous step, superclasses or super properties of the explicit tags are
added to each instance by entailment rules (rdfs7 and rdfs9). Eventually we can get the
set T ags(s) as the virtual document to index. Note that in theory for materialization, we
can apply all entailment rules, however, in practice we focus on rules that infer class
and property information, and thus generate new tags. Also we find domain and range
entailment (rdfs2 and rdfs3) usually introduces wrong classification due to many
erroneous statements (mostly flipped subjects/objects, we shall give an example in Scenario
3 in Section 4) in the explicit statements, so we decide not to support such entailments.
1 Negation can be constructed by specifying a term MUST NOT occur in the boolean query.</p>
      <p>In comparison, we discuss how to optimize the DB approach. Firstly, we did not
consider holding the KB with a triple table, otherwise our task need expensive join
operations over multiple selections on the giant table. On the other hand, the decomposed
storage mode [9][1] seems more promising: the triples are inserted into n two-column
tables (much smaller) where n is the number of unique properties (including rdf:type)
in the data. In each of these tables, the first column contains the subjects that define that
property and the second column contains the object values for those subjects. However
the join operation on selections on tables is inevitable for our task, thus we made several
simplifications and optimizations, in order to minimize the cost of the DB approach. For
each t 2 C [ P we create a table with a single indexed column, id, an integer that
represents each unique instance of tag t. Thus count(T ) can be computed by joining the tables
of classes and properties. For faster table joins, we use a dictionary that maps all strings
(either full URI or its qualified name) to integer ids, and use the ids in the property
tables. However this might require maintenance of the dictionary and frequent look-ups
slow down the process. Given that the task is only collecting summary information, we
do not have to record the real URIs, but only need to know whether an instance has
appeared before when processing a new triple. So if we reuse the first three steps in the
IR approach, we can just assign an auto increasing integer as id to the di erent instance
from the last one while reading through the sorted file line by line. Then this auto id can
be inserted into tables corresponding to the tags in its cluster of triples. Similarly, the
superclass and super property inference is materialized at this step. In practice, if the
insertion is done line by line there is much waste in the overhead cost of DB operations.
Thus instead, we first generate the script file of insertions, and run it in a batch.</p>
      <p>
        In order to justify our proposed approach, we choose DBPedia 3.6 [2] as our dataset.
Specifically, we load two raw data files, i.e. Ontology Infobox Types and Ontology
Infobox Properties, together with the ontology. Although DBPedia itself is not a
multisource interlinked dataset, we still believe that it is a proper dataset for our preliminary
experiment and prototype demo system due to the facts that (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) it plays a central role
in the Linked Data world; and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) it has two important features of multi-source linked
datasets: an ontology with broad scope and a large scale of data. However in this paper
we do not have particular procedures for owl:sameAs data, which means if we want to
adapt the current work to multi-source datasets, we need either maintain an instance
map before any aforementioned steps or merge the tags of such instances after those
steps. The optimization requires further study as one of our major areas for future work.
      </p>
      <p>We use MySQL 5.0.21 and Lucene 3.3.0 as the underlying DB and search engine.
In Fig. 2 we illustrate the process of both approaches for loading the DBPedia dataset
and compare the timeline between them. There are in total 2,446,683 instances with
27,221,328 triples, including the flipped ones. Using the IR approach, most of the time
is the cost of parsing and writing the file. However, comparing between the DB and
the IR approaches, we find that the step for running the SQL script itself is already
more than the total time of the IR approach. To compare the time cost of count by both
approaches, we conduct another experiment: run count(ftx; tyg) for a comprehensive set
of combinations pairwise tx; ty 2 C [ P . There are 1952 tags, and thus we call count
for 1,904,176 (= 195221951 ) times. It turns out that it takes more than 2 hours if we issue
so many queries to DB, but only less than 8 minutes if we search the index. This shows
the great advantage of the IR approach compared to the DB approach, and thus in the
rest of the paper we focus on optimization of the IR approach.</p>
      <p>
        The experiment on pairwise combinations can also be used to reduce the number
of times calling count(T ) in the online computation. Using o ine computation, we can
find a disjoint tag set dis(tx) = ftyjcount(ftx; tyg) = 0g for each tag tx, and then dis(T ) =
[8tx2T dis(tx). Then we can improve the previous naive approach with pruning to speed
up the preprocess. For pruning, we apply two rules: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) whenever tx A ty, compute
count(ft1; txg) before count(ft1; tyg), then count(ft1; txg) = 0 implies count(ft1; tyg) = 0;
and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) ignore count(ft1; t2g) if t1 w t2 _ t1 v t2, since they are non-disjoint by semantics.
We could also use owl:disjointWith axioms, however they do not exist in the DBPedia
ontology. After pruning, the number of calls to count is reduced to 1,545,754 (saving
19%) and the time cost reduced from 451 s to 387 s (saving 14%) .
      </p>
      <p>With the precomputed dis(T ), we can apply three pruning rules for online
computation of the tag cloud for any given context T : (p1) ignore t 2 dis(T ); (p2) ignore
t 2 super(T ), where super(T ) is the union of all the super classes and super properties
of tags in T ; and (p3) we always compute count(ft1; txg) before count(ft1; tyg) whenever
tx A ty and we can ignore ty if count(ft1; txg) = 0. To find how well these pruning rules
work, we simulate the real world task by randomly generating 100 contexts T for
different context set sizes from 1 to 5. The results are shown in Table 1. Note to generate
the tag cloud of T , we need call count(T [ ftxg) for all the candidates tx in the naive
approach, but by pruning, the calls of count and the time for these calls are saved for
the pruned tags. We also look into how each pruning rule benefits us, and the cost of
the pruned ones are recorded as in the naive approach. As we can see the pruning saves
around 93% of the calls and the most savings come from Rule (p1), which shows the
preprocessing enables a significant optimization of the online computation. We
hypothesize that Rule (p2) may save more in some other ontologies with deeper hierarchies,
which we want to verify by experiments in future work. Also when we look into the
details, we find that the time cost has large variation for di erent T s. Generally
speaking T s that contain frequent tags such as Person take longer than those that involve rare
tags, which contributes to large standard deviation. That also explains why in general
the naive time increases as jT j increases, but it is not the case for jT j = 5. The average
time for each call is very small, and we can expect thousands of calls within seconds.
We implement a prototype system with the IR-based infrastructure. The tags are shown
as the qualified names consisting of both an ontology prefix, which usually follows the
conventions of the sources themselves, and the local name of the term. The sort order
is alphabetical based on the local name first and the prefix second, so that terms from
di erent sources with similar syntactic forms (e.g. foaf:Person and dc:Person) tend to
be clustered. Since the semantics for adding a class tag and adding a property tag is
slightly di erent, we decide to present two separate clouds: class view and property
view. An example tag cloud is shown in Fig. 1, and illustrates some functionalities of
our system. The window contains two parts: the status and the cloud part. The status part
is always floating at the top of the window even when scrolling down the page. It shows
the current tags 2 T = fowl:Thing; dbp:Persong and the current count(T ) = 363751. It
also contains the control where users can switch between class view and property view.
As shown in the figure, currently the tag cloud is the class view for T , where we can see
that the two largest intersected classes with T are Artist and Athlete 3. The tag cloud
provides users a quick summary of the distribution of the data, and if the user hovers the
mouse over one of the tags, the precise number will also be shown at the status count.
Particularly in this example, the mouse is over Politician. Clicking a tag in the cloud
will add it to T and a new contextual tag cloud will show. Similarly, a user can also
change the contextual tag cloud by clicking the remove button next to each current tag,
and the user can see the count change by hovering over the remove button.</p>
      <p>One important extension to the bare tag cloud, is to allow users to view the instances
of the current type defined by T . When clicking the “View Instances” button in the
status part, the page goes to Instance Browser, as illustrated in Fig. 3. In this example,
T = fdbp:Artist; dbp:director-g. Each page shows 20 instances of the current type, and
each instance is a link to the DBPedia instance page. We further take advantage of the
IR approach and attach more information to provide more features. We attach the labels
of instances while indexing so that the links are human readable. We also attach the file
o set in the Sorted Big Triple File, where the lines of this instance’s triples starts in the</p>
      <sec id="sec-3-1">
        <title>2 In this version, the user interface of our system does not support the negation feature.</title>
        <p>3 Although Actor and SoccerPlayer are large tags, we know that they are subclasses of Artist
and Athlete respectively.
file, so that the system can quickly display a list of all the triples in the KB regarding that
instance. Di erent display styles distinguish URI resources from literals in the details
of instances. Meanwhile we also provide shortcuts to quickly modify the current set of
context tags. As in the Tag Cloud Browser page, we can remove some tags. Also while
looking into the details of some instances, a user might find new interesting tags and
add them by clicking the add button next to these tags. Hovering over the add or remove
buttons will show the count changes.</p>
        <p>With these two web pages: Tag Cloud Browser and Instance Browser, users can
explore the KB and get familiar with it by trying out many di erent combinations. Our
system provides a visual summary of the predominant patterns by way of tag clouds,
and then lets users see specific examples using the Instance Browser. Now we want to
look into some scenarios that demonstrate the usefulness of our system. We start with
an example that shows how our system helps users to understand the ontology and the
data distribution.</p>
        <p>Scenario 1: “find someone who is both a scientist and a writer.” v.s. “find someone
who is both an actor and a writer”. Both queries seem very straightforward because
there are classes named Scientist, Writer, and Actor. However in fact, there is no
common instance between any two classes of the three. When presented with this odd
results, most users will suspect a data quality issue and pursue alternatives. However, if
the information need had more conditions, e.g. “find a Canadian who is both a
scientist and a writer.”, instead of doubting the data, the user may believe that there is no
such person in the real world. Aided by our system, the user can clearly see that there
is no intersection for fScientist; Writerg in the data, and then may start with
alternative expressions rather than getting misled. For this scenario, there are two properties:
dbp:author- and dbp:writer-, which suggest two ways of rephrasing the original queries.
It is not clear to us the di erence between these two properties since they are synonyms
and from the ontology they have the same domain and range. How to choose which
property to use? One choice may be the one that will return more answers, because it
is more likely that we can still get answers after adding more conditions to the query.
In our dataset, dbp:author- has 4825 instances, and dbp:writer- has 13158 instances.
So we always prefer dbp:writer- to dbp:author-? With the help of our contextual tag
cloud, we can see that is not true. fdbp:Scientist; dbp:author-g has 167 instances while
fdbp:Scientist; dbp:writer-g has only 27 instances. fdbp:Actor; dbp:author-g has 593
instances while fdbp:Actor; dbp:writer-g has 3654 instances. Thus we might choose
different properties when building two similar queries.</p>
        <p>Scenario 2: Often the interesting instances are the rare ones. While in Scenario 1,
we investigate the distribution and pay more attention to common patterns, to use in
building queries, in many other cases, users may explore the KB and be curious about
those rare patterns. Our system also allows users to invert tag sizes, so that tags that have
smaller intersections with the contextual tags will be larger, and thus more “noticeable”.
e.g. the property view tag cloud of Politician with mouse over doctoralStudent- is shown
in Fig. 4. There are only 3 instances, and after examining the details, we find these are
all politicians with doctoral degrees. In another example, from the cloud of Mayor, we
find starring- a rare but interesting tag, and after examining the only 2 answers, we
learned that one mayor was in a documentary film and the other was actually also an
actor.</p>
        <p>Scenario 3: In other cases, rare tags sometimes suggest errors in the data. One
common error, made by both humans and automatic programs, is mistakenly inverting
the subject and object of an object property. Users can quickly decide which
direction is right, by simply taking a vote. e.g. in the property cloud of Work, we find both
author and author- are non-empty, but clearly one of them must be wrong. Since
author is a much more common tag than author-, the interpretation “has author” is more
favored than “is author of” in this dataset. The rare tags also help reveal another
common error, i.e. coreference. e.g. in the property cloud of Politician (shown in Fig. 4),
choreographer- curiously gives 1 answer. This could be either some very special
politician or an error in the data: it turns out there are two Jim Peterson’s, a politician and a
figure skating coach, but an error in the dataset leads to the politician being described
as a choreographer.</p>
        <p>
          In sum, we have found at least three categories of use cases for this system: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
understand the KB and discover common patterns; (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) learn interesting facts; (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) find
errors in data. As a prototype system, we believe it can also be further extended in any
direction of the three as a useful real world application.
5
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Related Work</title>
      <p>Early researchers used graph representations for browsing Semantic Web data,
believing it as a natural choice. But later Karger and schraefel [6] pointed out Big Fat Graphs
are not the ideal representation for RDF data. Many recent systems, such as /facet [5],
gFacet [4] and BrowseRDF [8], use or extend the faceted browsing idea: the users can
construct a selection query by adding constraints and each new added constraint will
update the interface to display further facet options based on the current selection query
results. The selection operations often vary in di erent systems. Most systems
support selection on property values, and intersection on selections. BrowseRDF supports
existential selection on properties and join selection which is equivalent to property
composition, and all these operations on inverse properties. Our current system is
similar to faceted browsing systems in the sense that our contextual tag cloud idea also has
the feature that the new tag cloud is generated based on the previous selected tags. In
comparison, our system has less expressiveness than BrowseRDF since it supports only
existential, inverse existential and intersection, and does not support any operations on
the values of properties. However while none of the 3 papers stressed the scalability
issue, our system particularly focuses on limited expressiveness, i.e. the existence of
properties and classes, and by optimizing the infrastructure, provides a more scalable
performance over large datasets. Meanwhile we want to emphasize, except for the same
flavor of adding constrains on the fly and the comparison of expressiveness and
scalability, our system has di erent purposes compared to faceted browsing systems. Our
system aims at revealing the patterns of co-occurrence of ontological terms and
familiarizing the users with the KB while the faceted browsing systems mostly help users
find specific instances that meets some criteria. Another type of exploration tools
provides summaries of datasets. e.g. Explod [7] provides a summary graph for class and
property usage of grouped instances, however such summary information is buried in
bracketed labels, making patterns less obvious.</p>
      <p>To visualize the patterns we apply the tag cloud techniques. Traditional tag cloud
interfaces are mostly used for displaying the frequency or popularity of tags in some
systems, such as in flickr4 and delicious5. While tags in these systems are folksonomies,
tags in our system are ontological terms, and thus have precise semantics enabling
inference on the tags. Most tag cloud systems only provide a top level tag cloud. TagExplorer
[10] is similar to our contextual tag cloud idea, providing a dynamic tag cloud based
on the subset of instances selected by the previous selected tags as context. While they
treat the tags as values of facets (like predefined properties), and they system classifies
folksonomies automatically, our system provides more precise semantics by
considering the type defined by the RDF data.</p>
      <sec id="sec-4-1">
        <title>4 www.flickr.com</title>
        <p>5 www.delicious.com</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>In this paper we propose a new approach to familiarize casual users with a KB built
on linked datasets: we treat classes and properties as tags with precise semantics and
instances as documents, and use contextual tag clouds to visualize the patterns of
cooccurrence of ontological terms in the instances specified by the context tags. From the
common patterns users can better understand the distribution of data in the KB; and
from the rare patterns users can either find interesting special facts or errors in the data.</p>
      <p>Our future work mainly lies in applying this technique to larger datasets with more
ontologies and data. We believe it can scale up to the whole Linked Open Data cloud,
however we understand there are potential challenges. First the infrastructure may need
further optimization, new algorithms or even parallel systems for the larger dataset.
Also currently we have not developed specific algorithms for owl:sameAs statements,
which is critical for visualizing across linked data sets. The interface might also need
modifications for new schemes to display larger tag clouds without contributing to
information overload. For example, we can hide some subclass tags in the cloud, and these
tags can be found in the cloud when the user add a superclass of it into the context.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abadi</surname>
            ,
            <given-names>D.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marcus</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Madden</surname>
            ,
            <given-names>S.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hollenbach</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Scalable semantic web data management using vertical partitioning</article-title>
          .
          <source>In: 33rd International Conference on Very Large Data Bases(VLDB'07)</source>
          . pp.
          <fpage>411</fpage>
          -
          <lpage>422</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kobilarov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ives</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>DBPedia: A nucleus for a web of open data</article-title>
          .
          <source>In: 6th International Semantic Web Conference and 2nd Asian Semantic Web Conference(ISWC'07/ASWC'07)</source>
          . pp.
          <fpage>722</fpage>
          -
          <lpage>735</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Hayes</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <source>RDF semantics: W3C Recommendation 10 February 2004 (Feb</source>
          <year>2004</year>
          ), http: //www.w3.org/TR/rdf-mt/
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Heim</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ziegler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lohmann</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>gFacet: A browser for the web of data</article-title>
          . In: International Workshop on Interacting with
          <article-title>Multimedia Content in the Social Semantic Web (IMCSSW'08)</article-title>
          . vol.
          <volume>417</volume>
          , pp.
          <fpage>49</fpage>
          -
          <lpage>58</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hildebrand</surname>
          </string-name>
          , M.,
          <string-name>
            <surname>van Ossenbruggen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hardman</surname>
          </string-name>
          , L.:
          <article-title>/facet: A browser for heterogeneous Semantic Web repositories</article-title>
          .
          <source>In: 5th International Semantic Web Conference(ISWC'06)</source>
          . vol.
          <volume>4273</volume>
          , pp.
          <fpage>272</fpage>
          -
          <lpage>285</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Karger</surname>
            ,
            <given-names>D.R.</given-names>
          </string-name>
          , schraefel, m.c.:
          <article-title>The pathetic fallacy of RDF</article-title>
          .
          <source>In: 3rd International Semantic Web User Interaction Workshop</source>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Khatchadourian</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Consens</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          : Explod:
          <article-title>Summary-based exploration of interlinking and rdf usage in the linked open data cloud</article-title>
          .
          <source>In: 7th Extended Semantic Web Conference(ESWC'10)</source>
          . pp.
          <fpage>272</fpage>
          -
          <lpage>287</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Oren</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Delbru</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Decker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Extending faceted navigation for RDF data</article-title>
          .
          <source>In: 5th International Semantic Web Conference(ISWC'06)</source>
          . pp.
          <fpage>559</fpage>
          -
          <lpage>572</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heflin</surname>
            ,
            <given-names>J.: DLDB</given-names>
          </string-name>
          :
          <article-title>Extending relational databases to support Semantic Web queries</article-title>
          .
          <source>In: Workshop on Practical and Scaleable Semantic Web Systems</source>
          ,
          <string-name>
            <surname>ISWC</surname>
          </string-name>
          <year>2003</year>
          . pp.
          <fpage>109</fpage>
          -
          <lpage>113</lpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Sigurbjornsson</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>TagExplorer: Faceted browsing of flickr photos</article-title>
          .
          <source>Tech. Rep. YL-2010- 005</source>
          , Yahoo! Research (
          <year>August 2010</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>