<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Association Rule Mining System for Acquiring Knowledge of DBpedia from Wikipedia Categories</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jiseong Kim</string-name>
          <email>jiseong@kaist.ac.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eun-Kyung Kim</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yousung Won</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sangha Nam</string-name>
          <email>nam.sangha@kaist.ac.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Key-Sun Choi</string-name>
          <email>kschoi@kaist.ac.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>KAIST, Computer Science Department</institution>
          ,
          <addr-line>Daejeon</addr-line>
          ,
          <country country="KR">Korea</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Wikipedia categories are a useful source of knowledge that is usually expressed in a noun-phrase that contains information about concepts of entities or relations among entities. In DBpedia KBs, they categorize their entities into Wikipedia categories using RDF triples. The RDF triples represent only categories of entities, but not concepts of entities or relations among entities despite the fact that expression of Wikipedia categories contain a wealth of those types of information. In this regard, We propose a method that extracts RDF triples encoding concepts of entities or relations among entities from RDF triples encoding Wikipedia categories of each DBpedia entities using association rule mining techniques that mainly utilize lexical patterns in category expression and a hierarchy of categories. Our extensive experiments show that our approach can mine association rules with more high quality than those of state-of-the-art approaches in this problem.</p>
      </abstract>
      <kwd-group>
        <kwd>Association rule mining</kwd>
        <kwd>Wikipedia categories</kwd>
        <kwd>DBpedia enrichment</kwd>
        <kwd>Knowledge base enrichment</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>DBpedia contains plentiful and well-organized knowledge about entities that
denote the real world entities like people, animals, and locations or the abstract
entities like mathematical concepts and scienti c theories. DBpedia uses Wikipedia
categories to categorize their entities using RDF triples encoding which DBpedia
entities belong to certain Wikipedia categories, which will be called as category
triples. For example, the category triple hdbpedia:John McCarthy, dcterms,
category:American computer scientists i means the DBpedia entity John McCarthy
belongs to the Wikipedia category Category:American computer scientists.
Because expression of Wikipedia categories contains information about concepts
of entities or relations among entities, we can extract RDF triples encoding
those types of information from category triples, which will be called knowledge
triples. For example, we can extract the knowledge triples hdbpedia:John
McCarthy, dbpedia-owl:occupation, computer scientist i from the above mentioned
category triple. This work can be achieved by mining association rules of the
form fhx; belongT o; cig ) hx; r; yi, which means that if an entity x belongs to
the category c, then an entity x and y are in a relation r. We will call these
kinds of rules as C2K (category to knowledge) rules. These rules can be mined
by the existing association rule mining (ARM) systems like WARMR, ALEPH,
and AMIE. But these ARM systems only use frequency information of existing
triples in knowledge bases (KBs) to mine rules despite of the fact that there are
rich available features like lexical patterns of category expression and
dependencies among categories.</p>
      <p>In this paper, we propose an e ective method to mine C2K rules with fully
utilizing lexical patterns in category expression and a hierarchy among categories
as features. We compare our method with the state-of-the-art ARM system
AMIE that shows outstanding performance on ontological KBs. The experiments
show that our method outperforms AMIE in the domain of mining C2K rules.
We also propose a simple con dence measure which is more appropriate for
mining C2K rules than the standard con dence measure.</p>
      <p>In section 2 of this paper, we explore the existing state-of-the art researches
that handle the same or similar issues. In the following section 3 and 4, we
describe the preliminaries and our approach in much greater details respectively.
In section 5, we apply our approach on existing KBs and analyze results in detail.
In the last section 6, we conclude and state the future works.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <sec id="sec-2-1">
        <title>Knowledge Acquisition from Wikipedia infoboxes. In 2007, Auer et al.</title>
        <p>
          initiated the DBpedia project [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] that originally extracted knowledge about
entities from the Wikipedia infoboxes and encoded it in RDF triples. The project
successfully extracted 18M triples from infoboxes, after being further developed.
However the fact that only about 44.2% articles have infoboxes results in that
only a minor portion of articles are covered by these triples extracted from
infoboxes [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. On the other hand, Wikipedia categories cover about 80.6% articles
[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], which means that categories are a rich source of knowledge which is worth
being studied in depth.
        </p>
        <p>
          Knowledge Acquisition from Wikipedia Categories. There have been
a number of works done relating to acquire knowledge from Wikipedia
categories. For instance, in 2007, Suchanek et al. developed YAGO [
          <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
          ], a large
ontology derived from Wikipedia categories, infoboxes, and the taxonomic
relations of WordNet. YAGO primarily focuses on concepts of entities, i.e., is-a
relation. They specify the relations (e.g., locatedIn) and the corresponding
lexical patterns in expression of categories (e.g., Rivers in x) to extract RDF triples
encoding concepts of entities or relations among entities. In 2008, Liu et al.
suggested the approach Catriple [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] that analyze lexical patterns in expression of
categories using NLP tools and use it to extract knowledge from categories. They
enlarged their results by using a subsumption hierarchy contained in Wikipedia
category network. In 2008, Nastase and Strude suggested a similar approach
[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] that classi es each category into several classes based on how clear they
express conceptual or relational information about entities. They analyze lexical
patterns in each class of categories using NLP tools and the category network,
and then use it to extract knowledge from categories. In this paper, we mainly
focuses on an association rule mining (ARM) system that mines C2K rules more
e ectively than other state-of-the-art ARM systems, so we do not compare our
nal prediction to triples of the above-mentioned three approaches.
        </p>
        <p>
          Association Rule Mining Systems. Association rules are mined on a list
of transactions. A transaction is a set of items. For example, in the context of
sales analysis, an item is a product and a transaction is a set of products bought
together by a customer in a speci c event. The mined association rules are of the
form fM ilk; Diaperg ) Beer, meaning that people who bought a bottle of milk
and a diaper usually also bought a beer, which is partly because of the fact that
working couples with children are so busy to go to a beer bar. These association
rules can be used to discover knowledge about entities in KBs. For example,
we can mine rules with the form fhe1; r1; e2i; he2; r2; e3ig ) he1; r3; e3i where
e is one of entities, which means that if there are he1; r1; e2i and he2; r2; e3i in
KBs, he1; r3; e3i is likely to exist in KBs. We can predict new triples using these
rules. In 2013, Galarraga suggested the state-of-the-art ARM system AMIE [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]
that mines association rules from KBs with more high scalability and quality of
results than the previous ARM systems WARMR [7{9] and ALEPH1. AMIE can
achieve high performance by using their new con dence measure and the e cient
algorithm for mining rules in KBs. Despite of their superior capability, they have
weaknesses in the more speci c problem, mining C2K rules that have the form
fhx; belongT o; cig ) hx; r; yi, because they only use frequency information of
existing triples in KBs to mine rules. Wikipedia categories are usually expressed
in a noun phrase and organized in a network, so there are plentiful lexical and
hierarchical information that we can further utilize to solve this problem. Our
approach uses these kinds of information to be a specialist in the domain of
mining C2K rules.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Preliminaries</title>
      <p>DBpedia: The RDF Knowledge Base. In this paper, we focus on DBpedia,
one of RDF knowledge bases (KBs). An RDF KB is a set of RDF triples that
encode relations between an entity and a literal (a string, a integer, a date and
so on) or relations among entities. Each RDF triple has the form hs; p; oi with s
denoting the subject which is placed with one of entities in KBs, p denoting the
predicate which represents one of relations pre-de ned in KBs, and o the object
which is placed with one of entities in KBs or literals.</p>
      <p>Wikipedia Categories in DBpedia. DBpedia contains RDF triples
encoding which DBpedia entities belong to certain Wikipedia categories. We will call
these triples as category triples. Each category triple has the form he; rcat; ci
1 http://www.cs.ox.ac.uk/activities/machlearn/Aleph/aleph_toc.html
with e denoting one of entities in KBs, rcat denoting the relation indicates that
the subject is categorized in the object, and c denoting one of Wikipedia
categories.</p>
      <p>Types of Categories. There are di erent types of Wikipedia categories:
One is the conceptual categories that contain information about a class of an
entity (e.g., Albert Einstein is in the category Category:Naturalized citizens of
the United States). The other is the relational categories (like Category:1989
births) containing relational information with other entities or literals. There are
some other types that merely indicate thematic vicinity (e.g., Category:Physics ).
Of these, the conceptual categories and the relational categories are main sources
of knowledge for our method to extract information about entities.</p>
      <p>Category Expression. Wikipedia categories are usually expressed in a
noun phrase which is composed of nouns with some modi ers (prepositional
phrases, adjectives and so on) and some special characters (e.g., hyphens(-),
commas(,) and so on). Figure 1 below shows examples of Wikipedia categories. We
can see that there is plentiful information about occupations, locations,
educations and so on. Our method analyzes lexical patterns in expression of categories
and use it to extract knowledge.</p>
      <p>20th-century mathematicians
Philosophers who committed suicide
People of Rhode Island in the American Civil War
Alumni of King's College, Cambridge</p>
      <p>Category Hierarchy. Since Wikipedia categories are more compressively
expressed with a few lexical elements than a usual complete sentence, lexical
patterns are not su cient to get high quality results. In this regard, we use a
category hierarchy extracted from the Wikipedia category network to enhance
our results. The Wikipedia category network (Ncat) is a directed graph which
encodes various dependencies (subsumption relationships, semantic similarity,
thematic similarity and so on) among categories. Our method mainly use
subsumption relationships between categories, which are partly contained in Ncat.
To reduce errors introduced by dependencies representing non-subsumption
relationships among categories, we convert Ncat to a directed acyclic graph, which
will be called as a category hierarchy (Hcat), by eliminating cycles in Ncat.
Although Hcat is not a complete subsumption hierarchy, but it is useful for our
method.</p>
      <p>Association Rules and C2K Rules. An association rule consists of a body
and a head, where the body is a set of atoms and the head is a single atom which
is an RDF triple that can have variables at the subject and/or object position.
For example, in the association rule fhx; r1; yig ) hx; r2; yi, the set of the atom
fhx; r1; yig is the body of the rule and the atom hx; r2; yi is the head of the rule.
In this paper, we only focus on C2K (category to knowledge) rules whose
body is composed of one category triple with a subject-position variable and
head is composed of one knowledge triple, that encodes concepts of entities or
relations among entities, with a subject-position variable. In the next section,
we will de ne our problem more formally.</p>
      <p>The C2K Problem. Let E be a set of entities, R denotes a set of relations,
L denotes a set of literals and C denotes a set of categories of entities. A set
n
of n category triples can be represented as Tcat = fhei; rcat; ciigi=1 where e 2
E, rcat 2 R and c 2 C. A set of m knowledge triples can be represented as
Tknow = fhei; ri; oiigim=1 where e 2 E, r 2 R frcatg and o 2 E [ L. An RDF
KB containing n category triples and m knowledge triples can be represented as
x x
K = Tcat [ Tknow. The problem is to mine C2K rules of the form tcat ) tfact
from K, where tcxat = hx; rcat; c 2 Ci, tfxact = hx; r 2 R frcatg; o 2 E [ Li, and
x is a variable for an entity e 2 E.</p>
      <p>Goal. The goal of our approach is to solve the C2K problem with DBpedia
KB and Wikipedia categories to mine a wealth of C2K rules with high quality.
4</p>
    </sec>
    <sec id="sec-4">
      <title>The Proposed Approach</title>
      <p>For mining C2K rules, our method mainly use lexical patterns in expression
of categories and a hierarchy of categories, i.e., Hcat, which is extracted from
Wikipedia category network. In the rst step of our approach, we discover lexical
patterns of each relation using existing category triples and knowledge triples
in KBs. In the second step of our approach, we mine C2K rules by applying
discovered lexical patterns to a hierarchy of categories. In the last step of our
approach, we enlarge initial KBs by predicting triples using mined C2K rules.
We bootstrap mined rules by repeating the three steps of our approach until
there are few rules further mined. In the next, we describe our method in detail.
4.1</p>
      <sec id="sec-4-1">
        <title>The Mining Procedure</title>
        <p>In the mining procedure, there are two main conditions to be checked. When
the conditions are checked, The pre xes like dbpedia:, dbpedia-owl: and Category:
are removed. The rst condition is as follows:
{ If there are he; rcat; ci 2 Tcat and he; r; oi 2 Tknow such that c = o, then mine
the rule:</p>
        <p>hx; rcat; ci ) hx; r; oi
where x is a variable for entities.</p>
        <p>For example, if there are the category triple hJohn McCarthy, rcat,
Category:Computer scientist i and the knowledge triple hJohn McCarthy, occupation,
computer scientist i, then the condition is satis ed and the C2K rule containing
the two triples will be made (DBpedia pre xes like dbpedia: and dbpedia-owl:
are omitted for simplicity).</p>
        <p>The second condition is as follows:
{ If there are tknow = fhe; ri; oiigin=1 Tknow and tcat = he; rcat; ci 2 Tcat
such that all objects of triples in tknow, i.e., foigin=1, are a substring of
c = (w1; o1; w2; o2; :::; on; wn+1) and all words in fw2; :::; wng have no zero
length, then follow the next steps:
1. Make a co-lexical pattern pco = (w1; xr1 ; w2; xr2 ; :::; xrn ; wn+1) for
relation r1; r2; :::; rn where xri2f1;2;:::;ng is a variable for an object of ri.
2. Make a set of candidate categories which will be compared with pco. We
de ne a set of candidate categories as follows:</p>
        <p>Ccandi = fcg [ siblings(c)
where siblings(x) is a set of siblings of a category x on Hcat.
3. If ccandi = (v1; v2; :::; vm) 2 Ccandi is matched with the extracted
colexical pattern pco, i.e., v1 = w1; v3 = w2; :::; vm = wn+1, then mine the
rules:
hx; rcat; ccandii ) hx; r1; v2i
hx; rcat; ccandii ) hx; r2; v4i</p>
        <p>:::
hx; rcat; ccandii ) hx; rn; vm 1i
where x is a variable for entities.</p>
        <p>For example, if there are the knowledge triples hSohyang, nationality, South
Koreai and hSohyang, gender, femalei and the category triple hSohyang, rcat,
Category:female singers of South Korea i, the co-lexical pattern \x singers of y"
will be extracted (DBpedia pre xes like dbpedia: and dbpedia-owl: are omitted
for simplicity). Then our method propagates the pattern through siblings like
Category:male singers of England of the category Category:female singers of
South Korea. Finally our method can get C2K rules containing not only the
category triples with the categories Category:female singers of South Korea and
Category:male singers of England, but also the knowledge triples representing
the nationality and gender relations.</p>
        <p>Bootstrapping Mined Rules. An initial KB can be sparse, i.e., some
entities have many category triples but few knowledge triples. If predicted triples
by mined rules are used as a part of knowledge triples, more rules can be mined
than the previously mined ones. In this intuition, we bootstrap mined rules
through an iterated bootstrapping process. Let Q be a set of rules mined from
a KB K and Q denotes a set of trustworthy rules which can be de ned as rules
with a proportion of high con dence rules of Q where is the rst parameter
of our method (con dence measures will be introduced in the later section).
The n-th bootstrapped K can be de ned as Kn = Kn 1 [ pred(Qn 1) where
pred(x) is a set of new triples predicted by rules contained in a set x and Qn 1
is trustworthy rules mined from Kn 1. Overall iterated bootstrapping process
can be represented as follows:</p>
        <p>K0 ! K1 ! ::: ! Kn
which follows the condition jQnjQnQn1j 1j where is the second parameter of
our method which means a threshold of a proportion of increases in mined rules.
Qn is a nal output of our method.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Con dence Measures for C2K rules</title>
        <p>Transactions C2K rules are mined on a list of transactions. A transaction
is a set of triples in Tknow, which share the same subject. More precisely, a
transaction with n items that share the same entity e can be represented as
Te = fhe; ri; oiigin=1 Tknow.</p>
        <p>Support We de ne a support of a set of triples (sharing an entity e as their
subject) as follows:
supp(fhe; ri; oiigin=1) = I(fhe; ri; oiigi=1
n</p>
        <p>K)
where I(x) is an indicator function that is the value 1 when a statement x holds,
otherwise 0.</p>
        <p>A shared subject of triples can be a variable x for an entity. In this case, a
support is de ned as follows:
supp(fhx; ri; oiigin=1) =</p>
        <p>X I(fhe; ri; oiigi=1</p>
        <p>n
e2E</p>
        <p>K)</p>
        <p>Standard Con dence. Con dence of a rule indicates how trustworthy a
rule is. A standard con dence of a C2K rule can be de ned as follows:
conf (tcxat ) tknow) =
x
supp(ftcat; tkxnowg)
x</p>
        <p>x
supp(ftcatg)
where tcxat = hx; rcat; c 2 Ci, tkxnow = hx; r 2 R frcatg; o 2 E [ Li, and x is a
variable for an entity e 2 E.</p>
        <p>Unnormalized Con dence. The standard con dence of a rule can be
abnormally low or high when KBs contain knowledge with some sparsity. In order
to discourage abnormal con dence of a rule caused by sparsity of KBs, we
calculate a con dence of a rule with only numerator as follows:
conf (tcxat ) tknow) = supp(ftcat; tkxnowg)</p>
        <p>x x
where tcxat = hx; rcat; c 2 Ci, tkxnow = hx; r 2 R frcatg; o 2 E [ Li, and x is a
variable for an entity e 2 E.</p>
        <p>In the later experiments, we will show our simple measure is more e ect to
estimate precision of predictions than those of the standard con dence measure.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experiments</title>
      <p>We conducted 3 groups of experiments. In the rst group, we compare our
approach with AMIE which is the state-of-the-art ARM system publicly available.
In the second group of experiments, we analyze the results of our approach more
deeply, mostly with regard to the amount and precision. In the last group of
experiments, we compare the standard con dence measure to the unnormalized
con dence measure introduced in this paper (Section 4).
5.1</p>
      <sec id="sec-5-1">
        <title>Dataset and Experimental Setting</title>
        <p>Knowledge Bases. We run our experiments on DBpedia dataset which contains
RDF triples extracted from various sources in Wikipedia. Among entire triples
with various sources, we only use category-driven triples as category triples (Tcat)
and infobox-driven triples as knowledge triples (Tknow). We use English
DBpedia 2.0 as an input KB. Because the English dataset is too large to deal with,
we randomly choose 20% of triples in English DBpedia 2.0 and only use it in the
entire experiments, which will be called Sample(Ken). Sample(Ken) contains
1,027,213 category triples and 2,527,466 knowledge triples. We extract a
category hierarchy Hcat from a Wikipedia category network and use it in the entire
experiments.</p>
        <p>Settings. Our method have two parameters to be predetermined (Section 4.
We set the rst parameter as 0.1 and the second parameter also as 0.1. AMIE
is con gured to extract only rules of length 2 which have a category triple as a
body of a rule and a knowledge triple as a head of a rule. The other settings of
AMIE remain unchanged.
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Results and Analysis</title>
        <p>Our Approach vs. AMIE. We run our approach and AMIE on the sampled
dataset of DBpedia 2.0, i.e., Sample(Ken), and mine rules using each method.
Our approach mine 150,762 rules with three iterations and AMIE mine 6,959
rules. We predict new triples using the mined rules and check whether new triples
are contained in the DBpedia 3.8 dataset. The amount of triples contained in
DBpedia 3.8, namely hit count, of each method are measured and compared.
Figure 2 shows the results. The left graph shows that our approach predicts
more triples which are contained in DBpedia 3.8 than the AMIE's results. This
tells us that our approach can mine more useful rules than AMIE's, i.e., the
knowledge extracted by rules of our method is more probable in the real world
than those of AMIE. The right graph in Figure 2 shows hit proportion of each
method, which is a hit count over the amount of entire new triples predicted
by each method. The hit proportion would be proportional to the quality of
predictions. The right graph shows that our method have high hit proportion
than that of AMIE, which indicates that the knowledge extracted by rules of
our method is of better quality than that of AMIE. The both graphs show that
our method outperforms AMIE in the domain of C2K rule mining. Because of
various features appropriate for C2K rules, which are lexical and hierarchical
features of categories, our method can outperform AMIE in the domain of C2K
rule mining. The detailed values of results are shown in Table 1 which of rows
indicate a hit count over the number of entire new triples predicted by each
method with di erent used proportion of Tknow in Sample(Ken).
120000
100000</p>
      </sec>
      <sec id="sec-5-3">
        <title>The Amount and Precision of Results. Our approach extracts totally</title>
        <p>150,762 rules. The 125,936 rules are initially extracted in the rst iteration,
and then the rules are bootstrapped to 150,762 rules through three iterations,
i.e., 19.71% rules are bootstrapped by iterated bootstrapping. We successfully
predict 1,530,253 new triples from 1,027,213 category triples of Sample(Ken)
using the mined rules. Figure 3 and 4 show the examples of the new triples.</p>
        <p>The entire results can be downloaded at the website 2.</p>
        <p>
          We estimate precision of new triples predicted by the mined rules of our
approach. Since there is no computer-processable ground truth of suitable extent
[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], we have to rely on manual evaluation. Two people manually evaluate 200
randomly selected samples of the new triples. If a new triple can be inferred
from a source category triple, it is regarded as a right one. Domain and range
2 http://elvis.kaist.ac.kr/demos/iswc2015_workshop
are also checked. Only those accepted by both of two people are regarded as true
positives.
        </p>
        <p>Table 2 shows an estimated precision of predicted triples which are extracted
from category triples of Sample(Ken) using the mined rules of our approach.</p>
        <p>We sort the 200 samples by each con dence of rules used to predict them and
divide it into 10 equal segments, and then we estimate the precision of the each
segment. The table shows that triples predicted by high con dence rules tend
to also have high precision, which means that our con dence measure can be
used to estimate the precision of predictions. In the later subsection, we will
further show the e ectiveness of our measure by comparing with the standard
con dence measure.</p>
        <p>Categories Subjects Predicates Objects Conf.
Living people Jennifer Blushi dateOfDeath living 41185.0
English-language lms Larger Than Life ( lm) language English-language 1727.0
People from New York City Deborah Orin birthPlace New York City 558.0
Harvard University alumni William Warren Bartley almaMater Harvard University 339.0
Liberal Party of Canada MPs Jacob Thomas Schell party Liberal Party of Canada 210.0
Queens Park Rangers F.C. players Dean Wilkins clubs Queens Park Rangers F.C. 66.0
PlayStation 2 games Dynasty Tactics series ports PlayStation 2 0.0
Categories Subjects Predicates Objects Conf.
People from Philadelphia Tom Brandi location Philadelphia 222.0
History of Texas Wild Cat Blu , Texas placeOfBirth Texas 56.0
Athletes at the 1932 Summer Olympics Babe Zaharias medalsilverProperty 1932 Summer Olympics 24.0
Union College, New York alumni Orson Spencer placeOfDeath New York 3.0
National Conference Pro Bowl players Alfred Jenkins nationality national 0.0
Unnormalized Conf. vs. Standard Conf. We compare our measure, the
unnormalized con dence, with the standard con dence. Figure 5 shows the
precision of each segment of each method. The gure shows that the unnormalized
con dence values are more close to the precision line of an ideal con dence
measure than those of the standard con dence which tend to abnormally high or
low in some segments. Table 2 above shows the detailed gures of results. The
conclusion of the experiment is that our simple measure is more e ective than
the standard measure to discriminate importance of C2K rules.
1
With our approach, DBpedia can be automatically enriched in every time that
new category triples come in. The table 3 shows the persistently growing size
of category triples in each version of DBpedia. Each column of jTcatj indicates
the number of entire category triples in each version of DBpedia. Each column
of New jTcatj indicates the number of category triples which does not exist in a
immediately previous version of DBpedia (Only New jTcatj of DBpedia 3.8 is the
number of category triples not in DBpedia 2.0). With high quality C2K rules
mined by our approach, we can persistently construct KBs beyond the current
DBpedia from growing Wikipedia categories.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and Future Work</title>
      <p>Throughout this paper, we have proposed an e ective method for mining C2K
rules. We have also proposed an e ective con dence measure to discriminate</p>
      <p>DBpedia 3.8 3.9 2014</p>
      <p>jTcatj 15,112,372 16,598,682 18,731,754</p>
      <p>New jTcatj 12,580,437 2,693,767 3,513,358</p>
      <p>Table 3. The growing number of category triples in each version of DBpedia.
importance of C2K rules extracted from sparse KBs. Our approach has been
proven capable of mining a larger number of C2K rules which can predict more
probable and accurate new knowledge from categories than those of the
stateof-the-art ARM system for ontological KBs although the systems are for mining
more general rules than C2K rules. Our approach mainly utilize lexical and
hierarchical features of categories, which is the main reason of our outperformance.
Our idea of using these features might be applied on the existing ARM systems
to enhance their capability. With our approach, it is possible to persistently and
automatically enrich DBpedia-like KBs whose entities are classi ed in several
human-readable categories organized in a network. In the later research, we will
further enhance our system and distribute extracted results publicly.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgement</title>
      <p>This work was supported by Institute for Information &amp; communications
Technology Promotion(IITP) grant funded by the Korea government(MSIP) (No.
R0101-15-0054, WiseKB: Big data based self-evolving knowledge base and
reasoning platform)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kobilarov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ives</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Dbpedia: A nucleus for a web of open data</article-title>
          . pp.
          <fpage>722</fpage>
          -
          <lpage>735</lpage>
          . Springer Berlin Heidelberg (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Suchanek</surname>
            ,
            <given-names>F. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kasneci</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weikum</surname>
          </string-name>
          , G. :
          <article-title>Yago: a core of semantic knowledge</article-title>
          .
          <source>In Proceedings of the 16th international conference on World Wide Web</source>
          , pp.
          <volume>697</volume>
          {
          <fpage>706</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Suchanek</surname>
            ,
            <given-names>F. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kasneci</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weikum</surname>
          </string-name>
          , G. :
          <article-title>Yago: A large ontology from wikipedia and wordnet</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          ,
          <volume>6</volume>
          (
          <issue>3</issue>
          ),
          <volume>203</volume>
          {
          <fpage>217</fpage>
          . (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Catriple: Extracting triples from wikipedia categories</article-title>
          .
          <source>In The Semantic Web</source>
          , pp.
          <fpage>330</fpage>
          -
          <lpage>344</lpage>
          . Springer Berlin Heidelberg (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Nastase</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Strube</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Decoding Wikipedia Categories for Knowledge Acquisition</article-title>
          .
          <source>In AAAI</source>
          , Vol.
          <volume>8</volume>
          , pp.
          <fpage>1219</fpage>
          -
          <lpage>1224</lpage>
          . (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Galrraga</surname>
            ,
            <given-names>L. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Te ioudi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hose</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suchanek</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Amie: association rule mining under incomplete evidence in ontological knowledge bases</article-title>
          .
          <source>In Proceedings of the 22nd international conference on World Wide Web</source>
          , pp.
          <fpage>413</fpage>
          -
          <lpage>422</lpage>
          . International World Wide Web Conferences Steering Committee (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Dehaspe</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toivonen</surname>
          </string-name>
          , H. :
          <article-title>Discovery of frequent datalog patterns</article-title>
          .
          <source>Data Mining and knowledge discovery</source>
          ,
          <volume>3</volume>
          (
          <issue>1</issue>
          ),
          <fpage>7</fpage>
          -
          <lpage>36</lpage>
          . (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Dehaspe</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toivonen</surname>
          </string-name>
          , H. :
          <article-title>Discovery of relational association rules</article-title>
          .
          <source>In Relational data mining</source>
          , pp.
          <fpage>189</fpage>
          -
          <lpage>212</lpage>
          . Springer Berlin Heidelberg (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Goethals</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , Van den Bussche, J. : Relational Association Rules: Getting W armer.
          <source>Pattern Detection and Discovery</source>
          .
          <volume>145</volume>
          -
          <fpage>159</fpage>
          . (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Hearst</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          :
          <article-title>Automatic acquisition of hyponyms from large text corpora</article-title>
          .
          <source>In Proceedings of the 14th conference on Computational linguistics-Volume</source>
          <volume>2</volume>
          , pp.
          <fpage>539</fpage>
          -
          <lpage>545</lpage>
          . Association for Computational Linguistics (
          <year>1992</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>