<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Learning Conformation Rules for Linked Data Integration</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Axel-Cyrille Ngonga Ngomo</string-name>
          <email>ngonga@informatik.uni-leipzig.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science University of Leipzig Johannisgasse 26</institution>
          ,
          <addr-line>04103 Leipzig</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Over the last years, manifold applications that consume, link and integrate Linked Data have been developed. Yet, the speci cation of integration processes for Linked Data is rendered increasingly tedious by several factors such as the great number of knowledge bases in the Linked Data Cloud as well as schema mismatches and heterogeneous conventions for property values across knowledge bases. Especially the speci cation of rules for transforming property values has been carried out mostly manually so far. In this paper, we present CaRLA, an algorithm that allows learning transformation rules for pairs of property values expressed as strings. We present both a batch and an active learning version of CaRLA. The batch version of CaRLA uses a three-step learning approach to retrieve probable transformation rules. The active learning version extends the batch version by requesting highly informative property value pairs from the user so as to improve the learning speed of the system. We evaluate both versions of our approach on four experiments with respect to runtime and accuracy. Our results show that we can improve the precision of the data integration process by up to 12% by discovering transformation rules with human accuracy even when provided with small training datasets. In addition, we can even discover rules that were missed by human experts.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The Linked Open Data (LOD) Cloud consists of more than 30 billion triples1.
Making use of this large amount of domain knowledge within large-scale
semantic applications is currently gaining considerable momentum. For example, the
DeepQA framework [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] combines knowledge from DBpedia2, Freebase3 and
several other knowledge bases to determine the answer to questions with a speed
superior to that of human champions. Complex applications that rely on several
sources of knowledge usually integrate them into a uni ed view by the means of
1 http://www4.wiwiss.fu-berlin.de/lodcloud/state/
2 http://dbpedia.org
      </p>
    </sec>
    <sec id="sec-2">
      <title>3 http://www.freebase.com</title>
      <p>
        the Extract-Transform-Load (ETL) paradigm [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Yet, over the last few years,
the automatic provision of such uni ed views on Linked datasets has been
rendered increasingly tedious. The di culties behind the integration of Linked Data
are not only caused by the mere growth of the datasets in the Linked Data Web
but also by large disparity across these datasets.
      </p>
      <p>Linked Data Integration is commonly impeded by two categories of
mismatches: ontology mismatches and naming convention mismatches. The second
category of common mismatches (which is the focus of this work) mostly a ects
the transformation step of ETL and lies in the di erent conventions used for
equivalent property values. For example, the labels of lms in DBpedia di er
from the labels of lms in LinkedMDB4 in three ways: First, they contain a
language tag. Second, the extension \( lm)" is added to the label of movies if
another entity with the same label exists. Third, if another lm with the same
label exists, the production year of the lm is added. Consequently, the lm
Liberty from 1929 has the label \Liberty (1929 lm)@en" in DBpedia, while
the same lm bears the label \Liberty" in LinkedMDB. A similar discrepancy in
naming persons holds for lm directors and actors. Finding a conform
representation of the labels of movies that maps the LinkedMDB representation would
require knowing the rules replace(``@en'', ) and replace(``(*film)'',
) where stands for the empty string.</p>
      <p>In this paper, we address the problem of discovering transformation rules
by presenting CaRLA, the Canonical Representation Learning Algorithm. Our
approach learns canonical (also called conform) representation of data type
property values by implementing a simple, time-e cient and accurate learning
approach. We present two versions of CaRLA: a batch learning and an active
learning version. The batch learning approach relies on a training dataset to
derive rules that can be used to generate conform representations of property
values. The active version of CaRLA (aCarLa) extends CaRLA by computing
unsure rules and retrieving highly informative candidates for annotation that
allow the validation or negation of these candidates. One of the main advantages of
CaRLA is that it can be con gured to learn transformations at character, n-gram
or even word level. By these means, it can be used to improve integration and
link discovery processes based on string similarity/distance measures ranging
from character-based (edit distance) and n-gram-based (q-grams) to word-based
(Jaccard similarity) approaches.</p>
      <p>The rest of this paper is structured as follows: In Section 2, we give an
overview of the notation used in this paper. Thereafter, we present the two
di erent versions of the CaRLA algorithm: the batch version in Section 3 and
the active version in Section 4. Subsequently, we evaluate our approach with
respect to both runtime and accuracy in four experiments (Section 5). After a
brief overview of the related work (Section 6), we present some future work and
conclude in Section 7.</p>
    </sec>
    <sec id="sec-3">
      <title>4 http://linkedmdb.org/</title>
      <sec id="sec-3-1">
        <title>Preliminaries</title>
        <p>In the following, we de ne terms and notation necessary to formalize the
approach implemented by CaRLA. Let s 2 be a string from an alphabet .
We de ne a tokenization function as follows:</p>
        <sec id="sec-3-1-1">
          <title>De nition 1 (Tokenization Function). Given an alphabet A of tokens, a</title>
          <p>tokenization function token : ! 2A maps any string s 2 to a subset of
the token alphabet A.</p>
          <p>
            Note that string similarity and distances measures rely on a large number
of di erent tokenization approaches. For example, the Levenshtein similarity [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ]
relies on a tokenization at character level, while the Jaccard similarity [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] relies
on a tokenization at word level.
          </p>
        </sec>
        <sec id="sec-3-1-2">
          <title>De nition 2 (Transformation Rule). A transformation rule is a function</title>
          <p>r : A ! A that maps a token from the alphabet A to another token of A.</p>
          <p>In the following we will denote transform rules by using an arrow notation.
For example, the mapping of the token \Alan" to \A." will be denoted by &lt;
\Alan" ! \A."&gt;. For any rule r =&lt; x ! y &gt;, we call x the premise and y the
consequence of r. We call a transformation rule trivial when it is of the form
&lt; x ! x &gt; with x 2 A. We call two transformation rules r and r0 inverse to
each other when r =&lt; x ! y &gt; and r0 =&lt; y ! x &gt;. Throughout this work, we
will assume that the characters that make up the tokens of A belong to [ f g,
where stands for the empty character. Note that we will consequently denote
deletions by rules of the form &lt; x ! &gt; where x 2 A.</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>De nition 3 (Weighted Transformation Rule). Let be the set of all</title>
          <p>rules. Given a weight function w : ! R, a weighted transformation rule is
the pair (r; w(r)), where r 2 is a transformation rule.</p>
        </sec>
        <sec id="sec-3-1-4">
          <title>De nition 4 (Transformation Function). Given a set R of (weighted) trans</title>
          <p>formation rules and a string s, we call the function 'R : ! [f g a
transformation function when it maps s to a string 'R(s) by applying all rules
ri 2 R to every token of token(s) in an arbitrary order.</p>
          <p>For example, the set R = f&lt;\Alan"!\A."&gt;g of transformation rules would
lead to 'R(\James Alan Het eld") =\James A. Het eld".
3</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Batch Learning Approach</title>
        <p>The goal of CaRLA is two-fold: First, it aims to compute rules that allow the
derivation of conform representations of property values. As entities can have
several values for the same property, CaRLA also aims to detect a condition
under which two property values should be merged during the integration
process. In the following, we will assume that two source knowledge bases are to be
integrated to one. Note that our approach can be used for any number of source
knowledge bases.
3.1</p>
        <sec id="sec-3-2-1">
          <title>Overview</title>
          <p>Formally, CaRLA addresses the problem of nding the required transformation
rules by computing an equivalence relation E between pairs of property values
(p1; p2) that is such that E (p1; p2) holds when p1 and p2 should be mapped to
the same canonical representation p. CaRLA computes E by generating two sets
of weighted transformation function rules R1 and R2 such that for a given
similarity function , E (p1; p2) ! ('R1 (p1); 'R2 (p2)) , where is a similarity
threshold. The canonical representation p is then set to 'R1 (p1). The similarity
condition ('R1 (pR1 ); 'R2 (p2)) is used to distinguish between the pairs of
properties values that should be merged.</p>
          <p>To detect R1 and R2, CaRLA assumes two training datasets P and N , of
which N can be empty. The set P of positive training examples is composed
of pairs of property value pairs (p1; p2) such that E (p1; p2) holds. The set N
of negative training examples consists of pairs (p1; p2) such that E (p1; p2) does
not hold. In addition, CaRLA assumes being given a similarity function and a
corresponding tokenization function token. Given this input, CaRLA implements
a three-step approach. It begins by computing the two sets R1 and R2 of plausible
transformation rules based on the positive examples at hand (Step 1). Then it
merges inverse rules across R1 and R2 and discards rules with a low weight during
the rule merging and ltering step. From the resulting set of rules, CaRLA
derives the similarity condition E (p1; p2) ! ('R1 (p1); 'R2 (p2)) . It then
applies these rules to the negative examples in N and tests whether the similarity
condition also holds for the negative examples. If this is the case, then it discards
rules until it reaches a local minimum of its error function. The retrieved set of
rules and the novel value of constitute the output of CaRLA and can be used to
generate the canonical representation of the properties in the source knowledge
bases.</p>
          <p>In the following, we explain each of the three steps in more detail. Throughout
the explanation we use the toy example shown in Table 1. In addition, we will
assume a word-level tokenization function and the Jaccard similarity.
The goal of the rule generation set is to compute two sets of rules R1 resp. R2
that will underlie the transformation 'R1 resp. 'R2 . We begin by tokenizing all
positive property values pi and pj such that (pi; pj) 2 P . We call T1 the set of all
tokens pi such that (pi; pj) 2 P , while T2 stands for the set of all pj. We begin
the computation of R1 by extending the set of tokens of each pj 2 T2 by adding
to it. Thereafter, we compute the following rule score function score:
score(&lt; x ! y &gt;) = jf(pi; pj) 2 P : x 2 token(pi) ^ y 2 token(pj)gj:
(1)
score computes the number of co-occurrences of the tokens x and y across P .
All tokens x 2 T1 always have a maximal co-occurrence with as it occurs in
all tokens of T2. To ensure that we do not only compute deletions, we decrease
the score of rules &lt; x ! &gt; by a factor 2 [0; 1]. Moreover, in case of a
tie, we assume the rule &lt; x ! y &gt; to be more natural than &lt; x ! y0 &gt; if
(x; y) &gt; (x; y0). Given that is bound between 0 and 1, it is su cient to add
a fraction of (x; y) to each rule &lt; x ! y &gt; to ensure that the better rule is
chosen. Our nal score function is thus given by
scorefinal(&lt; x ! y &gt;) =
(score(&lt; x ! y &gt;) + (x; y)=2: if y 6= ;
score(&lt; x ! y &gt;) else.
(2)
Finally, for each token x 2 T1, we add the rule r =&lt; x ! y &gt; to R1 i x 6= y (i.e.,
r is not trivial) and y = arg max scorefinal(&lt; x ! y0 &gt;). To compute R2 we
y02T2
simply swap T1 and T2, invert P (i.e., compute the set f(pj; pi) : (pi; pj) 2 P g)
and run through the procedure described above.</p>
          <p>For the set P in our example, we get the following sets of rules: R1 =
f(&lt;\van"!\Van"&gt;, 2.08), (&lt;\T."! &gt;, 2)g and R2 = f(&lt;\Van"!\van"&gt;,
2.08), (&lt;\(actor)"! &gt;, 2)g.
3.3</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Rule Merging and Filtering</title>
          <p>The computation of R1 and R2 can lead to a large number of inverse or
improbable rules. In our example, R1 contains the rule &lt;\van"!\Van"&gt; while
R2 contains &lt;\Van"!\van"&gt;. Applying these rules to the data would
consequently not improve the convergence of their representations. To ensure that the
transformation rules lead to similar canonical forms, the rule merging step rst
discards all rules &lt; x ! y &gt;2 R2 that are such that &lt; y ! x &gt;2 R1 (i.e., rules
in R2 that are inverse to rules in R1). Then, low-weight rules are discarded. The
idea here is that if there is not enough evidence for a rule, it might just be a
random event. The initial similarity threshold for the similarity condition is
nally set to
=
In our example, CaRLA would discard &lt;\van"!\Van"&gt; from R2. When
assuming a threshold of 10% of P's size (i.e., 0.4), no rule would be ltered out.
The output of this step would consequently be R1 = f(&lt;\van"!\Van"&gt;, 2.08),
(&lt;\T."! &gt;, 2)g and R2 = f(&lt;\(actor)"! &gt;, 2)g.
3.4</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>Rule Falsi cation</title>
          <p>The aim to the rule falsi cation step is to detect a set of transformations that
lead to a minimal number of elements of N having a similarity superior to via
. To achieve this goal, we follow a greedy approach that aims to minimize the
magnitude of the set
E = f(p1; p2) 2 N : ('R1 (p1); 'R2 (p2))
(4)
Our approach simply tries to discard all rules that apply to elements of E by
ascending score. If E is then empty, the approach terminates. If E does not get
smaller, then the change is rolled back and then the next rule is tried. Else, the
rule is discarded from the set of nal rules. Note that discarding a rule can alter
the value of and thus E. Once the set E has been computed, CaRLA concludes
its computation by generating a nal value of the threshold .</p>
          <p>In our example, two rules apply to the element of N . After discarding the
rule &lt;\T."! &gt;, the set E becomes empty, leading to the termination of the
rule falsi cation step. The nal set of rules are thus R1 = f&lt;\van"!\Van"&gt;g
and R2 = f&lt;\(actor)"! &gt;g. The value of is computed to be 0.75. Table 2
shows the canonical property values for our toy example. Note that this threshold
allows to discard the elements of N as being equivalent property values.
=
('R1 (p1); 'R2 (p2))g:</p>
          <p>
            It is noteworthy that by learning transformation rules, we also found an initial
threshold for determining the similarity of property values using as
similarity function. In combination with the canonical forms computed by CaRLA, the
con guration ( ; ) can be used as an initial con guration for Link Discovery
frameworks such as LIMES. For example, the smallest Jaccard similarity for the
pair of property values for our example lies by 1=3, leading to a precision of 0.71
for a recall of 1 (F-measure: 0.83). Yet, after the computation of the
transformation rules, we reach an F-measure of 1 with a threshold of 1. Consequently, the
pair ( , ) can be used for determining an initial classi er for approaches such
as the RAVEN [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] algorithm implemented in LIMES [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ].
          </p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Extension to Active Learning</title>
        <p>
          One of the drawbacks of batch learning approaches is that they often require a
large number of examples to generate good models. As our evaluation shows (see
Section 5), this drawback also holds for the batch version of CaRLA, as it can
easily detect very common rules but sometimes fails to detect rules that apply
to less pairs of property values. In the following, we present how this problem
can be addressed by extending CaRLA to aCARLA using active learning [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
Algorithm 1 Overview of aCaRLA
Require: Positive examples P0
Require: Negative examples N0
Require: Similarity function
Require: Damping factor
Require: Score threshold smin
Require: Tokenization function token
Require: Maximal number of annotation requests qtotal
Require: Number of questions/iteration q
        </p>
        <p>Rule sets R1 ;, R2 ;
qcurrent := 0 //Current number of questions
Ex := 0 //Set of examples to annotate
t := 0 //Iteration counter
r := null //unsure rule
B := ; //set of banned rules
while qcurrent qtotal do
(R1, R2, ) = runCarla(Pt, Nt, , , Smin, token) // Run batch learning
r =getMostUnsureRule(R1 [ R2, B, smin)
if r 6= null then</p>
        <p>Ex = computeExamples(r, q)
else</p>
        <p>Ex = getMostDi erent(q)
end if
(P , N ) = requestAnnotations(Ex)
Pt+1 Pt [ P
Nt+1 Nt [ N
B updateBannedRules(B, r)
t t + 1
qcurrent qcurrent + jExj
end while
(R1, R2, ) := runCarla(Pt, Nt, , , Smin, token)
return (R1; R2; )</p>
        <p>An overview of aCaRLA is given in Algorithm 1. The basic idea here is to
begin with small training sets P0 and N0. In each iteration, all the available
training data is used by the batch version of CaRLA to update the set of rules.
The algorithm then tries to refute or validate rules with a score below the score
threshold smin (i.e., unsure rules). For this purpose it picks the most unsure
rule r that has not been shown to be erroneous in a previous iteration (i.e., that
is not an element of the set of banned rules B). It then fetches a set Ex of
property values that map the left side (i.e., the premise) of r. Should there be no
unsure rule, then Ex is set to the q property values that are most dissimilar to
the already known property values. Annotations consisting of the corresponding
values for the elements of Ex in the other source knowledge bases are requested
by the user and written in the set P . Property values with no corresponding
values are written in N . Finally the sets of positive and negative examples are
updated and the triple (R1, R2, ) is learned anew until a stopping condition
such as a maximal number of questions is reached. As our evaluation shows, this
simple extension of the CaRLA algorithm allows it to detect e ciently the pairs
of annotations that might lead to a larger set of high-quality rules.
5
5.1</p>
      </sec>
      <sec id="sec-3-4">
        <title>Evaluation</title>
        <sec id="sec-3-4-1">
          <title>Experimental Setup</title>
          <p>In the experiments reported in this section, we evaluated CaRLA by two means:
First we aimed to measure how well CaRLA could compute transformations
created by experts. To achieve this goal, we retrieved transformation rules from
four link speci cations de ned manually by experts within the LATC project5.
An overview of these speci cations is given in Table 3. Each link speci cation
aimed to compute owl:sameAs links between entities across two knowledge bases
by rst transforming their property values and by then computing the similarity
of the entities based on the similarity of their property values. For example, the
computation of links between lms in DBpedia and LinkedMDB was carried out
by rst applying the set of R1 = f&lt; (f ilm) ! &gt;g to the labels of lms in
DBpedia and R2 = f&lt; (director) ! &gt;g to the labels of their directors. We
ran both CaRLA and aCaRLA on the property values of the interlinked entities
and measured how fast CaRLA was able to reconstruct the set of rules that were
used during the Link Discovery process.</p>
          <p>In addition, we quanti ed the quality of the rules learned by CaRLA. In each
experiments, we computed the boost in the precision of the mapping of property
pairs with and without the rules derived by CaRLA. The initial precision was</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5 http://latc-project.eu</title>
      <p>computed as jjMPjj , where M
= f(pi; pj ) : (pi; pj )
min(p1;p2)2P (p1; p2)g.</p>
      <p>The precision after applying CaRLA's results was computed as jP j where
jM0j
M 0 = f(pi; pj ) : ('R1 (pi); 'R2 (pj )) min ('R1 (p1); 'R2 (p2))g. Note
(p1;p2)2P
that in both cases, the recall was 1 given that 8(pi; pj ) 2 P : (pi; pj )
min (p1; p2). In all experiments, we used the Jaccard similarity metric and
(p1;p2)2P
a word tokenizer with = 0:8. All runs were carried on a notebook running
Windows 7 Enterprise with 3GB RAM and an Intel Dual Core 2.2GHz
processor. Each of the algorithms was ran ve times. We report the rules that were
discovered by the algorithms and the number of experiments within which they
were found.</p>
      <p>100%
80%
iin 60%
sco
reP 40%
20%
100%
80%
lseodh 60%
rhT 40%
20%
Baseline
CaRLA</p>
      <p>Baseline
CaRLA
0% Actors Directors Directors_clean Movies Movies_clean Producers
(a) Precision
0% Actors Directors Directors_clean Movies Movies_clean Producers
(b) Thresholds
Table 4 shows the union of the rules learned by the batch version of CaRLA in
all ve runs. Note that the computation of a rule set lasted under 0.5s even for
the largest dataset, Movies. The columns Pn give the probability of nding a rule
for a training set of size n in our experiments. R2 is not reported because it was
empty in all setups. Our results show that in all cases, CaRLA converges quickly
and learns rules that are equivalent to those utilized by the LATC experts with
a sample set of 5 pairs. Note that for each rule of the form &lt;\@en" ! y &gt; with
y 6= that we learned, the experts used the rule &lt; y ! &gt; while the linking
platform automatically removed the language tag. We experimented with the
same datasets without language tags and computed exactly the same rules as
those devised by the experts. In some experiments (such as Directors), CaRLA
was even able detect rules that where not included in the set of rules generated
by human experts. For example, the rule &lt;\( lmmaker)" ! \(director)"&gt; is not
very frequent and was thus overlooked by the experts. In Table 4, we marked
such rules with an asterisk. The director and the movies datasets contained a
large number of typographic errors of di erent sort (incl. misplaced hyphens,
character repetitions such as in the token \Neilll", etc.) which led to poor
precision scores in our experiments. We cleaned the rst 250 entries of these datasets
from these errors and obtained the results in the rows labels Directors clean
and Movies clean. The results of CaRLA on these datasets are also shown in
Table 4. We also measured the improvement in precision that resulted from
applying CaRLA to the datasets at hand (see Figure 1). For that the precision
remained constant across the di erent dataset sizes. In the best case (cleaned
Directors dataset), we are able to improve the precision of the property mapping
by 12.16%. Note that we can improve the precision of the mapping of property
values even on the noisy datasets.</p>
      <p>Experiment</p>
      <p>Actors
Directors</p>
      <p>Directors
Directors clean</p>
      <p>Movies
Movies</p>
      <p>Movies
Movies clean
Movies clean
Movies clean</p>
      <p>Producers</p>
      <p>Interestingly, when used on the Movies dataset with a training dataset size
of 100, our framework learned low-con dence rules such as &lt;\(1999" ! &gt;,
which were yet discarded due to a too low score. These are the cases where
aCaRLA displayed its superiority. Thanks to its ability to ask for annotation
when faced with unsure rules, aCaRLA is able to validate or negate unsure
rules. As the results on the Movies example show, aCaRLA is able to detect
several supplementary rules that were overlooked by human experts. Especially,
it clearly shows that deleting the year of creation of a movie can improve the
conformation process. aCaRLA is also able to generate a signi cantly larger
number of candidates rules for the user's convenience.
6</p>
      <sec id="sec-4-1">
        <title>Related Work</title>
        <p>
          Linked Data Integration is an important topic for all applications that rely on a
large number of knowledge bases and necessitate a uni ed view on this data, e.g.,
Question Answering frameworks [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and semantic mashups [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. In recent work,
several challenges and requirements to Linked Data consumption and integration
have been pointed out [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Recently, several approaches and frameworks have
been developed with the aim of addressing many of these challenges. For example,
the R2R framework [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] allows the speci cation and publication of mappings
that allow mapping resources and literals across knowledge bases. The Linked
Data Integration Framework LDIF [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], whose goal is to support the integration
of RDF data, builds upon R2R mappings and technologies such as SILK [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]
and LDSpider6. The concept behind the framework is to enable users to create
periodic integration jobs via simple XML con gurations. KnoFuss [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] addresses
data integration from the point of view of link discovery by monitoring the
interaction between instance and dataset matching (which is similar to ontology
matching [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]). Also worth mentioning id Semantic Web Pipes7 [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], which follows
the idea of Yahoo Pipes8 to enable the integration of data in formats such as RDF
and XML. In all of these systems, the con guration of the data transformations
has to be carried out manually.
        </p>
        <p>
          Some approaches to transformation rule learning approaches have previously
been proposed in the record linkage area. For example, [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] presents an
interactive approach to data cleaning while [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] developed an approach to learn string
transformation from examples. Yet, none of these approaches uses active
learning to improve the number of rules it detects. To the best of our knowledge,
CaRLA is the rst approach tailored towards Linked Data that allows the active
learning of data conformation rules for property values expressed as strings.
7
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Conclusion and Future Work</title>
        <p>In this work, we presented two algorithms for learning data transformations. We
present a batch learning approach that allows to detect rules by performing a
cooccurrence analysis. This version is particularly useful when the knowledge bases
to be integrated are already linked. We also present an active learning approach
6 http://code.google.com/p/ldspider/
7 http://pipes.deri.org/
8 http://pipes.yahoo.com/pipes/
that allows to discover a large number of rules e ciently. We evaluated our
approach with respect to the quality of the rules it discovers and to the e ect of
the rules on matching property values. We showed that the data transformation
achieved by our approach allows to increase the precision of property mapping
by up to 12% when the recall is set to 1. In future work, we aim to extend our
approach to learning transformation rules with several tokenizers concurrently.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Arvind</given-names>
            <surname>Arasu</surname>
          </string-name>
          , Surajit Chaudhuri, and
          <string-name>
            <given-names>Raghav</given-names>
            <surname>Kaushik</surname>
          </string-name>
          .
          <article-title>Learning string transformations from examples</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .,
          <volume>2</volume>
          (
          <issue>1</issue>
          ):
          <volume>514</volume>
          {
          <fpage>525</fpage>
          ,
          <year>August 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Schultz</surname>
          </string-name>
          .
          <article-title>The R2R Framework: Publishing and Discovering Mappings on the Web</article-title>
          .
          <source>In Proceedings of COLD</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Jero</surname>
          </string-name>
          <article-title>^me Euzenat and Pavel Shvaiko</article-title>
          . Ontology matching. Springer-Verlag, Heidelberg (DE),
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>David</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ferrucci</surname>
            , Eric W. Brown, Jennifer Chu-Carroll,
            <given-names>James</given-names>
          </string-name>
          <string-name>
            <surname>Fan</surname>
            , David Gondek,
            <given-names>Aditya</given-names>
          </string-name>
          <string-name>
            <surname>Kalyanpur</surname>
            , Adam Lally,
            <given-names>J. William</given-names>
          </string-name>
          <string-name>
            <surname>Murdock</surname>
          </string-name>
          , Eric Nyberg, John M. Prager, Nico Schlaefer, and
          <string-name>
            <surname>Christopher</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Welty</surname>
          </string-name>
          .
          <article-title>Building watson: An overview of the deepqa project</article-title>
          .
          <source>AI Magazine</source>
          ,
          <volume>31</volume>
          (
          <issue>3</issue>
          ):
          <volume>59</volume>
          {
          <fpage>79</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Robert</given-names>
            <surname>Isele</surname>
          </string-name>
          and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Learning Linkage Rules using Genetic Programming</article-title>
          .
          <source>In Sixth International Ontology Matching Workshop</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Paul</given-names>
            <surname>Jaccard</surname>
          </string-name>
          .
          <article-title>Etude comparative de la distribution orale dans une portion des Alpes et des Jura</article-title>
          .
          <source>Bulletin del la Societe Vaudoise des Sciences Naturelles</source>
          ,
          <volume>37</volume>
          :
          <fpage>547</fpage>
          {
          <fpage>579</fpage>
          ,
          <year>1901</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Ralph</given-names>
            <surname>Kimball</surname>
          </string-name>
          and
          <string-name>
            <given-names>Joe</given-names>
            <surname>Caserta</surname>
          </string-name>
          .
          <article-title>The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning</article-title>
          , Conforming, and
          <string-name>
            <given-names>Delivering</given-names>
            <surname>Data</surname>
          </string-name>
          . Wiley,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Vladimir</surname>
            <given-names>I. Levenshtein.</given-names>
          </string-name>
          <article-title>Binary codes capable of correcting deletions, insertions, and reversals</article-title>
          .
          <source>Technical Report 8</source>
          ,
          <year>1966</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Ian</given-names>
            <surname>Millard</surname>
          </string-name>
          , Hugh Glaser, Manuel Salvadores, and
          <string-name>
            <given-names>Nigel</given-names>
            <surname>Shadbolt</surname>
          </string-name>
          .
          <article-title>Consuming multiple linked data sources: Challenges and experiences</article-title>
          .
          <source>In First International Workshop on Consuming Linked Data</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Axel-Cyrille Ngonga Ngomo. A Time-E cient Hybrid</surname>
          </string-name>
          <article-title>Approach to Link Discovery</article-title>
          .
          <source>In Sixth International Ontology Matching Workshop</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Axel-Cyrille Ngonga</surname>
            <given-names>Ngomo</given-names>
          </string-name>
          , Jens Lehmann,
          <article-title>Soren Auer, and Konrad Ho ner. RAVEN { Active Learning of Link Speci cations</article-title>
          .
          <source>In Proceedings of OM@ISWC</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Andriy</surname>
            <given-names>Nikolov</given-names>
          </string-name>
          , Victoria Uren, Enrico Motta, and
          <string-name>
            <given-names>Anne</given-names>
            <surname>Roeck</surname>
          </string-name>
          .
          <article-title>Overcoming schema heterogeneity between linked semantic repositories to improve coreference resolution</article-title>
          .
          <source>In Proceedings of the 4th Asian Conference on The Semantic Web</source>
          , pages
          <volume>332</volume>
          {
          <fpage>346</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Danh Le</surname>
            <given-names>Phuoc</given-names>
          </string-name>
          , Axel Polleres, Manfred Hauswirth, Giovanni Tummarello, and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Morbidoni</surname>
          </string-name>
          .
          <article-title>Rapid prototyping of semantic mash-ups through semantic web pipes</article-title>
          .
          <source>In WWW</source>
          , pages
          <volume>581</volume>
          {
          <fpage>590</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>Vijayshankar</given-names>
            <surname>Raman and Joseph M. Hellerstein</surname>
          </string-name>
          .
          <article-title>Potter's wheel: An interactive data cleaning system</article-title>
          .
          <source>In VLDB</source>
          , pages
          <volume>381</volume>
          {
          <fpage>390</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Andreas</surname>
            <given-names>Schultz</given-names>
          </string-name>
          , Andrea Matteini, Robert Isele,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Becker</surname>
          </string-name>
          .
          <article-title>Ldif - linked data integration framework</article-title>
          .
          <source>In COLD</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>Burr</given-names>
            <surname>Settles</surname>
          </string-name>
          .
          <article-title>Active learning literature survey</article-title>
          .
          <source>Technical Report 1648</source>
          , University of Wisconsin-Madison,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>