=Paper= {{Paper |id=None |storemode=property |title=Learning conformation rules for linked data integration |pdfUrl=https://ceur-ws.org/Vol-946/om2012_Tpaper2.pdf |volume=Vol-946 |dblpUrl=https://dblp.org/rec/conf/semweb/Ngomo12a }} ==Learning conformation rules for linked data integration == https://ceur-ws.org/Vol-946/om2012_Tpaper2.pdf
    Learning Conformation Rules for Linked Data
                    Integration

                          Axel-Cyrille Ngonga Ngomo1

                   Department of Computer Science
                          University of Leipzig
                    Johannisgasse 26, 04103 Leipzig
                 ngonga@informatik.uni-leipzig.de,
          WWW home page: http://bis.uni-leipzig.de/AxelNgonga



      Abstract. Over the last years, manifold applications that consume, link
      and integrate Linked Data have been developed. Yet, the specification
      of integration processes for Linked Data is rendered increasingly tedious
      by several factors such as the great number of knowledge bases in the
      Linked Data Cloud as well as schema mismatches and heterogeneous
      conventions for property values across knowledge bases. Especially the
      specification of rules for transforming property values has been carried
      out mostly manually so far. In this paper, we present CaRLA, an al-
      gorithm that allows learning transformation rules for pairs of property
      values expressed as strings. We present both a batch and an active learn-
      ing version of CaRLA. The batch version of CaRLA uses a three-step
      learning approach to retrieve probable transformation rules. The active
      learning version extends the batch version by requesting highly infor-
      mative property value pairs from the user so as to improve the learning
      speed of the system. We evaluate both versions of our approach on four
      experiments with respect to runtime and accuracy. Our results show that
      we can improve the precision of the data integration process by up to
      12% by discovering transformation rules with human accuracy even when
      provided with small training datasets. In addition, we can even discover
      rules that were missed by human experts.


1    Introduction

The Linked Open Data (LOD) Cloud consists of more than 30 billion triples1 .
Making use of this large amount of domain knowledge within large-scale seman-
tic applications is currently gaining considerable momentum. For example, the
DeepQA framework [4] combines knowledge from DBpedia2 , Freebase3 and sev-
eral other knowledge bases to determine the answer to questions with a speed
superior to that of human champions. Complex applications that rely on several
sources of knowledge usually integrate them into a unified view by the means of
1
  http://www4.wiwiss.fu-berlin.de/lodcloud/state/
2
  http://dbpedia.org
3
  http://www.freebase.com
the Extract-Transform-Load (ETL) paradigm [7]. Yet, over the last few years,
the automatic provision of such unified views on Linked datasets has been ren-
dered increasingly tedious. The difficulties behind the integration of Linked Data
are not only caused by the mere growth of the datasets in the Linked Data Web
but also by large disparity across these datasets.
    Linked Data Integration is commonly impeded by two categories of mis-
matches: ontology mismatches and naming convention mismatches. The second
category of common mismatches (which is the focus of this work) mostly affects
the transformation step of ETL and lies in the different conventions used for
equivalent property values. For example, the labels of films in DBpedia differ
from the labels of films in LinkedMDB4 in three ways: First, they contain a
language tag. Second, the extension “(film)” is added to the label of movies if
another entity with the same label exists. Third, if another film with the same
label exists, the production year of the film is added. Consequently, the film
Liberty from 1929 has the label “Liberty (1929 film)@en” in DBpedia, while
the same film bears the label “Liberty” in LinkedMDB. A similar discrepancy in
naming persons holds for film directors and actors. Finding a conform represen-
tation of the labels of movies that maps the LinkedMDB representation would
require knowing the rules replace(‘‘@en’’, ) and replace(‘‘(*film)’’,
) where  stands for the empty string.
    In this paper, we address the problem of discovering transformation rules
by presenting CaRLA, the Canonical Representation Learning Algorithm. Our
approach learns canonical (also called conform) representation of data type prop-
erty values by implementing a simple, time-efficient and accurate learning ap-
proach. We present two versions of CaRLA: a batch learning and an active
learning version. The batch learning approach relies on a training dataset to
derive rules that can be used to generate conform representations of property
values. The active version of CaRLA (aCarLa) extends CaRLA by computing
unsure rules and retrieving highly informative candidates for annotation that al-
low the validation or negation of these candidates. One of the main advantages of
CaRLA is that it can be configured to learn transformations at character, n-gram
or even word level. By these means, it can be used to improve integration and
link discovery processes based on string similarity/distance measures ranging
from character-based (edit distance) and n-gram-based (q-grams) to word-based
(Jaccard similarity) approaches.
    The rest of this paper is structured as follows: In Section 2, we give an
overview of the notation used in this paper. Thereafter, we present the two
different versions of the CaRLA algorithm: the batch version in Section 3 and
the active version in Section 4. Subsequently, we evaluate our approach with
respect to both runtime and accuracy in four experiments (Section 5). After a
brief overview of the related work (Section 6), we present some future work and
conclude in Section 7.


4
    http://linkedmdb.org/
2   Preliminaries
In the following, we define terms and notation necessary to formalize the ap-
proach implemented by CaRLA. Let s ∈ Σ ∗ be a string from an alphabet Σ.
We define a tokenization function as follows:
Definition 1 (Tokenization Function). Given an alphabet A of tokens, a
tokenization function token : Σ ∗ → 2A maps any string s ∈ Σ ∗ to a subset of
the token alphabet A.
    Note that string similarity and distances measures rely on a large number
of different tokenization approaches. For example, the Levenshtein similarity [8]
relies on a tokenization at character level, while the Jaccard similarity [6] relies
on a tokenization at word level.
Definition 2 (Transformation Rule). A transformation rule is a function
r : A → A that maps a token from the alphabet A to another token of A.
    In the following we will denote transform rules by using an arrow notation.
For example, the mapping of the token “Alan” to “A.” will be denoted by <
“Alan” → “A.”>. For any rule r =< x → y >, we call x the premise and y the
consequence of r. We call a transformation rule trivial when it is of the form
< x → x > with x ∈ A. We call two transformation rules r and r0 inverse to
each other when r =< x → y > and r0 =< y → x >. Throughout this work, we
will assume that the characters that make up the tokens of A belong to Σ ∪ {},
where  stands for the empty character. Note that we will consequently denote
deletions by rules of the form < x →  > where x ∈ A.
Definition 3 (Weighted Transformation Rule). Let Γ be the set of all
rules. Given a weight function w : Γ → R, a weighted transformation rule is
the pair (r, w(r)), where r ∈ Γ is a transformation rule.
Definition 4 (Transformation Function). Given a set R of (weighted) trans-
formation rules and a string s, we call the function ϕR : Σ ∗ → Σ ∗ ∪{} a
transformation function when it maps s to a string ϕR (s) by applying all rules
ri ∈ R to every token of token(s) in an arbitrary order.
    For example, the set R = {<“Alan”→“A.”>} of transformation rules would
lead to ϕR (“James Alan Hetfield”) =“James A. Hetfield”.


3   Batch Learning Approach
The goal of CaRLA is two-fold: First, it aims to compute rules that allow the
derivation of conform representations of property values. As entities can have
several values for the same property, CaRLA also aims to detect a condition
under which two property values should be merged during the integration pro-
cess. In the following, we will assume that two source knowledge bases are to be
integrated to one. Note that our approach can be used for any number of source
knowledge bases.
3.1   Overview


Formally, CaRLA addresses the problem of finding the required transformation
rules by computing an equivalence relation E between pairs of property values
(p1 , p2 ) that is such that E(p1 , p2 ) holds when p1 and p2 should be mapped to
the same canonical representation p. CaRLA computes E by generating two sets
of weighted transformation function rules R1 and R2 such that for a given sim-
ilarity function σ, E(p1 , p2 ) → σ(ϕR1 (p1 ), ϕR2 (p2 )) ≥ θ, where θ is a similarity
threshold. The canonical representation p is then set to ϕR1 (p1 ). The similarity
condition σ(ϕR1 (pR1 ), ϕR2 (p2 )) ≥ θ is used to distinguish between the pairs of
properties values that should be merged.
    To detect R1 and R2 , CaRLA assumes two training datasets P and N , of
which N can be empty. The set P of positive training examples is composed
of pairs of property value pairs (p1 , p2 ) such that E(p1 , p2 ) holds. The set N
of negative training examples consists of pairs (p1 , p2 ) such that E(p1 , p2 ) does
not hold. In addition, CaRLA assumes being given a similarity function σ and a
corresponding tokenization function token. Given this input, CaRLA implements
a three-step approach. It begins by computing the two sets R1 and R2 of plausible
transformation rules based on the positive examples at hand (Step 1). Then it
merges inverse rules across R1 and R2 and discards rules with a low weight during
the rule merging and filtering step. From the resulting set of rules, CaRLA
derives the similarity condition E(p1 , p2 ) → σ(ϕR1 (p1 ), ϕR2 (p2 )) ≥ θ. It then
applies these rules to the negative examples in N and tests whether the similarity
condition also holds for the negative examples. If this is the case, then it discards
rules until it reaches a local minimum of its error function. The retrieved set of
rules and the novel value of θ constitute the output of CaRLA and can be used to
generate the canonical representation of the properties in the source knowledge
bases.
   In the following, we explain each of the three steps in more detail. Throughout
the explanation we use the toy example shown in Table 1. In addition, we will
assume a word-level tokenization function and the Jaccard similarity.



          Type     Property value 1             Property value 2
           ⊕      “Jean van Damne”         “Jean Van Damne (actor)”
           ⊕ “Thomas T. van Nguyen” “Thomas Van Nguyen (actor)”
           ⊕        “Alain Delon”             “Alain Delon (actor)”
           ⊕       “Alain Delon Jr.”        “Alain Delon Jr. (actor)”
                 “Claude T. Francois”       “Claude Francois (actor)”
Table 1. Toy example dataset. The positive examples are of type ⊕ and the negative
of type .
3.2    Rule Generation
The goal of the rule generation set is to compute two sets of rules R1 resp. R2
that will underlie the transformation ϕR1 resp. ϕR2 . We begin by tokenizing all
positive property values pi and pj such that (pi , pj ) ∈ P . We call T1 the set of all
tokens pi such that (pi , pj ) ∈ P , while T2 stands for the set of all pj . We begin
the computation of R1 by extending the set of tokens of each pj ∈ T2 by adding
 to it. Thereafter, we compute the following rule score function score:

      score(< x → y >) = |{(pi , pj ) ∈ P : x ∈ token(pi ) ∧ y ∈ token(pj )}|.     (1)

score computes the number of co-occurrences of the tokens x and y across P .
All tokens x ∈ T1 always have a maximal co-occurrence with  as it occurs in
all tokens of T2 . To ensure that we do not only compute deletions, we decrease
the score of rules < x →  > by a factor κ ∈ [0, 1]. Moreover, in case of a
tie, we assume the rule < x → y > to be more natural than < x → y 0 > if
σ(x, y) > σ(x, y 0 ). Given that σ is bound between 0 and 1, it is sufficient to add
a fraction of σ(x, y) to each rule < x → y > to ensure that the better rule is
chosen. Our final score function is thus given by
                                 (
                                  score(< x → y >) + σ(x, y)/2. if y 6= ,
     scoref inal (< x → y >) =                                                    (2)
                                  κ × score(< x → y >)              else.

Finally, for each token x ∈ T1 , we add the rule r =< x → y > to R1 iff x 6= y (i.e.,
r is not trivial) and y = arg max0
                                     scoref inal (< x → y 0 >). To compute R2 we
                                y ∈T2
simply swap T1 and T2 , invert P (i.e., compute the set {(pj , pi ) : (pi , pj ) ∈ P })
and run through the procedure described above.
   For the set P in our example, we get the following sets of rules: R1 =
{(<“van”→“Van”>, 2.08), (<“T.”→  >, 2)} and R2 = {(<“Van”→“van”>,
2.08), (<“(actor)”→  >, 2)}.

3.3    Rule Merging and Filtering
The computation of R1 and R2 can lead to a large number of inverse or im-
probable rules. In our example, R1 contains the rule <“van”→“Van”> while
R2 contains <“Van”→“van”>. Applying these rules to the data would conse-
quently not improve the convergence of their representations. To ensure that the
transformation rules lead to similar canonical forms, the rule merging step first
discards all rules < x → y >∈ R2 that are such that < y → x >∈ R1 (i.e., rules
in R2 that are inverse to rules in R1 ). Then, low-weight rules are discarded. The
idea here is that if there is not enough evidence for a rule, it might just be a
random event. The initial similarity threshold θ for the similarity condition is
finally set to
                        θ = min σ(ϕR1 (p1 ), ϕR2 (p2 )).                        (3)
                              (p1 ,p2 )∈P

In our example, CaRLA would discard <“van”→“Van”> from R2 . When as-
suming a threshold of 10% of P’s size (i.e., 0.4), no rule would be filtered out.
The output of this step would consequently be R1 = {(<“van”→“Van”>, 2.08),
(<“T.”→  >, 2)} and R2 = {(<“(actor)”→  >, 2)}.

3.4   Rule Falsification
The aim to the rule falsification step is to detect a set of transformations that
lead to a minimal number of elements of N having a similarity superior to θ via
σ. To achieve this goal, we follow a greedy approach that aims to minimize the
magnitude of the set

 E = {(p1 , p2 ) ∈ N : σ(ϕR1 (p1 ), ϕR2 (p2 )) ≥ θ =      min        σ(ϕR1 (p1 ), ϕR2 (p2 ))}.
                                                       (p1 ,p2 )∈P
                                                                                 (4)
Our approach simply tries to discard all rules that apply to elements of E by
ascending score. If E is then empty, the approach terminates. If E does not get
smaller, then the change is rolled back and then the next rule is tried. Else, the
rule is discarded from the set of final rules. Note that discarding a rule can alter
the value of θ and thus E. Once the set E has been computed, CaRLA concludes
its computation by generating a final value of the threshold θ.
    In our example, two rules apply to the element of N . After discarding the
rule <“T.”→  >, the set E becomes empty, leading to the termination of the
rule falsification step. The final set of rules are thus R1 = {<“van”→“Van”>}
and R2 = {<“(actor)”→  >}. The value of θ is computed to be 0.75. Table 2
shows the canonical property values for our toy example. Note that this threshold
allows to discard the elements of N as being equivalent property values.


    Property value 1           Property value 2              Canonical value
   “Jean van Damne”      “Jean Van Damne (actor)”          “Jean Van Damne”
“Thomas T. van Nguyen” “Thomas Van Nguyen (actor)” “Thomas T. Van Nguyen”
     “Alain Delon”          “Alain Delon (actor)”            “Alain Delon”
    “Alain Delon Jr.”     “Alain Delon Jr. (actor)”         “Alain Delon Jr.”
  “Claude T. Francois”                                    “Claude T. Francois”
                          “Claude Francois (actor)”        “Claude Francois”
           Table 2. Canonical property values for our example dataset



    It is noteworthy that by learning transformation rules, we also found an initial
threshold θ for determining the similarity of property values using σ as similar-
ity function. In combination with the canonical forms computed by CaRLA, the
configuration (σ, θ) can be used as an initial configuration for Link Discovery
frameworks such as LIMES. For example, the smallest Jaccard similarity for the
pair of property values for our example lies by 1/3, leading to a precision of 0.71
for a recall of 1 (F-measure: 0.83). Yet, after the computation of the transforma-
tion rules, we reach an F-measure of 1 with a threshold of 1. Consequently, the
pair (σ, θ) can be used for determining an initial classifier for approaches such
as the RAVEN [11] algorithm implemented in LIMES [10].
4     Extension to Active Learning

One of the drawbacks of batch learning approaches is that they often require a
large number of examples to generate good models. As our evaluation shows (see
Section 5), this drawback also holds for the batch version of CaRLA, as it can
easily detect very common rules but sometimes fails to detect rules that apply
to less pairs of property values. In the following, we present how this problem
can be addressed by extending CaRLA to aCARLA using active learning [16].


Algorithm 1 Overview of aCaRLA
Require: Positive examples P0
Require: Negative examples N0
Require: Similarity function σ
Require: Damping factor κ
Require: Score threshold smin
Require: Tokenization function token
Require: Maximal number of annotation requests qtotal
Require: Number of questions/iteration q
 Rule sets R1 ← ∅, R2 ← ∅
 qcurrent := 0 //Current number of questions
 Ex := 0 //Set of examples to annotate
 t := 0 //Iteration counter
 r := null //unsure rule
 B := ∅ //set of banned rules
 while qcurrent ≤ qtotal do
    (R1 , R2 , θ) = runCarla(Pt , Nt , σ, κ, Smin , token) // Run batch learning
    r =getMostUnsureRule(R1 ∪ R2 , B, sm in)
    if r 6= null then
       Ex = computeExamples(r, q)
    else
       Ex = getMostDifferent(q)
    end if
    (P , N ) = requestAnnotations(Ex)
    Pt+1 ← Pt ∪ P
    Nt+1 ← Nt ∪ N
    B ← updateBannedRules(B, r)
    t←t+1
    qcurrent ← qcurrent + |Ex|
 end while
 (R1 , R2 , θ) := runCarla(Pt , Nt , σ, κ, Smin , token)

    return (R1 , R2 , θ)



    An overview of aCaRLA is given in Algorithm 1. The basic idea here is to
begin with small training sets P0 and N0 . In each iteration, all the available
training data is used by the batch version of CaRLA to update the set of rules.
The algorithm then tries to refute or validate rules with a score below the score
threshold smin (i.e., unsure rules). For this purpose it picks the most unsure
rule r that has not been shown to be erroneous in a previous iteration (i.e., that
is not an element of the set of banned rules B). It then fetches a set Ex of
property values that map the left side (i.e., the premise) of r. Should there be no
unsure rule, then Ex is set to the q property values that are most dissimilar to
the already known property values. Annotations consisting of the corresponding
values for the elements of Ex in the other source knowledge bases are requested
by the user and written in the set P . Property values with no corresponding
values are written in N . Finally the sets of positive and negative examples are
updated and the triple (R1 , R2 , θ) is learned anew until a stopping condition
such as a maximal number of questions is reached. As our evaluation shows, this
simple extension of the CaRLA algorithm allows it to detect efficiently the pairs
of annotations that might lead to a larger set of high-quality rules.


5     Evaluation

5.1     Experimental Setup

In the experiments reported in this section, we evaluated CaRLA by two means:
First we aimed to measure how well CaRLA could compute transformations
created by experts. To achieve this goal, we retrieved transformation rules from
four link specifications defined manually by experts within the LATC project5 .
An overview of these specifications is given in Table 3. Each link specification
aimed to compute owl:sameAs links between entities across two knowledge bases
by first transforming their property values and by then computing the similarity
of the entities based on the similarity of their property values. For example, the
computation of links between films in DBpedia and LinkedMDB was carried out
by first applying the set of R1 = {< (f ilm) →  >} to the labels of films in
DBpedia and R2 = {< (director) →  >} to the labels of their directors. We
ran both CaRLA and aCaRLA on the property values of the interlinked entities
and measured how fast CaRLA was able to reconstruct the set of rules that were
used during the Link Discovery process.


        Experiment Source     Target  Source property Target property Size
          Actors DBpedia LinkedMDB rdfs:label            rdfs:label 1172
         Directors DBpedia LinkedMDB rdfs:label          rdfs:label 7353
          Movies DBpedia LinkedMDB rdfs:label            rdfs:label 9859
         Producers DBpedia LinkedMDB rdfs:label          rdfs:label 1540
                        Table 3. Overview of the datasets
   In addition, we quantified the quality of the rules learned by CaRLA. In each
experiments, we computed the boost in the precision of the mapping of property
pairs with and without the rules derived by CaRLA. The initial precision was
5
    http://latc-project.eu
            |P |
computed as |M | , where M = {(pi , pj ) : σ(pi , pj ) ≥ min(p1 ,p2 )∈P σ(p1 , p2 )}.
                                                                     |P |
The precision after applying CaRLA’s results was computed as |M         0 | where

M 0 = {(pi , pj ) : σ(ϕR1 (pi ), ϕR2 (pj )) ≥ min σ(ϕR1 (p1 ), ϕR2 (p2 ))}. Note
                                                                                                                          (p1 ,p2 )∈P
that in both cases, the recall was 1 given that ∀(pi , pj ) ∈ P : σ(pi , pj ) ≥
  min σ(p1 , p2 ). In all experiments, we used the Jaccard similarity metric and
(p1 ,p2 )∈P
a word tokenizer with κ = 0.8. All runs were carried on a notebook running
Windows 7 Enterprise with 3GB RAM and an Intel Dual Core 2.2GHz proces-
sor. Each of the algorithms was ran five times. We report the rules that were
discovered by the algorithms and the number of experiments within which they
were found.

                100%                                                                                                     100%




                80%                                                                                                      80%




                60%                                                                                                      60%


                                                                                                             Threshold
    Precision




                                                                                                  Baseline                                                                                                 Baseline
                                                                                                  CaRLA                                                                                                    CaRLA
                40%                                                                                                      40%




                20%                                                                                                      20%




                 0%                                                                                                       0%
                       Actors   Directors   Directors_clean   Movies   Movies_clean   Producers                                 Actors   Directors   Directors_clean   Movies   Movies_clean   Producers



                                        (a) Precision                                                                                        (b) Thresholds

      Fig. 1. Comparison of the precision and thresholds with and without CaRLA.




5.2               Results and Discussion
Table 4 shows the union of the rules learned by the batch version of CaRLA in
all five runs. Note that the computation of a rule set lasted under 0.5s even for
the largest dataset, Movies. The columns Pn give the probability of finding a rule
for a training set of size n in our experiments. R2 is not reported because it was
empty in all setups. Our results show that in all cases, CaRLA converges quickly
and learns rules that are equivalent to those utilized by the LATC experts with
a sample set of 5 pairs. Note that for each rule of the form <“@en” → y > with
y 6=  that we learned, the experts used the rule < y →  > while the linking
platform automatically removed the language tag. We experimented with the
same datasets without language tags and computed exactly the same rules as
those devised by the experts. In some experiments (such as Directors), CaRLA
was even able detect rules that where not included in the set of rules generated
by human experts. For example, the rule <“(filmmaker)” → “(director)”> is not
very frequent and was thus overlooked by the experts. In Table 4, we marked
such rules with an asterisk. The director and the movies datasets contained a
large number of typographic errors of different sort (incl. misplaced hyphens,
character repetitions such as in the token “Neilll”, etc.) which led to poor preci-
sion scores in our experiments. We cleaned the first 250 entries of these datasets
from these errors and obtained the results in the rows labels Directors clean
and Movies clean. The results of CaRLA on these datasets are also shown in
Table 4. We also measured the improvement in precision that resulted from ap-
plying CaRLA to the datasets at hand (see Figure 1). For that the precision
remained constant across the different dataset sizes. In the best case (cleaned
Directors dataset), we are able to improve the precision of the property mapping
by 12.16%. Note that we can improve the precision of the mapping of property
values even on the noisy datasets.


        Experiment                    R1                P5 P10 P20 P50 P100
           Actors           <“@en” → “(actor)” >         1 1 1 1        1
          Directors       <“@en” → “(director)” >        1 1 1 1        1
          Directors    <“(filmmaker)” → “(director)”>∗ 0 0 0 0 0.2
       Directors clean    <“@en” → “(director)” >        1 1 1 1        1
           Movies              <“@en” →  >              1 1 1 1        1
           Movies              <“(film)” →  >           1 1 1 1        1
           Movies              <“film)” →  >∗           0 0 0 0 0.6
        Movies clean           <“@en” →  >              1 1 1 1        1
        Movies clean           <“(film)” →  >           0 0.8 1 1      1
        Movies clean           <“film)” →  >∗           0 0 0 0        1
         Producers          <“@en” → (producer)>         1 1 1 1        1
                    Table 4. Overview of batch learning results




    Interestingly, when used on the Movies dataset with a training dataset size
of 100, our framework learned low-confidence rules such as <“(1999” →  >,
which were yet discarded due to a too low score. These are the cases where
aCaRLA displayed its superiority. Thanks to its ability to ask for annotation
when faced with unsure rules, aCaRLA is able to validate or negate unsure
rules. As the results on the Movies example show, aCaRLA is able to detect
several supplementary rules that were overlooked by human experts. Especially,
it clearly shows that deleting the year of creation of a movie can improve the
conformation process. aCaRLA is also able to generate a significantly larger
number of candidates rules for the user’s convenience.


6   Related Work

Linked Data Integration is an important topic for all applications that rely on a
large number of knowledge bases and necessitate a unified view on this data, e.g.,
Question Answering frameworks [4] and semantic mashups [13]. In recent work,
several challenges and requirements to Linked Data consumption and integration
have been pointed out [9]. Recently, several approaches and frameworks have
been developed with the aim of addressing many of these challenges. For example,
the R2R framework [2] allows the specification and publication of mappings
          Experiment                 R1               P5 P10 P20 P50 P100
             Actors        <“@en” → “(actor)” >        1 1 1 1        1
            Directors     <“@en” → “(director)” > 1 1 1 1             1
            Directors   <“(actor)” → “(director)”>∗ 0 0 0 0           1
         Directors clean <“@en” → “(director)” > 1 1 1 1              1
             Movies            <“@en” →  >            1 1 1 1        1
             Movies           <“(film)” →  >          1 1 1 1        1
             Movies           <“film)” →  >∗          0 0 0 0        1
             Movies           <“(2006” →  >∗          0 0 0 0        1
             Movies           <“(199” →  >∗           0 0 0 0        1
          Movies clean         <“@en” →  >            1 1 1 1        1
          Movies clean        <“(film)” →  >          0 1 1 1        1
          Movies clean        <“film)” →  >∗          0 0 0 0        1
           Producers       <“@en” → (producer)>        1 1 1 1        1
                    Table 5. Overview of active learning results



that allow mapping resources and literals across knowledge bases. The Linked
Data Integration Framework LDIF [15], whose goal is to support the integration
of RDF data, builds upon R2R mappings and technologies such as SILK [5]
and LDSpider6 . The concept behind the framework is to enable users to create
periodic integration jobs via simple XML configurations. KnoFuss [12] addresses
data integration from the point of view of link discovery by monitoring the
interaction between instance and dataset matching (which is similar to ontology
matching [3]). Also worth mentioning id Semantic Web Pipes7 [13], which follows
the idea of Yahoo Pipes8 to enable the integration of data in formats such as RDF
and XML. In all of these systems, the configuration of the data transformations
has to be carried out manually.
    Some approaches to transformation rule learning approaches have previously
been proposed in the record linkage area. For example, [14] presents an interac-
tive approach to data cleaning while [1] developed an approach to learn string
transformation from examples. Yet, none of these approaches uses active learn-
ing to improve the number of rules it detects. To the best of our knowledge,
CaRLA is the first approach tailored towards Linked Data that allows the active
learning of data conformation rules for property values expressed as strings.


7   Conclusion and Future Work

In this work, we presented two algorithms for learning data transformations. We
present a batch learning approach that allows to detect rules by performing a co-
occurrence analysis. This version is particularly useful when the knowledge bases
to be integrated are already linked. We also present an active learning approach
6
  http://code.google.com/p/ldspider/
7
  http://pipes.deri.org/
8
  http://pipes.yahoo.com/pipes/
that allows to discover a large number of rules efficiently. We evaluated our
approach with respect to the quality of the rules it discovers and to the effect of
the rules on matching property values. We showed that the data transformation
achieved by our approach allows to increase the precision of property mapping
by up to 12% when the recall is set to 1. In future work, we aim to extend our
approach to learning transformation rules with several tokenizers concurrently.

References
 1. Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik. Learning string transfor-
    mations from examples. Proc. VLDB Endow., 2(1):514–525, August 2009.
 2. Christian Bizer and Andreas Schultz. The R2R Framework: Publishing and Dis-
    covering Mappings on the Web. In Proceedings of COLD, 2010.
 3. Jérôme Euzenat and Pavel Shvaiko. Ontology matching. Springer-Verlag, Heidel-
    berg (DE), 2007.
 4. David A. Ferrucci, Eric W. Brown, Jennifer Chu-Carroll, James Fan, David
    Gondek, Aditya Kalyanpur, Adam Lally, J. William Murdock, Eric Nyberg,
    John M. Prager, Nico Schlaefer, and Christopher A. Welty. Building watson: An
    overview of the deepqa project. AI Magazine, 31(3):59–79, 2010.
 5. Robert Isele and Christian Bizer. Learning Linkage Rules using Genetic Program-
    ming. In Sixth International Ontology Matching Workshop, 2011.
 6. Paul Jaccard. Étude comparative de la distribution florale dans une portion des
    Alpes et des Jura. Bulletin del la Société Vaudoise des Sciences Naturelles, 37:547–
    579, 1901.
 7. Ralph Kimball and Joe Caserta. The Data Warehouse ETL Toolkit: Practical
    Techniques for Extracting, Cleaning, Conforming, and Delivering Data. Wiley,
    2004.
 8. Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions,
    and reversals. Technical Report 8, 1966.
 9. Ian Millard, Hugh Glaser, Manuel Salvadores, and Nigel Shadbolt. Consuming
    multiple linked data sources: Challenges and experiences. In First International
    Workshop on Consuming Linked Data, 2010.
10. Axel-Cyrille Ngonga Ngomo. A Time-Efficient Hybrid Approach to Link Discovery.
    In Sixth International Ontology Matching Workshop, 2011.
11. Axel-Cyrille Ngonga Ngomo, Jens Lehmann, Sören Auer, and Konrad Höffner.
    RAVEN – Active Learning of Link Specifications. In Proceedings of OM@ISWC,
    2011.
12. Andriy Nikolov, Victoria Uren, Enrico Motta, and Anne Roeck. Overcoming
    schema heterogeneity between linked semantic repositories to improve coreference
    resolution. In Proceedings of the 4th Asian Conference on The Semantic Web,
    pages 332–346, 2009.
13. Danh Le Phuoc, Axel Polleres, Manfred Hauswirth, Giovanni Tummarello, and
    Christian Morbidoni. Rapid prototyping of semantic mash-ups through semantic
    web pipes. In WWW, pages 581–590, 2009.
14. Vijayshankar Raman and Joseph M. Hellerstein. Potter’s wheel: An interactive
    data cleaning system. In VLDB, pages 381–390, 2001.
15. Andreas Schultz, Andrea Matteini, Robert Isele, Christian Bizer, and Christian
    Becker. Ldif - linked data integration framework. In COLD, 2011.
16. Burr Settles. Active learning literature survey. Technical Report 1648, University
    of Wisconsin-Madison, 2009.