<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Entity Disambiguation for Wild Big Data Using Multi-Level Clustering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jennifer Sleeman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science and Electrical Engineering University of Maryland</institution>
          ,
          <addr-line>Baltimore County Baltimore. MD 21250</addr-line>
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>When RDF instances represent the same entity they are said to corefer. For example, two nodes from di erent RDF graphs1 both refer to same individual, musical artist James Brown. Disambiguating entities is essential for knowledge base population and other tasks that result in integration or linking of data. Often however, entity instance data originates from di erent sources and can be represented using di erent schemas or ontologies. In the age of Big Data, data can have other characteristics such originating from sources which are schema-less or without ontological structure. Our work involves researching new ways to process this type of data in order to perform entity disambiguation. Our approach uses multi-level clustering and includes ne-grained entity type recognition, contextualization of entities, online processing of which can be supported by a parallel architecture.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Often when performing knowledge base population, entities that exist in the
knowledge base need to be matched to entities from newly acquired data. After
matching entities, the knowledge base can be further enriched with new
information. This matching of entities is typically called entity disambiguation (ED)
or coreference resolution when performed without a knowledge base [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Early
work related to record linkage [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] was foundational to this concept of entity
similarity. Though there is a signi cant amount of research in this area including
methods which are supervised and unsupervised, these approaches tend to make
assumptions that do not hold for big data.
      </p>
      <p>
        Existing research tends to assume a static batch of data, ignoring the
streaming, temporal aspects. It assumes that the schemas or ontologies are available
and complete. Often issues such as heterogeneity and volume are not considered.
However, big data applications tend to include unalignable data from multiple
sources and often have schemas or ontologies that are absent or insu cient. We
de ne these characteristics in terms of 'Wild Big Data' (WBD) [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] and
describe how these characteristics challenge the disambiguation process. Our work
speci cally addresses these characteristics with an approach that could be used
to perform ED for WBD.
1 http://dbpedia.org/resource/James Brown and http://musicbrainz.org/artist/20
33034fe2-4a47-a1b6-291e26aa3438#
The objective of this research is to perform ED given the data is large in
volume, potentially schema-less, multi-sourced and temporal by nature. We want
to answer questions such as, how do we perform ED in a big data setting, can
we e ciently distribute the task of ED without a loss in precision, how do we
account for data that is changing over time, how do we process semantic graphs
given they may not have an associated schema or ontology, and nally how do
we process this data given it originates from di erent sources with potentially
unalignable vocabularies. These questions are important to answer because they
are real problems in big data applications [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ].
      </p>
      <sec id="sec-1-1">
        <title>Motivation</title>
        <p>
          Big data is a growing area of research and o ers many challenges for ED [
          <xref ref-type="bibr" rid="ref11 ref2">2, 11</xref>
          ].
The main motivation of this work is the need for ED that supports data with big
data characteristics. This includes data originating from di erent sources which
contain di erent types of entities at di erent levels of granularity, data that may
not have a schema or ontology, and data that changes over time. This sort of
data at big data volumes complicates the ED process.
        </p>
        <p>
          Companies, organizations and government entities are sharing more data and
acquiring more data from other sources to gain new insight and knowledge [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
Often the combination of sources, such as social media, news and other types of
sources can provide more insight into topics than a single source.
        </p>
        <p>
          As is evident by e orts related to Linked Open Data (LOD) [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ],
interoperability among di erent data sources is of growing importance and essential for
sharing data. As more data is made available for sharing, the need for aligning
schemas/ontologies is increasing.
        </p>
        <p>Knowledge bases typically contain entities, facts about the entities and links
between entities. As new data is made available over time, these knowledge
bases require ways to manage new information such as adding entities, links and
new attributes pertaining to the entities. There is a need to also alter existing
information such that information that becomes invalid over time is adjusted. For
example, a link may become invalid or an attribute may prove to be incorrectly
assigned to an entity.</p>
      </sec>
      <sec id="sec-1-2">
        <title>Challenges and Opportunities</title>
        <p>By exploring how to perform ED for big data, we will o er a strong contribution
to this area as previous research has only focused on various parts of this problem.</p>
        <p>
          Regarding the LOD [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], interoperability is a real challenge, particularly
because vocabularies are not always alignable. For example, address in one
vocabulary could mean street address alone and in another it could include city, state
and zip code. We explored this problem in more depth in our previous work [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ].
LOD attempts to provide a way for data providers to link their data into the
cloud. However data may not always be made available as LOD, and in order
for an application to perform ED, this alignment becomes essential.
        </p>
        <p>With unstructured text, one can use natural language processing to acquire
various facts related to entities found in the text. With RDF data, an ontology
can often be used to develop an understanding of the data. However, when data
is semi-structured such as RDF or JSON and no such ontology or schema is
present, disambiguating entities becomes problematic. Making sense of these
large data extractions becomes a real issue.</p>
        <p>Knowledge bases naturally change over time, however it is a challenge to
enrich the knowledge base over time while at the same time reducing errors in
previously asserted facts. Algorithms used to perform ED are typically developed
for static data. Incremental updates and changes are harder to incorporate.
However, this is precisely what is needed as often big data applications are producing
data on a periodic basis. If one is developing a knowledge base where facts are
changing over time, the ED algorithm must accommodate these changes in a
way that does not require the algorithm to reprocess all potential matches given
new information.</p>
        <p>Volume requires that the algorithm can be distributed in such a way that
work could be performed in parallel. Again ED algorithms do not typically
assume data in terms of the volume that is present with big data applications.
However, since ED algorithms have typically O(n2) complexity, distributing the
algorithm would be necessary for such large volumes of data.</p>
        <p>
          Recent research which has addressed big data ED has primarily been in the
natural language processing domain. For example, a number of researchers [
          <xref ref-type="bibr" rid="ref13 ref16 ref4">4, 13,
16</xref>
          ] have explored using MapReduce for pairwise document similarity. However,
they are primarily focused on the volume characteristic. Work by Araujo et al. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]
tackled the problem of working with heterogeneous data but they worked with
sources where the vocabularies were alignable. Work by Hogan et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] addresses
this problem of performing ED for large, heterogeneous data. However, they
assume they have access to the ontologies used and they assume they can make
use of owl:sameAs semantics (which isn't always present). The hard problem of
trying to understand data absent knowledge of how it is structured has not been
thoroughly addressed in previous research.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Proposed Approach</title>
      <p>We are developing a multi-level clustering approach that includes one level of
topic modeling and a second level of clustering using our own custom algorithm.
This approach makes big data ED more tractable. Our research makes three
major research contributions that work together to achieve an e ective approach
for performing online ED.</p>
      <sec id="sec-2-1">
        <title>Research Contribution: Fine-grained Entity Type Recognition: If</title>
        <p>we consider identifying traits of an entity, at the highest level of identi cation,
entities are de ned by types, for example \Person", \Football Player", \Baseball
Stadium", etc. With TAC 2 there are just three types used (PER, ORG, GEP)
used, with DBpedia there are fewer than 1000 types, and tens of thousands
of types in Yago. Given a WBD data set, data can contain a mix of entities,</p>
        <sec id="sec-2-1-1">
          <title>2 http://www.nist.gov/tac</title>
          <p>
            can be composed of many di erent types, such as a person, a sports player, a
team member, and can be de ned by types that are de ned at di erent levels
of granularity. For example, \Person" is at a much higher level than \Football
Player". Often type information is not available, to get around this problem, we
have proposed a solution [
            <xref ref-type="bibr" rid="ref21">21</xref>
            ] based on topic modeling that enables us to de ne
types when type information is not present.
          </p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Research Contribution: Multi-dimensional Clustering: We are devel</title>
        <p>oping a clustering algorithm that performs ED based on multiple dimensions.
This algorithm would be applied to the coarse clusters generated from the
negrained entity type recognition. The complexity of clustering algorithms can
range from O n2 to O n3 , so a key aspect of this work is that it supports
parallelism.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Research Contribution: Incremental Online modeling to support</title>
        <p>Temporal Change: Our work includes knowledge base (KB) population. When
entities are assessed as similar, the information in the KB is merged with the
information contained in the newly recognized matched entity instance. However,
similarity is usually associated with some level of probability. As more data is
acquired over time, previous assertions may prove to have a lower probability
than previously asserted.</p>
      </sec>
      <sec id="sec-2-4">
        <title>Relationship with State of the art</title>
        <p>
          As it relates to coreference resolution the following work [
          <xref ref-type="bibr" rid="ref1 ref12 ref24">12, 1, 24</xref>
          ] would be
considered state of the art and is comparable to our work. In the NLP domain, a
number of researchers have focused on scalable entity coreference using
MapReduce [
          <xref ref-type="bibr" rid="ref13 ref16 ref4">4, 13, 16</xref>
          ].
        </p>
        <p>
          As it relates to type identi cation, work by Ma et al. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] presents a similar
problem, whereby type information is missing. This work builds clusters that
represent entity types based on both schema and non-schema features.
Paulheim et al. [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] also address the problem of identifying type information when
it is non-existent and they also use their approach to validate existing type
definitions. They take advantage of existing links between instances and assume
that instances of the same types should have similar relations. They acquire this
understanding by examining the statistical distribution for each link.
        </p>
        <p>
          As it relates to candidate selection, the following work [
          <xref ref-type="bibr" rid="ref15 ref23">23, 15</xref>
          ] would be
considered state of the art and comparable to our work.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Implementation of Proposed Approach</title>
      <p>We will implement our approach as a software system by which it could be used
to perform ED for wild big data. We will demonstrate the e ectiveness of this
approach by launching it in a parallel environment processing wild big data.
We will use benchmarks to convey the overall performance of the system as it
compares to other systems that are not necessarily addressing the wild big data
aspects. We anticipate ED scores that have slightly lower precision but we expect
to see better computing performance as we scale the number of entities in our
system to big data sizes, since our approach is developed to be amenable to a
Hadoop-like architecture.</p>
      <sec id="sec-3-1">
        <title>Current Implementation</title>
        <p>
          For the rst level of clustering we use Latent Dirichlet Allocation (LDA) [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
topic modeling, to form coarse clusters of entities based on their ne-grained
entity types. We use LDA to map unknown entities to known entity types to
predict the unknown entity types. We shared preliminary results of this e ort
in our previous work [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. Table 1 shows our latest results that include
experiments using DBpedia data where we show accuracy given we found all types and
accuracy given we missed on 1 type but found the others. Figure 1 also shows
another experiment where we measured precision at N where given N
predictions we found all of the types for a particular entity. This approach o ers two
bene ts, it results in overlapping clusters based on entity types improving recall
and it does not require knowledge of the schema or ontology of the data. The
only requirement is that there is a knowledge base of entity types that can be
used as a source for associating entity types to unknown entities.
        </p>
        <p>
          We have performed research related to ED of people in our early work [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]
where we experimented with combining rules and supervised classi cation,
however when ED is performed on entities of di erent types in combination with
di erent data sources, the ED process is more di cult. Often recognizing the
types of entities and then performing ED among speci c types can reduce this
problem, however, when the data sets are large to the scale of big data problems,
even recognizing these types reduces the problem to intractable sized
subproblems. For this reason, we are building a custom clustering algorithm for the
second level of clustering. This work is still in-process.
        </p>
        <p>
          We have performed preliminary work [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] with hierarchical clustering and
did not nd this to be a viable solution. Our current work clusters based on
a number of features such as distance measures, co-occurrences, graph-based
properties, and statistical distributions. Distinctive to our work, we also
incorporate context which we derive from our topic model. Entity context provides
additional information about an entity that is not necessarily acquired from the
associated predicates for that entity. We are also currently performing
preliminary experiments related to contextualizing entities.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Current Limitations</title>
        <p>Since our approach is a two-level approach, errors from the rst level of
clustering could propagate to the second level. We look to overcome this problem by
generating a model that both levels of clustering would use, however a resolution
to this problem is still under investigation.</p>
        <p>This approach is currently limited to graph-based data. There is a lot of
unstructured text and it would be advantageous for our system to be able to
convert unstructured text to graph-based structures. In addition, in order for our
approach to work with data that is truly \wild\, we require access to a knowledge
base that is rich with ne-grained entity types. The richness of the knowledge
base and its representation of the data to be processed directly in uence how
well our approach will perform. For example, if our knowledge base has very
little information related to car accidents and we are processing entities from
a data source related to car accidents, we will under-perform when recognizing
the ne-grained entity types which consequently will negatively impact our ED
algorithm.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Empirical Evaluation Methodology</title>
      <p>Since there are multiple parts to our approach, we intend to evaluate the various
parts in addition to how well the parts work together to perform ED.
Hypotheses
1. By using a multi-level clustering approach we can perform ED for wild big
data and achieve F-measure rates that are close to those of other ED
algorithms that are not processing wild big data.
2. Fine-grained entity type recognition as a rst level of clustering is a
competitive approach to performing candidate selection.
3. Our approach will be scalable such that it is comparable with other methods
that perform ED in parallel.
4. By performing ED online, we can reduce the number of errors in our KB.</p>
      <sec id="sec-4-1">
        <title>General Strategy</title>
        <p>
          Our general approach for evaluation is to evaluate our rst level of clustering,
the ne-grained entity type recognition work in isolation of ED. We will then
perform experiments related to contextualizing entities, performing ED both
from a scalability and accuracy perspective, and nally online KB improvements.
Benchmarks We will use data sets that we are able to easily establish ground
truth for, such as DBpedia and Freebase. However, we will also use Big Data
datasets and we may use unstructured data sets that are processed by an OpenIE
[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] system resulting in triple-based information.
        </p>
        <p>
          Our goal with the ne-grained entity type recognition work is to be able
to identify all entity types that are assigned to gold standard entities. We also
will try to identify incorrectly used and missing entity types. We will perform
experiments which benchmark our rst level of clustering with other candidate
selection methods and will be benchmarked against an existing type identi
cation approach [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
        <p>With our second level of clustering we hope to demonstrate that
contextualization of entities improves performance. We also plan to compare our ED
with others from an accuracy standpoint and from a complexity standpoint. We
will benchmark how well we scale in a parallel environment compared to other
parallel ED approaches.</p>
        <p>
          One feasible approach for evaluating the ED method is to use data from the
LOD and remove links then compare our results with the unlinked data to the
data that is linked [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. We will also explore the Instance Matching Benchmark
3 for evaluation and benchmarking. Another benchmark that is more recent
is SPIMBench 4 which provides test cases for entity matching and evaluation
metrics, and supports testing scalability. Finally we will show how a KB with
online temporal changes can reduce errors over time. We will prove this by taking
an o ine KB and comparing it to our online version.
        </p>
        <p>Metrics For our evaluation we will use the standard F-measure metric. For
evaluating our clusters, we will likely use standard clustering metrics such as
measuring purity.</p>
        <p>T rueP ositive
P recision = T rueP ositive+F alseP ositive</p>
        <p>T rueP ositive
Recall = T rueP ositive+F alseNegative</p>
        <p>F measure = 2 PPrreecciissiioonn+RReeccaallll</p>
      </sec>
      <sec id="sec-4-2">
        <title>Current State of Evaluation</title>
        <p>
          Our early work [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] shows our evaluation of identifying ne-grained entity types
using an entropy-based approach. We now use a topic modeling approach and
have performed preliminary evaluation of this work [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. We also include in Table
1 our latest evaluation. This evaluation is based on DBpedia 6000 randomly
selected entities and 176 types used to build the model. We used 350 separately
randomly selected entities that are of type Creative Works, type Place, and type
Organization, as these had the highest representation among the training set.
We measured how often we were able to recognize all types associated with each
entity as de ned by DBpedia. We are also in the process of a comprehensive
evaluation for this work. We are currently developing our custom clustering
algorithm and will plan to evaluate this work soon. We performed preliminary
experiments with an online KB where we reduced the errors by 70% by updating
the KB over time.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Lessons Learned, Open Issues, and Future Directions</title>
      <p>One of our challenges is nding the data we need to properly evaluate our
approach. Since we are proposing a system that works with Big Data scale datasets,
our evaluations will be harder to achieve.</p>
      <p>A second challenge is comparing and benchmarking our work against others.
Since our approach addresses problems that may overlap with other research</p>
      <sec id="sec-5-1">
        <title>3 http://islab.dico.unimi.it/iimb/ 4 http://www.ics.forth.gr/isl/spimbench/index.html</title>
        <p>Fig. 1: Fine-Grained Entity Type Precision at N
but isn't exactly the same, we will need to benchmark parts of our system with
other research.</p>
        <p>From our previous experiments when evaluating mappings of entity types
from one data source to another we learned that since there will not always be
a direct mapping, we will need to have supporting heuristics which makes the
evaluation process harder to achieve. For example mapping between Freebase and
DBpedia is not always possible, often because types de ned in one knowledge
base just do not exist in the other.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The author would like to thank her advisor, Dr. Tim Finin, and Dr. Anupam
Joshi.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Araujo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>DeVries</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hidders</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schwabe</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          : Serimi:
          <article-title>Class-based disambiguation for e ective instance matching over heterogeneous web data</article-title>
          .
          <source>In: WebDB</source>
          . pp.
          <volume>25</volume>
          {
          <issue>30</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Beheshti</surname>
            ,
            <given-names>S.M.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Venugopal</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ryu</surname>
            ,
            <given-names>S.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Benatallah</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Big data and cross-document coreference resolution: Current state and future opportunities</article-title>
          .
          <source>arXiv preprint arXiv:1311.3987</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          :
          <article-title>Probabilistic topic models</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>55</volume>
          (
          <issue>4</issue>
          ),
          <volume>77</volume>
          {
          <fpage>84</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Elsayed</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oard</surname>
            ,
            <given-names>D.W.</given-names>
          </string-name>
          :
          <article-title>Pairwise document similarity in large collections with mapreduce</article-title>
          .
          <source>In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers</source>
          . pp.
          <volume>265</volume>
          {
          <fpage>268</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Etzioni</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Banko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soderland</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weld</surname>
            ,
            <given-names>D.S.</given-names>
          </string-name>
          :
          <article-title>Open information extraction from the web</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>51</volume>
          (
          <issue>12</issue>
          ),
          <volume>68</volume>
          {
          <fpage>74</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Fellegi</surname>
            ,
            <given-names>I.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sunter</surname>
            ,
            <given-names>A.B.</given-names>
          </string-name>
          :
          <article-title>A theory for record linkage</article-title>
          .
          <source>Journal of the American Statistical Association</source>
          <volume>64</volume>
          (
          <issue>328</issue>
          ),
          <volume>1183</volume>
          {
          <fpage>1210</fpage>
          (
          <year>1969</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Ferrara</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montanelli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Noessner</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stuckenschmidt</surname>
          </string-name>
          , H.:
          <article-title>Benchmarking matching applications on the semantic web</article-title>
          .
          <source>In: The Semanic Web: Research and Applications</source>
          , pp.
          <volume>108</volume>
          {
          <fpage>122</fpage>
          . Springer (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Franks</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Taming the big data tidal wave: Finding Opportunities in Huge data streams with advanced Analytics</article-title>
          , vol.
          <volume>56</volume>
          . John Wiley &amp; Sons (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zimmermann</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Umbrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polleres</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Decker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          <volume>10</volume>
          ,
          <issue>76</issue>
          {
          <fpage>110</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bicer</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Typi er: Inferring the type semantics of structured data</article-title>
          .
          <source>In: Data Engineering (ICDE)</source>
          ,
          <source>2013 IEEE 29th Inter. Conf. on</source>
          . pp.
          <volume>206</volume>
          {
          <fpage>217</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>McAfee</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brynjolfsson</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davenport</surname>
            ,
            <given-names>T.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patil</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barton</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Big data. The management revolution</article-title>
          .
          <source>Harvard Bus Rev</source>
          <volume>90</volume>
          (
          <issue>10</issue>
          ),
          <volume>61</volume>
          {
          <fpage>67</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Nikolov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uren</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Motta</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roeck</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Overcoming schema heterogeneity between linked semantic repositories to improve coreference resolution</article-title>
          .
          <source>In: Proc. 4th Asian Conf. on the Semantic Web</source>
          . vol.
          <volume>5926</volume>
          , pp.
          <volume>332</volume>
          {
          <issue>346</issue>
          (
          <year>December 2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Pantel</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crestan</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Borkovsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Popescu</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vyas</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Web-scale distributional similarity and entity set expansion</article-title>
          .
          <source>In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-</source>
          Volume 2. pp.
          <volume>938</volume>
          {
          <fpage>947</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Paulheim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Type inference on noisy rdf data</article-title>
          . In: International Semantic Web Conference (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Rao</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McNamee</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dredze</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Entity linking: Finding extracted entities in a knowledge base</article-title>
          .
          <source>In: Multi-source, Multilingual Information Extraction and Summarization</source>
          , pp.
          <volume>93</volume>
          {
          <fpage>115</fpage>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Sarmento</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kehlenbeck</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oliveira</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ungar</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>An approach to web-scale named-entity disambiguation</article-title>
          .
          <source>In: Machine Learning and Data Mining in Pattern Recognition</source>
          , pp.
          <volume>689</volume>
          {
          <fpage>703</fpage>
          . Springer (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Schmachtenberg</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>State of the LOD cloud</article-title>
          . http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/ (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Sleeman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alonso</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pope</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Badia</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Opaque attribute alignment</article-title>
          .
          <source>In: Proc. 3rd Int. Workshop on Data Engineering Meets the Semantic Web</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Sleeman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Computing foaf co-reference relations with rules and machine learning</article-title>
          .
          <source>In: The Third Int. Workshop on Social Data on the Web. ISWC (November</source>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Sleeman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Cluster-based instance consolidation for subsequent matching</article-title>
          .
          <source>Knowledge Extraction and Consolidation from Social</source>
          Media p.
          <volume>13</volume>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Sleeman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Taming wild big data</article-title>
          .
          <source>In: Symposium on Natural Language Access to Big Data</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Sleeman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joshi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Entity type recognition for heterogeneous semantic graphs</article-title>
          .
          <source>In: AI Magazine</source>
          . vol.
          <volume>36</volume>
          , pp.
          <volume>75</volume>
          {
          <fpage>86</fpage>
          . AAAI Press (
          <year>March</year>
          2105)
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , He in, J.:
          <article-title>Automatically generating data linkages using a domainindependent candidate selection approach</article-title>
          .
          <source>In: Int. Semantic Web Conf</source>
          . (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , He in, J.:
          <article-title>Domain-independent entity coreference for linking ontology instances</article-title>
          .
          <source>Journal of Data and Information Quality (JDIQ) 4</source>
          (
          <issue>2</issue>
          ),
          <volume>7</volume>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Suchanek</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weikum</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Knowledge harvesting in the big-data era</article-title>
          .
          <source>In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data</source>
          . pp.
          <volume>933</volume>
          {
          <fpage>938</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>