<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>BlogNEER: Applying Named Entity Evolution Recognition on the Blogosphere?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Helge Holzmann</string-name>
          <email>holzmann@L3S.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nina Tahmasebi</string-name>
          <email>ninat@chalmers.se</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Risse</string-name>
          <email>risse@L3S.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science &amp; Engineering Department, Chalmers University of Technology</institution>
          ,
          <addr-line>412 96 Gothenburg</addr-line>
          ,
          <country country="SE">Sweden</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Holzmann</institution>
          ,
          <addr-line>Tahmasebi, Risse</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>L3S Research Center</institution>
          ,
          <addr-line>Appelstr. 9, 30167 Hannover</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>28</fpage>
      <lpage>39</lpage>
      <abstract>
        <p>The introduction of Social Media allowed more people to publish texts by removing barriers that are technical but also social such as the editorial controls that exist in traditional media. The resulting language tends to be more like spoken language because people adapt their use to the medium. Since spoken language is more dynamic, more new and short lived terms are introduced also in written format on the Web. In [1] we presented an unsupervised method for Named Entity Evolution Recognition (NEER) to nd name changes in newspaper collections. In this paper we present BlogNEER, an extension to apply NEER on blog data. The language used in blogs is often closer to spoken language than to language used in traditional media. BlogNEER introduces a novel semantic ltering method that makes use of Semantic Web resources (i.e., DBpedia) to gain more information about terms. We present the approach of BlogNEER and initial results that show the potentials of the approach.</p>
      </abstract>
      <kwd-group>
        <kwd>Named Entity Evolution</kwd>
        <kwd>Blogs</kwd>
        <kwd>Semantic Web</kwd>
        <kwd>DBpedia</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The introduction of new technology changes the way we express ourselves [2]. In
Social Media, like blogs, everyone can publish content, discuss, comment, rate,
and re-use content from anywhere with minimal e ort. The constant availability
of computers and mobile devices allows communicating with little e ort, few
restrictions, and increasing frequency. As there are no requirements for formal
or correct language, authors can change their language use dynamically. Under
these circumstances we expect people to adapt their language to the means of
communication by using more creative language and unconventional spellings.
2
Also words which might otherwise have been reserved for use only in
conversations between friends can be introduced in written text.</p>
      <p>These changes lead to a more dynamic language where new and short lived
terms are introduced also in written format. Local as well as global language
trends can spread via forums on the Web to a larger audience. This shortened
gap between written \Web Language" and spoken language coupled with the
inherent dynamics of spoken language leads to the introduction of new terms
and high dynamics also in written language.</p>
      <p>With the increasing e orts in documenting and preserving the public view
on certain events and topics like the Financial Crisis or the Olympic Games,
there is also an increasing need to make use of this content. To turn user
generated content into valuable information requires a better \understanding" of
the content. A systems that is aware of this knowledge can support information
retrieval by augmenting the query term. Awareness of language evolution is in
particular important for searching tasks in archives due to the di erent ages of
the involved texts.</p>
      <p>Language evolution is a broad area and covers many sub-classes like word
sense evolution, term to term evolution, named entity evolution and spelling
variations. In [1] we presented our approach for Named Entity Evolution
Recognition (NEER). NEER is an unsupervised method to nd name changes without
using external knowledge sources. As an example consider Pope Benedict XVI,
formerly known as Joseph Ratzinger. NEER can detect those changes in a high
quality newspaper dataset that reports this evolution by analyzing co-occurring
terms.</p>
      <p>In this paper we present a rst extension of NEER towards \Web Language"
by adapting and applying the method to blog content. The language used in
blogs is often closer to spoken language than to language used in traditional
media [3]. BlogNEER, an extension of NEER that introduces a novel semantic
ltering method, makes use of semantic resources (here exemplarily DBpedia)
to gain more information about terms.</p>
      <p>In the following section we present the related work in the eld of named
entity evolution. In Section 3 we give an introduction to NEER and motivate
BlogNEER. Section 4 explains our novel ltering method utilizing external
resources from the Semantic Web. In Section 5 we describe our experiments and
show an example. Section 6 concludes the work and gives an outlook on future
work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>Previous work on automatic detection of language evolution has mainly focused
on named entity evolution. The interest has mainly been from an information
retrieval point of view as search results can be a ected by named entity evolution.</p>
      <p>Berberich et al. [4] proposed a solution to this problem by reformulating a
query into terms prevalent in the past. They measure the degree of relatedness
between two terms when used at di erent times by comparing the contexts as
captured by co-occurrence statistics. This approach requires a recurrent
computation each time a query is submitted as it requires a target time for the query
reformulations which reduces e ciency and scalability. The results presented in
this paper are \anecdotal" (to use the words of the authors) and thus do not
provide a basis for comparison. However, because of the promising results we
use the same method for de ning a context.</p>
      <p>Kaluarachchi et al. [5] propose to discover semantically identical concepts
(or named entities) used at di erent times. They discover these changing
entities using association rule mining by associating distinct entities to events.
Sentences containing a subject, a verb, objects, and nouns are targeted and the
verb is interpreted as an event. Two entities are considered semantically related
if their associated event is the same and the event occurs multiple times in a
document archive. The temporally related term of a given named entity is used
for query translation (or reformulation) and results are retrieved appropriately
w.r.t. speci ed time criteria. They present precision and recall for three queries
and evaluate only indirectly on the basis of retrieved documents.</p>
      <p>Kanhabua et al. [6] de ne a time-based synonym as a term semantically
related to a named entity at a particular time period. They extract synonyms of
named entities from link anchor texts in Wikipedia articles using the full history.
The paper evaluates the precision and recall of the time-based synonyms by
measuring increased precision and recall in search results rather than directly
evaluating the quality of the found synonyms.</p>
      <p>In more recent work, Mazeika et al. [7] consider semantically similar
entities from di erent time periods. They extract named entities from the YAGO
ontology and provide a visual analytics tool to analyze the evolution of named
entities of the New York Times Annotated Corpus. No name changes are tracked
but the tool o ers a visualization of the evolution of an entity in the relation to
other entities.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Named Entity Evolution Recognition</title>
      <p>The NEER approach addresses the problem of automatically detecting named
entity evolution. It works unsupervised and without incorporating external
resources. This section gives an overview of NEER and its limitations on blog
data.
3.1</p>
      <sec id="sec-3-1">
        <title>De nitions</title>
        <p>We consider a term wi to be a single or multi-word lexical representation of an
entity at time ti. The context Cwi is the set of all terms related to wi at time ti.
Similar to Berberich et al. [4] we consider the most frequently co-occurring terms
within a distance of k words as the context, however, other contexts can be used.
We consider a change period to be a period of time in which one term evolves
into another. We consider temporal co-references to be di erent lexical
representations that have been used to reference the same concept or entity at the
4
di erent periods in time. Direct temporal co-references are temporal
coreferences that are variations of each other with some lexical overlap. Indirect
temporal co-references are temporal co-references that lack lexical overlap on
the token level. A temporal co-reference class contains all direct temporal
co-references for a given named entity, denoted as corefr fw1; w2 ; : : :g. Each
temporal co-reference class is represented by a class representative r which is
also a member of the class. For example, Joseph Ratzinger is the
representative of the co-reference class containing the terms fJoseph Ratzinger, Cardinal
Ratzinger, Cardinal Joseph Ratzinger, . . . g.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Overview of NEER</title>
        <p>The major steps of the Named Entity Evolution Recognition (NEER) approach
are depicted in Figure 1. NEER utilizes change period for nding named entity
evolution. These periods are identi ed by detecting high frequency bursts of
an entity. Those are considered to indicate a change period. Texts from the
year around a burst are regarded for collecting the co-reference candidates by
extracting the relevant terms. These are used to build up contexts represented
as graphs. Based on the contexts four rules are being applied to nd direct
coreferences among the extracted terms. These are merged to co-reference classes
as follows:
1. Pre x/su x rule: Terms with the same pre x/su x are merged (e.g., Pope</p>
        <p>Benedict and Benedict).
2. Sub-term rule: Terms with all words of one term are contained in the other
term are merged (e.g., Cardinal Joseph Ratzinger and Cardinal Ratzinger).
3. Prolong rule: Terms having an overlap are merged into a longer term (e.g.,</p>
        <p>Pope John Paul and John Paul II are merged to Pope John Paul II).
4. Soft sub-term rule: Terms with similar frequency are merged as in rule 2,
but regardless of the order of the words.</p>
        <p>Ultimately, the graphs are being consolidated by means of the co-references
classes. Afterwards ltering methods lter out false co-references that do not
refer to the query term. For this purpose, statistical as well as machine learning
(ML) based lters were introduced. A comparison of the methods revealed their
strengths and weaknesses in increasing precision while keeping a high recall. The
ML approach performed best with noticeable precision and recall of more than
90%. While it is possible to deliver a high accuracy with NEER + ML, training
the needed ML classi er requires manual labelling.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Limitations of NEER Applied on Blog Data</title>
        <p>Tahmasebi et al. [3] showed that language in blog texts behaves di erently than
traditional written language. Blog language is much more dynamic and closer to
spoken language than written language in traditional media. Therefore, we treat
blog texts di erently than texts from newspapers.
Proceedings
of
the
3rd</p>
        <sec id="sec-3-3-1">
          <title>BlogNEER:</title>
        </sec>
        <sec id="sec-3-3-2">
          <title>Named</title>
        </sec>
        <sec id="sec-3-3-3">
          <title>Entity</title>
        </sec>
        <sec id="sec-3-3-4">
          <title>Evolution on</title>
        </sec>
        <sec id="sec-3-3-5">
          <title>Blogs 5</title>
          <p>C(BhuIardsnetgnDeteifPtyeeicrntiiogond)s
Extra
ct</p>
          <p>Text
NLP</p>
          <p>Pro
cessin
g</p>
          <p>Co
nte
xt</p>
          <p>Cre
atio
n
Filterin
g
C
o-R
eferen ces
C
oT-Freei mnfepdroiennrgacles
In his latest ad res treo
AmInericahins bliasthesotps advirseitsng to
RomAemI,neriPchoaispne lbaBitsehsnotepdsicatdviXsiVtsIng to
stresRoeAmdme,eriPtchoapnte bBiseChnaoetpdhsioclitcviXsiVt Ing
educsarteoRrsoemdes,houPtlhdoapte reBmeCanietndhioclitc XVI
tfrcdNUmreoiuorsonmeemictf,rcvanNUmireoutdnoeunmirsnmrudoomtf,scsNmttrcvoeaeonniiidhuhotrrrannnemmyueeemtrcdoogit,rscsoanittseoeenitrahnhmttrshnbmtyedeoeocyeoupittrdnontreeeohhtsshndhnpebmsceeneitryeuepfhneajtaeoGtssouahmmidneptsueeudepnthhltfadoeoaCjsdaaoGirrssudanpmituedeagttnoelftshewjtdnaddhaoeGtsuaitrrrsohoiiteeedgel-oonnehwmitssadowmfcnonCitrrrnitteeodhgioane-sonewmmitsnftnhona.tnoitoaee-owmilsifnoacn.te a.
camcdiapsumccdUsiauspnscmicuvuslipetncuussgrsuiieltntcyguwurlteathbrusyewrteahrenws.aaCmrsat.eChaoltwi.hcohleicn
IAnmIenhirsichailsnateblsatitsehsatodpraesds rveistiot ntgo
Ro Amme ,eIrnPicoahpniseb Bliasethnsoetdp isactd vXriesVitIsngto
stresRoedmAem, ePtrhoiactpaen BbCeisantheodlipcts XvVisIit ng
eduscarteorsReodmsheo,utlhPdaot premCBaeitnhoeldicct XVI
tfrcereosuormemtfrceaireosnudnomrmodumtt.ceoaeihfrrsannmeoeNtcdduoitrromtsoeoueeetrh,trschnamieecnueaitrnhdeoettsotdhoendmrsehetrseufhnjaeepirssuoidnttsueeotshhltfaejtaudepshhunsitseonaeohdttundltfrsaiijiead-sdduonommfsennhttreaiaoC-rsdomiitefanno-mtrenhtio-aomil-ifnce a
m entsionesco.mNmo ,entchGemopreongtpetodwidn oatd r
Un imvernsteitsoynesb.yNno,a mtGheo rpwghoeptnoedwidn not
discUusn ivnmegresni ytohne by nCamtGheo lwirchgento wn
camdipscu sUscniunvlgteu rsietyhewabrsy .Cnatmh eo licwhen
cam dpisucsuscuilntgu r e thwears . Cath o lic
cam p u s cu ltu r e wars .</p>
        </sec>
        <sec id="sec-3-3-6">
          <title>Pip eline</title>
          <p>used
to
detect
temp oral
co-references[1].</p>
          <p>The machine learning lter, which delivered b est results in NEER exp
eriments, achieved a precision of more then 90% by ltering out false detected
co-references. Applying this to blog data leads to a much wider contexts,
containing many unrelated terms due to the large amount of relatively low quality
texts. Therefore the NEER ltering metho ds would have a much lower e ect.</p>
          <p>NEER makes no use of external resources like DBp edia since the main
development goal was to apply on historical do cument collections. Incorp orating
the Semantic Web allows us to lter out false detected names using semantic
information, which is reasonable when working with data from the Web, like
blogs.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Semantic</title>
    </sec>
    <sec id="sec-5">
      <title>Filtering</title>
    </sec>
    <sec id="sec-6">
      <title>Approach</title>
      <p>Semantic Filtering is a novel a-p osteriori ltering metho d for NEER incorp
orating the Semantic Web. With this approach we exemplarily use external data
from DBp edia to augment a term with semantic information. Employing these,
we are able to lter out names that do not refer to the same entity. Two terms
referring to entities of di erent typ es or categories can not b e evolutions of each
other. A-p osteriori means we apply this lter after applying NEER to our dataset
given a query term and one or more change p erio ds. At this step we have access
to the NEER results which consist of a collection of indirect co-references and a
co-reference class for the query term, comp osing the direct co-references. Using
this lter, all co-references that could b e identi ed as names for other entitites
will b e ltered out.</p>
      <p>The semantic lter incorp orates semantic information from DBp edia which
are structured as resources. A resource on DBp edia is the structured
representation of a Wikip edia page, which is automatically extracted as describ ed
by Bizer et al. [8]. While an ambiguous name can refer to multiple resources,
every resource has its own unique name and every name only p oints to one
resource directly. This is realized by using disambiguation resources. E.g.,
Apple (disambiguation) is the disambiguation resource of the resource Apple
(the fruit) and Apple Inc. Unlike this example, disambiguation resources do not
always have the "disambiguation" su x. However, every resource has prop
erties, which either p oint to a textual or numeric value, or to another resource.
Disambiguation resources can be identi ed by the existence of disambiguation
properties that point to their corresponding unambiguous resources.</p>
      <p>Other properties which are important for our work are the types of resources
as well as subjects, which can be conceived as categories. In addition to the
property relations (resource ! property ! value), DBpedia also provides the
inverse relations (value ! is-property-of ! resource). These can help to detect
ambiguous resources where the corresponding disambiguation resource points to
the ambiguous one (e.g., Apple (dismbiguation) disambiguates Apple).</p>
      <p>By mapping a query term as well as all of its co-references (direct and
indirect) to DBpedia resources we can augment the terms with semantic properties.
These properties can help to lter out false positive results derived by NEER as
new names for the entity. It is important to mention that we only make use of
descriptive properties and will not utilize already known name evolution
information and co-references from DBpedia. In this paper we focus on a term's types
and subjects, but also make use of redirects and disambiguations. Although, in
some cases redirects represent a name change as well by redirecting an old name
to its new name, we do not use this information explicitly. Hence, we treat all
terms separately, even if they redirect to the same resource, like there is no
redirection available (e.g., for Czechoslovakia and Czech Republic or Slovakia).
4.1</p>
      <sec id="sec-6-1">
        <title>Disambiguation and Aggregation of Properties</title>
        <p>To map a term to a DBpedia resource, we replace spaces with underscores and
append it to the DBpedia resource URI (e.g., for "Project Natal" the resource URI
becomes http://dbpedia.org/resource/Project Natal). In case we are able to
resolve a term to a resource we fetch all property relations as well as the inverse
relations and save them in a lookup table. In this table, every property gets indexed
twice, by the complete property URI (e.g.,
http://www.w3.org/1999/02/22rdf-syntax-ns#type, short rdf:type) and by the name extracted from the URI
(e.g., type). In the lookup table, every property for a term points to a list
of values, either URIs or strings for textual/numeric values. By indexing the
property names in addition to the unique identi ers we are able to retrieve
a list of all types independently from the used ontology. This is important
since some resource have assigned same properties from di erent ontologies
(e.g., http://dbpedia.org/property/type in addition to rdf:type from the
example above). By indexing these using their name (i.e., type), we unify them to the
same property.</p>
        <p>After mapping the found terms to their corresponding resources, we follow
four strategies to extend and disambiguate their semantic meanings. The rst
strategy is to follow DBpedia redirections if present. The second strategy is to
explore disambiguation resources for ambiguous terms that do not redirect to a
disambiguation resource. The remaining two strategies disambiguate ambiguous
terms.</p>
        <p>Redirection Strategy Redirections are realized on DBpedia by a redirection
property (i.e., http://DBpedia.org/ontology/wikiPageRedirects, short
dbpediaowl:wikiPageRedirects ). This is assigned to the resource that is supposed to
redirect to another. We leverage this by fetching the resource the property points
to (s. Figure 2). Redirects are followed recursively. During this procedure we fetch
and index all new found properties and aggregate them. The rationale behind
this is that, in case there is a redirection pointing to another resource, this is
supposed to give a better entity description. Therefore, it represents the same
entity and its properties belong to the entity as well.</p>
        <p>redirects</p>
        <p>Ambiguation Strategy If a resource has an ambiguous meaning, it mostly
points to a disambiguation resource using the dbpedia-owl:wikiPageRedirects
property. In this case, we apply the rst redirection strategy. However, there
are ambiguous resources that do not redirect. For instance, the resource
Apple (i.e., http://dbpedia.org/resource/Apple) represents the fruit, even though
Apple is an ambiguous term. The disambiguation resource for Apple is
Apple (disambiguation), but there is no redirection between these two. Therefore,
Apple (disambiguation) uses the dbpedia-owl:wikiPageDisambiguates property to
point to its non-ambiguous resources, like Apple (the fruit).</p>
        <p>To discover ambiguous terms, we analyze all inverse disambiguation relations
of a resources and follow backwards if there is a relation originating in a resource
with the exact same name as the original term, but with the su x
"(disambiguation)" appended (s. Figure 3). Unlike for the redirection, we do not collect all
properties. Instead, we only keep the properties of the disambiguation resource,
because the original term might not the one we are interested in (e.g., Apple
fruit).</p>
        <p>Apple</p>
        <p>Apple (disambiguation)
Direct Disambiguation Strategy If a disambiguation resource has been
identi ed we need to decide for one of the suggested resources as a representation
for the entity name under consideration. In case one of the candidates proposed
by DBpedia is also a direct co-reference of the term we take this one as shown
in the example in Figure 4. The term we try to resolve in the example is Pope
Benedict. The corresponding disambiguation resource proposes all popes with
name Benedict up to XVI. Since Pope Benedict XVI is a direct co-reference in
the co-reference class of Pope Benedict derived by NEER we follow this resource
as described for our redirection strategy and aggregate its properties with the
properties that have been fetched so far.</p>
        <p>Pope Benedict</p>
        <p>Pope Benedict XVI
(direct co-reference)</p>
        <p>Indirect Disambiguation Strategy For the disambiguation of terms for
which we do not have a direct co-reference as disambiguation candidate, we make
use of indirect co-references derived by NEER for that term. Using these indirect
co-references ind1, ind2, . . . we form a term vector. Additionally, a term vector
is formed for each disambiguation candidate based on the property values of the
corresponding resource. These vectors consist of the frequencies of every
indirect co-reference occurring in the property values: (f req(ind1); f req(ind2); : : :).
Similar to [9] we calculate the cosine similarity between two vectors to measure
which resource ts the term in our context best. That resource will be selected as
the semantic representation for the ambiguous term. This procedure is illustrated
in Figure 5.</p>
        <p>AppleB(disambiguation)</p>
        <p>AppleBInc.
[cosB0.8]
indirectBco-references:
iPad
MacBook
Microsoft</p>
      </sec>
      <sec id="sec-6-2">
        <title>Filtering</title>
        <p>After the disambiguation and aggregation of properties from DBpedia we
proceed with the ltering. We consider the properties type and subject despite their
ontology or namespace (i.e., URI), as described in Section 4.1. We treat DBpedia
under the open world assumption. That means the fact a resource does not have
a certain property does not mean that the corresponding entity does not have
the property either. The resource has perhaps just not been annotated with the
property. However, if a resource has a certain property, we consider this to be
complete. For instance, if a resource is annotated with types, we assume these
are all types it has and there is no type missing.</p>
        <p>Similarity Filtering The rst lter we apply to the result set of co-references
derived by NEER compares the similarity of the query term with its co-reference
candidates based on the their types and subjects from DBpedia. We compare the
set of types and subjects of the query term with sets of each co-reference, direct
and indirect. This only works if the query term or its corresponding DBpedia
resource respectively has been annotated with types or subjects at all. Otherwise,
this ltering method is not applicable. The same holds for the co-references. It
would be wrong to consider two term referring to di erent entities just because
one of them has not been annotated with types or subjects while the other one
has (open world assumption, s. above). In this case we treat them as correct
coreferences for the query term and keep them in our result set. In case the query
term's resource and the resource of the co-reference under consideration have
both been annotated with types or subjects we require them to have at least
one type and/or subject in common. To check this requirement, we compute the
intersections of their type sets as well as their subject sets. In case one of the
set intersections is empty, we consider the two terms as di erent and lter out
those co-references. Otherwise, we keep them in our result set and pass them to
the type lter.</p>
        <p>Type Filtering Other than the similarity lter, the type lter considers
hierarchies of types in addition to the types a resource is directly annotated with. For
instance, both Pope Benedict XVI and Barack Obama are persons (resources
of type dbpedia-owl:Person). Therefore, the similarity lter would not have
ltered out one of them as co-reference of the other. However, Pope Benedict XVI
is of type dbpedia-owl:Cleric while Barack Obama is annotated with
dbpediaowl:O ceHolder. Both types are sub-types of Person. Thus, the two terms refer
to di erent kinds of persons on DBpedia and do most likely not correspond to
the same entity.</p>
        <p>To achieve this ltering we need to analyze the sub-class relations of all
types assigned to a resource. Each type on DBpedia is represented as an URI
that points to a resource of that type. To obtain the hierarchy of a type, we
leverage the rdfs:subClassOf property (i.e., http://www.w3.org/2000/01/rdf-schema
#subClassOf) of the resource. This points to its super-type and allows us to
perform this procedure recursively until there is no rdfs:subClassOf property
available or no resource corresponding to a type's URI exists.</p>
        <p>After we have fetched the hierarchies for all types top-down, starting by a
type and fetching the super-types, we analyze them bottom-up. For all types
that the query term and its potential co-reference have in common we compare
all of their sub-types. For instance, for Pope Benedict XVI and Barack Obama,
having type Person in common, we compare their sub-types of type Person:
Cleric and O ceHolder. As these are di erent we consider the two terms not
to be the same or referring to the same entity respectively and do not keep the
co-reference candidate in our result set. In case they are equivalent we proceed
with the next sub-type. This will be done recursively as long as both terms have
sub-types in common or until they are not annotated with further sub-types.</p>
        <p>The open world assumption holds again if the terms under consideration have
a type in common, only one of them has been annotated with a further sub-type
though. As we cannot tell whether the sub-type is missing on the other DBpedia
resource or the entity is actually not an instance of that type, we do not lter
out that co-reference and keep the it in the nal result set.
5</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Experiments</title>
      <p>For our experiments we created a Ruby implementation of NEER and added the
introduced extensions for BlogNEER. For the entity extraction we used a Ruby
implementation of the Lingua English Tagger by Coburn [10].</p>
      <p>
        For the evaluation we created two datasets. The techblog dataset consists
of ve popular tech
        <xref ref-type="bibr" rid="ref20">blogs covering ve years from 2008</xref>
        to 2013, fetched from
Google Reader: TechCrunch, Gizmodo, SlashGear, Ubergizmo and
GottaBeMobile. For the general blog dataset we fetched the top 100 blogs from nine di erent
categories (sports, autos, science, business, politics, entertainment, technology,
living, green), based on the ranking of Technorati [11], also from Google Reader.
In addition, we used the Blogs08 TREC dataset, described by Ounis et al. [12].
      </p>
      <p>Prior to creating contexts with NEER we applied a frequency ltering to
avoid feeding NEER with too many noisy terms. Those terms often do not have
a corresponding DBpedia resource and thus they cannot be ltered out by using
the semantic lter with similarity or type ltering and remain as noise in the
end result. Applying the frequency lter lead to much better results by keeping
the contexts smaller.</p>
      <p>To demonstrate the results of BlogNEER we use the term \Kinect" as an
example. \Kinect" is the name of a gaming accessory from Microsoft. During its
development it was known under the name "Project Natal" until the
announcement of Kinect in June 2010. We used that month as the change period and
applied the frequency as well as the semantic lter to our results. The following
set of terms is a result containing both, direct and indirect co-reference without
semantic ltering:</p>
      <p>After applying the semantic lter (s. Section 4.2) we get an improved result
set:</p>
      <p>Project Natal, Microsoft Kinect</p>
      <p>Due to the preliminary stage of our research, we are unable to compare
precision and recall. However, in recent experiments we already reached a recall
similar to the recall we achieved with our baseline, NEER on the New York Times
dataset [1]. Even though the precision was still lower due to noise, the semantic
lter helped with ltering out false positives as shown in the example above. In
case the noise consists of misspelled, informal or rarely used terms, which are not
known in DBpedia, we are not able to lter them out using semantic ltering.
In future work we will tackle this problem by using advanced frequency ltering
methods.</p>
      <p>Our results also indicated how di erently NEER behaves on blog data.
Although both datasets consist of blogs, we observed much less noise with the
general blog dataset specialized in certain categories than in arbitrary,
unspecialized and partly private blogs from the TREC blogs. Our experiments, even
if not yet nal, already indicate the impact of frequency and semantic ltering.
We were already able to reduce the noise and achieve a constantly high recall.
6</p>
    </sec>
    <sec id="sec-8">
      <title>Conclusions and Future Work</title>
      <p>For applying the NEER method on the Blogosphere we proposed BlogNEER, an
extension to the original approach. BlogNEER uses a novel a-posteriori ltering
method incorporating the Semantic Web. The semantic lter applied to the
results of NEER increased the precision by making use of data from DBpedia.
Using properties like types and subjects (i.e., categories) we are able to keep
apart terms that refer to di erent entities. Therefore, we can lter out names
that refer to another entity than a query term and thus, can not be an new
name.</p>
      <p>We presented a rst evaluation and a simple example showed the potential
of BlogNEER. However, to further reduce the noise we will need to lter terms
a-priori before they are processed by BlogNEER. We are also planning on
incorporating additional web resources in BlogNEER as well as making use of other
web speci c feature, for instance tags.
[1] Nina Tahmasebi, Gerhard Gossen, Nattiya Kanhabua, Helge Holzmann,
and Thomas Risse. Neer: An unsupervised method for named entity
evo</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>lution recognition</article-title>
          .
          <source>In Proceedings of the 24th International Conference on</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Computational</given-names>
            <surname>Linguistics</surname>
          </string-name>
          (
          <year>Coling 2012</year>
          ), Mumbai, India,
          <year>December 2012</year>
          . [2]
          <string-name>
            <given-names>Y.H.</given-names>
            <surname>Segerstad</surname>
          </string-name>
          .
          <article-title>Use and adaptation of written language to the conditions of</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>computer-mediated communication</article-title>
          .
          <source>PhD thesis</source>
          , G?teborg University,
          <year>2002</year>
          . [3]
          <string-name>
            <given-names>Nina</given-names>
            <surname>Tahmasebi</surname>
          </string-name>
          , Gerhard Gossen, and Thomas Risse.
          <article-title>Which words do you</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>ory and Practice of Digital Libraries</source>
          , volume
          <volume>7489</volume>
          , pages
          <fpage>32</fpage>
          {
          <fpage>37</fpage>
          . Springer,
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <year>2012</year>
          . [4]
          <string-name>
            <given-names>Klaus</given-names>
            <surname>Berberich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Srikanta J.</given-names>
            <surname>Bedathur</surname>
          </string-name>
          , Mauro Sozio, and Gerhard Weikum.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>Bridging the terminology gap in web archive search</article-title>
          . In WebDB,
          <year>2009</year>
          . [5]
          <string-name>
            <given-names>Amal</given-names>
            <surname>Chaminda</surname>
          </string-name>
          <string-name>
            <surname>Kaluarachchi</surname>
          </string-name>
          , Aparna S. Varde,
          <string-name>
            <given-names>Srikanta J.</given-names>
            <surname>Bedathur</surname>
          </string-name>
          , Ger-
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>CIKM</surname>
          </string-name>
          , pages
          <volume>1789</volume>
          {
          <fpage>1792</fpage>
          . ACM,
          <year>2010</year>
          . [6]
          <string-name>
            <given-names>Nattiya</given-names>
            <surname>Kanhabua</surname>
          </string-name>
          and
          <article-title>Kjetil N rvag</article-title>
          .
          <article-title>Exploiting time-based synonyms</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>in searching document archives</article-title>
          .
          <source>In Proceedings of the 10th annual joint</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>conference on Digital libraries, JCDL '10</source>
          , pages
          <fpage>79</fpage>
          {
          <fpage>88</fpage>
          , New York, NY,
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>USA</surname>
          </string-name>
          ,
          <year>2010</year>
          . ACM. [7]
          <string-name>
            <given-names>Arturas</given-names>
            <surname>Mazeika</surname>
          </string-name>
          , Tomasz Tylenda, and
          <string-name>
            <given-names>Gerhard</given-names>
            <surname>Weikum</surname>
          </string-name>
          . Entity timelines:
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <article-title>visual analytics and named entity evolution</article-title>
          .
          <source>In CIKM</source>
          , pages
          <volume>2585</volume>
          {
          <fpage>2588</fpage>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>2011. ISBN 978-1-4503-0717-8</source>
          . doi:
          <volume>10</volume>
          .1145/2063576.2064026. URL http:
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          //doi.acm.
          <source>org/10</source>
          .1145/2063576.2064026. [8]
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          , Jens Lehmann, Georgi Kobilarov, Soren Auer, Christian
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <article-title>lization point for the web of data</article-title>
          .
          <source>J. Web Sem</source>
          .,
          <volume>7</volume>
          (
          <issue>3</issue>
          ):
          <volume>154</volume>
          {
          <fpage>165</fpage>
          ,
          <year>2009</year>
          . [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Garca-Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Szomszor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Alani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>O.</given-names>
            <surname>Corcho</surname>
          </string-name>
          . Preliminary results
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <article-title>in tag disambiguation using dbpedia. In Knowledge Capture (K-Cap</article-title>
          <year>2009</year>
          )
          <article-title>-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <year>2009</year>
          . [10]
          <string-name>
            <given-names>Aaron</given-names>
            <surname>Coburn</surname>
          </string-name>
          . Lingua::EN::Tagger - search.cpan.org. (accessed October
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          27,
          <year>2009</year>
          ),
          <year>2008</year>
          . URL http://search.cpan.org/perldoc?Lingua::EN::
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          Tagger. [11]
          <string-name>
            <given-names>Technorati</given-names>
            <surname>Inc</surname>
          </string-name>
          .
          <source>accessed June 05</source>
          ,
          <year>2013</year>
          ,
          <year>2013</year>
          . URL http://www.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          technorati.com. [12]
          <string-name>
            <surname>Iadh</surname>
            <given-names>Ounis</given-names>
          </string-name>
          , Craig Macdonald, and
          <string-name>
            <given-names>Ian</given-names>
            <surname>Soboro</surname>
          </string-name>
          .
          <article-title>Overview of the trec-2008</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <article-title>blog track</article-title>
          .
          <source>In In Proceedings of TREC-2008</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>