<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A bag-of-entities approach to document focus time estimation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dipartimento di Ingegneria dell'Informazione</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Universita` Politecnica delle Marche</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ancona</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Italy c.morbidoni@univpm.it</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>a.cucchiarelli@univpm.it</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Detecting the document focus time, defined as the time the content of a document refers to, is an important task to support temporal information retrieval systems. In this paper we propose a novel approach to focus time estimation based on a bag-of-entity representation. In particular, we are interested in understanding if and to what extent existing open data sources can be leveraged to achieve focus time estimation. We leverage state of the art Named Entity Extraction tools and exploit links to Wikipedia and DBpedia to derive temporal information relevant to entities, namely years and intervals of years. We then estimate focus time as the point in time that is more relevant to the entity set associated to a document. Our method does not rely on explicit temporal expressions in the documents, so it is therefore applicable to a general context. We tested our methodology on two datasets of historical events and evaluated it against a state of the art approach, measuring improvement in average estimation error.</p>
      </abstract>
      <kwd-group>
        <kwd>focus time</kwd>
        <kwd>temporal mining</kwd>
        <kwd>information retrieval</kwd>
        <kwd>bag-ofentities</kwd>
        <kwd>linked data</kwd>
        <kwd>wikipedia</kwd>
        <kwd>dbpedia</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The growing interest in exploiting the temporal dimension of text documents
to improve information retrieval tasks led to a relatively new field of research,
referred to as Temporal Information Retrieval [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Given that most web search
queries express implicit or explicit temporal needs, a considerable amount of
research in the field has been made to best answer temporal queries, addressing
both recency-sensitive queries, where the users need is to get fresh information,
and time-sensitive queries, where the information need is related to a particular
point or period in time. Characterizing documents under a temporal dimension
is therefore important to support temporal aware search engines and spans two
distinct aspects: document creation time and document focus time. While
considerable e↵ort has been made in literature to address the first one, few works
investigated document focus time estimation [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        The focus time of a document is defined as the point in time (instant-based
focus time), or time interval (interval-based focus time), to which the content
of the document refers and is related to the meaning and the semantics of its
content. The concept was introduced in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] where a general methodology to
estimate focus time of text documents is presented. The approach, extended
and further evaluated in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], attempts to detect the focus time of a document
by deriving words-time association from a large training corpus of on-line news.
      </p>
      <p>To introduce our approach let us consider the short text document in
Figure 1 as an example. The document mentions a number of named entities and
can be concisely represented as a bag-of-entities. Each entity has associations
in time with years and intervals of years, which are derived from the entities
representations available on the web. Our assumption is that the focus time of
a document is likely to be the point in time relevant to the biggest subset of
entities representing the document.</p>
      <p>In this paper we start from this simple intuition and rely on a bag-of-entities
representation to investigate if and how one can leverage openly available
textual and machine readable data to extract meaningful entities-time associations
and ultimately develop a fast, unsupervised and reliable method to estimate
document focus time. Our method does not make use of explicit temporal
expressions in documents, but rather aims at determining the focus time based
on the semantics of the content. We think that leveraging named entities and
Linked Data has the clear advantage of avoiding possibly expensive computation
to learn word-time associations. Entity-time associations are in fact derived by
simply parsing wikipedia articles and DBpedia resources.</p>
      <p>
        In our experiments we used a state of the art NERD tool1 [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] and defined
a set of pragmatic rules to extract and rank relevant dates and time intervals
associated to each detected entity. We then combined these associations to rank
dates and estimate the focus time with a granularity of one year.
      </p>
    </sec>
    <sec id="sec-2">
      <title>1 https://dandelion.eu/semantic-text/entity-extraction-demo/</title>
      <p>
        We evaluate the approach on two datasets extracted from history related
books and web sites, and by comparing our results with the method proposed
in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] (B1) and with a simple method based on explicit temporal expressions
in documents (B2). We found that our approach outperforms B1 in average
estimation error and considerably increases recall when compared to B2, since
in real world dataset explicit dates are not always present.
      </p>
      <p>This paper is organized as follows. In section 2 we discuss related works.
In section 3 we describe the di↵erent phases of our method. Then we describe
our experiments in section 4 and provide an evaluation of the approach on test
data in section 5. We conclude the paper and mention possible improvements in
section 6.
2</p>
      <sec id="sec-2-1">
        <title>Related Works</title>
        <p>
          According to [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], a document has mainly two temporal dimensions: its
timestamp, or creation time, and its focus time. While the first one has been largely
addressed in literature, e.g., in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], few works address the second. Related works,
as [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], face the problem of identifying the most relevant temporal expression
in a document. However, relying only on temporal expressions is not a good
strategy as they can be rare in the text or even poorly related to the content
itself.
        </p>
        <p>
          The first work addressing focus time estimation without relying on temporal
expressions, which inspired this research, is [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Authors introduced the notion
of focus time in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and extensively evaluated a generic approach to estimate the
focus time of a document based on weighted associations between terms in the
document and years. Such associations are statistically extracted from a large
news corpus by analyzing sentences that contain explicit temporal expressions
and weighting associations based on the co-occurrences of words and dates in
sentences. In addition, words are weighted with respect to their discriminative
capabilities - using temporal entropy and temporal kurtosis measures - as well
as their relevance, using di↵erent techniques, including TextRank [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. Finally
the focus time of a document is calculated as the date that maximizes the sum
of associations with all the words in the document.
        </p>
        <p>As the approach is based on a learning phase, one possible limitation is
that it needs a training corpus which properly covers the event occurring in
the documents to be estimated. This means the domain has to be known. For
example authors restrict their domain of interest to historical events related
to 5 countries in the time range 1900-2013. In this paper we investigate an
alternative approach based on a bag-of-entities representation of documents,
where each named entity is linked to Wikipedia and DBpedia. Entities-time
associations are then derived by parsing a relatively small number of documents
(i.e. the Wikipedia and DBpedia entries for all the entities in a document). In
our experiments we aim at understanding whether we can combine state of the
art named entity recognition with knowledge from Wikipedia and DBpedia to
estimate focus time with reasonable precision, without the need for a learning
phase and in domain independent manner.</p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], the authors attempt to temporally classify images on the web by
analyzing the text surrounding them. The work is related to ours in that they
extract named entities from text. However, they use the dates associated to such
entities (extracted from the YAGO knowledge base [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]) only to filter explicit
temporal expressions found in the text, removing those not associated with a
mentioned entity.
        </p>
        <p>
          Other works related to this research come from the temporal information
retrieval (TIR) area. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] introduce the notion of temporal clusters, which
are related to our paper as they can be possibly used to associate time spans to
documents. In [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], authors proposed an approach to identify the most relevant
temporal expressions in a text document. The relevances of the temporal
expressions, detected using the HeidelTime temporal tagger, are calculated with a set
of predefined heuristics based on document and corpus-based features. This task
is related to ours in that identifying relevant time expressions can give clues to
detect the focus time of the document. In general, however, explicit temporal
expressions are not always present.
        </p>
        <p>
          A number of research made use of bag of entities representations to address
tasks such as document classification [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] and clustering [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. In the latter work,
a concept thesaurus based on the semantic relations (synonym, hypernym, and
associative relation) extracted from Wikipedia is leveraged to enhance traditional
content similarity measure for text clustering. A bag of concepts is also used in
[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] to improve performances of text classification tasks in a biomedical domain.
A bag-of-entities representation, the one used in our methodology, has been
used to improve semantic search engines in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and to drive recommendation
systems in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], where advantages of this representation in reducing complexity
are highlighted. Named entities extracted from documents were used to detect
temporal patterns of collective attention in online news, twitter and Wikipedia
page views[
          <xref ref-type="bibr" rid="ref18">18</xref>
          ].
        </p>
        <p>To the best of our knowledge no research has been conducted on the feasibility
of using bag-of-entities and available online open data (specifically Wikipedia
and DBpedia) to address document focus time estimation.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Methodology</title>
        <p>
          Singling out the bag-of-entities for a document
The first step in our approach is to derive a bag-of-entities to represent the
document. To do so we processed each document via the Dandelion API2 [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ],
obtaining a set of spots in the text with links to Wikipedia and DBpedia, along
with a confidence score. Dandelion is an evolution of TAGME [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] and, albeit
initially designed for short texts (e.g. micro-blogs), it was proved to be e↵ective
also for longer texts [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2 https://dandelion.eu/docs/api/datatxt/nex/v1/</title>
      <p>The obtained entities are identified by the corresponding Wikipedia page
URL, from which it is straightforward to derive the URL of the corresponding
DBpedia resource. For each entity we collected the plain text of the
corresponding article using the Wikipedia API3, and we retrieved the RDF representation
from DBpedia.
3.2</p>
      <p>Enriching entities with temporal data
For each entity we want to retrieve the associated exact dates (years) and time
intervals, to then score them with respect to their relevance. We use a date
granularity of one year.</p>
      <p>We first extract years mentions from the Wikipedia full article and from the
short entity abstract contained in DBpedia4. The DBpedia abstract is extracted
from the Wikipedia article and represents a concise description/summary of the
entity. As the dates in the abstract are generally included in the full Wikipedia
article, we in fact boost dates that appear in the abstract, as we assume they
are in general more relevant. To extract years mentions we tried two options: i)
using SUTIME temporal tagger5 and ii) using a simple regular expression6. We
observed very few di↵erences in performances of our method, the second option
providing slightly better results. For brevity’s sake, in this paper we only show
results obtained using our simple regular expression.</p>
      <p>As an additional source for relevant years we exploit the DBpedia Linked
Data representation by spotting triples carrying temporal information. We
extracted such triples by querying the DBpedia SPARQL endpoint to get all the
properties used in the dataset that have an rdfs:range7 explicitly declared of type
date (xsd:date, xsd:dateTime, xsd:gYear, etc.8). We counted the occurrence of
each property in the DBpedia graph and selected the ones used at least 10 times,
obtaining a set of 113 properties. For each entity in the document, the values of
such properties are retrieved and years mentions are distilled and added to the
candidates.</p>
      <p>
        If single years intuitively represent dates of short or specific events involving
the entity, time intervals represent their periods of activity or existence (e.g.
life of a person, rise and fall of a government, etc.). We attempt to derive time
periods from the Linked Data (LD) representation of the entity, by considering
couples of properties that identify time intervals. For example, for a resource
of type Person (the most frequent in the datasets), DBpedia represents his/her
period of existence as follows:
dbpedia:Audrey_Hepburn dbo:birthDate "1929-05-04".
dbpedia:Audrey_Hepburn dbo:deathDate "1993-01-20".
3 https://en.wikipedia.org/w/api.php
4 the value of the dbpedia : abstract property
5 http://nlp.stanford.edu/software/sutime.shtml
6 We used the following regex: [
        <xref ref-type="bibr" rid="ref1 ref2">1-2</xref>
        ][
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6 ref7 ref8 ref9">0-9</xref>
        ][
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6 ref7 ref8 ref9">0-9</xref>
        ][
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6 ref7 ref8 ref9">0-9</xref>
        ]
7 the range of a property is the type of values that it can assume
8 XML datatypes, https://www.w3.org/TR/xmlschema11-2/
      </p>
      <p>For our experiments, we manually selected from the list of temporal
properties 14 couples indicating time ranges and which are used in at least 1000
triples. These include, for example, birthDate/deathDate for persons, and
foundingYear/dissolutionYear for organizations.
3.3</p>
      <p>Ranking entities-dates associations
As a next step we rank the relations among the entities and the candidate dates
by combining di↵erent contributions:
– The occurrences of a date in the full Wikipedia article corresponding to an
entity;
– The occurrences of a date in the entity abstract
– The occurrences of a date in the RDF temporal triples of an entity LD
representation
– The matching of a date with the relevant time periods for an entity.</p>
      <p>We proceed as follows. For each document d in our dataset we define N Ed =
{e1, e2, ..., en} as the set of NEs in the document, w(ei) as the Wikipedia article
associated to each ei and T Wei = {t1, t2, ..., th} as the set of dates found in
w(ei). Then we calculate the relevance of a date tj with respect to w(ei) as:</p>
      <p>We consider as set of candidate focus times for the document the set Tcand =
T Wei [ T DPei [ T DTei . The next step is scoring the elements in Tcand looking
for matches in the set of time periods extracted from DBpedia, that we call
where f req(tj , w(ei)) is the number of occurrences of tj in w(ei).</p>
      <p>In the same way, we define T DPei = {t1, t2, ..., tg} as the set of dates found
in the ei entity abstract dp(ei), and the relevance of a date tj with respect to
dp(ei) as:</p>
      <p>where f req(tj , dp(ei)) is the number of occurrences of tj in dp(ei). Likewise if
T DTei = {t1, t2, ..., tf } is the set of dates extracted from the RDF triples tr(ei)
related to ei, the relevance of a term with respect to the data in RDF triples is:
Wrel(tj , ei) =
f req(tj , w(ei))
h
X f req(tk, w(ei))
k=1
DPrel(tj , ei) =
DTrel(tj , ei) =
f req(tj , dp(ei))
g
X f req(tk, dp(ei))
k=1
f req(tj , tr(ei))
f
X f req(tk, tr(ei))
k=1
(1)
(2)
(3)
T P DPei = {tp1, tp2, ..., tpl}. We thus define the relevance of a date in Tcand
with respect to the entity time-periods in T P DPei related to ei as:
l</p>
      <p>X in(tj , tpk)
T Prel(tj , ei) = k=1
l
where in(tj , tpk) is equal to 1 IFF the date tj is included in the time interval
tpk and 0 otherwise, and l is the dimension (number of elements) of the set Tcand.
Now we can combine (1), (2), (3) and (4) to define the cumulative measure of
relevance of the date tj with respect to the entity ei:
Rel(tj , ei) = ↵W rel(tj , ei) + DP rel(tj , ei) + DT rel(tj , ei) + T P rel(tj , ei) (5)
↵ , , and sum to 1 and are parameters used to weight the contribution
given by each component of (5).</p>
      <p>Finally, the relevance of a date tj for a given document d is defined as the
normalized sum of its relevance for each entity in the document. Thus we use
the following ranking function for the candidate dates:</p>
      <p>DRel(tj , d) = i=1
n
X Conf (ei)Rel(tj , ei)
n
where Conf (ei) is the level of confidence assigned by the NERD tool to ei
and n is the dimension (number of elements) of the set N Ed. We then take as
estimated focus time of the document d the date in Tcand with the highest value
of DRel(tj , d).
(4)
(6)
4</p>
      <sec id="sec-3-1">
        <title>Experimental settings</title>
        <p>4.1</p>
        <p>
          Datasets
To evaluate the proposed approach we used the methodology described in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]
as a first baseline (referred to as B1 for now on). B1 was tested on document
sets extracted from web sites and digital editions of history books, focusing on
events related to five countries. Unfortunately original datasets are not available
so we created our two test datasets accessing the same sources and following the
steps described in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. We call the original datasets D1 and D2.
Web dataset. We collected paragraphs from three web sites reporting main
events related to the five countries. The web sites we considered are the same as
in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]: History Orb9, History World10, BBC Timelines11, and Infoplease12. In
order to reproduce a dataset as similar as possible to D1, we used the Wayback
Machine13 to access the snapshot of the websites recorded in January 2015, which
is the date it was accessed to create D1. However, we found the average length of
paragraphs (3.6 sentences) to be considerably smaller than that reported in D1
(18.3 sentences). As the performances of our method decrease with the inverse
of the document length, due to the smaller number of entities matched in a short
document, we believe this di↵erence does not favour our approach against B1.
The Web Dataset contains 1007 documents referring to events in the time span
1900-2015.
        </p>
        <p>
          Books dataset. We collected text paragraphs from the two digital history
books used in D2: Timeline of World History (Kerr, 2011) and Timelines of
History (Ratnikas, 2012), following the procedure described in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. The resulting
dataset has an average document length of 40 sentences (against 43 in D2) and
the average event year is 1959 (against 1982 in D2). Such a di↵erence should not
favor our methods as - see [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] - the baseline method performs better for more
recent events. The Books Dataset contains 747 documents referring to events in
the time span 1900-2015.
4.2
        </p>
        <p>Named entities spotting
Dandelion API, which we used to match named entities, provides a confidence
score (from 0 to 1) indicating how sure the system is about the association.
Errors in matching and disambiguating entities clearly a↵ect final results. To
mitigate this problem we filtered out entities with a confidence score below a
threshold value th. We experimented with di↵erent values of th and manually
checked the goodness of entities match on 20 randomly selected documents. We
found 0.7 to be the optimal value. For this level of th, we found an average
number of entities match per document of 8.9 and a total number of entities
match of 8958 in the Web dataset. The books dataset, where documents are in
average longer, has 92.8 average entity matches per document and 69303 entity
matches in total.
4.3</p>
        <p>
          Evaluation measures and parameters settings
In order to compare the proposed approach with the baseline methodology we
use average error to measure results. The estimation error is computed as the
absolute value of the di↵erence between the estimated focus time and the ground
truth. The ground truth date for a document can be, in our datasets, a single
9 http://www.historyorb.com
10 http://www.historyworld.net
11 http://www.bbc.co.uk/history
12 http://www.infoplease.com
13 http://archive.org/web/
year or a range of years. In the latter case, as done in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], the estimation error is
calculated as the distance between the estimated year and the closest boundary
of the range, or zero if the estimated year is included in the range. We thus adopt
the following formula to represent the error e(t):
e(t) =
(
min{|tb
0
t|, |t
te|} if t 2 / [tb, te],
otherwise
We then compute the average error over all the documents in the datasets.
        </p>
        <p>In order to tune the parameters of our algorithm we randomly selected 10%
of the document from both our datasets and used the remaining documents as
a test set. We experimented on the training set with several di↵erent parameter
settings, then we evaluated our method on the test set using the one providing
the lowest average error. The selected configuration is the following: ↵, =
0.1666; , = 0.3333.</p>
        <p>In addition to the average error, we also measured precision, that is the
number of documents correctly annotated (e(t) = 0) with respect to the ground
truth date, divided by the total number of documents.
5</p>
      </sec>
      <sec id="sec-3-2">
        <title>Results and discussion</title>
        <p>In Table 1 we show the results of our experiments, where our method is referred
to as BOE (Bag of Entities). In the table we compare them with two baselines:
B1 and B2. B2 uses explicit dates mentioned in the documents. Given that we are
interested in processing each document independently14, we simply calculated
the explicit-dates score by assigning e(t) = 0 if the document contains the ground
truth year (or a year in a ground truth time span). Otherwise we computed e(t)
considering the date in the document which is closest to the ground truth as the
estimated focus time.</p>
        <p>
          As shown, our method outperforms B1 with respect to average error. We
remind the reader, however, that we ran the algorithm on datasets that are
14 i.e. we do not consider co-occurrences of dates as well as of entities in the test corpus
for our estimations
not identical to the ones used to evaluate B1. Even if they have been collected
from the same sources and following the same methodology, our datasets contain
shorter documents. As expected our method performs better on longer text (e.g.
the books dataset), where more named entities can be spotted. No evaluation of
B1 is provided in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] with respect to precision. With our method the precision for
book dataset is relatively high, while it is substantially lower in the web dataset,
due to the shortest length of the documents, which are often composed by a
short single sentence. This is also the reason why the number of non estimated
documents (failed column in Table 1) is considerably higher in this dataset.
Non estimated documents are usually very short documents where the entities
spotted have a low level of confidence. An example is Japanese government
of Imukai forms, where the two entities Japanese government and Imukai are
matched with a confidence of respectively 0.44 and 0.26. On the other hand, B1
estimates all documents whereas B2, even though it provides a low average error
(especially in the web dataset) fails to estimate more than 76% of the documents
as few explicit dates are present on average in our test documents. To take into
account failed estimations, in the columns marked with + we report the average
error and the precision on the whole collection, considering non estimated ones.
For such documents we have set the estimation error to half of the corpus time
interval, which is the error we would obtain by arbitrarily choosing the middle
of the interval as focus time.
        </p>
        <p>In Table 2 we compare the results obtained with three di↵erent settings, each
considering only the contribution of one element of the formula 5. The method
marked as wiki-only weights time associations based only on the appearance of
a date in the Wikipedia article (↵ = 1; , , = 0). Similarly, abst-only considers
only the abstract from DBpedia ( = 1; ↵, , = 0), triples-only considers only
explicit dates found in DBpedia triples ( = 1; ↵, , = 0), and, finally,
periodsonly scores dates only with respect to their inclusion in entity related time
intervals detected in DBpedia ( = 1; ↵, , = 0).</p>
        <p>We measure average error and precision with respect to the two distinct types
of ground truth we have: single years and time spans. Precision in single years
in books dataset is 75.6% while for time span is only 31.5%. This is unexpected
behaviour, as in general we would expect it to be easier to match a period than
a precise year. However, we notice that in the book dataset documents marked
with a ground truth time span are rare (21 out of 747) and the intervals are
quite short (3 years on average). On the other hand, in the web dataset, where
we have 166 documents out of 1007 with a ground truth time span and the
average length of such intervals is around 10 years, the precision is considerably
higher for time spans than for single dates. Results clearly indicate that the
combination of the di↵erent contributions increases the performance both in
average error and precision when compared to single components, with the only
exception of wiki-only which provides better precision in the books dataset in
the case of time spans (GTint).
6</p>
      </sec>
      <sec id="sec-3-3">
        <title>Conclusions and outlook</title>
        <p>In this paper we proposed a methodology for estimating focus time of a document
based on named entities and temporal information extracted from Wikipedia and
DBpedia. The methodology is designed to provide unsupervised and domain
agnostic focus time estimation.</p>
        <p>Results are encouraging and demonstrate that leveraging bag-of-entities is a
good strategy for addressing focus time estimation.</p>
        <p>
          In the future, we are planning to investigate how the approach could be
integrated with [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and with those based on explicit temporal expressions to
increase performances, as well as evaluating the use of additional temporal data
sources, e.g. YAGO.
        </p>
        <p>
          Finally we remark that the methodology evaluated here performs single year
focus time estimation only. Estimating the focus time as intervals of dates, as
done in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], as well as investigating higher granularity than years, is left for
future works.
7
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>Acknowledgments</title>
        <p>A special thanks goes to SpazioDati15 for supporting our research by granting
access to the Dandelion API. This work was supported by the GramsciSource
FIRB project16 funded by the Italian Ministry of Education, Universities and
Research (MIUR).
15 http://www.spaziodati.eu/
16 http://gramsciproject.org</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Alonso</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gertz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baeza-Yates</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Clustering and exploring search results using timeline constructions</article-title>
          .
          <source>In: CIKM'09</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Campos</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dias</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jorge</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jatowt</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Survey of temporal information retrieval and related applications</article-title>
          .
          <source>ACM Computing Surveys</source>
          <volume>47</volume>
          (
          <issue>2</issue>
          ) (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Campos</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jorge</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dias</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nunes</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Disambiguating implicit temporal queries by clustering top relevant dates in web snippets</article-title>
          . In: IEEE/WIC/ACM International Conference on Web Intelligence, WI'
          <volume>12</volume>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Caputo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Basile</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Semeraro</surname>
          </string-name>
          , G.:
          <article-title>Boosting a semantic search engine by named entities</article-title>
          .
          <source>In: ISMIS'09</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Cornolti</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferragina</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ciaramita</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A framework for benchmarking entityannotation systems</article-title>
          . In: WWW'
          <volume>13</volume>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ferragina</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scaiella</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          : Tagme:
          <article-title>On-the-fly annotation of short text fragments (by wikipedia entities)</article-title>
          .
          <source>In: CIKM'10</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Garcia</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rodriguez</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Anido</given-names>
            <surname>Rifon</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          :
          <article-title>Biomedical literature classification using encyclopedic knowledge: A wikipedia-based bag-of-concepts approach</article-title>
          .
          <source>PeerJ</source>
          <year>2015</year>
          (
          <article-title>9) (</article-title>
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Ho↵art, J.,
          <string-name>
            <surname>Suchanek</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berberich</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weikum</surname>
          </string-name>
          , G.:
          <article-title>Yago2: A spatially and temporally enhanced knowledge base from wikipedia</article-title>
          .
          <source>Artificial Intelligence</source>
          <volume>194</volume>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zeng</surname>
            ,
            <given-names>H.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Enhancing text clustering by leveraging wikipedia semantics</article-title>
          .
          <source>In: SIGIR'08</source>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Jatowt</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Au Yeung</surname>
            ,
            <given-names>C.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tanaka</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Estimating document focus time</article-title>
          .
          <source>In: CIKM'13</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Jatowt</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Au Yeung</surname>
            ,
            <given-names>C.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tanaka</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Generic method for detecting focus time of documents</article-title>
          .
          <source>Information Processing and Management</source>
          <volume>51</volume>
          (
          <issue>6</issue>
          ) (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Kanhabua</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nrvg</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Improving temporal language models for determining time of non-timestamped documents</article-title>
          .
          <source>In: ECDL'08</source>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Kuchar</surname>
          </string-name>
          J., K.T.:
          <article-title>Bag-of-entities text representation for client-side (video) recommender systems</article-title>
          .
          <source>Workshop on Recommender Systems for Television and online Video (RecSysTV)</source>
          , at RecSys'
          <volume>14</volume>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spaniol</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doucet</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Temporal reconciliation for dating photographs using entity information</article-title>
          . In: Workshop on Exploiting Semantic Annotations in Information Retrieval, at CIKM'
          <volume>15</volume>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Mihalcea</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tarau</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : Textrank:
          <article-title>Bringing order into text</article-title>
          .
          <source>In: EMNLP'04</source>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Rizzo</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erp</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Troncy</surname>
          </string-name>
          , R.:
          <article-title>Benchmarking the extraction and disambiguation of named entities on the semantic web</article-title>
          .
          <source>In: ESWC'14</source>
          . Reykjavik, Iceland (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Scaiella</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prestia</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Del Tessandoro</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ver</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barbera</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parmesan</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>Datatxt at #microposts2014 challenge</article-title>
          . In: Workshop on Making Sense of Microposts, at WWW'
          <volume>14</volume>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Stilo</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morbidoni</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cucchiarelli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Velardi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Capturing users' information and communication needs for the press ocers</article-title>
          .
          <source>In: Workshop on Social Media for Personalization and Search</source>
          , at ECIR'
          <volume>17</volume>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Strtgen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alonso</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gertz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Identification of top relevant temporal expressions in documents</article-title>
          . In: Temporal Web Analytics Workshop, at TempWeb'
          <volume>12</volume>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Strtgen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gertz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Multilingual and cross-domain temporal tagging</article-title>
          .
          <source>Language Resources and Evaluation</source>
          <volume>47</volume>
          (
          <issue>2</issue>
          ) (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wen</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          :
          <article-title>Concept-based short text classification and ranking</article-title>
          . In: CIKM'
          <volume>14</volume>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>