<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ontology Guided Purposive News Retrieval and Presentation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Abir Naskar</string-name>
          <email>abir.naskar@tcs.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rupsa Saha</string-name>
          <email>rupsa.s@tcs.com</email>
          <email>rupsa.s@tcs.com Lipika Dey TCS Innovation Lab, India lipika.dey@tcs.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tirthankar Dasgupta</string-name>
          <email>dasgupta.tirthankar@tcs.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>TCS Innovation Lab</institution>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <abstract>
        <p>In this paper, we present a purposive News information retrieval and presentation system that curates information from News articles collected from multiple trusted sources for a given domain. A back-end domain ontology provides details about the concepts and relations of interest. We propose an attention based CNN-BiLSTM model to classify sentence tokens as ontology concepts or entities of interest. These entities are then curated and used to link articles to illustrate evolution of events over time and regions. Working systems are initiated with small annotated data sets which are later augmented with humans in the loop. It is easily customizable for various domains.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>News consumption is no more restricted to consuming
a set of facts dished out by a speci c agency. Readers
are not only choosing the type of content they want to
read but also how. Increasing interest in social
statistics is also seeing News as a source of data for
generating these statistics. Unlike social media, News from
trusted sources is reliable. News presentation is
therefore undergoing a sea-change. Along with a bird's eye
view of global events, the ability to delve deep down
into speci c stories along various dimensions and also
watch their evolution is necessary to enable systematic
studies.</p>
      <p>In this paper, we present an Ontology-guided News
information retrieval and presentation system. The
uniqueness of the proposed system lies in the use of
a back-end domain ontology that speci es the
entities and relations of interest in a domain, based on
which, information components are extracted and
classi ed using a deep neural network architecture. These
concepts are used to create domain-speci c "purposive
indices" to aid concept-oriented Information retrieval
rather than simple word-based retrieval.</p>
      <p>A seed ontology of concepts along with a few
instances of each concept is used to create annotated
data to train a concept classi cation model. This is
applied over a larger set of articles, the result of which
is then validated through human evaluation and used
subsequently to enhance the initial model. It is found
that the proposed method takes much less time and
effort to create annotated data sets for all situations that
lack large labeled data needed to exploit deep-learning
methods. Information components extracted from the
News articles, are stored in indexed repositories for
downstream analytics. The results are presented to
the end-user through an innovative interactive
interface that helps in consuming information at multiple
levels of granularity.
2</p>
      <p>Overview of proposed News
Retrieval and presentation framework
Figure 1 presents an overview of the proposed News
Information retrieval system. A number of News
crawlers are deployed to collect News from dedicated
and reliable sources. For each article its meta-data like
(A)
Train Classifier</p>
      <p>Classification</p>
      <p>Model
Entity Aggregation
&amp; News Linking</p>
      <p>Seed
Ontology
Web Crawler
C1 C2 C3
Induced
Repository
occursAt
Safety
Incident
DueTo
Violation</p>
      <p>By</p>
      <p>Annotated</p>
      <p>Corpus</p>
      <p>News Docs
News Retrieval
Target
Org</p>
      <p>On</p>
      <p>propose
Penalty
(B)
locatedAt</p>
      <p>Loc.</p>
      <p>Governing
Agency</p>
      <p>Commits
Accused
Against</p>
      <p>Charges
Brings</p>
      <p>Crime</p>
      <p>ResultsIn
Investigation</p>
      <p>Against
Victim
Law Enf
CarriesOut</p>
      <p>APttoeonltiinogn a1
w
Bi−LSTM
CNN
Layer</p>
      <p>Word
Embeddings
(C)</p>
      <p>Output
\Accused \Accused \None \None ... \Crime \Victim \Victim
Fully Connected</p>
      <p>Linear layer
With sigmoid</p>
      <p>Activation
and victim names, section names, location of crime
etc. News articles reporting safety incidents are
similarly additionally indexed by incident names, location
of incident, penalty incurred in di erent currency etc.
Linking of related articles are done through the
purposive indices only. It may be noted that all concepts
for the same event may be obtained at once. New
concepts can get associated as a single story unfolds
over time. Similarly multiple isolated articles may get
linked to each other at later stages. The purposive
indices are also used to generate comparative and
aggregate statistics over various dimensions. The
visualization module enables the end-user to view News
articles at various levels of granularity.</p>
      <p>The ontology is composed of a schema that contains
di erent domain concepts (C) and a set of generic
relations R between these concepts. Figure 1 illustrates
a pair of ontologies for two di erent domains, namely,
crime and occupational health and safety. Given an
ontology that elucidates the basic components of a
domain and their underlying relationships in a generic
way, nding similar information components from vast
collections of text still remains a challenging task, since
the manifestation of these components in natural
language can be extremely varied. For example, though
the names of criminal and victim can be extracted as
named-entities from text, establishing their roles
unambiguously is not a simple task. Classifying an
instance of a concept correctly requires deep contextual
analysis. We propose the use of an attention-pooling
convolutional Bi-LSTM neural network based
architecture to do the task. Details of this is given in the
next subsection.</p>
      <p>Purposive indices for news articles are created
using the ontology concepts extracted from text. For
example crime News articles are indexed by criminal
2.1</p>
      <p>Ontology-guided Concept and Entity
Detection using C-RNN Network
Convolutional neural networks (CNN) exploit local
dependencies, while recurrent networks like Long Short
Term Memory (LSTM) capture long-distance
dependencies among features. The proposed Convolutional
Bi-LSTM (C-RNN) model combines both the
capabilities. For a given sentence, the network learns to
assign ontology concept labels to each word. The
input to the network is a sequence of word embeddings
with 100 dimensions each.</p>
      <p>A convolutional layer is rst used to extract local
ngram features. All word embeddings are concatenated
to form an embedding matrix M 2 RdXjV j. Where,
jV j is the vocabulary size and d is the embedding
dimension. The matrix is divided into k regions. In each
region, we apply convolution function represented by
Conv(xi:w) = W:(xi:w) + b
(1)
to calculate the output features. Where, W and b
are the weights that the network learns. We apply the
same convolution operation repeatedly over the
different matrix regions to get multiple output feature
vectors. The output of the CNN layer is passed to
the bidirectional LSTMs, which read it both backward
and forward to take care of dependencies on the past
neighbours as well as future long-distance
dependencies. The Bi-LSTM layer is followed by an attention
pooling layer over the sentence representations.
Attention modules have been proved to boost accuracy
for tasks like sentiment or activity detection by
learning to focus more on certain linguistic elements over
others, without increasing computational complexity.
In our case, attention pooling achieves higher
accuracy by learning the speci c characteristics
surrounding each concept in the form of weights associated to
the output of the Bi-LSTM layer. This is represented
as:
ai = tanh(Wa:hi + ba);
i =</p>
      <p>ew :ai</p>
      <p>P ew :ai ;
O =</p>
      <p>X( i:hi):
Where Wa; w are weight matrix and vector
respectively, ba is the bias vector, ai is attention vector for
i-th sentence, and i is the attention weight of i-th
sentence. The output of the attention layer is then passed
to a fully connected linear layer with sigmoid
activation. The mapping of the linear layer after applying
the sigmoid activation function is given by</p>
      <p>y = s(x) = sigmoid(w:x + b):
Where, x is the input vector, w is the weight vector,
and b is bias value. Finally, the loss function is
computed using the cross-entropy loss de ned by
L =</p>
      <p>2
X yilog(yi):
i=1
Where y is the one-hot representation of the actual
label for the input word. To avoid over- tting, we
apply dropout technique at each layer to regularize
our model.</p>
      <p>The model is initially built from a small
annotated corpus, in which the instances of the ontology
concepts are tagged by their respective labels. This
model, when applied over a larger corpus yields new
instances of each label, which are evaluated by
humans and then accepted for next-level training if two
out of three annotators simultaneously agree on the
label. For repeated experiments on di erent domains,
the inter-annotator agreement is found to be around
0.65, which is pretty high. This can be done multiple
times, though we have restricted it to two times only.
(2)
(3)
(4)
(5)
(6)</p>
      <p>As discussed earlier, classi cation of concepts is
more complex than merely identifying named entities.
It is imperative to also recognize the role played by
the entity. Two such example sentences are presented
here along with the concepts extracted</p>
      <p>Sanjeev Khanna has been taken into custody in
Kolkata on charges of killing Sheena Bora -
Concepts extracted are &lt;Criminal, Sanjeev Khanna&gt; and
&lt;Victim, Sheena Bora&gt;. Kolkata is not labeled as any
concept, correctly.</p>
      <p>US Department of Labor's nes Heat Seal $95,000
for 15 health violations Concepts extracted
&lt;Company, Heat Seal&gt; and &lt;Penalty, $95,000&gt;.
2.2</p>
      <p>Linking Articles to Indicate News Story
Evolution
A link between two News articles is created only if they
share purposive indices for speci c concept classes. For
example, for crime incidents, victim names and
criminal names should overlap, while for safety incidents
organization name and safety incident type should
overlap.</p>
      <p>The rst challenge comes in the form of entity
resolution since named entities are spelled di erently in
di erent sources. An edit distance based measure
[MV93] is used to compare di erent entities and
combine them if su ciently similar. For example, an
individual was variably referred to as \Mukherjee",
\Mukerjea" and \Mookerjee" across various sources in our
crime news database.</p>
      <p>The second challenge comes from the fact that the
sets of conceptual instances for a single story also
evolve over time. For example, it is observed that for
a long-drawn crime incident, new articles report new
names as criminals or even victims, as new
information pours in. A concept overlap threshold of 80% for
speci ed types is used to link the articles into a single
story.</p>
      <p>The third challenge is due to the fact that mere
overlap of the names of entities is not enough, even
their corresponding roles need to be same, or at least
similar, for us to consider them to refer to the same
case.</p>
      <p>Considering the above challenges, we propose a
weighted similarity computation to determine the
similarity of two articles. Highest weightage is given for
candidates that are resolved to be similar and also
belong to the same concept class. Candidates which are
similar after resolution but are identi ed as instances
of di erent classes are given a lower weightage. The
nal similarity measure is computed as a weighted sum
of all candidate similarities. Two articles are
considered similar if the similarity is above a user-de ned
threshold.
Apart from regular entity and concept based retrieval
and tracking the evolution of a News, the proposed
system also enables the user to explore the evolution
of an incident across di erent dimensions like time and
space. Figure 2 illustrates how News evolution is
presented. The left most panel shows a series of crime
incidents reporting an unsolved murder case gathered
from across di erent time and sources. These have
been linked together by the earlier algorithm using the
purposive indices extracted by the c RNN classi er.
On selecting a particular crime news, the extracted
entities and events are shown in the middle panel. The
right top chart shows the temporal distribution of the
reports over the period 2015 to 2018. The bottom
right chart displays how crime entities have changed
over time. While Murder is prevalent over the
entire time-line, new crime incidents like money
laundering has emerged as reported co-crimes at later stages.
One can also explore presence or absence of similar
incidents across di erent geographical regions. It can
unearth regional a nities for certain kinds of acts.
Visualizations also help in studying aberrations like
different charges evoked for similar crimes or variability
in penalty rates for similar safety incidents through
canned analytics.</p>
      <p>We have conducted experiments for two di erent
domains using the ontology pair described earlier. Our
collection consists of around 12000 crime-related news
collected from the top 3 English news sources from
each of four regions in the Indian subcontinent (north,
south, east and west). The Occupational Health
and Safety database has been created from
approximately 4000 articles published by Occupational Safety
and Health Administration (OSHA), United States
Department of Labor, each of which detail various
transgressions by organizations, and the actions taken
against them.
3.2</p>
    </sec>
    <sec id="sec-2">
      <title>Experiments</title>
      <p>Each dataset is divided into 70%, 20% and 10% for
training, validation and testing respectively. A
Condition Random Field (CRF) model is trained as the
baseline. This model uses part-of-speech(POS) and
N-grams as features. Additionally, a number of other
deep neural network based models such as BiLSTM,
BILSTM with mean over time (MoT), BILSTM with
Attention network, Convolution network (CNN), CNN
with BILSTM+MoT were used to compare the results
with the proposed CNN with BILSTM along with
Attention Network. Each model is trained with three
types of word embeddings GloVe(G), Word2Vec(w2v)
and combination of GloVe and Word2Vec(G+w2v).
Both w2v and a combination of Glove and w2v achieve
similar performance for the proposed architecture,
which is signi cantly better than the baseline and also
others. F1 score for all the models are shown in Table
1.
3.4</p>
    </sec>
    <sec id="sec-3">
      <title>Results 3.3</title>
      <p>For CNN, we keep the window size as 3 and number of
lters as 30. For BiLSTM, state size is 200 with initial
state value =0.0. We use a dropout rate of 0.05, batch
size=10, initial learning rate of 0.01, decay rate of 0.05
and gradiant clipping of 5.0.</p>
      <p>Throughout all the target classes, the performance of
the CNN-BiLSTM model has been found to be better
than the others. The performance of combined local
and global embedding word2vec method for learning
word embeddings [MCCD13] have been observed to
be very e ective in capturing solely contextual
information. It has also been observed that, combining
both the W2V and GloVe embeddings surpasses the
performance of models using the individual
embeddings. Overall, the performance of the
CNN-BiLSTMatt model along with combined W2V-GloVe
embedding is higher than the rest of the existing models.
4</p>
      <sec id="sec-3-1">
        <title>Related Works</title>
        <p>While there has been a growing body of research in
extracting structured information from texts, neural
network based ontology guided News event extraction and
story evolution is still in its nascent stage. Most of the
existing methods are limited to event and named entity
extractions and not into identifying granular level of
entity role identi cation. Supervised learning with
different avours of LSTM or CNN [GCW+16, MB16] are
used for entity classi cation. Distant supervision
involving some amount of annotated data and an initial
knowledge source has been proposed to develop
models in [ZNL+09, NZRS12, CBK+10, MBSJ09, SSW09,
NTW11]. The unsupervised approach requires hand
crafted rules pertaining to the information to be
extracted [Hea92, SW13, JVSS98, HZW10, MB05].
5</p>
      </sec>
      <sec id="sec-3-2">
        <title>Conclusion</title>
        <p>In this paper we have proposed an Ontology-guided
News information retrieval system using a
Convolutional Bi-LSTM Network for Concept detection.
[JVSS98]</p>
        <p>Marti A Hearst. Direction-based text
interpretation as an information access
re</p>
        <p>nement. Text-based intelligent systems:
current research and practice in
information extraction and retrieval, pages 257{
274, 1992.</p>
        <p>Raphael Ho mann, Congle Zhang, and
Daniel S Weld. Learning 5000 relational
extractors. In Proceedings of the 48th
Annual Meeting of the Association for
Computational Linguistics, pages 286{295.
Association for Computational Linguistics,
2010.</p>
        <p>Yaochu Jin, Werner Von Seelen, and
Bernhard Sendho . An approach to rule-based
knowledge extraction. In Fuzzy Systems
Proceedings, 1998. IEEE World Congress
on Computational Intelligence., The 1998
[MB05]
[MB16]
[MBSJ09]</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [CBK+10]
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Carlson</surname>
          </string-name>
          , Justin Betteridge, Bryan Kisiel, Burr Settles,
          <string-name>
            <surname>Estevam R Hruschka Jr</surname>
          </string-name>
          , and Tom M Mitchell.
          <article-title>Toward an architecture for never-ending language learning</article-title>
          .
          <source>In AAAI</source>
          , volume
          <volume>5</volume>
          , page 3.
          <string-name>
            <surname>Atlanta</surname>
          </string-name>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [GCW+16]
          <string-name>
            <surname>Jiang</surname>
            <given-names>Guo</given-names>
          </string-name>
          , Wanxiang Che, Haifeng Wang, Ting Liu, and
          <string-name>
            <given-names>Jun</given-names>
            <surname>Xu</surname>
          </string-name>
          .
          <article-title>A uni ed architecture for semantic role labeling and relation classi cation</article-title>
          .
          <source>In Proceedings of COLING</source>
          <year>2016</year>
          ,
          <source>the 26th International Conference on Computational Linguistics: Technical Papers</source>
          , pages
          <volume>1264</volume>
          {
          <fpage>1274</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Hea92] [HZW10] IEEE International Conference on, volume
          <volume>2</volume>
          , pages
          <fpage>1188</fpage>
          {
          <fpage>1193</fpage>
          . IEEE,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>Mining knowledge from text using information extraction</article-title>
          .
          <source>ACM SIGKDD explorations newsletter</source>
          ,
          <volume>7</volume>
          (
          <issue>1</issue>
          ):3{
          <fpage>10</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Makoto</given-names>
            <surname>Miwa</surname>
          </string-name>
          and
          <string-name>
            <given-names>Mohit</given-names>
            <surname>Bansal</surname>
          </string-name>
          .
          <article-title>Endto-end relation extraction using lstms on sequences and tree structures</article-title>
          .
          <source>arXiv preprint arXiv:1601.00770</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-</source>
          Volume
          <volume>2</volume>
          , pages
          <fpage>1003</fpage>
          {
          <fpage>1011</fpage>
          . Association for Computational Linguistics,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [MCCD13]
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , Kai Chen, Greg Corrado, and
          <article-title>Je rey Dean. E cient estimation of word representations in vector space</article-title>
          .
          <source>arXiv preprint arXiv:1301.3781</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [MV93]
          <article-title>[NTW11] [NZRS12] [SSW09] Andres Marzal and Enrique Vidal. Computation of normalized edit distance and applications</article-title>
          .
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          ,
          <volume>15</volume>
          (
          <issue>9</issue>
          ):
          <volume>926</volume>
          {
          <fpage>932</fpage>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Ndapandula</given-names>
            <surname>Nakashole</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Martin</given-names>
            <surname>Theobald</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Gerhard</given-names>
            <surname>Weikum</surname>
          </string-name>
          .
          <article-title>Scalable knowledge harvesting with high precision and high recall</article-title>
          .
          <source>In Proceedings of the fourth ACM international conference on Web search and data mining</source>
          , pages
          <volume>227</volume>
          {
          <fpage>236</fpage>
          . ACM,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Feng</given-names>
            <surname>Niu</surname>
          </string-name>
          , Ce Zhang, Christopher Re, and
          <string-name>
            <given-names>Jude</given-names>
            <surname>Shavlik</surname>
          </string-name>
          . Elementary:
          <article-title>Large-scale knowledge-base construction via machine learning and statistical inference</article-title>
          .
          <source>International Journal on Semantic Web and Information Systems (IJSWIS)</source>
          ,
          <volume>8</volume>
          (
          <issue>3</issue>
          ):
          <volume>42</volume>
          {
          <fpage>73</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Fabian M Suchanek</surname>
            ,
            <given-names>Mauro</given-names>
          </string-name>
          <string-name>
            <surname>Sozio</surname>
            , and
            <given-names>Gerhard</given-names>
          </string-name>
          <string-name>
            <surname>Weikum</surname>
          </string-name>
          .
          <article-title>So e: a self-organizing framework for information extraction</article-title>
          .
          <source>In Proceedings of the 18th international conference on World wide web</source>
          , pages
          <volume>631</volume>
          {
          <fpage>640</fpage>
          . ACM,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data</source>
          , pages
          <volume>933</volume>
          {
          <fpage>938</fpage>
          . ACM,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [ZNL+09]
          <string-name>
            <surname>Jun</surname>
            <given-names>Zhu</given-names>
          </string-name>
          , Zaiqing Nie, Xiaojiang Liu, Bo Zhang, and
          <string-name>
            <surname>Ji-Rong Wen</surname>
          </string-name>
          .
          <article-title>Statsnowball: a statistical approach to extracting entity relationships</article-title>
          .
          <source>In Proceedings of the 18th international conference on World wide web</source>
          , pages
          <volume>101</volume>
          {
          <fpage>110</fpage>
          . ACM,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>