Ontology Guided Purposive News Retrieval and
                          Presentation

       Abir Naskar                                 Rupsa Saha                           Tirthankar Dasgupta
 TCS Innovation Lab, India                  TCS Innovation Lab, India                TCS Innovation Lab, India
   abir.naskar@tcs.com                          rupsa.s@tcs.com                     dasgupta.tirthankar@tcs.com
                                                     Lipika Dey
                                             TCS Innovation Lab, India
                                                lipika.dey@tcs.com


                                                                 view of global events, the ability to delve deep down
                                                                 into specific stories along various dimensions and also
                        Abstract                                 watch their evolution is necessary to enable systematic
                                                                 studies.
    In this paper, we present a purposive News                       In this paper, we present an Ontology-guided News
    information retrieval and presentation system                information retrieval and presentation system. The
    that curates information from News articles                  uniqueness of the proposed system lies in the use of
    collected from multiple trusted sources for a                a back-end domain ontology that specifies the enti-
    given domain. A back-end domain ontology                     ties and relations of interest in a domain, based on
    provides details about the concepts and re-                  which, information components are extracted and clas-
    lations of interest. We propose an attention                 sified using a deep neural network architecture. These
    based CNN-BiLSTM model to classify sen-                      concepts are used to create domain-specific ”purposive
    tence tokens as ontology concepts or entities of             indices” to aid concept-oriented Information retrieval
    interest. These entities are then curated and                rather than simple word-based retrieval.
    used to link articles to illustrate evolution of                 A seed ontology of concepts along with a few in-
    events over time and regions. Working sys-                   stances of each concept is used to create annotated
    tems are initiated with small annotated data                 data to train a concept classification model. This is
    sets which are later augmented with humans                   applied over a larger set of articles, the result of which
    in the loop. It is easily customizable for vari-             is then validated through human evaluation and used
    ous domains.                                                 subsequently to enhance the initial model. It is found
                                                                 that the proposed method takes much less time and ef-
1    Introduction                                                fort to create annotated data sets for all situations that
News consumption is no more restricted to consuming              lack large labeled data needed to exploit deep-learning
a set of facts dished out by a specific agency. Readers          methods. Information components extracted from the
are not only choosing the type of content they want to           News articles, are stored in indexed repositories for
read but also how. Increasing interest in social statis-         downstream analytics. The results are presented to
tics is also seeing News as a source of data for gener-          the end-user through an innovative interactive inter-
ating these statistics. Unlike social media, News from           face that helps in consuming information at multiple
trusted sources is reliable. News presentation is there-         levels of granularity.
fore undergoing a sea-change. Along with a bird’s eye
                                                                 2    Overview of proposed News Re-
Copyright c 2019 for the individual papers by the papers’ au-
thors. Copying permitted for private and academic purposes.           trieval and presentation framework
This volume is published and copyrighted by its editors.
                                                                 Figure 1 presents an overview of the proposed News
In: A. Aker, D. Albakour, A. Barrón-Cedeño, S. Dori-Hacohen,
M. Martinez, J. Stray, S. Tippmann (eds.): Proceedings of the
                                                                 Information retrieval system. A number of News
NewsIR’19 Workshop at SIGIR, Paris, France, 25-July-2019,        crawlers are deployed to collect News from dedicated
published at http://ceur-ws.org                                  and reliable sources. For each article its meta-data like
                                               (A)                                                                       (C)                  Output
                         Annotated                                                                              \Accused \Accused \None \None ... \Crime \Victim \Victim
          Seed                               Train Classifier
         Ontology         Corpus                                                                Fully Connected
                                                                                                   Linear layer
                                                                                                 With sigmoid
          Web Crawler                                  Classification                               Activation
                              News Docs                   Model
          C1 C2 C3                                                                                                       r1    r2   r3    r4           rn−2   rn−1   rn
                                                                                                        Attention                   a3                 an−2
                                                     Entity Aggregation                                  Pooling    a1         a2        a4                   an−1   an
           Induced                                   & News Linking
           Repository
                                                                                                           w
                              News Retrieval
                                                                               Crime
                                                                Commits                    Against
            occursAt         Target
                                        locatedAt                                                         Bi−LSTM
                              Org                                Accused                    Victim
           Safety
                                               Loc.
          Incident                                                            ResultsIn
                                                                Against                                    CNN
                                On
                        By                                                                                 Layer
          DueTo                                                   Charges                  Law Enf
                                  propose
          Violation                         Governing                      Investigation                      Word
                             Penalty         Agency                                                        Embeddings
                                                                                           CarriesOut
                                                                   Brings
                                       (B)                                                                          Sanjeev khanna has been ........ killing Sheena Bora


Figure 1: Overview of the Event and Entity Extraction Architecture(A). (B) depicts the example seed ontology
structure. (C) explains the CRNN architecture.
source, date and time of publication, location, head-       and victim names, section names, location of crime
line is stored along with the full article. These are all   etc. News articles reporting safety incidents are simi-
indexed using Solr. Exact or near duplicate articles        larly additionally indexed by incident names, location
are identified using Latent Semantic Hashing (LSH).         of incident, penalty incurred in different currency etc.
Though only one copy is stored in the repository, total     Linking of related articles are done through the pur-
count for such articles are maintained. Each article        posive indices only. It may be noted that all concepts
is then passed through an information extraction and        for the same event may be obtained at once. New
classification pipe-line that deploys component classi-     concepts can get associated as a single story unfolds
fiers to detect ontology-components in sentences. The       over time. Similarly multiple isolated articles may get
extracted elements are resolved and then used to cre-       linked to each other at later stages. The purposive
ate additional ”purposive named-indices” for the doc-       indices are also used to generate comparative and ag-
uments. These elements are also used to link threads        gregate statistics over various dimensions. The visu-
of the same News together.                                  alization module enables the end-user to view News
                                                            articles at various levels of granularity.
    The ontology is composed of a schema that contains
different domain concepts (C) and a set of generic re-
                                                            2.1 Ontology-guided Concept and Entity De-
lations R between these concepts. Figure 1 illustrates
                                                                    tection using C-RNN Network
a pair of ontologies for two different domains, namely,
crime and occupational health and safety. Given an          Convolutional neural networks (CNN) exploit local de-
ontology that elucidates the basic components of a do-      pendencies, while recurrent networks like Long Short
main and their underlying relationships in a generic        Term Memory (LSTM) capture long-distance depen-
way, finding similar information components from vast       dencies among features. The proposed Convolutional
collections of text still remains a challenging task, since Bi-LSTM (C-RNN) model combines both the capa-
the manifestation of these components in natural lan-       bilities. For a given sentence, the network learns to
guage can be extremely varied. For example, though          assign ontology concept labels to each word. The in-
the names of criminal and victim can be extracted as        put to the network is a sequence of word embeddings
named-entities from text, establishing their roles un-      with 100 dimensions each.
ambiguously is not a simple task. Classifying an in-           A convolutional layer is first used to extract local n-
stance of a concept correctly requires deep contextual      gram features. All word embeddings are concatenated
analysis. We propose the use of an attention-pooling        to form an embedding matrix M ∈ RdX|V | . Where,
convolutional Bi-LSTM neural network based archi-           |V | is the vocabulary size and d is the embedding di-
tecture to do the task. Details of this is given in the     mension. The matrix is divided into k regions. In each
next subsection.                                            region, we apply convolution function represented by
   Purposive indices for news articles are created us-                                                                        Conv(xi:w
                                                                                                                                    ¯ ) = W.(xi:w
                                                                                                                                              ¯ )+b                        (1)
ing the ontology concepts extracted from text. For
example crime News articles are indexed by criminal                                                     to calculate the output features. Where, W and b
are the weights that the network learns. We apply the           As discussed earlier, classification of concepts is
same convolution operation repeatedly over the dif-          more complex than merely identifying named entities.
ferent matrix regions to get multiple output feature         It is imperative to also recognize the role played by
vectors. The output of the CNN layer is passed to            the entity. Two such example sentences are presented
the bidirectional LSTMs, which read it both backward         here along with the concepts extracted -
and forward to take care of dependencies on the past            Sanjeev Khanna has been taken into custody in
neighbours as well as future long-distance dependen-         Kolkata on charges of killing Sheena Bora - Con-
cies. The Bi-LSTM layer is followed by an attention          cepts extracted are <Criminal, Sanjeev Khanna> and
pooling layer over the sentence representations. At-         <Victim, Sheena Bora>. Kolkata is not labeled as any
tention modules have been proved to boost accuracy           concept, correctly.
for tasks like sentiment or activity detection by learn-        US Department of Labor’s fines Heat Seal $95,000
ing to focus more on certain linguistic elements over        for 15 health violations Concepts extracted -
others, without increasing computational complexity.         <Company, Heat Seal> and <Penalty, $95,000>.
In our case, attention pooling achieves higher accu-
racy by learning the specific characteristics surround-      2.2   Linking Articles to Indicate News Story
ing each concept in the form of weights associated to              Evolution
the output of the Bi-LSTM layer. This is represented
as:                                                          A link between two News articles is created only if they
                 ai = tanh(Wa .hi + ba ),            (2)     share purposive indices for specific concept classes. For
                          ewα .ai                            example, for crime incidents, victim names and crimi-
                    αi = P wα .ai ,                   (3)    nal names should overlap, while for safety incidents or-
                           e
                        X                                    ganization name and safety incident type should over-
                    O=     (αi .hi ).                 (4)    lap.
                                                                The first challenge comes in the form of entity reso-
Where Wa , wα are weight matrix and vector respec-           lution since named entities are spelled differently in
tively, ba is the bias vector, ai is attention vector for    different sources. An edit distance based measure
i-th sentence, and αi is the attention weight of i-th sen-   [MV93] is used to compare different entities and com-
tence. The output of the attention layer is then passed      bine them if sufficiently similar. For example, an indi-
to a fully connected linear layer with sigmoid activa-       vidual was variably referred to as “Mukherjee”, “Muk-
tion. The mapping of the linear layer after applying         erjea” and “Mookerjee” across various sources in our
the sigmoid activation function is given by                  crime news database.
             y = s(x) = sigmoid(w.x + b).             (5)       The second challenge comes from the fact that the
                                                             sets of conceptual instances for a single story also
Where, x is the input vector, w is the weight vector,        evolve over time. For example, it is observed that for
and b is bias value. Finally, the loss function is com-      a long-drawn crime incident, new articles report new
puted using the cross-entropy loss defined by                names as criminals or even victims, as new informa-
                                                             tion pours in. A concept overlap threshold of 80% for
                         2
                         X                                   specified types is used to link the articles into a single
                  L=−          y¯i log(yi ).          (6)    story.
                         i=1
                                                                The third challenge is due to the fact that mere
Where ȳ is the one-hot representation of the actual         overlap of the names of entities is not enough, even
label for the input word. To avoid over-fitting, we          their corresponding roles need to be same, or at least
apply dropout technique at each layer to regularize          similar, for us to consider them to refer to the same
our model.                                                   case.
   The model is initially built from a small anno-              Considering the above challenges, we propose a
tated corpus, in which the instances of the ontology         weighted similarity computation to determine the sim-
concepts are tagged by their respective labels. This         ilarity of two articles. Highest weightage is given for
model, when applied over a larger corpus yields new          candidates that are resolved to be similar and also be-
instances of each label, which are evaluated by hu-          long to the same concept class. Candidates which are
mans and then accepted for next-level training if two        similar after resolution but are identified as instances
out of three annotators simultaneously agree on the          of different classes are given a lower weightage. The fi-
label. For repeated experiments on different domains,        nal similarity measure is computed as a weighted sum
the inter-annotator agreement is found to be around          of all candidate similarities. Two articles are consid-
0.65, which is pretty high. This can be done multiple        ered similar if the similarity is above a user-defined
times, though we have restricted it to two times only.       threshold.
Figure 2: Illustration of the linked News articles. Selecting one article shows the extracted events and entities.
Graph at the top right shows the distribution of a News over time. Figure at the bottom right shows evolution
of entity types over time.
    Once conceptually similar pairs of articles are         3 Experiments and Results
found, a virtual group is created containing these along
                                                            3.1 Data Collection
with a union of the conceptual entities contained in
them. More articles are added to these by repeating         We have conducted experiments for two different do-
the similarity computation of each article with an ex-      mains using the ontology pair described earlier. Our
isting group. The system automatically links them           collection consists of around 12000 crime-related news
and maintains them chronologically.                         collected from the top 3 English news sources from
                                                            each of four regions in the Indian subcontinent (north,
                                                            south, east and west). The Occupational Health
2.3 News Retrieval and Data Visualization                   and Safety database has been created from approxi-
                                                            mately 4000 articles published by Occupational Safety
Apart from regular entity and concept based retrieval
                                                            and Health Administration (OSHA), United States
and tracking the evolution of a News, the proposed
                                                            Department of Labor, each of which detail various
system also enables the user to explore the evolution
                                                            transgressions by organizations, and the actions taken
of an incident across different dimensions like time and
                                                            against them.
space. Figure 2 illustrates how News evolution is pre-
sented. The left most panel shows a series of crime
                                                            3.2 Experiments
incidents reporting an unsolved murder case gathered
from across different time and sources. These have          Each dataset is divided into 70%, 20% and 10% for
been linked together by the earlier algorithm using the     training, validation and testing respectively. A Con-
purposive indices extracted by the c RNN classifier.        dition Random Field (CRF) model is trained as the
On selecting a particular crime news, the extracted en-     baseline. This model uses part-of-speech(POS) and
tities and events are shown in the middle panel. The        N-grams as features. Additionally, a number of other
right top chart shows the temporal distribution of the      deep neural network based models such as BiLSTM,
reports over the period 2015 to 2018. The bottom            BILSTM with mean over time (MoT), BILSTM with
right chart displays how crime entities have changed        Attention network, Convolution network (CNN), CNN
over time. While Murder is prevalent over the en-           with BILSTM+MoT were used to compare the results
tire time-line, new crime incidents like money launder-     with the proposed CNN with BILSTM along with At-
ing has emerged as reported co-crimes at later stages.      tention Network. Each model is trained with three
One can also explore presence or absence of similar         types of word embeddings GloVe(G), Word2Vec(w2v)
incidents across different geographical regions. It can     and combination of GloVe and Word2Vec(G+w2v).
unearth regional affinities for certain kinds of acts. Vi-  Both w2v and a combination of Glove and w2v achieve
sualizations also help in studying aberrations like dif-    similar performance for the proposed architecture,
ferent charges evoked for similar crimes or variability     which is significantly better than the baseline and also
in penalty rates for similar safety incidents through       others. F1 score for all the models are shown in Table
canned analytics.                                           1.
    Table 1: Results demonstrating F1 Scores for each model corresponding to two domain, Crime and OSHA
                                                    Crime                OSHA
                                                 G    W2V    G+W2V      G      W2V    G+W2V
                          BiLSTM                67     70      64       64   65       70
                          BiLSTM-MoT            66     69      68       63   66       72
                          CNN                   69     72      68       67   67       63
                          BiLSTM-att            71     71      70       68   69       64
                          CNN+BiLSTM-MoT        72     75      75       69   70       67
                          CNN+BiLSTM-att        74     76      76       70   71       73
                     CRF                               58                    61
3.3    Hyper-parameters                                      Concept-based linking is used to link related articles to
                                                             present News evolution and event distribution across
For CNN, we keep the window size as 3 and number of
                                                             regions. We have also illustrated how deep learning
filters as 30. For BiLSTM, state size is 200 with initial
                                                             methods can be deployed with small volumes of anno-
state value =0.0. We use a dropout rate of 0.05, batch
                                                             tated data. We intend to extend the proposed methods
size=10, initial learning rate of 0.01, decay rate of 0.05
                                                             to work with all kinds of legal documents and also in-
and gradiant clipping of 5.0.
                                                             corporate predictive technologies to predict activities
3.4    Results                                               or events.

Throughout all the target classes, the performance of        References
the CNN-BiLSTM model has been found to be better
than the others. The performance of combined local           [CBK+ 10] Andrew Carlson, Justin Betteridge, Bryan
and global embedding word2vec method for learning                      Kisiel, Burr Settles, Estevam R Hr-
word embeddings [MCCD13] have been observed to                         uschka Jr, and Tom M Mitchell. Toward
be very effective in capturing solely contextual infor-                an architecture for never-ending language
mation. It has also been observed that, combining                      learning. In AAAI, volume 5, page 3. At-
both the W2V and GloVe embeddings surpasses the                        lanta, 2010.
performance of models using the individual embed-            [GCW+ 16] Jiang Guo, Wanxiang Che, Haifeng Wang,
dings. Overall, the performance of the CNN-BiLSTM-                     Ting Liu, and Jun Xu. A unified architec-
att model along with combined W2V-GloVe embed-                         ture for semantic role labeling and relation
ding is higher than the rest of the existing models.                   classification. In Proceedings of COLING
                                                                       2016, the 26th International Conference
4     Related Works                                                    on Computational Linguistics: Technical
While there has been a growing body of research in ex-                 Papers, pages 1264–1274, 2016.
tracting structured information from texts, neural net-      [Hea92]         Marti A Hearst. Direction-based text in-
work based ontology guided News event extraction and                         terpretation as an information access re-
story evolution is still in its nascent stage. Most of the                   finement. Text-based intelligent systems:
existing methods are limited to event and named entity                       current research and practice in informa-
extractions and not into identifying granular level of                       tion extraction and retrieval, pages 257–
entity role identification. Supervised learning with dif-                    274, 1992.
ferent flavours of LSTM or CNN [GCW+ 16, MB16] are
used for entity classification. Distant supervision in-      [HZW10]         Raphael Hoffmann, Congle Zhang, and
volving some amount of annotated data and an initial                         Daniel S Weld. Learning 5000 relational
knowledge source has been proposed to develop mod-                           extractors. In Proceedings of the 48th An-
els in [ZNL+ 09, NZRS12, CBK+ 10, MBSJ09, SSW09,                             nual Meeting of the Association for Com-
NTW11]. The unsupervised approach requires hand                              putational Linguistics, pages 286–295. As-
crafted rules pertaining to the information to be ex-                        sociation for Computational Linguistics,
tracted [Hea92, SW13, JVSS98, HZW10, MB05].                                  2010.
                                                             [JVSS98]        Yaochu Jin, Werner Von Seelen, and Bern-
5     Conclusion                                                             hard Sendhoff. An approach to rule-based
In this paper we have proposed an Ontology-guided                            knowledge extraction. In Fuzzy Systems
News information retrieval system using a Convo-                             Proceedings, 1998. IEEE World Congress
lutional Bi-LSTM Network for Concept detection.                              on Computational Intelligence., The 1998
           IEEE International Conference on, vol-         [SW13]      Fabian Suchanek and Gerhard Weikum.
           ume 2, pages 1188–1193. IEEE, 1998.                        Knowledge harvesting in the big-data era.
                                                                      In Proceedings of the 2013 ACM SIGMOD
[MB05]     Raymond J Mooney and Razvan Bunescu.                       International Conference on Management
           Mining knowledge from text using infor-                    of Data, pages 933–938. ACM, 2013.
           mation extraction. ACM SIGKDD explo-
           rations newsletter, 7(1):3–10, 2005.           [ZNL+ 09]   Jun Zhu, Zaiqing Nie, Xiaojiang Liu,
                                                                      Bo Zhang, and Ji-Rong Wen. Statsnow-
[MB16]     Makoto Miwa and Mohit Bansal. End-                         ball: a statistical approach to extracting
           to-end relation extraction using lstms on                  entity relationships. In Proceedings of the
           sequences and tree structures.      arXiv                  18th international conference on World
           preprint arXiv:1601.00770, 2016.                           wide web, pages 101–110. ACM, 2009.

[MBSJ09]   Mike Mintz, Steven Bills, Rion Snow, and
           Dan Jurafsky. Distant supervision for
           relation extraction without labeled data.
           In Proceedings of the Joint Conference
           of the 47th Annual Meeting of the ACL
           and the 4th International Joint Confer-
           ence on Natural Language Processing of
           the AFNLP: Volume 2-Volume 2, pages
           1003–1011. Association for Computational
           Linguistics, 2009.

[MCCD13] Tomas Mikolov, Kai Chen, Greg Corrado,
         and Jeffrey Dean. Efficient estimation
         of word representations in vector space.
         arXiv preprint arXiv:1301.3781, 2013.

[MV93]     Andres Marzal and Enrique Vidal. Com-
           putation of normalized edit distance and
           applications. IEEE transactions on pat-
           tern analysis and machine intelligence,
           15(9):926–932, 1993.

[NTW11]    Ndapandula Nakashole, Martin Theobald,
           and Gerhard Weikum. Scalable knowledge
           harvesting with high precision and high re-
           call. In Proceedings of the fourth ACM in-
           ternational conference on Web search and
           data mining, pages 227–236. ACM, 2011.

[NZRS12]   Feng Niu, Ce Zhang, Christopher Ré, and
           Jude Shavlik. Elementary: Large-scale
           knowledge-base construction via machine
           learning and statistical inference. Interna-
           tional Journal on Semantic Web and In-
           formation Systems (IJSWIS), 8(3):42–73,
           2012.

[SSW09]    Fabian M Suchanek, Mauro Sozio, and
           Gerhard Weikum. Sofie: a self-organizing
           framework for information extraction. In
           Proceedings of the 18th international con-
           ference on World wide web, pages 631–
           640. ACM, 2009.