Ontology Guided Purposive News Retrieval and Presentation Abir Naskar Rupsa Saha Tirthankar Dasgupta TCS Innovation Lab, India TCS Innovation Lab, India TCS Innovation Lab, India abir.naskar@tcs.com rupsa.s@tcs.com dasgupta.tirthankar@tcs.com Lipika Dey TCS Innovation Lab, India lipika.dey@tcs.com view of global events, the ability to delve deep down into specific stories along various dimensions and also Abstract watch their evolution is necessary to enable systematic studies. In this paper, we present a purposive News In this paper, we present an Ontology-guided News information retrieval and presentation system information retrieval and presentation system. The that curates information from News articles uniqueness of the proposed system lies in the use of collected from multiple trusted sources for a a back-end domain ontology that specifies the enti- given domain. A back-end domain ontology ties and relations of interest in a domain, based on provides details about the concepts and re- which, information components are extracted and clas- lations of interest. We propose an attention sified using a deep neural network architecture. These based CNN-BiLSTM model to classify sen- concepts are used to create domain-specific ”purposive tence tokens as ontology concepts or entities of indices” to aid concept-oriented Information retrieval interest. These entities are then curated and rather than simple word-based retrieval. used to link articles to illustrate evolution of A seed ontology of concepts along with a few in- events over time and regions. Working sys- stances of each concept is used to create annotated tems are initiated with small annotated data data to train a concept classification model. This is sets which are later augmented with humans applied over a larger set of articles, the result of which in the loop. It is easily customizable for vari- is then validated through human evaluation and used ous domains. subsequently to enhance the initial model. It is found that the proposed method takes much less time and ef- 1 Introduction fort to create annotated data sets for all situations that News consumption is no more restricted to consuming lack large labeled data needed to exploit deep-learning a set of facts dished out by a specific agency. Readers methods. Information components extracted from the are not only choosing the type of content they want to News articles, are stored in indexed repositories for read but also how. Increasing interest in social statis- downstream analytics. The results are presented to tics is also seeing News as a source of data for gener- the end-user through an innovative interactive inter- ating these statistics. Unlike social media, News from face that helps in consuming information at multiple trusted sources is reliable. News presentation is there- levels of granularity. fore undergoing a sea-change. Along with a bird’s eye 2 Overview of proposed News Re- Copyright c 2019 for the individual papers by the papers’ au- thors. Copying permitted for private and academic purposes. trieval and presentation framework This volume is published and copyrighted by its editors. Figure 1 presents an overview of the proposed News In: A. Aker, D. Albakour, A. Barrón-Cedeño, S. Dori-Hacohen, M. Martinez, J. Stray, S. Tippmann (eds.): Proceedings of the Information retrieval system. A number of News NewsIR’19 Workshop at SIGIR, Paris, France, 25-July-2019, crawlers are deployed to collect News from dedicated published at http://ceur-ws.org and reliable sources. For each article its meta-data like (A) (C) Output Annotated \Accused \Accused \None \None ... \Crime \Victim \Victim Seed Train Classifier Ontology Corpus Fully Connected Linear layer With sigmoid Web Crawler Classification Activation News Docs Model C1 C2 C3 r1 r2 r3 r4 rn−2 rn−1 rn Attention a3 an−2 Entity Aggregation Pooling a1 a2 a4 an−1 an Induced & News Linking Repository w News Retrieval Crime Commits Against occursAt Target locatedAt Bi−LSTM Org Accused Victim Safety Loc. Incident ResultsIn Against CNN On By Layer DueTo Charges Law Enf propose Violation Governing Investigation Word Penalty Agency Embeddings CarriesOut Brings (B) Sanjeev khanna has been ........ killing Sheena Bora Figure 1: Overview of the Event and Entity Extraction Architecture(A). (B) depicts the example seed ontology structure. (C) explains the CRNN architecture. source, date and time of publication, location, head- and victim names, section names, location of crime line is stored along with the full article. These are all etc. News articles reporting safety incidents are simi- indexed using Solr. Exact or near duplicate articles larly additionally indexed by incident names, location are identified using Latent Semantic Hashing (LSH). of incident, penalty incurred in different currency etc. Though only one copy is stored in the repository, total Linking of related articles are done through the pur- count for such articles are maintained. Each article posive indices only. It may be noted that all concepts is then passed through an information extraction and for the same event may be obtained at once. New classification pipe-line that deploys component classi- concepts can get associated as a single story unfolds fiers to detect ontology-components in sentences. The over time. Similarly multiple isolated articles may get extracted elements are resolved and then used to cre- linked to each other at later stages. The purposive ate additional ”purposive named-indices” for the doc- indices are also used to generate comparative and ag- uments. These elements are also used to link threads gregate statistics over various dimensions. The visu- of the same News together. alization module enables the end-user to view News articles at various levels of granularity. The ontology is composed of a schema that contains different domain concepts (C) and a set of generic re- 2.1 Ontology-guided Concept and Entity De- lations R between these concepts. Figure 1 illustrates tection using C-RNN Network a pair of ontologies for two different domains, namely, crime and occupational health and safety. Given an Convolutional neural networks (CNN) exploit local de- ontology that elucidates the basic components of a do- pendencies, while recurrent networks like Long Short main and their underlying relationships in a generic Term Memory (LSTM) capture long-distance depen- way, finding similar information components from vast dencies among features. The proposed Convolutional collections of text still remains a challenging task, since Bi-LSTM (C-RNN) model combines both the capa- the manifestation of these components in natural lan- bilities. For a given sentence, the network learns to guage can be extremely varied. For example, though assign ontology concept labels to each word. The in- the names of criminal and victim can be extracted as put to the network is a sequence of word embeddings named-entities from text, establishing their roles un- with 100 dimensions each. ambiguously is not a simple task. Classifying an in- A convolutional layer is first used to extract local n- stance of a concept correctly requires deep contextual gram features. All word embeddings are concatenated analysis. We propose the use of an attention-pooling to form an embedding matrix M ∈ RdX|V | . Where, convolutional Bi-LSTM neural network based archi- |V | is the vocabulary size and d is the embedding di- tecture to do the task. Details of this is given in the mension. The matrix is divided into k regions. In each next subsection. region, we apply convolution function represented by Purposive indices for news articles are created us- Conv(xi:w ¯ ) = W.(xi:w ¯ )+b (1) ing the ontology concepts extracted from text. For example crime News articles are indexed by criminal to calculate the output features. Where, W and b are the weights that the network learns. We apply the As discussed earlier, classification of concepts is same convolution operation repeatedly over the dif- more complex than merely identifying named entities. ferent matrix regions to get multiple output feature It is imperative to also recognize the role played by vectors. The output of the CNN layer is passed to the entity. Two such example sentences are presented the bidirectional LSTMs, which read it both backward here along with the concepts extracted - and forward to take care of dependencies on the past Sanjeev Khanna has been taken into custody in neighbours as well as future long-distance dependen- Kolkata on charges of killing Sheena Bora - Con- cies. The Bi-LSTM layer is followed by an attention cepts extracted are and pooling layer over the sentence representations. At- . Kolkata is not labeled as any tention modules have been proved to boost accuracy concept, correctly. for tasks like sentiment or activity detection by learn- US Department of Labor’s fines Heat Seal $95,000 ing to focus more on certain linguistic elements over for 15 health violations Concepts extracted - others, without increasing computational complexity. and . In our case, attention pooling achieves higher accu- racy by learning the specific characteristics surround- 2.2 Linking Articles to Indicate News Story ing each concept in the form of weights associated to Evolution the output of the Bi-LSTM layer. This is represented as: A link between two News articles is created only if they ai = tanh(Wa .hi + ba ), (2) share purposive indices for specific concept classes. For ewα .ai example, for crime incidents, victim names and crimi- αi = P wα .ai , (3) nal names should overlap, while for safety incidents or- e X ganization name and safety incident type should over- O= (αi .hi ). (4) lap. The first challenge comes in the form of entity reso- Where Wa , wα are weight matrix and vector respec- lution since named entities are spelled differently in tively, ba is the bias vector, ai is attention vector for different sources. An edit distance based measure i-th sentence, and αi is the attention weight of i-th sen- [MV93] is used to compare different entities and com- tence. The output of the attention layer is then passed bine them if sufficiently similar. For example, an indi- to a fully connected linear layer with sigmoid activa- vidual was variably referred to as “Mukherjee”, “Muk- tion. The mapping of the linear layer after applying erjea” and “Mookerjee” across various sources in our the sigmoid activation function is given by crime news database. y = s(x) = sigmoid(w.x + b). (5) The second challenge comes from the fact that the sets of conceptual instances for a single story also Where, x is the input vector, w is the weight vector, evolve over time. For example, it is observed that for and b is bias value. Finally, the loss function is com- a long-drawn crime incident, new articles report new puted using the cross-entropy loss defined by names as criminals or even victims, as new informa- tion pours in. A concept overlap threshold of 80% for 2 X specified types is used to link the articles into a single L=− y¯i log(yi ). (6) story. i=1 The third challenge is due to the fact that mere Where ȳ is the one-hot representation of the actual overlap of the names of entities is not enough, even label for the input word. To avoid over-fitting, we their corresponding roles need to be same, or at least apply dropout technique at each layer to regularize similar, for us to consider them to refer to the same our model. case. The model is initially built from a small anno- Considering the above challenges, we propose a tated corpus, in which the instances of the ontology weighted similarity computation to determine the sim- concepts are tagged by their respective labels. This ilarity of two articles. Highest weightage is given for model, when applied over a larger corpus yields new candidates that are resolved to be similar and also be- instances of each label, which are evaluated by hu- long to the same concept class. Candidates which are mans and then accepted for next-level training if two similar after resolution but are identified as instances out of three annotators simultaneously agree on the of different classes are given a lower weightage. The fi- label. For repeated experiments on different domains, nal similarity measure is computed as a weighted sum the inter-annotator agreement is found to be around of all candidate similarities. Two articles are consid- 0.65, which is pretty high. This can be done multiple ered similar if the similarity is above a user-defined times, though we have restricted it to two times only. threshold. Figure 2: Illustration of the linked News articles. Selecting one article shows the extracted events and entities. Graph at the top right shows the distribution of a News over time. Figure at the bottom right shows evolution of entity types over time. Once conceptually similar pairs of articles are 3 Experiments and Results found, a virtual group is created containing these along 3.1 Data Collection with a union of the conceptual entities contained in them. More articles are added to these by repeating We have conducted experiments for two different do- the similarity computation of each article with an ex- mains using the ontology pair described earlier. Our isting group. The system automatically links them collection consists of around 12000 crime-related news and maintains them chronologically. collected from the top 3 English news sources from each of four regions in the Indian subcontinent (north, south, east and west). The Occupational Health 2.3 News Retrieval and Data Visualization and Safety database has been created from approxi- mately 4000 articles published by Occupational Safety Apart from regular entity and concept based retrieval and Health Administration (OSHA), United States and tracking the evolution of a News, the proposed Department of Labor, each of which detail various system also enables the user to explore the evolution transgressions by organizations, and the actions taken of an incident across different dimensions like time and against them. space. Figure 2 illustrates how News evolution is pre- sented. The left most panel shows a series of crime 3.2 Experiments incidents reporting an unsolved murder case gathered from across different time and sources. These have Each dataset is divided into 70%, 20% and 10% for been linked together by the earlier algorithm using the training, validation and testing respectively. A Con- purposive indices extracted by the c RNN classifier. dition Random Field (CRF) model is trained as the On selecting a particular crime news, the extracted en- baseline. This model uses part-of-speech(POS) and tities and events are shown in the middle panel. The N-grams as features. Additionally, a number of other right top chart shows the temporal distribution of the deep neural network based models such as BiLSTM, reports over the period 2015 to 2018. The bottom BILSTM with mean over time (MoT), BILSTM with right chart displays how crime entities have changed Attention network, Convolution network (CNN), CNN over time. While Murder is prevalent over the en- with BILSTM+MoT were used to compare the results tire time-line, new crime incidents like money launder- with the proposed CNN with BILSTM along with At- ing has emerged as reported co-crimes at later stages. tention Network. Each model is trained with three One can also explore presence or absence of similar types of word embeddings GloVe(G), Word2Vec(w2v) incidents across different geographical regions. It can and combination of GloVe and Word2Vec(G+w2v). unearth regional affinities for certain kinds of acts. Vi- Both w2v and a combination of Glove and w2v achieve sualizations also help in studying aberrations like dif- similar performance for the proposed architecture, ferent charges evoked for similar crimes or variability which is significantly better than the baseline and also in penalty rates for similar safety incidents through others. F1 score for all the models are shown in Table canned analytics. 1. Table 1: Results demonstrating F1 Scores for each model corresponding to two domain, Crime and OSHA Crime OSHA G W2V G+W2V G W2V G+W2V BiLSTM 67 70 64 64 65 70 BiLSTM-MoT 66 69 68 63 66 72 CNN 69 72 68 67 67 63 BiLSTM-att 71 71 70 68 69 64 CNN+BiLSTM-MoT 72 75 75 69 70 67 CNN+BiLSTM-att 74 76 76 70 71 73 CRF 58 61 3.3 Hyper-parameters Concept-based linking is used to link related articles to present News evolution and event distribution across For CNN, we keep the window size as 3 and number of regions. We have also illustrated how deep learning filters as 30. For BiLSTM, state size is 200 with initial methods can be deployed with small volumes of anno- state value =0.0. We use a dropout rate of 0.05, batch tated data. We intend to extend the proposed methods size=10, initial learning rate of 0.01, decay rate of 0.05 to work with all kinds of legal documents and also in- and gradiant clipping of 5.0. corporate predictive technologies to predict activities 3.4 Results or events. Throughout all the target classes, the performance of References the CNN-BiLSTM model has been found to be better than the others. The performance of combined local [CBK+ 10] Andrew Carlson, Justin Betteridge, Bryan and global embedding word2vec method for learning Kisiel, Burr Settles, Estevam R Hr- word embeddings [MCCD13] have been observed to uschka Jr, and Tom M Mitchell. Toward be very effective in capturing solely contextual infor- an architecture for never-ending language mation. It has also been observed that, combining learning. In AAAI, volume 5, page 3. At- both the W2V and GloVe embeddings surpasses the lanta, 2010. performance of models using the individual embed- [GCW+ 16] Jiang Guo, Wanxiang Che, Haifeng Wang, dings. Overall, the performance of the CNN-BiLSTM- Ting Liu, and Jun Xu. A unified architec- att model along with combined W2V-GloVe embed- ture for semantic role labeling and relation ding is higher than the rest of the existing models. classification. In Proceedings of COLING 2016, the 26th International Conference 4 Related Works on Computational Linguistics: Technical While there has been a growing body of research in ex- Papers, pages 1264–1274, 2016. tracting structured information from texts, neural net- [Hea92] Marti A Hearst. Direction-based text in- work based ontology guided News event extraction and terpretation as an information access re- story evolution is still in its nascent stage. Most of the finement. Text-based intelligent systems: existing methods are limited to event and named entity current research and practice in informa- extractions and not into identifying granular level of tion extraction and retrieval, pages 257– entity role identification. Supervised learning with dif- 274, 1992. ferent flavours of LSTM or CNN [GCW+ 16, MB16] are used for entity classification. Distant supervision in- [HZW10] Raphael Hoffmann, Congle Zhang, and volving some amount of annotated data and an initial Daniel S Weld. Learning 5000 relational knowledge source has been proposed to develop mod- extractors. In Proceedings of the 48th An- els in [ZNL+ 09, NZRS12, CBK+ 10, MBSJ09, SSW09, nual Meeting of the Association for Com- NTW11]. The unsupervised approach requires hand putational Linguistics, pages 286–295. As- crafted rules pertaining to the information to be ex- sociation for Computational Linguistics, tracted [Hea92, SW13, JVSS98, HZW10, MB05]. 2010. [JVSS98] Yaochu Jin, Werner Von Seelen, and Bern- 5 Conclusion hard Sendhoff. An approach to rule-based In this paper we have proposed an Ontology-guided knowledge extraction. In Fuzzy Systems News information retrieval system using a Convo- Proceedings, 1998. IEEE World Congress lutional Bi-LSTM Network for Concept detection. on Computational Intelligence., The 1998 IEEE International Conference on, vol- [SW13] Fabian Suchanek and Gerhard Weikum. ume 2, pages 1188–1193. IEEE, 1998. Knowledge harvesting in the big-data era. In Proceedings of the 2013 ACM SIGMOD [MB05] Raymond J Mooney and Razvan Bunescu. International Conference on Management Mining knowledge from text using infor- of Data, pages 933–938. ACM, 2013. mation extraction. ACM SIGKDD explo- rations newsletter, 7(1):3–10, 2005. [ZNL+ 09] Jun Zhu, Zaiqing Nie, Xiaojiang Liu, Bo Zhang, and Ji-Rong Wen. Statsnow- [MB16] Makoto Miwa and Mohit Bansal. End- ball: a statistical approach to extracting to-end relation extraction using lstms on entity relationships. In Proceedings of the sequences and tree structures. arXiv 18th international conference on World preprint arXiv:1601.00770, 2016. wide web, pages 101–110. ACM, 2009. [MBSJ09] Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Confer- ence on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pages 1003–1011. Association for Computational Linguistics, 2009. [MCCD13] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. [MV93] Andres Marzal and Enrique Vidal. Com- putation of normalized edit distance and applications. IEEE transactions on pat- tern analysis and machine intelligence, 15(9):926–932, 1993. [NTW11] Ndapandula Nakashole, Martin Theobald, and Gerhard Weikum. Scalable knowledge harvesting with high precision and high re- call. In Proceedings of the fourth ACM in- ternational conference on Web search and data mining, pages 227–236. ACM, 2011. [NZRS12] Feng Niu, Ce Zhang, Christopher Ré, and Jude Shavlik. Elementary: Large-scale knowledge-base construction via machine learning and statistical inference. Interna- tional Journal on Semantic Web and In- formation Systems (IJSWIS), 8(3):42–73, 2012. [SSW09] Fabian M Suchanek, Mauro Sozio, and Gerhard Weikum. Sofie: a self-organizing framework for information extraction. In Proceedings of the 18th international con- ference on World wide web, pages 631– 640. ACM, 2009.