516 Automatic Information Extraction and Inferencing System from Online News Sources for Substance Abuse Cases Judith George Josepha , Jestin Joyb , Sreeraj Mc , Sanjay Govindd , Shijas Muhammed T Pe and Tibi Sunnif a Department of Computer Science And Engineering, Federal Institute of Science And Technology(FISAT), Kerala, India b Assistant Professor, Department of Computer Science And Engineering, Federal Institute of Science And Technology(FISAT), Kerala, India c Assistant Professor, Department of Computer Science, Sree Ayyappa College, Alappuzha, Kerala d Department of Computer Science And Engineering, Federal Institute of Science And Technology(FISAT), Kerala, India e Department of Computer Science And Engineering, Federal Institute of Science And Technology(FISAT), Kerala, India f Department of Computer Science And Engineering, Federal Institute of Science And Technology(FISAT), Kerala, India Abstract The rising number of substance abuse cases is a serious situation that demands significant attention. Gaining insights from the reported substance abuse cases will greatly help law enforcement authorities and policy makers. The unstructured nature of the publicly available data is a challenge. Computational techniques can be made use in efficiently extracting and summarising these unstructured data. The proposed system extracts the news reported on substance abuse related crimes from Malayalam online news papers. The extracted data is then processed using Natural Language Processing (NLP) techniques to generate a set of information that can be helpful in generating valuable inferences. Results show that the proposed system provide good accuracy for the data extraction task. Keywords Information extraction, NER, Machine Learning, Data Mining 1. Introduction to be 1,94,058 as per 2000 estimates. Illicit drug use also causes premature deaths in young adults and ad- The United Nations Office on Drugs and Crime (UN- versely affects their overall health. ODC) reports[1] that approximately 5 per cent of the Substance use is a problem in India too. Ministry of world’s population used an illicit drug in 2010 and 27 Social Justice and Empowerment, Government of In- million people can be classified as problem drug users. dia report[3] “Magnitude of Substance Use in India - Alcohol and illicit drug use cause around 39 deaths per 2019” shows the dismal picture in India. After Alco- million population. In addition to causing death, sub- hol, Cannabis and Opioids is the most commonly used stance abuse is also responsible for significant mor- substances in India and about2.8% of the population bidity and the treatment of drug addiction creates a use it. More than 30 lakh of the people with opioid use tremendous burden on society. Significant rise in the disorders are from Indian states of Uttar Pradesh, Pun- reported drug abuse cases is a serious public health jab, Haryana, Delhi, Maharashtra, Rajasthan, Andhra threat. Handling this problem needs the intervention Pradesh and Gujarat. Enforcement activities report[4] of government, law enforcement and public health sec- by Excise department, Government of Kerala reports tor. World Health Organization (WHO) study[2] es- that during 2019, 7099 cases are registered based on timates that the four major cause of illicit drug use Narcotic Drugs and Psychotropic Substances Act. death are AIDS, suicide, overdose and trauma. Based Though governments publish[3, 4] data regarding on this, the median number of deaths are estimated substance abuse cases, it is not easy to get region wise detailed information. For example detailed informa- ISIC’21: International Semantic Intelligence Conference, February 25–27, 2021, New Delhi, India tion regarding size, type and location of registered cases judithgeorgejoseph123@gmail.com (J.G. Joseph); are not easy to find. But these information are avail- jestinjoy@fisat.ac.in (J. Joy); sreeraj.sac@gmail.com (S. M) able in public domain through news reports. Prob- { http: lem with these news reports are that, they are not in //www.sreeayyappacollege.ac.in/uploads/downloads/sreeraj.pdf (S. M) a structured format. Various techniques[5, 6, 7] are  0000-0003-0892-7874 (J. Joy); 0000-0003-4974-437X (S. M) explored for extracting structured information from © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). unstructured textual data. Information extraction is CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 517 the process of extracting information from unstruc- Web NER tured data. It is extensively used in medical document Crawler Training mining, mining business and law documents. Internet being a rich source of unstructured textual data, web mining is also an active research area. The proposed Figure 1: NER Training system extracts structured information from news re- ports. News reports regarding substance abuse cases reported in online edition of popular Malayalam news on crime location and 0.87 on drug quantity. But in- papers are used for this purpose. These are then pro- formation regarding the dataset size and testing infor- cessed using Natural Language Processing (NLP) tech- mation is missing in the paper. niques like Named Entity Recognition (NER) for ex- Rexy Arulanandam, Bastin Tony Roy Savarimuthu tracting structured information. This information helps and Maryam A. Purvis proposed a system[12] for ex- in getting information like places where more cases tracting crime information from newspaper articles. are reported, most commonly used drug, amount of Named Entity Recognition (NER) coupled with Condi- each drug as reported in news etc. tional Random Field (CRF) is used to find crime loca- tion in a sentence. 70 articles from Otago Daily Times is used for evaluating the system. LBJ NER Tagger is 2. Related Works found to be the best tagger with a precision of 0.98. Ac- curacy varies from 84% to 90% for for New Zealand ar- Study on information extraction techniques from un- ticles for the task of identifying locations in sentences structured data is explored in literature[5, 6, 7, 8, 9]. and classifying it into crime location sentences. This involves extracting data from medical text, busi- Eiji Aramakia, Yasuhide Miurab, Masatsugu Tonoike ness and law documents. Most of the research revolves et al[13] proposed a system for extracting adverse drug around using English as the language. We haven’t came events and effects from clinical records. Results on a across much research[10, 11] on information extrac- study on 3,012 discharge summaries show that 7.7% of tion from Malayalam unstructured text. This is mainly records include adverse event information, and 59% of due to the unavailability of publicly available datasets them can be extracted automatically. and computational techniques for processing text. Works Authors haven’t came across any similar systems related to extracting drug related information unstruc- for extracting information from Malayalam news ar- tured text is discussed below. ticles. Extracting Substance Abuse Information from Clin- ical Notes[8] was studied by Lybarger, Yetsigen et.al They proposed a neural network architecture for auto- 3. Design and Implementation matic extraction of substance abuse information from clinical notes. A discrete model was also experimented Proposed system consists of two phases. In the first for extracting information. These clinical notes were phase, relevant data is crawled from web and fed to stored with information about patients’ substance abuse Named Entity Recognition Module (NER) for creating history. The model was trained to find the presence a model for recognizing named entities. This phase is of substances events like alcohol, drug, or tobacco. A given in Figure 1. Maximum Entropy (MaxEnt) model was used for clas- This phase is not an easy task since we need to NER sifying the status. Other entities like amount, frequency, on Malayalam language text. Malayalam[14] is a lan- exposure history,.. were extracted using Conditional guage spoken in the Indian state of Kerala. It is one random fields (CRF) model. Neural Multi-task Model of 22 scheduled languages of India and is spoken by predicted all entities for all substances. 37,919,870 people. Malayalam follows a word order Khmael Rakm Rahem and Nazlia Omar proposed of SOV (subject-object-verb) generally. Malayalam is a rule-based approach [9] for extracting drug related a heavily agglutinated and inflected language making crime information from online newspaper articles. The it difficult for NER task. Different techniques are ex- task involved extracting information like drug name, plored for Malayalam NER[15, 16, 17, 18, 19]. Most of nationality, location and assess the quantity and price these are based statistical techniques. This study also of drug. A set of grammatical and heuristic rules were used a statistical technique for NER. used for this purpose. Data from Malaysian National Statistical model provided by Spacy1 is used in this News Agency (BERNAMA) is used in the system. Sys- study. Tagged data is fed to the NER system for train- tem achieved a precision of 0.96 on drug names, 0.83 1 https://spacy.io/ 518 Figure 2: Information extraction ing. Transition based approach[20] is used for NER. This uses word embedding strategy using subword fea- tures and bloom embeddings. CNN filter sizes are cho- sen with beam search. 1D convolutional filters are ap- plied over the input text to predict how the upcoming words may change the current entity tags. In the second phase, the trained model is made use in extracting information. A rule base is also used for this purpose. This is given in Figure 2. NER model helps to identify relevant entities for the information extraction task. Name, age, place, drug name, amount and size is considered in the proposed system. Tagged sentences are then fed to processing module, which processes information based on handcrafted rules. A snapshot of the rules used in the proposed system are given below. 1. Name of the person, and drug appears in the ini- tial part of the news item. 2. If money occurs just before the drugs name, then it is assigned as that of the corresponding drug’s 3. First occurrence of location is assigned as that of Figure 3: NER output for processed sample news items location 4. Person age is close to the name of the person 5. Amount of drug carried by the offender is close to 3.2. Implementation the drug name Processing of the data is done using Python program- ming language. Spacy2 NER module is used for named 3.1. Dataset entity recognition, which forms the important com- ponent of the system. The availability of pretrained Though there exists trained models for languages like statistical models and support for large number of lan- English, publicly available tagged dataset for Malay- guages makes Spacy a good choice for text processing. alam language is non existent for this task. Data is ex- tracted from online edition of Malayalam news sites of Malayala Manorama, Mathrubhumi, Mangalam, News18 4. Results and discussion Malayalam, Deshabhimani and Media One. Tagging for NER was done using web frontend based on doc- The proposed system involves passing the news item cano, which is an open source text annotation tool. It to NER module and processing it using the rule based is an open source text annotation tool. It can be used system. Figure 3 shows the result of NER module for to tag data for various tasks like named entity recogni- sample news items. tion, text summarization and sentiment analysis. Data This is then fed to the processing module for infer- collected for training were from the period January ence. Output from the inference module is given in 2017 to December 2019. Figure 4. 2 https://spacy.io/ 519 Figure 4: Output of inference module Entity Correctly Identified(50) Accuracy each and every news story following the same writ- Location 42 0.84 ing style. This is a major drawback of the system. For Drugs 40 0.80 example the accuracy of the entities quantity, person, Quantity 30 0.60 date are the lowest. Most news stories lack quantity Money 38 0.76 and date information in a standard format. Person in- Person 30 0.60 formation is also difficult to identify since news sto- Age 32 0.64 ries sometimes lack them and sometimes more person Date 30 0.60 names like that of law enforcement authorities are in- Time 36 0.72 cluded making it difficult for the system to correctly Table 1 identify it. Accuracy of each entity tested 50 news articles 5. Conclusion For evaluating the system, 50 substance abuse re- An automated system for generating valuable infor- lated news articles are collected. These news articles mation out of online news articles can reduce the colos- are from the period January 2020 to March 2020. The sal amount of effort that must be put in to do the same collected news articles are manually verified to be of by other means. The data provided by the system can substance abuse cases. These news articles are then aid in statistical research and study, generating key in- fed to system and accuracy of the entity identified is ferences for investigations, for background studies in recorded. Accuracy is found by matching the entities formulating action plans etc. Since the system pro- manually with the predicted entities. For example re- cesses news reports on crimes related to substance abuse, sults indicate that of the 50 news articles considered, the information provided is very significant and rele- on 42 of them location entity is predicted correctly. vant as the issue is an ongoing serious social threat. Table 1 lists the accuracy identified by the system However in a broad sense the services provided by for the given 50 news articles. the current version of the system is limited. Which Results indicate that system could identify the enti- also opens an opportunity for future enhancement. Now ties location, drugs with reasonably good accuracy. Al- the system is providing only key aspects mentioned in though system could identify most entities correctly, news. It can be modified into a full fledged inference these are marked as those relevant by the rule based which increase it’s clarity. The proposed system can system. The reduction in accuracy for other entities be enhanced in a way that it responds to user queries. is due to the failure in the part of rule base to cor- rectly match the entity. Rules are framed manually after going through news stories. We cant be sure of 520 Acknowledgments from malayalam text, in: 2012 International Con- ference on Advances in Computing and Commu- Authors would like to thank the help extended by Adam nications, IEEE, 2012, pp. 78–81. Shamsudeen for providing the required dataset and tag- [11] D. S. Nair, J. P. Jayan, E. Sherly, et al., Sentima- ging frontend. sentiment extraction for malayalam, in: 2014 In- ternational Conference on Advances in Comput- ing, Communications and Informatics (ICACCI), References IEEE, 2014, pp. 1719–1723. [1] UNODC, Atlas on substance use (2010), 2011. [12] R. Arulanandam, B. T. R. Savarimuthu, M. A. URL: https://www.who.int/publications/i/item/ Purvis, Extracting crime information from on- 9789241500616. line newspaper articles, in: Proceedings of the [2] M. W.-S. Louisa Degenhardt, Wayne Hall, second australasian web conference-volume 155, M. Lynskey, Illicit drug use, 2020. URL: 2014, pp. 31–38. https://www.who.int/publications/cra/chapters/ [13] E. Aramaki, Y. Miura, M. Tonoike, T. Ohkuma, volume1/1109-1176.pdf. H. Masuichi, K. Waki, K. Ohe, Extraction of ad- [3] N. D. D. T. C. (NDDTC), Magnitude of substance verse drug effects from clinical records., MedInfo use in india - 2019, 2020. 160 (2010) 739–743. [4] G. o. K. Excise department, Month wise [14] G. F. Simons, C. D. Fennig, Ethnologue: lan- details of enforcement activities during guages of Asia, sil International Dallas, 2017. 2019, 2020. URL: https://excise.kerala.gov. [15] P. Sreeja, A. S. Pillai, Towards an efficient in/enforcement-activities-2/. malayalam named entity recognizer analysis on [5] M. Alawad, S. Gao, J. X. Qiu, H. J. Yoon, the challenges, Procedia Computer Science 171 J. Blair Christian, L. Penberthy, B. Mumphrey, (2020) 2541–2546. X.-C. Wu, L. Coyle, G. Tourassi, Automatic ex- [16] C. Malarkodi, S. L. Devi, A deeper study on fea- traction of cancer registry reportable informa- tures for named entity recognition, in: Proceed- tion from free-text pathology reports using mul- ings of the WILDRE5–5th Workshop on Indian titask convolutional neural networks, Journal of Language Data: Resources and Evaluation, 2020, the American Medical Informatics Association 27 pp. 66–72. (2020) 89–98. [17] J. P. Jayan, R. Rajeev, E. Sherly, A hybrid statis- [6] S. Jiang, S. Baumgartner, A. Ittycheriah, C. Yu, tical approach for named entity recognition for Factoring fact-checks: Structured information malayalam language, in: Proceedings of the 11th extraction from fact-checking articles, in: Pro- Workshop on Asian Language Resources, 2013, ceedings of The Web Conference 2020, 2020, pp. pp. 58–63. 1592–1603. [18] A. Ajees, S. M. Idicula, A named entity recog- [7] N. Milosevic, C. Gregson, R. Hernandez, G. Ne- nition system for malayalam using neural net- nadic, A framework for information extraction works, Procedia computer science 143 (2018) from tables in biomedical literature, Interna- 962–969. tional Journal on Document Analysis and Recog- [19] S. Thottingal, Finite state transducer based mor- nition (IJDAR) 22 (2019) 55–78. phology analysis for malayalam language, in: [8] K. Lybarger, M. Yetisgen, M. Ostendorf, Using Proceedings of the 2nd Workshop on Technolo- neural multi-task learning to extract substance gies for MT of Low Resource Languages, 2019, abuse information from clinical notes, in: AMIA pp. 1–5. Annual Symposium Proceedings, volume 2018, [20] G. Lample, M. Ballesteros, S. Subramanian, American Medical Informatics Association, 2018, K. Kawakami, C. Dyer, Neural architectures p. 1395. for named entity recognition, arXiv preprint [9] K. R. Rahem, N. Omar, Drug-related crime infor- arXiv:1603.01360 (2016). mation extraction and analysis, in: Proceedings [21] H. Nakayama, T. Kubo, J. Kamura, of the 6th International Conference on Informa- Y. Taniguchi, X. Liang, doccano: Text an- tion Technology and Multimedia, IEEE, 2014, pp. notation tool for human, 2018. URL: https: 250–254. //github.com/doccano/doccano, software avail- [10] N. Mohandas, J. P. Nair, V. Govindaru, Do- able from https://github.com/doccano/doccano. main specific sentence level mood extraction