1. Introduction

J. Joy);

Automatic Information Extraction and Inferencing System from Online News Sources for Substance Abuse Cases

Judith George Joseph

judithgeorgejoseph123@gmail.com 2 3

Jestin Joy

jestinjoy@fisat.ac.in 0 3

Sreeraj M

sreeraj.sac@gmail.com 1 3

Sanjay Govind

2 3

Shijas Muhammed T P

2 3

Tibi Sunni

2 3 0 Assistant Professor, Department of Computer Science And Engineering, Federal Institute of Science And Technology(FISAT) , Kerala , India 1 Assistant Professor, Department of Computer Science, Sree Ayyappa College , Alappuzha, Kerala 2 Department of Computer Science And Engineering, Federal Institute of Science And Technology(FISAT) , Kerala , India 3 ISIC'21: International Semantic Intelligence Conference

2020

000 0 0003

The rising number of substance abuse cases is a serious situation that demands significant attention. Gaining insights from the reported substance abuse cases will greatly help law enforcement authorities and policy makers. The unstructured nature of the publicly available data is a challenge. Computational techniques can be made use in eficiently extracting and summarising these unstructured data. The proposed system extracts the news reported on substance abuse related crimes from Malayalam online news papers. The extracted data is then processed using Natural Language Processing (NLP) techniques to generate a set of information that can be helpful in generating valuable inferences. Results show that the proposed system provide good accuracy for the data extraction task.

eol>Information extraction NER Machine Learning Data Mining

1. Introduction The United Nations Ofice on Drugs and Crime (UN

ODC) reports[ 1 ] that approximately 5 per cent of the world’s population used an illicit drug in 2010 and 27 million people can be classified as problem drug users.

Alcohol and illicit drug use cause around 39 deaths per million population. In addition to causing death, substance abuse is also responsible for significant morbidity and the treatment of drug addiction creates a tremendous burden on society. Significant rise in the reported drug abuse cases is a serious public health threat. Handling this problem needs the intervention of government, law enforcement and public health sector. World Health Organization (WHO) study[ 2 ] estimates that the four major cause of illicit drug use death are AIDS, suicide, overdose and trauma. Based on this, the median number of deaths are estimated to be 1,94,058 as per 2000 estimates. Illicit drug use also causes premature deaths in young adults and adversely afects their overall health.

Substance use is a problem in India too. Ministry of Social Justice and Empowerment, Government of India report[ 3 ] “Magnitude of Substance Use in India 2019” shows the dismal picture in India. After Alcohol, Cannabis and Opioids is the most commonly used substances in India and about2.8% of the population use it. More than 30 lakh of the people with opioid use disorders are from Indian states of Uttar Pradesh, Punjab, Haryana, Delhi, Maharashtra, Rajasthan, Andhra Pradesh and Gujarat. Enforcement activities report[ 4 ] by Excise department, Government of Kerala reports that during 2019, 7099 cases are registered based on Narcotic Drugs and Psychotropic Substances Act.

Though governments publish[ 3, 4 ] data regarding substance abuse cases, it is not easy to get region wise detailed information. For example detailed information regarding size, type and location of registered cases are not easy to find. But these information are available in public domain through news reports. Problem with these news reports are that, they are not in a structured format. Various techniques[ 5, 6, 7 ] are explored for extracting structured information from unstructured textual data. Information extraction is the process of extracting information from unstruc- Web NER tured data. It is extensively used in medical document Crawler Training mining, mining business and law documents. Internet being a rich source of unstructured textual data, web mining is also an active research area. The proposed Figure 1: NER Training system extracts structured information from news reports. News reports regarding substance abuse cases reported in online edition of popular Malayalam news on crime location and 0.87 on drug quantity. But inpapers are used for this purpose. These are then pro- formation regarding the dataset size and testing inforcessed using Natural Language Processing (NLP) tech- mation is missing in the paper. niques like Named Entity Recognition (NER) for ex- Rexy Arulanandam, Bastin Tony Roy Savarimuthu tracting structured information. This information helps and Maryam A. Purvis proposed a system[ 12 ] for exin getting information like places where more cases tracting crime information from newspaper articles. are reported, most commonly used drug, amount of Named Entity Recognition (NER) coupled with Condieach drug as reported in news etc. tional Random Field (CRF) is used to find crime location in a sentence. 70 articles from Otago Daily Times is used for evaluating the system. LBJ NER Tagger is 2. Related Works found to be the best tagger with a precision of 0.98. Accuracy varies from 84% to 90% for for New Zealand arStudy on information extraction techniques from un- ticles for the task of identifying locations in sentences structured data is explored in literature[ 5, 6, 7, 8, 9 ]. and classifying it into crime location sentences. This involves extracting data from medical text, busi- Eiji Aramakia, Yasuhide Miurab, Masatsugu Tonoike ness and law documents. Most of the research revolves et al[ 13 ] proposed a system for extracting adverse drug around using English as the language. We haven’t came events and efects from clinical records. Results on a across much research[ 10, 11 ] on information extrac- study on 3,012 discharge summaries show that 7.7% of tion from Malayalam unstructured text. This is mainly records include adverse event information, and 59% of due to the unavailability of publicly available datasets them can be extracted automatically. and computational techniques for processing text. Works Authors haven’t came across any similar systems related to extracting drug related information unstruc- for extracting information from Malayalam news artured text is discussed below. ticles.

Extracting Substance Abuse Information from Clinical Notes[ 8 ] was studied by Lybarger, Yetsigen et.al They proposed a neural network architecture for auto- 3. Design and Implementation matic extraction of substance abuse information from clinical notes. A discrete model was also experimented Proposed system consists of two phases. In the first for extracting information. These clinical notes were phase, relevant data is crawled from web and fed to stored with information about patients’ substance abuse Named Entity Recognition Module (NER) for creating history. The model was trained to find the presence a model for recognizing named entities. This phase is of substances events like alcohol, drug, or tobacco. A given in Figure 1.

Maximum Entropy (MaxEnt) model was used for clas- This phase is not an easy task since we need to NER sifying the status. Other entities like amount, frequency, on Malayalam language text. Malayalam[ 14 ] is a lanexposure history,.. were extracted using Conditional guage spoken in the Indian state of Kerala. It is one random fields (CRF) model. Neural Multi-task Model of 22 scheduled languages of India and is spoken by predicted all entities for all substances. 37,919,870 people. Malayalam follows a word order

Khmael Rakm Rahem and Nazlia Omar proposed of SOV (subject-object-verb) generally. Malayalam is a rule-based approach [ 9 ] for extracting drug related a heavily agglutinated and inflected language making crime information from online newspaper articles. The it dificult for NER task. Diferent techniques are extask involved extracting information like drug name, plored for Malayalam NER[ 15, 16, 17, 18, 19 ]. Most of nationality, location and assess the quantity and price these are based statistical techniques. This study also of drug. A set of grammatical and heuristic rules were used a statistical technique for NER. used for this purpose. Data from Malaysian National Statistical model provided by Spacy1 is used in this News Agency (BERNAMA) is used in the system. Sys- study. Tagged data is fed to the NER system for traintem achieved a precision of 0.96 on drug names, 0.83 ing. Transition based approach[ 20 ] is used for NER.

This uses word embedding strategy using subword features and bloom embeddings. CNN filter sizes are chosen with beam search. 1D convolutional filters are applied over the input text to predict how the upcoming words may change the current entity tags.

In the second phase, the trained model is made use in extracting information. A rule base is also used for this purpose. This is given in Figure 2. NER model helps to identify relevant entities for the information extraction task. Name, age, place, drug name, amount and size is considered in the proposed system. Tagged sentences are then fed to processing module, which processes information based on handcrafted rules. A snapshot of the rules used in the proposed system are given below. 1. Name of the person, and drug appears in the ini

tial part of the news item. 2. If money occurs just before the drugs name, then

it is assigned as that of the corresponding drug’s 3. First occurrence of location is assigned as that of Figure 3: NER output for processed sample news items location 4. Person age is close to the name of the person 5. Amount of drug carried by the ofender is close to 3.2. Implementation the drug name 3.1. Dataset

Though there exists trained models for languages like

English, publicly available tagged dataset for Malayalam language is non existent for this task. Data is extracted from online edition of Malayalam news sites of Malayala Manorama, Mathrubhumi, Mangalam, News18 4. Results and discussion Malayalam, Deshabhimani and Media One. Tagging for NER was done using web frontend based on doc- The proposed system involves passing the news item cano, which is an open source text annotation tool. It to NER module and processing it using the rule based is an open source text annotation tool. It can be used system. Figure 3 shows the result of NER module for to tag data for various tasks like named entity recogni- sample news items. tion, text summarization and sentiment analysis. Data This is then fed to the processing module for infercollected for training were from the period January ence. Output from the inference module is given in 2017 to December 2019. Figure 4.

Processing of the data is done using Python programming language. Spacy2 NER module is used for named entity recognition, which forms the important component of the system. The availability of pretrained statistical models and support for large number of languages makes Spacy a good choice for text processing. each and every news story following the same writing style. This is a major drawback of the system. For example the accuracy of the entities quantity, person, date are the lowest. Most news stories lack quantity and date information in a standard format. Person information is also dificult to identify since news stories sometimes lack them and sometimes more person names like that of law enforcement authorities are included making it dificult for the system to correctly identify it.

5. Conclusion

For evaluating the system, 50 substance abuse re- An automated system for generating valuable inforlated news articles are collected. These news articles mation out of online news articles can reduce the colosare from the period January 2020 to March 2020. The sal amount of efort that must be put in to do the same collected news articles are manually verified to be of by other means. The data provided by the system can substance abuse cases. These news articles are then aid in statistical research and study, generating key infed to system and accuracy of the entity identified is ferences for investigations, for background studies in recorded. Accuracy is found by matching the entities formulating action plans etc. Since the system promanually with the predicted entities. For example re- cesses news reports on crimes related to substance abuse, sults indicate that of the 50 news articles considered, the information provided is very significant and releon 42 of them location entity is predicted correctly. vant as the issue is an ongoing serious social threat.

Table 1 lists the accuracy identified by the system However in a broad sense the services provided by for the given 50 news articles. the current version of the system is limited. Which

Results indicate that system could identify the enti- also opens an opportunity for future enhancement. Now ties location, drugs with reasonably good accuracy. Al- the system is providing only key aspects mentioned in though system could identify most entities correctly, news. It can be modified into a full fledged inference these are marked as those relevant by the rule based which increase it’s clarity. The proposed system can system. The reduction in accuracy for other entities be enhanced in a way that it responds to user queries. is due to the failure in the part of rule base to correctly match the entity. Rules are framed manually after going through news stories. We cant be sure of

Acknowledgments

[1] UNODC , Atlas on substance use ( 2010 ), 2011 . URL: https://www.who.int/publications/i/item/ 9789241500616.

[2] M. W.-S. Louisa

Degenhardt

, Wayne Hall,

Lynskey , Illicit drug use, 2020 . URL: https://www.who.int/publications/cra/chapters/ volume1/ 1109 - 1176 .pdf.

[3] N. D. D. T. C. (NDDTC), Magnitude of substance use in india - 2019 , 2020 .

[4]

o. K . Excise department, Month wise details of enforcement activities during 2019 , 2020 . URL: https://excise.kerala.gov. in/enforcement-activities- 2 /.

[5]

Alawad ,

Gao ,

J. X.

Qiu ,

H. J.

Yoon ,

J. Blair

Christian ,

Penberthy ,

Mumphrey , X. -C. Wu , L. Coyle , G. Tourassi, Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks , Journal of the American Medical Informatics Association 27 ( 2020 ) 89 - 98 .

[6]

Jiang ,

Baumgartner ,

Ittycheriah ,

Yu , Factoring fact-checks: Structured information extraction from fact-checking articles , in: Proceedings of The Web Conference 2020 , 2020 , pp. 1592 - 1603 .

[7]

Milosevic ,

Gregson ,

Hernandez , G. Nenadic, A framework for information extraction from tables in biomedical literature , International Journal on Document Analysis and Recognition (IJDAR) 22 ( 2019 ) 55 - 78 .

[8]

Lybarger ,

Yetisgen ,

Ostendorf , Using neural multi-task learning to extract substance abuse information from clinical notes , in: AMIA Annual Symposium Proceedings , volume 2018 , American Medical Informatics Association, 2018 , p. 1395 .

[9]

K. R.

Rahem ,

Omar , Drug-related crime information extraction and analysis , in: Proceedings of the 6th International Conference on Information Technology and Multimedia , IEEE, 2014 , pp. 250 - 254 .

[10]

Mohandas ,

J. P.

Nair ,

Govindaru , Domain specific sentence level mood extraction from malayalam text , in: 2012 International Conference on Advances in Computing and CommuAuthors would like to thank the help extended by Adam nications , IEEE, 2012 , pp. 78 - 81 .

Shamsudeen for providing the required dataset and tag- [11]

D. S.

Nair ,

J. P.

Jayan ,

Sherly , et al., Sentimaging frontend. sentiment extraction for malayalam , in: 2014 International Conference on Advances in ComputReferences ing, Communications and Informatics (ICACCI) , IEEE, 2014 , pp. 1719 - 1723 .

[12]

Arulanandam ,

B. T. R.

Savarimuthu ,

M. A.

Purvis , Extracting crime information from online newspaper articles , in: Proceedings of the second australasian web conference- volume 155 , 2014 , pp. 31 - 38 .

[13]

Aramaki ,

Miura ,

Tonoike ,

Ohkuma ,

Masuichi ,

Waki ,

Ohe , Extraction of adverse drug efects from clinical records ., MedInfo 160 ( 2010 ) 739 - 743 .

[14]

G. F.

Simons ,

C. D.

Fennig , Ethnologue: languages of Asia, sil International Dallas, 2017 .

[15]

Sreeja ,

A. S.

Pillai , Towards an eficient malayalam named entity recognizer analysis on the challenges , Procedia Computer Science 171 ( 2020 ) 2541 - 2546 .

[16]

Malarkodi ,

S. L.

Devi , A deeper study on features for named entity recognition , in: Proceedings of the WILDRE5-5th Workshop on Indian Language Data: Resources and Evaluation , 2020 , pp. 66 - 72 .

[17]

J. P.

Jayan ,

Rajeev ,

Sherly , A hybrid statistical approach for named entity recognition for malayalam language , in: Proceedings of the 11th Workshop on Asian Language Resources , 2013 , pp. 58 - 63 .

[18]

Ajees ,

S. M.

Idicula , A named entity recognition system for malayalam using neural networks , Procedia computer science 143 ( 2018 ) 962 - 969 .

[19]

Thottingal , Finite state transducer based morphology analysis for malayalam language , in: Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages , 2019 , pp. 1 - 5 .

[20]

Lample ,

Ballesteros ,

Subramanian ,

Kawakami ,

Dyer , Neural architectures for named entity recognition , arXiv preprint arXiv:1603.01360 ( 2016 ).

[21]

Nakayama ,

Kubo ,

Kamura ,

Taniguchi ,

Liang , doccano: Text annotation tool for human, 2018 . URL: https: //github.com/doccano/doccano, software available from https://github.com/doccano/doccano.