516


Automatic Information Extraction and Inferencing
System from Online News Sources for Substance Abuse
Cases
Judith George Josepha , Jestin Joyb , Sreeraj Mc , Sanjay Govindd , Shijas Muhammed T Pe
and Tibi Sunnif
a Department of Computer Science And Engineering, Federal Institute of Science And Technology(FISAT), Kerala, India
b Assistant Professor, Department of Computer Science And Engineering, Federal Institute of Science And Technology(FISAT), Kerala, India
c Assistant Professor, Department of Computer Science, Sree Ayyappa College, Alappuzha, Kerala
d Department of Computer Science And Engineering, Federal Institute of Science And Technology(FISAT), Kerala, India
e Department of Computer Science And Engineering, Federal Institute of Science And Technology(FISAT), Kerala, India
f Department of Computer Science And Engineering, Federal Institute of Science And Technology(FISAT), Kerala, India


                                          Abstract
                                          The rising number of substance abuse cases is a serious situation that demands significant attention. Gaining insights from the
                                          reported substance abuse cases will greatly help law enforcement authorities and policy makers. The unstructured nature of
                                          the publicly available data is a challenge. Computational techniques can be made use in efficiently extracting and summarising
                                          these unstructured data. The proposed system extracts the news reported on substance abuse related crimes from Malayalam
                                          online news papers. The extracted data is then processed using Natural Language Processing (NLP) techniques to generate
                                          a set of information that can be helpful in generating valuable inferences. Results show that the proposed system provide
                                          good accuracy for the data extraction task.

                                          Keywords
                                          Information extraction, NER, Machine Learning, Data Mining


1. Introduction                                                                                                    to be 1,94,058 as per 2000 estimates. Illicit drug use
                                                                                                                   also causes premature deaths in young adults and ad-
The United Nations Office on Drugs and Crime (UN-                                                                  versely affects their overall health.
ODC) reports[1] that approximately 5 per cent of the                                                                  Substance use is a problem in India too. Ministry of
world’s population used an illicit drug in 2010 and 27                                                             Social Justice and Empowerment, Government of In-
million people can be classified as problem drug users.                                                            dia report[3] “Magnitude of Substance Use in India -
Alcohol and illicit drug use cause around 39 deaths per                                                            2019” shows the dismal picture in India. After Alco-
million population. In addition to causing death, sub-                                                             hol, Cannabis and Opioids is the most commonly used
stance abuse is also responsible for significant mor-                                                              substances in India and about2.8% of the population
bidity and the treatment of drug addiction creates a                                                               use it. More than 30 lakh of the people with opioid use
tremendous burden on society. Significant rise in the                                                              disorders are from Indian states of Uttar Pradesh, Pun-
reported drug abuse cases is a serious public health                                                               jab, Haryana, Delhi, Maharashtra, Rajasthan, Andhra
threat. Handling this problem needs the intervention                                                               Pradesh and Gujarat. Enforcement activities report[4]
of government, law enforcement and public health sec-                                                              by Excise department, Government of Kerala reports
tor. World Health Organization (WHO) study[2] es-                                                                  that during 2019, 7099 cases are registered based on
timates that the four major cause of illicit drug use                                                              Narcotic Drugs and Psychotropic Substances Act.
death are AIDS, suicide, overdose and trauma. Based                                                                   Though governments publish[3, 4] data regarding
on this, the median number of deaths are estimated                                                                 substance abuse cases, it is not easy to get region wise
                                                                                                                   detailed information. For example detailed informa-
ISIC’21: International Semantic Intelligence Conference, February
25–27, 2021, New Delhi, India                                                                                      tion regarding size, type and location of registered cases
   judithgeorgejoseph123@gmail.com (J.G. Joseph);                                                                  are not easy to find. But these information are avail-
jestinjoy@fisat.ac.in (J. Joy); sreeraj.sac@gmail.com (S. M)                                                       able in public domain through news reports. Prob-
{ http:                                                                                                            lem with these news reports are that, they are not in
//www.sreeayyappacollege.ac.in/uploads/downloads/sreeraj.pdf (S.
M)                                                                                                                 a structured format. Various techniques[5, 6, 7] are
 0000-0003-0892-7874 (J. Joy); 0000-0003-4974-437X (S. M)                                                         explored for extracting structured information from
                                    © 2020 Copyright for this paper by its authors. Use permitted under Creative
                                    Commons License Attribution 4.0 International (CC BY 4.0).                     unstructured textual data. Information extraction is
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
                                                                                                                517


the process of extracting information from unstruc-                        Web                  NER
tured data. It is extensively used in medical document                    Crawler             Training
mining, mining business and law documents. Internet
being a rich source of unstructured textual data, web
mining is also an active research area. The proposed Figure 1: NER Training
system extracts structured information from news re-
ports. News reports regarding substance abuse cases
reported in online edition of popular Malayalam news on crime location and 0.87 on drug quantity. But in-
papers are used for this purpose. These are then pro- formation regarding the dataset size and testing infor-
cessed using Natural Language Processing (NLP) tech- mation is missing in the paper.
niques like Named Entity Recognition (NER) for ex-            Rexy Arulanandam, Bastin Tony Roy Savarimuthu
tracting structured information. This information helps and Maryam A. Purvis proposed a system[12] for ex-
in getting information like places where more cases tracting crime information from newspaper articles.
are reported, most commonly used drug, amount of Named Entity Recognition (NER) coupled with Condi-
each drug as reported in news etc.                         tional Random Field (CRF) is used to find crime loca-
                                                           tion in a sentence. 70 articles from Otago Daily Times
                                                           is used for evaluating the system. LBJ NER Tagger is
2. Related Works                                           found to be the best tagger with a precision of 0.98. Ac-
                                                           curacy varies from 84% to 90% for for New Zealand ar-
Study on information extraction techniques from un- ticles for the task of identifying locations in sentences
structured data is explored in literature[5, 6, 7, 8, 9]. and classifying it into crime location sentences.
This involves extracting data from medical text, busi-        Eiji Aramakia, Yasuhide Miurab, Masatsugu Tonoike
ness and law documents. Most of the research revolves et al[13] proposed a system for extracting adverse drug
around using English as the language. We haven’t came events and effects from clinical records. Results on a
across much research[10, 11] on information extrac- study on 3,012 discharge summaries show that 7.7% of
tion from Malayalam unstructured text. This is mainly records include adverse event information, and 59% of
due to the unavailability of publicly available datasets them can be extracted automatically.
and computational techniques for processing text. Works Authors haven’t came across any similar systems
related to extracting drug related information unstruc- for extracting information from Malayalam news ar-
tured text is discussed below.                             ticles.
   Extracting Substance Abuse Information from Clin-
ical Notes[8] was studied by Lybarger, Yetsigen et.al
They proposed a neural network architecture for auto- 3. Design and Implementation
matic extraction of substance abuse information from
clinical notes. A discrete model was also experimented Proposed system consists of two phases. In the first
for extracting information. These clinical notes were phase, relevant data is crawled from web and fed to
stored with information about patients’ substance abuse Named Entity Recognition Module (NER) for creating
history. The model was trained to find the presence a model for recognizing named entities. This phase is
of substances events like alcohol, drug, or tobacco. A given in Figure 1.
Maximum Entropy (MaxEnt) model was used for clas-             This phase is not an easy task since we need to NER
sifying the status. Other entities like amount, frequency, on Malayalam language text. Malayalam[14] is a lan-
exposure history,.. were extracted using Conditional guage spoken in the Indian state of Kerala. It is one
random fields (CRF) model. Neural Multi-task Model of 22 scheduled languages of India and is spoken by
predicted all entities for all substances.                 37,919,870 people. Malayalam follows a word order
   Khmael Rakm Rahem and Nazlia Omar proposed of SOV (subject-object-verb) generally. Malayalam is
a rule-based approach [9] for extracting drug related a heavily agglutinated and inflected language making
crime information from online newspaper articles. The it difficult for NER task. Different techniques are ex-
task involved extracting information like drug name, plored for Malayalam NER[15, 16, 17, 18, 19]. Most of
nationality, location and assess the quantity and price these are based statistical techniques. This study also
of drug. A set of grammatical and heuristic rules were used a statistical technique for NER.
used for this purpose. Data from Malaysian National           Statistical model provided by Spacy1 is used in this
News Agency (BERNAMA) is used in the system. Sys- study. Tagged data is fed to the NER system for train-
tem achieved a precision of 0.96 on drug names, 0.83           1
                                                               https://spacy.io/
                                                                                                             518


Figure 2: Information extraction


ing. Transition based approach[20] is used for NER.
This uses word embedding strategy using subword fea-
tures and bloom embeddings. CNN filter sizes are cho-
sen with beam search. 1D convolutional filters are ap-
plied over the input text to predict how the upcoming
words may change the current entity tags.
   In the second phase, the trained model is made use
in extracting information. A rule base is also used for
this purpose. This is given in Figure 2. NER model
helps to identify relevant entities for the information
extraction task. Name, age, place, drug name, amount
and size is considered in the proposed system. Tagged
sentences are then fed to processing module, which
processes information based on handcrafted rules. A
snapshot of the rules used in the proposed system are
given below.
    1. Name of the person, and drug appears in the ini-
       tial part of the news item.
    2. If money occurs just before the drugs name, then
       it is assigned as that of the corresponding drug’s
    3. First occurrence of location is assigned as that of Figure 3: NER output for processed sample news items
       location
    4. Person age is close to the name of the person
    5. Amount of drug carried by the offender is close to 3.2. Implementation
       the drug name
                                                           Processing of the data is done using Python program-
                                                           ming language. Spacy2 NER module is used for named
3.1. Dataset                                               entity recognition, which forms the important com-
                                                           ponent of the system. The availability of pretrained
Though there exists trained models for languages like statistical models and support for large number of lan-
English, publicly available tagged dataset for Malay- guages makes Spacy a good choice for text processing.
alam language is non existent for this task. Data is ex-
tracted from online edition of Malayalam news sites of
Malayala Manorama, Mathrubhumi, Mangalam, News18 4. Results and discussion
Malayalam, Deshabhimani and Media One. Tagging
for NER was done using web frontend based on doc- The proposed system involves passing the news item
cano, which is an open source text annotation tool. It to NER module and processing it using the rule based
is an open source text annotation tool. It can be used system. Figure 3 shows the result of NER module for
to tag data for various tasks like named entity recogni- sample news items.
tion, text summarization and sentiment analysis. Data        This is then fed to the processing module for infer-
collected for training were from the period January        ence.  Output from the inference module is given in
2017 to December 2019.                                     Figure 4.
                                                             2 https://spacy.io/
                                                                                                              519


Figure 4: Output of inference module


     Entity      Correctly Identified(50)   Accuracy       each and every news story following the same writ-
     Location               42              0.84           ing style. This is a major drawback of the system. For
     Drugs                  40              0.80           example the accuracy of the entities quantity, person,
     Quantity               30              0.60           date are the lowest. Most news stories lack quantity
     Money                  38              0.76           and date information in a standard format. Person in-
     Person                 30              0.60
                                                           formation is also difficult to identify since news sto-
     Age                    32              0.64
                                                           ries sometimes lack them and sometimes more person
     Date                   30              0.60
                                                           names like that of law enforcement authorities are in-
     Time                   36              0.72
                                                           cluded making it difficult for the system to correctly
Table 1                                                    identify it.
Accuracy of each entity tested 50 news articles

                                                           5. Conclusion
   For evaluating the system, 50 substance abuse re-       An automated system for generating valuable infor-
lated news articles are collected. These news articles     mation out of online news articles can reduce the colos-
are from the period January 2020 to March 2020. The        sal amount of effort that must be put in to do the same
collected news articles are manually verified to be of     by other means. The data provided by the system can
substance abuse cases. These news articles are then        aid in statistical research and study, generating key in-
fed to system and accuracy of the entity identified is     ferences for investigations, for background studies in
recorded. Accuracy is found by matching the entities       formulating action plans etc. Since the system pro-
manually with the predicted entities. For example re-      cesses news reports on crimes related to substance abuse,
sults indicate that of the 50 news articles considered,    the information provided is very significant and rele-
on 42 of them location entity is predicted correctly.      vant as the issue is an ongoing serious social threat.
   Table 1 lists the accuracy identified by the system        However in a broad sense the services provided by
for the given 50 news articles.                            the current version of the system is limited. Which
   Results indicate that system could identify the enti-   also opens an opportunity for future enhancement. Now
ties location, drugs with reasonably good accuracy. Al-    the system is providing only key aspects mentioned in
though system could identify most entities correctly,      news. It can be modified into a full fledged inference
these are marked as those relevant by the rule based       which increase it’s clarity. The proposed system can
system. The reduction in accuracy for other entities       be enhanced in a way that it responds to user queries.
is due to the failure in the part of rule base to cor-
rectly match the entity. Rules are framed manually
after going through news stories. We cant be sure of
                                                                                                              520


Acknowledgments                                             from malayalam text, in: 2012 International Con-
                                                            ference on Advances in Computing and Commu-
Authors would like to thank the help extended by Adam       nications, IEEE, 2012, pp. 78–81.
Shamsudeen for providing the required dataset and tag- [11] D. S. Nair, J. P. Jayan, E. Sherly, et al., Sentima-
ging frontend.                                              sentiment extraction for malayalam, in: 2014 In-
                                                            ternational Conference on Advances in Comput-
                                                            ing, Communications and Informatics (ICACCI),
References                                                  IEEE, 2014, pp. 1719–1723.
 [1] UNODC, Atlas on substance use (2010), 2011.       [12] R.  Arulanandam, B. T. R. Savarimuthu, M. A.
     URL: https://www.who.int/publications/i/item/          Purvis,     Extracting crime information from on-
     9789241500616.                                         line  newspaper     articles, in: Proceedings of the
 [2] M. W.-S. Louisa Degenhardt, Wayne Hall,                second    australasian   web conference-volume 155,
     M. Lynskey, Illicit drug use, 2020. URL:               2014,   pp. 31–38.
     https://www.who.int/publications/cra/chapters/ [13] E. Aramaki, Y. Miura, M. Tonoike, T. Ohkuma,
     volume1/1109-1176.pdf.                                 H. Masuichi, K. Waki, K. Ohe, Extraction of ad-
 [3] N. D. D. T. C. (NDDTC), Magnitude of substance         verse drug effects from clinical records., MedInfo
     use in india - 2019, 2020.                             160 (2010) 739–743.
 [4] G. o. K. Excise department, Month wise            [14] G.  F. Simons, C. D. Fennig, Ethnologue: lan-
     details of enforcement activities during               guages    of Asia, sil International Dallas, 2017.
     2019, 2020. URL: https://excise.kerala.gov.       [15] P.  Sreeja,   A. S. Pillai, Towards an efficient
     in/enforcement-activities-2/.                          malayalam      named entity recognizer analysis on
 [5] M. Alawad, S. Gao, J. X. Qiu, H. J. Yoon,              the  challenges,    Procedia Computer Science 171
     J. Blair Christian, L. Penberthy, B. Mumphrey,         (2020)   2541–2546.
     X.-C. Wu, L. Coyle, G. Tourassi, Automatic ex- [16] C. Malarkodi, S. L. Devi, A deeper study on fea-
     traction of cancer registry reportable informa-        tures for named entity recognition, in: Proceed-
     tion from free-text pathology reports using mul-       ings of the WILDRE5–5th Workshop on Indian
     titask convolutional neural networks, Journal of       Language Data: Resources and Evaluation, 2020,
     the American Medical Informatics Association 27        pp. 66–72.
     (2020) 89–98.                                     [17] J. P. Jayan, R. Rajeev, E. Sherly, A hybrid statis-
 [6] S. Jiang, S. Baumgartner, A. Ittycheriah, C. Yu,       tical  approach for named entity recognition for
     Factoring fact-checks: Structured information          malayalam     language, in: Proceedings of the 11th
     extraction from fact-checking articles, in: Pro-       Workshop      on  Asian Language Resources, 2013,
     ceedings of The Web Conference 2020, 2020, pp.         pp.  58–63.
     1592–1603.                                        [18] A. Ajees, S. M. Idicula, A named entity recog-
 [7] N. Milosevic, C. Gregson, R. Hernandez, G. Ne-         nition system for malayalam using neural net-
     nadic, A framework for information extraction          works, Procedia computer science 143 (2018)
     from tables in biomedical literature, Interna-         962–969.
     tional Journal on Document Analysis and Recog-    [19] S. Thottingal, Finite state transducer based mor-
     nition (IJDAR) 22 (2019) 55–78.                        phology     analysis for malayalam language, in:
 [8] K. Lybarger, M. Yetisgen, M. Ostendorf, Using          Proceedings     of the 2nd Workshop on Technolo-
     neural multi-task learning to extract substance        gies  for  MT   of Low Resource Languages, 2019,
     abuse information from clinical notes, in: AMIA        pp.  1–5.
     Annual Symposium Proceedings, volume 2018, [20] G. Lample, M. Ballesteros, S. Subramanian,
     American Medical Informatics Association, 2018,        K. Kawakami, C. Dyer, Neural architectures
     p. 1395.                                               for named entity recognition, arXiv preprint
 [9] K. R. Rahem, N. Omar, Drug-related crime infor-        arXiv:1603.01360 (2016).
     mation extraction and analysis, in: Proceedings   [21] H.     Nakayama,         T. Kubo,       J. Kamura,
     of the 6th International Conference on Informa-        Y.  Taniguchi,     X.   Liang,  doccano:    Text an-
     tion Technology and Multimedia, IEEE, 2014, pp.        notation     tool  for   human,   2018.  URL:   https:
     250–254.                                               //github.com/doccano/doccano,        software   avail-
[10] N. Mohandas, J. P. Nair, V. Govindaru, Do-             able  from   https://github.com/doccano/doccano.
     main specific sentence level mood extraction