<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Named Entity Recognition from Chernobyl Documentaries</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniil Tikhomirov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikita Nikitinsky</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ilya Makarov</string-name>
          <email>iamakarov@hse.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Research University, Higher School of Economics</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National University of Science and Technology MISIS</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The paper describes a system that extracts facts and opinions from documentary texts to create a domain ontology of a controversial topic for Chernobyl disaster. The pipeline of the system is based on RNNbased NER module, which was tested on an annotated text corpora.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;information extraction</kwd>
        <kwd>NER</kwd>
        <kwd>opinion mining</kwd>
        <kwd>domain ontology</kwd>
        <kwd>Chernobyl disaster</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In this paper, we propose a project of system that extracts named entities and events from
documentary corpus dealing with a controversial topic of Chernobyl disaster. Below we will
describe its architecture and initial results achieved by its first implementation.</p>
      <p>The task of Named Entity Recognition (NER) is a crucial component in development of any
NLP system that requires a certain level of general or domain-specific knowledge, such as
systems for question answering. The present task is no exception: the step of retrieving relevant
propositions about certain events pertaining to the Chernobyl disaster and the individuals’ or
objects’ involvement in it should be prefaced with an extraction of such objects. This step will
allow to both separate the relevant passages of our corpus from irrelevant ones, and to classify
the extracted facts and opinions by their subjects.</p>
      <p>
        The domain that the NER methods will be applied to in our work is fairly limited. The
number of actors and events connected to the Chernobyl disaster is rather small and predictable,
when compared to general domain NER tasks, or certain domain-specific, yet more broad, tasks,
such as concept detection in medical texts, like the one described in
        <xref ref-type="bibr" rid="ref1">(Uzuner et al., 2011)</xref>
        [1].
However, the domain is not as limited, and corpus size is not as small as to allow this task to
be done manually; rather, it calls for extracting more fine-grained knowledge from the present
texts under human supervision and reinforcement.
      </p>
      <p>(Yadav and Bethard, 2018)[2] ofers a comprehensive survey of developments in NER
systems and reviews diferent approaches to this problem, from early rule- and dictionary-based
approaches to the more recent methods that rely on feature engineering and supervised
learning, as well as state-of-the-art neural network systems. Besides describing examples of every
major method for NER, the survey compares their efectiveness. This allows to make an
informed decision of which method would suit the present task. The gazetteer-based approaches
described there were discarded for obvious reasons: while it is possible to annotate a portion
of the corpus, one would also like to infer other possible entities that were not found in the
annotated part.</p>
      <p>Of the machine learning methods, the most consistently good results were achieved by
feature-inferring neural network models, namely by Bi-LSTM models, both word- and
characterlevel [2]. It should be noted that feature-engineered models models, such as (Agerri and Rigan,
2017) [3], achieved similarly good results, but they have a strong disadvantage compared to the
feature-inferring models: the feature engineering for a new domain requires a lot of time and
resources, and constitutes a separate topic in NLP tasks all on its own. As such, when choosing
among the ready-made decisions, the RNN models look much more favourable, ofering the
same good quality while being easy to adapt to suit one’s needs.</p>
      <p>The suitability of neural network methods is supported by the abundance of libraries that
implement this strategy. One of the more popular and available options is the customizable NER
model included into SpaCy (spacy.io), a multi-purpose NLP module for Python. The model
used there, however, difers from the best-performing models described in (Yadav and Bethard,
2018) [2] in that they use a CNN approach rather than RNN. The SpaCy model calculates word
representations using both subword features (prefix, sufix, general shape of the word) and
"Bloom" embeddings (which is a way to assign hash IDs to vectors to reduce dimensionality
and speed up the model while using several successive hashing functions to avoid hash
collisions; the technique was described in (Serra and Karatzoglou, 2017) [4]). These embeddings
are then passed to the trigram CNN with residual connections, where they are transformed in
accrodance with their context. Finally, the prediction layer for the model is composed of a
standard multi-layer perceptron. The architecture of this model is heavily inspired by (Lample et
al., 2016) [5], which describes a transition-based model for chunking and labeling a sequence
of inputs using a stack data structure, which allows for a "summary embedding" of several
previous words. Among other things, this allows for an easy representations of multi-word
named entities, as they are included in the stack together. The good results  = 86.4, the ease
of implementation and customization makes SpaCy a viable instrument for the named entity
recognition in our system.</p>
      <p>Another popular library that we will inspect is DeepPavlov (https://deeppavlov.ai/).
DeepPavlov employs the more "standard" RNN method for the named entity recognition task. It
has also been trained on the OntoNotes 5.0 dataset [6], and shows similarly good results as the
model used in SpaCy. The quality of these models on our dataset will be compared in Section
4.2.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset</title>
      <p>Our data consists of a corpus of English-language documentary works concerned with the
Chernobyl disaster and, more generally, with efects of hazards connected to such catastrophes
on the environment and human health. The corpus includes printed and internet articles and
books. These materials come in diferent ebook formats: PDF, .mobi, .epub and .djvu, some of
them containing a separate text layer, and some without one. In the case of ebooks and PDF
ifles with OCR, the text layer is simply extracted using specialized Python libraries; in the latter
case, such a layer is first created using image-to-text methods, in particular the Python OCR
library PyTesseract.</p>
      <p>The present task does not call for usual preprocessing techniques, such as removal of
punctuation or stopwords: relation extraction and argumentation detection methods demand full
preservation of all elements of written text that would allow to distinguish between sentences
that contain opinions and/or argumentation and those that do not, such as commas, colons,
conjunctions, etc. Therefore, the corpus preprocessing was primarily concerned with noise
that comes from book formatting, that is done to suit the human reader, but does not translate
well into any kind of automatic parsing. The sources of this noise include:
• line breaks that separate lines on PDF pages rather than paragraphs;
• tables, figures, images and their captions, that are not parsed efectively and often break
the sentence in the middle;
• page numbers;
• repetitions of the book or chapter title;
• footnotes and references.</p>
      <p>As it is impossible to find a one-size-fits-all solution to such problems, because each source is
characterized by its own particular style of formatting, a separate set of preprocessing rules
was created for each source work. These include removal of line breaks that do not come after
an end of a sentence, figures, tables, running titles and author names, as well as page numbers.
Apart from that, if the work is constructed as an anthology that covers various topics (e.g.
various anthropogenic disasters or types of pollution), only a relevant part of such work was
manually extracted and included in the corpus. All in all, our corpus contains 462843 tokens
(without punctuation) split over 23090 sentences. The bibliographical information about books
that were included in the corpus can be found in Appendix A.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experiment Settings</title>
      <p>The NER module was used in order to construct the domain ontology of the Chernobyl
Disaster - objects and entities that are likely to be a subject of controversy and that played a
somehow significant role in the events of the Chernobyl disaster. The entities and objects are
included into the ontology in two ways: first, all named entities extracted by an NER model,
and second, common nouns that designate objects and events that cannot be considered named
entities, but are, nevertheless, relevant to the topic at hand. These objects include, but not
limited to:
• occupations of people that were involved in the catastrophe ("operators", "liquidators",
"workers");
• diferent parts of the Chernobyl facility or the reactor ("graphite rods", "turbines");
• various health hazards ("radiation");
• consequences of exposure to such hazards ("cancer", "sarkoma");
• physical phenomena associated with the catastrophe and the reactor operation
("runaway", "void coeficient");
• umbrella terms for diferent causes of the disaster ("safety violations", "design faults")</p>
    </sec>
    <sec id="sec-4">
      <title>4. NER Models Comparison</title>
      <p>As stated above, two popular NER libraries were considered: a hybrid Bi-LSTM-CRF model
which comes as a part of a DeepPavlov AI library, and a CNN model adopted by the SpaCy
library. Both of these libraries are pre-trained on the OntoNotes 5.0 dataset, using the tagset
described in (Weischedel et al., 2013) [6], and both demonstrate state-of-the-art results for the
NER task with  = 86.4. Both of them can be adapted to diferent domains, easy to train and
simple to integrate into any application. In order to choose between to approaches to the NER
task, we have tested both libraries on a manually annotated text - a Wikipedia article on the
Chernobyl disaster. The total size of the test corpus is 9700 tokens split over 317 sentences,
with 561 entities annotated with OntoNotes labels that used in both models. We have excluded
from the annotation labels usually associated with numeric strings, namely tags "ORDINAL",
"CARDINAL", "QUANTITY", "PERCENT" "TIME", "DATE" and "MONEY": relations and events
associated with those entities fall outside the scope of the present research, as it is oriented
towards the extraction of facts and actors’ involvement in them, rather than reconstructing of
the temporal sequence of events. We have also excluded certain tags that were considered
irrelevant for the domain we apply them to: "WORK_OF_ART", "PRODUCT" and "LANGUAGE".
The list of OntoNotes tags used during manual annotation of the Wikipedia article, together
with the native description and our domain-specific interpretation is provided in Table 1.</p>
      <p>The use of certain tags difers slightly from the meaning originally intended by the creators
of the tagset. The "EVENT" tag was applied to the phrase "Chernobyl disaster" and the word
"Chernobyl" when used in such a sense; the "FAC" tag was used to designate the Chernobyl
Nuclear Power Plant and diferent parts of the plant, e.g. names of reactors, such as "Unit
Four"; finally, the LAW tag was applied to the documents created by various commissions
following the catastrophe, such as the INSAG-7 report, which is included into the corpus as
one of the sources. Such an, arguably frivolous, use of these tags was tried as an experiment
to see whether the extraction of such objects and events can be delegated to the NER module,
or should be found and listed manually.</p>
      <p>OntoNotes description
People, including fictional
Nationalities or religious or political groups
Buildings, airports, highways, bridges, etc
Companies, agencies, institutions, etc.</p>
      <p>Countries, cities, states
Non-GPE locations, mountain ranges, bodies
of water
Named hurricanes, battles, wars, sports event
s, etc.</p>
      <p>Named documents made into laws</p>
      <p>For evaluating the results, we have calculated the usual quality metrics: Recall, Precision
and F1-score. The label was considered a true positive only in cases of exact entity matches.</p>
      <p>Tables 2 and 3 indicate that on average, both models show high precision and low recall,
which can be interpreted as a tendency to have many false positives - mistaking the label or
identifying as an entity a string that is not. The only entity label for which this pattern does
not hold is the "ORG" label, for which the evaluation metrics high recall with low precision. It
seems that the criteria for identifying "ORG" entities are fairly broad in both models.</p>
      <p>It can also be seen both models encounter considerable problems with detecting the entities
of types "EVENT", "FAC" and "LAW". The most frequent source of mistakes was the polysemy
of the word "Chernobyl", which can designate both the city of Chernobyl (assigned a GPE
label), the Chernobyl disaster (EVENT) and Chernobyl Nuclear Power Plant (FAC). This
polysemy was taken into account while preparing the test corpus, and, as the results show, was
not handled well by either model. Other types of entities, which were used conventionally,
have shown rather good results. It should be noted, however, that the CNN model employed
by SpaCy shows much lower accuracy when identifying entities of the "PERSON" type, with
both precision and recall metrics significantly lower than demonstrated by DeepPavlov’s RNN
model. This diference in accuracy for this particular label we consider to be crucial. There is
a finite and rather small number of unique entities of the LAW, EVENT and FAC types that are
mentioned in the corpus, and these entities can be specified by hand, as there was a limited
number of facilities involved in Chernobyl catastrophe. In addition, most events that are of
interest to our research are, in any case, mostly represented by common nouns. The same,
however, cannot be said about human actors that took part in the events: the same person can
be called by diferent variations of the same name, and the list of people who are mentioned in
the corpus can hardly be exhausted as the significant part of this corpus consists of personal
stories and analyses of individual involvement. Based on the results of NER models
evaluation, it was decided to use the DeepPavlov RNN model for detection of locations, organizations
and people names, while relying on the manually constructed list of interesting objects for the
detection of documents, facilities and events.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this paper we have described a way to extract named entities from documentary corpus
on Chernobyl disaster. We have compared NER models: SpaCy, based on CNN, and RNN-based
DeepPavlov, with the latter achieving a better quality on annotated Wikipedia article on the
Chernobyl Disaster. The system can by no means be called a finished one, and this paper serves
as a first attempt or a proof-of-concept for a large-scale project.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments References</title>
      <p>The research was supported by the Russian Science Foundation grant 19-11-00281.
[2] V. Yadav, S. Bethard, A Survey on Recent Advances in Named Entity Recognition from</p>
      <p>Deep Learning models, arXiv preprint arXiv:1910.11470 (2018) 14.
[3] R. Agerri, G. Rigau, Robust Multilingual Named Entity Recognition with Shallow
Semisupervised Features (Extended Abstract), in: Proceedings of the Twenty-Sixth
International Joint Conference on Artificial Intelligence, International Joint Conferences on
Artificial Intelligence Organization, Melbourne, Australia, 2017, pp. 4965–4969. URL: https:
//www.ijcai.org/proceedings/2017/703. doi:10.24963/ijcai.2017/703.
[4] J. Serra`, A. Karatzoglou, Getting deep recommenders fit: Bloom embeddings for sparse
binary input/output networks, arXiv:1706.03993 [cs] (2017). URL: http://arxiv.org/abs/1706.
03993, arXiv: 1706.03993.
[5] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, C. Dyer, Neural Architectures
for Named Entity Recognition, arXiv:1603.01360 [cs] (2016). URL: http://arxiv.org/abs/1603.
01360, arXiv: 1603.01360.
[6] Weischedel, Ralph, et al., OntoNotes Release 5.0 LDC2013t19. Web Download., 2013.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>O.</given-names>
            <surname>Uzuner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>South</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. L.</given-names>
            <surname>DuVall</surname>
          </string-name>
          ,
          <year>2010</year>
          i2b2/
          <article-title>VA challenge on concepts, assertions, and relations in clinical text</article-title>
          ,
          <source>Journal of the American Medical Informatics Association : JAMIA</source>
          <volume>18</volume>
          (
          <year>2011</year>
          )
          <fpage>552</fpage>
          -
          <lpage>556</lpage>
          . URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3168320/. doi:
          <volume>10</volume>
          .1136/amiajnl-2011-000203.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>