=Paper=
{{Paper
|id=Vol-2481/paper9
|storemode=property
|title=Standardizing Language with Word Embeddings and Language Modeling in Reports of Near Misses in Seveso Industries
|pdfUrl=https://ceur-ws.org/Vol-2481/paper9.pdf
|volume=Vol-2481
|authors=Simone Bruno,Silvia Maria Ansaldi,Patrizia Agnello,Fabio Massimo
Zanzotto
|dblpUrl=https://dblp.org/rec/conf/clic-it/BrunoAAZ19
}}
==Standardizing Language with Word Embeddings and Language Modeling in Reports of Near Misses in Seveso Industries==
<pdf width="1500px">https://ceur-ws.org/Vol-2481/paper9.pdf</pdf>
<pre>
Standardizing Language with Word Embeddings and Language Modeling
             in Reports of Near Misses in Seveso Industries

    Simone Bruno? , Silvia Maria Ansaldi+ , Patrizia Agnello+ , Fabio Massimo Zanzotto?
           ?
             University of Rome Tor Vergata, fabio.massimo.zanzotto@uniroma2.it
                         +
                            INAIL, {s.ansaldi,p.agnello}@inail.it


                         Abstract                          analyzed. Technical people can be invited to use a
        Standardizing technical language has always
                                                           standardized dictionary with writing suggestions.
        been a strong necessity of the technological          This paper discusses two different methods of
        society. Today, Natural Language Processing        standardizing technical languages, which have
        as well as the widespread use of computerized      been applied to a dataset of near misses coming
        document writing can give a tremendous boost       from the inspections at Major-Accident Hazard
        in reaching the goal of standardizing techni-      (MAH) industries, named also “Seveso” indus-
        cal language. In this paper, we propose two        tries. . The first method aims to help a standardiza-
        methods for standardizing language. These
                                                           tion agency to propose the standard language for
        methods have been applied to the dataset of
        near misses, collected during the inspections      writing these reports. We proposed to analyze lan-
        at Major-Accident Hazard (MAH) Industries.1        guage in use by word embedding similarity such
                                                           that the standardization agency can propose a lan-
1       Introduction                                       guage that is close to the one used. The second
Standardizing technical language has always been           method aims to reduce the use of unnecessary syn-
a strong necessity of the technological society. Ar-       onyms in compiling reports of near misses. In fact,
tifacts, objects, measures and so on should have           using unnecessary synonyms may result in confus-
a clear name and a clear description in order to           ing the report. For this problem, we propose to use
assure mutual understanding, which leads to the            a combination of language modeling derived from
reach of important goals in building and control-          the CBOW model of the word2vec (Mikolov et al.,
ling machines. However, language standardiza-              2013) along with a classical cosine similarity us-
tion has always the same problem: language is a            ing word embeddings. We experimented with a
social phenomenon (de Saussure, 1916). Hence,              dataset of anonymized reports of near misses from
whenever a group gather for designing or using             Seveso Industries, which INAIL has institutionally
a technical object, this group can develop a spe-          collected.
cific sub-language or just adapt the shared techni-           The rest of the paper is organized as follows.
cal language. This adapted sub-language can be             Section 2 describes the application scenario and
then effectively used to refer to parts of this tech-      the dataset. Section 3 shortly reports on the mod-
nical object. It is sufficient that group members          els used in this study and proposes the two tasks.
agree upon this language and the mutual under-             Section 4 reports on a preliminary analysis of the
standing occur. Yet, the language used by the spe-         possible results of the system. Finally, Section 5
cific group may prevent the others to understand           draws some conclusions and proposes further in-
what is written.                                           vestigations.
   Nowadays, Natural Language Processing as
well as the widespread use of computerized doc-            2     Background
ument writing can give a tremendous boost in
                                                           2.1    Scenario
reaching the goal of standardizing technical lan-
guage. Language in use can be captured and, then,          The European “Seveso” Directive deals with the
    1
                                                           control of major-accident hazards involving dan-
     Copyright c 2019 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0   gerous substances, which can cause toxic clouds,
International (CC BY 4.0).                                 fire, or explosion with consequences to people, as-
 Ref: 66                       Data (Date): 2007-02-15
 Titolo (Title):               Trasudamento OCD da serbatoio di stoccaggio OCD
 Descrizione (Description):    Durante le operazioni di riempimento del serbatoio K2 da nave cisterna, si è notato
                               un leggero trasudamento diOCD per corrosione del mantello (sottospessore localizzato
                               mantello serbatoio) a quota 6 metri circa lungo il latoovest. Uno degli operatori addetto
                               ai controlli durante la discarica della nave ha evidenziato l’evento. L’operazionedi dis-
                               carica della nave cisterna è stata fermata. Non si sono avuti rilasci, a meno del leggero
                               trasudamento.
 Sistemi     tecnici   crit-   serbatoio
 ici (Critical Technical
 System):
 Sostanza (Substance):         olio combustibile (ocd)
 Fattori gestionali (Manag-    Descrizione (Description)                       Azioni pianificate (Planned Actions)
 ing Factor)
 4.iv                          Fallimento procedure di manutenzione e          Fuori servizio e bonifica del serbatoio.
                               controllo.

Figure 1: Sample Report of a Near Miss within the European Seveso Directive - Italian Localization: Translation
is provided for Field Names


sets and environment, also outside the establish-                2.2    Corpus
ments. All European Member States apply this Di-
                                                                 The dataset refers to the near misses reports pro-
rective, which foresees periodical inspections by
                                                                 vided by the operators of “Seveso” establishments.
National Competent Authorities; in Italy, Inail is
                                                                 The collection of reports on near misses, hereafter
one of these authorities. During the inspection,
                                                                 referred as REP corpus, consists of 1300 docu-
the operator has to provide the inspectors with the
                                                                 ments called ”operative experiences”. These op-
list of near-misses, minor incidents, and accidents
                                                                 erative experiences span the period from 2006 to
occurred in the last ten years. Near misses and
                                                                 2017 and are related to 320 plants.
minor incidents are events of losses of contain-
ment, involving dangerous substances with none                      Each “operative experience” tells about the
or minor consequences, respectively. In Seveso                   events occurred in the recent past (see Figure 1
industries, the registration and the analysis of near            for an example). Each event is registered by the
misses is strongly recommended, as they can be                   operators filling in a pre-defined form. The docu-
considered as precursors of incidents with serious               ment contains information including the date, a ti-
consequences.                                                    tle summarizing the event, a short description, the
                                                                 reference to failed, missing or misapplied techni-
   In Italy, under Seveso legislation, there are                 cal or procedural barriers, those that stopped the
about a thousand industries, including refineries,               escalation and the recovering actions, and eventu-
petrochemical, and chemical. One of the pillars                  ally the planned actions for improving the safety.
of the Seveso Directive is the Safety Management
                                                                    It is out of scope of this paper to discuss the
System SMS, whose adoption is mandatory for the
                                                                 different methods used in the literature to man-
establishments’ operators, in order to control ma-
                                                                 age near miss information for improving the safety
jor accident hazards.
                                                                 management system. However, the common ob-
   The Safety Management System (SMS), imple-                    jective is to exploit the valuable information con-
mented by the establishment’s operator, addresses                tained. (Ansaldi et al., 2018) describe a method
technical measures and organizational procedures                 to extract knowledge from this collection of docu-
in order to guarantee human, asset and environ-                  ments, and to support foresights or intuitions about
mental safety, with a view to the prevention of                  the safety of process industries. Another appli-
major accident or the mitigation of their conse-                 cation has been developed for understanding if
quences.                                                         the lessons from major accidents have been fully
   In the recent inspections, the focus is often to-             learnt and implemented (Ansaldi et al., 2016). The
ward the study of the incidents and near misses                  issue has been addressed by looking for similari-
(see Figure 1). The approach based on near-miss                  ties between near misses and accident characteris-
discussion is considered more “risk based” as it is              tics, and by evaluating their semantic distance.
able to single out the critical issues of the safety                Although the form of the document is the same
system.                                                          adopted for all operators, the compiling mode
varies by the establishments and by the type of          able language; (2) devise ways to suggest more ap-
event recorded. The accuracy of the documents is         propriate words to writers of reports. In this study,
not homogeneous and the interpretation of opera-         we used two different word embeddings:
tive experience concept changes from one estab-
                                                           • General Language Word Embeddings
lishment to another; their carefulness varies on the
                                                             (GLwe)(Cimino et al., 2018): these are
sector activities, and often reveals the safety cul-
                                                             word embeddings pre-trained with word2vec
ture of the establishment. At a few establishments,
                                                             (Mikolov et al., 2013) on a general purpose
just the releases of hazardous substance without
                                                             corpus of the Italian language, that is, itWaC
consequences are registered. In other cases, re-
                                                             (Baroni et al., 2009)
ports include anomalies, unsafe situations, fail-
ures, and trivial errors; that is, events not directly     • Domain-adapted Word Embeddings (Dawe):
related to major accident hazard. The documents              these are word embeddings obtained training
are various, but represent truthful pictures of devi-        word2vec (Mikolov et al., 2013) using GLwe
ations occurred inside the establishment.                    as initialization and the REP training corpus

3     Methods                                            3.3   Task 1: Understanding Language of
                                                               Near-Miss Reports
The overall goal is to show that existing method-
                                                         We aim to provide the standardization organism,
ologies can help in standardizing language in the
                                                         that is, INAIL, the possibility to investigate the
specific case of reports on near misses on Seveso
                                                         language used in these reports on near misses. The
industries and we aim to perform this standard-
                                                         possibility we explored is to provide a visual rep-
ization with two tools: (1) analyzing similarities
                                                         resentation of similarity computed using similarity
among words in current reports; (2) propose a
                                                         among word embeddings. Giving this visual rep-
methodology to help in writing these reports.
                                                         resentation, researchers in INAIL can devise the
3.1    Challenges                                        definition of a standard language that is built on a
                                                         common and shared language. This idea is similar
The specific case of reports on near misses is par-      to what has been done in the past for terminology
ticular for several compelling reasons. The first        extraction. The real added value is that similarity
compelling reason is that reports are written by         among terms is computed according to word em-
operators belonging to sub-communities of speak-         beddings.
ers. In fact, people working in each plant can be
considered a sub-community, which shares a par-          3.4   Task 2: Standardizing Report with
ticular language. Hence, standardizing language                Assisted Writing
of reports means also harmonize sub-languages of         We aim to provide a tool to assist operators while
different sub-communities, which do not interact.        writing reports. We explored the first capability of
This problem is particularly severe when the aim         this tool, that is, avoiding unnecessary use of syn-
is to standardize language across the whole Seveso       onyms while writing. In the Italian tradition, us-
industries. The second compelling reason is the          ing repeating words is seen as bad writing. Hence,
different background of reports’ writers. Reports        when writing, synonyms are used to introduce a
are in fact written by operators, which may have         variation. However, for technical documents, un-
different knowledge, different school degree, and        necessary use of synonyms in core concepts may
different cultural background. This reason makes         introduce misunderstanding. Hence, we envisage
particularly relevant the goal to help writers in        a tool that helps in reducing use of synonyms.
compiling reports on near misses.                           The algorithm governing the tool works as fol-
                                                         lows. While writing a report, the algorithm ac-
3.2    Enabling Tools and Methodologies                  cumulate words in a set W . Whenever a new
To meet the overall goal , we here experiment with       content word w is added, the algorithms compute
standard and well-assessed models and method-            the similarity with the words in the set W . If
ologies: the notion of word embedding. In fact,          there is a word w0 ∈ W for which the similar-
the long tradition of representing word meaning in       ity sim(w, w0 ) = wT w0 is above a threshold τ ,
vectors is what is needed to: (1) help the standard-     the algorithm suggests w0 as a possible substitu-
ization organism to develop a common and accept-         tion of w. In this way, the operator is forced to
                                  (a) General Language Word Embeddings


                                  (b) Domain-adapted Word Embeddings

Figure 2: Similarities among Words: Studying and Understanding Technical Language with Word Embeddings
    #      Text
    174    La perdita non si era evidenziata al controllo dell’area effettuato preliminarmente all’inizio attività,
                                                    area (sim = 0.64)
           né rilevata dal CTM presente in zona
                                                    attività (sim = 0.31)


    175    ... Alle ore 10,15 il CT rilevava visivamente la presenza di tracce di virgin nafta miscelati con
           le acque di scarico e, mentre si accingeva a chiudere la valvola sul dreno di fondo colonna, im-
           provvisamente, si sviluppava un principio d’incendio. Lo stesso CT, utilizzando le manichette di
           erogazione acqua già attive per il lavaggio dell’area atto a favorire il convogliamento dei reflui nel
                                                                                      incendio (sim = 0.46)
           pozzetto di raccolta di raffineria, estingueva prontamente il focolaio
                                                                                      intervento (sim = 0.34)

    109    Necessità di prevedere un più elevato grado di protezione contro la perdita di contenimento da
                                            perdita (sim = 0.54)
           fondo serbatoi. La fuoriuscita
                                            contenimento (sim = 0.42)

                         Figure 3: Suggested replacements for with already used synonyms


think whether the word w0 that s/he already used            4.2    Task 2
is similar to the word s/he is using now. In this           For the second task, we experimented with some
case, w0 can be used to replace w and an unneces-           sample reports. The algorithm in action is reported
sary synonym is avoided.                                    in Figure 3. This test has been carried out on ex-
                                                            isting reports and aimed to show that some words
4     Experimental Results                                  can be replaced with previously used words. In the
                                                            report #174, the word zona can be replaced with
4.1       Task 1                                            the word area, which has been previously used.
For the first task, we experimented with the two            In the report #175, the word focolaio could be re-
dictionaries: the General Language word embed-              placed with the word incendio. Finally, in the re-
dings (GLwe) and the Domain-adpated word em-                port #109, the word fuoriuscita can be replaced
beddings (Dawe). Similarity spaces for the two              with the word perdita. However, the operator is
word embeddings (see Figure 2) may help in un-              free to accept or refuse the suggestion if this is not
derstanding whether unnecessary synonyms are                satisfactory.
used and, hence, suggest a standardized word that
should be used for a group of words.                        5     Conclusion and Future Work
   Using the two dictionaries, we built two similar-        Standardizing language is a need of our technolog-
ity spaces (Figure 2) obtained as follows. We se-           ical society. In this paper, we investigated the pos-
lected 10 frequent words in the REP training cor-           sibility of using modern NLP techniques to reach
pus and, then, we presented in the two figures the          this goal in the specific scenario of near misses
top 15 words that are more similar to the 10 se-            in Seveso Industries. Initial results on the corpus
lected frequent words. The similarity spaces are            provided by Inail are interesting and leave room
built according to GLwe (Figure 2a) and accord-             for improvement. Future model should include
ing to Dawe (Figure 2b).                                    the treatment of multi-word expressions by us-
   The Dawe similarity space (Figure 2b) gives ap-          ing compositional distributional semantic models
parently better hints on how words are used. The            (Zesch et al., 2013; Zanzotto et al., 2015), should
dictionary seems to be more tailored to the specific        merge distributional and ontological models, and
domain. In fact, there is an interesting groups of          should include a clear model for repaying knowl-
words such as {avvenuto, accaduto, occorso, veri-           edge producers (Zanzotto, 2019).
ficatosi } and { causato, provocato}. These gropus
are missing in the GLwe similarity space (Figure
2a).
References
Silvia Maria Ansaldi, Patrizia Agnello, and Paolo Bra-
   gatto. 2016. Incidents triggered by failures of
   level sensors. Chemical Engineering Transactions,
   53:223–228.
Silvia Maria Ansaldi, Annalisa Pirone, Rosaria
   Vallerotonda Maria, Paolo Bragatto, Patrizia Ag-
   nello, and Corrado Delle Site. 2018. How inspec-
   tions outcomes may improve the foresight of oper-
   ators and regulators in seveso industries. Chemical
   Engineering Transactions, 67:367–372.
Marco Baroni, Silvia Bernardini, Adriano Ferraresi,
 and Eros Zanchetta. 2009. The wacky wide web:
 a collection of very large linguistically processed
 web-crawled corpora. Language Resources and
 Evaluation, 43(3):209–226.

Andrea Cimino, Lorenzo De Mattei, and Felice
  Dell’Orletta. 2018. Multi-task learning in deep neu-
  ral networks at EVALITA 2018. In Proceedings
  of the Sixth Evaluation Campaign of Natural Lan-
  guage Processing and Speech Tools for Italian. Fi-
  nal Workshop (EVALITA 2018) co-located with the
  Fifth Italian Conference on Computational Linguis-
  tics (CLiC-it 2018), Turin, Italy, December 12-13,
  2018.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
  Dean. 2013. Efficient estimation of word represen-
  tations in vector space. CoRR, abs/1301.3781.
Ferdinand de Saussure. 1916. Cours de linguistique
  générale. Payot, Paris.
Fabio Massimo Zanzotto. 2019. Viewpoint: Human-
  in-the-loop Artificial Intelligence. Journal of Artifi-
  cial Intelligence Research, 64:243–252.

Fabio Massimo Zanzotto, Lorenzo Ferrone, and Marco
  Baroni. 2015. When the whole is not greater
  than the combination of its parts: A ”decomposi-
  tional” look at compositional distributional seman-
  tics. Comput. Linguist., 41(1):165–173.
T. Zesch, I. Korkontzelos, F.M. Zanzotto, and C. Bie-
   mann. 2013. Semeval-2013 task 5: Evaluating
   phrasal semantics. volume 2, pages 39–47. Cited
   By 12.

</pre>