=Paper=
{{Paper
|id=Vol-1495/paper_28
|storemode=property
|title=Descriptors for the Detection of the Chemical Risk
|pdfUrl=https://ceur-ws.org/Vol-1495/paper_28.pdf
|volume=Vol-1495
|dblpUrl=https://dblp.org/rec/conf/tia/GrabarH15
}}
==Descriptors for the Detection of the Chemical Risk==
<pdf width="1500px">https://ceur-ws.org/Vol-1495/paper_28.pdf</pdf>
<pre>
                 Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)

                                                            191


                   Descriptors for the detection of the chemical risk


                Natalia Grabar                                           Thierry Hamon
                UMR8163 STL                                            LIMSI-CNRS, Orsay
           CNRS, Université Lille 3                                    Université Paris 13
           Villeneuve d’Ascq, France                                Sorbonne Paris Cité, France
     natalia.grabar@univ-lille3.fr                                     hamon@limsi.fr


                      Abstract                                   The chemical risk is poorly studied, although
                                                              the notion of the risk is addressed by other works:
     We propose an experience on the automatic
     detection of sentences conveying the notion
                                                              building of the dedicated resources (Makki et al.,
     of chemical risk. Our objective is to study              2008), exploring of known industrial incidents
     which resources are useful for the automatic             (Tulechki and Tanguy, 2012), computing the ex-
     detection of such sentences. Lexical, se-                position to the risk (Marre et al., 2010). Our objec-
     mantic and opinion-oriented content of the               tive is to study which resources are useful for the
     sentences is studied. Our results indicate               automatic detection of the sentences which convey
     that not only lexical and semantic content               the notion of the chemical risk.
     must be taken into account, but also markers
     related to the modality, opinion and polarity.
                                                              2   Material and Methods
1   Introduction                                              In addition to the lexical and semantic content of
Chemical risk is relative to situations in which              the text, we use several kinds of resources in order
chemical products are dangerous for human or                  to favour one aspect or another. These resources
animal health and consumption, and for environ-               contain markers oriented on modality, opinion and
ment. The automatization of the process can help              polarity expressed by the authors on the proposed
the experts to control and manage large amounts               experiements: (1) uncertainty (possible, should,
of scientific literature, that have to be analyzed            may, usually) indicates that there are doubts on
to support the decision making process (van der               the results presented, their interpretation, etc.; (2)
Sluijs et al., 2008). The sentences that must be rec-         negation (no, neither, lack, absent, missing) indi-
ognized are for instance: The Panel concluded that            cates that the results have not been observed, that
the current NOAEL for BPA (5 mg/kg b.w./day)                  the study does not respect the expected norms,
would be sufficiently low to exclude any concern              etc.; (3) limitations (only, shortcoming, insuffi-
for this effect, or Despite this lack of evidence,            cient) indicates that there are some limits of the
the possibility of poultry and egg consumption as             work, such as unsufficient sample size, small num-
an exposure route to HPAIV remains a concern to               ber of tests or doses explored, etc.; (4) approxi-
food safety experts. Such sentences are to be as-             mation (approximately, commonly, estimated) in-
signed in categories related to the chemical risk:            dicates other kinds of insufficiency related to im-
the first sentence is related to the significance of          precise values of substances, samples, dosage, etc.
the results, while the second is related to the qual-            The work is done with the corpus on chemi-
ity of the scientific hypothesis. If such sentences           cal risk reporting on several chemical experiments
are detected in scientific publications or reports,           with bisphenol A (EFSA Panel, 2010). It contains
it means that these publications or reports contain           over 80,000 occurrences. The reference data are
information not fully reliable and can possibly in-           obtained through a manual categorization of the
dicate the insufficiency of the corresponding stud-           corpus sentences: 425 sentences are assigned to
ies and the presence of the risk.                             55 classes of the chemical risk.
                                Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)

                                                                           192


                  1                                                                        1
                                                      freq                                                                   freq
                                                     norm                                                                   norm
                0.8                                   tfidf                               0.8                                tfidf
 performance


                                                                            performance
                0.6                                                                       0.6

                0.4                                                                       0.4

                0.2                                                                       0.2

                  0                                                                        0
                      all    form lemm        lf        lft   stag   tag                        all   form lemm      lf        lft   stag   tag
                                         descripteurs                                                           descripteurs

                            (a) Significance of the results                                            (b) Natural variability


               Figure 1: F-measure obtained during the categorization of sentences into classes of the chemical risk.


   We tackle the problem through the supervized                              mentary views on the content and should be com-
categorization with the Weka platform (Witten                                bined. These results also indicate that chemical
and Frank, 2005). Sentences correspond to the                                risk is not fully conceptual category but is also re-
units, while 7 classes (most frequent) of the chem-                          lated to subjective and contextual values.
ical risk are the categories to which the sentences
have to be assigned. The resources and the linguis-
                                                                             References
tic annotation of corpus (Schmid, 1994) provide
several descriptors. These are used to build sev-                            EFSA Panel. 2010. Scientific opinion on Bisphenol A:
eral sets of descriptors. They represent the seman-                             evaluation of a study investigating its neurodevelop-
                                                                                mental toxicity, review of recent scientific literature
tic and linguistic content of the sentences: forms
                                                                                on its toxicity and advice on the danish risk assess-
(the forms such as they occur in the corpus), lem-                              ment of Bisphenol A. EFSA journal, 8(9):1–110.
mas (lemmatized forms), lf (combination of forms                             J Makki, AM Alquier, and V Prince. 2008. Ontology
and lemmas), tag (POS tags, such as nouns, verbs,                               population via NLP techniques in risk management.
adjectives), lft (combination of forms, lemmas and                              In Proceedings of ICSWE.
POS-tags), stag (semantic tags of words, such as                             A Marre, S Biver, M Baies, C Defreneix, and
uncertainty, negation, limitations), all (combina-                              C Aventin. 2010. Gestion des risques en ra-
                                                                                diothérapie. Radiothérapie, 724:55–61.
tion of all the descriptors available). The descrip-
                                                                             H Schmid. 1994. Probabilistic part-of-speech tag-
tors are weighted with various methods (freq raw                                ging using decision trees. In International Con-
frequency, norm normalization by the length of the                              ference on New Methods in Language Processing,
sentences, and tfidf tf-idf normalization).                                     pages 44–49.
                                                                             N Tulechki and L Tanguy. 2012. Effacement de di-
3              Results                                                          mensions de similarité textuelle pour l’exploration
                                                                                de collections de rapports d’incidents aéronautiques.
Figure 1 presents some results obtained for two
                                                                                In TALN, pages 439–446.
categories: Significance of the results and Natural                          Jeroen P van der Sluijs, Arthur C Petersen, Peter H M
variability of the results. We can observe some                                 Janssen, James S Risbey, and Jerome R Ravetz.
difference according to the descriptors: the ex-                                2008. Exploring the quality of evidence for com-
ploitation of forms, semantic tags (with Signifi-                               plex and contested policy decisions. Environ. Res.
cance of the results) and various combinations of                               Lett., 3(2).
descriptors provide results that are often better for                        G Vinodhini and RM Chandrasekaran. 2012. Senti-
                                                                                ment analysis and opinion mining: A survey. Inter-
these two categories and for other categories. We
                                                                                national Journal of Advanced Research in Computer
assume that these two kinds of descriptors (lexical                             Science and Software Engineering, 2(6):282–292.
and semantic content of corpus and the descriptors                           I.H. Witten and E. Frank. 2005. Data mining: Practi-
related to modality, polarity and opinion (Vinod-                               cal machine learning tools and techniques. Morgan
hini and Chandrasekaran, 2012)) provide comple-                                 Kaufmann, San Francisco.

</pre>