=Paper=
{{Paper
|id=Vol-1495/paper_28
|storemode=property
|title=Descriptors for the Detection of the Chemical Risk
|pdfUrl=https://ceur-ws.org/Vol-1495/paper_28.pdf
|volume=Vol-1495
|dblpUrl=https://dblp.org/rec/conf/tia/GrabarH15
}}
==Descriptors for the Detection of the Chemical Risk==
Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)
191
Descriptors for the detection of the chemical risk
Natalia Grabar Thierry Hamon
UMR8163 STL LIMSI-CNRS, Orsay
CNRS, Université Lille 3 Université Paris 13
Villeneuve d’Ascq, France Sorbonne Paris Cité, France
natalia.grabar@univ-lille3.fr hamon@limsi.fr
Abstract The chemical risk is poorly studied, although
the notion of the risk is addressed by other works:
We propose an experience on the automatic
detection of sentences conveying the notion
building of the dedicated resources (Makki et al.,
of chemical risk. Our objective is to study 2008), exploring of known industrial incidents
which resources are useful for the automatic (Tulechki and Tanguy, 2012), computing the ex-
detection of such sentences. Lexical, se- position to the risk (Marre et al., 2010). Our objec-
mantic and opinion-oriented content of the tive is to study which resources are useful for the
sentences is studied. Our results indicate automatic detection of the sentences which convey
that not only lexical and semantic content the notion of the chemical risk.
must be taken into account, but also markers
related to the modality, opinion and polarity.
2 Material and Methods
1 Introduction In addition to the lexical and semantic content of
Chemical risk is relative to situations in which the text, we use several kinds of resources in order
chemical products are dangerous for human or to favour one aspect or another. These resources
animal health and consumption, and for environ- contain markers oriented on modality, opinion and
ment. The automatization of the process can help polarity expressed by the authors on the proposed
the experts to control and manage large amounts experiements: (1) uncertainty (possible, should,
of scientific literature, that have to be analyzed may, usually) indicates that there are doubts on
to support the decision making process (van der the results presented, their interpretation, etc.; (2)
Sluijs et al., 2008). The sentences that must be rec- negation (no, neither, lack, absent, missing) indi-
ognized are for instance: The Panel concluded that cates that the results have not been observed, that
the current NOAEL for BPA (5 mg/kg b.w./day) the study does not respect the expected norms,
would be sufficiently low to exclude any concern etc.; (3) limitations (only, shortcoming, insuffi-
for this effect, or Despite this lack of evidence, cient) indicates that there are some limits of the
the possibility of poultry and egg consumption as work, such as unsufficient sample size, small num-
an exposure route to HPAIV remains a concern to ber of tests or doses explored, etc.; (4) approxi-
food safety experts. Such sentences are to be as- mation (approximately, commonly, estimated) in-
signed in categories related to the chemical risk: dicates other kinds of insufficiency related to im-
the first sentence is related to the significance of precise values of substances, samples, dosage, etc.
the results, while the second is related to the qual- The work is done with the corpus on chemi-
ity of the scientific hypothesis. If such sentences cal risk reporting on several chemical experiments
are detected in scientific publications or reports, with bisphenol A (EFSA Panel, 2010). It contains
it means that these publications or reports contain over 80,000 occurrences. The reference data are
information not fully reliable and can possibly in- obtained through a manual categorization of the
dicate the insufficiency of the corresponding stud- corpus sentences: 425 sentences are assigned to
ies and the presence of the risk. 55 classes of the chemical risk.
Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)
192
1 1
freq freq
norm norm
0.8 tfidf 0.8 tfidf
performance
performance
0.6 0.6
0.4 0.4
0.2 0.2
0 0
all form lemm lf lft stag tag all form lemm lf lft stag tag
descripteurs descripteurs
(a) Significance of the results (b) Natural variability
Figure 1: F-measure obtained during the categorization of sentences into classes of the chemical risk.
We tackle the problem through the supervized mentary views on the content and should be com-
categorization with the Weka platform (Witten bined. These results also indicate that chemical
and Frank, 2005). Sentences correspond to the risk is not fully conceptual category but is also re-
units, while 7 classes (most frequent) of the chem- lated to subjective and contextual values.
ical risk are the categories to which the sentences
have to be assigned. The resources and the linguis-
References
tic annotation of corpus (Schmid, 1994) provide
several descriptors. These are used to build sev- EFSA Panel. 2010. Scientific opinion on Bisphenol A:
eral sets of descriptors. They represent the seman- evaluation of a study investigating its neurodevelop-
mental toxicity, review of recent scientific literature
tic and linguistic content of the sentences: forms
on its toxicity and advice on the danish risk assess-
(the forms such as they occur in the corpus), lem- ment of Bisphenol A. EFSA journal, 8(9):1–110.
mas (lemmatized forms), lf (combination of forms J Makki, AM Alquier, and V Prince. 2008. Ontology
and lemmas), tag (POS tags, such as nouns, verbs, population via NLP techniques in risk management.
adjectives), lft (combination of forms, lemmas and In Proceedings of ICSWE.
POS-tags), stag (semantic tags of words, such as A Marre, S Biver, M Baies, C Defreneix, and
uncertainty, negation, limitations), all (combina- C Aventin. 2010. Gestion des risques en ra-
diothérapie. Radiothérapie, 724:55–61.
tion of all the descriptors available). The descrip-
H Schmid. 1994. Probabilistic part-of-speech tag-
tors are weighted with various methods (freq raw ging using decision trees. In International Con-
frequency, norm normalization by the length of the ference on New Methods in Language Processing,
sentences, and tfidf tf-idf normalization). pages 44–49.
N Tulechki and L Tanguy. 2012. Effacement de di-
3 Results mensions de similarité textuelle pour l’exploration
de collections de rapports d’incidents aéronautiques.
Figure 1 presents some results obtained for two
In TALN, pages 439–446.
categories: Significance of the results and Natural Jeroen P van der Sluijs, Arthur C Petersen, Peter H M
variability of the results. We can observe some Janssen, James S Risbey, and Jerome R Ravetz.
difference according to the descriptors: the ex- 2008. Exploring the quality of evidence for com-
ploitation of forms, semantic tags (with Signifi- plex and contested policy decisions. Environ. Res.
cance of the results) and various combinations of Lett., 3(2).
descriptors provide results that are often better for G Vinodhini and RM Chandrasekaran. 2012. Senti-
ment analysis and opinion mining: A survey. Inter-
these two categories and for other categories. We
national Journal of Advanced Research in Computer
assume that these two kinds of descriptors (lexical Science and Software Engineering, 2(6):282–292.
and semantic content of corpus and the descriptors I.H. Witten and E. Frank. 2005. Data mining: Practi-
related to modality, polarity and opinion (Vinod- cal machine learning tools and techniques. Morgan
hini and Chandrasekaran, 2012)) provide comple- Kaufmann, San Francisco.