=Paper=
{{Paper
|id=Vol-1495/paper_6
|storemode=property
|title=A Methodology for Identifying Terms and Patterns Specific to Requirements as a Textual Genre Using Automated Tools
|pdfUrl=https://ceur-ws.org/Vol-1495/paper_6.pdf
|volume=Vol-1495
|dblpUrl=https://dblp.org/rec/conf/tia/WarnierC15
}}
==A Methodology for Identifying Terms and Patterns Specific to Requirements as a Textual Genre Using Automated Tools==
Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)
183
A Methodology for Identifying Terms and Patterns Specific to Re-
quirements as a Textual Genre Using Automated Tools
Maxime Warnier Anne Condamines
CLLE-ERSS (UMR 5263) CLLE-ERSS (UMR 5263)
Université Toulouse – Jean Jaurès & CNRS Université Toulouse – Jean Jaurès & CNRS
Centre National d’Études Spatiales
maxime.warnier@univ-tlse2.fr anne.condamines@univ-tlse2.fr
avoid or at least substantially limit these prob-
Abstract lems by setting constraints on the lexicon, the
syntax or the semantics (Kuhn, 2014).
As a step in a project whose final goal is to However, in order for this CNL to be actually
propose a Controlled Natural Language for applied, we believe that it should not be unneces-
requirements writing at CNES (Centre Na- sarily restrictive and, in particular, not too far
tional d’Études Spatiales), we intend to removed from the way engineers are already
build the grammar of the textual genre of
used to write the documents – otherwise, they
the requirements. One of the main issues
faced when analyzing our corpus is the will probably merely ignore it. In other words,
(sometimes subtle) difference between the we wish to propose a CNL inspired by already
terms and syntactic structures pertaining to existing data, following a corpus-driven and
the genre and those linked to the domain (in corpus-based methodology that we describe
our case, the development of space systems) more in details in (Condamines & Warnier,
– a difference that is generally not taken in- 2014).
to account by automated tools. In this paper, This methodology relies on the existence of a
we present a methodology aimed at detect- textual genre, which Bhatia (1993) defines as “a
ing candidate terms and textual patterns recognizable communicative event characterized
specific to the genre by combining results
by a set of communicative purpose(s) identified
obtained from a terminology extractor and a
data mining tool with a validated resource and mutually understood by the members of the
in use for indexing documents at CNES. professional or academic community in which it
The results are then illustrated by a selec- regularly occurs”, as it is clearly the case for
tion of examples from our corpus. requirements writing (since it is a recurring task
performed by employees working in similar
1 Introduction companies), and in particular of a sublanguage,
defined by Somers (1998) as “an identifiable
This study is part of a wider project aiming at genre or text-type in a given subject field, with a
improving the writing of requirements1 at CNES relatively or even absolutely closed set of syntac-
(Centre National d’Études Spatiales), the French tic structures and vocabulary”. We were already
Space Agency. able to provide some evidence in favor of this
Indeed, the requirements (as well as the speci- hypothesis (if not for all requirements, at least
fications, that is, the documents in which they for requirements written in French at CNES) and
are included) are mostly written in a natural lan- we are now trying to build the grammar (that is
guage – in this case, in French –, and as a conse- to say the set of rules followed – consciously or
quence they may sometimes contain well-known not – by the speakers of this community to pro-
related problems, such as ambiguity and vague- duce acceptable utterances) of this particular
ness (Pace & Rosner, 2010). A Controlled Natu- genre by semi-automatically analyzing specifica-
ral Language (CNL) is a possible solution to tions of two former projects.
In the present study, we will focus on the re-
1
According to one of the definitions given by IEEE (1990), sults obtained by a terminological extraction.
a requirement is: “a condition or capability that must be met More specifically, we will propose a method to
or possessed by a system or system component to satisfy a
contract, standard, specification, or other formally imposed
sort them (as we are interested only in the terms
documents”. pertaining to the genre, not in those pertaining to
Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)
184
the domain) and subsequently to use them as a to exploit the results without a need for manually
filter to retrieve textual patterns belonging to the revising all of them. In the next section, we pre-
grammar of the genre. An example of similar sent the small experiment we conducted on our
work, based on collocations and n-grams, is giv- corpus of specifications as a possible way to
en by the transdisciplinary scientific lexicon reach this goal, but also to reuse these results to
(Tutin, 2007). filter textual patterns identified by a text mining
tool.
2 Genre vs. domain
3 Methodology
Although this grammar should ideally be in-
dependent of the field (aerospace industry, aero- 3.1 Corpora
nautics, software engineering, etc.), in practice,
All the operations described hereafter were
the distinction is not so simple as regards speci-
performed on two corpora of requirements in
fications2. While some features are indeed inher-
French extracted from several specifications
ent in the nature of the documents (because they
provided by the CNES. (All tables and figures
describe something that does not exist yet, but
were removed from the requirements, because
will have to exist and to conform with the re-
their automatic analysis would have been more
quirements, the use of the future tense and in-
difficult.) The first corpus concerns the project
junctions, for instance, are common), others,
called “Pleiades” 3 (two very-high-resolution
however, are closely related to the field to which
satellites for Earth observation) and is composed
belongs the future “object” being described. It
of nearly 120,000 words; the second corpus,
may reasonably be assumed that the lexical lev-
related to the smaller project “Microscope” 4 (a
el – since it directly refers to the object in ques-
microsatellite, whose main objective is to verify
tion – is most significantly affected by the
a physical principle), contains nearly 44,000
domain, but we cannot reject the hypothesis that
words. Although the requirements were written
syntactic structures too may differ from one field
under similar circumstances and represent the
to another.
same levels of specifications for the two projects,
For that reason, if we want to define a termi-
it is worth noting that Pleiades and Microscope
nology of requirements, we must keep in mind
have totally different scales and purposes. Con-
that the candidate terms proposed by the termi-
sequently, the fields to which they relate are at
nology extractors may actually belong either to
least partially distinct.
the genre or to the domain. Unfortunately, alt-
hough the possibility to filter terms by domain 3.2 Candidate terms
has already been highlighted as a user need
(Blancafort et al., 2011), traditional extractors do First of all, candidate terms for both corpora
not provide any means to distinguish a priori were extracted using the terminology extractor
between genre and domain, because they are developed for the Talismane toolkit (Urieli,
designed mostly for more didactic corpus, where 2013); based on a syntactic analysis, it extracts
the field matters much more than the genre (e.g. only contiguous noun phrases. The first list we
in order to establish the terminology in use in a obtained (Pleiades) contained 1,551 candidates,
company or in a knowledge domain). Further- while the second one (Microscope) contained
more, similar problems are to be expected when 716 candidates (minimum frequency = 5).
using other kinds of automated tools (such as Since they included candidate terms for the
data mining software), as they will also mix the genre and for the domain (see section 2), and
two different types of words and terms. since we are interested only in the former, all the
Specifications are thus unusual, specialized entries present in a list of terms used at CNES
corpora and they bring new challenges to termi- for indexing documents in their knowledge base
nology extraction in general. In particular, con- were removed. This list of domain terms (used
sidering the fact that the candidate terms linked here as a “stop list”) has been augmented for
to the domain are probably more numerous than many years thanks to internal documents of vari-
those linked to the genre, we want to find a way ous types and carefully validated by domain
2 3
The distinction between genre and domain itself is actual- https://pleiades.cnes.fr/en/PLEIADES/index.htm
4
ly far from trivial (Lee, 2001). http://missions-scientifiques.cnes.fr/MICROSCOPE/
Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)
185
experts. We therefore assume that the terms that structures or, at least, frequent textual patterns5
it contains are representative of the fields cov- with the help of text mining tools.
ered by the different projects conducted at CNES For this purpose, we used SDMC (Quiniou et
over the past years; furthermore, it is safe to al., 2012) to retrieve patterns of lemmas (i.e.
think that it should not contain terms belonging canonical forms of the words) frequent in the
to the genre of requirements, because they would two corpora, such as “comme décrire dans le
not be helpful for indexation (since they are too tableau” ‘as describe in the table’, appearing
general). After this step, only 1,355 entries re- seventeen times in total. These patterns have
mained for Pleiades (a difference of almost 200 variable lengths. Here again, the main problem is
entries) and 598 for Microscope (more than 100 the huge number of results: almost 14,000 pat-
candidates were thus discarded). terns were proposed, making a manual revision
In order to remove even more candidate terms extremely time-consuming.
supposedly linked to the field, we decided to In order to reduce this number to a more rea-
keep only entries present in both lists (Pleiades sonable proportion, we have decided to keep
and Microscope). This resulted in a much shorter only patterns containing at least one of the re-
list of just 300 candidate terms (meaning 1,055 maining candidate terms (for the sake of simplic-
were exclusive to Pleiades and 298 to Micro- ity, the noun phrases were reduced to their
scope). This step makes sense because the speci- heads); indeed, we assume that the structures
fications of Pleiades and Microscope are based on terms belonging to the genre are them-
comparable at many levels, but also because, as selves more likely to be typical of this same gen-
already mentioned, the two projects are suffi- re. This restriction limited the number of patterns
ciently distinct. Hence, whereas the first selec- to approximately 6,000, among which “être con-
tion was useful to eliminate candidates related to naître avec un [précision] 6 meilleur que (num-
the field at a more general level (e.g. “satellite” ber)” ‘be know with a [precision] better than
or “simulation”), here some of the candidates (number)’, “être conforme au [format]” ‘be con-
were not kept because they are more dependent sistent with the [format]” and “devoir respecter
to one of the two projects, and thus more special- le [contrainte]” ‘must respect the [constraint]’.
ized (e.g. “magnétomètre” ‘magnetometer’ or The list can be further reduced by focusing on
“masse interne” ‘internal mass’). (However, patterns containing a verb. In this way, we con-
because the corpus of specifications from Pleia- sider an intermediary level between the lexicon
des is almost three times larger than the other and the discourse.
corpus, it is also probable that some terms, such To conclude this section, the main steps of
as “priorité” ‘priority’, could have appeared in the process we described are represented by Fig-
the Microscope corpus as well.) ure 1.
Lastly, we proceeded to a manual revision of
the remaining candidate terms to eliminate some corpus 1 corpus 2
text mining
patterns
entries that were obviously noise. The final list
contains 267 candidate terms (to be compared terminological extraction
with the original list, which would have con-
tained over 1,850 different candidates, or almost candidate candidate
terms terms
2,000 if the extraction had been performed on the
two corpora as a whole). Interestingly, the terms stop list
seem to concern both functional requirements
candidate candidate
(e.g. “fonctionnalité” ‘functionality’) and non- terms terms
functional requirements (e.g. “disponibilité” common entries
‘availability’).
candidate
3.3 Textual patterns terms
filtered
Of course, a grammar of genre should not be patterns
limited to the lexicon, as it would be the case
with the results of the terminological extraction. 5
Patterns of this kind are the basis of the so-called “boiler-
We would like to identify recurring syntactic plates” (Hull et al., 2005), which are basically fixed struc-
tures filled with variable elements at determined positions.
6
The candidate terms are between square brackets.
Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)
186
Figure 1. Main steps of the proposed methodology. cute’, “vérifier” ‘to verify’, “calculer” ‘to calcu-
late’.
4 Results Some structures based on these verbs are typ-
ical of the corpus:
In this section, we briefly discuss some of the [Det N permettre de (V+deverbal noun)]: “le
results we obtained after applying the process DUPC permettra de modifier localement les
described previously. paramètres du calcul”.
[Det N fournir Det N1 (à Det N2)]: “cette in-
4.1 Regarding terms
terface fournit les positions navigateur de
Some terms belonging to the space domain l’instrument”.
remain: initialisms (“ASH”, “DGAPC”), terms [Det N utiliser Det N2 (pour V)]: “le système
too general to be useful for indexation (“mis- GIDE utilisera le protocole FTP pour effectuer
sion”, “centre de contrôle” ‘control center’), les transferts”.
terms of the field (“tuyère” ‘nozzle’, “calibra- [Det N fournir (à Det N2) Det N3]: “le sys-
tion”). tème de navigation fournira au système informa-
Others, by contrast, belong more to the genre. tique central une réference de temps”.
They may describe a need (“besoin de [Sur réception de cette TC, le LVC exécute la
test+programmation+restitution” ‘need for a procédure de mise ON+OFF de Det N (, par
test+programmation+restitution’) or the charac- l’envoi de commandes (sur+vers+à Det N3))]:
teristics of the objet that is described (“taille du “sur réception de cette TC, le LVC exécute la
buffer temporaire+du paquet TM” ‘size of the procédure de mise ON de la carte IOT sélection-
temporary buffer+TM packet”, “durée de désatu- née, par l’envoi de commandes discrètes sur
ration+la manœuvre” ‘duration of desatura- l’OBMU” (only in Pleiades).
tion+the manoeuvre’); they can specify expected [Det deverbal noun doit s’exécuter (condi-
functions (“fonction de gestion+filtrage” ‘func- tions)]: “la consolidation du scenario de travail
tion of management+filtering’); or they can be au CECT doit s’exécuter en moins de 15 secon-
related to the management of the project: possi- des” (only in Microscope).
ble problems (“défaillance” ‘failure’, “défaut” [Det N (avoir la capacité de+être (capable
‘defect’), necessary documentation (“rapport de+autorisé à)) traiter Det N2]: “le CCC doit
d’avancement+d’expertise” ‘progress+expertise avoir la capacité de récupérer et traiter 291 Mo
report’), validation (“acceptation” ‘acceptance’, de TM par jour”.
“confirmation”, “autorisation” ‘authorization’). These regular structures are therefore part of
Some terms can belong either to the field or the grammar of the genre of requirements (at
to the genre, depending on their modifier: “date CNES).
de début du produit” ‘starting date of the prod-
uct’ (genre) vs. “dates de début et de fin de 5 Conclusion
vidage TM” ‘starting and ending dates of the
emptying of the TM’ (field, because of the do- As emphasized in section 2, specifications of
main terms “vidage TM”). space systems represent a particular type of cor-
pus, because the terms of the domain and the
4.2 Regarding structures terms of the genre are closely linked – making it
The most frequent verbs in the patterns are: difficult to automatically distinguish them. In
“être” ‘to be’, “devoir” ‘must’, “permettre” ‘to section 3, we described the methodology we
allow’, “mettre” ‘to put’, “prendre (en compte)” applied to keep only the terms belonging to the
‘to take (into account)’, “fournir” ‘to provide’, textual genre, using an existing resource (built
“pouvoir” ‘to be able’, “définir” ‘to define’, for other needs) and a comparison between two
“passer (en mode+dans l’état)” ‘to enter (a corpora. This also allowed us to identify some
mode+a state)’, “contenir” ‘to contain’, “donner” structures (textual patterns) belonging to the
‘to give’, “utiliser” ‘to use’, “gérer” ‘to manage’, grammar of the genre, which are used for writing
“sélectionner” ‘to select’, “rejeter” ‘to reject’, functional requirements (describing expected
“traiter” ‘to process’, “correspondre” ‘to corre- functions) as well as for non-functional require-
spond’, “générer” ‘to generate’, “décrire” ‘to ments (describing qualities or constraints applied
describe’, “tenir” ‘to hold’, “exécuter” ‘to exe- to the system). The grammar could be refined
thanks to existing guides to writing specifica-
Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)
187
tions that specify the various sections of the doc- niques to Identify Linguistic Patterns for Stylistics?
uments and the different types of requirements, In International Conference on Intelligent Text
which are likely to be expressed in different Processing and Computational Linguistics (CI-
ways. CLing’12) (pp. 166–177). New Delhi, India.
Nevertheless, it also appears that it is not al- Somers, H. (1998). An Attempt to Use Weighted
ways possible to draw a line clearly separating Cusums to Identify Sublanguages. In D.M.W.
terms of the field and terms of the genre, since Powers (Ed.), NeMLaP3/CoNLL 98 : New Methods
some terms may belong to both categories. In in Language Processing and Computational Natu-
any case, the interpretation of the results remains ral Language Learning (pp. 131–139). ACL.
dependent on the objective(s) being pursued. Tutin, A. (2007). Modélisation linguistique et annota-
Finally, we used this experiment as a proof- tion des collocations: une application au lexique
of-concept; before we can generalize it, we transdisciplinaire des écrits scientifiques. Formal-
would have to ask for validation by experts (ex- iser Les Langues Avec L’ordinateur: Actes Des
perienced writers). It would also be very interest- Sixièmes, Sofia 2003, et Septièmes, Tours 2004,
Journées Intex-Nooj, 3, 189.
ing to compare our corpus to specifications
written in another domain. Urieli, A. (2013). Robust French syntax analysis:
reconciling statistical methods and linguistic
References knowledge in the Talismane toolkit. Université de
Toulouse 2 - Le Mirail, Toulouse.
Bhatia, V. K. (1993). Analysing genre: Language use
in professional settings. London: Longman.
Blancafort, H., Heid, U., Gornostay, T., Méchoulam,
C., Daille, B., & Sharoff, S. (2011). User-centred
Views on Terminology Extraction Tools: Usage
Scenarios and Integration into MT and CAT Tools.
In Conference ”Translation Careers and Technol-
ogies: Convergence Points for the Future
(TRALOGY). Paris, France: INIST.
Condamines, A., & Warnier, M. (2014). Linguistic
Analysis of Requirements of a Space Project and
Their Conformity with the Recommendations Pro-
posed by a Controlled Natural Language. In B. Da-
vis, K. Kaljurand, & T. Kuhn (Eds.), Controlled
Natural Language (pp. 33–43). Springer Interna-
tional Publishing.
Hull, E., Jackson, K., & Dick, J. (2005). Require-
ments engineering. London: Springer.
IEEE Standard Glossary of Software Engineering
Terminology. (1990). IEEE Std 610.12-1990, 1–84.
http://doi.org/10.1109/IEEESTD.1990.101064
Kuhn, T. (2014). A Survey and Classification of Con-
trolled Natural Languages. Computational Linguis-
tics, 40(1), 121–170.
Lee, D. Y. (2001). Genres, registers, text types, do-
mains and styles: Clarifying the concepts and nevi-
gating a path through the BNC jungle. Retrieved
from http://ro.uow.edu.au/artspapers/598/
Pace, G. J., & Rosner, M. (2010). A Controlled Lan-
guage for the Specification of Contracts. In N.
Fuchs (Ed.), CNL 2009 Workshop (pp. 226–245).
Marettimo: Springer.
Quiniou, S., Cellier, P., Charnois, T., & Legallois, D.
(2012). What About Sequential Data Mining Tech-