=Paper= {{Paper |id=Vol-1495/paper_6 |storemode=property |title=A Methodology for Identifying Terms and Patterns Specific to Requirements as a Textual Genre Using Automated Tools |pdfUrl=https://ceur-ws.org/Vol-1495/paper_6.pdf |volume=Vol-1495 |dblpUrl=https://dblp.org/rec/conf/tia/WarnierC15 }} ==A Methodology for Identifying Terms and Patterns Specific to Requirements as a Textual Genre Using Automated Tools== https://ceur-ws.org/Vol-1495/paper_6.pdf
                 Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)

                                                                183




     A Methodology for Identifying Terms and Patterns Specific to Re-
         quirements as a Textual Genre Using Automated Tools
                Maxime Warnier                                                    Anne Condamines
            CLLE-ERSS (UMR 5263)                                              CLLE-ERSS (UMR 5263)
    Université Toulouse – Jean Jaurès & CNRS                          Université Toulouse – Jean Jaurès & CNRS
        Centre National d’Études Spatiales
     maxime.warnier@univ-tlse2.fr                                     anne.condamines@univ-tlse2.fr


                                                                      avoid or at least substantially limit these prob-
                        Abstract                                      lems by setting constraints on the lexicon, the
                                                                      syntax or the semantics (Kuhn, 2014).
      As a step in a project whose final goal is to                       However, in order for this CNL to be actually
      propose a Controlled Natural Language for                       applied, we believe that it should not be unneces-
      requirements writing at CNES (Centre Na-                        sarily restrictive and, in particular, not too far
      tional d’Études Spatiales), we intend to                        removed from the way engineers are already
      build the grammar of the textual genre of
                                                                      used to write the documents – otherwise, they
      the requirements. One of the main issues
      faced when analyzing our corpus is the                          will probably merely ignore it. In other words,
      (sometimes subtle) difference between the                       we wish to propose a CNL inspired by already
      terms and syntactic structures pertaining to                    existing data, following a corpus-driven and
      the genre and those linked to the domain (in                    corpus-based methodology that we describe
      our case, the development of space systems)                     more in details in (Condamines & Warnier,
      – a difference that is generally not taken in-                  2014).
      to account by automated tools. In this paper,                       This methodology relies on the existence of a
      we present a methodology aimed at detect-                       textual genre, which Bhatia (1993) defines as “a
      ing candidate terms and textual patterns                        recognizable communicative event characterized
      specific to the genre by combining results
                                                                      by a set of communicative purpose(s) identified
      obtained from a terminology extractor and a
      data mining tool with a validated resource                      and mutually understood by the members of the
      in use for indexing documents at CNES.                          professional or academic community in which it
      The results are then illustrated by a selec-                    regularly occurs”, as it is clearly the case for
      tion of examples from our corpus.                               requirements writing (since it is a recurring task
                                                                      performed by employees working in similar
1     Introduction                                                    companies), and in particular of a sublanguage,
                                                                      defined by Somers (1998) as “an identifiable
    This study is part of a wider project aiming at                   genre or text-type in a given subject field, with a
improving the writing of requirements1 at CNES                        relatively or even absolutely closed set of syntac-
(Centre National d’Études Spatiales), the French                      tic structures and vocabulary”. We were already
Space Agency.                                                         able to provide some evidence in favor of this
    Indeed, the requirements (as well as the speci-                   hypothesis (if not for all requirements, at least
fications, that is, the documents in which they                       for requirements written in French at CNES) and
are included) are mostly written in a natural lan-                    we are now trying to build the grammar (that is
guage – in this case, in French –, and as a conse-                    to say the set of rules followed – consciously or
quence they may sometimes contain well-known                          not – by the speakers of this community to pro-
related problems, such as ambiguity and vague-                        duce acceptable utterances) of this particular
ness (Pace & Rosner, 2010). A Controlled Natu-                        genre by semi-automatically analyzing specifica-
ral Language (CNL) is a possible solution to                          tions of two former projects.
                                                                          In the present study, we will focus on the re-
1
  According to one of the definitions given by IEEE (1990),           sults obtained by a terminological extraction.
a requirement is: “a condition or capability that must be met         More specifically, we will propose a method to
or possessed by a system or system component to satisfy a
contract, standard, specification, or other formally imposed
                                                                      sort them (as we are interested only in the terms
documents”.                                                           pertaining to the genre, not in those pertaining to
                 Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)

                                                               184




the domain) and subsequently to use them as a                        to exploit the results without a need for manually
filter to retrieve textual patterns belonging to the                 revising all of them. In the next section, we pre-
grammar of the genre. An example of similar                          sent the small experiment we conducted on our
work, based on collocations and n-grams, is giv-                     corpus of specifications as a possible way to
en by the transdisciplinary scientific lexicon                       reach this goal, but also to reuse these results to
(Tutin, 2007).                                                       filter textual patterns identified by a text mining
                                                                     tool.
2    Genre vs. domain
                                                                     3      Methodology
    Although this grammar should ideally be in-
dependent of the field (aerospace industry, aero-                    3.1      Corpora
nautics, software engineering, etc.), in practice,
                                                                         All the operations described hereafter were
the distinction is not so simple as regards speci-
                                                                     performed on two corpora of requirements in
fications2. While some features are indeed inher-
                                                                     French extracted from several specifications
ent in the nature of the documents (because they
                                                                     provided by the CNES. (All tables and figures
describe something that does not exist yet, but
                                                                     were removed from the requirements, because
will have to exist and to conform with the re-
                                                                     their automatic analysis would have been more
quirements, the use of the future tense and in-
                                                                     difficult.) The first corpus concerns the project
junctions, for instance, are common), others,
                                                                     called “Pleiades” 3 (two very-high-resolution
however, are closely related to the field to which
                                                                     satellites for Earth observation) and is composed
belongs the future “object” being described. It
                                                                     of nearly 120,000 words; the second corpus,
may reasonably be assumed that the lexical lev-
                                                                     related to the smaller project “Microscope” 4 (a
el – since it directly refers to the object in ques-
                                                                     microsatellite, whose main objective is to verify
tion – is most significantly affected by the
                                                                     a physical principle), contains nearly 44,000
domain, but we cannot reject the hypothesis that
                                                                     words. Although the requirements were written
syntactic structures too may differ from one field
                                                                     under similar circumstances and represent the
to another.
                                                                     same levels of specifications for the two projects,
    For that reason, if we want to define a termi-
                                                                     it is worth noting that Pleiades and Microscope
nology of requirements, we must keep in mind
                                                                     have totally different scales and purposes. Con-
that the candidate terms proposed by the termi-
                                                                     sequently, the fields to which they relate are at
nology extractors may actually belong either to
                                                                     least partially distinct.
the genre or to the domain. Unfortunately, alt-
hough the possibility to filter terms by domain                      3.2      Candidate terms
has already been highlighted as a user need
(Blancafort et al., 2011), traditional extractors do                    First of all, candidate terms for both corpora
not provide any means to distinguish a priori                        were extracted using the terminology extractor
between genre and domain, because they are                           developed for the Talismane toolkit (Urieli,
designed mostly for more didactic corpus, where                      2013); based on a syntactic analysis, it extracts
the field matters much more than the genre (e.g.                     only contiguous noun phrases. The first list we
in order to establish the terminology in use in a                    obtained (Pleiades) contained 1,551 candidates,
company or in a knowledge domain). Further-                          while the second one (Microscope) contained
more, similar problems are to be expected when                       716 candidates (minimum frequency = 5).
using other kinds of automated tools (such as                           Since they included candidate terms for the
data mining software), as they will also mix the                     genre and for the domain (see section 2), and
two different types of words and terms.                              since we are interested only in the former, all the
    Specifications are thus unusual, specialized                     entries present in a list of terms used at CNES
corpora and they bring new challenges to termi-                      for indexing documents in their knowledge base
nology extraction in general. In particular, con-                    were removed. This list of domain terms (used
sidering the fact that the candidate terms linked                    here as a “stop list”) has been augmented for
to the domain are probably more numerous than                        many years thanks to internal documents of vari-
those linked to the genre, we want to find a way                     ous types and carefully validated by domain

2                                                                    3
  The distinction between genre and domain itself is actual-             https://pleiades.cnes.fr/en/PLEIADES/index.htm
                                                                     4
ly far from trivial (Lee, 2001).                                         http://missions-scientifiques.cnes.fr/MICROSCOPE/
              Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)

                                                         185




experts. We therefore assume that the terms that               structures or, at least, frequent textual patterns5
it contains are representative of the fields cov-              with the help of text mining tools.
ered by the different projects conducted at CNES                   For this purpose, we used SDMC (Quiniou et
over the past years; furthermore, it is safe to                al., 2012) to retrieve patterns of lemmas (i.e.
think that it should not contain terms belonging               canonical forms of the words) frequent in the
to the genre of requirements, because they would               two corpora, such as “comme décrire dans le
not be helpful for indexation (since they are too              tableau” ‘as describe in the table’, appearing
general). After this step, only 1,355 entries re-              seventeen times in total. These patterns have
mained for Pleiades (a difference of almost 200                variable lengths. Here again, the main problem is
entries) and 598 for Microscope (more than 100                 the huge number of results: almost 14,000 pat-
candidates were thus discarded).                               terns were proposed, making a manual revision
    In order to remove even more candidate terms               extremely time-consuming.
supposedly linked to the field, we decided to                      In order to reduce this number to a more rea-
keep only entries present in both lists (Pleiades              sonable proportion, we have decided to keep
and Microscope). This resulted in a much shorter               only patterns containing at least one of the re-
list of just 300 candidate terms (meaning 1,055                maining candidate terms (for the sake of simplic-
were exclusive to Pleiades and 298 to Micro-                   ity, the noun phrases were reduced to their
scope). This step makes sense because the speci-               heads); indeed, we assume that the structures
fications of Pleiades and Microscope are                       based on terms belonging to the genre are them-
comparable at many levels, but also because, as                selves more likely to be typical of this same gen-
already mentioned, the two projects are suffi-                 re. This restriction limited the number of patterns
ciently distinct. Hence, whereas the first selec-              to approximately 6,000, among which “être con-
tion was useful to eliminate candidates related to             naître avec un [précision] 6 meilleur que (num-
the field at a more general level (e.g. “satellite”            ber)” ‘be know with a [precision] better than
or “simulation”), here some of the candidates                  (number)’, “être conforme au [format]” ‘be con-
were not kept because they are more dependent                  sistent with the [format]” and “devoir respecter
to one of the two projects, and thus more special-             le [contrainte]” ‘must respect the [constraint]’.
ized (e.g. “magnétomètre” ‘magnetometer’ or                        The list can be further reduced by focusing on
“masse interne” ‘internal mass’). (However,                    patterns containing a verb. In this way, we con-
because the corpus of specifications from Pleia-               sider an intermediary level between the lexicon
des is almost three times larger than the other                and the discourse.
corpus, it is also probable that some terms, such                  To conclude this section, the main steps of
as “priorité” ‘priority’, could have appeared in               the process we described are represented by Fig-
the Microscope corpus as well.)                                ure 1.
    Lastly, we proceeded to a manual revision of
the remaining candidate terms to eliminate some                    corpus 1                     corpus 2
                                                                                                            text mining
                                                                                                                          patterns
entries that were obviously noise. The final list
contains 267 candidate terms (to be compared                            terminological extraction
with the original list, which would have con-
tained over 1,850 different candidates, or almost                  candidate                    candidate
                                                                     terms                        terms
2,000 if the extraction had been performed on the
two corpora as a whole). Interestingly, the terms                                 stop list
seem to concern both functional requirements
                                                                   candidate                    candidate
(e.g. “fonctionnalité” ‘functionality’) and non-                     terms                        terms
functional requirements (e.g. “disponibilité”                                  common entries
‘availability’).
                                                                                  candidate
3.3   Textual patterns                                                              terms
                                                                                                                          filtered
   Of course, a grammar of genre should not be                                                                            patterns
limited to the lexicon, as it would be the case
with the results of the terminological extraction.             5
                                                                 Patterns of this kind are the basis of the so-called “boiler-
We would like to identify recurring syntactic                  plates” (Hull et al., 2005), which are basically fixed struc-
                                                               tures filled with variable elements at determined positions.
                                                               6
                                                                 The candidate terms are between square brackets.
                 Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)

                                                            186




    Figure 1. Main steps of the proposed methodology.             cute’, “vérifier” ‘to verify’, “calculer” ‘to calcu-
                                                                  late’.
4      Results                                                        Some structures based on these verbs are typ-
                                                                  ical of the corpus:
   In this section, we briefly discuss some of the                    [Det N permettre de (V+deverbal noun)]: “le
results we obtained after applying the process                    DUPC permettra de modifier localement les
described previously.                                             paramètres du calcul”.
                                                                      [Det N fournir Det N1 (à Det N2)]: “cette in-
4.1      Regarding terms
                                                                  terface fournit les positions navigateur de
    Some terms belonging to the space domain                      l’instrument”.
remain: initialisms (“ASH”, “DGAPC”), terms                           [Det N utiliser Det N2 (pour V)]: “le système
too general to be useful for indexation (“mis-                    GIDE utilisera le protocole FTP pour effectuer
sion”, “centre de contrôle” ‘control center’),                    les transferts”.
terms of the field (“tuyère” ‘nozzle’, “calibra-                      [Det N fournir (à Det N2) Det N3]: “le sys-
tion”).                                                           tème de navigation fournira au système informa-
    Others, by contrast, belong more to the genre.                tique central une réference de temps”.
They may describe a need (“besoin de                                  [Sur réception de cette TC, le LVC exécute la
test+programmation+restitution” ‘need for a                       procédure de mise ON+OFF de Det N (, par
test+programmation+restitution’) or the charac-                   l’envoi de commandes (sur+vers+à Det N3))]:
teristics of the objet that is described (“taille du              “sur réception de cette TC, le LVC exécute la
buffer temporaire+du paquet TM” ‘size of the                      procédure de mise ON de la carte IOT sélection-
temporary buffer+TM packet”, “durée de désatu-                    née, par l’envoi de commandes discrètes sur
ration+la manœuvre” ‘duration of desatura-                        l’OBMU” (only in Pleiades).
tion+the manoeuvre’); they can specify expected                       [Det deverbal noun doit s’exécuter (condi-
functions (“fonction de gestion+filtrage” ‘func-                  tions)]: “la consolidation du scenario de travail
tion of management+filtering’); or they can be                    au CECT doit s’exécuter en moins de 15 secon-
related to the management of the project: possi-                  des” (only in Microscope).
ble problems (“défaillance” ‘failure’, “défaut”                       [Det N (avoir la capacité de+être (capable
‘defect’), necessary documentation (“rapport                      de+autorisé à)) traiter Det N2]: “le CCC doit
d’avancement+d’expertise” ‘progress+expertise                     avoir la capacité de récupérer et traiter 291 Mo
report’), validation (“acceptation” ‘acceptance’,                 de TM par jour”.
“confirmation”, “autorisation” ‘authorization’).                      These regular structures are therefore part of
    Some terms can belong either to the field or                  the grammar of the genre of requirements (at
to the genre, depending on their modifier: “date                  CNES).
de début du produit” ‘starting date of the prod-
uct’ (genre) vs. “dates de début et de fin de                     5   Conclusion
vidage TM” ‘starting and ending dates of the
emptying of the TM’ (field, because of the do-                        As emphasized in section 2, specifications of
main terms “vidage TM”).                                          space systems represent a particular type of cor-
                                                                  pus, because the terms of the domain and the
4.2      Regarding structures                                     terms of the genre are closely linked – making it
    The most frequent verbs in the patterns are:                  difficult to automatically distinguish them. In
“être” ‘to be’, “devoir” ‘must’, “permettre” ‘to                  section 3, we described the methodology we
allow’, “mettre” ‘to put’, “prendre (en compte)”                  applied to keep only the terms belonging to the
‘to take (into account)’, “fournir” ‘to provide’,                 textual genre, using an existing resource (built
“pouvoir” ‘to be able’, “définir” ‘to define’,                    for other needs) and a comparison between two
“passer (en mode+dans l’état)” ‘to enter (a                       corpora. This also allowed us to identify some
mode+a state)’, “contenir” ‘to contain’, “donner”                 structures (textual patterns) belonging to the
‘to give’, “utiliser” ‘to use’, “gérer” ‘to manage’,              grammar of the genre, which are used for writing
“sélectionner” ‘to select’, “rejeter” ‘to reject’,                functional requirements (describing expected
“traiter” ‘to process’, “correspondre” ‘to corre-                 functions) as well as for non-functional require-
spond’, “générer” ‘to generate’, “décrire” ‘to                    ments (describing qualities or constraints applied
describe’, “tenir” ‘to hold’, “exécuter” ‘to exe-                 to the system). The grammar could be refined
                                                                  thanks to existing guides to writing specifica-
               Proceedings of the conference Terminology and Artificial Intelligence 2015 (Granada, Spain)

                                                          187




tions that specify the various sections of the doc-               niques to Identify Linguistic Patterns for Stylistics?
uments and the different types of requirements,                   In International Conference on Intelligent Text
which are likely to be expressed in different                     Processing and Computational Linguistics (CI-
ways.                                                             CLing’12) (pp. 166–177). New Delhi, India.
   Nevertheless, it also appears that it is not al-             Somers, H. (1998). An Attempt to Use Weighted
ways possible to draw a line clearly separating                   Cusums to Identify Sublanguages. In D.M.W.
terms of the field and terms of the genre, since                  Powers (Ed.), NeMLaP3/CoNLL 98 : New Methods
some terms may belong to both categories. In                      in Language Processing and Computational Natu-
any case, the interpretation of the results remains               ral Language Learning (pp. 131–139). ACL.
dependent on the objective(s) being pursued.                    Tutin, A. (2007). Modélisation linguistique et annota-
   Finally, we used this experiment as a proof-                   tion des collocations: une application au lexique
of-concept; before we can generalize it, we                       transdisciplinaire des écrits scientifiques. Formal-
would have to ask for validation by experts (ex-                  iser Les Langues Avec L’ordinateur: Actes Des
perienced writers). It would also be very interest-               Sixièmes, Sofia 2003, et Septièmes, Tours 2004,
                                                                  Journées Intex-Nooj, 3, 189.
ing to compare our corpus to specifications
written in another domain.                                      Urieli, A. (2013). Robust French syntax analysis:
                                                                  reconciling statistical methods and linguistic
References                                                        knowledge in the Talismane toolkit. Université de
                                                                  Toulouse 2 - Le Mirail, Toulouse.
Bhatia, V. K. (1993). Analysing genre: Language use
  in professional settings. London: Longman.
Blancafort, H., Heid, U., Gornostay, T., Méchoulam,
  C., Daille, B., & Sharoff, S. (2011). User-centred
  Views on Terminology Extraction Tools: Usage
  Scenarios and Integration into MT and CAT Tools.
  In Conference ”Translation Careers and Technol-
  ogies: Convergence Points for the Future
  (TRALOGY). Paris, France: INIST.
Condamines, A., & Warnier, M. (2014). Linguistic
  Analysis of Requirements of a Space Project and
  Their Conformity with the Recommendations Pro-
  posed by a Controlled Natural Language. In B. Da-
  vis, K. Kaljurand, & T. Kuhn (Eds.), Controlled
  Natural Language (pp. 33–43). Springer Interna-
  tional Publishing.
Hull, E., Jackson, K., & Dick, J. (2005). Require-
  ments engineering. London: Springer.
IEEE Standard Glossary of Software Engineering
  Terminology. (1990). IEEE Std 610.12-1990, 1–84.
  http://doi.org/10.1109/IEEESTD.1990.101064
Kuhn, T. (2014). A Survey and Classification of Con-
  trolled Natural Languages. Computational Linguis-
  tics, 40(1), 121–170.
Lee, D. Y. (2001). Genres, registers, text types, do-
  mains and styles: Clarifying the concepts and nevi-
  gating a path through the BNC jungle. Retrieved
  from http://ro.uow.edu.au/artspapers/598/
Pace, G. J., & Rosner, M. (2010). A Controlled Lan-
  guage for the Specification of Contracts. In N.
  Fuchs (Ed.), CNL 2009 Workshop (pp. 226–245).
  Marettimo: Springer.
Quiniou, S., Cellier, P., Charnois, T., & Legallois, D.
  (2012). What About Sequential Data Mining Tech-