Semi-automatic Extraction of Academic Transdisciplinary
Phraseology: Expanding a Corpus of Categorized Academic
Phraseology Using BERT Machine Learning Models (Short Paper)
Micaela Aguiar 1,, José Monteiro 1 and Sílvia Araújo 1
1
    University of Minho, Rua da Universidade, Braga, 4710-057 , Braga

                                  Abstract
                                  The lack of access to academic literacy skills and tools is a serious problem, as it furthers
                                  inequality among students. In this paper, we propose a methodology to semi-automatically
                                  extend a corpus of academic phraseology, previously manually extracted and categorized,
                                  using BERT machine learning models. We begin by describing the constitution of the manually
                                  extracted and categorized corpus. Next, we briefly discuss how the BERT machine learning
                                  model works. Then, we explore the methodology for the semi-automatic extension of the initial
                                  corpus: from the constitution of a new corpus, to the preparation of the documents to be
                                  processed by the model, to the manual evaluation of the retrieved sentences. We ran two tests,
                                  the results of which we report in this paper.

                                  Keywords 1
                                  academic phraseology, transdisciplinary scientific phraseology, natural language processing,
                                  BERT, semi-automatic extraction


1. Introduction

    Many are the reasons given for the lack of academic literacy among higher education students. Some
point the finger at earlier levels of education, and, in particular, at language subjects, whose curriculum
is often focused mainly on literary texts [1]. Others point out that teachers themselves sometimes “have
insufficient meta-language skills and knowledge to discuss writing issues with students and to explain
their expectations with respect to student assignments'' [2]. Institutions try to address the issue by
offering academic writing courses. However, not all institutions have the same offer and a lot of the
time these types of courses are subject to fees. This promotes inequality in the access to academic
literacy skills.
    As part of the research project PortLinguE, a Portuguese project financed by European funds, we
are developing a tool that will assist Portuguese and non-native students in academic writing tasks. This
tool will be freely available to all students and one of its goals is to bridge the gap in the access to
academic literacy skills. This tool centers around the creation of a phrase bank of European Portuguese
academic phraseology, that was manually extracted and categorized.
    This paper describes a methodology to expand the initial corpus of manually extracted and
categorized academic phrases using BERT machine learning models. We will begin by addressing the
framework of academic phraseology, which will be followed by a brief description of the constitution
of the initial corpus and the description of the methodology for the corpus expansion, ending with results
of the tests we ran.


2nd International Conference on “Multilingual Digital Terminology Today, Design, Representation Formats and Management Systems”,
(MDTT) 2023, June 29-30, 2023, Lisbon, Portugal
EMAIL: maguiar60@gmail.com (M. Aguiar); jdiogoxmonteiro@gmail.com (J. Monteiro); saraujo@elach.uminho.pt (S. Araújo)
ORCID: 0000-0002-5923-9257 (M. Aguiar); 0000-0002-2904-3501 (J. Monteiro); 0000-0003-4321-4511 (S. Araújo)
                               © 2023 Copyright for this paper by its authors.
                               Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Wor
    Pr
       ks
        hop
     oceedi
          ngs
                ht
                I
                 tp:
                   //
                    ceur
                       -
                SSN1613-
                        ws
                         .or
                       0073
                           g
                               CEUR Workshop Proceedings (CEUR-WS.org)
2. Framework

    Formulaic language is favored by native speakers in their communication [3], however, using
academic formulas is not a “linguistic universal skill” [4], in fact failing to use formulaic language may
even be considered a signal of “lack of mastery of a novice writer in a specific disciplinary community”
[5]. The concept of “transdisciplinary scientific lexicon” accounts for the non terminological lexicon
common to the scientific community and widely shared by the disciplines [6]. Transdisciplinary
scientific lexicon may refer to single words, multiword referential sequences (like collocations and
fixed expressions), multiword discursive sequences (recurring expressions used to structure the speech),
multiword interpersonal sequences (expression used to convey pragmatic or modal functions) and
semantic-rhetorical routines [7]. According to Tutin, semantic-rhetorical routines are typical utterances
of scientific writing corresponding to a specific rhetorical function: they correspond to complete
statements, built around a predicate.
    An academic phrase bank is a list of collocations, discourse markers, hedging devices, but mostly
of semantic-rhetorical routines, that perform various functions, such as referring to sources, describing
the results of an experiment or stating the conclusions of a study. There are some academic phrase banks
available online, such as the Ref-N-Write Academic Phrasebank for English (https://www.ref-n-
write.com/academic-phrasebank/) or the Dictionnaire des expressions from Base ARTES for multiple
languages (https://artes.app.univ-paris-diderot.fr/artes-symfony/web/app.php/fr). In Portuguese, Bab.la
(https://en.bab.la/phrases/academic/opening/english-portuguese) offers a multilingual Portuguese
phrase bank. Morley [8] developed the most popular English phrase bank, the Manchester University's
Academic Phrasebank (https://www.phrasebank.manchester.ac.uk/). The Academic Phrasebank corpus
originally consisted of 100 postgraduate dissertations from the University of Manchester, and has since
incorporated academic material from a variety of sources. Morley drew on Swale's concept of move to
manually extract and categorize sections of text serving a particular communicative purpose. There
aren’t many academic phrase banks exclusively for European Portuguese, so we are developing one as
a freely available tool for academic literacy.


3. Manually Extracted Corpus
   To create the European Portuguese academic phrase bank, we started with a corpus of 40 scientific
papers taken from RepositoriUM, the repository of the University of Minho, and Repositório Aberto,
the repository of the University of Porto. The papers were divided into four scientific areas: Life and
Health Sciences, Exact and Engineering Sciences, Natural and Environmental Sciences, and Social
Sciences and Humanities. Papers were only included in the corpus if they were written in European
Portuguese and were available in open access.
   Taking into account the pedagogical nature of the phrase bank, we elected to create a corpus of
scientific papers, because the scientific paper as a genre is ubiquitous in the work of any researcher and
because it can be used as a model for young researchers and undergraduates. As Tulin and Jacques [6]
point out, the lexicon shared by scientific productions is a lexicon of genre. That is why, with regard to
the semantic-rhetorical categories, we used an adapted version of the typology put forward by Morley.
   Five main categories were considered: introduction, literature review, methodology, results,
discussion and conclusions. These categories were informed by the concept of included genders (genre
inclu in the French original) [9]. Included genres refers to sections of text such as the introduction or
the conclusion that can be, for example, found in distinct genres, such as the scientific paper, the
doctoral thesis or the conference paper. The extraction and categorization was carried out using the
qualitative data analysis and mixed methods software, MAXQDA. Afterwards, the extracted
phraseological units were simplified: any particular content was removed or substituted by more general
terms. This highlights the phraseological element, makes it easier for students to use it in their writing
and discourages plagiarism. Fifty sub-categories were identified and nearly a thousand phraseological
units were extracted.
4. Extending the Corpus

    Tulin [7] points out that, unlike other types of lexical sequences, the extraction and categorization
of the semantic-rhetorical routines is hard to automatize, due to its large lexical, syntactic and semantic
variety. To try to automatically extract this type of more complex phraseological units, we will utilize
the manually extracted and categorized corpus as a starting point to find similar elements in a larger
corpus, using the natural language processing model BERT [10].

    4.1.         Bert Model

    BERT (Bidirectional Encoder Representations from Transformers) is a Natural Language
Processing model that analyzes text corpus in terms of similarities at word, collocation and sentence
level and distributes the processed data based on semantic similarity, thus generating semantic vectors.
Transformers [11] are a deep learning model, first introduced in the 2017 paper “Attention is all you
need” [12], that uses attention (a concept that arose in NLP and Machine Translation) to weigh the
relevance of each part of the input data. Similarly to humans, previous recurrent neural nets (RNNs)
were not able to process long sentences, because RNN’s process them sequentially (left to right or right
to left) and tended to “to forget information from timesteps that are far behind” [13]. The concept of
attention addresses this problem by proposing to “look at all the different words at the same time and
learn to ‘pay attention’ to the correct ones depending on the task at hand” [13]. Transformers are capable
of reading a sentence in both directions at the same time (hence the concept of bidirectionality). Since
Transformers can process data in any order, it enabled the models to train in large amounts of data and
create pre-trained models [14].
    BERT is the most popular and used pre-trained model in Natural Language Processing. BERT is
usually trained for two main purposes: masked language modeling (which entails predicting a randomly
masked word) and Next Sentence Prediction (which involves predicting if two sentences are
consecutive or not [15]. This way, BERT is able to process context since “words are defined by their
surroundings, not by a pre-fixed identity” [14]. This feature enables BERT to perform semantic
searches. This type of search is unique given that it seeks to “determine the intent and contextual
meaning of the words a person is using for a search”. We will train BERT models to perform semantic
searches and find similar phrases and structures in a large text corpus, using the initial manually
extracted and categorized corpus.

    4.2.         Methodology

    In this section, we will describe the steps in the methodology we propose to semi-automatically
extend a corpus of previously categorized academic phraseology.
         The first step in this methodology is to compile a new corpus. The new corpus will be composed
of 40 PhD dissertations and 40 Master’s thesis, drawn from the four disciplinary areas mentioned above.
The new corpus will meet the same inclusion criteria as the original corpus: being available in open
access and written in European Portuguese. This is the only step where human involvement is necessary,
and our system only necessitates that the user creates a file with the repository link of the files he wishes
to extract.
    The second step is to prepare the collected documents to be processed by the model. Generally
speaking, academic texts are deposited in repositories in pdf format. These documents must be
downloaded and then converted into text format. To perform this task, our python pipeline uses the
requests module to download the documents (using the URLs provided by the user). Then, the
pdfplumber [16] package allows for extracting the text from each document.
    After the documents have been converted into text format, the third step is to extract individual
sentences from each document. To parse each sentence, we opted to use the NLTK (natural language
tool-kit) [17] platform for python, specifically the portuguese.pickles tokenizer.
    The fourth step is to use the BERT model to convert the individual extracted sentences into semantic
vectors, and then store them in an efficient text search database FAISS. Instead of manually producing
the embeddings and managing the database, we opted to use Haystack [18], a end-to-end framework
that enables the construction of powerful and production-ready pipelines for searching text. Our
Haystack pipeline was set up to use an efficient FAISS database and a pre-trained BERT model called
BERTimbau [19]. BERTimbau has the limitation of being a model trained with the BrWaC corpus
(Brazilian Web as Corpus). However, there are no large BERT models trained for European Portuguese
yet.
    With our Haystack framework working and fully indexed with sentences, the fifth step consists in
finding, for each phrase of the original corpus, the top 10 most similar sentences from the new corpus.
The framework facilitates this process by having methods for querying the database. These methods
have two purposes: first, they convert the original phrase into a sentence vector (using the same pre-
trained model); second, they perform a dot product search between the new query vector and the vectors
stored in the database. The results from performing the search are 10 similar phrases from the new
corpus being recommended. Given the semantic nature of the process, the phrases should be similar in
both content and context.
    The sixth step is the manual evaluation of the results.

5. Results & Discussion

    In the application of this methodology, we first tested the rufimelo/Legal-BERTimbau-sts-base-ma-
v2 [20] model and evaluated the results manually. This first trial served for both exploratory and
baseline purposes. We found that the results of this first test were not particularly positive. However,
they still showed promising results for the application of our methodology. For this reason we decided
to run a new test, this time using the larger rufimelo/Legal-BERTimbau-sts-large-ma-v3 [21] model.
We also carry out a manual evaluation of the results of the second model, whose results we will describe
next.
    For the manual evaluation, we defined 8 categories: 1 - it corresponds to a variation of the template;
2 - it is a repetition of the template; 3 - it deviates from the original meaning, by focusing on words in
the context; 4 - it is not an expression (titles, numbers, punctuation, etc...) or it is not a complete
expression; 5 - it is an expression that is not related to the original template; 6 - it is an ambiguous
expression, that only the context can clarify; 7 - it is an expression that corresponds to another function;
8 - it is an expression in a language other than Portuguese. The manual evaluation worked as follows:
for each of the 50 subcategories in the phrase bank, 3 templates were categorized (two template-
proposed sentences per template).
    The initial corpus contains variables like x, y, and z that replace expressions with particular content,
so in evaluating the first model we evaluated the results with variables and without variables (we
determined that the variables were a source of noise for the model). That is why for the first model we
evaluated 600 sentences and for the second we evaluated 300. The results of the two models are
described in Table 1.

Table 1
Model Comparison
        Categories                            Model 1                                  Model 2
            Total                                600                                     300
         1 - Match                           121 - 20,1%                             134 - 45,5%
      2 - Repetition                            0 - 0%                                  0 - 0%
  3 - Contextual Focus                       187 - 31,1%                              25 - 4,1%
    4 - No Expression                        125 - 20,8%                               8 - 2,6%
        5 - Random                           97 - 16,1%                               38 -12,6%
      6 - Ambiguous                           58 - 9,6%                               11 - 3,6%
       7 - Mismatch                           10 - 1,6%                                63 - 21%
   8 - Other Language                          4 - 0,6%                                 21 -7%
    As can be seen, there was a considerable increase in the amount of matches from the first model
(20.1%) to the second (45.5%). The first model presented many instances where the results were bound
to a word from the context of the template (31,1%), kept for the sake of readability; the second (4,1%)
model presented considerably fewer occurrences of this problem. Another result that should be
highlighted is the increase in mismatch cases from the first model (1,6%) to the second (21%). We
categorized as mismatches occurrences in which the expression corresponds to another function, usually
very close to the one presented by the model. This is because the phrase bank categories often present
nuances that are difficult to resolve, for example, between the functions of “reporting unexpected
results” , “commenting on the results” or “summarizing the results”. In these cases the expressions will
later be incorporated into the most appropriate functions. A limitation of this method is that it is not
always possible to tell which plane of the text (introduction, methodology, conclusions, etc.) the
expression is taken from, which explains the percentage of ambiguous results. Of the 50 categories, the
first model showed positive results for 31 categories, the second for 41 categories.
    In the future, we will try to eliminate other causes of noise that we have identified for the models,
such as numerals, dates and place-holder names, like Smith and Jones, to see if we can achieve better
results. We will also run a new test with the second model using a larger corpus of thesis and
dissertations.
    The final step in our work will be to annotate all occurrences of the model with the best results;
recategorize the sentences that have been mismatched, and finally, transform the sentences into
simplified templates in order to be incorporated into the phrase bank. Phrases containing expressions
that already exist in the phrase bank will be used as examples of the template in context.

6. Conclusions

   In this paper, we proposed a methodology to semi-automatically extend a corpus of categorized
academic phraseology using machine learning models, BERT. The aim was to enrich the European
Portuguese academic phrase bank, which is being developed within the PortLinguE project and will be
made available as a tool to support academic literacy.
   In the future, this extended phrase bank will be available for free online and in pdf format on the
digital platform created by the PortLinguE project, called Lang2Science. Furthermore, we intend to
embed this phrase bank in a search engine, using technology already developed within the
PortLinguEproject. The search engine also uses BERT models, so this integration will allow users to
search for expressions, similar to what they do when using Google, and obtain similar phraseology or
phraseology with similar functions as a result. This offers users a more dynamic way to interact with
the phrase bank.

7. Acknowledgements

   This work was carried out within the scope of the “PortLinguE” project (PTDC / LLT-LIG /
31113/2017) financed by FEDER under Portugal 2020 and by national funds through Fundação para a
Ciência e a Tecnologia, I.P. (FCT,I.P.).

8. References

    [1] J. A. Brandão, Literacia Académica: Da Escola Básica Ao Ensino Superior – Uma Visão
        Integradora, Letras & Letras (2013) 17.
    [2] K.M. Jonsmoen, and M. Greek, 2017, ‘Lecturers’ text competencies and guidance towards
        academic literacy’, Educational Action Research, 25(3) (2017) 354–69.
    [3] A. Wray, Formulaic language and the lexicon. Cambridge University Press, Cambridge, 2002.
    [4] C. Pérez-Llantada, Formulaic language in L1 and L2 expert academic writing: Convergent and
        divergent usage, Journal of English for Academic Purposes, 14 (2014) 84–94.
[5] J. Li, and N. Schmitt, The acquisition of lexical phrases in academic writing: a longitudinal
    case study, Journal of Second Language Writing, 18(2) (2009) 85–10.
[6] A. Tutin, and M.-P. Jacques, Le lexique scientifique transdisciplinaire : une introduction, in:
    M.-P. Jacques and A. Tutin (Eds.), Lexique transversal et formules discursives des sciences
    humaines, ISTE Editions, 2018, pp.1-26.
[7] A. Tutin, La phraséologie transdisciplinaire des écrits scientifiques : des collocations aux
    routines sémantico-rhétoriques, in: A. Tutin and F. Grossmann (Eds.) L’écrit scientifique : du
    lexique au discours. Autour de Scientext, Presses Universitaires de Rennes, 2014, pp. 27-44.
[8] J. Morley, A compendium of commonly used phrasal elements in academic English in PDF
    format, The University of Manchester, 2014.
[9] F. Rastier, Arts et sciences du texte, PUF, Paris, 2001.
[10]         J. Devlin, M.-W. Chang, K. Lee and K. Toutanova, BERT: Pre-training of Deep
    Bidirectional Transformers for Language Understanding, NAACL (2019).
[11]         T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R.
    Louf, M. Funtowicz, and J. Brew, Transformers: State-of-the-Art Natural Language
    Processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language
    Processing: System Demonstrations, 2020, pp. 38–45, doi: 10.18653/v1/2020.emnlp-demos.6
[12]         A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and
    I. Polosukhin, Attention is all you need, in: 31st Conference on Neural Information Processing
    Systems Long Beach, CA, USA, 2017
[13]         N. Adaloglou, How Attention works in Deep Learning: understanding the attention
    mechanism in sequence models, 2020, URL: https://theaisummer.com/attention/.
[14]         B.       Lutkevich,         BERT           language       model,     2020,        URL:
    https://www.techtarget.com/searchenterpriseai/definition/BERT-language-model
[15]         H. Tayyar Madabushi, L. Romain, D. Divjak, and P. Milin, CxGBERT: BERT meets
    Construction Grammar, in: Proceedings of the 28th International Conference on Computational
    Linguistics, 2020, pp. 4020–4032. https://doi.org/10.18653/v1/2020.coling-main.355
[16]         pdfplumber. URL: https://github.com/jsvine/pdfplumber
[17]         NLTK. URL: https://www.nltk.org/
[18]         haystack. URL: https://github.com/deepset-ai/haystack
[19]         F. Souza, R. Nogueira, R. Lotufo, BERTimbau: Pretrained BERT Models for Brazilian
    Portuguese, in: R. Cerri, R. C. Prati (Eds.), Intelligent Systems, BRACIS 2020, Springer, Cham,
    2020, https://doi.org/10.1007/978-3-030-61377-8_28.
[20]         rufimelo/Legal-BERTimbau-sts-base-ma-v2.                                          URL:
    https://huggingface.co/rufimelo/Legal-BERTimbau-sts-base-ma-v2
[21]         rufimelo/Legal-BERTimbau-sts-large-ma-v3.                                         URL:
    https://huggingface.co/rufimelo/Legal-BERTimbau-sts-large-ma-v3