Assembly Models for SimpleText Task 2: Results from
Wuhan University Research Group
Jianfei Huang1 , Jin Mao2
1
    School of Information Management, Wuhan University, Bayi Rd 299, Wuhan, Hubei, 430072, China
2
    Center for Studies of Information Resources, Wuhan University, Bayi Rd 299, Wuhan, Hubei, 430072, China


                                         Abstract
                                         The goal of SimpleText Task 2 is to sort and rank complex terms that are required to be explained, given
                                         a passage and a query. To this end, our group applied a pipeline of term recognition and complexity
                                         evaluation. Candidate terms are extracted and evaluated according to their similarity with the query
                                         and a few rules. We formulate the evaluation of complexity as a classification task. We compile three
                                         groups of features for terms, including lexical, syntactic, and semantic features, then, ensemble machine
                                         learning models that adopt a soft voting strategy are applied to classify the complexity of the terms.
                                         Results of cross-validation on the training set are reported. Potential further improvements about the
                                         approach in future are discussed as well.

                                         Keywords
                                         term recognition, lexical features, syntactic features, semantic features, text complexity


1. Introduction
SimpleText Task 2 involves identifying what term is unclear and ranking terms that are crucial
for readers to understand scientific text, given a passage and a query. In fact, for ranking terms
that bother readers without prior domain knowledge, we need to know which terms should
be extracted and explained. Further, evaluating term complexity could be a prior step for text
simplification according to Shardlow’s proposed approaches[1], as what to do in SimpleText
Task 3.
   Readers who do not understand the background of news articles often need to start with
some technical terms. A term may consist of one or many words. It could be a strange word, an
uncommon abbreviation, or a phrase. Apparently, a complex term cannot be understood just by
its counts in some specific corpus. Its meaning relies on many features and differs according
to context. To remove such understanding barriers, the goal of SimpleText Task 2 is to decide
which terms need explanation in a passage concerning a query and to rank them by three-level
scores and five-level scores[2]. The task can be divided into two subtasks concerning all the
above factors. One is extracting complex terms from a combination of passage and query. The
other is evaluating complexity by considering valid influencing factors as much as possible.
   In this paper, we extract key phrases and words based on similarity measures and rules, and

CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ hngdoze@gmail.com (J. Huang); danveno@163.com (J. Mao)
 0000-0003-1125-4754 (J. Huang); 0000-0001-9572-6709 (J. Mao)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
present our submission using two ensemble models to complete the complexity classification
tasks. The former considers a large set of linguistics features, such as lexical features, syntactic
features, and semantic features. The latter has nothing different but adding the prediction result
of the former as a feature. In section 2 we introduce previous relevant works. In section 3 we
present the main points of our feature engineering. In section 4 we show the basic flow of our
model. Finally, in Section 5 we have some discussions.


2. Related works
2.1. Term Recognition
Terminology recognition methods can mainly be divided into traditional algorithms, classical
machine learning, and deep learning models. Different methods have different application
scenarios according to the tasks and data. Robertson explained the term-weighting function
TF-IDF from a theoretical level, which was considered one of the most commonly used baselines
for term recognition in information retrieval models[3]. Some studies have applied PageRank
to keyword extraction and achieved good performance[4]. In addition, some studies focused
on the clustering approach and classic machine learning classifiers, such as the Bayesian and
support vector machine approaches[5]. Further, many recent works turned to the black box of
deep learning, like using the pre-trained models, e.g., BERT. Deep learning approaches have
shown promising results.

2.2. Term Complexity
Terminology complexity is closely related to the study of text complexity. In early studies,
computational measures of text complexity have been restricted to some heuristic readability
formulations, which mainly focus on some shallow features[6]. The shallow features usually
adopt traditional readability metrics by simply counting words and characters[7], such as an
average number of syllables per word, an average number of words per sentence, Automated
Readability Index[8], and the Flesch-Kincaid score[9]. Later, some studies attempted to dig out
deeper and more general features to supplement those shallow features.
   In recent years, adopting machine learning or deep learning methods to complete feature
learning for text complexity has become a trend. Gooding and Kochmar presented CAMB based
on ensemble voting, a system that brings together 27 lexical, morphological, and psycholinguistic
features[10]. Although it achieved state-of-the-art results in the 2018 CWI shared task[11], it
dismissed the context of the target words. In the SemEval-2021 shared task 1[12], most studies
tented to capture extensive information for the target word and its context. Morphosyntax
features and pretraining embedding were applied to obtain better feature representation. The
model that attained the best performance in the above task, used both token and context features
derived from pre-trained models [13]. However, an expanded version of the CAMB system
obtained a similar performance[14]. It ranks third and is less than a percentage point below
the best result on lexical complexity prediction for single words, which showed some feature
engineering-based models can outperform most deep learning-based counterparts. Nonetheless,
combining various features and machine learning models seems to be a consensus in recent
studies.


3. Methodology
3.1. Term Recognition
To get candidate terms, we first extracted keywords and phrases in the passages via Key-
BERT 1 . A few similar algorithms can extract candidate terms, including TF-IDF, Rake, YAKE!.
While,KeyBERT computes the cosine similarity of sub-phrases and passages internally, which is
more in line with the task description. Then, the candidate terms were filtered by calculating
the similarity scores between the terms and the query with PhraseSimilarity2 . And we excluded
those starting with a, an, the, or digit in the candidate terms. We also detected the capitalization
of terms to extract acronyms. The terms obtained include words, compound words, phrases, etc.
We then removed the punctuations and reverted the terms to lowercase except for acronyms.

3.2. Feature Extraction
We designed a few lexical features, syntactic features, and semantic features for the terms.

3.2.1. Lexical Features
These are features based on lexical information about the term:

    • length: Length of the term.
    • zipf frequency3 : To make word frequency norms comparable, Brysbaert Marc et al
      provide the Zipf Scale, which is independent of corpus size[15]. Zipf frequency exactly
      aims to return the term’s frequency on a human-friendly logarithmic scale via that.
    • tf-idf score: We calculated tf-idf score based on PhraseFinder. PhraseFinder is a search
      engine for the Google Books Ngram Dataset (version 2) that features a wildcard-supporting
      query language and outstanding retrieval performance.
    • acronym: Check if all letters are uppercase. Because acronyms are often difficult to
      understand.
    • number of subwords4 , syllables5 , phonemes6 : Morphological awareness is an
      understanding of how words can be broken down into smaller units[16]. The number of
      subwords is expected as a complementary feature to the length of the term and we get it
      via BPEmb, which is trained on Wikipedia and using the Byte-Pair Encoding algorithm.
      Similarly, the other two features are well-represented in speech synthesis and are widely
      incorporated into measures or feature sets in other studies on lexical complexity.
    1
      https://github.com/MaartenGr/KeyBERT
    2
      https://github.com/franplk/PhraseSimilarity
    3
      https://pypi.org/project/wordfreq/
    4
      https://github.com/bheinzerling/bpemb
    5
      https://github.com/Kyubyong/g2p
    6
      https://pypi.org/project/syllables/
3.2.2. Syntactic Features
Complex terms may have some special syntactic roles in the sentences. We coined a few syntactic
features from the syntactic structure of a term’s context. We used stanza7 for part-of-speech
recognition and dependency parsing.
    • depth of the term: It means the distance between the term and the parse tree’s root.
    • number of the dependencies: We count all words that depend on or are depended
      on by the term, as this feature.
    • part-of-speech: We use a 17-dimension one-hot vector to represent it, and each
      dimension represents one kind of part-of-speech tag. Some words have simple meanings,
      but when combined into phrases their meanings are elusive. Prepositional phrases, verb
      phrases, noun phrases, and adjective phrases have subtle differences in our understanding
      of the meaning of phrases. For phrases, what we do is add the vectors together, therefore
      we put both single words and phrases in the 17-dimensional vectors for comparison.

3.2.3. Semantic Features
    • glove embedding8 : We extract 300-dimension embeddings pre-trained on Common
      Crawl. Further, we use the zero vector to fill missing values and reduce the dimensions to
      30 by PCA.
    • fasttext embedding9 : Fasttext embedding is considered as an alternative semantic
      feature. The dimensions are reduced to 30 by PCA as well.

3.3. Model Design
We formulated the complexity evaluation of terms as two classification tasks with 3 classes and
5 classes respectively. For the former, we concatenate all features and get 86 dimensional vector
as the input vector. We put the predicted label of the three-classification model and all features
together for the latter. Considering a large number of features and the small training set, we
trained a few state-of-the-art base models, including LightGBM, CatBoost, XGBoost, Random
Forest, Support Vector Machine, and then assembled these models using a soft voting strategy. On
the one hand, the ensemble model consists of multiple classifiers, which improves the accuracy
of the classification task. On the other hand, ensemble models reduce the occurrence of special
cases, such as predicting difficult terms into simpler ones. Figure 1 gives an overview of the
model design. Hyperparameter settings either use grid search or follow default values.


4. Results
The terms provided in the training samples are not independent, in other words, a term can
correspond to multiple passages. We deduplicated records and obtained 250 independent
sentence-term pairs as the final dataset. Then, we performed five-fold cross-validation on the
   7
     https://stanfordnlp.github.io/stanza/
   8
     https://nlp.stanford.edu/projects/glove/
   9
     https://fasttext.cc/
Figure 1: Overall model design: term complexity assessment at simpletext task 2.


dataset. According to the model design, we first verified the models for the three-class task,
and the results are shown in Table 1. The star represents our proposed integrated model. It is
shown that the integrated model is superior to the base models in terms of accuracy and AUC.

Table 1
Cross-validation results of the three-class task.
                Three-classification      Accuracy         F1 Score           AUC
                       Model            mean      std   mean      std    mean     std
               * (Integrated Model) 0.684 0.062         0.583    0.093   0.635 0.059
               LightBGM                 0.652 0.063     0.586    0.089   0.624   0.062
               * - LightBGM             0.660 0.083     0.565    0.089   0.607   0.069
               CatBoost                 0.636 0.069     0.551    0.064   0.615   0.058
               * - CatBoost             0.672 0.079     0.583    0.093   0.611   0.069
               XGBoost                  0.656 0.093     0.593 0.112      0.591   0.061
               * - XGBoost              0.668 0.084     0.576    0.096   0.631   0.064
               RandomForest             0.656 0.097     0.556    0.080   0.590   0.098
               * - RandomForest         0.664 0.066     0.581    0.090   0.626   0.055
               SVM                      0.672 0.079     0.557    0.080   0.576   0.098
               * - SVM                  0.660 0.072     0.582    0.101   0.621   0.062


   Intuitively, five grading scales are more difficult, which require a more precise assessment
of complexity. We take the prediction results of the three-class models as the extended input
feature, which can improve the performance. We also obtained the accuracy, F1 score, and AUC
value for the ensemble models of the five-class task, as shown in Table 2.
   The accuracy metrics of the two ensemble models we designed outperform the other base
models. On F1 scores and AUC metrics, they also achieved almost the best performance in the
experiment. Furthermore, according to the subset of the test set consisting of 592 sentences
manually annotated, our submissions are ranked second(2/4) on the scale 1-3 and first(1/4) on
the scale 1-5, based on the proportion of successful matches of all participants. In the subset
consisting of 167 common sentences, we ranked second in both tasks. [2]
Table 2
Cross-validation results of the five-classification model.
                 Five-classification        Accuracy         F1 Score         AUC
                       Model             mean       std    mean     std   mean    std
               * (Integrated Model) 0.464 0.064 0.445 0.068               0.670 0.025
               LightBGM                  0.448 0.060 0.414 0.076          0.653 0.023
               * - LightBGM              0.428 0.072 0.375 0.085          0.656 0.026
               CatBoost                  0.428 0.060 0.376 0.084          0.668 0.028
               * - CatBoost              0.440 0.067 0.389 0.090          0.667 0.027
               XGBoost                   0.448 0.057 0.415 0.081          0.657 0.029
               * - XGBoost               0.432 0.053 0.389 0.086          0.663 0.028
               RandomForest              0.460 0.052 0.402 0.078          0.654 0.047
               * - RandomForest          0.428 0.060 0.377 0.086          0.673 0.023
               SVM                       0.440 0.057 0.351 0.061          0.581 0.027
               * - SVM                   0.432 0.059 0.399 0.081          0.662 0.026


   However, the evaluation results of all participating teams performed poorly. One reason for
this could be that the term extraction process is not proper. Many terms are manually annotated
as requiring no explanation during the evaluation process and assigned a new difficulty score
of 0, whereas they are assigned a difficulty score of 1 in our submissions, implying that they
belonged to the easiest terms. Admittedly, the values of all these metrics are not high, indicating
that the tasks of identifying terms and predicting term complexity are difficult.


5. Discussion
In this paper, we applied a pipeline for the term complexity prediction tasks, which consists of
term recognition, feature extraction, training models, and assembling models. The ensemble
models show improved performance than the base models.
   As a preliminary study, a few limitations have been identified, which could guide our future
refinement for our approach. The pre-trained embedding we choose is trained on Common Crawl,
which is from the public domain. There can be pre-trained word embeddings for technology
and medical fields, as are the domains covered by the task corpus. Thus, one work direction is
to fine-tune a pre-trained model based on transformer architecture on a specific corpus of the
target domain and to extract the learned embeddings as a complement to semantic features.
Furthermore, our method takes into account some insignificant features, and there may be some
important features that have not been identified. Evaluating the importance of features and
emphasizing significant features in the learning models could further improve the approach.
References
 [1] M. Shardlow, A survey of automated text simplification, International Journal of Advanced
     Computer Science and Applications 4 (2014) 58–70.
 [2] L. Ermakova, P. Bellot, J. Kamps, D. Nurbakova, I. Ovchinnikova, E. SanJuan, E. Mathurin,
     R. Hannachi, S. Huet, S. Araujo, Overview of the CLEF 2022 SimpleText Lab: Automatic
     Simplification of Scientific Texts, Experimental IR Meets Multilinguality, Multimodality,
     and Interaction. Proceedings of the Thirteenth International Conference of the CLEF
     Association (CLEF 2022) 13390 (2022).
 [3] S. Robertson, Understanding inverse document frequency: on theoretical arguments for
     idf, Journal of documentation (2004).
 [4] J. Wang, J. Liu, C. Wang, Keyword extraction based on pagerank, in: Pacific-Asia Confer-
     ence on Knowledge Discovery and Data Mining, Springer, 2007, pp. 857–864.
 [5] D. Isa, L. H. Lee, V. Kallimani, R. Rajkumar, Text document preprocessing with the
     bayes formula for classification using the support vector machine, IEEE Transactions on
     Knowledge and Data engineering 20 (2008) 1264–1272.
 [6] D. S. McNamara, Y. Ozuru, A. C. Graesser, M. Louwerse, Validating coh-metrix, in:
     Proceedings of the 28th annual conference of the cognitive science society, 2006, pp.
     573–578.
 [7] S. Jönsson, E. Rennes, J. Falkenjack, A. Jönsson, A component based approach to measuring
     text complexity, in: The Seventh Swedish Language Technology Conference (SLTC-18),
     Stockholm, Sweden, 7-9 November 2018, 2018.
 [8] R. Senter, E. A. Smith, Automated readability index, Technical Report, Cincinnati Univ OH,
     1967.
 [9] G. R. Klare, Assessing readability, Reading research quarterly (1974) 62–102.
[10] S. Gooding, E. Kochmar, CAMB at CWI shared task 2018: Complex word identification with
     ensemble-based voting, in: Proceedings of the Thirteenth Workshop on Innovative Use of
     NLP for Building Educational Applications, Association for Computational Linguistics,
     New Orleans, Louisiana, 2018, pp. 184–194.
[11] S. M. Yimam, C. Biemann, S. Malmasi, G. H. Paetzold, L. Specia, S. Štajner, A. Tack,
     M. Zampieri, A report on the complex word identification shared task 2018, arXiv preprint
     arXiv:1804.09132 (2018).
[12] M. Shardlow, R. Evans, G. H. Paetzold, M. Zampieri, Semeval-2021 task 1: Lexical com-
     plexity prediction, arXiv preprint arXiv:2106.00473 (2021).
[13] C. Pan, B. Song, S. Wang, Z. Luo, DeepBlueAI at SemEval-2021 task 1: Lexical complexity
     prediction with a deep ensemble approach, in: Proceedings of the 15th International Work-
     shop on Semantic Evaluation (SemEval-2021), Association for Computational Linguistics,
     Online, 2021, pp. 578–584.
[14] A. Mosquera, Alejandro mosquera at semeval-2021 task 1: Exploring sentence and word
     features for lexical complexity prediction, in: Proceedings of the 15th International
     Workshop on Semantic Evaluation (SemEval-2021), 2021, pp. 554–559.
[15] M. Brysbaert, E. Keuleers, M. Stevens, L. Van der Haegen, A. Verma, M. Callens, W. Tops,
     V. Khare, P. Mandera, H. Vander Beken, et al., The zipf-scale: A better standardized
     measure of word frequency, Update (2013).
[16] S. H. Deacon, M. J. Kieffer, A. Laroche, The relation between morphological awareness
     and reading comprehension: Evidence from mediation and longitudinal models, Scientific
     Studies of Reading 18 (2014) 432–451.