1. Introduction

Multi-Task Learning for German Text Readability Assessment

Salar Mohtaj

0 1

Vera Schmitt

0 1

Razieh Khamsehashari

Sebastian Möller

0 1 0 German Research Centre for Artificial Intelligence (DFKI), Labor Berlin , Germany 1 Technische Universität Berlin , Berlin , Germany

Automated text readability assessment is the process of assigning a number to the level of dificulty of a piece of text automatically. Machine learning and natural language processing techniques made it possible to measure the readability and complexity of the fast-growing textual content on the web. In this paper, we proposed a multi-task learning approach to predict the readability of German text based on pre-trained models. The proposed multi-task model has been trained on three tasks: text complexity, understandability, and lexical dificulty assessment. The results show a significant improvement in the model's performance in the multi-task learning setting compared to single-task learning, where each model has been trained separately for each task.

eol>Text readability assessment Multi-task learning Transfer learning Text complexity

1. Introduction

CLiC-it 2023: 9th Italian Conference on Computational Linguistics, Nov 30 — Dec 02, 2023, Venice, Italy * Corresponding author. $ salar.mohtaj@tu-berlin.de (S. Mohtaj); vera.schmitt@tu-berlin.de (V. Schmitt); razieh.khamsehashari@tu-berlin.de (R. Khamsehashari); sebastian.moeller@tu-berlin.de (S. Möller)

0000-0002-0032-3833 (S. Mohtaj); 0000-0002-9735-6956 (V. Schmitt); 0000-0003-3057-0760 (S. Möller)

CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) 2. Related Work 3. Data Set

In this section, we review some of the recent eforts in us- In this section, we describe the data set that has been ing NLP and machine learning models for the evaluation used to train and test the proposed models in this paper. of text readability. We used TextComplexityDE1 data set [8] to train the

A supervised model for German text readability as- proposed model and also to test it against single-task sessment is proposed in [9]. They have extracted more learning approaches. In this section, we briefly describe than 70 features grouped in traditional, lexical, and the data set, especially the available readability scores in morphological-based features to train text regression the data that make it possible to train multi-task learning models. They have selected the top 20 features for the models. training phase based on diferent criteria, such as the low As thoroughly explained in [8], TextComplexityDE data ratio of missing values and also low correlation between set contains 1,000 sentences in the German language features. The obtained results show that the Random For- taken from 23 Wikipedia articles from three diferent topest model could outperform Linear Regression and Poly- ics. The sentences were annotated by German learners nomial Regression models with respect to the Root Mean in levels A and B who were 32 years old on average and Squared Error (RMSE) metric. They improved the results mostly held a university degree. Each sentence is mapped on the same data set by fine-tuning pre-trained language to the Mean Opinion Score (MOS) of three diferent readmodels in [10]. They used pre-trained models in fea- ability metrics, namely complexity, understandability, ture extraction and fine-tuning settings and came to the and lexical dificulty. All the sentences have been rated conclusion that the fine-tuning approach could outper- by multiple annotators on a 7-point Likert scale. The form the feature extraction as well as classical machine complexity shows how complex a sentence for an annolearning models. tator was in the range of very easy ( 1 ) to very complex ( 7 ).

A sentence-wise readability assessment model for Ger- The understandability metric shows how well the particiman L2 readers is introduced in [11]. They extracted pants were able to understand a sentence, and the lexical 373 features from diferent types (e.g., syntax) to train dificulty presents the dificulty of the most dificult word machine learning models for the regression and rank- in a sentence. ing tasks. The Bayesian Ridge Regression model out- This data set has been used as the training set in the performs the widely used readability formulae in the Text Complexity Challenge on German Text in 2022. In regression task in their experiments. They also analyzed order to train and also evaluate the single- and multi-task the complexity at the document level and found that the learning models in this paper, we split the data set into maximum complexity in the sentence level impacts the the train, validation, and test parts (60%, 20%, and 20%, document complexity. respectively).

A hybrid model combining a feature engineering ap- Figure 1 shows the distribution of MOS values over proach and transfer learning for German text complexity the training and test data sets for the three metrics. As assessment is proposed in [12]. They have extracted word presented in the figure, there are more easy instances in level and sentence level features from text and ensem- the data set than complex ones. ble it with transformer-based models like Bert [13] and Table 1 provides a summary of statistics and frequency RoBERTa [14]. The proposed model achieved the first distribution of the training and test data sets. As deranking in the Text Complexity DE Challenge 2022 [15]. scribed in the table, the training and test sets follow a

An online service for assessing the readability of Ger- similar distribution from the textual content and readman text based on machine learning models is presented ability scores point of view. in [16]. The authors provided the model as an online service that is publicly available to use. The online service provides five statistical metrics and two machine 4. Multi-task Learning Model learning models for an input text. The machine learning models are based on the BERT and the fine-tuned In this section, we present our model based on a multiBERT. They achieved promising results on two diferent task learning approach to predict the complexity score data sets based on Mean Square Error (MSE) and Mean of textual input and the understandability and lexical Absolute Error (MAE) metrics [16]. dificulty scores. We use pre-trained language models to

To the best of our knowledge, there is no text read- extract features from the input text and feed the extracted ability prediction model for German text based on MTL features into a Recurrent Neural Network (RNN) as the approaches. The proposed model uses the benefits of pre- initial hidden state. trained language models as well as a multi-task learning Due to the fact that MTL can learn features that genapproach where features that form good predictors for eralize better across tasks and considering the relation multiple tasks are favoured over those that don’t. 0.30 0.25 The2distribution o3f MOS Compl4exity in train5and test data6sets 7 0.00 1 The dist2ribution of M3OS Understan4dability in tra5in and test d6ata sets 7 0.00 1 The dis2tribution of M3OS Lexical D4if iculty in tra5in and test d6ata sets 7 (a) (b) (c) between three readability scores in the TextComplexityDE • Learning rate: 0.001, 0.0005, 0.0001 data set, we propose a joint model for the task. Consider- • Batch size: 32, 64 ing the similarity between the three tasks and in order • Dropout probability: 0.3, 0.4, 0.5 to enable knowledge sharing among tasks, we used a • Size of the hidden layer: 64, 128, 256 parallel architecture (i.e., tree-like architecture) [17] in this work. Moreover, we trained all the models in 50 epochs and

We use the German BERT model [18] (i.e., bert-base- set the early stopping patience to 10 checkpoints to pregerman-cased) in a feature extraction setting where the vent over-fitting. In other words, the training has stopped input text is fed into the model to convert textual input in case of no improvement in ten continuous epochs. The into vectors. The model includes a shared layers part that model has 110,125,315 parameters in total and 1,043,971 is shared between three regression models (i.e., complex- trainable parameters since the parameters from the preity, understandability, and lexical dificulty prediction) trained model are frozen and didn’t change during the and a unique task-specific layer for each task. The overall training phase. architecture of the model is depicted in Figure 2 (a). Regarding the loss weighting strategy, we used the

As presented in Figure 2, the output of the BERT model "optimizing worst-case task loss" strategy, in which the is fed into a two layers Bi-GRU model [19]. As an RNN worst-performing task has been chosen in each step as model GRUs can handle sequence input very well and the optimization target. The importance of worst-case showed promising results in text readability prediction task loss compared to the vanilla average task loss when in the previous studies [20]. A fully connected layer is training an MTL model is analyzed in [21]. The achieved on top as the last layer of the shared layers. results on the test data set are presented in the next

The task-specific layer includes a separated, fully con- section.

nected layer that is connected to the task-specific output layer. The following hyper-parameters are tested during 5. Evaluation and Results the training phase in order to find the best configuration for this task. The best-performing parameters are highlighted.

In this section, we briefly describe the evaluation metric used to measure the performance of the proposed model

Concatenation 128 128 128 128 s r e y La Concatenation d e r a h

S 768 768 768 768 Complexity

Understandability Ful y connected 16248

Ful y connected

Ful y connected Ful y connected 128

256 Drop out 256 BERT embedding

Complexity/ Understandability/ Lexical Difficulty Fully connected 128

256 Drop out 256

4*128 Bi-GRU (2 layers) BERT embedding (b) and the obtained results from the MTL model as well as a single-task learning model as the baseline. 5.1. Evaluation Metric The Root Mean Square Error (RMSE) metric is used to evaluate the models’ performance. It measures the root of the average squared diference between the estimated values (e.g., complexity scores) and the actual value. It is a common metric for regression analysis including text readability assessment.

= √︃ ∑︀ =1 ( − )2

̂︀ ( 1 ) where is ith actual value, is the ith predicted value ̂︀ and is the number of data points. 5.2. Results

We evaluated the performance of the proposed MTL

model on the test set of the data. We compared the obtained results in the MTL setting with the single-task learning setting as the baseline. The overall performance of the single-task and multi-task learning modules are presented in Table 2.

We used a similar architecture for the single-task learning model. The single-task learning model includes the same embedding layer (i.e., the German BERT model) and the same 2-layers Bi-GRU layers on top. In this model, the output of the fully connected layer is fed directly to the output layer as depicted in Figure 2 (b). The singletask learning model has 1,019,137 trainable parameters (compared to 1,043,971 trainable parameters of the MTL model). We used the same model to train the text regression to predict text complexity, understandability, and lexical dificulty scores, separately.

As presented in the table, the MTL setting significantly

outperforms the single-learning model in all three tasks.

Moreover, the average error of the three tasks (0.7945) is

much lower in the MTL model compared to the situation where each model is trained separately (0.8379).

It also should be noted that the number of trainable parameters is almost the same in both models (∼ 0.025% more parameters in the MTL model). In contrast, the single-task learning model undergoes three separate training sessions, one for each task. So, in addition to achieving a better performance in predicting German text readability, the MTL model also demonstrates higher computational eficiency.

The obtained results from the MTL setting highlight

the importance of the prediction of text readability score from diferent perspectives. In other words, the results show that the performance of a text complexity predictor could be improved by introducing other related metrics Task Complexity Understandability Lexical dificulty Average such as understandability and lexical dificulty to the model.

6. Conclusion

In this paper, we proposed a model based on a multitask learning approach for the task of text readability assessment in German text. The model is trained and tested on the TextComplexityDE data set. It is simultaneously trained on three diferent readability scores, namely complexity, understandability, and lexical dificulty. Our results showed that the MTL model outperforms the common single-task learning models in all three scores. The obtained results in this experiment reveal the importance of the annotation of text readability from diferent perspectives.

As the direction for future studies, diferent multi-task learning architectures (e.g., hierarchical architectures) could be tested in the task. Moreover, in this study, we exclusively tested the BERT model to extract features from the input text. However, exploring and assessing the impact and the performance of other pre-trained models is a question for future works. Finally, the performance of fine-tuning approaches of transfer learning can be compared to the feature extraction approach in future studies.

Acknowledgments The present study was funded by the Deutsche Forschungsgemeinschaft (DFG) through the project “Analyse und automatische Abschätzung der Qualität maschinell generierter Texte”, project number 436813723.

jective assessment of text complexity: A dataset [16] F. Pickelmann, M. Färber, A. Jatowt, Ablesfor german language, CoRR abs/1904.07733 (2019). barkeitsmesser: A system for assessing the readarXiv:1904.07733. ability of german text, in: Advances in Information [9] B. Naderi, S. Mohtaj, K. Karan, S. Möller, Auto- Retrieval - 45th European Conference on Informamated text readability assessment for german lan- tion Retrieval, ECIR 2023, Dublin, Ireland, April 2-6, guage: A quality of experience approach, in: 11th 2023, Proceedings, Part III, volume 13982 of LecInternational Conference on Quality of Multimedia ture Notes in Computer Science, Springer, 2023, pp.

Experience QoMEX 2019, Berlin, Germany, June 5- 288–293.

7, 2019, IEEE, 2019, pp. 1–3. doi:10.1109/QoMEX. [17] S. Chen, Y. Zhang, Q. Yang, Multi-task learning in 2019.8743194. natural language processing: An overview, CoRR [10] S. Mohtaj, B. Naderi, S. Möller, F. Maschhur, abs/2109.09138 (2021). URL: https://arxiv.org/abs/ C. Wu, M. Reinhard, A transfer learning based 2109.09138. arXiv:2109.09138. model for text readability assessment in ger- [18] B. Chan, S. Schweter, T. Möller, German’s next man, CoRR abs/2207.06265 (2022). doi:10.48550/ language model, in: D. Scott, N. Bel, C. Zong arXiv.2207.06265. arXiv:2207.06265. (Eds.), Proceedings of the 28th International Confer[11] Z. Weiss, D. Meurers, Assessing sentence read- ence on Computational Linguistics, COLING 2020, ability for German language learners with broad Barcelona, Spain (Online), December 8-13, 2020, Inlinguistic modeling or readability formulas: When ternational Committee on Computational Linguisdo linguistic insights make a diference?, in: Pro- tics, 2020, pp. 6788–6796. ceedings of the 17th Workshop on Innovative Use [19] K. Cho, B. van Merriënboer, D. Bahdanau, Y. Bengio, of NLP for Building Educational Applications (BEA On the properties of neural machine translation: 2022), Association for Computational Linguistics, Encoder–decoder approaches, in: Proceedings of Seattle, Washington, 2022, pp. 141–153. URL: https: SSST-8, Eighth Workshop on Syntax, Semantics and //aclanthology.org/2022.bea-1.19. doi:10.18653/ Structure in Statistical Translation, Association for v1/2022.bea-1.19. Computational Linguistics, Doha, Qatar, 2014, pp. [12] A. Mosquera, Tackling data drift with adversarial 103–111.

validation: An application for German text complex- [20] Y. Sun, K. Chen, L. Sun, C. Hu, Attention-based ity estimation, in: Proceedings of the GermEval deep learning model for text readability evaluation, 2022 Workshop on Text Complexity Assessment of in: 2020 International Joint Conference on Neural German Text, Association for Computational Lin- Networks, IJCNN 2020, Glasgow, United Kingdom, guistics, Potsdam, Germany, 2022, pp. 39–44. July 19-24, 2020, IEEE, 2020, pp. 1–8. [13] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: [21] P. Michel, S. Ruder, D. Yogatama, Balancing averpre-training of deep bidirectional transformers for age and worst-case accuracy in multitask learning, language understanding, in: J. Burstein, C. Do- CoRR abs/2110.05838 (2021). arXiv:2110.05838. ran, T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human

Language Technologies, NAACL-HLT 2019, Min

neapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Association for Computational Linguistics, 2019, pp. 4171–4186. [14] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692 (2019). arXiv:1907.11692. [15] S. Mohtaj, B. Naderi, S. Möller, Overview of the germeval 2022 shared task on text complexity assessment of german text, in: S. Möller, S. Mohtaj, B. Naderi (Eds.), Proceedings of the GermEval 2022 Workshop on Text Complexity Assessment of German Text, GermEval@KONVENS 2022, Potsdam, Germany, September 12, 2022, Association for Computational Linguistics, 2022, pp. 1–9. URL: https://aclanthology.org/2022.germeval-1.1.

[1]

Xia , E. Kochmar, T. Briscoe, Text readability assessment for second language learners , in: J. R. Tetreault , J.

Burstein , C.

Leacock , H. Yannakoudakis (Eds.), Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications , BEA@NAACL-HLT 2016 , June 16, 2016 , San Diego, California, USA, The Association for Computer Linguistics, 2016 , pp. 12 - 22 .

[2]

Aluisio ,

Specia ,

Gasperin ,

Scarton , Readability assessment for text simplification , in: Proceedings of the NAACL HLT 2010 fifth workshop on innovative use of NLP for building educational applications , 2010 , pp. 1 - 9 .

[3]

Chatzipanagiotidis ,

Giagkou ,

Meurers , Broad linguistic complexity analysis for greek readability classification , in: Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications , BEA@EACL, Online, April 20 , 2021 , Association for Computational Linguistics, 2021 , pp. 48 - 58 . URL: https://www.aclweb. org/anthology/2021.bea- 1 .5/.

[4]

P. G.

Blaneck ,

Bornheim ,

Grieger ,

Bialonski , Automatic readability assessment of german sentences with transformer ensembles , in: Proceedings of the GermEval 2022 Workshop on Text Complexity Assessment of German Text , GermEval@KONVENS 2022 , Potsdam, Germany, September 12 , 2022 , Association for Computational Linguistics, 2022 , pp. 57 - 62 . URL: https:// aclanthology.org/ 2022 .germeval- 1 . 10 .

[5]

Zhao ,

Zhang ,

Hopfgartner , A comparative study of using pre-trained language models for toxic comment classification , in: J. Leskovec , M.

Grobelnik , M.

Najork , J.

Tang , L. Zia (Eds.), Companion of The Web Conference 2021 , Virtual Event / Ljubljana, Slovenia, April 19-23 , 2021 , ACM / IW3C2, 2021 , pp. 500 - 507 .

[6]

Mohtaj ,

Möller , On the importance of word embedding in automated harmful information detection , in: P. Sojka , A. Horák , I. Kopecek, K. Pala (Eds.), Text, Speech, and Dialogue - 25th International Conference, TSD 2022, Brno, Czech Republic, September 6-9 , 2022 , Proceedings, volume 13502 of Lecture Notes in Computer Science, Springer, 2022 , pp. 251 - 262 .

[7]

Ruder , An overview of multi-task learning in deep neural networks , CoRR abs/1706 .05098 ( 2017 ). URL: http://arxiv.org/abs/1706.05098. arXiv: 1706 . 05098 .

[8]

Naderi ,

Mohtaj ,

Ensikat ,

Möller , Sub-