Importance of Data and Controllability in Neural
Text Simplification
Wei Xu
Georgia Institute of Technology, Atlanta, GA, U.S.A.


                                      Abstract
                                      Natural language generation has become one of the fastest-growing areas in NLP and a popular
                                      playground for studying deep learning techniques. Many variants of sequence-to-sequence models with
                                      complicated components have been developed. Yet, as I will demonstrate in this talk, creating high-quality
                                      training data and injecting linguistic knowledge can lead to significant performance improvements that
                                      overshadow gains from many of these model variants. I will present two recent works from my group
                                      on text simplification, a task that requires both lexical and syntactic paraphrasing to improve text
                                      accessibility: 1) a neural conditional random field (CRF) based semantic model [1, 2] to create parallel
                                      training data [3]; 2) a controllable text generation approach [4] that incorporates syntax through pairwise
                                      ranking and data argumentation.
                                           In the first work, we show that the success of a text simplification system heavily depends on the
                                      quality and quantity of complex-simple sentence pairs in the training corpus, which are extracted by
                                      aligning sentences between parallel articles. To evaluate and improve sentence alignment quality, we
                                      create two manually annotated sentence-aligned datasets from two commonly used text simplification
                                      corpora, Newsela and Wikipedia. We propose a novel neural CRF alignment model which not only
                                      leverages the sequential nature of sentences in parallel documents but also utilizes a neural sentence pair
                                      model to capture semantic similarity. Experiments demonstrate that our proposed approach outperforms
                                      all the previous work on monolingual sentence alignment tasks by more than 5 points in F1. We apply our
                                      CRF aligner to construct two new text simplification datasets, Newsela-Auto and Wiki-Auto, which
                                      are much larger and of better quality compared to the existing datasets. A Transformer-based seq2seq
                                      model trained on our datasets outperforms other state-of-the-art approaches for text simplification.
                                           In the second work, we explore how text simplification improves the readability of sentences
                                      through several rewriting transformations, such as lexical paraphrasing, deletion, and splitting. Current
                                      simplification systems are predominantly sequence-to-sequence models that are trained end-to-end to
                                      perform all these operations simultaneously. However, such systems limit themselves to mostly deleting
                                      words and cannot easily adapt to the requirements of different target audiences. In this paper, we propose
                                      a novel hybrid approach that leverages linguistically-motivated rules for splitting and deletion, and
                                      couples them with a neural paraphrasing model to produce varied rewriting styles. We introduce a new
                                      data augmentation method to improve the paraphrasing capability of our model. Through automatic and
                                      manual evaluations, we show that our proposed model establishes a new state-of-the art for the task,
                                      paraphrasing more often than the existing systems, and can control the degree of each simplification
                                      operation applied to the input texts.
                                           Throughout the talk, I will also briefly cover some of our works on evaluation metrics [5], lexical
                                      simplification [6], and document-level simplification [7]. To conclude, I will discuss a few questions from
                                      the CLEF reviewers: “Whilst text simplification has a long history, recent advances have significantly
                                      increased the quality and this may have opened up novel real-world applications: Is the quality sufficient
                                      for operational systems? what are the applications that are currently within our grasp? what is the main

CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" wei.xu@cc.gatech.edu (W. Xu)
~ https://cocoxu.github.io/ (W. Xu)
 0000-0002-7044-3232 (W. Xu)
                                    © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings       CEUR Workshop Proceedings (CEUR-WS.org)
               http://ceur-ws.org
               ISSN 1613-0073


                Code and data are available at: https://github.com/chaojiang06/wiki-auto
         barrier for wide-scale deployement (comparable to MT)? Can we formulate a challenge for the obvious
         next step in the evolution?”

         Keywords
         text simplification, paraphrasing, sentence alignment, word alignment, readability


References
[1] C. Jiang, M. Maddela, W. Lan, Y. Zhong, W. Xu, Neural CRF model for sentence alignment
    in text simplification, in: Proceedings of the 58th Annual Meeting of the Association
    for Computational Linguistics (ACL), 2020, pp. 7943–7960. URL: https://www.aclweb.org/
    anthology/2020.acl-main.709.
[2] W. Lan, C. Jiang, W. Xu, Neural semi-Markov CRF for monolingual word alignment, in:
    Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics
    (ACL), 2021.
[3] W. Xu, C. Callison-Burch, C. Napoles, Problems in current text simplification research:
    New data can help, Transactions of the Association for Computational Linguistics (TACL)
    3 (2015) 283–297. URL: https://www.aclweb.org/anthology/Q15-1021.
[4] M. Maddela, F. Alva-Manchego, W. Xu, Controllable text simplification with explicit
    paraphrasing, in: Proceedings of the 2018 Conference of the North American Chapter of
    the Association for Computational Linguistics (NAACL), 2021. URL: https://arxiv.org/abs/
    2010.11004.
[5] W. Xu, C. Napoles, E. Pavlick, Q. Chen, C. Callison-Burch, Optimizing statistical machine
    translation for text simplification, Transactions of the Association for Computational
    Linguistics (TACL) 4 (2016) 401–415. URL: https://www.aclweb.org/anthology/Q16-1029.
[6] M. Maddela, W. Xu, A word-complexity lexicon and a neural readability ranking model
    for lexical simplification, in: Proceedings of the 2018 Conference on Empirical Methods in
    Natural Language Processing (ENNLP), 2018, pp. 3749–3760. URL: https://www.aclweb.org/
    anthology/D18-1410.
[7] Y. Zhong, C. Jiang, W. Xu, J. J. Li, Discourse level factors for sentence deletion in text
    simplification, Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) 34
    (2020) 9709–9716. URL: https://ojs.aaai.org/index.php/AAAI/article/view/6520.