Extraction of Formulaic Expressions from Scientific Papers Kenichi Iwatsuki,1 Akiko Aizawa2,1 1 The University of Tokyo, Japan 2 National Institute of Informatics, Japan iwatsuki@nii.ac.jp, aizawa@nii.ac.jp Abstract scientific entities are removed from the sentence. Second, two types of n-grams are extracted from the sentence. Phrasal patterns, such as ‘in this paper we propose’, are often Then, the extracted FEs are evaluated based on whether used in scientific papers. These are called formulaic expres- sions (FEs) and constitute sentential communicative func- they convey the sentential CFs. The results of manual evalu- tions (CFs) that convey how a sentence should be read by ations show that the proposed method can extract more FEs the readers. FEs are useful for scientific paper analyses and representing the CFs of sentences than existing methods. academic writing assistance, but FE extraction methods have Considering the compilation of a list of FEs, which will be thus far not been investigated in detail. In this paper, we pro- a possible application of the FE extraction, removing noisy pose a sentence-level FE extraction method in which the CFs FEs and enhancing precision is important. Thus, we test how are taken into account. The proposed method is compared effective filtering FEs based on the number of occurrence of to existing methods to demonstrate that it is better at CF- an FE is, and show that it improves precision much. oriented FEs. 2 Datasets 1 Introduction We used a CF-labelled sentence datasets that were made In scientific papers, the authors often use several fixed from scientific papers of four disciplines: computational phrasal patterns that are specific to the genre, such as ‘in linguistics (CL), chemistry (chem), oncology (onc), and this paper, we propose’. These patterns are called formulaic psychology (psy). Each discipline consists of four sec- expressions (FEs) or formulaic sequences. FEs convey the tions; introduction, methods, results, and discussion; thus, intentions of the authors to the readers, i.e., the manner in 16 datasets were used (combination of four disciplines and which a sentence should be understood. This characteristic four sections). The numbers of sentences and words in these of the FE is called communicative function (CF). For exam- datasets are listed in Table 1. Compared with some of the ple, the phrase ‘in this paper, we propose’ conveys the CF of existing studies, in which the sizes of the corpora were the sentence meaning ‘showing the aim of the paper’. FEs around 2 million (Simpson-Vlach and Ellis 2010) or 8 mil- are useful for understanding the composition of a scientific lion (Mizumoto, Hamatani, and Imao 2017) words, we de- paper and are helpful in writing the paper. termined that the datasets are sufficient. A few studies have been reported on addressing the ex- traction of FEs and subsequent assignment of CF labels 3 Methods to them (Cortes 2013; Mizumoto, Hamatani, and Imao 3.1 Two Approaches in FE Extraction 2017). However, these works have not rigorously investi- gated whether the extracted FEs convey the CFs of a sen- Two main approaches were considered here for extract- tence. Extracting word n-grams with frequency thresholds ing the FEs: corpus- and sentence-level approaches. In the has been reported in several studies, although frequent FEs corpus-level approach, the FEs are extracted from the entire do not always convey the sentential CFs. Machine-learning corpus, whereas in the sentence-level approach, a single FE approaches have hitherto been scarcely adopted because of is extracted from each sentence (Figure 1). The corpus-level the dearth of sufficient FE-annotated resources. approach may cause problems with deciding the FE size and overlap between FEs (Iwatsuki and Aizawa 2018). For ex- In this paper, we propose a new sentence-level FE extrac- ample, when 4-grams are extracted in the experiments, the tion method and compare it to several existing methods. We phrases ‘paper we propose a’ and ‘we propose a method’ assume that a single FE is extracted from each sentence be- were both extracted, but it is difficult to determine which of cause it conveys the entirety of the CF of that sentence. The these is a better FE. In contrast, the sentence-level approach proposed method consists of two steps. First, the named and is free of this problem because it does not have a fixed length Copyright ©2021 for this paper by its authors. Use permitted under for the n-gram. Since a single FE is extracted from each Creative Commons License Attribution 4.0 International (CC BY sentence, only ‘in this paper we propose a method’ is ex- 4.0). tracted. Therefore, we adopt the sentence-level approach in Discipline Section Sentences Words computationally. We used an implementation available1 . CL introduction 266,904 5,934,772 methods 362,477 7,469,502 3.3 Sentence-Level Extraction results 507,592 10,176,904 discussion 111,052 2,481,983 Frequency-Based Filtering Based on the frequency, each Chem introduction 285,810 7,526,537 word of a sentence is labelled as either formulaic or non- methods 376,583 8,655,414 formulaic. Non-formulaic words are removed, and the re- results 721,960 18,308,473 maining words are regarded as the FE. We used two fre- discussion 175,266 4,443,967 quency thresholds, namely 1/50,000 and 1/100,000 words. Onc introduction 441,141 11,051,210 methods 976,205 20,615,171 LDA-Based Filtering Liu et al. (2016) proposed utilising results 1,069,044 27,059,136 latent Dirichlet allocations (LDA) because they assumed that discussion 834,641 20,897,907 topic-specific words do not comprise FEs. Thus, each word Psy introduction 484,615 13,944,874 of a sentence was judged as either topic-specific or topic- methods 429,155 9,898,714 independent based on the following criterion: results 288,754 7,756,912 max pw (i) discussion 453,118 12,641,250 P(w) = 1 − P , pw (i) Table 1: Numbers of sentences and words in each discipline where pw (i) is the probability of the word w in a topic i. If and section in the prepared CF-labelled sentence datasets. P (w) is greater than the threshold, w is formulaic. We use P (w) > 0.65 and 10 topics, which was reported optimal. the remaining experiments. We compared two corpus-level Proposed Method The proposed method comprises two and two sentence-level methods with the proposed method. steps: (1) removing named and scientific entities and (2) extracting longest word n-grams (Figure 2). The first step 3.2 Corpus-Level Extraction was based on the idea that the named and scientific enti- ties, including places, organisations, materials, and methods, Frequent N -grams Word n-grams were extracted from such as ‘Helsinki’ and ‘word embeddings’, do not constitute the dataset, and depending on the frequency-based thresh- FEs. In the second step, dependency parsing was applied to old, the infrequent FEs were removed. Although various the sentences to determine their roots. After removing the studies have used different lengths and frequency thresh- named and scientific entities, two types of word n-grams olds for the n-grams, we extracted FEs whose lengths were were labelled as formulaic: three words or greater, and followed the method in Cortes (2013) for the frequency thresholds: 20 per million words 1. the longest word n-gram satisfying a frequency threshold; (pmw) for four-word or shorter n-grams, 10 pmw for five- 2. the longest word n-gram that contains a root of the sen- word phrases, 8 for six- and seven-word phrases, and 6 pmw tence and satisfies the frequency threshold. for phrases longer than seven words. If multiple FEs of the same lengths were found, the most frequent one was prioritised. Word embeddings play an important role in natural language processing. We focused on the longest word sequences because Cortes 0 0 1 1 1 1 1 0 0 0 (2013) observed that lengthy FEs, such as ‘the rest of the 0: non-formulaic paper is organized as follows’ existed. Additionally, we as- play an important role in sumed that in several cases, sentential CFs were realised 1: formulaic around the root of the sentence, so that two types of n-grams should be extracted. Specifically, n-grams whose lengths Figure 1: Sentence-level FE extraction. were less than three words were ignored because such FEs would be too short. The remaining words in the sentence af- ter n-gram extraction were removed. The frequency thresh- Lattice FS This approach was originally proposed by old was thus set to 3 to collect the maximum number of FEs. Brooke, Šnajder, and Baldwin (2017), where n-grams are The entity removal was conducted with ScispaCy (Neumann first extracted and later selected based on the concepts of et al. 2019). covering, clearing, and overlap. Covering indicates that if In the example in Figure 2, the root word is ‘show’. The the number of instances of ‘we propose’ is almost the same longest n-gram satisfying the threshold and containing the as those of ‘we propose a new‘, the longer FE would explain root would thus be ‘the results show that’, while ‘is signif- the presence of the shorter FE. Clearing indicates the oppo- icantly better than’ would be another n-gram that does not site idea to covering. Overlap indicates that the expressions contain the root. There could also be cases where these two ‘in this paper we’ and ‘this paper we proposed’ should not be types of FEs overlap or be the same. accepted as FEs at the same time. These three concepts are 1 expressed in mathematical form, and the FEs are optimised https://github.com/julianbrooke/LatticeFS The results show that the BERT classifier is significantly better than the SVM classifier. It should be noted that the recall cannot be calculated be- Entity removal cause there are no available FE-annotated resources. The results show that the BERT classifier is significantly better than the SVM classifier. Occurrence ≥1 ≥3 ≥5 ≥7 longest n-gram with root longest n-gram n-gram extraction Ratio of 3/3 0.28 0.45 0.55 0.53 The results show that * is significantly better than # 39/100 24/53 21/47 21/46 Figure 2: The proposed FE extraction method. Table 3: Ratios of FEs whose score was 3/3 and filtering thresholds of occurrence. 3.4 Filtering FEs For compiling a list of FEs, which is one of the applications 5 Discussion of the FE extraction, it is not always necessary to use all 5.1 Errors in Entity Recognition these FEs extracted from every sentence. It is more impor- We analysed the errors (FEs that 1/3 or less annotators tant to discard non-FEs. Because the word sequences that judged as correct) in the proposed method. The errors in the occur only once or twice are not formulaic, filtering FEs entity recognition (step 1) accounts for approximately 60% based on the number of the occurrence is effective. There- of all the errors. They can be classified into two types: (1) en- fore, we set thresholds of the number of FE occurrence in tities are not removed and (2) formulaic words are removed the dataset, and removed FEs not satisfying the thresholds. as entities though they are not entities. Most of the errors were the type (2). 4 Results Table 4 lists the examples of this error. From this table, it can be seen that formulaic words such as ‘table’ and ‘inves- We randomly chose 100 sentences from the sentence dataset tigated’, which are indispensable for representing the CFs, to evaluate the FE extraction. For the sentence-level meth- were removed. When formulaic words are removed at this ods, a single FE was extracted from each sentence. For stage, meaningful n-grams are not to be extracted in the step the corpus-level methods, the FEs and sentences were not 2. This results infer that entity recognition is crucial to the clearly connected. Thus, we randomly selected a single FE proposed method and should be improved much. from the set of extracted FEs for each sentence. The evaluations were then conducted manually. Three an- 5.2 Errors in N -grams notators were asked to check if the FEs extracted with each method had the same CFs as the sentences from which they Another type of errors is the errors in the n-gram extraction were extracted and if these were reusable when writing sci- (step 2). In the proposed method, we extracted two different entific papers. The FEs were presented to the annotators si- n-grams: the longest n-gram containing the sentential root multaneously, and the method that was applied to the FE was and the longest n-gram that does not necessarily contain the not disclosed. A total of 100 combinations of sentences and root, both of which satisfied the threshold of the number of FEs were randomly selected for evaluations. occurrence in the corpora. The majority of this error is that the extracted two n- The results of the evaluations are shown in Table 2, and grams are the same but do not contain CF-realising part. the proposed method is observed to show clear advantage Table 5 lists the examples of this error. The span error oc- over other baselines in the FE extraction. curred in the second example. Since ‘both plasma and urine’ Method ≥ 2/3 3/3 κ is content part, the FE should not include ‘both’. The other examples missed the CF-realising part. In the first example, Frequent n-grams 0.30 0.09 0.36 ‘a common approach’ is important to the introduction to the Lattice FS 0.07 0.03 0.30 methodology. In the third example, detail number was ex- Frequency-based (1/50,000) 0.04 0.02 -0.36 tracted. It should be noted that the numbers sometimes con- Frequency-based (1/100,000) 0.05 0.02 -0.39 stitute an FE because in some disciplines, there exist very LDA-based 0.08 0.03 -0.20 fixed numbers, such as ‘a p value less than 0.05 was consid- Proposed (Step 1) 0.13 0.05 -0.27 ered significant’. In the fourth example, the FE missed ‘as Proposed (Step 2) 0.54 0.28 0.23 proposed by’ to show the method was used in past work. In Proposed (Step 1+2) 0.58 0.39 0.44 the last example, the controversy is represented by ‘has been challenged’, which was not extracted. Table 2: Ratios of FEs that two or three out of the three The last example also shows that n-grams that contain the (≥2/3) and all three (3/3) annotators labelled as correct. sentential root do not always convey the CF. It is true that the Fleiss’s kappa is also shown. that clause conveys the CF showing controversy within the field, but the phrase in the main clause ‘it should be noted Table 3 shows the thresholds of the number of occurrence that’ may have a different CF. This is a limitation when a of FEs and scores. From the table, it can be seen that if FEs sentence is regarded as a unit of a CF because a long sen- occurring less than three times in a corpus are ignored, the tence may have more than one CF. However, it is difficult to precision would change from 0.39 (39/100) to 0.49 (24/53). determine the length that constitutes the unit of a CF. CF Full sentence Sentence without entities Reference to tables or From this table, we observe that the topics learned from this * we observe that the topics figures by our method are better in coherence than those learned by our * are better in * than those learned from the baseline methods, which again learned from the * which again demon- demonstrates the effectiveness of our model. strates the * of our Showing limitation or Although the cellular uptake efficiency could be although the * could be improved by ad- lack of past work improved by adjusting the size and the sequence justing the * and the * of * in the previous of DNPs in the previous study, it has not been in- * it has not been * whether the * can also vestigated whether the DNPs can also be used in be used in the * rich in the in vivo environment rich in nucleases. Table 4: Examples of errors in named and scientific entity recognition. The sentences are cited from Xie, Yang, and Xing (2015); Kim et al. (2018). CF Sentence FE Showing brief intro- A common approach used to assign structure to language is to use a is to use a duction to the method- probabilistic grammar where each elementary rule or production is as- ology sociated with a probability. Restatement of the re- For example, shared specific genomic aberrations were observed in both were observed in sults plasma and urine cfDNAs at loci of PTEN, TMPRSS2 and AR (Figure both 1 and [CITATION] ). Description of the re- Rs679620 was also associated with increased OA risk in dominant p 0038 and sults (“TC-TT”, OR = 2.03, 95% CI: 1.03-4.01, P = 0.038) and over- dominant model analyses (“TC”, OR = 2.04, 95% CI: 1.05-3.96, P = 0.033). Using methods used in The smoothness value used for the AlphaSim calculation was based was based on the past work on the smoothness of the residual image of the statistical analysis as proposed by [CITATION] . Showing controversy However, it should be noted that the biological involvement of many of however it should be within the field these targets in HBD-3 activities has been challenged in recent years noted that the [[CITATION] ]. Table 5: Examples of errors in n-gram extraction. The sentences are cited from Sarkar (1998); Xia et al. (2016); Guo et al. (2017); Vivas et al. (2019); Phan et al. (2016). Table 6 shows the average number of FEs with 3/3 ac- Original sentence In order to avoid over fitting, PA with PCA was chosen for this study. curacy in each CF. It can be said that the difficulty in the FE extraction differs depending on the CFs. The CFs such Frequency (1/50,000) in order to avoid over fitting pa with * was chosen for this study as ‘describing interesting or surprising results’ and ‘unex- Frequency (1/100,000) in order to avoid over fitting pa with pca was pected outcome’ are often realised by an adverb or adjective, chosen for this study which is difficult to extract using the proposed method. LDA-based in order to avoid over fitting * with * chosen for this study 5.3 Error Analyses in Existing Methods Proposed in order to avoid * was chosen for this The existing FE extraction methods have different draw- backs. Table 7 lists the number of FEs extracted with the Figure 3: Example of FE extraction. The second step of the sentence-level methods after removing infrequent FEs oc- proposed method extracted two different n-grams. The orig- curring less than three times in the corpus. Compared to the inal sentence is cited from An, Zhang, and Zhang (2018). proposed method, these methods extracted smaller numbers of FEs because most of these FEs rarely occur in the cor- pus. An example of sentence-level extraction is illustrated thresholds. Further, it achieves a relatively good quality in Figure 3. The existing methods do not remove the non- score, which is still lower than that of the proposed method formulaic words sufficiently here because the focus is only (Table 2). The Lattice FS extracts too many FEs, which can on a single word, and words such as ‘in’ or ‘results’ do not deteriorate the quality of the FEs. always constitute the FE. The corpus-level methods are different in this regard. The numbers of extracted FEs are 23,847 (frequent n-gram) and 6 Conclusion 2,480,935 (Lattice FS). The frequent n-gram method ex- In this paper, we proposed a new sentence-level FE extrac- tracts a smaller number of FEs because of the frequency tion method to realise CF-oriented analysis. We compared CF R. an n-gram Lattice. Transactions of the Association for Computa- Showing limitation or lack of past work 0.00 tional Linguistics 5: 455–470. doi:10.1162/tacl a 00073. Comments on the findings 0.00 Cortes, V. 2013. The purpose of this study is to: Connecting lexical Showing explanation or definition of terms or 0.00 bundles and moves in research article introductions. Journal of notations English for Academic Purposes 12(1): 33–43. doi:10.1016/j.jeap. Unexpected outcome 0.00 2012.11.002. Describing interesting or surprising results 0.00 Guo, W.; Xu, P.; Jin, T.; Wang, J.; Fan, D.; Hao, Z.; Ji, Y.; Jing, S.; Summary of the results 0.00 Han, C.; Du, J.; Jiang, D.; Wen, S.; ; and Wang, J. 2017. MMP- Comparison of the results 0.00 3 gene polymorphisms are associated with increased risk of os- Showing the limitation of the research 0.00 teoarthritis in Chinese men. Oncotarget 8(45): 79491–79497. doi: Showing the characteristics of samples or data 0.00 10.18632/oncotarget.18493. Showing reasons why a method was adopted or 0.00 Iwatsuki, K.; and Aizawa, A. 2018. Using Formulaic Expressions rejected in Writing Assistance Systems. In Proceedings of the 27th Interna- Showing methodology used in past work 1.00 tional Conference on Computational Linguistics, 2678–2689. As- Suggestion of hypothesis 1.00 sociation for Computational Linguistics. Showing the outline of the paper 1.00 Kim, K.-R.; Röthlisberger, P.; Kang, S. J.; Nam, K.; Lee, S.; Hol- Showing the aim of the paper 1.00 lenstein, M.; and Ahn, D.-R. 2018. Shaping Rolling Circle Am- Suggestion of future work 1.00 plification Products into DNA Nanoparticles by Incorporation of Explanation for findings 1.00 Modified Nucleotides and Their Application to In Vitro and In Vivo Showing criteria for selection 1.00 Delivery of a Photosensitizer. Molecules 23(7). ISSN 1420-3049. Showing the main problem in the field 1.00 doi:10.3390/molecules23071833. Liu, Y.; Wang, X.; Liu, M.; and Wang, X. 2016. Write-righter: Table 6: CFs that the ratio of FEs with 3/3 accuracy is 0.00 An Academic Writing Assistant System. In Proceedings of the or 1.00. Thirtieth AAAI Conference on Artificial Intelligence, 4373–4374. Association for the Advancement of Artificial Intelligence. Method FEs Mizumoto, A.; Hamatani, S.; and Imao, Y. 2017. Applying the Frequency-based (1/50,000) 13,722 Bundle–Move Connection Approach to the Development of an On- Frequency-based (1/100,000) 12,840 line Writing Support Tool for Research Articles. Language Learn- ing 67(4): 885–921. doi:10.1111/lang.12250. LDA-based 18,033 Proposed 285,193 Neumann, M.; King, D.; Beltagy, I.; and Ammar, W. 2019. Scis- paCy: Fast and Robust Models for Biomedical Natural Language Table 7: Number of FEs that were extracted using the differ- Processing. In Proceedings of the 18th BioNLP Workshop and ent methods and occurred at least three times in the dataset. Shared Task, 319–327. doi:10.18653/v1/W19-5034. Phan, T. K.; Lay, F. T.; Poon, I. K.; Hinds, M. G.; Kvansakul, M.; and Hulett, M. D. 2016. Human β-defensin 3 contains an oncolytic the proposed method to four existing methods, and our man- motif that binds PI(4,5)P2 to mediate tumour cell permeabilisation. Oncotarget 7(2): 2054–2069. doi:10.18632/oncotarget.6520. ual evaluations showed that the proposed method extracted CF-realising FEs better than these other methods. Although Sarkar, A. 1998. Conditions on Consistency of Probabilistic Tree FE extraction has not been discussed in detail thus far in re- Adjoining Grammars. In 36th Annual Meeting of the Association ported literature, we showed the existence of a more robust for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 2, 1164–1170. doi:10.3115/ method than just extracting frequent n-grams, as adopted 980691.980759. in the past studies. The FEs extracted with the proposed method are provided at our website2 for utilisation in various Simpson-Vlach, R.; and Ellis, N. C. 2010. An Academic Formulas tasks, such as information extraction and computer-based List: New Methods in Phraseology Research. Applied Linguistics 31(4): 487–512. ISSN 0142-6001. doi:10.1093/applin/amp058. academic writing assistance. Vivas, A. B.; Paraskevopoulos, E.; Castillo, A.; and Fuentes, L. J. Acknowledgements 2019. Neurophysiological Activations of Predictive and Non- predictive Exogenous Cues: A Cue-Elicited EEG Study on the This work was supported by JSPS KAKENHI Grant Num- Generation of Inhibition of Return. Frontiers in Psychology 10: bers 19J12466 and 18H03297. 227. ISSN 1664-1078. doi:10.3389/fpsyg.2019.00227. Xia, Y.; Huang, C.-C.; Dittmar, R.; Du, M.; Wang, Y.; Liu, H.; References Shenoy, N.; Wang, L.; ; and Kohli, M. 2016. Copy number vari- An, M.; Zhang, X.; and Zhang, X. 2018. Identifying the Validity ations in urine cell free DNA as biomarkers in advanced prostate and Reliability of a Self-Report Motivation Instrument for Health- cancer. Oncotarget 7(24): 35818–35831. doi:10.18632/oncotarget. Promoting Lifestyles Among Emerging Adults. Frontiers in psy- 9027. chology 9: 1222. ISSN 1664-1078. doi:10.3389/fpsyg.2018.01222. Xie, P.; Yang, D.; and Xing, E. 2015. Incorporating Word Correla- Brooke, J.; Šnajder, J.; and Baldwin, T. 2017. Unsupervised Acqui- tion Knowledge into Topic Modeling. In Proceedings of the 2015 sition of Comprehensive Multiword Lexicons using Competition in Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 725– 2 https://github.com/Alab-NII/CF-Labelled-FE-Database 734. doi:10.3115/v1/N15-1074.