Disciplinary Variation in Syntactic Complexity: A Corpus Analysis of Professional Academic Writing Javier Pérez-Guerraa and Elizaveta A. Smirnovaa,b a University of Vigo, Campus Universitario, Vigo, E-36310, Spain b HSE University, 38 Studencheskaya Street, Perm, 614070, Russia Abstract This study deals with the analysis of syntactic complexity in professional academic writing and is based on a corpus of so-called ‘hard’ and ‘soft’ papers published in leading international journals. We aim at describing the main complexity features of academic discourse and testing the hypothesis that there is considerable disciplinary variation in linguistic complexity. We conclude that, first, clausal complexity strategies are more prevalent in the ‘hard’ sciences, while phrasal-complexity features dominate in the ‘soft’ ones. Second, the data reveal a continuum across subdisciplines within the broad categories of ‘soft’ and ‘hard’ genres with respect to the adoption of complexity strategies. Keywords 1 Corpus analysis, disciplinary variation, academic discourse, academic writing, syntactic complexity 1. Introduction The phenomenon of complexity has been extensively approached in corpus linguistics over the recent years. Specifically, the complexity of writing has been studied in terms of the comparison of L2 and L1 writing [e.g. 1], correlations between text complexity, language proficiency and task types [e.g. 2], and the development of text complexity after intensive instruction [e.g. 3]. However, complexity in professional academic writing has been relatively under-researched to date despite the potential pedagogical implications of such studies. In this respect, we contend that following the linguistic conventions of a particular discipline plays a crucial role in identifying the writers as experts in their own discourse communities [4]. From this perspective, a research article can serve as a benchmark for optimal academic writing, providing learners with “a rich and authentic introduction to the complexities and nuances of the genre” [5: 3]. This study reports the empirical analysis of linguistic complexity features which aims, first, to describe the complexity features of research articles written by professional authors and, second, to test the hypothesis that linguistic complexity varies across disciplines. 2. Data and methodology The analysis of linguistic complexity in professional academic writing has been conducted on a 775,000-word corpus of research papers in four ‘soft’ arts and social sciences (business studies, linguistics, history and political science), and four ‘hard’ life and physical sciences (mathematics, engineering, chemistry and physics) which were published in leading peer-review journals indexed in Scopus Quartile 1, in 2016 and 2017. Once collected, the texts were manually cleared from tables, formulas, graphs, charts, metadata and reference lists for further analysis. The size and details of the corpus are given in Table 1. IMS 2021 - International Conference "Internet and Modern Society", June 24-26, 2021, St. Petersburg, Russia EMAIL: jperez@uvigo.es (A. 1); easmirnova@hse.ru (A. 2); ORCID: 0000-0002-8882-667X (A. 1); 0000-0001-9307-6773 (A. 2) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 292 PART 2: Computational Linguistics Table 1 Corpus Discipline No. texts Word totals Journals HARD SCIENCES Chemistry 16 97,947 Cell Chemical Biology (CCB) Chem Physics 18 95,852 Physics Letters B (PL) Reviews in Physics (RP) Mathematics 13 98,430 Compositio Matematica (CM) The Journal of Differential Geometry (JDG) Engineering 17 99,003 Automatica (Auto) Materials Characterisation (MC) Totals 64 391,232 SOFT SCIENCES Business 10 95,350 The Journal of Management (JM) The Journal of Management Studies (JMS) Linguistics 10 95,603 Applied Linguistics (AL) Lingua (Ling) History 10 99,303 Contemporary European History (CEH) The Journal of Modern History (JMH) Political science 11 93,366 Political Analysis (PA) World Politics (WP) Totals 41 383,622 In this study we undertake both the quantitative analysis of measures automatically generated by the complexity analyser and the qualitative scrutiny of a number of syntactic patterns associated with syntactic complexity. Firstly, to accomplish the quantitative analysis, the corpus texts were processed using Lu’s L2 Syntactic Complexity Analyser (hereafter L2SCA). L2SCA provided the 14 indices given in Table 2 along with their descriptions, as in Lu [6: 43]. Such indices were categorised into: (i) metrics of structural complexity: indices reporting the length of units (sentences, T-units, clauses2), measured by counting the number of words; (ii) metrics of syntactic complexity: indices reflecting syntactic depth and dependency, that is, those based on coordination and subordination ratios as well as on clausal/T-unit embedding within other superordinate units; and (iii) metrics of categorial complexity: indices expressing the pervasiveness of nominal and verbal categories in the text. At the second stage of the analysis, we carried out the qualitative analysis of the clausal and the phrasal complexity features, based on the taxonomy in Staples et al. [9]. The features are: sentence- final adverbial clauses of different types, wh complement clauses, verb + that-clauses, nouns, attributive adjectives, premodifying nouns and of-genitives. The analysis of such features required extensive manual disambiguation of the data examples. 2 The notion of a T-unit is extensively used in complexity studies and is defined as “the shortest terminable units into which a connected discourse can be segmented without leaving any residue” [7: 34]. Bardovi-Harllg [8] notes that a T-unit normally comprises an independent along with its dependent clauses. For example, the expression This would certainly continue to be the case with the CNT, but the UGT fared differently thanks to the support of the PSOE, its European partners and even the Spanish government, who had a strong interest in weakening the Communists (CEH-2016-4) consists of one sentence, two T-units (This would certainly continue to be the case with the CNT and the UGT fared differently thanks to the support of the PSOE, its European partners and even the Spanish government, who had a strong interest in weakening the Communists) and three clauses (This would certainly continue…, …but the UGT fared differently… and …who had a strong interest…). IMS-2021. International Conference “Internet and Modern Society” 293 Table 2 L2SCA syntactic complexity indices Structural MLS mean length of sentence (no. of words) complexity MLT mean length of T-unit (no. of words) MLC mean length of clause (no. of words) Syntactic Coordination CPC coordinate-phrase/clause ratio complexity CPT coordinate-phrase/T-unit ratio Subordination CS clause/sentence ratio CT clause/T-unit TS T-unit/sentence ratio DCC dependent-clause/clause ratio CTT dependent-clause/T-unit ratio Categorial Predicates VPT verb-phrase/T-unit ratio complexity Nominals CNT complex-nominal/T-unit ratio CNC complex-nominal/clause ratio 3. Results The automated complexity indices are given in Table 3. Table 3 L2SCA syntactic complexity indices in hard/soft sciences Hard sciences Soft sciences Index chemistry physics mathematics engineering mean business linguistics history political-sc mean MLS 32.3 26.26 27.99 27.34 28.47 32.68 31.47 63.9 35.84 40.97 MLT 29.75 25.35 25.87 25.33 26.58 30.87 29.04 56.74 31.88 37.13 MLC 20.03 16.33 15.12 15.49 16.74 17.65 16.52 29.42 16.02 19.9 CPC 0.49 0.31 0.17 0.34 0.33 0.66 0.41 0.37 0.28 0.43 CPT 0.74 0.47 0.29 0.52 0.5 0.88 0.7 0.71 0.56 0.71 CS 1.63 1.59 1.88 1.75 1.71 2.06 2.06 2.21 2.25 2.14 CT 1.5 1.19 1.74 1.62 1.51 1.79 1.84 1.93 2 1.89 TS 1.12 1.08 1.08 1.07 1.09 1.06 1.08 1.14 1.13 1.1 DCC 0.31 0.34 0.4 0.35 0.35 0.43 0.43 0.43 0.47 0.44 DCT 0.49 0.54 0.7 0.57 0.58 0.75 0.8 0.83 0.96 0.84 CTT 0.36 0.38 0.48 0.39 0.4 0.52 0.52 0.53 0.57 0.53 VPT 2.08 2.13 2.09 2.13 2.11 2.81 2.42 2.67 2.82 2.68 CNT 3.66 3.39 2.9 3.01 3.06 4.07 3.05 4.4 3.78 3.83 CNC 2.45 2.2 1.68 1.88 2.05 2.24 2.07 2.31 1.91 2.13 In an attempt to determine the relative weights of the complexity indices, a binomial linear regression analysis was applied to the data, implemented via the function ‘glm’ (‘stats’ package, R Core Team 2020). We operationalised a (backward-steps) reduction of the number of indices that led to the model in (1), with only the indices VPT (Verb phrases per T-unit), DCS (Dependent clause ratio), TS (T-unit/sentence ratio) and CPT (Coordinate phrases per T-unit). Both the C(oncordance) 0.918 and Nagelkerke R2 0.653 discrimination indices indicate that the model is very good at explaining the variation. 294 PART 2: Computational Linguistics (1) Definitive glm model (‘***’: 0,001, ‘*’: 0,05) Estimate Std, Error z value Pr(>|z|) (Intercept) -25,9115 4,0116 -6,459 1,05e-10 *** vpt 3,4756 1,0531 3,300 0,000966 *** dcs 10,6276 5,0567 2,102 0,035580 * ts 10,1416 2,6373 3,845 0,000120 *** cpt 3,8392 0,7312 5,250 1,52e-07 *** Figure 1 presents the Random Forests (function ‘cforest’, ‘party’ package) corresponding to the model’s fixed predictors, with an excellent C-index of 0.918. Figure 1 reflects the significant impact of the indices CPT, VPT and DCC on the variation hard/soft science, and the more minor contribution of TS to the model. Figure 1: Dot chart of conditional variable importance The interpretation of the findings revealed by the statistical analysis of the complexity indices per broad discipline, that is, hard and soft sciences, is as follows. The reduction of the indices led to a model with only 4 indices evincing different dimensions of linguistic complexity: (i) syntactic complexity mirrored by pervasive coordination, as reflected by the index CPT, which calculates the ratio of coordinated phrases per T-unit (ii) syntactic complexity determined by subordination within clausal units, as evinced by the index DCC, which expresses the amount of subordinate dependent clauses in matrix clauses, and in sentences, which has been corroborated by the statistical significance of the index TS, a telling indicator of the ratio of T-units per sentence (iii) categorial complexity associated with the frequency of, specifically, verbal constituents in T- units, here captured by the index VPT. Random Forests have demonstrated, on the one hand, that, out of the indices that proved to be very strong in the model, those measures evincing complexity triggered by coordination (CPT) and by the profusion of verbal categories (VPT), contribute to the variation of hard versus soft science to a greater extent than DCC and TS. On the other hand, the probability of higher values in the four complexity indices increases in academic writings categorised as soft science. In other words, greater ratios of coordination, subordination and the ‘verby’ status of texts can be taken as proxies for the categorisation of a research paper within the domain of social sciences and humanities. These results are in line with Biber et al, [10: 29] when they claim that “complexity is not a single unified construct, and it is therefore not reasonable to suppose that any single measure will adequately represent this construct”. However, some remarks are in order here as regards the interpretation of our findings in light of the conclusions drawn by Biber and colleagues. In their multidimensional analysis of academic writing versus other more informal genres, Biber et al, [11] found that high(er) phrasal complexity and low(er) clausal complexity are characteristic features of academic English (as well as of newspaper and magazine writings). By contrast, the type of complexity evinced in personal, IMS-2021. International Conference “Internet and Modern Society” 295 professional (even academic) spoken genres, as well as in popular written (novels, personal essays) discourse, is fundamentally clausal. Specifically, they contend that T-unit- and subordination-based (i,e, clausal) measures are not typical of academic writing but of conversational discourse, whereas nominal/prepositional (i,e, phrasal) measures are good indicators of academic writing. The statistical modeling of the complexity indices reported in this section has shown that subordination, coordination and the ‘verby’ status of sentences (or, better, T-units) are defining features of soft academic writing. As we see it, this conclusion does not invalidate a dominantly phrasal characterisation of academic writing when compared to more informal speech-based/related discourse, but gives support to the multifaceted nature of academic writing. Subsequently, a more qualitative analysis of the frequencies of the features associated with clausal and phrasal complexity was carried out. The results of the such an analysis are shown in Figure 2, which provides the normalised frequencies (per 100,000 words) of the features. All the differences in the use of the complexity features in hard and in soft sciences were found to be statistically significant at the level of 1%, except that of verb+that-clauses, which was significant at the 5% level. As can be seen in Figure 2, adverbial clauses were found to be more common in the corpus of the hard-science papers. A closer look at the types of adverbial clauses extensively employed in life and physical sciences revealed that the most frequently used one is the conditional clause, which accounts for almost a third of all adverbial clauses. This type of adverbial clauses is typically used in the comments for various calculations, formulas and theorems (see example 1). As regards the two features evincing complementation strategies, wh-clauses prevail in the soft research papers, whereas that-clauses are more frequent in the hard disciplines. Finally, the data demonstrates that, overall, phrasal complexity features, particularly, adjectival and prepositional phrases prevail in the soft-science texts, while nominal categories are more frequent in the hard sciences, particularly in chemistry, where they are used in long names of chemical entities and processes (see example 2). (1) The next lemma expresses the important fact that if qC > 0 and if the excess measured relative to C is much smaller than the excess measured relative to pairs of planes with higher-dimensional axes… (JDG-2017-3). (2) In addition, methyliminodiacetic acid (MIDA)-protected boronate esters were well tolerated (Chem-2016-4) Figure 2: Clausal/phrasal complexity features in hard/soft sciences 4. Conclusions This study has tackled the analysis of linguistic complexity in professional academic writing in English. The analysis of automated indices of complexity in a corpus of research articles published in leading journals in hard (mathematics, chemistry, physics, engineering) and soft (linguistics, history, business, political science) science papers led to the following conclusions. Soft sciences demonstrate a significantly larger number of features associated with syntactic complexity, subordination and coordination ratios than the hard-science genre. The data have also revealed that the clausal- 296 PART 2: Computational Linguistics complexity indices, in particular, the occurrence of sentence-final adverbial clauses, are significantly more frequent in the corpus of the hard-science papers. Phrasal complexity, measured here by the amount of adjectival and prepositional phrases, proved to prevail in the soft-science category, whereas the hard-science texts exhibited greater ratios of nominal categories. An in-depth description of linguistic complexity in professional academic texts, along the lines of analyses of objectively depicted indices, can benefit the teaching of EAP/ESP writing in terms of guiding the production of discipline-specific language-learning materials that will address the needs of learners of different sciences in a more effective way. From the perspective of Data Driven Learning (DDL) approaches [12], EAP/ESP practitioners could employ teaching materials with examples from research papers in a particular discipline or group of disciplines (hard vs soft) with the purpose of helping students learn how to meet the necessary language and stylistic conventions established in a specific discipline. In this vein, concordance lines with the most common finite adverbial clauses could for example be employed to demonstrate the way in which clausal complexity is achieved and realised in hard sciences, while occurrences of adjectival and prepositional phrases from papers in soft disciplines would serve as an illustration of the type of phrasal complexity in this domain. 5. References [1] C. Lambert, S. Nakamura, Proficiency‐related variation in syntactic complexity: A study of English L1 and L2 oral descriptive discourse. International Journal of Applied Linguistics 29(2) (2019) 1–17. doi: 10.1111/ijal.12224 [2] J. E. Casal, J. J. Lee, Syntactic complexity and writing quality in assessed first year L2 writing. Journal of Second Language Writing 44 (2019) 51–62. doi: 10.1016/j.jslw.2019.03.005 [3] D. Mazgutova, J. Kormos, Syntactic and lexical development in an intensive English for Academic Purposes programme. Journal of Second Language Writing 29 (2015) 3–15. doi: 10.1016/j.jslw.2015.06.004 [4] K. Hyland, As can be seen: Lexical bundles and disciplinary variation. English for Specific Purposes 27(1) (2008) 4–21. doi: 10.1016/j.esp.2007.06.001 [5] R. F. Kelly-Laubscher, N. Muna, M. van der Merwe, Using the research article as a model for teaching laboratory report writing provides opportunities for development of genre awareness and adoption of new literacy practices. English for Specific Purposes 48 (2017) 1–16. doi: 10.1016/j.esp.2017.05.002 [6] X. Lu, A corpus-based evaluation of syntactic complexity measures as indices of college-level ESL writers’ language development. TESOL Quarterly 45(1) (2011) 36–62. doi: 10.5054/tq.2011.240859 [7] K. W. Hunt, Differences in grammatical structures written at three grade levels: The structures to be analysed by transformational methods. Report no. CRP-1998. Tallahasser: Florida State University, 1964. [8] K. Bardovi-Harlig, A second look at T-unit analysis: Reconsidering the sentence. TESOL quarterly 26(2) (1992) 390–395. doi: 10.2307/3587016 [9] S. Staples, J. Egbert, D. Biber, B. Gray, Academic writing development at the university level: Phrasal and clausal complexity across level of study, discipline, and genre. Written Communication 33(2) (2016) 149–183. doi: 10.1177%2F0741088316631527 [10] D. Biber, B. Gray, K. Poonpon, Should we use characteristics of conversation to measure grammatical complexity in L2 writing development? TESOL Quarterly 45(1) (2011) 5–35. doi: 10.5054/tq.2011.244483 [11] D. Biber, B. Gray, K. Poonpon, Pay attention to the phrasal structures: Going beyond T-units – A response to WeiWei Yang. TESOL Quarterly 47(1) (2013) 192–201. doi: 10.1002/tesq.84 [12] T. F. Johns, Should you be persuaded: two samples of data-driven learning materials. English Language Research Journal 4 (1991) 1–16.