First-Year Composition as “Big Data”: Examining Student Revisions at Scale Chris Holcomb Duncan Buell English Language and Literature Computer Science and E University of South Carolina University of South Carolina Columbia, South Carolina Columbia, South Carolina holcombc@mailbox.sc.edu buell@acm.org ABSTRACT focusing primarily on “phrasal adjustments and sentence cor- Approaching First-Year Composition (FYC) as a “big data” rectness” [1, p. xii]. Arguing along similar lines, Sommers phenomenon, we have prototyped software to study revi- says that students typically “understand the revision pro- sion in a large corpus of student papers and thus to ad- cess as a rewording activity”—that is, finding just the right dress a question central to Composition and Rhetoric schol- word or eliminating lexical redundancies [11, p. 381]. Harr arship: “What role does revision play in students’ writing and Horning claim that while students occasionally “revise processes?” After running our program on a corpus of stu- extensively,” they are “more likely to stick to surface cor- dent writing, we see that our computational analysis chal- rection and small changes” [4, p. 4]. All in all, and espe- lenges past research on revision and extends the method- cially when compared to more experienced writers, students ological reach of Composition and Rhetoric to include “big lack a robust approach to revision, one that includes revi- data” analytics. sion strategies that extend beyond word- and phrase-level changes. Keywords As valuable as this research has been in helping us under- first-year composition, revision, text analysis stand and respond to student revision, it is limited in two important respects, limitations that Faigley and Witte ac- 1. INTRODUCTION knowledge in their own and prior studies and that still seem For many First-Year Composition (FYC) programs, revision applicable today. First, owing to the “complexity of the is the centerpiece of their writing pedagogies. Students draft analysis” involved, researchers have restricted their studies each major writing assignment, receive feedback from their to only a “small number of subjects” [3, p. 411]. Faigley peers and instructors, revise their papers based on that feed- and Witte, for instance, include only 18 subjects in their back, and submit all of their drafts and final versions at the study while Sommers [11] includes 40, Horning [5] includes end of the semester under a single cover (i.e., the writing 9, and Treglia [13] includes 43. Second, while explaining portfolio). The assumption here is that students improve the causes of revision, researchers focus too narrowly on the their writing through these multiple and guided revisions. “skill of the writer” and thus ignore a range of other “situ- However, given the number of papers students produce dur- ational variables” that contribute to revision or its absence ing a typical semester (it’s 9,000 to 12,000 at our institution), ([3, p. 410]; see also [7, pp. 258-264]). In other words, how can we know, at a program level and on a routine basis, “revision cannot be separated from other aspects of com- what happens between all these first and final drafts? How posing, especially during that period when writers come to often and how much do students revise, what specific fea- grips with the demands of the particular writing situation.” tures do they typically change, and do their revisions match, Research that neglects these “situational variables” is “likely exceed, or fall short of the learning outcomes and more gen- to be skewed” [3, p. 411]. eral expectations of the FYC courses? Both these limitations involve problems of scale: too few In answering these questions, the scholarship on revision subjects and too few variables considered. Towards over- has been fairly consistent: students revise infrequently, and, coming these limitations as well as answering the question when they do make changes to their papers, they typically with which this essay begins (“How can we know what hap- focus on minor edits and surface errors [3, 5, 6, 11]. Accord- pens between all of these first and final drafts?”), we ap- ing to Bazerman, “students tend to revise essays shallowly,” proached revision, and FYC more generally, as a “big data” phenomenon. More specifically, we built a corpus of first and final drafts from our students’ portfolios and developed soft- ware to process them. This software allows us to examine revisions in student papers, to explore correlations between these revisions and the situational variables that may influ- ence them, and to perform both of these operations at scale. What we found differs considerably from past research: un- like students in other studies, ours rarely focused on mi- nor edits and surface corrections; instead, when they did revise, their changes primarily involved deleting and, more Using this measure, we were able to quantify the “distance” frequently, inserting complete sentences. What this suggests between draft and final sentences by looping through those more generally is that our students see revision not as a “re- sentences and aligning pairs of sentences whose distance falls wording activity,” but as a sentence deletion and insertion within a gradually increasing threshold. On its first pass, activity, treating their original drafts as fixed structures into our program aligns sentences with an edit distance of zero which they plug or unplug not words, but sentences. (no difference between the sentences). On its next pass, it looks in between aligned sentences and aligns in the inter- In the rest of this paper, we describe our data set, the pro- vening space the pair of sentences with the smallest pairwise gram we developed to analyze it, and the results it produced. distance. And then it does so again, and then again until We conclude by outlining future directions for our project it reaches a point where the smallest pairwise distance ex- and how “big data” analytics informs that work. ceeds 50% of the worst-case distance (the worst case is the distance achieved by deleting each word from the draft sen- tence and then inserting each word from the final sentence). 2. DATA AND PROGRAMMING We chose this as the program’s stopping point after visually inspecting scores of sentences and determining that, beyond FYC at the University of South Carolina is taught in about 50% of the worst-case distance, the program would likely be 150 (fall semester) and 120 (spring semester) sections, each aligning two different sentences. with about 20-24 students who each write three or four draft and final papers, for a rough total of about 10,000 pairs of papers each semester. These are submitted to a content 3. RESULTS – STUDENT REVISIONS management system from which we download the papers. When we ran our program on a test corpus, the results sur- Earlier downloads have been manual; we have devised a sys- prised us because they differed from what the scholarship on tem for a more automatic script for download. Most of these revisions says we should be seeing. In other words, unlike are submitted as dot doc or dot docx files, which can be other studies of revision which found that students typically turned into ASCII text with a Python program. Scripts and focus on minor changes in diction, punctuation, and gram- programs convert these to a standard file naming and clean mar, we found that when our students revised, the majority the ASCII files of the various Unicode or nonstandard char- of their changes involved deleting and, especially, adding acters that would complicate later processing (smart quotes, sentences. Consider Figure 1. This stacked bar chart shows em dashes, en dashes, ellipses, and so forth). the percentages of unchanged sentences (light blue), lightly edited (red) sentences, sentences deleted from the draft or We do lose some data along the way. A small fraction of the inserted into the final (green), and heavily edited (purple) papers are submitted in formats other than dot doc or dot sentences. By far, the largest portion of sentences fall into docx, and at present we do not process these. Subsequent the unchanged category. That is, the bulk of student writing versions of our code may be able to make use of pdf, or survives unaltered from first to final draft. When we con- Pages, or odt files, for example. We have not done that yet, sider text that students actually changed, the bulk of those though. We are at present drawing rather coarse conclusions changes involve deleted and inserted sentences, followed by from a corpus that is already large, and we would not expect heavily edited and then lightly edited sentences. So while students submitting pdf files, for example, to be statistically students do edit their text to some extent, their primary different as writers from students submitting doc files. We revision strategy involves treating their drafts as relatively remark that each paper averages a little less than 10,000 fixed structures into which the plug or unplug, not words, characters, so that 10,000 pairs of papers is only about 200 but complete sentences. megabytes of data each semester. This is substantial enough to require some management and organization but is by no We remark that almost no great shifting of text occurs in means problematic; the quantity of data is less a manage- our student papers. Our alignment algorithm is somewhat ment problem than is separating the files into class sections, naı̈ve in that it anchors the initial alignment to unchanged keeping track of which papers come from which standard sentences and then continues with that alignment. Clearly, assignment, etc. if entire paragraphs were moved, our algorithm would work poorly and we would see anomalous results for those papers. Similarly, we do admit that our “cleaning” process could in- In fact, we see this happening in only a very small fraction troduce corruptions in ways that might make some detailed of the papers. analysis difficult or impossible. Again, however, we do not imagine that a few such character changes, if done consis- 4. FUTURE DIRECTIONS tently to draft and final, would change the overall analysis Our next steps involve explaining the revision practices we currently being done. are observing. In other words, having gained a better sense of what happens between all those first and final drafts, To analyze the data set, we used Python programs (only we now plan to explore why it happens. Toward that end, about 2500 total lines of code) together with the Natural our work will continue to be informed by big-data analytics. Language ToolKit (NLTK) [10] and limited use of the Stan- What do we mean by this? The phrase “big data” refers to ford NLP package [12] for processing the data. The NLTK a large data set and to a collection of computational tech- routines were used primarily for breaking the documents into niques for analyzing it. Both meanings apply to our project. sentences and paragraphs. Having broken both draft and fi- Our corpus will eventually consist of tens of thousands of pa- nal versions into sentences, we used edit distance, which is pers, a size much too large for humans to analyze in detail, a standard measure of similarity [8, 9, 14], to compute the so we will use natural language processing to capture and “similarity” between sentences in draft and final versions. quantify features that, taken together, offer a linguistic pro- Figure 1: Fractions of sentences unedited, edited, or inserted/deleted in a sample of 15 sections. file of each paper. Once those features are quantified, we them against their revision scores. The term “eviden- will employ other computational techniques (e.g., linear re- tials” refers to linguistic features that signal a writer’s gression and cluster analysis) to search for correlations (and source of information and his or her perceptions about other patterns) among the papers in the corpus. its reliability, including reporting verbs (e.g., “say,” “think,” and “argue”), adverbs (e.g., “actually”, “proba- The program we have already developed supplies us with bly”, and “certainly”), and modals (e.g., “could,”“should,” a relatively finely tuned computational model for revision. and “must”). Equipped with this model, we have multiple paths forward, and, in the spirit of big data, we will explore as many of them • Collaborate with other institutions that have assem- as we can—including, but not limited to, the following: bled similar corpora of student writing and run their data through our program. By seeing results produced by other institutions, we will gain a better sense of • Turn each draft-final pair into a four-dimensional vec- whether the sentence deletion and insertion practice tor (i.e., the frequency of unchanged, inserted, deleted, we observed in our corpus is a more general trend or and edited sentences) and use cluster analysis to see a phenomenon peculiar to our FYC program and its if those pairs fall into any groupings. If they do, then curriculum. Either result would be of interest: if the look within and across those clusters to see if other data from other institutions looks like the USC data, written features or situational variables correlate with then perhaps we have identified a broad characteristic those groupings. of student writing. If that data is different, then we will have new questions to ask to determine why one • Compare the aggregate of deleted sentences (which group of students revises differently from the other. students presumably thought were bad) with the ag- gregate of inserted ones (which they presumably thought were better). 5. CONCLUSION Thus far, our project addresses one of the limitations Faigley • Measure sentence complexity trends in our corpus against and Witte point out in revision research: that is, rather than those found in other genres, using a distinction be- restricting our research to “a small number of subjects,” we tween clausal complexity (a characteristic of spoken are able to examine revision patterns in tens of thousands discourse) and phrasal complexity (a characteristic of of student papers at one go. In doing so, we have unearthed academic writing) ([2]). Do students’ sentence struc- trends in student writing that past studies of revision fail to tures align more with spoken discourse or with aca- predict—namely, the prevalence of the sentence deletion and demic writing? insertion trend. As we move forward with our project, we will address Faigley and Witte’s second limitation: too few • Examine students’ use of “evidentials” and compare “situational variables” considered. Having quantified revi- sion, we can now explore correlations between it and dozens of these variables, including grades, student major, teacher feedback, gender, and a host of features in the co-text of student revisions (e.g., sentence complexity, lexical sophisti- cation, metadiscourse, etc.). As we do so, we will continue to enrich our understanding of what happens between all of those first and final drafts. 6. ACKNOWLEDGMENTS We acknowledge and offer thanks for the financial support for student assistants from the Center for Digital Human- ities at the University of South Carolina, and the work of graduate student Gerald Jackson and undergraduate stu- dents Brian Flick, Chelsea Reeser, Sam Watson, and Ming Wong at various stages of this research. 7. REFERENCES [1] C. Bazerman. Preface. In A. Horning and A. Becker, editors, Revision: History, Theory, and Practice, West Lafayette, IN, 2006. Parlor Press. [2] D. Biber, B. Gray, and K. Poonpon. Should we use characteristics of conversation to measure grammatical complexity in L2 writing development? TESOL Quarterly, 45.1:5–35, 2011. [3] L. Faigley and S. Witte. Analyzing revision. CCC, 32.4:400–414, 1981. [4] C. Haar and A. Horning. Introduction and overview. In A. Horning and A. Becker, editors, Revision: History, Theory, and Practice, West Lafayette, IN, 2006. Parlor Press. [5] A. Horning. Revision Revisited. Hampton Press, Inc., Cresskill, NJ, 2002. [6] A. Horning and A. Becker, editors. Revision: History, Theory, and Practice. Parlor Press, West Lafayette, IN, 2006. [7] J. Jones. Patterns of revision in online writing: A study of Wikipedia’s featured articles. Written Communication, 25.2:262–289, 2008. [8] V. I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10:707–710, 1966. [9] S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similaries in the amino acid sequence of two proteins. Journal of Molecular Biology, 48:443–453, 1970. [10] NLTK.org. Natural language toolkit, 2016. http://www.nltk.org. [11] N. Sommers. Revision strategies of student writers and experience adult writers. CCC, 31.4:378–388, 1980. [12] Stanford Natural Language Processing Group. Natural language processing package, 2016. http://nlp.stanford.edu. [13] M. O. Teglia. Teacher-written commentary in college writing composition: How does it impact student revisions? Composition Studies, 37.1:67–86, 2009. [14] R. Wagner and M. Fischer. The string-to-string correction problem. Journal of the ACM, 21:168–178, 1974.