First-Year Composition as “Big Data”: Examining Student
                    Revisions at Scale

                          Chris Holcomb                                              Duncan Buell
                 English Language and Literature                               Computer Science and E
                   University of South Carolina                               University of South Carolina
                    Columbia, South Carolina                                   Columbia, South Carolina
                  holcombc@mailbox.sc.edu                                           buell@acm.org

ABSTRACT                                                            focusing primarily on “phrasal adjustments and sentence cor-
Approaching First-Year Composition (FYC) as a “big data”            rectness” [1, p. xii]. Arguing along similar lines, Sommers
phenomenon, we have prototyped software to study revi-              says that students typically “understand the revision pro-
sion in a large corpus of student papers and thus to ad-            cess as a rewording activity”—that is, finding just the right
dress a question central to Composition and Rhetoric schol-         word or eliminating lexical redundancies [11, p. 381]. Harr
arship: “What role does revision play in students’ writing          and Horning claim that while students occasionally “revise
processes?” After running our program on a corpus of stu-           extensively,” they are “more likely to stick to surface cor-
dent writing, we see that our computational analysis chal-          rection and small changes” [4, p. 4]. All in all, and espe-
lenges past research on revision and extends the method-            cially when compared to more experienced writers, students
ological reach of Composition and Rhetoric to include “big          lack a robust approach to revision, one that includes revi-
data” analytics.                                                    sion strategies that extend beyond word- and phrase-level
                                                                    changes.
Keywords                                                            As valuable as this research has been in helping us under-
first-year composition, revision, text analysis                     stand and respond to student revision, it is limited in two
                                                                    important respects, limitations that Faigley and Witte ac-
1.   INTRODUCTION                                                   knowledge in their own and prior studies and that still seem
For many First-Year Composition (FYC) programs, revision            applicable today. First, owing to the “complexity of the
is the centerpiece of their writing pedagogies. Students draft      analysis” involved, researchers have restricted their studies
each major writing assignment, receive feedback from their          to only a “small number of subjects” [3, p. 411]. Faigley
peers and instructors, revise their papers based on that feed-      and Witte, for instance, include only 18 subjects in their
back, and submit all of their drafts and final versions at the      study while Sommers [11] includes 40, Horning [5] includes
end of the semester under a single cover (i.e., the writing         9, and Treglia [13] includes 43. Second, while explaining
portfolio). The assumption here is that students improve            the causes of revision, researchers focus too narrowly on the
their writing through these multiple and guided revisions.          “skill of the writer” and thus ignore a range of other “situ-
However, given the number of papers students produce dur-           ational variables” that contribute to revision or its absence
ing a typical semester (it’s 9,000 to 12,000 at our institution),   ([3, p. 410]; see also [7, pp. 258-264]). In other words,
how can we know, at a program level and on a routine basis,         “revision cannot be separated from other aspects of com-
what happens between all these first and final drafts? How          posing, especially during that period when writers come to
often and how much do students revise, what specific fea-           grips with the demands of the particular writing situation.”
tures do they typically change, and do their revisions match,       Research that neglects these “situational variables” is “likely
exceed, or fall short of the learning outcomes and more gen-        to be skewed” [3, p. 411].
eral expectations of the FYC courses?
                                                                    Both these limitations involve problems of scale: too few
In answering these questions, the scholarship on revision           subjects and too few variables considered. Towards over-
has been fairly consistent: students revise infrequently, and,      coming these limitations as well as answering the question
when they do make changes to their papers, they typically           with which this essay begins (“How can we know what hap-
focus on minor edits and surface errors [3, 5, 6, 11]. Accord-      pens between all of these first and final drafts?”), we ap-
ing to Bazerman, “students tend to revise essays shallowly,”        proached revision, and FYC more generally, as a “big data”
                                                                    phenomenon. More specifically, we built a corpus of first and
                                                                    final drafts from our students’ portfolios and developed soft-
                                                                    ware to process them. This software allows us to examine
                                                                    revisions in student papers, to explore correlations between
                                                                    these revisions and the situational variables that may influ-
                                                                    ence them, and to perform both of these operations at scale.
                                                                    What we found differs considerably from past research: un-
                                                                    like students in other studies, ours rarely focused on mi-
                                                                    nor edits and surface corrections; instead, when they did
revise, their changes primarily involved deleting and, more         Using this measure, we were able to quantify the “distance”
frequently, inserting complete sentences. What this suggests        between draft and final sentences by looping through those
more generally is that our students see revision not as a “re-      sentences and aligning pairs of sentences whose distance falls
wording activity,” but as a sentence deletion and insertion         within a gradually increasing threshold. On its first pass,
activity, treating their original drafts as fixed structures into   our program aligns sentences with an edit distance of zero
which they plug or unplug not words, but sentences.                 (no difference between the sentences). On its next pass, it
                                                                    looks in between aligned sentences and aligns in the inter-
In the rest of this paper, we describe our data set, the pro-       vening space the pair of sentences with the smallest pairwise
gram we developed to analyze it, and the results it produced.       distance. And then it does so again, and then again until
We conclude by outlining future directions for our project          it reaches a point where the smallest pairwise distance ex-
and how “big data” analytics informs that work.                     ceeds 50% of the worst-case distance (the worst case is the
                                                                    distance achieved by deleting each word from the draft sen-
                                                                    tence and then inserting each word from the final sentence).
2.   DATA AND PROGRAMMING                                           We chose this as the program’s stopping point after visually
                                                                    inspecting scores of sentences and determining that, beyond
FYC at the University of South Carolina is taught in about
                                                                    50% of the worst-case distance, the program would likely be
150 (fall semester) and 120 (spring semester) sections, each
                                                                    aligning two different sentences.
with about 20-24 students who each write three or four draft
and final papers, for a rough total of about 10,000 pairs of
papers each semester. These are submitted to a content              3.   RESULTS – STUDENT REVISIONS
management system from which we download the papers.                When we ran our program on a test corpus, the results sur-
Earlier downloads have been manual; we have devised a sys-          prised us because they differed from what the scholarship on
tem for a more automatic script for download. Most of these         revisions says we should be seeing. In other words, unlike
are submitted as dot doc or dot docx files, which can be            other studies of revision which found that students typically
turned into ASCII text with a Python program. Scripts and           focus on minor changes in diction, punctuation, and gram-
programs convert these to a standard file naming and clean          mar, we found that when our students revised, the majority
the ASCII files of the various Unicode or nonstandard char-         of their changes involved deleting and, especially, adding
acters that would complicate later processing (smart quotes,        sentences. Consider Figure 1. This stacked bar chart shows
em dashes, en dashes, ellipses, and so forth).                      the percentages of unchanged sentences (light blue), lightly
                                                                    edited (red) sentences, sentences deleted from the draft or
We do lose some data along the way. A small fraction of the         inserted into the final (green), and heavily edited (purple)
papers are submitted in formats other than dot doc or dot           sentences. By far, the largest portion of sentences fall into
docx, and at present we do not process these. Subsequent            the unchanged category. That is, the bulk of student writing
versions of our code may be able to make use of pdf, or             survives unaltered from first to final draft. When we con-
Pages, or odt files, for example. We have not done that yet,        sider text that students actually changed, the bulk of those
though. We are at present drawing rather coarse conclusions         changes involve deleted and inserted sentences, followed by
from a corpus that is already large, and we would not expect        heavily edited and then lightly edited sentences. So while
students submitting pdf files, for example, to be statistically     students do edit their text to some extent, their primary
different as writers from students submitting doc files. We         revision strategy involves treating their drafts as relatively
remark that each paper averages a little less than 10,000           fixed structures into which the plug or unplug, not words,
characters, so that 10,000 pairs of papers is only about 200        but complete sentences.
megabytes of data each semester. This is substantial enough
to require some management and organization but is by no            We remark that almost no great shifting of text occurs in
means problematic; the quantity of data is less a manage-           our student papers. Our alignment algorithm is somewhat
ment problem than is separating the files into class sections,      naı̈ve in that it anchors the initial alignment to unchanged
keeping track of which papers come from which standard              sentences and then continues with that alignment. Clearly,
assignment, etc.                                                    if entire paragraphs were moved, our algorithm would work
                                                                    poorly and we would see anomalous results for those papers.
Similarly, we do admit that our “cleaning” process could in-        In fact, we see this happening in only a very small fraction
troduce corruptions in ways that might make some detailed           of the papers.
analysis difficult or impossible. Again, however, we do not
imagine that a few such character changes, if done consis-          4.   FUTURE DIRECTIONS
tently to draft and final, would change the overall analysis        Our next steps involve explaining the revision practices we
currently being done.                                               are observing. In other words, having gained a better sense
                                                                    of what happens between all those first and final drafts,
To analyze the data set, we used Python programs (only              we now plan to explore why it happens. Toward that end,
about 2500 total lines of code) together with the Natural           our work will continue to be informed by big-data analytics.
Language ToolKit (NLTK) [10] and limited use of the Stan-           What do we mean by this? The phrase “big data” refers to
ford NLP package [12] for processing the data. The NLTK             a large data set and to a collection of computational tech-
routines were used primarily for breaking the documents into        niques for analyzing it. Both meanings apply to our project.
sentences and paragraphs. Having broken both draft and fi-          Our corpus will eventually consist of tens of thousands of pa-
nal versions into sentences, we used edit distance, which is        pers, a size much too large for humans to analyze in detail,
a standard measure of similarity [8, 9, 14], to compute the         so we will use natural language processing to capture and
“similarity” between sentences in draft and final versions.         quantify features that, taken together, offer a linguistic pro-
       Figure 1: Fractions of sentences unedited, edited, or inserted/deleted in a sample of 15 sections.


file of each paper. Once those features are quantified, we              them against their revision scores. The term “eviden-
will employ other computational techniques (e.g., linear re-            tials” refers to linguistic features that signal a writer’s
gression and cluster analysis) to search for correlations (and          source of information and his or her perceptions about
other patterns) among the papers in the corpus.                         its reliability, including reporting verbs (e.g., “say,”
                                                                        “think,” and “argue”), adverbs (e.g., “actually”, “proba-
The program we have already developed supplies us with                  bly”, and “certainly”), and modals (e.g., “could,”“should,”
a relatively finely tuned computational model for revision.             and “must”).
Equipped with this model, we have multiple paths forward,
and, in the spirit of big data, we will explore as many of them        • Collaborate with other institutions that have assem-
as we can—including, but not limited to, the following:                  bled similar corpora of student writing and run their
                                                                         data through our program. By seeing results produced
                                                                         by other institutions, we will gain a better sense of
   • Turn each draft-final pair into a four-dimensional vec-             whether the sentence deletion and insertion practice
     tor (i.e., the frequency of unchanged, inserted, deleted,           we observed in our corpus is a more general trend or
     and edited sentences) and use cluster analysis to see               a phenomenon peculiar to our FYC program and its
     if those pairs fall into any groupings. If they do, then            curriculum. Either result would be of interest: if the
     look within and across those clusters to see if other               data from other institutions looks like the USC data,
     written features or situational variables correlate with            then perhaps we have identified a broad characteristic
     those groupings.                                                    of student writing. If that data is different, then we
                                                                         will have new questions to ask to determine why one
   • Compare the aggregate of deleted sentences (which                   group of students revises differently from the other.
     students presumably thought were bad) with the ag-
     gregate of inserted ones (which they presumably thought
     were better).
                                                                  5.    CONCLUSION
                                                                  Thus far, our project addresses one of the limitations Faigley
   • Measure sentence complexity trends in our corpus against     and Witte point out in revision research: that is, rather than
     those found in other genres, using a distinction be-         restricting our research to “a small number of subjects,” we
     tween clausal complexity (a characteristic of spoken         are able to examine revision patterns in tens of thousands
     discourse) and phrasal complexity (a characteristic of       of student papers at one go. In doing so, we have unearthed
     academic writing) ([2]). Do students’ sentence struc-        trends in student writing that past studies of revision fail to
     tures align more with spoken discourse or with aca-          predict—namely, the prevalence of the sentence deletion and
     demic writing?                                               insertion trend. As we move forward with our project, we
                                                                  will address Faigley and Witte’s second limitation: too few
   • Examine students’ use of “evidentials” and compare           “situational variables” considered. Having quantified revi-
sion, we can now explore correlations between it and dozens
of these variables, including grades, student major, teacher
feedback, gender, and a host of features in the co-text of
student revisions (e.g., sentence complexity, lexical sophisti-
cation, metadiscourse, etc.). As we do so, we will continue
to enrich our understanding of what happens between all of
those first and final drafts.

6.   ACKNOWLEDGMENTS
We acknowledge and offer thanks for the financial support
for student assistants from the Center for Digital Human-
ities at the University of South Carolina, and the work of
graduate student Gerald Jackson and undergraduate stu-
dents Brian Flick, Chelsea Reeser, Sam Watson, and Ming
Wong at various stages of this research.

7.   REFERENCES
 [1] C. Bazerman. Preface. In A. Horning and A. Becker,
     editors, Revision: History, Theory, and Practice, West
     Lafayette, IN, 2006. Parlor Press.
 [2] D. Biber, B. Gray, and K. Poonpon. Should we use
     characteristics of conversation to measure grammatical
     complexity in L2 writing development? TESOL
     Quarterly, 45.1:5–35, 2011.
 [3] L. Faigley and S. Witte. Analyzing revision. CCC,
     32.4:400–414, 1981.
 [4] C. Haar and A. Horning. Introduction and overview.
     In A. Horning and A. Becker, editors, Revision:
     History, Theory, and Practice, West Lafayette, IN,
     2006. Parlor Press.
 [5] A. Horning. Revision Revisited. Hampton Press, Inc.,
     Cresskill, NJ, 2002.
 [6] A. Horning and A. Becker, editors. Revision: History,
     Theory, and Practice. Parlor Press, West Lafayette,
     IN, 2006.
 [7] J. Jones. Patterns of revision in online writing: A
     study of Wikipedia’s featured articles. Written
     Communication, 25.2:262–289, 2008.
 [8] V. I. Levenshtein. Binary codes capable of correcting
     deletions, insertions, and reversals. Soviet Physics
     Doklady, 10:707–710, 1966.
 [9] S. B. Needleman and C. D. Wunsch. A general
     method applicable to the search for similaries in the
     amino acid sequence of two proteins. Journal of
     Molecular Biology, 48:443–453, 1970.
[10] NLTK.org. Natural language toolkit, 2016.
     http://www.nltk.org.
[11] N. Sommers. Revision strategies of student writers and
     experience adult writers. CCC, 31.4:378–388, 1980.
[12] Stanford Natural Language Processing Group. Natural
     language processing package, 2016.
     http://nlp.stanford.edu.
[13] M. O. Teglia. Teacher-written commentary in college
     writing composition: How does it impact student
     revisions? Composition Studies, 37.1:67–86, 2009.
[14] R. Wagner and M. Fischer. The string-to-string
     correction problem. Journal of the ACM, 21:168–178,
     1974.