<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Collaborative Review in Writing Analytics: N-Gram Analysis of Instructor and Student Comments</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alex Rudniy</string-name>
          <email>arudniy@fdu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Norbert Elliot</string-name>
          <email>elliot@njit.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fairleigh Dickinson University</institution>
          ,
          <addr-line>1000 River Rd, Teaneck, NJ 07666, 1-646-684-5876</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>New Jersey Institute of Technology</institution>
          ,
          <addr-line>323 Dr. Martin Luther King Jr. Blvd, Newark, NJ 07102, 1-856-952-7680</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The purpose of this paper is to explore the use of n-gram analysis to analyze instructor and student comments elicited within My Reviewers, a web-based learning environment. Shown to be informative in a wide variety of applications, n-gram analysis is of interest in determining concept proliferation in topics, purposes, terminologies, and rubrics used in writing courses. As the present study demonstrates, unigram, bigram, digram, trigram, fourgram, and fivegram analytic methods reveal important information about instructor and student use of concepts; in turn, such analysis holds the potential to lead to precise and actionable revision behaviors.</p>
      </abstract>
      <kwd-group>
        <kwd>context informed linguistic analysis</kwd>
        <kwd>My Reviewers</kwd>
        <kwd>n-grams</kwd>
        <kwd>webbased learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In the present study, attention is given to a single course: English
1102: Rhetoric and Academic Research, a second-semester USF
undergraduate writing course [3]. The course introduces students
to rhetorical conventions and provides them with an opportunity
to analyze, research, and compose arguments. Designed to
improve academic writing, research, information literacy, and
*critical thinking abilities, the course is unique in its focus on
exploring the ways that writers gain agency—that is, credibility
through argument, negotiation, and reasoning. In addition, the
course incorporates projects using distinct print and digital genres.
Because of its uniqueness (focus on writer agency) and variation
(use of multiple genres), the course is ideal for exploring the
usefulness of n-gram analysis in providing context-specific
information regarding course specific information.
To lend specificity to the analysis, this study uses the term course
concept proliferation. Generally speaking, first-year
postsecondary writing courses simultaneously advance knowledge
and skills as part of the cognitive domain of the course [4]. For
example, the ability to think critically about a specific topic (how
writers gain agency through evidence) is demonstrated through
mastery of genre (how an essay is organized through claims).
Analysis of instructor and student comments affords the analysis
of proliferation of key course terms involving instruction and key
trait terms involving assessment. As such, concept proliferation is
defined as the degree to which course terms and assessment traits
are present in comments—and what that presence suggests
regarding instruction that unifies topic, purpose, terms, and
rubrics for the benefit of students.</p>
    </sec>
    <sec id="sec-2">
      <title>2. N-GRAM ANALYSIS</title>
      <p>Because of its straightforward assumptions, n-gram analysis is
ideal for a basic analysis of course concepts students should know
and the evaluation of those concepts through rubric use.</p>
    </sec>
    <sec id="sec-3">
      <title>2.1 Definition</title>
      <p>
        An N-gram is defined as a sequence of n items as they appear in
text—letters, words, phonemes, part-of-speech (POS) tags, or
other elements. N in n-gram denotes the number of items in a
sequence. Commonly, a single word is referred to as a unigram;
two words are referred to as a bigram; three words constitute a
trigram; four words constitute a four-gram; and five words
constitute a fivegram [
        <xref ref-type="bibr" rid="ref2">5,6</xref>
        ].
      </p>
    </sec>
    <sec id="sec-4">
      <title>2.2 Early Work</title>
      <p>
        The history of n-gram model originates in Markov [
        <xref ref-type="bibr" rid="ref3 ref4">7, 8</xref>
        ]. N-grams
are considered a version of the multi-order Markov model in
which the probability of the Nth element depends on the previous
N-1 elements and can be obtained from data [5]. Shannon [
        <xref ref-type="bibr" rid="ref5">9</xref>
        ] and
Chomsky [
        <xref ref-type="bibr" rid="ref6 ref7">10, 11</xref>
        ] are known for applying n-grams for predicting
subsequent elements within sequences (e.g., Shannon game) [
        <xref ref-type="bibr" rid="ref8">12</xref>
        ].
These elements can vary from a single character to a linguistic
entity [
        <xref ref-type="bibr" rid="ref4">8</xref>
        ].
      </p>
      <p>
        In the 1950s, 1960s and 1970s, n-gram models from one to five
were used as a stand-alone research method in early works on
natural language processing, in particular for hand-printing
recognition and standardization, reading machines for the blind,
and language computational analysis. Due to computational
restrictions of that era, character n-grams were widely used in a
large number of studies [
        <xref ref-type="bibr" rid="ref9">13</xref>
        ].
      </p>
    </sec>
    <sec id="sec-5">
      <title>2.3 Contemporary N-Gram Applications</title>
      <p>
        Bassil [
        <xref ref-type="bibr" rid="ref4">8</xref>
        ] designed an n-gram-based method for spelling
corrections and evaluated it on the Yahoo! N-Grams Dataset 2.0
consisting of word n-grams of sizes from 1 to 5 [
        <xref ref-type="bibr" rid="ref10">14</xref>
        ]. Nadkarni et
al. [5] describe the applications of character n-grams for
autocompletion of words and phrases, spelling correction, speech
recognition, and word disambiguation on the Google n-gram
dataset for n=1..5, which was assembled from web data and the
Google Books project [
        <xref ref-type="bibr" rid="ref11">15</xref>
        ]. The Google Books N-Gram Corpus is
commonly used for analyzing cultural, social, and linguistic
trends. It contains n-grams and their frequencies retrieved from
books in several languages over the past five hundred years [
        <xref ref-type="bibr" rid="ref12 ref13">16,
17</xref>
        ]. Mayfield and McNamee [
        <xref ref-type="bibr" rid="ref14">18</xref>
        ] applied n-gram tokenization for
stemming in a language-independent way. Gencosman et al. [
        <xref ref-type="bibr" rid="ref15">19</xref>
        ]
describe character n-gram applications in speech recognition,
optical character recognition, spelling correction, handwriting
recognition, and statistical machine translation. In addition,
Lecluze et al. [
        <xref ref-type="bibr" rid="ref16">20</xref>
        ] mention examples of character n-gram models
for author and language identification, speech analysis,
classification of multilingual documents, and information
retrieval.
      </p>
      <p>
        Rangarajan and Ravichandran [
        <xref ref-type="bibr" rid="ref17">21</xref>
        ] registered a US patent
describing a system and a method for indexing and retrieval of
stored documents using n-grams. While working on opinion
extraction and classification tasks, Dave et al. [
        <xref ref-type="bibr" rid="ref18">22</xref>
        ] identified the
n-gram model to be analytically competitive; specifically, trigrams
demonstrated the best performance compared to bigrams and
unigrams. Their work identified two major flaws related to
product reviews: rating inconsistency when qualitative
descriptions do not correlate with quantitative scores; and
ambivalence and comparison when an overall conclusion
contradicts a review body. Zhao [
        <xref ref-type="bibr" rid="ref19">23</xref>
        ] concludes that
bag-of-ngrams-based methods achieve state-of-the-art results for sentiment
classification of long movie reviews. Wang, McCalum and Wei
[
        <xref ref-type="bibr" rid="ref20">24</xref>
        ] claim the importance of n-grams in multiple areas of NLP
and text mining, especially for parsing, machine translation and
information retrieval. The work by Bespalov et al. [
        <xref ref-type="bibr" rid="ref21">25</xref>
        ] determines
that the n-gram model in conjunction with latent semantic analysis
produce superior results for document-level classification tasks.
N-grams were successfully used by Chaovalit and Zhou [
        <xref ref-type="bibr" rid="ref22">26</xref>
        ] for
sentiment analysis. Lin and Hovy [
        <xref ref-type="bibr" rid="ref23">27</xref>
        ] demonstrated an
n-grambased method for automatic document summarization that
outperforms human assessments in certain cases.
      </p>
      <p>
        Ye et al. [
        <xref ref-type="bibr" rid="ref24">28</xref>
        ] have established influential research in data mining
and classification, naming n-gram one of three most important
approaches in text mining and sentiment classification. The
ngram method is known as the simplest and the most successful
method in language modeling [
        <xref ref-type="bibr" rid="ref25">29</xref>
        ].
      </p>
      <p>
        In writing analytics, n-gram models were used as a discriminator
of different genres for corpus analysis and register variations [
        <xref ref-type="bibr" rid="ref26">30</xref>
        ].
This research domain was expanded by multiple analyses
investigating n-grams variations between academic prose and
conversation [
        <xref ref-type="bibr" rid="ref27">31</xref>
        ]; analysis of frequencies, structural types and
functional categories of n-grams in textbooks [
        <xref ref-type="bibr" rid="ref28">32</xref>
        ]; student
writings in history and biology [
        <xref ref-type="bibr" rid="ref29">33</xref>
        ]; L1 and L2 academic writing
[
        <xref ref-type="bibr" rid="ref30">34</xref>
        ]; and n-gram frequencies in multiple registers [
        <xref ref-type="bibr" rid="ref31">35</xref>
        ]. Lately,
ngrams are used in the preprocessing and feature-extraction stages
while more advanced techniques are applied afterwards [
        <xref ref-type="bibr" rid="ref32">36</xref>
        ]. For
example, N-gram frequencies serve as feature values used by data
mining classification algorithms [
        <xref ref-type="bibr" rid="ref2">6</xref>
        ]. Jain et al. [
        <xref ref-type="bibr" rid="ref33">37</xref>
        ] applied a
Markov model after extracting text features with bi- and tri-grams
and their frequencies.
      </p>
      <p>
        Justeson and Katz [
        <xref ref-type="bibr" rid="ref34">38</xref>
        ] used n-gram frequencies to identify
technical terms in texts. After sorting by frequency, this method
yielded noun phrases that were topically relevant to the
documents of their corpus. More recently, Aull [1] has used
ngram analysis to distinguish first-year and expert writing by
emphasizing the bigram “I will.” Using such phrases, Aull found
that expert writers draw attention to their involvement in, and
control of, the socio-rhetorical subject matter of the text (e.g., “I
will discuss”). In this way, the expert writers demonstrate their
“text internal” presence and involvement within the unfolding
argument and evidence. In contrast, first-year college writers
adopted a more “text external” position in which they established
themselves as more of a participant in the “real world” outside of
the text (e.g., “I will always remember”).
      </p>
    </sec>
    <sec id="sec-6">
      <title>3. RESEARCH QUESTIONS</title>
      <p>Baseline and descriptive, this study poses three questions:
1.
2.
3.</p>
      <sec id="sec-6-1">
        <title>How can n-gram analysis be used to examine concept</title>
        <p>proliferation of course terms students should know?
How can n-gram analysis be used to examine concept
proliferation of assessment traits used to assess student
work?
What type of n-gram analysis is best suited to examine
concept proliferation?</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>4. METHOD</title>
      <p>Instructor and student comments were retrieved from My
Reviewers for ENC 1102 courses offered during the 2014 and
2015 academic years. The data were anonymized as required by
federal regulations.</p>
      <p>
        My Reviewers allows free-response textual comments and
designation of numeric score on a 4-point scale employing 5
rubric traits: focus, evidence, organization, style, and format. The
same essay draft is reviewed by several fellow students (peer
review) and an instructor (expert review). To ensure inter-rater
agreement, all comments in which instructor scores did not match
peer scores were removed. Ten datasets were then constructed,
two—one with instructor and one with peer comments—for each
of the 5 rubric traits using intermediate drafts. The dataset is
shown in Table 1.
Microsoft SQL Server was used for preparing the datasets. For
text preprocessing and n-gram extraction, R, RStudio, and the TM
package were employed. Following a common procedure for the
pre-processing phase, text was converted to lower case; any
nonword characters, numbers, and punctuation were removed. In this
study, stemming was not applied since n-grams of word base
forms unnecessarily complicated analysis. Since we do not use
any computer algorithms for subsequent text feature comparison,
stemming brings extra complexity for interpreting n-grams. In
future work, we plan to use stemmed n-grams as a preprocessing
step for more sophisticated analysis using LSA. Similarly,
adhering to common practice in text mining applications, the
corpus was stripped of stop words, though there is evidence this
operation may negatively affect results for certain tasks (e.g.
plagiarism detection) [
        <xref ref-type="bibr" rid="ref35">39</xref>
        ]. Finally, whitespace such as line breaks
and tabulation symbols was removed.
      </p>
      <p>The corpus was tokenized into 1-, 2-, 3-, 4- and 5-gram models.
N-gram frequencies were obtained with the help of a
termdocument matrix displaying the frequency of terms occurring in a
collection of documents. The obtained models were used to build
subsets of the most common n-grams, and n-grams used more
than a hundred times per dataset. While analyzing corpus features,
n-grams used across criteria by peers, instructors and both
instructors and peers were identified.</p>
    </sec>
    <sec id="sec-8">
      <title>5. RESULTS</title>
      <p>Results will be presented in terms of the study questions.
Interpretations will follow each result.</p>
    </sec>
    <sec id="sec-9">
      <title>5.1 N-gram Analysis and Course Terms</title>
      <p>Table 2 presents ENC 1102 course topics, their purpose, the
genres used, and terms that students should know from each
project. The dataset shown in Table 1 was assembled from each of
the three projects.</p>
      <p>Unique in this course is the use of constructed response tasks
based on topics uniformly used across course sections. Equally
unique is the clearly stated purpose of each topic, the variation in
genre across essays, websites, and oral presentations, and
identification of key course terms. Using the traits of focus,
evidence and organization as sources of information about course
knowledge, Table 3 presents a unigram analysis of each of the
course projects with attention to terms students should know.
Terms not mentioned in comments are listed with zero
frequencies. Following each term, the number of instances of each
term is used within the 100 most commonly used terms in the
comments.</p>
      <sec id="sec-9-1">
        <title>Course Term Results</title>
        <p>Distinct patterns emerge of congruence, disjuncture, and absence
in Table 3. There is notable congruence among the terms that both
instructors and students use. Regarding the trait of focus,
stakeholder, rhetorical, visual, compromise, and argument are
used in both instructor and student comments. Regarding the trait
of evidence, stakeholder, rhetorical, compromise, and argument
are used in both sets of comments. Regarding the trait identified
as organization, the terms stakeholder, rhetorical, compromise,
and argument are used in both sets of comments. There is also
notable disjuncture. In terms of the trait of focus, instructors use
the term visual twice as much as students. In terms of evidence,
the term rhetorical is used twice more by instructors than by
students; as well, while instructors use the term visual, students do
not use that term at all. In terms of organization, instructors use
the term while students do not. There is a notable absence of key
terms by both groups: ethos, pathos, logos, Kairos, fallacies,
empathy, negotiation, Rogerian, multimodality, remediation, and
non-engaged.
recurring patterns in writing comments through both the presence
and absence of concepts.</p>
      </sec>
      <sec id="sec-9-2">
        <title>5.1.2 Course Term Interpretation</title>
        <p>
          Patterns of congruence reveal that some of the course terms are
being used in comments on intermediate drafts by both instructors
and students. This pattern is praiseworthy and suggests a common
referential frame. However, instructors appear to associate the use
of visual artifacts as elements of evidence while students do not.
Similarly, terms such as rhetorical are much more commonly used
by instructors. In the case of terms from classical rhetoric—ethos,
pathos, and logos—there is no use by either group; nor is there
use of more contemporary rhetorical systems such as that
developed by Carl Rogers [
          <xref ref-type="bibr" rid="ref36">40</xref>
          ]. And the presence of logical
fallacies is not taken up by either group in the comments.
Regarding use of such information, curricular strategies might be
taken to ensure continued use of congruent terms, to investigate
differing use of terms by instructors and students, and to probe
more deeply into which terms are opaque or cosmetic and
therefore unlikely to be used to advance student learning.
        </p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>5.2 N-gram Analysis and Traits</title>
      <p>Table 4 presents the 5 assessment traits used in ENC 1102 and
their associated rubric terms.</p>
      <p>As is the case in the analysis of course terms, rubric traits also
reveal distinct patterns of congruence, disjuncture, and absence.</p>
      <sec id="sec-10-1">
        <title>5.2.1 Rubric Trait Results</title>
        <p>Unigram and bigram analyses for instructor and students are
largely congruent. For both groups, the presence of a thesis is
associated with focus, just as evidence derives from sources,
organization is understood as achieved through paragraphs, style
is associated with correct grammar, and format is achieved
through following specifications established by the Modern
Language Association. Absent are terms related to organization.
Regarding evidence, trigram analysis reveals some disjuncture.
Instructors note that sources establish credibility; students, in
contrast, note the presence and features of the works cited page—
a format substitution for the complexities of establishing claims.
Fourgram analysis reveals the presence of a writer, the innovator
Jane Chen, while student comments remain vague in their
reference to credible sources. Fivegram analysis continues to
reveal specificity in instructor comments regarding evidence while
students remain vague in noting that “quotes are really good.” In
terms of the rubric, absent are references to traits such as
synthesis, personal experiences, anecdotes, segues, diction, and
document design. Useful, n-gram analysis clearly exposes</p>
      </sec>
      <sec id="sec-10-2">
        <title>5.2.2 Rubric Trait Interpretation</title>
        <p>As is the case with course terms, patterns of congruence reveal
that some rubric traits are being used in comments on intermediate
drafts by both instructors and students. This pattern suggests a
common referential frame often lacking across course sections.
However, the traits are general and do not seem to accommodate
multimodal genres; that is, while paragraphs are central to
constructing an academic, source-based essay, the rubric does not
address ways to achieve coherence in a website. Furthermore,
rubric traits do not address the oral presentation genre associated
with Project 3.</p>
        <p>It must be noted that genres beyond the essay may not be
evaluated within My Reviewers if instructors do not require that
intermediate drafts be uploaded to the platform for review. This
example demonstrates the complexities of capturing all student
performance within a digital environment.</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>5.3 N-gram Analysis and Concept</title>
    </sec>
    <sec id="sec-12">
      <title>Proliferation</title>
      <p>Tables 3, 4, and 5 reveal that various forms of n-gram analysis can
be very useful in capturing key course terms and rubric traits as
they are used in instructor and student comments. Implying
metacognition, review comments suggest a deep and deliberate
use of course concepts and evaluative frameworks. N-gram
analysis reveals the presence of such words—and the directions
that might be taken to examine their usefulness to students and
their absence in areas where more specific guidance may be
helpful to students.</p>
      <p>Where unigrams and bigrams yield larger sample sizes, however,
trigrams, fourgrams, and fivegrams reveal extremely small sample
sizes. The benefits and costs of these smaller sample sizes, and the
inferences drawn from them, should be taken into consideration
before their use.</p>
    </sec>
    <sec id="sec-13">
      <title>6. FURTHER RESEARCH DIRECTIONS</title>
      <p>
        In her call for context-informed corpus linguistics analysis, Aull
[1] has advanced connections between lexical analysis and
classroom applications. In such pedagogically-based applications
using bigram analysis, Forbes-Riley and Litman [
        <xref ref-type="bibr" rid="ref36">40</xref>
        ] have
developed approaches for adapting student affect in intelligent
tutoring dialogue systems. At the level of the student, this study
confirms the possibility of connecting word-level patterns to
curricular design. Real-time communication of such information
to students and their instructors is the next step in advancing
context-informed corpus linguistics analyses that are that are
structured and actionable.
paper (1246) word choice (256)
grammar (1019) grammatical errors
errors (1005) (187)
good (943) point view (187)
sentences (910) grammar punctuation
      </p>
      <p>(124)
make sure (110)
sources (2458) text citations (365) works cited page (92)
evidence (2312) credible sources (219) use text citations (47)
paper (2309) make sure (202) good use evidence (45)
good (2044) sources used (160) good use sources (37)
used (1932) throughout paper (155) just make sure (36)
Organization paper (1118) well organized (122) paper well organized (44)
well (980) topic sentences (118) paper organized well (23)
paragraphs logical progression (71) essay well organized (11)
(969) organization paper (67) logical progression ideas (11)
paragraph (958) paper organized (60) transitions topic sentences (11)
good (893)</p>
    </sec>
    <sec id="sec-14">
      <title>7. ACKNOWLEDGEMENTS</title>
      <p>This research is supported by the National Science Foundation
under Award #1544239, “Collaborative Research: The Role of
Instructor and Peer Feedback in Improving the Cognitive,
Interpersonal, and Intrapersonal Competencies of Student Writers
in STEM Courses.” We wish to thank the support of our principal
investigator, Joseph M. Moxley, as well as our fellow
investigators Chris Anson, Christiane J. Donahue, Valerie Ross,
and Suzanne T. Lane. We are thankful for the reviews of Laura
Aull, Jill Burstein, and Dave Eubanks. Thanks also to Rafael
Walker for expert manuscript editing.
8.
[1]</p>
      <sec id="sec-14-1">
        <title>Dixon, Z., and Moxley, J.M. (2013). Everything is</title>
        <p>illuminated: What big data can tell us about teacher
commentary. Assessing Writing 18, 241-256
University of South Florida (2016). First-year composition:
University of South Florida. http://hosted.usf.edu/FYC/
National Research Council. (2012). Education for life and
work: Developing transferable knowledge and skills in the
21st century. Committee on Defining Deeper Learning and
21st Century Skills, J.W. Pellegrino and M.L. Hilton,
Editors. Board on Testing and Assessment and Board on
Science Education, Division of Behavioral and Social
Sciences and Education. Washington, DC: The National
Academies Press.</p>
      </sec>
      <sec id="sec-14-2">
        <title>Nadkarni P.M., Ohno-Machado, L, and Chapman W.W. (2011). Natural language processing: an introduction. J Am</title>
        <p>Med Inform Assoc. 18, 544-551.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Aull</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>First-year university writing: A corpusbased study with implications for pedagogy</article-title>
          . London, UK: Palgrave Macmillan.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Sidorov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Velasquez</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gelbukh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chanona-Hernández</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Syntactic N-grams as machine learning features for natural language processing</article-title>
          .
          <source>Expert Systems with Applications</source>
          <volume>41</volume>
          ,
          <fpage>853</fpage>
          -
          <lpage>860</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          , (
          <year>1913</year>
          ).
          <article-title>Essai d‟une recherche statistique sur le texte du roman “Eugène Oneguine”</article-title>
          ,
          <source>Bull. Acad. Imper. Sci. 7</source>
          ,
          <fpage>153</fpage>
          -
          <lpage>162</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Bassil</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Parallel spell-checking algorithm based on Yahoo! N-grams dataset</article-title>
          .
          <source>International Journal of Research and Reviews in Computer Science</source>
          <volume>3</volume>
          ,
          <fpage>1429</fpage>
          -
          <lpage>1435</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Shannon</surname>
            ,
            <given-names>C.E.</given-names>
          </string-name>
          (
          <year>1948</year>
          ).
          <article-title>A mathematical theory of communication</article-title>
          .
          <source>Bell System Technical Journal 27</source>
          ,
          <fpage>379</fpage>
          -
          <lpage>423</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Chomsky</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          (
          <year>1957</year>
          ).
          <article-title>Syntactic structures</article-title>
          .
          <source>Mouton: The Hague.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Chomsky</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          (
          <year>1956</year>
          ).
          <article-title>Three models for the description of language</article-title>
          .
          <source>IRI Transactions on Information Theory</source>
          <volume>2</volume>
          ,
          <fpage>113</fpage>
          -
          <lpage>124</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Shannon</surname>
            ,
            <given-names>C. E.</given-names>
          </string-name>
          (
          <year>1951</year>
          ).
          <article-title>Prediction and entropy of printed English</article-title>
          .
          <source>Bell System Technical Journal 30</source>
          ,
          <fpage>50</fpage>
          -
          <lpage>64</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Suen</surname>
            ,
            <given-names>C.N.</given-names>
          </string-name>
          (
          <year>1979</year>
          ).
          <article-title>N-Gram Statistics for Natural Language Understanding and Text Processing</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>1</volume>
          ,
          <fpage>164</fpage>
          -
          <lpage>172</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Yahoo! Webscope dataset Yahoo! N-Grams</surname>
          </string-name>
          ,
          <source>ver. 2</source>
          .0, http://research.yahoo.com/Academic_Relations
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Franz</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>Brants. T.</surname>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>Google N-gram database (all our N-grams belong to you)</article-title>
          . http://googleresearch.blogspot.com/
          <year>2006</year>
          /08/all-our
          <article-title>-n-gramare-belong-to-you</article-title>
          .
          <source>html"</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>J.B.</given-names>
          </string-name>
          , et al. (
          <year>2011</year>
          ).
          <article-title>Quantitative analysis of culture using millions of digitized books</article-title>
          .
          <source>Science</source>
          <volume>331</volume>
          ,
          <fpage>176</fpage>
          -
          <lpage>182</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Kulkarni</surname>
            <given-names>V.</given-names>
          </string-name>
          et al. (
          <year>2013</year>
          ).
          <article-title>Statistically significant detection of linguistic change</article-title>
          .
          <source>Proceedings of the 24th International Conference on World Wide Web 625-635.</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Mayfield</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>McNamee P.</surname>
          </string-name>
          (
          <year>2003</year>
          ,
          <article-title>July 28-August 1). Single n-gram Stemming</article-title>
          . SIGIR'
          <volume>03</volume>
          ,
          <fpage>415</fpage>
          -
          <lpage>416</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Gencosman</surname>
            ,
            <given-names>B.C.</given-names>
          </string-name>
          , et al. (
          <year>2014</year>
          ).
          <article-title>Character n-gram application for automatic new topic identification</article-title>
          .
          <source>Information Processing and Management</source>
          <volume>50</volume>
          ,
          <fpage>821</fpage>
          -
          <lpage>856</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Lecluze</surname>
            ,
            <given-names>C</given-names>
          </string-name>
          , et al. (
          <year>2013</year>
          ).
          <article-title>Which granularity to bootstrap a multilingual method of document alignment: Character ngrams or word n-grams?</article-title>
          <source>Procedia - Social and Behavioral Sciences</source>
          <volume>95</volume>
          ,
          <fpage>473</fpage>
          -
          <lpage>481</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Rangarajan</surname>
            ,
            <given-names>V</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Ravichandran</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          (
          <year>1998</year>
          ,
          <article-title>Jan. 6). System and method for portable document indexing using n-gram word decomposition</article-title>
          . U.S. Patent.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Dave</surname>
            <given-names>K</given-names>
          </string-name>
          , et al. (
          <year>2003</year>
          ,
          <article-title>May 3-4). Mining the peanut gallery: Opinion extraction and semantic classification of product reviews</article-title>
          .
          <source>WWW2003</source>
          , Budapest, Hungary,
          <fpage>519</fpage>
          -
          <lpage>528</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Learning document embeddings by predicting n-grams for sentiment classification of long movie reviews</article-title>
          .
          <source>Accepted as a workshop contribution</source>
          , ICLR.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Wei</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Topical</surname>
          </string-name>
          n-grams:
          <article-title>Phrase and topic discovery, with an application to information retrieval</article-title>
          ,
          <source>Proceedings of the 7th IEEE International Conference on Data Mining</source>
          ,
          <fpage>697</fpage>
          -
          <lpage>702</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Bespalov</surname>
            <given-names>D</given-names>
          </string-name>
          , et al. (
          <year>2011</year>
          , October 24-28).
          <article-title>Sentiment classification based on supervised latent n-gram analysis</article-title>
          .
          <source>CIKM'11</source>
          ,
          <string-name>
            <surname>Glasgow</surname>
          </string-name>
          , Scotland,
          <fpage>375</fpage>
          -
          <lpage>382</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Chaovalit</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          (
          <year>2005</year>
          ).
          <article-title>Movie review mining: A comparison between supervised and unsupervised classification approaches</article-title>
          ,
          <source>HICSS</source>
          ,
          <year>2005</year>
          ,
          <source>Proceedings of the 38th Annual Hawaii International Conference on System Sciences, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.</source>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <surname>C-Y</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>Hovy</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <article-title>Automatic evaluation of summaries using n-gram Co-Occurrence Statistics</article-title>
          .
          <source>Proceedings of HLT-NAACL</source>
          <year>2003</year>
          ,
          <volume>71</volume>
          -
          <fpage>78</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Ye</surname>
            <given-names>Q</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            <given-names>A</given-names>
          </string-name>
          . and
          <string-name>
            <surname>Law</surname>
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>Sentiment classification of online reviews to travel destinations by supervised machine learning approaches</article-title>
          .
          <source>Expert Systems with Applications</source>
          <volume>36</volume>
          ,
          <fpage>6527</fpage>
          -
          <lpage>6535</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>An</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shuurmans</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Cercone</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>Applying machine learning to text segmentation for information retrieval</article-title>
          .
          <source>Information Retrieval</source>
          <volume>6</volume>
          ,
          <fpage>333</fpage>
          -
          <lpage>362</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <source>Procedia - Social and Behavioral Sciences</source>
          <volume>198</volume>
          ,
          <fpage>474</fpage>
          -
          <lpage>478</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Biber</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Johansson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leech</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Conrad</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finegan</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Quirk</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>1999</year>
          ).
          <article-title>Longman grammar of spoken and written English</article-title>
          . London/New York: Longman.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [32]
          <string-name>
            <surname>Biber</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , et al. (
          <year>2004</year>
          ).
          <article-title>If you look at…: Lexical bundles in university teaching and textbooks</article-title>
          .
          <source>Applied Linguistics</source>
          <volume>25</volume>
          ,
          <fpage>371</fpage>
          -
          <lpage>405</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [33]
          <string-name>
            <surname>Cortes</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>Lexical bundles in published and student disciplinary writing: Examples from history and biology</article-title>
          .
          <source>English for Specific Purposes</source>
          <volume>23</volume>
          ,
          <fpage>97</fpage>
          -
          <lpage>423</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [34]
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y. H.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Baker</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Lexical bundles in L1 and L2 academic writing</article-title>
          .
          <source>Language, Learning and Technology</source>
          <volume>14</volume>
          ,
          <fpage>30</fpage>
          -
          <lpage>49</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [35]
          <string-name>
            <surname>Gries</surname>
            ,
            <given-names>S. T.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Bigrams in registers, domains, and varieties: a bigram gravity approach to the homogeneity of corpora</article-title>
          .
          <source>Proceedings of Corpus Linguistics</source>
          <year>2009</year>
          , University of Liverpool.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [36]
          <string-name>
            <surname>Nassirtoussi</surname>
          </string-name>
          , et al. (
          <year>2014</year>
          ).
          <article-title>Text mining for market prediction: A systematic review</article-title>
          .
          <source>Expert Systems with Applications</source>
          <volume>41</volume>
          ,
          <fpage>7653</fpage>
          -
          <lpage>7670</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [37]
          <string-name>
            <surname>Jaina</surname>
            ,
            <given-names>K</given-names>
          </string-name>
          , et al. (
          <year>2015</year>
          ).
          <article-title>Chunked n-grams for sentence validation</article-title>
          .
          <source>Procedia - Computer Science</source>
          <volume>57</volume>
          <fpage>209</fpage>
          -
          <lpage>213</lpage>
          .
          <fpage>1</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [38]
          <string-name>
            <surname>Justeson</surname>
            ,
            <given-names>J. S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Katz</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          (
          <year>1995</year>
          ).
          <article-title>Technical terminology: some linguistic properties and an algorithm for identification in text</article-title>
          .
          <source>Natural Language Engineering</source>
          ,
          <volume>1</volume>
          ,
          <fpage>9</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [39]
          <string-name>
            <surname>Stamatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          (
          <year>2011</year>
          , Oct.
          <volume>24</volume>
          -
          <fpage>28</fpage>
          ).
          <article-title>Plagiarism detection based on structural information</article-title>
          .
          <source>CIKM'11</source>
          .
          <string-name>
            <surname>Glasgow</surname>
          </string-name>
          , Scotland.
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [40]
          <string-name>
            <surname>Hairston</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>1976</year>
          ).
          <article-title>Carl Rogers's alternative to traditional rhetoric</article-title>
          .
          <source>College Composition and Communication</source>
          ,
          <volume>27</volume>
          ,
          <fpage>373</fpage>
          -
          <lpage>377</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>Kate</given-names>
            <surname>Forbes-Riley</surname>
          </string-name>
          and
          <string-name>
            <given-names>Diane J.</given-names>
            <surname>Litman</surname>
          </string-name>
          .
          <article-title>Using bigrams to identify relationships between student certainness states and tutor responses in a spoken Dialogue Corpus</article-title>
          .
          <source>Proceedings of 6th SIGdial Workshop on Discourse and Dialogue</source>
          , Portugal.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>