A Quick Intensive Course on
            Natural Language Processing applied to Literary Studies

                                   Borja Navarro-Colorado
                          Department of Software and Computing Systems
                                      University of Alicante
                                     borja@dlsi.ua.es


                     Abstract                                (mainly Natural Language Processing) to literary
                                                             text analysis. The subject is framed in Moretti’s dis-
    This paper presents how Natural Language                 tant reading model (Moretti, 2007; Moretti, 2013;
    Processing is taught to students of a Mas-               Jockers, 2013) because I think that it is within this
    ter’s Degree in Literary Studies. These                  approach that NLP techniques are really useful
    students’ background is solely humanistic,               for literary studies. Following a standard empirical
    and they have no knowledge whatsoever of                 text-analysis process, the course is organized in two
    Natural Language Processing (NLP). The                   main modules: the first one is devoted to corpus
    challenge is to introduce these students                 design, compilation and annotation; and the second
    to the main aspects of NLP in a 20-hour                  one is devoted to the application of some specific
    course, and show them how they can apply                 Natural Language Processing techniques such as,
    these techniques to the analysis of literary             among others, lexical frequencies analysis, part
    texts. The course focuses on three main as-              of speech tagging, named-entities recognition or
    pects: first, to get to know a new approach              distributional semantic analysis (LDA Topic Mod-
    to literary text analysis based on the distant           eling (Blei et al., 2003)). My procedure in class is
    reading model; second, to develop a rep-                 as follows: first I show my students how each of
    resentative literary corpus and, finally, to             these above-mentioned techniques work, and then
    apply basic NLP techniques to said corpus                what can be expected from them when applied to
    in order to extract relevant data. Among                 literary texts. This way students extract empiri-
    these techniques are word frequencies, Part              cal data from the corpus that they must interpret
    of Speech tagging and distributional seman-              according to their literary knowledge.
    tic models such as LDA Topic Modeling.                      Students usually pass the course without much
    Satisfaction surveys show that students are              difficulty. Satisfaction surveys show that the course
    satisfied with the course.                               is well received among students. In general they
                                                             assume the necessity of empirical data to complete
1   Introduction                                             traditional literary analysis. However, only a few
Teaching Natural Language Processing to Litera-              students eventually apply some of these NLP tech-
ture students is currently a great challenge. Stu-           niques in their final Master Thesis or PhD Thesis.
dents come to the course with good skills for close-            In the next section I will present first the course
reading literary text analysis and a good back-              context and the student profile; then I will show the
ground in history of literature and even in literary         main objectives of the course, how the content is
theory. However, they don’t have enough tech-                organized and how it is taught to students (theory
nological or mathematical background in order to             and practice); to conclude I will propose some ideas
understand how current Natural Language Process-             for a Digital Humanities curriculum based on this
ing techniques work and how they can be applied              experience.
to the analysis of literary texts. Indeed, at the be-        2   Course context and student profile
ginning, they are unsure about the usefulness of
these resources for literary studies.                        The course is called “Computer resources for liter-
   In this paper I will present the objectives and           ary research”.1 It is a twenty-hour course included
contents of a Master’s course (two credits) focused             1 The course is taught in Spanish. The exact name is “Re-
on the application of computational techniques               cursos informáticos para la investigación literaria”: http:


                                                        37
in the Master’s Degree on Literary Studies2 at the              to use NLP tools, but also I try to clarify how they
University of Alicante (Spain). It is taught face-to-           work, that is, how these tools deal with linguistic
face in a computer lab classroom, where students                ambiguity.
perform their tasks under the teacher’s supervision.
The only task that is done outside the computer                 3   General objectives
lab is the students’s final essay in which they must
apply the NLP techniques they have learned during               In only 20 teaching hours it is not possible to intro-
the course.                                                     duce Python nor any other programming language
   The course is taught by a teacher whose back-                in the course. This limitation leaves most Natu-
ground is Spanish Language and Literature (B.A.)                ral Language Processing tools out of the syllabus.
with a PhD on Natural Language Processing. So                   Moreover, I avoid focusing the course only on the
far it has been taught three times since 2014-2015              technical application of NLP tools. More than this,
school year. The first year the course had an atten-            students must understand the important contribu-
dance of 11 students, 12 students the second year               tion of these tools to literary studies, mostly be-
and finally 17 students this last year.                         cause before they take this course, they do not see
                                                                why they must apply computational tools to the
   The students who take this course are usually
                                                                analysis of literary texts. They have enough with
young graduates in Literature. All of them share
                                                                their (manual and close reading) methodological
a good background in humanities and history of
                                                                skills and literary analysis models, so they do not
literature, and they use similar research methods
                                                                see the usefulness of computational analysis. If
for the traditional analysis of literary texts. They
                                                                I want to analyze the metrical aspects of Garcı́a
differ in the literary tradition that they have studied.
                                                                Lorca’s “Little Viennese Waltz”, why do I need a
Most of them are graduates in English Literature
                                                                computer, when I can analyze properly all these
or Spanish (Castilian) Literature, but there are also
                                                                lines by hand? This question is related to the use-
graduates in other literary traditions such as Cata-
                                                                fulness of NLP for literary studies.
lan, Arab or French Literature. Some of them have
background in Linguistics as well. In this context,                Our first objective is to show that the application
the course is focused mainly on the computational               of NLP techniques to literary text analysis makes
analysis of Spanish (both Catalan and Castilian)                sense only if by using these techniques I can learn
and English literary texts.                                     something new about the literary phenomenon. The
                                                                application of NLP tools to emulate human analysis
   The students’s knowledge about mathematics or
                                                                makes no sense. On the contrary, it must be applied
computers is poor; as far as mathematics is con-
                                                                where manual analysis cannot reach.
cerned, their knowledge basically comes down to
                                                                   In this regard, Moretti’s Distant Reading model
what they learned in high school. As regarding
                                                                (Moretti, 2007; Moretti, 2013; Jockers, 2013) sets
computers, they are digital natives and use com-
                                                                up a framework where the computational analysis
puters in their daily life. However, they have not
                                                                of literary texts is not only useful but also neces-
knowledge at all about Computer Science: algo-
                                                                sary. I am referring to the computational analysis
rithms, programming, etc.
                                                                of large corpora in order to extract common pat-
   On the other hand, these students are familiar
                                                                terns and regularities from the texts and, in gen-
not only with the main concepts of Linguistics, but
                                                                eral, implicit and unknown information that cannot
also with the literary criticism models that apply
                                                                be extracted by means of a manual analysis. Of
linguistic techniques to literary analysis (such as
                                                                course it is better to analyze manually the metrics
Russian Formalism, Structuralism or New Criti-
                                                                of Garcı́a Lorca’s “Little Viennese Waltz”, but it
cism). Therefore, they clearly understand the lin-
                                                                is not possible for a human being to analyze the
guistic aspects of Natural Language Processing
                                                                whole metrics of all Spanish Golden Age Poetry
and its main problem (the linguistic ambiguity).
                                                                (all the Spanish poetry composed during the 16th
However, their lack of a thorough computational
                                                                and 17th centuries). In this case the usage of com-
knowledge makes it hard for them to understand
                                                                putational analysis and NLP techniques is manda-
how NLP works, that is, the mathematical basis of
                                                                tory, and it will probably show some regularities
NLP. During the course not only do I explain how
                                                                about the period that traditional approaches are not
//www.dlsi.ua.es/˜borja/riilua/                                 able to detect. Both approaches are, in the end,
  2 https://maesl.ua.es/index.html                              complementary.


                                                           38
   The second objective of the course is to formu-             Manning, 2012).
late big questions. In order to apply the Distant               The syllabus of the course is as follow:
Reading model, students must first learn how to for-
mulate big literary questions, questions that could              • Introduction. Objectives (2 hours).
be answered applying NLP techniques to large lit-
erary corpus. At the beginning students pose small               • Module 1. Corpus compilation.
questions as the base for the analysis of a single                    1. Corpus design and compilation (2 hours).
novel or the work of a specific poet. I encourage
                                                                      2. Corpus annotation (4 hours).
students to think of big literary questions: Not ques-
tions about a specific author or a specific piece of             • Module 2. Corpus analysis.
literary work, but questions about whole literary
periods or genres, for example.                                       3. Frequencies, n-grams and concordances
   To develop this new point of view, I use the easy                     (2 hours).
but powerful Google Books n-gram viewer.3 It                          4. Regular Expressions (2 hours).
allows the student to look for word and n-gram                        5. Natural Language Processing (4 hours).
frequencies on the Google Books collection and
                                                                      6. Text mining (4 hours).
display them in a timeline. With this tool students
practice how to think big. They formulate big ques-
                                                                  The main idea of the first module is that only
tions about literature or, in general, cultural aspects
                                                               with a representative literary corpus is it possible to
and then look for data in the Google n-gram tool.
                                                               achieve reliable conclusions. The literary analysis
They then analyze the data provided by the tool and
                                                               depends, eventually, on the quality of the corpus
try to answer the question. The kind of questions
                                                               and the annotation. In this module students learn
formulated are based on Michel et al. (2011).
                                                               about basic aspects of Corpus Linguistics: how to
   Finally, the third main objective of the course
                                                               select representative texts according to a set of ob-
is to show that the application of these techniques
                                                               jective criteria; how to find, download and clean
sometimes provides quantitative data that, rather
                                                               texts in order to obtain plain texts; how to store text
than answers, produce new research questions that
                                                               files; how to deal with textual codification prob-
must be studied (Moretti, 2007).
                                                               lems, etc. (Wynne, 2004; Bowker and Pearson,
   Once students accept these ideas they are ready
                                                               2002; McEnery and Hardie, 2012)
to learn about the technical aspects of NLP. Now
                                                                  The second lesson of this module is an introduc-
they are able to appreciate the usefulness of NLP
                                                               tion to manual corpus annotation. It includes such
for literary studies and the course makes sense for
                                                               topics as XML and TEI (TEI Consortium, 2016),
them.
                                                               the application of annotation guidelines so that
4    Content and lessons                                       the resulting annotation is consistent and reliable,
                                                               or the evaluation of the annotation through inter-
Content and syllabus are based on my own research              annotators agreement (Pustejovsky and Stubbs,
experience. This is why the content of the course              2013).
is structured following a standard empirical text                 Along with this lesson I develop a simulation of
analysis. Given that the main objective of the Mas-            a corpus annotation process. As a main resource
ter’s Degree is to prepare students for research in            I use the Corpus of Spanish Golden-Age Sonnets
literary studies, this structure fits well with student        (with metrical annotation) (Navarro-Colorado et
expectations. In any case, content and lessons of              al., 2016).4 This corpus is suitable for this exercise
this course are not based on any specific previous             because it is freely available (including the annota-
course. Besides my own experience, to set up the               tors’s guidelines), it follows the standard XML-TEI,
course content I have taken into consideration tu-             and it has been manually annotated with literary
torials such as (Manning, 2011), handbooks such                information: the metrics of each line. This corpus
as (Jockers, 2014; Pustejovsky and Stubbs, 2013;               provides ample annotation practice, and eventually
Jurafsky and Martin, 2008), and some courses on                the students can compare their own work with the
Corpus Linguistics such as (McEnery, 2013) or on               original corpus annotation.
Natural Language Processing such as (Jurafsky and
                                                                 4 https://github.com/bncolorado/
    3 https://books.google.com/ngrams                          CorpusSonetosSigloDeOro


                                                          39
   The second module is focused on the computa-                 they will apply these techniques knowing what they
tional analysis of a literary corpus. It is structured          can expect from them.
in four lessons.                                                   The exercises in this lesson are carried out with
   The objective of the first lesson (number 3) is to           FreeLing (Padró and Stanilovsky, 2012). This tool
set up the basis for the computational treatment of             is appropriate for our course because it is multilin-
texts. Specifically, I show students how words are              gual: It includes PoS tagger, NE recognition and
transformed into numbers and what these numbers                 chunkers for Spanish, English, Catalan and other
represent. With AntConc tool (Anthony, 2014),5                  languages. The drawback of FreeLing is that it
students perform several tasks such as: the ex-                 is hard to install and use. Among other things, it
traction of the most frequent tokens of the corpus              has no graphical interface. To avoid installation
(including a stop-words filter), the extraction of              problems, instead of using FreeLing directly we
the most frequent n-grams, the estimation of the                use a web application developed by Pompeu Fabra
type/token ratio, concordance analysis, or the ex-              University (UPF - Barcelona, Spain) called Conta-
traction of the most frequent lemmas. The main                  Words.6 Using this web application is very easy: It
conclusion of this lesson is that, when the corpus is           allows students to upload several text files that are
really large, it is difficult to extract generalizations        analyzed by FreeLing on a remote server. Results
from it using these techniques (Roe, 2012).                     are returned in a spreadsheet format. To obtain the
   Lesson 4 is devoted to a gentle introduction to              data in a spreadsheet is a must: Students can cre-
regular expressions. The objective is to show stu-              ate graphics and analyze the data extracted directly
dents how to formalize linguistic expressions. Al-              from their corpus.
though it is not possible to go deeply into this top-              The final section of this module is devoted to
ics, students learn how to define regular expressions           computational semantics (lesson 6). Among all the
that allow them to find words by stem or by rhyme,              different computational semantic models that have
or even conditional expressions (words that appear              been proposed (lexical semantics, first-order logic,
before or after another word, etc.).                            events and semantic roles, etc.), our course is solely
   The lesson devoted to Natural Language Pro-                  focused on distributional models of semantics. I
cessing techniques (lesson 5) is focused on part of             explain only this model because it allows an effi-
speech (PoS) taggers, syntactic parsing and named-              cient computational processing of large corpora,
entities (NE) recognition. In general, I show first             and because it can only be used with computers.
the main architecture of this kind of tools, then the              Two main theoretical concepts are explained in
main problems (ambiguity) and finally the common                this lesson: first the idea lying behind distributional
error rate.                                                     semantics that the meaning of a word depends on
   In the case of the part of speech tagger, for ex-            its contexts and words that occur in similar contexts
ample, I explain that each word-lemma is related to             have similar meaning (Harris, 1968), and second
all its possible parts of speech in a dictionary. This          how computers deal with word contexts by means
way the PoS ambiguity problem is presented. Then                of vectors and matrices (Turney and Pantel, 2010).
some standard solutions are explained, as the use of            How it is possible to know the semantic similarity
a set of rules to specify the suitable part of speech           between two words in the distributional framework
for each word in each context, or the application of            is shown following Widdows (2004).
statistical information. Named-entities recognition                In order to show students how distributional mod-
is explained in the same way. Syntactic parsing is              els of semantics (and, in general, text mining tech-
explained showing how Context Free Grammars                     niques) are able to extract generalizations and reg-
(CFG) and Probabilistic CFP work.                               ularities from large corpora, I explain LDA Topic
   In any case it is not our objective to explain               Modeling (Blei et al., 2003). I describe how it
deeply the solutions to these problems (grammar                 works and how it can cluster words with similar
development, statistical learning, machine learning,            (distributional) meaning in the same topic.
etc.). These concepts will be hard to follow for our               Once students understand how LDA works, it is
students. What is most important is that students               applied to a literary corpus using MALLET (Mc-
understand the computational problem. This way                  Callum, 2002). Students must extract a set of topic
  5 http://www.laurenceanthony.net/                               6 http://contawords.iula.upf.edu/
software/antconc/                                               executions


                                                           40
models and analyze them. They check if topic mod-              only 20 hours, I do not go deeply into the technical
els are coherent, and if the words grouping in each            details of NLP. I first explain how each technique
topic could be justified by means of literary criteria.        works and then students apply them to the corpus
As I said before, I encourage students to formulate            with easy-to-use tools.
questions (as, for example “why words A and B
are in the same topic?”) and try to answer them                Acknowledgments
according to their background in literature.
                                                               I would like to thank the anonymous reviewers for
5     Evaluation                                               their helpful suggestions and comments. Thanks
                                                               also to my students for their feedback that helps
With the exception of MALLET, which is the most                me to improve the course.
complex tool used in the course, students do not                  Paper partially supported by the BBVA Foun-
have much difficulty using the technology and com-             dation: grants for research groups 2016, project
pleting the exercises in the syllabus, and eventually          “Distant Reading Approach to Golden-Age Spanish
they all pass the course.                                      Sonnets”
   In order to monitor courses, the University of Al-
icante distributes a survey in order to know the de-
gree of satisfaction of the students with the course           References
taken. This course was marked with 9 points out                Laurence Anthony.            2014.       Antconc
of 10, showing that students are really satisfied                (version     3.4.3)     [computer     software].
with the course. For us, these data show that the                http://www.laurenceanthony.net/.  Tokyo, Japan:
                                                                 Waseda University.
approach used to teaching NLP in literary stud-
ies is appropriate. However, only a few students               David M Blei, Andrew Y Ng, and Michael I Jordan.
apply some of these techniques in their final Mas-               2003. Latent Dirichlet Allocation. Journal of Ma-
ter Thesis or PhD Thesis. Perhaps students need                  chine Learning Research, 3:993–1022.
more time to assimilate all these new techniques               Lynne Bowker and Jennifer Pearson. 2002. Working
and apply them to their daily research in literature.            with Specialized Language. A practical guide to us-
                                                                 ing corpora. Routledge, London.
6     Conclusions
                                                               Zellig Harris. 1968. Mathematical structures of lan-
In this paper I have presented the key points of our             guage. Wiley, New York.
approach to teaching NLP to students of literature.
                                                               Matthew L. Jockers. 2013. Macroanalysis. Digital
I try to open the students’s mind with these main               Media and Literary History. University of Illinois
ideas:                                                          Press, Illinois.

    1. The use of NLP techniques in literary stud-             Matthew L. Jockers. 2014. Text Analysis with R for
                                                                Students of Literature. Springer, Switzerland.
       ies makes sense when they are applied where
       manual analysis cannot reach (the analysis of           Dan      Jurafsky     and     Christopher     Manning.
       large literary corpora).                                  2012.             Natural    language     processing.
                                                                 http://online.stanford.edu/course/natural-language-
    2. This literary analysis approach requires a wide           processing. Stanford University.
       scope -students must extend their point of
                                                               Dan Jurafsky and James H. Martin. 2008. Speech and
       view and learn how to formulate big research              Language Processing. Prentice Hall.
       questions.
                                                               Christopher Manning.              2011.         Natu-
    3. The application of these techniques some-                 ral language tools for the digital hu-
       times provides data that, rather than giving              manities.          https://nlp.stanford.edu/ man-
                                                                 ning/courses/DigitalHumanities/.      Stanford Uni-
       answers produces new research questions.                  versity.

   The course is structured following a standard               Andrew K. McCallum. 2002. Mallet: A machine learn-
empirical text analysis: compiling, first, a represen-           ing for language toolkit.
tative literary corpus, and then analyzing it with             Tony McEnery and Andrew Hardie. 2012. Corpus Lin-
NLP techniques (frequencies, part os speech tag-                 guistics: Method, Theorie and Practice. Cambridge
ging, LDA Topic Modeling, etc.). As the course has               University Press, Cambridge.


                                                          41
Tony McEnery.            2013.       Corpus linguis-
  tics:        Method,     analysis,    interpretation.
  https://www.futurelearn.com/courses/corpus-
  linguistics. Lancaster University.
Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser
   Aiden, Adrian Veres, Matthew K. Gray, The
   Google Books Team, Joseph P. Pickett, Dale
   Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant,
   Steven Pinker, Martin A. Nowak, and Erez Lieber-
   man Aiden. 2011. Quantitative Analysis of Cul-
   ture Using Millions of Digitized Books. Science,
   331(176).
Franco Moretti. 2007. Graphs, Maps, Trees: Abstract
  Models for a Literary History. Verso.
Franco Moretti. 2013. Distant reading. Verso.
Borja Navarro-Colorado, Mara Ribes Lafoz, and
  Noelia Snchez. 2016. Metrical annotation of a large
  corpus of Spanish sonnets: representation, scansion
  and evaluation. In Proceedings of the 10th edition of
  the Language Resources and Evaluation Conference
  (LREC 2016), Slovenia.
Lluı́s Padró and Evgeny Stanilovsky. 2012. FreeLing
  3.0: Towards Wider Multilinguality. In Language
  Resources and Evaluation Conference (LREC 2012),
  Istanbul.
James Pustejovsky and Amber Stubbs. 2013. Nat-
  ural Language Annotation for Machine Learning.
  O’Reilly.
Glenn Roe. 2012. The dangers and delights of data
  mining. In Digital Humanities Summer School, Ox-
  ford (UK). University of Oxford.
TEI Consortium, editor. 2016. TEI P5: Guidelines for
  Electronic Text Encoding and Interchange. Version
  3.1.0. Last modi ed 15th December 2016.
Peter D. Turney and Patrick Pantel. 2010. From
  frequency to meaning: Vector space models of se-
  mantics. Journal of Artificial Intelligence Research,
  37:141–188.
Dominic Widdows. 2004. Geometry and Meaningn.
  CSLI publications.
Martin Wynne.       2004.     Developing Linguis-
 tic Corpora:      a Guide to Good Practice.
 http://www.ahds.ac.uk/creating/guides/linguistic-
 corpora/index.htm.


                                                          42