A Quick Intensive Course on Natural Language Processing applied to Literary Studies Borja Navarro-Colorado Department of Software and Computing Systems University of Alicante borja@dlsi.ua.es Abstract (mainly Natural Language Processing) to literary text analysis. The subject is framed in Moretti’s dis- This paper presents how Natural Language tant reading model (Moretti, 2007; Moretti, 2013; Processing is taught to students of a Mas- Jockers, 2013) because I think that it is within this ter’s Degree in Literary Studies. These approach that NLP techniques are really useful students’ background is solely humanistic, for literary studies. Following a standard empirical and they have no knowledge whatsoever of text-analysis process, the course is organized in two Natural Language Processing (NLP). The main modules: the first one is devoted to corpus challenge is to introduce these students design, compilation and annotation; and the second to the main aspects of NLP in a 20-hour one is devoted to the application of some specific course, and show them how they can apply Natural Language Processing techniques such as, these techniques to the analysis of literary among others, lexical frequencies analysis, part texts. The course focuses on three main as- of speech tagging, named-entities recognition or pects: first, to get to know a new approach distributional semantic analysis (LDA Topic Mod- to literary text analysis based on the distant eling (Blei et al., 2003)). My procedure in class is reading model; second, to develop a rep- as follows: first I show my students how each of resentative literary corpus and, finally, to these above-mentioned techniques work, and then apply basic NLP techniques to said corpus what can be expected from them when applied to in order to extract relevant data. Among literary texts. This way students extract empiri- these techniques are word frequencies, Part cal data from the corpus that they must interpret of Speech tagging and distributional seman- according to their literary knowledge. tic models such as LDA Topic Modeling. Students usually pass the course without much Satisfaction surveys show that students are difficulty. Satisfaction surveys show that the course satisfied with the course. is well received among students. In general they assume the necessity of empirical data to complete 1 Introduction traditional literary analysis. However, only a few Teaching Natural Language Processing to Litera- students eventually apply some of these NLP tech- ture students is currently a great challenge. Stu- niques in their final Master Thesis or PhD Thesis. dents come to the course with good skills for close- In the next section I will present first the course reading literary text analysis and a good back- context and the student profile; then I will show the ground in history of literature and even in literary main objectives of the course, how the content is theory. However, they don’t have enough tech- organized and how it is taught to students (theory nological or mathematical background in order to and practice); to conclude I will propose some ideas understand how current Natural Language Process- for a Digital Humanities curriculum based on this ing techniques work and how they can be applied experience. to the analysis of literary texts. Indeed, at the be- 2 Course context and student profile ginning, they are unsure about the usefulness of these resources for literary studies. The course is called “Computer resources for liter- In this paper I will present the objectives and ary research”.1 It is a twenty-hour course included contents of a Master’s course (two credits) focused 1 The course is taught in Spanish. The exact name is “Re- on the application of computational techniques cursos informáticos para la investigación literaria”: http: 37 in the Master’s Degree on Literary Studies2 at the to use NLP tools, but also I try to clarify how they University of Alicante (Spain). It is taught face-to- work, that is, how these tools deal with linguistic face in a computer lab classroom, where students ambiguity. perform their tasks under the teacher’s supervision. The only task that is done outside the computer 3 General objectives lab is the students’s final essay in which they must apply the NLP techniques they have learned during In only 20 teaching hours it is not possible to intro- the course. duce Python nor any other programming language The course is taught by a teacher whose back- in the course. This limitation leaves most Natu- ground is Spanish Language and Literature (B.A.) ral Language Processing tools out of the syllabus. with a PhD on Natural Language Processing. So Moreover, I avoid focusing the course only on the far it has been taught three times since 2014-2015 technical application of NLP tools. More than this, school year. The first year the course had an atten- students must understand the important contribu- dance of 11 students, 12 students the second year tion of these tools to literary studies, mostly be- and finally 17 students this last year. cause before they take this course, they do not see why they must apply computational tools to the The students who take this course are usually analysis of literary texts. They have enough with young graduates in Literature. All of them share their (manual and close reading) methodological a good background in humanities and history of skills and literary analysis models, so they do not literature, and they use similar research methods see the usefulness of computational analysis. If for the traditional analysis of literary texts. They I want to analyze the metrical aspects of Garcı́a differ in the literary tradition that they have studied. Lorca’s “Little Viennese Waltz”, why do I need a Most of them are graduates in English Literature computer, when I can analyze properly all these or Spanish (Castilian) Literature, but there are also lines by hand? This question is related to the use- graduates in other literary traditions such as Cata- fulness of NLP for literary studies. lan, Arab or French Literature. Some of them have background in Linguistics as well. In this context, Our first objective is to show that the application the course is focused mainly on the computational of NLP techniques to literary text analysis makes analysis of Spanish (both Catalan and Castilian) sense only if by using these techniques I can learn and English literary texts. something new about the literary phenomenon. The application of NLP tools to emulate human analysis The students’s knowledge about mathematics or makes no sense. On the contrary, it must be applied computers is poor; as far as mathematics is con- where manual analysis cannot reach. cerned, their knowledge basically comes down to In this regard, Moretti’s Distant Reading model what they learned in high school. As regarding (Moretti, 2007; Moretti, 2013; Jockers, 2013) sets computers, they are digital natives and use com- up a framework where the computational analysis puters in their daily life. However, they have not of literary texts is not only useful but also neces- knowledge at all about Computer Science: algo- sary. I am referring to the computational analysis rithms, programming, etc. of large corpora in order to extract common pat- On the other hand, these students are familiar terns and regularities from the texts and, in gen- not only with the main concepts of Linguistics, but eral, implicit and unknown information that cannot also with the literary criticism models that apply be extracted by means of a manual analysis. Of linguistic techniques to literary analysis (such as course it is better to analyze manually the metrics Russian Formalism, Structuralism or New Criti- of Garcı́a Lorca’s “Little Viennese Waltz”, but it cism). Therefore, they clearly understand the lin- is not possible for a human being to analyze the guistic aspects of Natural Language Processing whole metrics of all Spanish Golden Age Poetry and its main problem (the linguistic ambiguity). (all the Spanish poetry composed during the 16th However, their lack of a thorough computational and 17th centuries). In this case the usage of com- knowledge makes it hard for them to understand putational analysis and NLP techniques is manda- how NLP works, that is, the mathematical basis of tory, and it will probably show some regularities NLP. During the course not only do I explain how about the period that traditional approaches are not //www.dlsi.ua.es/˜borja/riilua/ able to detect. Both approaches are, in the end, 2 https://maesl.ua.es/index.html complementary. 38 The second objective of the course is to formu- Manning, 2012). late big questions. In order to apply the Distant The syllabus of the course is as follow: Reading model, students must first learn how to for- mulate big literary questions, questions that could • Introduction. Objectives (2 hours). be answered applying NLP techniques to large lit- erary corpus. At the beginning students pose small • Module 1. Corpus compilation. questions as the base for the analysis of a single 1. Corpus design and compilation (2 hours). novel or the work of a specific poet. I encourage 2. Corpus annotation (4 hours). students to think of big literary questions: Not ques- tions about a specific author or a specific piece of • Module 2. Corpus analysis. literary work, but questions about whole literary periods or genres, for example. 3. Frequencies, n-grams and concordances To develop this new point of view, I use the easy (2 hours). but powerful Google Books n-gram viewer.3 It 4. Regular Expressions (2 hours). allows the student to look for word and n-gram 5. Natural Language Processing (4 hours). frequencies on the Google Books collection and 6. Text mining (4 hours). display them in a timeline. With this tool students practice how to think big. They formulate big ques- The main idea of the first module is that only tions about literature or, in general, cultural aspects with a representative literary corpus is it possible to and then look for data in the Google n-gram tool. achieve reliable conclusions. The literary analysis They then analyze the data provided by the tool and depends, eventually, on the quality of the corpus try to answer the question. The kind of questions and the annotation. In this module students learn formulated are based on Michel et al. (2011). about basic aspects of Corpus Linguistics: how to Finally, the third main objective of the course select representative texts according to a set of ob- is to show that the application of these techniques jective criteria; how to find, download and clean sometimes provides quantitative data that, rather texts in order to obtain plain texts; how to store text than answers, produce new research questions that files; how to deal with textual codification prob- must be studied (Moretti, 2007). lems, etc. (Wynne, 2004; Bowker and Pearson, Once students accept these ideas they are ready 2002; McEnery and Hardie, 2012) to learn about the technical aspects of NLP. Now The second lesson of this module is an introduc- they are able to appreciate the usefulness of NLP tion to manual corpus annotation. It includes such for literary studies and the course makes sense for topics as XML and TEI (TEI Consortium, 2016), them. the application of annotation guidelines so that 4 Content and lessons the resulting annotation is consistent and reliable, or the evaluation of the annotation through inter- Content and syllabus are based on my own research annotators agreement (Pustejovsky and Stubbs, experience. This is why the content of the course 2013). is structured following a standard empirical text Along with this lesson I develop a simulation of analysis. Given that the main objective of the Mas- a corpus annotation process. As a main resource ter’s Degree is to prepare students for research in I use the Corpus of Spanish Golden-Age Sonnets literary studies, this structure fits well with student (with metrical annotation) (Navarro-Colorado et expectations. In any case, content and lessons of al., 2016).4 This corpus is suitable for this exercise this course are not based on any specific previous because it is freely available (including the annota- course. Besides my own experience, to set up the tors’s guidelines), it follows the standard XML-TEI, course content I have taken into consideration tu- and it has been manually annotated with literary torials such as (Manning, 2011), handbooks such information: the metrics of each line. This corpus as (Jockers, 2014; Pustejovsky and Stubbs, 2013; provides ample annotation practice, and eventually Jurafsky and Martin, 2008), and some courses on the students can compare their own work with the Corpus Linguistics such as (McEnery, 2013) or on original corpus annotation. Natural Language Processing such as (Jurafsky and 4 https://github.com/bncolorado/ 3 https://books.google.com/ngrams CorpusSonetosSigloDeOro 39 The second module is focused on the computa- they will apply these techniques knowing what they tional analysis of a literary corpus. It is structured can expect from them. in four lessons. The exercises in this lesson are carried out with The objective of the first lesson (number 3) is to FreeLing (Padró and Stanilovsky, 2012). This tool set up the basis for the computational treatment of is appropriate for our course because it is multilin- texts. Specifically, I show students how words are gual: It includes PoS tagger, NE recognition and transformed into numbers and what these numbers chunkers for Spanish, English, Catalan and other represent. With AntConc tool (Anthony, 2014),5 languages. The drawback of FreeLing is that it students perform several tasks such as: the ex- is hard to install and use. Among other things, it traction of the most frequent tokens of the corpus has no graphical interface. To avoid installation (including a stop-words filter), the extraction of problems, instead of using FreeLing directly we the most frequent n-grams, the estimation of the use a web application developed by Pompeu Fabra type/token ratio, concordance analysis, or the ex- University (UPF - Barcelona, Spain) called Conta- traction of the most frequent lemmas. The main Words.6 Using this web application is very easy: It conclusion of this lesson is that, when the corpus is allows students to upload several text files that are really large, it is difficult to extract generalizations analyzed by FreeLing on a remote server. Results from it using these techniques (Roe, 2012). are returned in a spreadsheet format. To obtain the Lesson 4 is devoted to a gentle introduction to data in a spreadsheet is a must: Students can cre- regular expressions. The objective is to show stu- ate graphics and analyze the data extracted directly dents how to formalize linguistic expressions. Al- from their corpus. though it is not possible to go deeply into this top- The final section of this module is devoted to ics, students learn how to define regular expressions computational semantics (lesson 6). Among all the that allow them to find words by stem or by rhyme, different computational semantic models that have or even conditional expressions (words that appear been proposed (lexical semantics, first-order logic, before or after another word, etc.). events and semantic roles, etc.), our course is solely The lesson devoted to Natural Language Pro- focused on distributional models of semantics. I cessing techniques (lesson 5) is focused on part of explain only this model because it allows an effi- speech (PoS) taggers, syntactic parsing and named- cient computational processing of large corpora, entities (NE) recognition. In general, I show first and because it can only be used with computers. the main architecture of this kind of tools, then the Two main theoretical concepts are explained in main problems (ambiguity) and finally the common this lesson: first the idea lying behind distributional error rate. semantics that the meaning of a word depends on In the case of the part of speech tagger, for ex- its contexts and words that occur in similar contexts ample, I explain that each word-lemma is related to have similar meaning (Harris, 1968), and second all its possible parts of speech in a dictionary. This how computers deal with word contexts by means way the PoS ambiguity problem is presented. Then of vectors and matrices (Turney and Pantel, 2010). some standard solutions are explained, as the use of How it is possible to know the semantic similarity a set of rules to specify the suitable part of speech between two words in the distributional framework for each word in each context, or the application of is shown following Widdows (2004). statistical information. Named-entities recognition In order to show students how distributional mod- is explained in the same way. Syntactic parsing is els of semantics (and, in general, text mining tech- explained showing how Context Free Grammars niques) are able to extract generalizations and reg- (CFG) and Probabilistic CFP work. ularities from large corpora, I explain LDA Topic In any case it is not our objective to explain Modeling (Blei et al., 2003). I describe how it deeply the solutions to these problems (grammar works and how it can cluster words with similar development, statistical learning, machine learning, (distributional) meaning in the same topic. etc.). These concepts will be hard to follow for our Once students understand how LDA works, it is students. What is most important is that students applied to a literary corpus using MALLET (Mc- understand the computational problem. This way Callum, 2002). Students must extract a set of topic 5 http://www.laurenceanthony.net/ 6 http://contawords.iula.upf.edu/ software/antconc/ executions 40 models and analyze them. They check if topic mod- only 20 hours, I do not go deeply into the technical els are coherent, and if the words grouping in each details of NLP. I first explain how each technique topic could be justified by means of literary criteria. works and then students apply them to the corpus As I said before, I encourage students to formulate with easy-to-use tools. questions (as, for example “why words A and B are in the same topic?”) and try to answer them Acknowledgments according to their background in literature. I would like to thank the anonymous reviewers for 5 Evaluation their helpful suggestions and comments. Thanks also to my students for their feedback that helps With the exception of MALLET, which is the most me to improve the course. complex tool used in the course, students do not Paper partially supported by the BBVA Foun- have much difficulty using the technology and com- dation: grants for research groups 2016, project pleting the exercises in the syllabus, and eventually “Distant Reading Approach to Golden-Age Spanish they all pass the course. Sonnets” In order to monitor courses, the University of Al- icante distributes a survey in order to know the de- gree of satisfaction of the students with the course References taken. This course was marked with 9 points out Laurence Anthony. 2014. Antconc of 10, showing that students are really satisfied (version 3.4.3) [computer software]. with the course. For us, these data show that the http://www.laurenceanthony.net/. Tokyo, Japan: Waseda University. approach used to teaching NLP in literary stud- ies is appropriate. However, only a few students David M Blei, Andrew Y Ng, and Michael I Jordan. apply some of these techniques in their final Mas- 2003. Latent Dirichlet Allocation. Journal of Ma- ter Thesis or PhD Thesis. Perhaps students need chine Learning Research, 3:993–1022. more time to assimilate all these new techniques Lynne Bowker and Jennifer Pearson. 2002. Working and apply them to their daily research in literature. with Specialized Language. A practical guide to us- ing corpora. Routledge, London. 6 Conclusions Zellig Harris. 1968. Mathematical structures of lan- In this paper I have presented the key points of our guage. Wiley, New York. approach to teaching NLP to students of literature. Matthew L. Jockers. 2013. Macroanalysis. Digital I try to open the students’s mind with these main Media and Literary History. University of Illinois ideas: Press, Illinois. 1. The use of NLP techniques in literary stud- Matthew L. Jockers. 2014. Text Analysis with R for Students of Literature. Springer, Switzerland. ies makes sense when they are applied where manual analysis cannot reach (the analysis of Dan Jurafsky and Christopher Manning. large literary corpora). 2012. Natural language processing. http://online.stanford.edu/course/natural-language- 2. This literary analysis approach requires a wide processing. Stanford University. scope -students must extend their point of Dan Jurafsky and James H. Martin. 2008. Speech and view and learn how to formulate big research Language Processing. Prentice Hall. questions. Christopher Manning. 2011. Natu- 3. The application of these techniques some- ral language tools for the digital hu- times provides data that, rather than giving manities. https://nlp.stanford.edu/ man- ning/courses/DigitalHumanities/. Stanford Uni- answers produces new research questions. versity. The course is structured following a standard Andrew K. McCallum. 2002. Mallet: A machine learn- empirical text analysis: compiling, first, a represen- ing for language toolkit. tative literary corpus, and then analyzing it with Tony McEnery and Andrew Hardie. 2012. Corpus Lin- NLP techniques (frequencies, part os speech tag- guistics: Method, Theorie and Practice. Cambridge ging, LDA Topic Modeling, etc.). As the course has University Press, Cambridge. 41 Tony McEnery. 2013. Corpus linguis- tics: Method, analysis, interpretation. https://www.futurelearn.com/courses/corpus- linguistics. Lancaster University. Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieber- man Aiden. 2011. Quantitative Analysis of Cul- ture Using Millions of Digitized Books. Science, 331(176). Franco Moretti. 2007. Graphs, Maps, Trees: Abstract Models for a Literary History. Verso. Franco Moretti. 2013. Distant reading. Verso. Borja Navarro-Colorado, Mara Ribes Lafoz, and Noelia Snchez. 2016. Metrical annotation of a large corpus of Spanish sonnets: representation, scansion and evaluation. In Proceedings of the 10th edition of the Language Resources and Evaluation Conference (LREC 2016), Slovenia. Lluı́s Padró and Evgeny Stanilovsky. 2012. FreeLing 3.0: Towards Wider Multilinguality. In Language Resources and Evaluation Conference (LREC 2012), Istanbul. James Pustejovsky and Amber Stubbs. 2013. Nat- ural Language Annotation for Machine Learning. O’Reilly. Glenn Roe. 2012. The dangers and delights of data mining. In Digital Humanities Summer School, Ox- ford (UK). University of Oxford. TEI Consortium, editor. 2016. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 3.1.0. Last modi ed 15th December 2016. Peter D. Turney and Patrick Pantel. 2010. From frequency to meaning: Vector space models of se- mantics. Journal of Artificial Intelligence Research, 37:141–188. Dominic Widdows. 2004. Geometry and Meaningn. CSLI publications. Martin Wynne. 2004. Developing Linguis- tic Corpora: a Guide to Good Practice. http://www.ahds.ac.uk/creating/guides/linguistic- corpora/index.htm. 42