=Paper=
{{Paper
|id=Vol-1918/vela
|storemode=property
|title=A Practical Course in Corpus Linguistics for Students with a Humanist Background
|pdfUrl=https://ceur-ws.org/Vol-1918/vela.pdf
|volume=Vol-1918
|authors=Mihaela Vela,Hannah Kermes
|dblpUrl=https://dblp.org/rec/conf/gldv/VelaK17
}}
==A Practical Course in Corpus Linguistics for Students with a Humanist Background==
A Practical Course in Corpus Linguistics for Students with a Humanist Background Mihaela Vela Hannah Kermes Language Science and Technology Language Science and Technology Saarland University Saarland University m.vela@mx.uni-saarland.de h.kermes@mx.uni-saarland.de Abstract cises are reproducible and portable to further stud- ies. We present a practical course in corpus lin- In the following we will discuss challenges for guistics meant to provide students with a teachers and students (Section 2) and describe the humanities background with the necessary general concept of the course (Section 3) and its knowledge and skills for an empirical study composition (Section 4). We conclude with a brief as basis for term papers, BA- or MA-thesis. summary and envoy (Section 5). The course is part of a new Bachelor pro- gram and is combined with a theoretically 2 Challenges for teachers and students oriented course on corpus linguistics. The A practical course on corpus linguistics for students challenge is to provide students with the with a humanities background has challenges for necessary understanding of the underlying both teachers and students. concepts and skills of corpus linguistics The challenges stem from the seemingly op- without overwhelming them with too much posed character of the digital applications and the technical detail. The course material is humanities disciplines as well as from the charac- modular, allowing for easy updates, modifi- ter of a practical course requiring a lot of active cations and adaptations as well as reusable learning on the side of the students. for different target groups, settings, and ap- Challenges for teachers include: plications. • motivating students and lowering the psycho- 1 Introduction logical and practical barriers In this paper we present a practical course in corpus • trying to avoid or solve technical problems linguistics, which is meant to provide students with a humanities background with the necessary knowl- • dealing with heterogeneous groups both with edge and skills for an empirical study as basis for regard to the prior knowledge of the students term papers, BA- or MA-thesis. The course is part as well as with their different learning pace of a new Bachelor program Language Science and is combined with a theoretically oriented course • keeping track of the learning success of the on corpus linguistics. The students in the program group and individual students, adjusting the come from various backgrounds including transla- teaching speed and/or type accordingly tology and language studies among others. Most Challenges for students include: of the students have only little or no experience in natural language processing. • engaging with a potentially new kind of sub- The challenge is to provide them with the neces- ject matter sary understanding of the underlying concepts and skills of corpus linuistics without overwhelming • dealing with and solving technical problems them with too much technical detail. The course • coping with the high demands of active learn- material is modular, allowing for easy updates, ing modifications and adaptations as well as reusable for different target groups, settings, and applica- Motivating students to engage with the technical tions. The described processes, analysis and exer- aspects of corpus linguistics often boils down to 49 answering the question about the usefulness of the find out about individual problems and to provide methodology. Good and obvious examples for ap- immediate support. It is also possible to provide plications in the students’ discipline(s) exemplify exercises for different levels or extra exercises for the additional value. Useful to this respect are more advanced students. Working on a problem as simple and understandable practical exercises to a group can also foster deeper understanding. exemplify and to help lower potential psycholog- Making students present the results of exercises ical barriers with regard to technical applications. and discussing them as a group helps teachers keep- Motivating students stays a challenge throughout ing track of the learning success of the group and of a course as technical aspects can easily become individual students. This is important to eventually cumbersome and tedious. Active learning plays an adjust the teaching speed and/or type accordingly. important role in this respect. Active learning in the The discussion of the results can also be used to sense of an instructional method engaging students sum up and point to important aspects of a teaching in meaningful learning activities in the classroom unit. This helps the students to reflect on their own (e.g. doing exercises, working on and discussing learning success and to evaluate the personal orga- problems/results) (Prince, 2004; Bonwell and Ei- nization and structure of their learning activities. son, 1991). It allows to keep the students active In the following we will now describe the general and involved giving them an immediate feedback concept of the course and how it addresses the on their learning success. The broader goal of the challenges described above. session, however, should be made clear to show the necessity of the activities of the students. 3 General concept Technical problems with regard to applications The course covers the two main aspects of corpus and code are a challenge for both teachers and linguistics: (i) corpus building and (ii) corpus anal- students. A lot of technical problems, e.g. with ysis. The main emphasis, however, is on corpus installing software, can be avoided by using on- analysis as depicted in Figure 1. Tutorials lead the line tools, e.g. online corpora or web services students through the main steps in corpus building for corpus annotation. Another possibility to limit from the digitized text to an annotated searchable technical problems is to provide sample code for corpus and from the linguistic research question more complex examples, which only needs to be and corpus extraction to corpus analysis. modified or complemented for exercises or later The course is constructed like a sample study, application. This can help to focus on the main each tutorial representing a particular step in the aspects of the methodology as the technical diffi- process. In this sense, the tutorials in both parts culties are reduced to a minimum. Nevertheless, build on one another, as each tutorial produces the technical problems cannot be avoided completely, input data for the next. However, as the necessary it can even be good to provoke particular problems sample data is provided at the beginning of each in class. Problem solving, especially finding error tutorial, the tutorials are also self-contained. In the in code or regular expressions, is also an aspect of first part, we create a corpus out of a small plain corpus linguistic research. Working through exam- text sample, adding meta data and basic linguistic ples and exercises in class will inevitably lead to annotation. In the second part, we look at a sam- some technical problems. However, as they occur ple research question extracting and analyzing the in class, the teacher can immediately provide help respective data. The characteristic of the course in finding and solving the problems. of a sample study allows the students to get ac- Teachers are often confronted with heteroge- quainted with the process of an empirical corpus neous groups, both with regard to their prior knowl- linguistic study, facilitating a later application of edge as well as with different learning speeds. Al- the methodology to a study of their own. though this is a general challenge in teaching, the The course material is provided as a website groups are often more heterogeneous in digital hu- with detailed step-by-step tutorials including the manities settings and are especially pronounced necessary background information, examples with when teaching technical skills. Again an active sample data and exercises. Links to external knowl- learning environment where examples and exer- edge sources and tutorials, provide access to addi- cises are worked through in class can help to adopt tional information including more (technical) de- to individual needs. It is easier for the teacher to tails, more profound background or more complex 50 Figure 1: Structure of the course applications. The tutorials may be worked through students acquainted with the concepts and notions at individual pace adapting to the specific needs of Corpus Linguistics. Each session was planned as of different target groups and individual students, an interactive combination between a tutorial (pre- skipping sections or providing additional informa- sented by the lecturers or prepared by the students) tion or exercises. and the corresponding exercises (solved in class The tutorials are written in R-Markdown by the students with the lecturer’s assistance). The and converted into HTML websites. Using tutorial corresponds to the theoretical part, intro- R-Markdown as source documents has several ducing a new topic, while the exercises are meant advantages: (i) the tutorials are easy to modify, (ii) as practical applications. additional information as well as new material can In this section we describe the course in detail in- easily be integrated, (iii) the students can download cluding not only the structure, content and technical the source document, which allows for individual aspects of the course, but also the necessary infras- notes as well as to reproduce the provided sample tructure to teach this type of classes. We decided analysis. to publish the entire material of the course online1 , Although the tutorials were initiated as course facilitating the availability and reproducibility of material for a university course, they can also func- all course materials. As mentioned before, we use tion as self-learning tutorials or as a (knowledge) R Markdown for this purpose, being able to com- base and memory hook for later use when writing bine both narrative text and code and producing a term paper, BA- or MA-thesis. As a university formatted output. course it follows the concept of inverted or flipped As shown in Figure 1, the course is structured classroom (Bergmann and Sams, 2012; Handke into four blocks, Corpus Building, Corpus Annota- et al., 2012) in the sense that sample study and tion, Corpus Query and Data Analysis, distributed exercises are worked through in the classroom indi- over ten sessions. Each new session introduces a vidually or as a group, followed by a group discus- new concept, but at the same time using previously sion. This allows to address problems immediately, introduced concepts. In terms of corpora we use discuss them as a group, work through advance the Royal Society Corpus (RSC) (Kermes et al., concept and engage in collaborative learning and 2016), a historical corpus of written scientific En- problem solving (Tucker, 2012) again more or less glish, as well as the BROWN (Francis and Kučera, simulating a research process in a team. 1979), FLOB (Mair, 1999b), LOB (Johansson and Goodluck, 1978) or FROWN (Mair, 1999a) cor- 4 Tutorial Description pora, covering different time periods and registers The practical course described here consisted of ten 1 http://fedora.clarin-d.uni-saarland. incremental sessions on Corpus Linguistics. The de/teaching/Corpus_Linguistics/index. goal, as described in the previous section, is to get html 51 for both American and British English. class. In Session 1 this additional information was The course is structured as follows. provided as links depicted in the Figure 2 below. Session 2 and Session 3 are part of the Corpus • Session 1: Corpus building with XML and Annotation block. Session 2 deals with part-of- TEI speech tagging, including its definition and the in- • Session 2: Tagging with TreeTagger troduction of the concept of a tagset. More specif- ically, this class provides also an introduction to • Session 3: Corpus annotation with WebLicht the usage and configuration options of the TreeTag- • Session 4: Corpus query with regular expres- ger (Schmid, 1994; Schmid, 1995). The exercise sions deals with the installation and usage of the Tree- Tagger as well as performing tagging on .txt files, • Session 5: Corpus query with patterns but also on .xml files, as depicted in Figure 3. • Session 6: Data extraction and data formats Session 3 goes one step further by introducing additional annotation layers to the already existing • Session 7: Data analysis and data evaluation part-of-speech annotation. This is carried out by with Excel WebLicht (Hinrichs et al., 2010), a web based envi- ronment for the annotation of corpora. It includes • Session 8: Manipulating data sets with R tools for tokenization, lemmatization, pos-tagging • Session 9: Normalization and frequency dis- and parsing (among others), which can be com- tribution with R bined individually to tool chains. The WebLicht tutorial describes the usage of the tool by depict- • Session 10: Plotting analysis results with R ing screenshots and giving examples. The exercise Session 1 belongs to the Corpus Building block for the students is to build a processing chain in and provides an introduction to XML (EXtensible WebLicht including at least a tokenizer and the Markup Language) and TEI (Text Encoding Ini- TreeTagger. The files used for this exercise are the tiative). The goal of this class is to make students same as in Session 2. understand the importance and the syntax of mark- Session 4 and Session 5 are concerned with cor- up languages when working with corpora. The first pus query belonging to the Corpus Query block, class starts with an exercise by asking students to being concerned with the qualitative analysis of the mark title, paragraphs and sentences in a .txt file. texts. Session 4 is meant as an introduction to reg- The different solutions are meant to show the possi- ular expressions, defining the concept of a regular ble variation in marking linguistic units, here title, expression, but also explaining, by examples, the paragraphs and sentences, and making a point why special characters and their role in formulating a a standardized mark-up language is necessary. The regular expression. After practicing formulating tutorial introduces first the XML syntax, followed queries with regular expressions in Notepad++2 by the TEI syntax. For completion the session ends the students were introduced to the Saarland Uni- with an exercise on encoding the same text accord- versity CQPWeb (Hardie, 2012) platform and to ing to the TEI guidelines. the CQP syntax (Evert and Hardie, 2011)3 . The tutorial consists of a series of examples for queries meant to consolidate the knowledge about regular expressions. The corresponding exercises consist of a set of queries to be carried out in CQPWeb on the RSC and BROWN corpora. An example of such an exercise can be found in Figure 4. Session 5 is a continuation of Session 4 building on regular expression syntax extending the simple search queries introduced before to more complex queries including patterns. In the exercise part Figure 2: Additional resources for data encoding. 2 https://notepad-plus-plus.org/ As described in Section 3, we provided addi- 3 https://corpora.clarin-d.uni-saarland. tional information, beyond the scope of the specific de/cqpweb/ 52 Figure 3: Instructions for tagging with the TreeTagger. the online search tool CQPweb. Strongly related to the data set this class is introducing the notions of observations, features and values of features. Dur- ing the practical part students were asked to create their own data set based on a research question (e.g. distribution of content verbs and their parts- of-speech across registers in a specific corpus) and to formulate the research question in terms of query, observations and features. Session 7 to 10 belong to the Data Analysis block, introducing basic data analysis and data evaluation methods such as frequency distribution, normalization and statistical significance test using the χ 2 (chi-square) test. The statistical analyses in these sessions are based on the queries and data Figure 4: Introductory exercise in CQPWeb. extracted in the previous sessions. Session 7 is a gentle introduction to data anal- of the class the students are being asked to build ysis, introducing (with relevant examples) all the- their own pattern by using the CQP syntax and to oretical notions related to frequency distribution, query again the RSC and BROWN corpora using normalization and χ 2 . The practical application of CQPWeb as shown in Figure 5. these concepts is realized by exercises executed in Session 6 belongs to both the Corpus Query as Excel/Libre Office. Excel/Libre Office is not the well as the Data Analysis block, dealing with the state-of-the-art in statistical analysis but has big ad- results of a query. The result of a specific linguistic vantage: the statistical analysis can be performed induced query is usually a data set containing infor- step by step permitting students to understand the mation about a particular linguistic phenomenon path to the final formula. Understanding how a extracted from a particular corpus. In this class stu- specific formula works (including the intermediate dents were provided with the concept of a data set, steps) is a great benefit for learners, who need to creating, formatting and manipulating it by using use this kind of knowledge later in their academic 53 Figure 5: Exercises with patterns in CQPWeb. Figure 6: Calculating normalized figures in Excel/Libre Office. studies. cerned with basic data analysis and data evaluation Session 8 introduces statistical analysis with R in methods such as frequency distribution (see Fig- R Studio. It introduces basic notions related to ure 8), normalization and statistical significance R, including data manipulation such as adding col- test using the χ 2 test. The two sessions differ by umn names, adding additional variables (columns), the tools used for the analysis. While data analysis summarizing the data, merging and combining two in Session 7 was exemplified using Excel, Session or more data sets by presenting appropriate exam- 9 uses R. The exercises from Session 7 are repeated ples for each of these topics. In the exercise part in Session 9 to show the relation of the tools. How- of this session students are asked to extract simi- ever, it is also shown that R is more powerful when lar data sets from other corpora applying the same dealing with multivariate data sets extending the data manipulation as used in the examples. At the analysis performed in Session 7. end all data sets are combined to one large data set, Session 10 is the continuation of Session 9 intro- which will then be analyzed in Session 9. Figure 7 ducing additional aspects of data analysis showing shows the introductory part related to data frames how to visualize and interprete data from different in R. perspectives (see Figure 8). Students prepare the Session 9 relates to Session 7 in that it is con- different visualizations and their interpretation in 54 Figure 7: Exercises with data frames in R. Figure 9: Data analysis in R. Figure 8: Data analysis in R. The documents are modular allowing to apply them to different data sets as well as copying, modifying groups. The results are then discussed together in and adapting the included code. The modification class. The importance of verifying the interpreta- and adaptation is exemplified in the exercises and tion of the macro perspective of the visualization the students are encouraged to make notes about with the micro perspective, the examples from the technical aspects and interpretations. corpora (concordance lines) is made explicit in this discussions, linking and intertwining quantitative 5 Conclusion and qualitative analysis. We presented a ten session practical course on Cor- The R Markdown documents used throughout pus Linguistics for students with a humanities back- Session 8 to 10 include sample code for data ma- ground. The structure of the course is based on nipulation and data analysis (see Figures 8 and 9). active learning methods to address the challenges 55 of teaching a technical course to students with little Hannah Kermes, Jörg Knappen, Stefania Degaetano- or no technical background. An active learning en- Ortlieb, and Elke Teich. 2016. The royal soci- ety corpus: From uncharted data to corpus. In In vironment encourages students to work on research Proceedings of the Ninth International Conference question alone or as a group, addressing (technical) on Language Resources and Evaluation (LREC’16), challenges, solving technical and research related Portoroz, Slovenia. problems and discussing results. The role of the Christian Mair, 1999a. The Freiburg-Brown Corpus teacher moves in the direction of an assistant, an- (Frown). swering questions, pushing in the right direction and helping to find solutions. The course and the Christian Mair, 1999b. The Freiburg-LOB Corpus (F- LOB). course material, as presented here, allows for an easy modification, adaptation and extension of the Michael Prince. 2004. Does active learning work? A course material. This makes the course and its review of the research. Journal of engineering edu- material applicable to different target groups and cation, 93(3):223–231. settings, making the creation of such material worth Helmut Schmid. 1994. Probabilistic Part-of-Speech the effort. Tagging Using Decision Trees. In International Conference on New Methods in Language Process- ing, pages 44–49, Manchester, UK. References Helmut Schmid. 1995. Improvements in Part-of- Jonathan Bergmann and Aaron Sams. 2012. Flip Your Speech Tagging with an Application to German. In Classroom: Reach Every Student in Every Class Ev- Proceedings of the ACL SIGDAT-Workshop. ery Day. International Society for Technology in Ed- ucation, Eugene, Or. Bill Tucker. 2012. The flipped classroom. Education next, 12(1). Charles C. Bonwell and James A. Eison. 1991. Active Learning: Creating Excitement in the Classroom. Number 1, 1991 in ASHE-ERIC higher education re- port. School of Education and Human Development, George Washington University, Washington, DC. Stefan Evert and Andrew Hardie. 2011. Twenty-First Century Corpus Workbench: Updating a Query Ar- chitecture for the New Millennium. In Proceedings of the Corpus Linguistics 2011 Conference, Briming- ham, UK. W.N. Francis and H. Kučera. 1979. Manual of Infor- mation to Accompany A Standard Corpus of Present- day Edited American English, for Use with Digital Computers. Brown University, Department of Lin- gustics. Jürgen Handke, Alexander Sperl, and Deutsche ICM- Konferenz, editors. 2012. Das inverted class- room model: Begleitband zur ersten deutschen ICM-Konferenz. Oldenbourg, München. OCLC: 810266426. Andrew Hardie. 2012. CQPweb –Combining Power, Flexibility and Usability in a Corpus Analysis Tool. International Journal of Corpus Linguistics, 17(3):380–409. Erhard Hinrichs, Marie Hinrichs, and Thomas Zas- trow. 2010. WebLicht: Web-based LRT services for German. In Proceedings of the ACL 2010 Sys- tem Demonstrations, pages 25–29. Association for Computational Linguistics. Geoffrey Leech Johansson, Stig and Helen Goodluck, 1978. Manual of Information to Accompany the Lancaster-Oslo/Bergen Corpus of British English, for Use with Digital Computers. 56