-

Analyses of Literary Texts by Using Statistical Inference Methods

Mehmet Can Yavuz

mehmetyavuz@sabanciuniv.edu 0 0 Computer Science and Engineering Department, Sabancı University, Tuzla Management Information Systems Department, Kadir Has University, Cibali Physics Department, Bog ̆ azic ̧i University , Bebek

If a road map had to be drawn for Computational Criticism and subsequent Artificial Literature, it would have certainly considered Shakespearean plays. Demonstration of these structures through text analysis can be seen as both a naive effort and a scientific view of the characteristics of the texts. In this study, the textual analysis of Shakespeare plays was carried out for this purpose. Methodologically, we consecutively use Latent Dirichlet Allocation (LDA) and Singular Value Decomposition (SVD) in order to extract topics and then reduce topic distribution over documents into two-dimensional space. The first question asks if there is a genre called Romance between Comedy and Tragedy plays. The second question is, if each character's speech is taken as a text, whether the dramatic relationship between them can be revealed. Consequently, we find relationships between genres, also verified by literary theory and the main characters follow the antagonisms within the play as the length of speech increases. Although the results of the classification of the side characters in the plays are not always what one would have expected based on the reading of the plays, there are observations on dramatic fiction, which is also verified by literary theory. Tragedies and revenge dramas have different character groupings.

If a road map had to be drawn for Computational

Criticism (Moretti, 2013) and subsequent Artificial Literature, it would have certainly considered Elizabethan drama. In particular, Shakespearean texts are the most outstanding examples of dramatic fiction. Demonstration of these structures through text analysis can be seen as both a naive effort and a scientific view of the characteristics of the texts. In this study, the textual analysis of Shakespeare plays was carried out for this purpose.

To begin with, “the First Folio” is the printed material in which all Shakespeare’s works are brought together for the first time, (Synder, 2001). The edition of 1623 was directed by two actors from the group called King’s Men. King’s Men is the ensemble that Shakespeare is also a member of. Half of the 36-play collection had never been published anywhere before. The Folio was also printed in Quarto form. These prints took their names from the way the books were folded. It is known that the First Folio has 800 prints, 233 of them have reached today. In the First Folio, Shakespearean plays are typically divided into three groups: Comedies, Tragedies, and Histories. Romance is the genre that hybridizes Comedy and Tragedy, developed at the beginning of the 17th Century. At the end of his career, he wrote four romances: Pericles, Cymbeline, The Winter’s Tale and The Tempest. “The First Folio” groups Cymbeline with Tragedies; and The Winter’s Tale and The Tempest together with Comedies. The reason for this may be that The Winter’s Tale and The Tempest began as tragedies and then turned to comedies, and Cymbeline started as a comedy and ended as a tragedy.

Shakespeare’s two tragedies Macbeth and Othello are two very good examples of a true tragedy and a revenge tragedy. Tragedies are designed as the struggle of the main characters and the opposing characters who create obstacles for the main character. The protagonist is generally the main character that the audience sympathizes with. Although not sympathetic, Macbeth is a protagonist and the opposing characters are antagonists: Duncan and Banquo. Similarly, there is also antagonism in revenge drama and the main theme is revenge. The antagonist or protagonist seeks revenge for an imaginary or real injury. Iago the antagonist gets his revenge provoking Othello, the protagonist, against his wife.

Computerized analysis of literary texts, in other words computational criticism is a new and promising field, (Ramsay, 2011) . Pioneering works aim to answer critical questions by using Natural Language Processing (NLP) methods. It is of interest to create fictional texts with the help of computer in the developing artificial literature along with these studies. In this study, we make a computational analysis of Shakespearean texts. There are basically two questions we’re trying to answer. The first is if the genres in Shakespeare’s theater texts can be classified by computer. Secondly, if the sentences in which the characters speak are taken as texts, can antagonisms be revealed? I tried to find answers to both with the same unsupervised learning technique.

In recent years, NLP methods have been developing rapidly and text analysis methods are getting more advanced. Topic Modeling articles are among the top cited articles. An unsupervised topic modelling algorithm is used in this study. It is able to generate latent topics in which each document is a mixture. Having the latent topic distribution, by using dimension reduction algorithm, each document is mapped onto two dimensional coordinates without losing intrinsic characteristics. 1.1

Related Works Digital Humanities field lets researchers discuss quantitative methods in literary and cultural studies (Clement et al., 2008; Crane, 2006) . ”Drametrics” is a field that deals with quantitative analysis of the literary genre of drama (Romanska, 2015) . Digital Shakespeare studies also have gotten attention since the 2000, (Hirsch, 2017; Mueller, 2008) . The studies includes issues from digital archives to authorship analysis, (Vickers, 2011; Evert, 2017) . Besides, machine learning based text analyses are also carried out for genre classifications, (Ardunuy, 2004; Hope, 2010; Schoch, 2016; Underwood, 2013; Yu, 2008) . Information theoretical approaches are also successfully applied, (Rosso, 2009) . In literature, structural elements are quantified, such as the dramatis personae as well as scene structures; and applications are developed to further increase analysis (Dennerlein, 2015; Krautter, 2018; Schmidt, 2019; Trilcke, 2015; Wilhelm, 2013; Xanthos, 2016) . In order to analyze a literary text, we would like to use unsupervised topic modeling. Although there are linear-algebraic models such as NonNegative Matrix Factorization (Lee, 1999) , probabilistic models are more reliable and capable of representing true distributions of topics. Probabilistic Latent Semantic Analysis (Hoffman, 1999) and Latent Dirichlet Allocation (Blei, 2003) are the two major unsupervised topic modeling algorithms. Although both allow us to classify texts according to topic distribution, Latent Dirichlet Allocation as a generative model has a proven superiority over competitors. Principal Component Analysis (Jolliffe, 2002) , Linear Discriminant Analysis (Brown, 2000) or Non-Negative Matrix Factorization (NMF) techniques are all dimension reduction algorithms, along side Singular Value Decomposition (Golub, 1970) . The last algorithm we use is K-Means Clustering algorithm, a well known clustering algorithm that minimize variance within clusters (Llyod, 1982). 2

Theory

In this study, we will use text analysis to investigate genres and antagonisms in Shakespearean plays. By using Latent Dirichlet Allocation (LDA), document distributions over topics are generated. Firstly, optimum number of topics will be obtained for LDA with grid search optimization and then dimension reduction algorithm, truncated Singular Value Decomposition (tSVD) will map these documents into a two-dimensional plane and graphed.

In the following sections, generating topics with LDA algorithm and dimension reduction by tSVD algorithm are explained. The aim of using tSVD algorithm is to express each text with two floating numbers while preserving the latent topic properties. Thus, classification can be made depending on the distances between each text in the new twodimensional feature space. At the last step, we use a clustering with Euclidean distance. Theoretical section is kept brief and explanatory due fact that the main focus is on experimental results. If data has a large number of features, reduce it into a subset of features that are the most relevant to the prediction problem. SVD breaks any A matrix into a multiplication of three matrices so that, A = U SV 0 which

U U 0 = I and V V 0 = I S is a diagonal matrix that consists of r singular values. r is the rank of A. Truncated SVD is a reduced rank approximation. All singular values are equated to zero except for the largest k, and largest singular values are the first k columns of U and V. The dimensions of truncated SVD are [uxk] ⇤ [kxk] ⇤ [kxv] Since A matrix is approximated by k dimensions, there is a dimension reduction between matrix multiplications. A descriptive subset of the data is called T, which is a dense summary of the matrix A, LDA is a generative statistical model that explains why certain parts of the data are similar based on an observation set. LDA assumes that observations are generated by latent variables, or latent topics. Thus, each document is a mixture of topics and each topic is a distribution over words and each word is drawn from the mixture. The observations are frequency statistics of each document, so called the document-term matrix. The method is called the bag-of-words approach and intends to reflects how important a word is in a document. Thus, topics are identified on the basis of term cooccurrence, the topics-term matrix, and each document is assumed to be characterized by a particular set of topics, the document-topics matrix. Topics, mixtures and other variables are all hidden and need to be predicted from the observation data, the document-term matrix. In Figure 1, plate notation of LDA is represented. In the plate notation, there are NxD different variables that represent observations. There are K total topics and D total documents.

All at once, ↵ and ⌘ are parameters of the prior distributions over ✓ and respectively. ✓ d the distribution of topics for document d (real vector of length K). k is the distribution of words for topic k (real vector of length V). zd,n is the topic for the nth word in the dth document. wd,n the nth word of the dth document. Only gray shaded circles are the observed variables. The rest of the white circles would be inferred by using Variation Inference. The topic for each word, the distribution over topics for each document, and the distribution of words per topic are all latent variables in this model. By this formulation, similarities can be introduced between documents.

The model contains both continuous and discrete variables. ✓ d and k are vectors of probabilities. zd,n is an integer in {1, ...K} that indicates the topic of the nth word in the dth document. wd,n is an integer in {1, ...V }which indexes over all possible words. Sk denotes k largest singular values, which is the number of reduced features. Each feature can be expressed by a percentage of variance, the reason behind this is choosing only the most significant ones. We included two evaluations in our experiments. The first is whether or not the genre of Romance can be distinguished computationally by computer. In order to carry out this experiment, each tragedy, comedy and romance is treated as a different document; and is processed by LDA. Afterwards, for the document-topic distribution matrix, the number of topics is reduced to two by means of dimension reduction algorithm, tSVD. Similarly, in the second evaluation, the lines of each character were treated as a text and the document-subject matrix was reduced to two after processing it with LDA. Two different type of tragedies are considered: Macbeth and Othello. Thus, three different 2In Python, Scikit-learn library used for LDA, tSVD and GridSearch functions. experiments and optimization were conducted for these two evaluations. Two preprocesses were performed for each set of documents. Primary, stop-words were removed from the dictionary. These stop-words were created for both the usual English and Elisabethan English. The number of stop words is 1144. The characteristic of these words is that they often appear in every text. The secondary process is the expression of texts with word frequencies and the creation of the document-term matrix. Thus, each text could be expressed in a dictionary size fixedlength vector. Concatenations of these vectors creates the document-term matrix. In order to find the right topic number, we need an optimization. Since the subjects/topics are latent variables, there is no right number of topics. Gridsearch optimization over topic numbers is carried out, and the highest log-likelihood is the optimal settings. In all three experiments, the values between 6 and 12 were tried three times and drawn in Figure 2. Thus for example, for Macbeth, 3 experiments were conducted for a certain topic number. The LDA function that we called for the experiment was repeated up to 10 times before giving results. Thus, for example, the LDA algorithm was repeated up to 30 times in total for a certain topic number.

As an observation, as the number of topics decreases, log-likelihood increases. However, we prefer not to try less than 6 latent topics because, in literature, the number of themes/topics for Shakespearean plays is generally at least 6, (”William Shakespeare”, 2015) .

Discussion

In Figure 3, documents consisting of TragedyComedy-Romance plays are represented. The document-topic distribution matrix is reduced to two dimensions, and graphed. More than half of variances is explained by these two components. Even in three dimensions, the clustering does not change. The plays that are shown in red are Comedies, the blues are Tragedies and the greens are Romances according to the First Folio.

In the upper left corner, the majority of the Comedies are clustered, and likewise in the lower right corner Tragedies are clustered. In the middle of these two clusters, three plays, ”All’s Well That Ends Well”, ”Measure for Measure” and ”Troilus and Cressida” are placed known as problem plays. Some critics also includes ”Timon of Athens” which is a neighbor of other problem plays, (Snyder, 2001) . Thus, in the middle of the two clusters, there is a gray zone in which problem plays are placed. An interesting fact is, although “All’s Well That Ends Well“ and “Measure for Measure” are grouped as Comedies in the First Folio, they are much closer to tragedies. An unexplained fact is that Coriolanus and Othello are also placed in this gray zone. Another question in this grouping is ”Romeo and Juliet”. As a tragedy that has comedy elements is placed thematically very close to the Comedies cluster.

Another important distinction is that these three Romances are clustered within the Tragedies. According to this analysis, the genre of Romance is not different from tragedy. After the analysis, the characters of Macbeth clearly demonstrate Antagonist/Protagonist relations as graphed in Figure 4. There are two basic clusters in the tragedy of Macbeth. The first is the protagonists, led by Macbeth and Lady Macbeth. The second is the antagonists, who are the murdered king and Macduff who suspects foul play. In the plot, protagonists are shown in blue and antagonists in red. Lady Macbeth stands at the bottom left corner, since Lady Macbeth doesn’t have much to talk except to Macbeth. Macbeth’s himself is closer to the red cluster. He has relations with red clusters as a new King. Macduff, who is suspicious and kills Macbeth in the last scene, is in the center of the red cluster. Lady Macduff is also in this cluster. The murdered King Duncan is also at the center of this cluster. However, there is also a misclassification. Siward is in the blue cluster. However, Siward and Macbeth have a clash in which Siward is killed. Other than that, the witches who oracles, are in the opposite cluster of Macbeth. Other characters may not be fully explained due to their small and ambiguous roles. Apart from these two clusters, there is a top left green cluster. The main character of this cluster is Banquo. This character is Macbeth’s brave and noble companion. But he had no idea about Macbeth’s machinations until he is killed.

Tragedy of Macbeth has a very clear separation between clusters. The distance between clusters is also meaningful. The reds are between green and blues. The greens are actually closer to reds rather than Macbeth’s evil cluster. The characters of the Othello play are shown in the Figure 5 in accordance with the analysis. I give Othello as an example of revenge tragedies. Unlike a true tragedy, Macbeth, the Othello play does not have antagonist/protagonist clusters in the Figure 5. Iago is a single character who sets traps to get revenge on Othello. Throughout the play, Iago misleads Othello for reasons and purposes that only he and the reader know. Othello kills his beloved wife in a crisis of jealousy.

There are three different colored clusters shown. The red set consists of the main people of the play. Blue and green clusters belong to side characters and antagonisms are computationally ambiguous. The main characters of the red cluster at the bottom right, Othello, Emilia, Iago and Cassio have spoken almost the same subject because of the frequency of their dialogue with each other. Therefore, a conflict between them is not visible. But Iago is shown in the lower right corner because he shows his true intention in his monologues. Therefore, Othello is a negative example for the methodology we developed. Characters such as the Duke of Venice and the Senator are mentioned in the top left corner and are in fact extremely outside the plot. Shown from the green cluster, Bianca is again outside the plot as Cassio’s lover.

In Othello, there are interesting observations on revenge tragedies. In revenge tragedies of Shakespeare, a lonely character shows him/herself differently and his/her true intentions remain hidden. Thus, the clear difference from tragedies, is their dramatic structure.

Conclusion

The classification of genres shows us that the method we use provides successful quantitative information for the differentiation of genres. The length of the texts can be mentioned among the reasons for this success. Positioning the plays between Tragedy and Comedy is much discussed in the literature theory. The Romance genre hybridizes Tragedy and Comedy elements. Instead of mapping the Romance genre in between, the algorithm mapped four ”Problem Plays” in a region between Tragedies and Comedies. Another interesting finding is that Romance cannot be distinguished from Tragedies. The method used shows that the reason for some literary discussion is at the same time quantitative. The method classifies Romances within the Tragedies. In the light of theoretical discussions, of course, there may be a genre called Romance, but we have not been able to quantify this difference yet.

There are also some results from our experiments on the two tragedies we have chosen. I intentionally choose a tragedy and a revenge play, although Macbeth clearly shows antagonisms. This is mainly due to the frequency of conversations within these clusters. For example, Macbeth and Lady Macbeth are always aware of each others true intentions. Dialogues within these clusters are always compatible with each other. Therefore, the cluster forms. There is a group subjectivity, also verified computationally. The war scene at the end of Macbeth can clearly be observable from the clusters. Two clusters to clash are formed through out the play, which is quantifiable. On contrary, Iago who hides his true intention from everyone, has apparently always agreed with Othello. On the contrary, Iago never shares his intentions with anyone in the play. His intentions are shared through monologues. Thus, he could not form a cluster. He is a lonely character. That is why, algorithm fails to find an antagonisms. From this point of view, we can say that the method forms clusters of characters that agree with each other. The dramatic structure of revenge plays cannot be revealed by the method we proposed. Our method is successful when finding the clusters. We carried out a similar analysis for the play Hamlet, another type of revenge plays. Hamlet distinguished himself in a different cluster, as a lonely character with Lord Polonius who is responsible for spying on Hamlet. Lord Polonius is a similar character with Iago in terms of hiding their true intentions.

The dramatic fiction in Shakespeare’s texts is shown to a certain extent. The advantage of the proposed pipeline is using non-linearity over a linear layer. Instead of directly clustering the document-term matrix, a powerful representation of each document in a feature space is generated by LDA. After generating document-topic matrix, a linear layer of dimension reduction, tSVD, that extracts principal directions or principal axes in which the document-topic matrix have the largest variance.

I think that these naive efforts on the way to Artificial Literature also have a positive effect. The production of a play is possible with the knowledge of authorship for humans and even for Shakespeare. By authoring knowledge, we mean, for example, how to write a play from dramatic perspective. It is firstly introduced by Aristotle to shed light on present-day methods. It would be possible to reverse engineering them for artificial literature. Going from a quantitative analysis to plays would be possible. Therefore, as we analyze literary pieces, especially texts in dialogue form can help us verify critical questions and theories. From these analyses, going back to the literary text generation becomes possible.

Acknowledgments

This work was supported by grant 12B03P4 of Bog˘ azic¸i University.

The author would like to thank Muhittin Mungan for suggesting Master of Science thesis as his advisor and Meltem Gu¨ rle Mungan for her kind opinion. The author would also like to thank actor Gu¨ nes¸ Yakın, for talks together on Shakespeare.

Ardanuy , M. C. , & Sporleder , C. ( 2014 , April). Structure-based clustering of novels . In Proceedings of the 3rd Workshop on Computational Linguistics for Literature (CLFL) (pp. 31 - 39 ).

Blei , D. M. , Ng , A. Y. & Jordan , M. I. ( 2003 ). Latent dirichlet allocation . J. Mach. Learn. Res. , 3 , 993 - 1022 . doi: http://dx.doi.org/10.1162/jmlr. 2003 . 3 .4- 5 . 993

Brown , M. T., & Wicker , L. R. ( 2000 ). Discriminant analysis . In H. E. A. Tinsley & S. D. Brown (Eds.), Handbook of applied multivariate statistics and mathematical modeling (pp. 209 - 235 ). San Diego, CA, US: Academic

Press. http://dx.doi.org/10.1016/B978-012691360-

Clement , T. , Steger , S. , Unsworth , J. and Uszkalo , K. ( 2008 ). How not to read a million books . Available online at http://people.brandeis.edu/ unsworth/hownot2read.html

Crane , G. ( 2006 ). What do you do with a million books? D-Lib Magazine . Available online at http://www.dlib.org/dlib/march06/crane/03crane.html

Dennerlein , K. ( 2015 ). Measuring the average population densities of plays. A case study of Andreas Gryphius, Christian Weise and Gotthold Ephraim Lessing . Semicerchio. Rivista di poesia comparata LIII: 80 - 88 .

Evert , Thomas & Proisl, & Jannidis , Fotis & Reger, Isabella & Pielstro¨m, Steffen & Scho¨ch, Christof & Vitt, Thorsten. ( 2017 ). Understanding and explaining Delta measures for authorship attribution . Digital Scholarship in the Humanities. 32. 4-16. 10 .1093/llc/fqx023.

Hofmann , T. ( 1999 ). Probabilistic latent semantic analysis . Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence (p./pp. 289 - 296 ), .

Hope , J. , & Witmore , M. ( 2010 ). The Hundredth Psalm to the Tune of ”Green Sleeves”: Digital Approaches to Shakespeare's Language of Genre . Shakespeare Quarterly , 61 ( 3 ), 357 - 390 . Retrieved from http://www.jstor.org/stable/40985589

Hirsch , B. , & Craig , H. ( 2014 ). ”Mingled Yarn”: The State of Computing in Shakespeare 2.0 . In T. Bishop, & A. Huang (Eds.), The Shakespearean International Yearbook (Vol. 14 : Special

Section

, Digital Shakespeares, pp. 3 - 35 ). United Kingdom: Ashgate Publishing Limited.

Golub , G. H. ; Reinsch, C. ( 1970 ). ”Singular value decomposition and least squares solutions” . Numerische Mathematik . 14 ( 5 ): 403 - 420 . doi: 10 .1007/BF02163027. MR 1553974.

Jolliffe , I. ( 2002 ). Principal component analysis . New York: Springer Verlag.

Krautter , B. ( 2018 ). Quantitative microanalysis? Different methods of digital drama analysis in comparison . Book of Abstracts , DH 2018 . Mexico-City, Mexico , pp. 225 - 228 .

Lee , Daniel

Seung , H.. (1999). Learning the Parts of Objects by Non-Negative Matrix Factorization . Nature . 401 . 788 - 91 . 10 .1038/44565.

Lloyd , S.P. ( 1982 ). Least squares quantization in PCM . IEEE Trans. Information Theory , 28 , 129 - 136 .

Mueller , Martin. ( 2008 ). Digital Shakespeare, or towards a literary informatics . Shakespeare . 4 . 284 - 301 . 10 .1080/17450910802295179.

Moretti , F. ( 2013 ). Distant reading . Verso Books.

Rosso , Osvaldo & Craig, Hugh Moscato, Pablo. ( 2009 ). Shakespeare and other English Renaissance authors as characterized by Information Theory complexity quantifiers . Physica A: Statistical Mechanics and its Applications . 388 . 916 - 926 . 10 .1016/j.physa. 2008 . 11 .018.

Ramsay , S. ( 2011 ). Reading Machines: Toward an Algorithmic Criticism . University of Illinois Press.

Romanska , M. ( 2015 ). Drametrics: what dramaturgs should learn from mathematicians . In Romanska, M. (ed.), The Routledge Companion to Dramaturgy. Routledge, pp. 472 - 481 .

Scho¨ch,

Christof. ( 2016 ). Topic Modeling Genre: An Exploration of French Classical and Enlightenment Drama. Digital Humanities Quarterly . http://doi.org/10.5281/zenodo.166356

Schmidt , T. , Burghardt , M. , Dennerlein , K. & Wolff , C. ( 2019 ). Katharsis - A Tool for Computational Drametrics . In Book of Abstracts, DH 2019 .

Snyder , S. ( 2001 ). The genres of Shakespeare's plays . In M. De Grazia S. Wells (Eds.), The Cambridge Companion to Shakespeare (Cambridge Companions to Literature, pp. 83 - 98 ). Cambridge: Cambridge University Press. doi: 10 .1017/CCOL0521650941.006

Trilcke , P. , Fischer , F. and Kampkaspar , D. ( 2015 ). Digital Network Analysis of Dramatic Texts . Book of Abstracts, DH 2015 . Sidney, Australia

Underwood , T. , Black , M.L. , Auvil , L. , & Capitanu , B. ( 2013 ). Mapping mutable genres in structurally complex volumes . 2013 IEEE International Conference on Big Data , 95 - 103 .

Vickers , Brian. ( 2011 ). Shakespeare and Authorship Studies in the Twenty-First Century . Shakespeare Quarterly . 62 . 106 - 142 . 10 .1353/shq. 2011 . 0004 .

William

Shakespeare . ( 2015 , August 21 ). New World Encyclopedia, . Retrieved 12 : 11 , September

, 2019 from //www.newworldencyclopedia.org/p/index.php?title= William Shakespeareoldid= 990237 .

Wilhelm , T. , Burghardt , M. , and Wolff , C. ( 2013 ). “To See or Not to See” - An Interactive Tool for the Visualization and Analysis of Shakespeare Plays . In R. Franken-Wendelstorf,

Lindinger , and J. Sieck (Eds.), Kultur und Informatik: Visual Worlds & Interactive Spaces. Glu¨ckstadt: Verlag Werner Hu¨lsbusch , pp. 175 - 185 .

Yu , B. ( 2008 ). An evaluation of text classification methods for literary study . Literary and Linguistic Computing 23 ( 3 ): 327 - 343 .

Xanthos , A. , Pante , I. , Rochat , Y and Grandjean, M. ( 2016 ). Visualising the dynamics of character networks . Book of Abstracts , DH 2016 . Krako´w, Poland, pp. 417 - 419 .