Searching Texts for Signs of Aphasia Tatiana Jajcayová, Jozef Kubik, Mária Markošová, and Soňa Senkovičová Dept. of Applied Informatics Faculty of Mathematics, Physics and Informatics Comenius University, Bratislava, Slovakia, jajcayova@fmph.uniba.sk, WWW home page: http://dai.fmph.uniba.sk/w/Introduction/en Abstract: Aphasia is a brain disorder that impairs ability with the late-writer’s wife Diane Ackerman, and hoped to to speak or understand spoken language caused by damage obtain the raw unedited texts of his aphasic novel. Up to of the brain. It is supposed, that it might also influence this date, unfortunately, we have not received the raw texts. the vocabulary and complexity of the written texts of the We decided to test our tools on published works instead. affected people. In this paper, we analyzed two texts (one We analyzed both texts from the two different view- written before and the second after the onset of aphasia) points. The first group of text analyzing methods are for signs of aphasia using two different approaches. The string matching algorithms such as Rabin – Karp algo- first group of text analyzing methods are string matching rithm and approximate string matching algorithm. String algorithms able to find all occurrences of a pattern string in matching algorithms are able to find all occurrences of a another text string. The second group of methods is from pattern string in another text string. In the experiments we the complex networks theory. In scale free word webs, searched for the short words which can be omitted in text statistical measures are calculated, such as various types due to the aphasia, such as "to be", "above", "below" etc. of averages and distributions. More modern approach is and also some short phrases. Both books were scaled in to analyze the graphlet structure of the word web. Both length to be comparable for this type of analysis. studies are applied to the same text data in searching for The second group of methods is from the complex net- signs of aphasia. works theory. It is known that the positional word web constructed from the English texts is a scale free network [6]. In scale free networks, various statistical measures 1 Introduction are analyzed, such as various types of averages and dis- tributions. A more modern approach is to study a graphlet Aphasia is a brain disorder that often occurs after the brain structure of networks to compare them. The graphlet struc- damage. It is a partial or total loss of ability to speak or ture is then used to define measures, which are designed to understand spoken language, especially if the brain areas express network structural differences and similarities. responsible for language are affected. There are many ar- In this paper, we studied the two above mentioned texts eas in which speech may be impaired, and thus several from both of this point of views, searching for similarities types of aphasia – some patients know what the object is and differences. but cannot say the word for it, others replace words with unrelated ones in what would normally be a well under- standable sentence and other ones may have trouble re- 2 String matching analysis peating heard words. We will be working with texts of an author with global aphasia. We tested the two texts using several different string Paul West was an American writer born in Britain in matching algorithms. In this section we present the re- 1930. Over the course of his life he has written over fifty sults by two different types of algorithms: the Rabin-Karp books in various genres – poetry, novels and essays alike. exact string matching algorithm and approximate string In the year 2003 he suffered a stroke and as a result global matching algorithm [3, 7, 11]. We tested the theory [4] aphasia, he was not able to understand words and not able that aphasia demonstrates itself in written text by omitting to speak them as well. However, after speech therapists short words such as articles or some verbs, and thus the failed to help him speak more than a few words, his wife, work written before the injury would have a greater num- Diane Ackerman, proposed to him to write the first aphasic ber of short words than the later one. As we will see, the memoir. After three years the novel was finished, it’s name results of a series of our experiments for frequencies of is The Shadow Factory and this work is the subject of our short words in texts do not support this theory unambigu- research. To compare how aphasia changes the ability to ously. Later, we tested also some other characteristics of write, we also analysed one of his previous books, written the texts, and we present those comparisons as well. In in 1983, The Rat Man of Paris. We were in communication general, the string matching refers to the problem of find- Copyright ©2021 for this paper by its authors. Use permitted under ing all occurrences of a string, a pattern P[1..m] in another Creative Commons License Attribution 4.0 International (CC BY 4.0). string, a text T[1..n]. Large portion of this problem is find- ing ways to do so most efficiently with respect to time or computer memory. For more details see [3, 11]. 2.1 Experiments with string matching In the graphs below there are some interesting compar- isons we found while searching for short words. Figures 1 through 7 show usage of various words that have a chance to be omitted in speech. We have found Rat Man of Paris is 1.36 times longer when comparing the number of words and has 1.4 times more characters. The following graphs show comparison of two texts where the number of words Figure 3: Prepositions of place. in The Shadow Factory has been scaled accordingly. While Figures 1 and 2, where the frequencies of the us- whether it should continue pattern matching despite a mis- age of the forms of the verb ‘to be‘ and of pronouns is match or stop and move on. Edit distance is a number tested by the Rabin-Karp algorithm, give mixed results, of alterations that can be made to either the pattern or the Figure 3, which gives the comparison of the usage of currently searched part of the text in order to make them prepositions of place, slightly supports the hypothesis that match. An alteration can be either substitution of a charac- the later post-stroke work contains fewer short words of ter for a different one, insertion of a character or removal this kind. of a character. A brute force approach would simply cal- culate an edit distance from the pattern P to all substrings of the text T . This is very demanding computationally. However, as it is with many problems, this one can too be solved using dynamic programming. First we make a two dimensional array L where the number of columns and the number of rows correspond to the lengths of compared strings w1 and w2 . Each new entry in the array will be calculated from the currently compared characters and the values before. For cell Li, j , containing the minimum num- ber of alterations needed to match w1 [1.. j] with w2 [1..i], we compute the value by comparing the next characters. If the characters are the same no new alterations are needed, Figure 1: Use of forms of the verb “to be”. RMP: The Rat Man of and we take the value from L(i−1),( j−1) . If they are not Paris, SF: The Shadow Factory. the same, the total edit distance will be greater by one. If values Li,( j−1) and L(i−1), j are equal it signalizes that we would change a character. When Li,( j−1) is not equal L(i−1), j means we try inserting or deleting a character. To calculate our Li, j we take the lower value of the two and add one. The same results as above for the exact algorithms are supported by testing the use of prepositions of place by the approximate algorithm, as seen on Figure 4. The work on the novel was hard, lengthy and took three years to finish. Therefore, we also included tests compar- ing the initial parts of the novel to the concluding parts of the novel. The results (Figure 5) show that the frequency of the short words increased a little by the end of the novel. Figure 2: Use of pronouns. We also thought interesting to compare the literary text of the novel with the text of the Preface written by the While the Rabin-Karp and other exact string match- author for the publication of the book, with the assumption ing algorithms abort a pattern search when its characters that the preface may be less guarded. stopped matching the text, approximate algorithms will do Reading both, pre-stroke as well as the after-stroke, so only conditionally. We run several test using the ap- novels of Paul West is a peculiar experience. However, proximate algorithm. the feeling is very distinct. In the after-stroke work some Before the search using the approximate algorithm, we words stand out as “weird”, out of place. As the author specify an edit distance. This will tell the algorithm himself wrote in the Preface, those words are indeed in- Figure 4: Prepositions of place - approximate algorithm. Figure 6: Prepositions of time; Preface vs. novel. Figure 7: Unique words. Figure 5: Total short words. attachment leads to the scale free structure of the network, correct. He left them there on purpose, to show readers manifested in power law degree distribution the state of his mind back then. We would love to quan- tify these differences in reading experiences. At the mo- P(k) ∝ k−γ , ment we do not know how, but our graphs in Figure 7 is where k is a degree and γ is scaling exponent. Variations a move toward this direction. The hypothesis is that West of the local processes described above lead to the different might have forgotten many short words due to his con- final network structure [2, 5]. By scale free we mean that dition and that is why the number of unique short words there is no internal scale in the network. For example, the in The Shadow Factory was significantly smaller. We ob- situation is different in random graphs with Poisson de- served this behaviour also when comparing the first and gree distribution where there is an internal scale – an av- the last parts from The Shadow Factory. This time the part erage degree. Which means in the random graphs number with heavier aphasic traits would be the one written first. of nodes having the grater degree or smaller degree than The later part, indeed contains more unique words. average quickly decreases. This in not the situation in the For the investigating aphasia disorders that impair the scale free networks. ability to use words in the right context, it seems that other As we know, words in a language are not randomly dis- more complex tools, like contextual language graphs, tributed in texts. Their ordering reflects grammatical rules which capture the relationships of words in a text, will be of the language in question. Some of the words are used needed. We present such experiments and results in the very often, some of them are rather rare. Word webs en- following section. ables one to look at the organization of words in a language differently. Positional word web expresses how words are organized in sentences, that means syntactic aspect of the 3 Word web analysis language. Due to the scale free structure in all of the stud- ied languages, it is supposed, that preferential attachment It has been shown first by Barabási and Albert [1, 2], that was involved in positional word web development. the global structure of complex networks is influenced by Positional word web is created from the text as follows: the local processes of their development. Preferential node Unique words (without respect to their grammatical form) are nodes of the network. If one chooses word w, all words Properties of word webs RMP SF which are placed in the sentence before and after the word Number of nodes 8422 6582 w anywhere in the analyzed text are connected by an edge Number of edges 35054 25478 with w. The punctuation marks are not taken into account, Maximal degree 2191 1595 they are treated as if they are not included in the text. This Minimal degree 1 1 process of word web creation has been suggested by Can- Average degree 8.324 7.742 cho and Solé [6]. The result is a connected graph, because Ratio of unique words 0.14305 0.15835 all words in the text has at least one neighbour. It has been Density 0.00099 0.00118 shown by Cancho and Solé [6] and by us [8], that such po- Av. clustering coeff 0.384 0.378 sitional word web, created by the above described process Network diameter 8 7 and based on the English text is a scale free network and Av. shortest dist. 2.92499 2.89893 as such, can be analyzed as a graph. Table 1: Properties of the two word webs before and after We created two word webs, one from each text, with a aphasia. RMP: The RatMan of Paris, SF: The Shadow help of our own application. In a special cases applica- Factory. tion is able to recognize shortened versions of words and replace them by a correct ones (such as ’d, which can be had, would, did, depending on the context). Application provides standard graph analysis and calculates number of other. For example if we have a chain of tree nodes, the nodes, edges, occurrences of words in the text etc., and middle node is in a different automorphism orbit as the also standard distributions such as degree distribution for end ones, as no automorphism ever maps the middle one example. onto an end one. The end nodes belong to the same au- Graphlet structure of both networks was analyzed too. tomorphism orbit. 30 different connected graphlets have Usually one looks for connected graphlets. Graphlets 73 different automorphism orbits, so the correct analogue are small nonizomorphic induced subgraphs consisting of to the degree distribution is to measure the number of from two to five nodes [9]. There are 30 such connected nodes touching particular graphlet at a node belonging graphlets. Of course, one can take into account graphlets to a particular orbit. Therefore, we get 73 graphlet de- consisting of more nodes, but the number of such graphlets gree distributions (GDD). Then one can compare two net- grows up very quickly, making calculations very time con- works G and H calculating a measure called network GDD suming. Therefore, the convention has been established, agreement (GDDA) [9]. There are two types of GDDA - that speaking about graphlets, we have in mind the ones s, namely Aa (G, H) (arithmetic agreement) and Ag (G, H) described above [9], [10]. (geometric agreement). Here the result is a real number To compare two networks one can calculate "relative between zero and one. The closer to one the agreement is, graphlet frequency distance" (RGFD). Let Ni (G) be the the more similar the two networks are. number of graphlets of the type i in the network G, and let For the two networks G and H and the orbit j the j-th T (G) be the total number of graphlets of G. Thus, rela- agreement A j (G, H) = 1 − D j (G, H), where D j (G, H) is tive graphlet frequency distance D(G, H) between the two a j-th distance, is defined. The distance is given as fol- graphs G and H is defined as: D(G, H) = ∑29 i=0 |Fi (G) − lows: for each orbit j and the graph G the j-th GDD dGj (k) Ni (G) Fi (H)|, where Fi (G) = − log T (G) [9]. The result is a pos- is measured. If j = 0 one has a classical degree distribu- j d (k) itive real number. In general, two nets are similar, if this tion. Then dGj (k) is scaled as SGj (k) = Gk . SGj (k) is then number is under 50. This is a rule of thumb introduced normalized by TGj = ∑∞ j k=1 SG (k) giving normalized degree by Natasha Przulji in [10]. The motivation lies in the ob- j distribution NG (k). The j-th distance of the two net- servations of distances of protein-protein interaction net- j works G and H is given as D j (G, H) = √12 ∑∞ k=1 ([NG (k) − works and corresponding model networks. Better compar- 1 ison one gets using "graphlet degree distribution" (GDD). NHj (k)]2 ) 2 . By the graphlets, one can extend the concept of the de- Arithmetic agreement is then defined as Aa (G, H) = 1 72 j gree distribution. The node degree k(m) of the node m 73 ∑ j=0 A (G, H). Geometric agreement is given as 1 is a number of edges incident with the node in question Ag (G, H) = (∏72 j j=0 A (G, H)) . 73 and the degree distribution measures how many network nodes have the degree k. From the graphlet point of view, the degree k means that k graphlets of H0 type (which is 3.1 Results an edge) [9] touch (include) certain node. The same way we can look at another types of graphlets. However, in the First, the statistics of the two word webs is depicted in the graphlets it is topologically meaningful to distinguish at Table 1. From the Table 2 it seems that the second word which automorphism orbit the node touches them. Two web has more internally connected structure, because the graph nodes belong to the same automorphism orbit, if graphlet ratios are systematically slightly higher, but there there exists an automorphism which maps one onto the are no significant differences. Types of graphlets RMP SF ders that are demonstrated in language. Results of our tests two node 0.0988% 0.1176% using several exact and approximate algorithms do not de- three node 0.00979% 0.01268% cisively support the tested logopedical hypotheses. It is four node 0.001988% 0.002713% clear that use of not edited texts of authors with aphasia five node 0.000486% 0.000664% would be more revealing. Lajos Grendel, a Slovak writer and publicist, representative of Hungarian literature in Slo- Table 2: Ratios of the connected graphlet numbers, nor- vakia, is another author known to suffer aphasia. We hope malized by the number of all graphlets, connected and un- to gain an access to his “raw” texts written before and after connected of the  same size. The number of all graphlets the injury and analyse them. of size n is Nn , N is the number of network nodes. RMP: The RatMan of Paris, SF: The Shadow Factory. 4.2 Word web analysis Measures Comp. 1 Comp. 2 Comp. 3 As has been stated above, word web analysis does not RGFD 4.76126 196.043 187.868 show significant differences in networks due to aphasia. arithm. GDDA 0.870742 0.634399 0.636399 The reason might be twofold: first, it is known, that The geom. GDDA 0.8372 0.60967 0.616451 Shadow Factory has been edited by writer’s wife before publication. We made an effort to get the raw text, before Table 3: RGFD a GDDA comparisons of our word webs edition, but without success. The second possibility lies in and random graphs. Comparison 1 compares word webs a fact, stated in the Introduction, that despite speech thera- of the two books in question. Comparison 2 compares pist’s failure to teach writer to speak again, he gained a lot word web of Rat man of Paris and random graph having of his writing abilities in these three years of writing The the same number of nodes and edges. Comparison 3 com- Shadow Factory. Our analysis might confirm this, which pares word web of The shadow factory and random graph could be an important result, after more detailed studies. It having the same number of nodes and edges. shows, that language can be totally lost in one respect due to aphasia, but can be possibly gained back by specialized targeted training. RGFD of the word webs D(G1 , G2 ) = 4.76126 indicates high similarity of the graphlet structure of both texts. No influence of aphasia is seen. But GDDA analysis is better References for comparison, because it provides numbers in the (0,1) [1] A. L. Barabási, R. Albert, Emergence of scaling in random interval. We calculated both of the agreements. Geometric networks, Science 286 (1999) 3616 agreement is Ag (G1 , G2 ) = 0.8372 and arithmetic agree- [2] R. Albert, A. L. Barabási, Statistical mechanics of complex ment is Aa (G1 , G2 ) = 0.870742. Here G1 is a word web networks, Rev. Modern Phys. 74 (2002) 47 of the Rat Man of Paris, and G2 the one of The Shadow [3] T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein, Intro- Factory. Both agreements are far more closer to 1 then to duction to algorithms, 2nd edition (2001), MIT press 0 and indicate great similarity of the graphlet structure of [4] Z. Cséfalvay. Súčasný pohl’ad na diagnostiku a terapiu both networks. Therefore we can state, that we did not find afázie, Česká a slovenská neurologie a neurochirurgie. – a significant influence of aphasia on the graphlet structure Roč. 70/103, č. 2 (2007), s. 118–128 of our word webs. Also from the syntactic point of view, [5] S. N. Dorogovtsev, J. F. F. Mendes, Evolution of networks, both texts are written by the same style, with the same ba- Adv. Phys. 51 (2002) 1079 sic vocabulary. [6] R. Ferrer I Cancho, R. Solé, The small world of human We also made a comparisons of both networks to the language, Proceedings of the Royal Society of London 268 random graphs having the same number of nodes and (2001), 2261 edges. The results are in Table 3. We can see, that both [7] T. Jajcayova, Z. Majerikova, Note on Some Interesting word webs are far more similar to each other then to ran- Test Results for the Rabin-Karp String Matching Algorithm, dom graphs with respect to the graphlet structure. ITAT 2018 : Information Technologies – Applications and Degree distributions of both networks show, that they Theory are both scale free. Both degree distributions are linear [8] M. Markošová, Network model of human language, Physica in log-log plot, indicating power law (3) with the scaling A: Statistical Mechanics and its App. 387 (2008), 661 exponent γ = 2.1. Aphasia does not influence the degree [9] N. Przulj, Biological network comparison using graphlet de- distribution of the word web at all. gree, Bioinformatics 23 (2007) 177 [10] N. Przulj, D. G. Corneil, I. Jurisica, Efficient estimation of graphlet frequency distribution, Bioinformatics 22 (2006) 4 Conclusion 974 [11] S. Shabnam Hasan, F. Ahmed, R. S. Khan, Approximate 4.1 String matching analysis String Matching Algorithms: A Brief Survey and Com- Results of experiments presented above are a part of the parison, International Journal of Computer Applications broader research project into the diagnosis of brain disor- 120(8):26–31, June 2015.