Searching Texts for Signs of Aphasia

                            Tatiana Jajcayová, Jozef Kubik, Mária Markošová, and Soňa Senkovičová

                                                  Dept. of Applied Informatics
                                        Faculty of Mathematics, Physics and Informatics
                                           Comenius University, Bratislava, Slovakia,
                                                 jajcayova@fmph.uniba.sk,
                                WWW home page: http://dai.fmph.uniba.sk/w/Introduction/en

Abstract: Aphasia is a brain disorder that impairs ability                with the late-writer’s wife Diane Ackerman, and hoped to
to speak or understand spoken language caused by damage                   obtain the raw unedited texts of his aphasic novel. Up to
of the brain. It is supposed, that it might also influence                this date, unfortunately, we have not received the raw texts.
the vocabulary and complexity of the written texts of the                 We decided to test our tools on published works instead.
affected people. In this paper, we analyzed two texts (one                   We analyzed both texts from the two different view-
written before and the second after the onset of aphasia)                 points. The first group of text analyzing methods are
for signs of aphasia using two different approaches. The                  string matching algorithms such as Rabin – Karp algo-
first group of text analyzing methods are string matching                 rithm and approximate string matching algorithm. String
algorithms able to find all occurrences of a pattern string in            matching algorithms are able to find all occurrences of a
another text string. The second group of methods is from                  pattern string in another text string. In the experiments we
the complex networks theory. In scale free word webs,                     searched for the short words which can be omitted in text
statistical measures are calculated, such as various types                due to the aphasia, such as "to be", "above", "below" etc.
of averages and distributions. More modern approach is                    and also some short phrases. Both books were scaled in
to analyze the graphlet structure of the word web. Both                   length to be comparable for this type of analysis.
studies are applied to the same text data in searching for                   The second group of methods is from the complex net-
signs of aphasia.                                                         works theory. It is known that the positional word web
                                                                          constructed from the English texts is a scale free network
                                                                          [6]. In scale free networks, various statistical measures
1    Introduction                                                         are analyzed, such as various types of averages and dis-
                                                                          tributions. A more modern approach is to study a graphlet
Aphasia is a brain disorder that often occurs after the brain             structure of networks to compare them. The graphlet struc-
damage. It is a partial or total loss of ability to speak or              ture is then used to define measures, which are designed to
understand spoken language, especially if the brain areas                 express network structural differences and similarities.
responsible for language are affected. There are many ar-
                                                                             In this paper, we studied the two above mentioned texts
eas in which speech may be impaired, and thus several
                                                                          from both of this point of views, searching for similarities
types of aphasia – some patients know what the object is
                                                                          and differences.
but cannot say the word for it, others replace words with
unrelated ones in what would normally be a well under-
standable sentence and other ones may have trouble re-                    2   String matching analysis
peating heard words. We will be working with texts of an
author with global aphasia.
                                                                          We tested the two texts using several different string
   Paul West was an American writer born in Britain in
                                                                          matching algorithms. In this section we present the re-
1930. Over the course of his life he has written over fifty
                                                                          sults by two different types of algorithms: the Rabin-Karp
books in various genres – poetry, novels and essays alike.
                                                                          exact string matching algorithm and approximate string
In the year 2003 he suffered a stroke and as a result global
                                                                          matching algorithm [3, 7, 11]. We tested the theory [4]
aphasia, he was not able to understand words and not able
                                                                          that aphasia demonstrates itself in written text by omitting
to speak them as well. However, after speech therapists
                                                                          short words such as articles or some verbs, and thus the
failed to help him speak more than a few words, his wife,
                                                                          work written before the injury would have a greater num-
Diane Ackerman, proposed to him to write the first aphasic
                                                                          ber of short words than the later one. As we will see, the
memoir. After three years the novel was finished, it’s name
                                                                          results of a series of our experiments for frequencies of
is The Shadow Factory and this work is the subject of our
                                                                          short words in texts do not support this theory unambigu-
research. To compare how aphasia changes the ability to
                                                                          ously. Later, we tested also some other characteristics of
write, we also analysed one of his previous books, written
                                                                          the texts, and we present those comparisons as well. In
in 1983, The Rat Man of Paris. We were in communication
                                                                          general, the string matching refers to the problem of find-
     Copyright ©2021 for this paper by its authors. Use permitted under   ing all occurrences of a string, a pattern P[1..m] in another
Creative Commons License Attribution 4.0 International (CC BY 4.0).       string, a text T[1..n]. Large portion of this problem is find-
ing ways to do so most efficiently with respect to time or
computer memory. For more details see [3, 11].


2.1   Experiments with string matching

In the graphs below there are some interesting compar-
isons we found while searching for short words. Figures 1
through 7 show usage of various words that have a chance
to be omitted in speech. We have found Rat Man of Paris
is 1.36 times longer when comparing the number of words
and has 1.4 times more characters. The following graphs
show comparison of two texts where the number of words                            Figure 3: Prepositions of place.
in The Shadow Factory has been scaled accordingly.
   While Figures 1 and 2, where the frequencies of the us-
                                                                  whether it should continue pattern matching despite a mis-
age of the forms of the verb ‘to be‘ and of pronouns is
                                                                  match or stop and move on. Edit distance is a number
tested by the Rabin-Karp algorithm, give mixed results,
                                                                  of alterations that can be made to either the pattern or
the Figure 3, which gives the comparison of the usage of
                                                                  currently searched part of the text in order to make them
prepositions of place, slightly supports the hypothesis that
                                                                  match. An alteration can be either substitution of a charac-
the later post-stroke work contains fewer short words of
                                                                  ter for a different one, insertion of a character or removal
this kind.
                                                                  of a character. A brute force approach would simply cal-
                                                                  culate an edit distance from the pattern P to all substrings
                                                                  of the text T . This is very demanding computationally.
                                                                  However, as it is with many problems, this one can too
                                                                  be solved using dynamic programming. First we make a
                                                                  two dimensional array L where the number of columns and
                                                                  the number of rows correspond to the lengths of compared
                                                                  strings w1 and w2 . Each new entry in the array will be
                                                                  calculated from the currently compared characters and the
                                                                  values before. For cell Li, j , containing the minimum num-
                                                                  ber of alterations needed to match w1 [1.. j] with w2 [1..i],
                                                                  we compute the value by comparing the next characters. If
                                                                  the characters are the same no new alterations are needed,
Figure 1: Use of forms of the verb “to be”. RMP: The Rat Man of   and we take the value from L(i−1),( j−1) . If they are not
Paris, SF: The Shadow Factory.                                    the same, the total edit distance will be greater by one.
                                                                  If values Li,( j−1) and L(i−1), j are equal it signalizes that
                                                                  we would change a character. When Li,( j−1) is not equal
                                                                  L(i−1), j means we try inserting or deleting a character. To
                                                                  calculate our Li, j we take the lower value of the two and
                                                                  add one.
                                                                     The same results as above for the exact algorithms are
                                                                  supported by testing the use of prepositions of place by the
                                                                  approximate algorithm, as seen on Figure 4.
                                                                     The work on the novel was hard, lengthy and took three
                                                                  years to finish. Therefore, we also included tests compar-
                                                                  ing the initial parts of the novel to the concluding parts of
                                                                  the novel. The results (Figure 5) show that the frequency
                                                                  of the short words increased a little by the end of the novel.
                    Figure 2: Use of pronouns.                       We also thought interesting to compare the literary text
                                                                  of the novel with the text of the Preface written by the
   While the Rabin-Karp and other exact string match-             author for the publication of the book, with the assumption
ing algorithms abort a pattern search when its characters         that the preface may be less guarded.
stopped matching the text, approximate algorithms will do            Reading both, pre-stroke as well as the after-stroke,
so only conditionally. We run several test using the ap-          novels of Paul West is a peculiar experience. However,
proximate algorithm.                                              the feeling is very distinct. In the after-stroke work some
   Before the search using the approximate algorithm, we          words stand out as “weird”, out of place. As the author
specify an edit distance. This will tell the algorithm            himself wrote in the Preface, those words are indeed in-
     Figure 4: Prepositions of place - approximate algorithm.            Figure 6: Prepositions of time; Preface vs. novel.


                                                                                    Figure 7: Unique words.
                  Figure 5: Total short words.
                                                                 attachment leads to the scale free structure of the network,
correct. He left them there on purpose, to show readers          manifested in power law degree distribution
the state of his mind back then. We would love to quan-
tify these differences in reading experiences. At the mo-                                 P(k) ∝ k−γ ,
ment we do not know how, but our graphs in Figure 7 is           where k is a degree and γ is scaling exponent. Variations
a move toward this direction. The hypothesis is that West        of the local processes described above lead to the different
might have forgotten many short words due to his con-            final network structure [2, 5]. By scale free we mean that
dition and that is why the number of unique short words          there is no internal scale in the network. For example, the
in The Shadow Factory was significantly smaller. We ob-          situation is different in random graphs with Poisson de-
served this behaviour also when comparing the first and          gree distribution where there is an internal scale – an av-
the last parts from The Shadow Factory. This time the part       erage degree. Which means in the random graphs number
with heavier aphasic traits would be the one written first.      of nodes having the grater degree or smaller degree than
The later part, indeed contains more unique words.               average quickly decreases. This in not the situation in the
   For the investigating aphasia disorders that impair the       scale free networks.
ability to use words in the right context, it seems that other      As we know, words in a language are not randomly dis-
more complex tools, like contextual language graphs,             tributed in texts. Their ordering reflects grammatical rules
which capture the relationships of words in a text, will be      of the language in question. Some of the words are used
needed. We present such experiments and results in the           very often, some of them are rather rare. Word webs en-
following section.                                               ables one to look at the organization of words in a language
                                                                 differently. Positional word web expresses how words are
                                                                 organized in sentences, that means syntactic aspect of the
3   Word web analysis                                            language. Due to the scale free structure in all of the stud-
                                                                 ied languages, it is supposed, that preferential attachment
It has been shown first by Barabási and Albert [1, 2], that      was involved in positional word web development.
the global structure of complex networks is influenced by           Positional word web is created from the text as follows:
the local processes of their development. Preferential node      Unique words (without respect to their grammatical form)
are nodes of the network. If one chooses word w, all words              Properties of word webs        RMP           SF
which are placed in the sentence before and after the word                Number of nodes              8422         6582
w anywhere in the analyzed text are connected by an edge                  Number of edges              35054       25478
with w. The punctuation marks are not taken into account,                 Maximal degree               2191         1595
they are treated as if they are not included in the text. This             Minimal degree                1            1
process of word web creation has been suggested by Can-                    Average degree              8.324       7.742
cho and Solé [6]. The result is a connected graph, because              Ratio of unique words         0.14305     0.15835
all words in the text has at least one neighbour. It has been                  Density                0.00099     0.00118
shown by Cancho and Solé [6] and by us [8], that such po-                Av. clustering coeff          0.384       0.378
sitional word web, created by the above described process                Network diameter                8            7
and based on the English text is a scale free network and                 Av. shortest dist.          2.92499     2.89893
as such, can be analyzed as a graph.
                                                                 Table 1: Properties of the two word webs before and after
    We created two word webs, one from each text, with a
                                                                 aphasia. RMP: The RatMan of Paris, SF: The Shadow
help of our own application. In a special cases applica-
                                                                 Factory.
tion is able to recognize shortened versions of words and
replace them by a correct ones (such as ’d, which can be
had, would, did, depending on the context). Application
provides standard graph analysis and calculates number of        other. For example if we have a chain of tree nodes, the
nodes, edges, occurrences of words in the text etc., and         middle node is in a different automorphism orbit as the
also standard distributions such as degree distribution for      end ones, as no automorphism ever maps the middle one
example.                                                         onto an end one. The end nodes belong to the same au-
    Graphlet structure of both networks was analyzed too.        tomorphism orbit. 30 different connected graphlets have
Usually one looks for connected graphlets. Graphlets             73 different automorphism orbits, so the correct analogue
are small nonizomorphic induced subgraphs consisting of          to the degree distribution is to measure the number of
from two to five nodes [9]. There are 30 such connected          nodes touching particular graphlet at a node belonging
graphlets. Of course, one can take into account graphlets        to a particular orbit. Therefore, we get 73 graphlet de-
consisting of more nodes, but the number of such graphlets       gree distributions (GDD). Then one can compare two net-
grows up very quickly, making calculations very time con-        works G and H calculating a measure called network GDD
suming. Therefore, the convention has been established,          agreement (GDDA) [9]. There are two types of GDDA -
that speaking about graphlets, we have in mind the ones          s, namely Aa (G, H) (arithmetic agreement) and Ag (G, H)
described above [9], [10].                                       (geometric agreement). Here the result is a real number
    To compare two networks one can calculate "relative          between zero and one. The closer to one the agreement is,
graphlet frequency distance" (RGFD). Let Ni (G) be the           the more similar the two networks are.
number of graphlets of the type i in the network G, and let         For the two networks G and H and the orbit j the j-th
T (G) be the total number of graphlets of G. Thus, rela-         agreement A j (G, H) = 1 − D j (G, H), where D j (G, H) is
tive graphlet frequency distance D(G, H) between the two         a j-th distance, is defined. The distance is given as fol-
graphs G and H is defined as: D(G, H) = ∑29      i=0 |Fi (G) −
                                                                 lows: for each orbit j and the graph G the j-th GDD dGj (k)
                                Ni (G)
Fi (H)|, where Fi (G) = − log T (G) [9]. The result is a pos-    is measured. If j = 0 one has a classical degree distribu-
                                                                                                              j
                                                                                                             d (k)
itive real number. In general, two nets are similar, if this     tion. Then dGj (k) is scaled as SGj (k) = Gk . SGj (k) is then
number is under 50. This is a rule of thumb introduced           normalized by TGj = ∑∞        j
                                                                                          k=1 SG (k) giving normalized degree
by Natasha Przulji in [10]. The motivation lies in the ob-                       j
                                                                 distribution NG (k). The j-th distance of the two net-
servations of distances of protein-protein interaction net-                                                             j
                                                                 works G and H is given as D j (G, H) = √12 ∑∞  k=1 ([NG (k) −
works and corresponding model networks. Better compar-                    1
ison one gets using "graphlet degree distribution" (GDD).        NHj (k)]2 ) 2 .
    By the graphlets, one can extend the concept of the de-         Arithmetic agreement is then defined as Aa (G, H) =
                                                                 1    72      j
gree distribution. The node degree k(m) of the node m            73 ∑ j=0 A (G, H).  Geometric agreement is given as
                                                                                                  1
is a number of edges incident with the node in question          Ag (G, H) = (∏72   j
                                                                               j=0 A (G, H)) .
                                                                                            73
and the degree distribution measures how many network
nodes have the degree k. From the graphlet point of view,
the degree k means that k graphlets of H0 type (which is         3.1   Results
an edge) [9] touch (include) certain node. The same way
we can look at another types of graphlets. However, in the       First, the statistics of the two word webs is depicted in the
graphlets it is topologically meaningful to distinguish at       Table 1. From the Table 2 it seems that the second word
which automorphism orbit the node touches them. Two              web has more internally connected structure, because the
graph nodes belong to the same automorphism orbit, if            graphlet ratios are systematically slightly higher, but there
there exists an automorphism which maps one onto the             are no significant differences.
      Types of graphlets       RMP             SF              ders that are demonstrated in language. Results of our tests
          two node           0.0988%        0.1176%            using several exact and approximate algorithms do not de-
         three node         0.00979%       0.01268%            cisively support the tested logopedical hypotheses. It is
          four node        0.001988%      0.002713%            clear that use of not edited texts of authors with aphasia
          five node        0.000486%      0.000664%            would be more revealing. Lajos Grendel, a Slovak writer
                                                               and publicist, representative of Hungarian literature in Slo-
Table 2: Ratios of the connected graphlet numbers, nor-        vakia, is another author known to suffer aphasia. We hope
malized by the number of all graphlets, connected and un-      to gain an access to his “raw” texts written before and after
connected of the
                same size. The number of all graphlets        the injury and analyse them.
of size n is Nn , N is the number of network nodes. RMP:
The RatMan of Paris, SF: The Shadow Factory.
                                                               4.2   Word web analysis
       Measures      Comp. 1      Comp. 2      Comp. 3         As has been stated above, word web analysis does not
         RGFD        4.76126       196.043      187.868        show significant differences in networks due to aphasia.
    arithm. GDDA     0.870742     0.634399     0.636399        The reason might be twofold: first, it is known, that The
    geom. GDDA        0.8372       0.60967     0.616451        Shadow Factory has been edited by writer’s wife before
                                                               publication. We made an effort to get the raw text, before
Table 3: RGFD a GDDA comparisons of our word webs              edition, but without success. The second possibility lies in
and random graphs. Comparison 1 compares word webs             a fact, stated in the Introduction, that despite speech thera-
of the two books in question. Comparison 2 compares            pist’s failure to teach writer to speak again, he gained a lot
word web of Rat man of Paris and random graph having           of his writing abilities in these three years of writing The
the same number of nodes and edges. Comparison 3 com-          Shadow Factory. Our analysis might confirm this, which
pares word web of The shadow factory and random graph          could be an important result, after more detailed studies. It
having the same number of nodes and edges.                     shows, that language can be totally lost in one respect due
                                                               to aphasia, but can be possibly gained back by specialized
                                                               targeted training.
   RGFD of the word webs D(G1 , G2 ) = 4.76126 indicates
high similarity of the graphlet structure of both texts. No
influence of aphasia is seen. But GDDA analysis is better      References
for comparison, because it provides numbers in the (0,1)
                                                               [1] A. L. Barabási, R. Albert, Emergence of scaling in random
interval. We calculated both of the agreements. Geometric          networks, Science 286 (1999) 3616
agreement is Ag (G1 , G2 ) = 0.8372 and arithmetic agree-
                                                               [2] R. Albert, A. L. Barabási, Statistical mechanics of complex
ment is Aa (G1 , G2 ) = 0.870742. Here G1 is a word web            networks, Rev. Modern Phys. 74 (2002) 47
of the Rat Man of Paris, and G2 the one of The Shadow
                                                               [3] T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein, Intro-
Factory. Both agreements are far more closer to 1 then to          duction to algorithms, 2nd edition (2001), MIT press
0 and indicate great similarity of the graphlet structure of
                                                               [4] Z. Cséfalvay. Súčasný pohl’ad na diagnostiku a terapiu
both networks. Therefore we can state, that we did not find        afázie, Česká a slovenská neurologie a neurochirurgie. –
a significant influence of aphasia on the graphlet structure       Roč. 70/103, č. 2 (2007), s. 118–128
of our word webs. Also from the syntactic point of view,       [5] S. N. Dorogovtsev, J. F. F. Mendes, Evolution of networks,
both texts are written by the same style, with the same ba-        Adv. Phys. 51 (2002) 1079
sic vocabulary.                                                [6] R. Ferrer I Cancho, R. Solé, The small world of human
   We also made a comparisons of both networks to the              language, Proceedings of the Royal Society of London 268
random graphs having the same number of nodes and                  (2001), 2261
edges. The results are in Table 3. We can see, that both       [7] T. Jajcayova, Z. Majerikova, Note on Some Interesting
word webs are far more similar to each other then to ran-          Test Results for the Rabin-Karp String Matching Algorithm,
dom graphs with respect to the graphlet structure.                 ITAT 2018 : Information Technologies – Applications and
   Degree distributions of both networks show, that they           Theory
are both scale free. Both degree distributions are linear      [8] M. Markošová, Network model of human language, Physica
in log-log plot, indicating power law (3) with the scaling         A: Statistical Mechanics and its App. 387 (2008), 661
exponent γ = 2.1. Aphasia does not influence the degree        [9] N. Przulj, Biological network comparison using graphlet de-
distribution of the word web at all.                               gree, Bioinformatics 23 (2007) 177
                                                               [10] N. Przulj, D. G. Corneil, I. Jurisica, Efficient estimation
                                                                   of graphlet frequency distribution, Bioinformatics 22 (2006)
4     Conclusion                                                   974
                                                               [11] S. Shabnam Hasan, F. Ahmed, R. S. Khan, Approximate
4.1   String matching analysis
                                                                   String Matching Algorithms: A Brief Survey and Com-
Results of experiments presented above are a part of the           parison, International Journal of Computer Applications
broader research project into the diagnosis of brain disor-        120(8):26–31, June 2015.