Deriving Word Association Networks from Text Corpora David Galea (dp.galea@student.qut.edu.au) and Peter Bruza (p.bruza@qut.edu.au) Information Systems School Queensland University of Technology 2 George Street Brisbane, QLD 4000 AUS Abstract the ai ’s denote associates. An arrow, e.g., t → a1 represents that associate a1 was produced in a free association experi- This article presents and evaluates a model to automatically de- ment in respect to target t. Table 1 shows the corresponding rive word association networks from text corpora. Two aspects were evaluated: To what degree can corpus-based word associ- adjacency matrix for this example network. When collected ation networks (CANs) approximate human word association over a subject pool, the edges can be weighted, e.g., by the networks with respect to (1) their ability to quantitatively pre- probability that a given associate is produced in relation to a dict word associations and (2) their structural network charac- teristics. Word association networks are the basis of the hu- cue. Such networks are referred to as free association net- man mental lexicon. However, extracting such networks from works (FANs). FANs have formed the basis of human mem- human subjects is laborious, time consuming and thus neces- ory models such as Spreading Activation (Collins & Loftus, sarily limited in relation to the breadth of human vocabulary. Automatic derivation of word associations from text corpora 1975) and Processing Implicit and Explicit Representations would address these limitations. In both evaluations corpus- (PIER) (Nelson, Schreiber, & McEvoy, 1992; Nelson, Kitto, based processing provided vector representations for words. Galea, McEvoy, & Bruza, 2013). These representations were then employed to derive CANs us- ing two measures: (1) the well known cosine metric, which FANs have the following structural characteristics: is a symmetric measure, and (2) a new asymmetric measure computed from orthogonal vector projections. For both eval- R1 The edges are directed, hence allowing for asymmetric as- uations, the full set of 4068 free association networks (FANs) sociations between words. from the University of South Florida word association norms were used as baseline human data. Two corpus based mod- R2 The target word has an edge with each associate in the net- els were benchmarked for comparison: a latent topic model and latent semantic analysis (LSA). We observed that CANs work. constructed using the asymmetric measure were slightly less R3 The edges are weighted. effective than the topic model in quantitatively predicting free associates, and slightly better than LSA. The structural net- works analysis revealed that CANs do approximate the FANs FANs are derived manually which is time consuming and la- to an encouraging degree. bor intensive. They are therefore restricted in relation to the Keywords: semantic networks; free association networks; breadth of vocabulary in human language and challenging to corpus-based semantic representation keep up-to-date as language and associations evolve. The aim of this paper is investigate to what degree corpus-based se- Introduction mantic methods can be used to approximate FANs in relation The mental lexicon is a mental dictionary of words, but its to both their structural network characteristics and their abil- structure is founded on the associative links that bind these ity to quantitatively predict human word associations. We words together. Such links are acquired through experience shall refer to such networks as Corpus Based Association and the vast and semi-random nature of this experience en- Networks (CANS). sures that words within the lexicon are highly interconnected, both directly and indirectly through other words. For exam- ple, during childhood development and the associated acqui- sition of English, the word planet becomes associated with earth, space, moon, and so on. Even within this set, moon can itself become linked to earth and star etc. Words are so associatively interconnected with each other that they meet the qualifications of a ‘small world’ network wherein it takes only a few steps to move from any one word to any other in the lexicon (Steyvers & Tennenbaum, 2005). Because of such connectivity individual words are not represented in Figure 1: Example of a Free Association Network long-term memory as isolated entities but as part of a net- work of related words. One approach to extract such net- Corpus Based Association Networks work is to employ a target as a cue and collect free associ- A CAN comprises nodes, which correspond to words, and di- ations from human subjects (Nelson, McEvoy, & Schreiber, rected weighted edges, which model the associations between 2004; Simon, Navarro, & Storms, 2013). For example, Fig- words. We begin by describing how the nodes of a CAN are ure 1 depicts such a network where t is the target word and constructed. 252 t a1 a2 hu, vi cos (u, v) = (1) t 0 0.2 0.1 ||u||||v|| a1 0 0 0.6 a2 0.7 0 0 As pointed out previously, PPMI vector representations exist in the first orthant. Consequently the standard boundaries for Table 1: Example adjacency matrix of the FAN depicted in the cosine metric being [−1, 1] are transformed to [0, 1] and Figure 1 can be interpreted as a normalized measure of strength, where 0 represents no relationship between words u and v and 1 rep- resents a perfect synonymous relationship. In having a nor- Vector Representations of Words malized measure, requirement R3 is satisfied. Unfortunately, Each word u, (i.e., a node) has a vector based representation as cosine is a metric, its associations are necessarily symmet- u, where the vector has been computed from an underlying ric meaning cos(u, v) = cos(v, u). This violates characteristic corpus. There are a variety of strategies to produce such vec- R1 specified above. In order to satisfy R1, a measure is re- tors (Bullinaria & Levy, 2007), which are sometimes referred quired that permits asymmetric associations between words. to as “semantic vectors” due to their ability to replicate hu- The topic model (Griffiths, Steyvers, & Tenenbaum, 2007) man semantic association norm data (Dumais, 2004; Lund & used conditional probabilities to achieve this. For example Burgess, 1996; Turney & Pantel, 2011). the strength of association from word u to v is computed We used a Positive Pointwise Mutual Information (PPMI) as Pr(u|v) and the strength of reverse relation is computed vector representation because of its robust performance Pr(v|u). Note that these probabilities need not be the same across a variety of linguistic and semantic tasks (Bullinaria which thus allows for asymmetry in the associations between & Levy, 2007). PPMI vectors are derived from discrete prob- these two words. In this paper, however, we will build on a ability distributions built from word co-occurrence statistics. word association measure based on projection (Pothos, Buse- In our case, these discrete probability distributions are built meyer, & Trueblood, 2013). Initially, a simple orthogonal from a modified version of a standard word co-occurrence vector projection was considered: matrix where the rows correspond to a set of pre-defined tar- hu, vi P (u, v) = (2) get words. The co-occurrence frequencies of a given target ||v|| word with other words are computed using a sliding win- dow of fixed size (denoted w) across the corpus where sen- Exploration of this measure shows that it is bound between tence and paragraph boundaries are ignored. Context words [0, ||u||], where 0 represents no relationship and ||u|| repre- are those words surrounding the target word when it is cen- sents a perfect synonymous relationship. Although not nor- tered in the window. The frequency of each context word is malized this does preserve rank when comparing multiple accumulated as the window slides across the corpus. In this v’s to u. Unfortunately, when comparing multiple v’s to process, stop words are ignored. The frequencies are subse- different   u’s, say u1 , u2 we arrive at two sets of bounds, quently normalized to produce a probability distribution for 0, ||u1 || & 0, ||u2 || , which destroys rank equivalence (un- the given target word. As a consequence all vector elements less ||u1 || = ||u2 ||). To overcome this undesirable property, are positive real values, and thus exist in the first orthant of the GP measure was developed in which the relative differ- Euclidean Space. This property has important consequences ence between v and the length of the projection of u onto v is for the bounds for the word association measures to be dis- taken into account: ( P(u,v) cussed in the next section.For this analysis, both target and ||v|| : P (u, v) < ||v|| GP (u, v) = ||v|| context words were treated as single tokens. Furthermore the 1 + ||u|| − cos (u, v) : P (u, v) ≥ ||v|| window size was not explored as part of this analysis. From a technical point of view, GP is not a metric, but a Measures of Association S(u, v) pre-metric. As was the case with cosine, GP is also bound The preceding section described how the nodes of a CAN from [0, 1] and can be interpreted as a normalised measure of are represented via corpus-based vectors. These vectors are strength (thus satisfying R3). Furthermore, it permits asym- used to compute weighted associations between words thus metric associations between words meaning GP (u, v) is not providing the means to derive edges for CANs. For this paper, necessarily equal to GP (v, u), thus satisfying R1. we have utilized one well known metric: the cosine metric as well as introducing a new measure of association called the Constructing Corpus Based Association Networks GP measure. This section describes an abstract algorithm to compute a The cosine metric was chosen as a baseline as it is of- CAN using the notation shown in Table 3. A CAN is based ten used to compute vector based associations, e.g., in the around a target word t. Latent Semantic Analysis model where it has shown consis- The first step is to compute the list of associates tA based tently good performance in computing associations between on t. In order to compute this list, the vector representation words across a number of studies and text corpora (Landauer, t is compared to the vector representation of all other words, Foltz, & Laham, 1998). v (v ∈ V ) using a measure of association S(u, v), which can 253 be either cosine, or GP. For an associate to be added to the Table 3: Notation list, the strength of association must be greater or equal to a threshold value: S (u, v) ≥ Sτ . This ensures the target has an association with all associates in tA thus satisfying require- u A word u ment R2. The threshold is a parameter which is empirically u The vector representation for u set per measure (cosine or GP). uA The set of associates for u. A word t’s tN is constructed by taking t’s tA and comput- mna The maximum number of associates permitted in uA ing the strengths between each directed pair (u, v) u 6= v and S(u, v) Method to measure the strength between u, v including those strengths in which S (u, v) ≥ Sτ . The results Sτ Minimum threshold value for for S(u, v). are stored in tM so that tM (u, v) = S (u, v). This process is uN Word Association Network for u formalized by Algorithm 0.1 uM Adjacency Matrix used to represent uN t A target word Algorithm 0.1: CAN(t,tA ) T Set of Target Words, T ⊂ V V Vocabulary of Words tA = tA ∪ t  u ∈ tA for each  v ∈ tA , v 6= u for each Quantitative Prediction of Word Associations do if S (u, v) ≥ Sτ  do then tM (u, v) = S (u, v) In order to evaluate the quality of associations in CANs we analyzed the degree to which free associates from the USF Consider the following example, where a target word t norms were appearing in the associate list tA for a all targets t. and the associate list tA = {a1 , a2 } and assume the follow-  To this end we adopt the approach and corpus used to evaluate ing two associations are above the threshold: S a2 , a1 = the Topic Model (Griffiths et al., 2007). S1,2 ≥ Sτ and S a2 ,t = S2,t ≥ Sτ and that all other associ- Materials In generating the vector representations, the ations S (a, b) = 0. Applying Algorithm 0.1, the first step is Touchstone Applied Science Associates (TASA) corpus. was to add the target t as a default element to its associate list, i.e., used with a standard stop word list. This corpus comprises tA = {t, a1 , a2 }. The next step is to consider the associations 916060 documents. The set of target words T comprised that each member of tA has with one another and keep those the full 4068 target words present in the University of South for which S (a, b) ≥ Sτ Florida (USF) word association norms (Nelson et al., 2004).  u = t , v = a1 : S t, a1 = St,1 ≥ Sτ → tM (t, a1 ) = St,1 The baseline models for comparison are the Topic Model (Griffiths et al., 2007) and Latent Semantic Analysis (LSA)  v = a2 : S w, a2 = St,2 ≥ Sτ → wM (t, a2 ) = St,2 (Dumais, 2004). The Topic Model is a corpus based ap-  proach to semantic representation which ascribes probabili- u = a1 , v = t : S a1 ,t = 0 → tM (a1 ,t) = 0 ties to words with respect to latent contexts called “topics”. The model allows asymmetric words associations to be com-  v = a2 : S a1 , a2 = 0 → tM (a1 , a2 ) = 0 puted and has been evaluated on the USF word association  norms. The LSA Model was chosen as it a common corpus u = a2 , v = u : S a2 ,t = S2,w ≥ Sτ → tM (a1 ,t) = S2,t based benchmark that uses the cosine metric.  v = a1 : S a2 , a1 = S2,1 ≥ Sτ → tM (a1 , a2 ) = S2,1 Procedure The procedure involves taking each of the 4068 target words and computing the PPMI vector representation The matrix returned by the algorithm tM is, using the method described in section “Vector Representa- Table 2: Adjacency Matrix (tM ) for t tions of Words”. The size of the resulting vocabulary V was 47059 words, which is the dimensionality of the vector rep- t a1 a2 resentations. The vocabulary was constructed by taking all t 0 St,1 St,2 words in the TASA corpus (not including stop words) and a1 0 0 0 only considering those with a term frequency greater than a2 S2,t S2,1 0 10 (as used with the Topic Model). Thereafter, the associate strength between the target and all words of the vocabulary is computed. This list is then sorted (in descending order) by associate strength and then the rank/position of the target Empirical Evaluation word’s first associate is found. The first associate is the as- The evaluation aims to address two questions: To what degree sociate of the target word (from the USF data) that has the CANs approximate FANs with respect to (1) their ability to strongest forward relationship. For example, in Fig. 1, a1 has quantitatively predict human word associations and (2) their the strongest forward relationship to t being S (t, a1 ) = 0.2 structural network characteristics. and thus would be the first associate for t. The probability of 254 finding the first associate within the top m associates is com- Procedure A PPMI vector representation for each target puted using: Pr (m) = nnmT , where nm is the number of first word was computed using the method described in section associates produced whose rank ≤ m and nT is the number of “Vector Representations of Words”. The size of the resulting words in the corpus. vocabulary V was 255460 words, which is the dimensionality The cosine and the GP pre-metric were eval- of the vector representations. The procedure involved gener- uated in this way for 6 different values of m ating a CAN for each target word using Algorithm 0.1 with (m ∈ M = {1, 5, 10, 25, 50, 100}) and the results com- GP as the measure used to compute the associations. (CANs pared with published results of LSA and the Topic Model were not constructed with cosine as this measure is symmet- documented in (Griffiths et al., 2007). In order to determine ric) The CANs were generated with mna ≤ 50, where mna the best performance a simple method was introduced which refers to the maximum number of associates a target can have sums the probabilties across the different values of m: in it’s CAN. This value was chosen because it is the maxi- P = ∑m∈M Pr (m). Best performing results for CAN (cosine) mum number associates encountered across all target words are reported with window size w = 3. For CAN (GP) the best in the USF word association norms. performing results were achieved with w = 6. The structural network characteristics (see Table 4) used for evaluation are derived from the CAN’s adjacency matrix (tM ). These characteristics are well known in network analy- sis and have been used to analyze the USF word association norms (Steyvers & Tennenbaum, 2005) The mean, median and standard deviation (sample size=4068) are calculated for each of these network characteristics. The standard deviation is used to assess the stability of the mean and median. Table 4: Structural Characteristics n The number of nodes in the network. d The network density. Figure 2: Probabilities for producing the first USF associate L The average minimum distance between nodes. modulo the size of the associate list m The average number of connections for each node. Results The results are presented in the Fig 2, the P val- C The clustering coefficient for the network. ues for each of the four methods are: PCAN−COS = 2.7155, PLSA−COS = 2.4568, PTopic−Model = 2.7818, PCAN−GP = Table 5: Network Dimension (n) 2.5932. Of the four, the Topic Model produces the best results fol- USF GP lowed closely by the CAN (cosine). In comparing both of the Mean 14 16.23 baseline methods, CAN (cosine) outperforms LSA. In com- Median 14 14 paring the asymmetric measures, the Topic Model is slightly St Dev 4.7 10.89 more effective than CAN (GP). Given that we are primarily interested in the asymmetric measures of association, we ob- Results Table 5 shows that the GP measure has strengths serve that the performance of the Topic Model for first as- and weakness in replicating the Network Dimension n of the sociates for lower m values is considerably better than CAN FANs. Whilst CANs over-fit the mean, they produce a per- (GP), however this behavior is not continued for larger m val- fect median value. There is a quite large standard deviation, ues in which the CAN (GP) approaches and then slightly su- which may be due to the fact that it is much easier to estab- persedes the effectiveness of the Topic Model. lish associations in corpus based processing than humans are able to in free association experiments. We can conclude that Comparison of CANs vs FANs using structural whilst the CANs ability to replicate FANs is quite good, there network characteristics is a larger spread in the numbers of nodes. Materials The corpus used for testing was Wikipedia 2008 Table 6 shows that the mean and median Network Density which comprises 61998051 documents. Wikipedia was cho- d of FANs is closely matched by the CANs. Not only it is a sen and it allows the CAN algorithm to be tested on a very great predictor of the mean and median, it’s standard devia- large corpus of text. The set of target words T used was the tion is relatively small indicating stability. 4068 target words present in the University of South Florida Table 7 reveals that the mean and median average minu- (USF) word association norms (Nelson et al., 2004). Each mum distance between nodes in FANs is under-fitted by the word has a corresponding PPMI vector representation using CANs, but produces a stable result. This is to be expected the method described in section . The baseline for compari- given the structure of the USF FANs. These FANs are gener- son are the 4068 FANs in the USF norms. ally quite sparse except in two areas, firstly all associates have 255 Table 6: Network Density (d) Table 9: Clustering Coefficient (C) USF GP USF GP Mean 0.23 0.2 Mean 0.44 0.31 Median 0.21 0.15 Median 0.43 0.32 St Dev 0.11 0.14 St Dev 0.10 0.16 networks for each node in the network. Although we have ob- Table 7: Average Minimum Distance Between Nodes (L) served that words appear to be more connected in CANs over FANs (as observed in Table 7 and 8), there is therefore likely USF GP to be, on average, more sub-networks in CANs. However, the Mean 1.79 1.19 density of these sub-networks around a node is smaller than Median 1.76 1.05 in FANs. The direct cause of this is unknown at this stage. St Dev 0.36 0.32 Discussion The first component of analysis evaluated the degree to which a forward association to the target (as per R2) and secondly CANs can quantitatively predict human word associations. it is a common theme that the backward relationships (to the Two models were used as baselines for comparison - the target) also exist (though these can be of very low weight). Topic Model and LSA. The results revealed the following Consequently, the majority of associates in a USF FAN are findings. connected to the target in both a forward and backward con- CANs extracted using both the cosine metric and the GP nection and thus allow for an easier traverse between any two pre-metric outperform LSA though the differences are small. nodes in the FANs resulting in a low L value. The pattern The Topic Model outperforms CAN (GP pre-metric) and of forward connections is replicated by the CANs (R2) and CAN (cosine) at higher levels of precision. At lower levels of is strongly desired when replicating FANs (small world be- precision CAN (cosine) outperforms the Topic Model. That havior). The lower L value for the GP generated CANs indi- being said, all models are poor at generating FANs’ first as- cates that traversal between nodes in a CAN is easier than in a sociate at maximal precision (i.e., when m = 1). The cosine FAN. Given that the densities for FANs and CANs are almost metric in conjunction with corpus-based vectors like PPMI identical (as illustrated in Table 6), and that both have forced has shown in many studies to have a predisposition to com- forward connections to the target, the difference in structure pute semantic associations (e.g., (Lund & Burgess, 1996; Du- probably lies in the non-target nodes being, on average, more mais, 2004)). As there are many cases where the first asso- interconnected in the CANs, than in the FANs. This higher ciate is not semantically associated with the target, it is there- degree of interconnectedness provides more opportunities for fore challenging for such associates to be ranked first based traversal through the network and thus a lower L value. Table on a PPMI representation. Clearly the asymmetry of GP pre- Table 8: Average Number of Connections (< k >) metric could not mitigate the predisposition of the PPMI vec- tor representations to compute associations of a semantic na- USF GP ture. Conversely, the Topic Model is better at predicting first Mean 1.12 2.34 associates perhaps because the conditional probabilities pick Median 1.14 1.81 up associations which are broader in nature than semantic as- St Dev 0.15 1.94 sociations. 8 shows that the mean and median average number of connec- Currently the CAN method creates vector representations tions of FANs is over-fitted by the CANs and is quite unsta- for words in Euclidean space. In doing so, established met- ble. On average, the number of associate to associate relation- rics of Euclidean Space (i.e., the cosine metric) can be used ships is greater for CANs than for FANs, which is consistent to compute word associations. These metrics must satisfy with our preceding conjecture that the non-target nodes of four axioms being (1) d (a, b) = d (b, a), (2) d (a, a) = 0 , (3) CANs are more interconnected than in FANs. Again, a pos- d(a, b) ≥ 0 and (4) d (a, b) ≤ d (a, c) + d (c, b), where d(a, b) sible explanation is that in corpus-based techniques it is gen- denotes the distance between points a and b in the space. erally much easier to establish associations between words. Tversky challenged this assumption and found empirical ev- Whether this is a result of the PPMI representation, the large idence that symmetry (1) and the triangle inequality (4) are size of the corpus and/or a consequence of the GP pre-metric violated. Tversky argued that these violations implied that is currently under investigation. words do not act like points in Euclidean space (Tversky & Table 9 shows that the mean and median Clustering Coeffi- Gati, 1982). Although the vectors for the CANs are in Eu- cient C of FANs are under-fitted by the CANs. The Clustering clidean space, the GP pre-metric does not base the degree of Coefficient measures the average density for localized sub- association on the distance between points in the space, but 256 rather on the degree of projection between the respective vec- for further development. tors. The second component of analysis was to assess the struc- References tural similarities of the FANs with the CANs. A set of well Bullinaria, J. A., & Levy, J. P. (2007). Extracting semantic known network characteristics were employed to measure the representations from word co-occurrence statistics: A performance. It was found that the CANs built using the computational study. Behavior Research Methods, 39, GP pre-metric performed encouragingly well at replicating 510-526. the structural features of the FANs, however issues of sta- Collins, A., & Loftus, E. (1975). A spreading-activation bility and under/over fitting the network characteristic need theory of semantic processing. Psychological Review, to be investigated in more detail. Structural analysis of the 82(6), 407-428. USF norms has been performed previously (Griffiths et al., Dumais, S. (2004). Latent Semantic Analysis. Annual review 2007), however instead of analyzing the individual networks of information science and technology, 38, 189-200. (as done in this analysis), the networks were aggregated into Griffiths, T., Steyvers, M., & Tenenbaum, J. (2007). Top- a single global network which was then subjected to network ics in semantic representation. Psychological Review, analysis. The focus of this study was different; we were in- 114(2), 211-244. terested in how well FANs based on individual target words Landauer, T., Foltz, P., & Laham, D. (1998). An introduc- can be structurally replicated. For this reason, the small world tion to latent semantic analysis. Discourse Processes, network characteristic γ (used in P (k) = k−γ ) was not inves- 25(2&3), 259-284. tigated because this characteristic is more meaningfully ap- Lund, K., & Burgess, C. (1996). Producing high-dimensional plied to a global network rather than small individual net- semantic spaces from lexical co-occurrence. Behaviour works. Research Methods, Instruments & Computers, 28(2), The brute force style strategy employed to isolate the opti- 203–208. mal parameters for the structural analysis could be improved. Nelson, D., Kitto, K., Galea, D., McEvoy, C., & Bruza, P. Whilst it does converge to the optimal set of solutions, it is (2013). How activation, entanglement, and searching a computationally inefficient and does not explore the stability semantic network contribute to event memory. Memory of each set of solutions, nor does it assign weightings to in- & Cognition, 41(6), 797-819. dividual parameters. Lastly, the USF norms collected over Nelson, D., McEvoy, C., & Schreiber, T. (2004). The uni- three decades and were primarily sourced from students who versity of South Florida, word association, rhyme and attended the University of South Florida. As a consequence, word fragment norms. Behavior Research Methods, In- the corpus suffers from temporal and geographical bias. To struments & Computers, 36, 408-420. overcome the temporal and geographical bias, a new collec- Nelson, D., Schreiber, T., & McEvoy, C. (1992). Process- tion of FANs built by the University Of Leuven could be used ing implicit and explicit representations. Psychological as a more comprehensive and contemporary baseline of hu- Review, 99(2), 322-348. man word association data (Simon et al., 2013). Pothos, E., Busemeyer, J., & Trueblood, J. (2013). A quan- tum geometric model of similarity. Psychological Re- Conclusion view, 120(3). Simon, D., Navarro, D., & Storms, G. (2013). Better explana- The aim of this paper is to investigate to what degree cor- tions of lexical and semantic cognition using networks pus based semantic methods can be used to derive weighted derived from continued rather than single-word associ- networks of words which approximate human free associa- ations. Behavior Research Methods, 45, 480-498. tion networks (FANs) in relation to both structural network Steyvers, M., & Tennenbaum, J. (2005). The large scale characteristics and the ability to quantitatively predict human structure of semantic networks: statistical analyses and word associations. We conclude that corpus-based methods a model of semantic growth. , 21, 41–78. can approximate the structural characteristics of FANs to an Turney, P., & Pantel, P. (2011). From frequency to meaning: encouraging degree when a thresholded asymmetric measure Vector space models of semantics. Journal of Artificial based on vector projection is used to construct the network. Intelligence Research, 37, 141-188. The degree to which the corpus-based procedures can repli- Tversky, A., & Gati, I. (1982). Similarity, separability and cate human word associations is still questionable. When the triangle inequality. Psychological Review, 89, 123– benchmarked against two corpus-based models, CANs pro- 154. duced similar effectiveness. At this stage we conclude that when term co-occurrence statistics are used to provide vec- tor representations, the performance of the symmetric cosine metric can’t be differentiated from an asymmetric measure based on vector projection. The difference in performance be- tween CANs and the benchmark models is small from which we can conclude that CAN (cosine and GP) do show promise 257