<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Graph Analysis of Word Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lars G. Bagøien Johnsen</string-name>
          <email>lars.johnsen@nb.no</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Research National Library of Norway</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>We discuss two ways of clustering words from an underlying graph structure with the aim of uncovering semantic distinctions between words, where formal relationships between words are constructed from co-occurrences within sentences. The method illustrates how geometric relations of closeness and connectedness in graphs correspond to closeness and groupings of concepts in the real world. Graphs are analyzed using methods of network analysis taken from the social sciences. In particular, the concepts of cliques and centrality of graph structures as well as partitioning methods, are put to use. These methods find communities in graphs as sets of words, which we interpret as reflecting a grouping of the meaning. Word clusters and relationships between clusters are visualized on two layers where one is the graph itself, and the second consists of a derivative graph of the subset relation between word groups. The basic graph is rendered using the force layout algorithm, while the relationship between sets and groups of words in the graph are rendered as trees.</p>
      </abstract>
      <kwd-group>
        <kwd>graphs</kwd>
        <kwd>clusters</kwd>
        <kwd>meaning</kwd>
        <kwd>cliques</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Themes to be discussed are clustering and disambiguation of words, based on so called bag of
words or vector models, which are transformed into network structures. We will demonstrate a
particular way of generating graphs from those. Typical representatives are vectors made using
word2vec algorithm (Mikolov &amp; Zweig 2013), or produced within the application LancsBox
(Brezine et.al 2016).</p>
      <p>Our approach to disambiguation from word vectors differs from the stochastic approach in
(Bartunov et.al 2017), since word clusters and sets are constructed out of edges and nodes in
the network structure, using the topology of the network, in addition to any weighting of the
connections themselves. So, from a formal point of view, the method aligns itself with the formal
treatment of networks as found in analyses of social networks.</p>
      <p>The word vectors studied here are constructed from collocations computed from trigrams of
coordinations (see below), and can therefore be considered a subtype of the above word
vectors. Even though the vectors are constructed differently, the formal treatment of the network
structures will be the same.</p>
    </sec>
    <sec id="sec-2">
      <title>Research questions</title>
      <p>One question that is addressed is how textual raw data can be transformed into structures that
represent knowledge of language, while at the same time reflect its external significance. The
algorithms themselves have no access to the external world, so the correspondence lies in how
closeness (or groupings) of words matches the closeness of their corresponding concepts. For
example, words like apple and pear go together as vegetables and jazz and pop as genres of
music.</p>
      <p>This feature of graph analysis as a source of groupings and connections can then be used to
analyze the semantics of language and literary works, ranging from the disambiguation of
particular words to semantic fields and frames, where these objects are represented as nothing
more than a collection of words.</p>
      <p>To be specific, the meaning and reference of words are taken to be external to language, while
word vectors represent those meanings internally, within language, by associating a word with a
vector (bag of words). These vectors are taken to stand proxy for meaning, so that operations
and combinations of meaning can be performed on the vector representations, which
corresponds to meaning operations externally, see e.g. (Mikolov &amp; Zweig 2013).
Our aim here is not to provide an algebra, or a combinatorial system, of meanings to be applied
to these representations, but rather see in which ways word clusters can represent aspects of
word meaning and make distinctions between shades of meaning.</p>
    </sec>
    <sec id="sec-3">
      <title>Methods</title>
      <p>Graphs may be constructed from word vectors, and vectors from graphs. For example, a
selection of word pairs, taken from skip bigrams for example, may simply be viewed as a graph
of word to word pairs, where the words are the vertices and the pairs are edges of the graph.
From a graph, a word vector (or bag of words) is constructed by collecting all the words a given
word w pairs up with. Conversely, a word w with an associated bag of words W, form a natural
graph structure by pairing the words in W with w.</p>
      <p>The construction is illustrated here with vectors made from coordinative construction in
Norwegian like ost og kjeks (cheese and biscuits), which at the same time introduce a semantics
to the word vector – two words are coordinated if they share something in the context in which
they are uttered.</p>
      <p>Each coordination is assigned a weight in the form of pointwise mutual information (PMI),
computed from the collection of all texts, ensuring to a certain degree that the conjunctions
selected are full phrasal words, and not the end or start of a phrase. For example, blindly
selecting edges X → Y from coordination instances of “X og Y” results in many pairs that are not
true coordinations, since X may just be the end of a phrase coordinated with a phrase that starts
with Y. By using frequency and PMI almost all these instances are eliminated, resulting in a
graph that contains less noise, so that an edge (X,Y) is in fact a coordination of X and Y. So, the
whole process goes like this. The graphs used to represent relations are constructed from a
subset of trigrams on the form “X og Y”, conjunction in the middle. Words in the trigram form
edges in a graph X→ Y.</p>
      <p>The actual construction of the graph iterates the formation of nodes and edges. A graph from a
given word, say Norwegian jordbaer (strawberry), is constructed by considering the set of edges
X→ jordbaer and jordbaer→ Y, resulting in a directed graph. In the analysis below, the
directedness is not taken into account; the graph is converted to an undirected graph before
undergoing further analysis.</p>
      <p>The iteration of the process gives the graphs complexity. For each X and Y above, the process of
collecting words is continued, and so on up to a certain level, creating the necessary complexity
in the graph for network analysis, where the complexity comes from the recurrence of nodes,
and from nodes pointing back to other nodes.</p>
      <p>The graphs shown here are made in three steps. In the case of is (ice), the first step collects
words connected to it, like jordbaer, then the second step adds edges for jordbaer like jordbaer
(strawberry) → moreller (cherries) and jordbaer → blåbaer (blueberry). The third step collects
edges for moreller and blåbaer. When expanding these words, a link is created between them; as
well as new common node bringebaer (raspberry), as illustrated in the following diagram, which
shows part of the graph generated by jordbaer:
This graph also illustrates the typical cluster between nodes in a network, namely that of a
clique, a set of nodes which are all connected to each other. The set of nodes {jordbaer, blåbaer,
bringebaer} is a subgraph in which all the nodes are connected, and thus defines a clique.
Another clique is formed by {jordbaer, bringebaer, moreller}.</p>
      <p>Below, we describe the process of forming clusters using cliques within the development
platform provided by the module networkx (NetworkX (2016)) for the Python programming
language.</p>
    </sec>
    <sec id="sec-4">
      <title>Main findings</title>
      <p>Consider the graph shown below, constructed as described above from the word form kirsebaer
(cherry) and displaying most of the words related to it through the coordination construction. The
layout of the graph is of the force-directed type (Fruchterman &amp; Reingold 1991) as implemented
in networkx. This type of layout provides a visual set of clusters of words based on the weights
between nodes, and the topology of the graph structure. Implementations of force-directed
layouts will, in general, give different outputs on different runs. However, the overall grouping will
be the same, although rotational orientation will vary, as well as spatial relationships.
The graph has been amended with colors marking a partitioning of the nodes according to the
Louvain method (Blondel et.al. 2008), to which we will return below.
On visual inspection, a couple of word groups are immediately apparent. There is one area with
interconnected elements containing berries, and another area containing wood types, in addition
to other accidental readings and connections for some of the words, which can be attributed to
food relations like spices, salt etc.</p>
      <p>Words are clustered in terms of k-cliques (Chakrabarti and Faloutsos 2012) showing how
different readings or meanings of kirsebaer (cherry) can be extracted from the set structure. A
kclique cluster is made such that first a clique with k members is selected, then the cluster gets
new nodes from another clique if the latter only differs from the starting clique in only one node.
So, in the case of Figures 1, {jordbaer, blåbaer, bringebaer} and {jordbaer, bringebaer, moreller}
are joined together to from {jordbaer, blåbaer, bringebaer, moreller}, and so on.</p>
      <p>For reference, we show the whole cluster structure for kirsebaer here, together with a pair of
numbers (k, s) such that k is the size of the clique from which the group is created, and s is an
arbitrary sequence number. For the groups (4, 1), (4, 3) and (7, 1) on which further commentary
follows below, see the English translations in italic:
(3, 2) merbau, furu, gran, kirsebaer, teak, fronter, blandingsved, bøk, ask, bjerk, eik, lønn, bjørk,
ek, Bjørk, mahogni, valnøtt, mahogny
(4, 3) furu, kirsebaer, eik, bjerk, bjørk (pine, cherry, oak, birch, birch)
(5, 1) plommer, kirsebaer, epler, bringebaer, bjørnebaer, moreller, blåbaer, jordbaer, solbaer
(4, 2) rips, tyttebaer, krekling, plommer, kirsebaer, Bringebaer, paerer, eple, epler, bringebaer,
bjørnebaer, Solbaer, moreller, baer, blåbaer, jordbaer, solbaer
(6, 1) plommer, kirsebaer, epler, bringebaer, bjørnebaer, moreller, blåbaer, jordbaer, solbaer
(3, 1) paere, stikkelsbaer, Jordbaer, solbaer, aebler, krydder, plommer, kirsebaer, paerer, bananer,
krekling, epler, Bringebaer, Epler, eple, bringebaer, Solbaer, tomater, moreller, Paerer, jordbaer,
rips, multer, lakris, appelsiner, poteter, Moreller, bjørnebaer, baer, plomme, blåbaer, tyttebaer,
nøtter, urter
(4, 1) paere, kirsebaer, eple, plomme (pear, cherry, apple, plum)
(7, 1) bringebaer, bjørnebaer, moreller, plommer, kirsebaer, jordbaer, solbaer (raspberries,
blackberries, cherries, plums, cherries, strawberries, black currants)
Each group is a cluster generated as a k-clique cluster from the graph. An overview of the
relationships between them can be obtained by looking at the subset relation, which is visualized
using tree structures depicting relationships between clusters.</p>
      <p>In the two trees below, the labels are built to correspond with the (k, s) pairs in the clusters, and
the labelling is taken from the two most central nodes in the graph that are also members of the
cluster. There are two main branches, one starting from (3, 1) and one starting from (3, 2):
The three clusters that appear with a translation, (7, 1), (4, 1) and (4, 3) are the topmost and the
smallest sets within their branch. (7, 1) represents the reading "berry", which, together with the
associated fruits in (4, 1) constitute the clusters emanating from the larger set (3, 1), which also
contains words for vegetables, like the word for potato. Then, on the other side is the reading
"wood" or "tree", which is separate from the rest, although kirsebaer is still a member of the sets.
So, all in all, the graph represents three variations of meaning for one word, one as a tree
(cherry tree), and one as an edible substance, the berries, which also go together with fruits
such as apple and pears (the words eple and paere in the clusters).</p>
      <p>The topology of the differences between the trees may be interpreted along different axes of
polysemy: cherry is used both for berry and tree (a kind of homonymy by metonymy), while the
readings "fruit" and "berry" are brought closer together.</p>
    </sec>
    <sec id="sec-5">
      <title>Additional structure</title>
      <p>In addition to the overlapping clustering produced by k-cliques, there is also the option of
partitioning the graph into non overlapping clusters. A community analysis (Chakrabarti &amp;
Faloutsos 2012) will also generate sets that contain words that may be interpreted as belonging
together semantically. For example, coloring the graph using communities computed using the
Louvain method (Blondel et.al. 2008), it is apparent that the coloring adds information and aids
the visualization of the clusters. Note the slight difference between this rendering of the graph
compared to the one above. The visual graph structure fits well with the structures from the
clique analysis. The nodes representing trees in the upper right corner are colored grey, while
the fruits and berries (as well as spices) occupy the lower left corner.
Communities produce a partition of the nodes as a collection of mutually disjoint sets. Therefore,
our main word kirsebaer is only member of one set, making it harder to detect ambiguities within
the sets, in contrast to the k-clique clustering which created sets that overlapped. However, the
community analysis is broader and groups together words that are missed out by the k-clique
clustering.</p>
      <p>Future plans for this research is to try and group together these two ways of structuring the data,
so that both high precision and subset structure can be used to arrive at a description of multiple
meanings for a word.</p>
      <p>The principal finding is that both k-clique clusters and community detection may be used to find
different levels of meaning for words, and that k-cliques are in general more conservative with
high precision, while community detection, in general, creates partitions that covers the graphs
entirely.</p>
    </sec>
    <sec id="sec-6">
      <title>REFERENCES</title>
      <p>Bartunov, S., Kondrashkin, D., Osokin, A. and Vetrov, D. (2017). Breaking Sticks and Ambiguities
with Adaptive Skip-gram. [online] Arxiv.org. Available at: https://arxiv.org/abs/1502.07257
Blondel, Vincent &amp; Jean-Loup Guillaume &amp; Renaud Lambiotte &amp; Etienne Lefebvre (2008) Fast
unfolding of communities in large networks, Journal of Statistical Mechanics: Theory and
Experiment.</p>
      <p>Breder Birkenes, Magnus &amp; Johnsen, Lars G. &amp; Lindstad, Arne Martinus &amp; Ostad, Johanne,
2015. From digital library to n-grams: NB N-gram in Proceedings of the 20th Nordic Conference
of Computational Linguistics, pp 293—295, Linköping University Electronic Press, Sweden.
Brezina, V., McEnery, T., &amp; Wattam, S. (2015). Collocations in context: A new perspective on
collocation networks. International Journal of Corpus Linguistics, 20(2), 139-173.
Chakrabarti, Deepayan and Faloutsos, Christos 2012. Graph Mining, Morgan &amp; Claypool
Publishers.</p>
      <p>Fruchterman, Thomas M. J. &amp; Reingold, Edward M. (1991), Graph Drawing by Force-Directed
Placement, Software – Practice &amp; Experience, Wiley, 21 (11): 1129–1164,
doi:10.1002/spe.4380211102
Hanneman, Robert A. and Mark Riddle. (2005). Introduction to social network methods.
Riverside, CA: University of California, Riverside. Online text:
http://faculty.ucr.edu/~hanneman/nettext/
Mikolov, T., Yih, W. &amp; Zweig, G. Linguistic regularities in continuous space word representations.
In NAACL HLT, pp. 746–751, 2013b.</p>
      <p>NetworkX (2016). https://networkx.github.io/
Turney P.D. and Pantel P. 2010. From frequency to meaning: vector space models of semantics.
Journal of Artificial Intelligence Research, 37:141–188.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>We are grateful to the National Library for providing the facilities for carrying out the work
described here. The data in this study are all available from the URLs described in Breder
Birkenes et.al. 2015.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>