=Paper=
{{Paper
|id=Vol-1917/paper15
|storemode=property
|title=KPCA Embeddings: An Unsupervised Approach to Learn Vector Representations of Finite Domain Sequences
|pdfUrl=https://ceur-ws.org/Vol-1917/paper15.pdf
|volume=Vol-1917
|authors=Eduardo Brito,Rafet Sifa,Christian Bauckhage
|dblpUrl=https://dblp.org/rec/conf/lwa/BritoSB17
}}
==KPCA Embeddings: An Unsupervised Approach to Learn Vector Representations of Finite Domain Sequences==
<pdf width="1500px">https://ceur-ws.org/Vol-1917/paper15.pdf</pdf>
<pre>
                 KPCA Embeddings: an Unsupervised Approach
                   to Learn Vector Representations of Finite
                              Domain Sequences
                             A Use Case for Words and DNA Sequences

                                 Eduardo Brito, Rafet Sifa, and Christian Bauckhage

                                                    Fraunhofer IAIS
                                   Schloss Birlinghoven, Sankt Augustin, Germany
                         {Eduardo.Alfredo.Brito.Chacon,Rafet.Sifa,Christian.Bauckhage}
                                                 @iais.fraunhofer.de
                                  https://multimedia-pattern-recognition.info


                        Abstract. Most of the well-known word embeddings from the last few
                        years rely on a predefined vocabulary so that out-of-vocabulary words
                        are usually skipped when they need to be processed. This may cause a
                        significant quality drop in document representations that are built upon
                        them. Additionally, most of these models do not incorporate information
                        about the morphology of the words within the word vectors or if they
                        do, they require labeled data. We propose an unsupervised method to
                        generate continuous vector representations that can be applied to any
                        sequence of finite domain (such as text or DNA sequences) by means of
                        kernel principal component analysis (KPCA). We also show that, apart
                        from their potential value as a preprocessing step within a more com-
                        plex natural language processing system, our KPCA embeddings also
                        can capture valuable linguistic information without any supervision, in
                        particular word morphology of German verbs. When they are applied
                        to DNA sequences, they also encode enough information to detect splice
                        junctions.


                 1    Introduction

                 Machine learning approaches for natural language processing (NLP) generally
                 demand a numeric vector representation for words. We can distinguish any two
                 different words from a fixed vocabulary by assigning them a one-hot vector,
                 where all entries of the vector are zero-valued but in a single position that iden-
                 tifies the word. This is a very sparse representation that encodes no information
                 about the words but their position in the vocabulary.
                      A more information-rich alternative to one-hot vectors are the so-called word
                 embeddings. They are distributed vector representations, which are dense, low-
                 dimensional, real-valued and can capture latent features of the word [13]. Based
                 on the distributional hypothesis [5] (words that appear in similar contexts have
                 similar meanings), they exploit word co-occurrence so that similar words are


Copyright © 2017 by the paper’s authors. Copying permitted only for private and academic purposes.
In: M. Leyer (Ed.): Proceedings of the LWDA 2017 Workshops: KDML, FGWM, IR, and FGDB.
Rostock, Germany, 11.-13. September 2017, published at http://ceur-ws.org
mapped close to each other in the word vector space. Part of the success of the
word embeddings is due to their efficient shallow neural network architectures
such as the continuous skip-gram model and the continuous bag of words model
[9], widely popularized after the release of word2vec1 . Word embeddings also in-
spired research in other areas different from NLP to learn vector representations
such as node representations from a graph [4].
     Although syntax and semantics can be encoded with word2vec embeddings,
they do not incorporate morphological information about the word. As a con-
sequence, morphologically similar words may not be nearby in the word vector
space. Some approaches make use of existing linguistic resources so that the
word embeddings capture not only contextual information but also morphologi-
cal information[3, 7]. Due to the their dependence on language-specific resources,
they will not work for languages whose available linguistic resources are scarce.
     The subword-based models such as FastText [1, 6] learn indirectly morphol-
ogy by learning not only word vectors but also n-gram vectors. This enables not
only complete unsupervised learning but also the possibility of inferring out-of-
vocabulary (OOV) words, which is an important issue for all other approaches
mentioned, notably when noisy informal text needs to be processed. Despite
these advantages of subword-based models, some morphologically rich languages
(where these models are supposed to perform specially well) may contain very
long words. This leads to a dramatic increase of the number of necessary n-gram
representations, increasing time and space complexity as well.
     We propose an alternative unsupervised method by means of kernel princi-
pal component analysis (KPCA) that encodes morphology while learning word
representations and that generate new vectors for OOV words after training. In
addition, our approach is general enough to learn vector representations for any
sequence whose elements belong to a fixed predefined finite set. In particular, we
test our KPCA embeddings in two different tasks: classifying the verb category
for German verbs and detecting splice junctions in DNA sequences.


2     Approach

KPCA indirectly maps vectors to a feature space (of higher dimension) in order
to obtain the principal components from that space [12]. No explicit calculation
in the feature space is required since we only need to be able to compute the
inner product in the feature space and this can be achieved by using kernel
functions. We can exploit the freedom to select any inner product of our choice
so that we can also perform KPCA to non-numeric entities. Formally, given a
zero-mean column data matrix X = [x1 , x2 , . . . , xn ] containing n m-dimensional
data points, principal component analysis (PCA) deals with representing the
data through principal components maximizing the variance in X by solving

                                    Cv = λv,                                   (1)
1
    https://code.google.com/archive/p/word2vec
where C = n1 XXT is the covariance matrix, λ an arbitrary eigenvalue and v its
corresponding eigenvector. Considering (1), we can represent every eigenvector
as a linear combination of the data points the data matrix X as
                                1
                                  XXT v = Xβ = v,                             (2)
                               nλ
where β ∈ Rm . Substituting (2) in (1) we obtain
                                  1
                                    XXT Xβ = λXβ,                             (3)
                                  n
which upon data projection can be represented as
                            1 T
                              X XXT Xβ = λXT Xβ.                              (4)
                            n
Replacing the occurrences of the gram matrix XT X by a selected kernel matrix
K as
                                  1
                                    KKβ = γKβ.                                (5)
                                  n
and eliminating K as
                                    1
                                      Kβ = γβ,                                (6)
                                    n
we result in a kernelized representation of the conventional PCA.
   For the particular case of words and DNA sequences, the inner product can
be any string similarity. For our experiments, we adapt the the Sørensen-Dice
coefficient by considering not only bigrams, but n-grams of any length in general.
Let V a vocabulary of words, Gn (w) the n-grams of a word w ∈ V. We define
our similarity function s of two words x, y ∈ V as follows:
                           X           2|Gn (x) ∩ Gn (y))|     X
               s(x, y) =          αn                       ,       αn = 1     (7)
                                        |Gn (x) ∪ Gn (y)|      n
                           n∈N+

   where αi determines the weight of the Sørensen-Dice coefficient term for each
n-gram length. We compute a similarity matrix S by applying this similarity
function s to all word pairs of our vocabulary V :

                           Sij = s(wi , wj ) ∀wi , wj ∈ V.                    (8)

Then, we calculate the kernel matrix K by applying a non-linear kernel func-
tion (for instance RBF kernel or polynomial kernel) to the similarity matrix S.
After computing the eigenvectors and eigenvalues of the resulting matrix K, we
construct our projection matrix P by selecting d eigenvectors v1 , . . . , vd and
dividing them by their respective eigenvalues λ1 , . . . , λd :
                                           v1       vd
                                  P=[         ,...,    ].                     (9)
                                           λ1       λd
   We can now generate a KPCA embedding for any word wt . We only need to
compute the similarity function of the word against all the words processed from
                                                                               geboren
    0.2

                                                           gestiegen             gelesen
                                                                              gebore


                                                   umgestiegen                    geliebtlesen
                                                                             gedacht
                                                                              gemacht
    0.1
                                                      umgezogen
                                          eingestiegen
                                                 ausgestiegen                  denken lieben
                                                    ausgezogen
                                            eingezogen                                           gesehen
               steigen
               umsteigen
             einsteigen                                                                   lese machen
    0.0
                austeigen                                                    denke       liebe

      steige                                        ein                 um
        umsteige                                                                                        sehen
      einsteige                                                   aus
         austeige                                                                           mache
    0.1
                                                                                                   umziehen
                                                                                     einziehenausziehen
                                                                                                        sehe

    0.2
                                                                                                   umziehe
                                                                                     einziehe ausziehe
                    0.20       0.15       0.10            0.05                0.00               0.05


Fig. 1: Visualization of KPCA embeddings generated from a reduced German
vocabulary with a RBF kernel, σ = 0.7 and two principal components. Each
color represents a different verb tense. Black dots refer to German prefixes. Best
seen in color.


the vocabulary V and apply the kernel function k to the resulting vector. This
results in a kernelized distance vector rt . The product of rt with the projection
matrix P constitutes the d-dimensional KPCA embedding ut of the word wt .

                            rt = k(s(wt , V))      and           ut = P> rt .                                   (10)

It is important to note that, this approach can be generalized to encode any other
non-numeric entity as long as we can define an equivalent similarity function
between each pair of entities.


3         Experiments

We show how our KPCA embeddings can be used for different kinds of sequential
data. In particular, we apply our approach to represent words to classify different
types of verb forms and to represent DNA sequences to recognize splice junctions.
          3.Sg.Pres.Ind
                      Psp
                        Inf
           3.Sg.Past.Ind
           3.Pl.Pres.Ind
         3.Sg.Pres.Subj
            3.Pl.Past.Ind
          3.Pl.Past.Subj
         3.Sg.Past.Subj
                    Infzu
           1.Pl.Pres.Ind
          3.Pl.Pres.Subj
          1.Sg.Pres.Ind
           1.Sg.Past.Ind
                2.Sg.Imp
            1.Pl.Past.Ind
          2.Sg.Pres.Ind
          1.Pl.Past.Subj
                 2.Pl.Imp
           2.Pl.Pres.Ind
         1.Sg.Past.Subj
                      Pos
         2.Sg.Pres.Subj
         1.Sg.Pres.Subj
           Pl.3.Pres.Ind
           2.Sg.Past.Ind
                       Prp
            2.Pl.Past.Ind
          1.Pl.Pres.Subj
         2.Sg.Past.Subj
           Pl.1.Pres.Ind
                         0.00   0.05    0.10      0.15    0.20     0.25
                                               Ratio
Fig. 2: Distribution of 31 different morphological tags from the TIGER treebank.
Note the imbalance especially with respect to first and second person forms.


3.1   Verb classification

We test how our KPCA embeddings can encode morphology with a fine-grain
POS tagging task. We restrict our vocabulary to the tokens tagged as verbs
from the TIGER treebank [2]. This simplifies the problem to classify the correct
morphological tag (consisting of grammatical person, number, tense and mode
when they apply) of a German verb only from its KPCA embedding.
    First, we extract all tokens tagged as verb (corresponding to the TIGER tags
VVFIN, VAFIN, VMFIN, VVIMP, VAIMP, VVINF, VVIZU, VAINF, VMINF,
VVPP, VMPP, VAPP) and remove all duplicates. This leads to 13370 unique
verbs with 31 different morphological tags, whose distribution is showed in figure
2. Then, we apply the approach described in section 2. We build a training set
consisting of 80% of the verbs and a test set with the remaining 20%. We consider
only bigrams and trigrams to compute the similarity function for each pair of
verbs by selecting five different weight distributions (different values for α2 and
α3 in equation 7). We also incorporate an additional character at the beginning
and at the end of each verb when producing the n-grams. After running KPCA
on the training set, we infer vector representations of the verbs from the test
set. By using the KPCA embeddings as a features and the morphological tag as
label, we train k-nearest neighbors classifiers to predict the morphological tag
of a word from only the KPCA embedding. As baseline representations, we also
learn a word2vec [10, 9] model for each different vector size. These word vectors
were learned applying the default hyperparameter values.
    From Table 1 we can observe that a mean accuracy above 77% can be
achieved by classifiers taking only the nearest neighbor (k = 1). This can be
interpreted as a high accuracy considering the extremely unbalanced label dis-
tribution (see figure 2). Among the trained classifiers, we can also find some
improvement when the trigram similarity weight (α3 ) is at least as high as
the bigram similarity (α2 ). In addition, any KPCA embedding model beats all
word2vec models for this task. For the sake of a fair comparison, the displayed
word2vec results from Table 1 correspond to models where their vector size
matches with a tested number of principal components d of KPCA embeddings
and tested with the same k values for k-nearest neighbors. However, we also
tested additional word2vec models with vector sizes up to 100 and up to 100
neighbors. None of these larger models reached a mean accuracy above 27%.


3.2    Splice junction recognition on DNA sequences

We encode DNA sequences from the dataset “Molecular Biology (Splice-junction
Gene Sequences) Data Set” from UCI Machine Learning Repository [8]2 . This
dataset consists of DNA subsequences represented as 30 characters out of the
four nucleobases (A, T, C, G) plus other four characters (D, N, S, R) which
mark ambiguity. The sequences may contain a spline junction between the 30
first and the 30 last characters. They are thus labeled with three different cat-
egories depending if they contain exon/intron boundary (EI class), intron/exon
boundary (IE class) or neither (N). The distribution of the classes is displayed
in Table 2.
    We compute the similarity function considering all n-grams, giving all terms
from (7) the same weight:
                              (
                               1/58,    i ∈ {2, · · · , 59}
                         αi =                                                (11)
                               0,       i∈/ {2, · · · , 59}

Using the learned representations, we train k-nearest neighbors classifiers to
predict to which of the three classes each DNA sequence belongs to. Analyzing
the prediction performance, as we can see from Table 3, we achieve a mean
accuracy 94.67% with two different kernels. This result beats all baseline systems
that are presented with the dataset, including knowledge-based artificial neural
networks (KBANN) [11].

2
    https://archive.ics.uci.edu/ml/datasets/Molecular+Biology+
    (Splice-junction+Gene+Sequences)
 d α3   k = 1 k = 2 k = 3 k = 5 k = 10   d α3   k = 1 k = 2 k = 3 k = 5 k = 10
 KPCA embeddings                         KPCA embeddings
 5 1    76.76 65.19 65.20 63.02 60.56    5 1    77.08 65.29 64.95 63.02     60.48
 5 0.75 76.67 66.10 65.94 63.74 60.85    5 0.75 76.89 66.23 66.03 64.50     61.38
 5 0.5 76.56 65.73 65.78 63.33 60.50     5 0.5 76.75 66.12 66.05 63.78      61.86
 5 0.25 76.72 65.33 65.14 63.01 60.27    5 0.25 76.95 65.66 65.69 63.64     61.02
 5 0    77.01 65.19 65.14 62.64 59.60    5 0    76.89 65.95 65.53 63.65     61.26
 word2vec                                word2vec
 5      12.08 10.85 10.92 13.43 15.56    5      12.08 10.85 10.92 13.43     15.56
 KPCA embeddings                         KPCA embeddings
 10 1    77.38 66.93 66.64 64.92 62.59   10 1    76.80 66.01 66.38 64.54    61.92
 10 0.75 77.13 66.42 66.26 63.70 61.37   10 0.75 76.72 66.13 66.83 64.30    61.95
 10 0.5 76.59 65.05 64.87 62.19 59.91    10 0.5 77.17 66.45 66.70 64.36     61.96
 10 0.25 76.63 65.45 65.31 62.60 59.65   10 0.25 77.23 66.30 66.78 64.40    62.43
 10 0    77.05 65.81 65.96 63.72 60.68   10 0    77.13 66.26 66.67 64.42    62.16
 word2vec                                word2vec
 10      13.91 11.22 12.23 15.41 17.39   10      13.91 11.22 12.23 15.41    17.39
 KPCA embeddings                         KPCA embeddings
 15 1    77.55 66.52 66.96 65.13 63.27   15 1    77.28 66.45 66.96 64.82 62.60
 15 0.75 76.97 65.70 65.96 64.02 61.79   15 0.75 77.05 66.12 66.66 64.44 61.90
 15 0.5 77.28 66.34 66.55 64.62 62.22    15 0.5 77.23 65.94 66.17 64.01 62.00
 15 0.25 77.46 67.24 67.45 65.45 63.65   15 0.25 76.82 65.15 65.51 63.448 61.11
 15 0    77.72 67.38 67.46 65.43 63.83   15 0    76.51 65.52 65.34 63.76 60.98
 word2vec                                word2vec
 15      14.29 13.39 13.99 17.69 20.08   15      14.29 13.39 13.99 17.69 20.08
 KPCA embeddings                         KPCA embeddings
 20 1    77.22 66.67 66.43 64.92 62.75   20 1    77.51 66.16 67.00 65.06    62.67
 20 0.75 77.11 65.61 65.84 64.68 61.65   20 0.75 76.93 65.95 66.00 64.23    62.22
 20 0.5 77.56 66.59 66.42 65.11 62.52    20 0.5 77.05 65.01 65.29 63.76     61.61
 20 0.25 77.55 66.88 67.03 65.16 63.20   20 0.25 76.67 64.85 64.94 63.10    61.22
 20 0    77.49 66.65 66.96 65.47 63.59   20 0    76.43 64.88 64.70 62.76    60.94
 word2vec                                word2vec
 20      13.76 13.24 14.47 18.61 20.91   20      13.76 13.24 14.47 18.61    20.91

    (a) Polynomial kernel (degree 3)            (b) RBF kernel (σ = 2.26)

Table 1: Mean accuracy in % predicting the verb tag with k nearest neighbors,
trigram ratio α3 (α2 = 1 − α3 ) and d principal components applying different
kernel functions. For the word2vec baselines, d refers to the word vector size.
                                Class    Nr. sequences Ratio
                                EI      767             25%
                                IE      768             25%
                                Neither 1655            50%
     Table 2: Class distribution of the splice-junction DNA sequences dataset


 d                       k                      d                    k
     1       3     6      10    14      19          1    3     6     10    14       19
1 57.21 55.49 58.46 61.91 62.54 62.07           1 52.66 56.11 59.09 58.78 61.44 60.19
2 73.98 76.18 79.31 79.78 81.03 80.72           2 70.06 75.24 74.92 77.12 76.18 76.49
3 92.16 92.63 93.42 93.57 94.51 93.89           3 90.60 90.44 90.28 91.38 92.01 92.32
4 91.22 92.32 93.10 93.26 93.42 93.42           4 92.79 93.42 94.04 93.26 93.26 94.04
5 90.44 92.63 93.10 93.26 93.89 93.89           5 92.32 92.79 93.26 94.04 94.20 93.89
6 89.66 92.79 93.42 92.95 93.57 94.04           6 91.85 92.79 93.42 93.89 92.95 93.26
7 90.75 92.48 92.95 93.57 93.57 94.51           7 90.91 93.42 93.89 94.04 93.73 93.73
8 90.13 93.57 93.89 93.89 93.89 93.89           8 90.13 92.16 93.42 93.57 92.95 93.42
9 90.13 91.22 92.95 93.73 94.67 93.57           9 88.87 91.22 92.16 93.89 94.67 94.04
10 89.50 91.69 92.16 92.95 93.42 93.42         10 88.87 91.85 92.63 93.26 92.63 91.54

         (a) Polynomial kernel (degree 2)               (b) RBF kernel (σ = 0.72)

Table 3: Mean accuracy in % predicting splice junctions with k nearest neighbors
and d principal components applying different kernel functions. Several results
beat the baseline system KBANN (93.68% mean accuracy).


4     Discussion and future work

We showed that our KPCA embedding approach to learn vector word represen-
tations can encode the morphology of the words in an unsupervised fashion, at
least for the particular case of German verbs. The learned KPCA embeddings
could beat by far any word2vec model in the task of predicting the grammat-
ical tag. The highest accuracy was achieved by taking the nearest neighbor to
predict the verb category. Due to this fact, we suspect that our proposed word
representations tend to form clusters according to their word form, from which
predicting a grammatical tag is a feasible task with simple classifiers such as
k-nearest neighbors classifiers.
    Since many NLP applications require also syntactic and semantic information
about the words, good word embeddings should also incorporate information
not only from the form of the represented word but also about the context in
which they appear. In this direction, we will enhance our approach by adapting
our similarity function so that it also considers the frequency of each evaluated
word pair appearing in the same context or, alternatively, by taking our KPCA
representations as input representation of a neural network architecture. For the
latter, our KPCA would “just” substitute the one-hot encoding of most of neural
                                                                                EI
                                                                                IE
                                                                                N
                                                                             0.06
                                                                            0.04
                                                                            0.02
                                                                            0.00
                                                                            0.02
                                                                            0.04
                                                                            0.06

                                                                     0.04
                                                                  0.02
            0.04 0.02                                          0.00
                        0.00 0.02                           0.02
                                    0.04 0.06           0.04

Fig. 3: Visualization of KPCA embeddings generated from the DNA sequences
with RBF kernel, σ = 0.72 and seven principal components. Our dataset consists
of DNA subsequences represented as 30 characters out of the four nucleobases
(A, T, C, G) plus other four characters (D, N, S, R) which mark ambiguity. The
sequences may contain a spline junction between the 30 first and the 30 last
characters. They are thus labeled with three different categories depending if
they contain exon/intron boundary (EI class), intron/exon boundary (IE class)
or neither (N). Only their first three components are plotted. Best seen in color.


language models. Additionally, we will also extend our research evaluating the
same approach on other morphologically rich inflected languages (like any of the
Romance languages) or agglutinative languages (such as Turkish). We assume
they may profit the most from our approach since their word forms reveal more
grammatical information than word forms from more analytic languages such as
English. To this extent, KPCA embeddings may also help to overcome the lack
of linguistic resources of some of these non-English languages.

    Furthermore, we presented how we can learn KPCA embeddings for DNA
sequences. These representations proved to be useful in the task of predicting
splice junctions. It would be also interesting to generalize our method to encode
other types of discrete sequential data where also n-grams could be extracted,
for instance text paragraphs (word n-grams), protein sequences (amino acid n-
grams) or even sheet music (note n-grams).
References
 1. P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with
    subword information. arXiv preprint arXiv:1607.04606, 2016.
 2. S. Brants, S. Dipper, P. Eisenberg, S. Hansen-Schirra, E. König, W. Lezius,
    C. Rohrer, G. Smith, and H. Uszkoreit. Tiger: Linguistic interpretation of a german
    corpus. Research on Language and Computation, 2(4):597–620, 2004.
 3. R. Cotterell and H. Schütze. Morphological word-embeddings. In HLT-NAACL,
    pages 1287–1292, 2015.
 4. A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In
    Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge
    Discovery and Data Mining, pages 855–864. ACM, 2016.
 5. Z. S. Harris. Distributional structure. Word, 10(2-3):146–162, 1954.
 6. A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov. Bag of tricks for efficient text
    classification. arXiv preprint arXiv:1607.01759, 2016.
 7. G. Jurdzinski et al. Word embeddings for morphologically complex languages.
    Schedae Informaticae, 25:127–138, 2016.
 8. M. Lichman. UCI machine learning repository, 2013.
 9. T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word
    representations in vector space. In ICLR Workshop, 2013.
10. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed repre-
    sentations of words and phrases and their compositionality. In Advances in neural
    information processing systems, pages 3111–3119, 2013.
11. M. O. Noordewier, G. G. Towell, and J. W. Shavlik. Training knowledge-based
    neural networks to recognize genes in dna sequences. In Advances in Neural Infor-
    mation Processing Systems, pages 530–536, 1991.
12. B. Schölkopf, A. Smola, and K.-R. Müller. Kernel principal component analysis. In
    International Conference on Artificial Neural Networks, pages 583–588. Springer,
    1997.
13. J. Turian, L. Ratinov, and Y. Bengio. Word representations: A simple and general
    method for semi-supervised learning. In A. for Computational Linguistics, editor,
    48th Annual Meeting of the Association for Computational Linguistics, pages 384–
    394, 2010.

</pre>