<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>RatVec: A General Approach for Low-dimensional Distributed Vector Representations via Rational Kernels</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Eduardo Brito</string-name>
          <email>eduardo.alfredo.brito.chacon@iais.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bogdan Georgiev</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Domingo-Fernandez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Charles Tapley Hoyt</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christian Bauckhage</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>B-IT, University of Bonn</institution>
          ,
          <addr-line>Endenicher Allee 19a, 53115 Bonn</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Fraunhofer Center for Machine Learning</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Fraunhofer IAIS</institution>
          ,
          <addr-line>Schloss Birlinghoven, 53757 Sankt Augustin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Fraunhofer SCAI</institution>
          ,
          <addr-line>Schloss Birlinghoven, 53757 Sankt Augustin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present a general framework, RatVec, for learning vector representations of non-numeric entities based on domain-speci c similarity functions interpreted as rational kernels. We show competitive performance using k -nearest neighbors in the protein family classi cation task and in Dutch spelling correction. To promote re-usability and extensibility, we have made our code and pre-trained models available at https://github.com/ratvec.</p>
      </abstract>
      <kwd-group>
        <kwd>Representation Learning Kernel Principal Component Anal- ysis Bioinformatics Natural Language Processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The success of distributed vector representations in natural language processing
(NLP) tasks has motivated their use for other areas such as biological sequence
analysis [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Most of them rely on the distributional hypothesis [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]: "words that
appear in similar context are similar". We aim to relax this constraint with a
general framework to learn vector representations of non-numeric entities
including text, DNA sequences, and protein sequences based on similarity functions
that can be expressed as rational kernels [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Supported by the robust
theoretical framework behind rational kernels, we obtain vector representations via
kernel PCA (KPCA) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] that, together with a simple k nearest neighbors (k NN)
classi er, constitute an e cient and explainable classi cation pipeline showing
competitive performance in two di erent tasks: protein family classi cation and
Dutch spelling correction.
      </p>
      <p>
        Copyright c 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
Our approach extends our previous work on KPCA embeddings, where we studied
vector representations for words and DNA sequences via KPCA replacing the dot
product by a speci c similarity function [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. There are also a myriad of previous
kernel methods for NLP tasks such as text classi cation [
        <xref ref-type="bibr" rid="ref10 ref6">6, 10</xref>
        ] and named entity
recognition [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. However, they are either restricted to a speci c kernel function
(i.e. to a concept of similarity) or lack in explicit vector representations.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Approach</title>
      <p>Let X be a dataset consisting of n elements, to which we will further refer as
the full vocabulary ; and k a rational kernel (a particular similarity function).
Computing a kernel matrix K from X may be computationally prohibitive for
large datasets due to its time and space complexity (O(n2)). Hence, we select
only the m elements considered "most representative" (in a domain-speci c
formulation) from X to construct our representative vocabulary V and compute a
kernel matrix KV from all element pairs from V . After centering and
diagonalizing KV, we construct our projection matrix PV . This is required to generate
m-dimensional representations for the full vocabulary. In a last step, we assign
the rst d m components of each computed vector to each full vocabulary
element. There are the resulting d-dimensional vector representations obtained
with our approach. When k is suitable for the task, d being of a lower order of
magnitude than m can lead to optimal results, as we will see in Section 4.1. The
choice of m can be adjusted to t the computational resources of the user.</p>
      <p>The produced representations tend to cluster naturally according to the
domain-speci c concept of similarity applied by the selected rational kernel, as
we can see in Fig. 1. This has two main advantages:
1. A simple k NN classi er (eventually 1NN) can su ce for classi cation tasks.
2. Learning the vector representations and k NN constitute an explainable
classi cation pipeline: an entity is assigned to a particular class because it is
similar to the labeled entities used to train the classi er.</p>
    </sec>
    <sec id="sec-3">
      <title>Experimental Results</title>
      <p>
        Protein Family Classi cation
In this task, we classify protein sequences to their corresponding families with the
same Swiss-Prot dataset as in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], containing 7,027 protein families and 324,018
protein sequences5. The system is evaluated for each protein family by 10-fold
cross-validation on a balanced dataset consisting of all amino acid sequences of
the family and as many randomly sampled sequences from all other families.
      </p>
      <p>
        We produced protein representations with 25 dimensions applying our
approach with bigram and trigram similarity and trained a nearest neighbor
classi er (1NN). We generated our representative vocabulary by taking the shortest
sequence of the 1,000 most frequent protein families. We evaluated this pipeline
in the same setting as [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. We report the weighted average accuracy results in
Table 1. Also, the results of the evaluation limited to subsets of the full dataset are
presented (e.g. sequences belonging to the 1,000 most frequent protein families).
      </p>
      <p>The results from Table 1 show that our approach with the bigram similarity
reaches the same accuracy as the baseline and or even outperforms it when we
restrict the dataset to the rst 3,000 or 4,000 most frequent protein families.
Considering that the reported BioVec representations are four times longer than
ours, our approach to encode proteins seems to be more e cient.</p>
      <p>
        In contrast to [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], we do not train any support-vector machine (SVM) but
a 1NN classi er. To assess the impact of the di erent classi cation algorithm,
we evaluated a pretrained BioVec model 6 with our setup. The results (Table
2), showing that 1NN is more suitable than SVMs for this task, are in line
with our previous experiments on classi cation tasks using distributed vector
representations, where simple algorithms such as k NN and logistic regression
outperform more complex ones such as random forests or neural networks [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ].
      </p>
      <sec id="sec-3-1">
        <title>5 http://dx.doi.org/10.7910/DVN/JMFHTN</title>
        <p>
          6 https://github.com/kyu999/biovec
We also evaluated our approach by participating in the CLIN28 shared task,
where our system obtained the best F1 score among the competing teams [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. The
task focused on correcting errors in extracts from Dutch Wikipedia pages. We
used an older implementation of our approach that we keep for reproducibility7.
        </p>
        <p>
          Spell checkers involve two steps: misspelling detection and correction. The
latter generally requires ranking a set of correction candidates. Generating them
implies nding all valid words di ering from the detected misspelling less than a
determined edit distance, which is mostly set to 1 for real-world applications to
limit computation time [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. We avoid this strong constraint with our approach.
        </p>
        <p>
          In our RatVec framework, our full vocabulary is wdutch, a word list from the
OpenTaal project, from which the 3000 most frequent words form our
representative vocabulary. The applied rational kernel is the composition of the bigram
similarity [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] with the homogeneous polynomial kernel of degree 2. We generated
a vector representation for each full vocabulary word with 2000 dimensions.
        </p>
        <p>Once a misspelling is detected, we compute its RatVec representation and
search for the closest precomputed vector. Its related word (its nearest neighbor)
is our correction to the misspelling. Formally, this is equivalent to training a 1NN
classi er where each valid word is assigned a di erent label.</p>
        <p>Our RatVec framework is only relevant during the correction phase for spelling
mistakes that are related to the word form. Hence, we restrict our analysis to
the correction results on the ve error categories where the word form is
relevant, namely those displayed in Table 3. Although the baseline system Valkuil
achieves a better average accuracy than our approach for the analyzed
misspellings, RatVec outperforms Valkuil in some categories where it completely
fails (redundant punctuation errors) or where it cannot be even evaluated
because it failed to detect any error (capitalization and archaic spelling errors).
From these results, we interpret that our word vectors encode word forms in a
suitable way so that similar words can be retrieved.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and Future Work</title>
      <p>We showed that our approach involving rational kernels on KPCA for vector
space embeddings provides rich representations for two di erent kinds of
en</p>
      <sec id="sec-4-1">
        <title>7 https://github.com/fraunhofer-iais/kpca embeddings</title>
        <p>tities: proteins and words. In the rst application, we presented how to learn
protein vectors from amino acid sequences and how to apply them to predict
their protein family. In the second, we learned word representations that encode
word form information so that they can be applied to correct misspellings once
these are detected. In both tasks, our results are comparable to state-of-the-art
approaches. Thanks to the simplicity of k NN classi ers and the interpretable
similarity concept we apply to generate the vector representations, our approach
may be advantageous for some real-world applications compared to more
complex machine learning models such as deep neural networks.</p>
        <p>In future work, we will explore possibilities of learning optimal similarity
metrics (modeled as transducers) that, incorporated in our presented approach,
solve a particular task.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Asgari</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mofrad</surname>
            ,
            <given-names>M.R.</given-names>
          </string-name>
          :
          <article-title>Continuous distributed representation of biological sequences for deep proteomics and genomics</article-title>
          .
          <source>PloS one</source>
          <volume>10</volume>
          (
          <issue>11</issue>
          ),
          <year>e0141287</year>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Beeksma</surname>
          </string-name>
          , M.,
          <string-name>
            <surname>van Gompel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kunneman</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Onrust</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Regnerus</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinke</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brito</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bauckhage</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sifa</surname>
          </string-name>
          , R.:
          <article-title>Detecting and correcting spelling errors in high-quality dutch wikipedia text</article-title>
          .
          <source>Computational Linguistics in the Netherlands Journal</source>
          <volume>8</volume>
          ,
          <issue>122</issue>
          {
          <fpage>137</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Brito</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sifa</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bauckhage</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>KPCA embeddings: an unsupervised approach to learn vector representations of nite domain sequences</article-title>
          .
          <source>In: LWDA</source>
          . pp.
          <volume>87</volume>
          {
          <issue>96</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Brito</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sifa</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cvejoski</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ojeda</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bauckhage</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Towards German word embeddings: A use case with predictive sentiment analysis</article-title>
          .
          <source>In: Data Science{ Analytics and Applications</source>
          , pp.
          <volume>59</volume>
          {
          <fpage>62</fpage>
          . Springer (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Cortes</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ha</surname>
            <given-names>ner</given-names>
          </string-name>
          , P.,
          <string-name>
            <surname>Mohri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Rational kernels: Theory and algorithms</article-title>
          .
          <source>Journal of Machine Learning Research 5, 1035{1062 (Dec</source>
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Cristianini</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shawe-Taylor</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Lodhi</surname>
          </string-name>
          , H.:
          <article-title>Latent semantic kernels</article-title>
          .
          <source>Journal of Intelligent Information Systems</source>
          <volume>18</volume>
          (
          <issue>2-3</issue>
          ),
          <volume>127</volume>
          {
          <fpage>152</fpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Giuliano</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Fine-grained classi cation of named entities exploiting latent semantic kernels</article-title>
          .
          <source>In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning</source>
          . pp.
          <volume>201</volume>
          {
          <fpage>209</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Harris</surname>
            ,
            <given-names>Z.S.:</given-names>
          </string-name>
          <article-title>Distributional structure</article-title>
          .
          <source>Word</source>
          <volume>10</volume>
          (
          <issue>2-3</issue>
          ),
          <volume>146</volume>
          {
          <fpage>162</fpage>
          (
          <year>1954</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kondrak</surname>
          </string-name>
          , G.:
          <article-title>N-gram similarity and distance</article-title>
          .
          <source>In: Int. Symp. on String Processing and Information Retrieval</source>
          . pp.
          <volume>115</volume>
          {
          <fpage>126</fpage>
          . Springer (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Lodhi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saunders</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shawe-Taylor</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Cristianini</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Watkins</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Text classi cation using string kernels</article-title>
          .
          <source>Journal of Machine Learning Research 2(Feb)</source>
          ,
          <volume>419</volume>
          {
          <fpage>444</fpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. Scholkopf,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Smola</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          , Muller, K.R.:
          <article-title>Kernel principal component analysis</article-title>
          .
          <source>In: International conference on arti cial neural networks</source>
          . pp.
          <volume>583</volume>
          {
          <fpage>588</fpage>
          . Springer (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Tijhuis</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Context-based spelling correction for the Dutch language: Applied on spelling errors extracted from the Dutch wikipedia revision history (</article-title>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>