<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Word embeddings for semantic similarity: comparing LDA with Word2vec</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luandrie Potgieter</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alta de Waal</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Arti cial Intelligence Research</institution>
          ,
          <addr-line>CAIR</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Statistics, University of Pretoria</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Distributional semantics is a sub eld in Natural Language Processing (NLP) that studies methods to derive meaning, and semantic representations for text. These representations can be thought of as statistical distributions of words assuming that it characterises semantic behaviour. These statistical distributions are often referred to as embeddings, or lower dimensional representations of the text. The two main categories of embeddings are count-based and prediction-based methods. Count-based methods most often result in global word embeddings as it utilizes the frequency of words in documents. Prediction-based embeddings on the other hand are often referred to as local embeddings as it takes into account a window of words adjacent to the word of interest. Once a corpus is represented by its semantic distribution (which is in a vector space), all kinds of similarity measures can be applied. This again leads to other NLP applications such as collaborative ltering, aspect-based sentiment analysis, intent classi cation for chatbots and machine translation. In this research, we investigate the appropriateness of Latent Dirichlet Allocation (LDA) [1] and word2vec [4] embeddings as semantic representations of a corpus. Once the semantic representation of a corpus is obtained, the distances between the documents and query documents are obtained by means of distance measures such as cosine, Euclidean or Jensen-Shannon. We expect short distances between documents with similar semantic representations and longer distances between documents with di erent semantic representations.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>Experimental setup</title>
      <p>
        In our experiment, we choose the 20 Newsgroups dataset which contains 20000
documents that are partitioned across 20 di erent newsgroups [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Examples
of the newsgroup topics are motorcycles, religion, computer software, etc. Our
experimental work ow is as follows:
1. Train the word embeddings (LDA and word2vec) on the complete dataset.
      </p>
      <p>The output of this step is a vector space representation of the corpus.
2. Choose documents of one newsgroup, say motorcycles. Split this newsgroup
into a 80%, 20% split. The 80% subset is referred to as the reference set and
the 20% is referred to as the query set.</p>
      <p>
        L Potgieter, A de Waal
3. Index the 80% reference set and the 20% query set (separately) on the vector
space representation obtained in step 1.
4. Calculate the similarity between the reference and query set. For LDA, we
use the Jensen-Shannon distance [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and for word2vec we use soft cosine [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
5. Create an alternative query set from a di erent newsgroup, say religion. Also
index this query set on the vector space representation obtained in step 1.
6. Calculate the similarity between the reference and the alternative query set.
Word embeddings for semantic similarity: comparing LDA with Word2vec
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>M.I.</given-names>
          </string-name>
          :
          <article-title>Latent Dirichlet Allocation</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>3</volume>
          ,
          <issue>30</issue>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Fuglede</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Topsoe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Jensen-Shannon divergence and Hilbert space embedding</article-title>
          .
          <source>In: International Symposium onInformation Theory</source>
          ,
          <year>2004</year>
          .
          <article-title>ISIT 2004</article-title>
          . Proceedings. pp.
          <volume>30</volume>
          {
          <fpage>30</fpage>
          . IEEE, Chicago, Illinois, USA (
          <year>2004</year>
          ). https://doi.org/10.1109/ISIT.
          <year>2004</year>
          .1365067
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Karishma</given-names>
            <surname>Borkar</surname>
          </string-name>
          , Nutan Dhande:
          <article-title>E cient Text Classi cation of 20 Newsgroup Dataset using Classi cation Algorithm j International Journal IJRITCC</article-title>
          .
          <source>International Journal on Recent and Innovation Trends in Computing and Communication</source>
          <volume>5</volume>
          (
          <issue>6</issue>
          ) (
          <year>Jun 2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>E cient Estimation of Word Representations in Vector Space</article-title>
          .
          <source>arXiv:1301.3781 [cs] (Jan</source>
          <year>2013</year>
          ), http://arxiv.org/abs/1301.3781, arXiv:
          <fpage>1301</fpage>
          .
          <fpage>3781</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Sidorov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gelbukh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez-Adorno</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pinto</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Soft similarity and soft cosine measure: similarity of features in vector space model</article-title>
          .
          <source>Computacion y Sistemas</source>
          <volume>18</volume>
          (
          <issue>3</issue>
          ) (
          <year>Sep 2014</year>
          ). https://doi.org/10.13053/cys-18-3
          <article-title>-2043</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>