<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Investigating Subspace Distances in Semantic Spaces</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>G. Zuccon</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>L. Azzopardi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>E. Gasco</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Computing Science, University of Glasgow</institution>
          ,
          <addr-line>Scotland</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Zirak</institution>
          ,
          <addr-line>Mondov</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Semantic space models of word meaning derived from cooccurrence statistics within a corpus of documents, such as the Hyperspace Analogous to Language (HAL) model, have been proposed in the past. While word similarity can be computed using these models, it is not clear how semantic spaces derived from di erent sets of documents can be compared. In this paper, we focus on this problem, and we revisit the proposal of using semantic subspace distance measurements [1]. In particular, we outline the research questions that still need to be addressed to investigate and validate these distance measures. Then, we describe our plans for future research.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Traditional information retrieval (IR) paradigms use statistics of word
occurrence to match documents with users queries. Although a number of IR models
have been proposed, they usually focus on key word matching [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. However, there
are many occasions when the queries do not contain the exact same terms which
appear in relevant documents. Consequently, models that only consider word
occurrence statistics will fail to rank such documents highly, if at all [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. It has been
argued that some of the limits of IR systems can be overcome by understanding
the \meaning" of the user query and of the indexed documents [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Mathematical
and computational models of (word) meaning and semantics have been proposed
in the past. Among those, semantic space models have attracted much interest
and shown to be successful in particular tasks, e.g. information inference for
query expansion [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Common to semantic space techniques is the mapping of
words into a high dimensional vector space [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], generally obtained by
computing3 lexical co{occurrences between words appearing in the same context, e.g.
Hyperspace Analogue to Language (HAL) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and probabilistic HAL (pHAL) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
approaches. Other models however can be used that do not derive semantic
evidences directly from word co-occurrences, e.g. Latent Semantic Analysis [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
However, a problem common to all types of semantic spaces is how to measure
the distance between subspaces. Previous work has been limited to
measuring distances between vector representations of words within the same semantic
space. For example, the Minkowski distance, or alternatively one of its
specialisation, such as the Euclidean distance, can be used to compare individual word
3 Vectors are assigned to words so as to represent the co{occurrences between a word and others.
vectors (see [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]). Following this approach in order to measure distance between
semantic spaces, however, would led to imprecise measurements. This is because
a concept can be expressed by di erent words in di erent semantic spaces: thus
their vector representations are not going to be considered. In [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], a measure has
been developed, inspired by Quantum Physics, to overcome this problem. The
approach compares whole subspaces (generated from semantic spaces), instead
of individual vectors, i.e. documents are represented with subspaces instead of
word-vectors. This provides a novel way in which the distance measurements
can be performed, as all the vectors that form a basis of the semantic spaces
contribute in building up the distance measurement.
      </p>
      <p>
        In this paper we build upon the proposal suggested in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and outline the
plan of a joint collaboration between the University of Glasgow and Zirak4.
This industry funded project aims to explore the subspace distance measure and
empirically investigate its utility and application within IR.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2 Measuring the Distance between Subspaces</title>
      <p>
        The semantic subspace distance proposed in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is inspired by the chordal
distance, a monotonic function of the inner product. The idea underlying this
measure is that all the basis vectors that describe the subspaces should concur to build
up the distance between the subspaces themselves. This is in contrast with the
use of pairwise distance measurements between representations of the same word
in di erent subspaces. The de nition of the semantic subspace distance can be
given either in relation to the inner product between vectors belonging to the
bases (i.e. ui, vi) of the compared subspaces, or depending upon the trace
(indicated by the function tr(:)) of the projectors (i.e. Pa, Pb) describing the subspaces:
ds(Sa; Sb) = vuutmax (p; r) Xp Xr(uiT vj)2 (1)
= pmax (p; r)
where p and r are the dimensions5 of the subspaces Sa and Sb, respectively.
Geometrically, this \loosely"6 corresponds to take into account the projection
of a subspace into another subspace (see eq 2), rather than the intersection
between two subspaces. The semantic subspace distance can be employed to
compute the distance between semantic subspaces (as, for example, the ones
generated by HAL) aiming to obtain a more precise measurement of separation
than using a nave distance based upon the comparison between single word
representations, e.g. the Minkowski distance. We refer the interested reader to [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
for the derivation and a discussion of the properties of this distance.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], a pilot study was performed which showed that the measure grouped
relevant documents at a closer distance than irrelevant documents. Thus, the
4 http://www.zirak.it/
5 The dimension of a subspace is de ned as the number of vectors that form the basis of the
subspace; recall that a basis must both span the subspace and be linearly independent.
6 In the sense that eq 2 considers also the dimensionality of the subspaces and applies a trace
operator to the projection of the subspaces.
distance might be better at discriminating between relevant and non relevant
documents than other similarity measures. While these initial results are
encouraging, they warrant further investigation, in particular by testing the
distance measure on a number of heterogeneous document collections. Furthermore,
neither computational e ciency problems nor the suitability for real IR
applications have been investigated and discussed. In the next section we outline a
number of open questions that have yet to been investigated and our plan to
address those problematics.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3 Directions of Future Research</title>
      <p>In the following we detail the research questions and the plan of investigation of a
joint project between the University of Glasgow and Zirak, that aims to develop
further knowledge about the semantic subspace distance. The nal outcome of
the project is two-fold. On one hand we aim to achieve a deeper understanding
about the geometry of semantic spaces and about the semantic subspace
measure. On the other hand, we aim to release an open source package for creating
semantic subspaces and measuring subspaces distances, so to provide the IR
community with a tool for using the distance in real retrieval applications, such
as information ltering. Our research questions are:
1. which semantic models can be used to obtain a semantic subspace?
2. how can a basis for a semantic subspace be found? Is there a technique that
provides a (qualitative) \better" basis than alternative techniques?
3. how can the distance be tested?
4. in which IR scenarios can the measure be applied?</p>
      <p>
        Semantic models for semantic (sub)spaces. Previous studies have used
HAL for deriving semantic spaces from sets of documents. Other techniques
might be used to derive alternative representations of documents in terms of
word co-occurrences and semantics [
        <xref ref-type="bibr" rid="ref10 ref8">8,10</xref>
        ]. However, we have chosen to focus our
implementation and investigation on HAL and probabilistic HAL [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>Finding a basis for semantic (sub)spaces. A common linear algebra
procedure for nding the basis of a subspace is the application of the
GramSchimdt algorithm (which produces an orthonormal basis). However, alternative
algorithms are suitable for this problem. Furthermore, given the large
dimensionality of semantic subspaces and the success of techniques for dimensionality
reduction within IR, algorithms for dimensionality reduction can be employed
to extract an approximate (reduced) basis. For example, Singular Value
Decomposition yields to matrices that contain a set of orthonormal basis vectors (the
eigenvectors) for the subspace, as well as a matrix containing the eigenvalues
associated to the basis vectors with respect to the given subspace. In preliminary
studies, the QR decomposition (which directly implements the Gram-Schmidt
algorithm) has been employed to obtain subspaces bases. If dimensionality
reduction is not applied, there are no quantitative di erences with respect to the
semantic subspace distance in the choice of any of the previous techniques.
Therefore, we not only plan to investigate one of these techniques, but also a
particular matrix decomposition procedure (the Non-negative Matrix Factorisation {
NMF), that indeed yields qualitative as well as quantitative di erences when
computing the semantic subspace distance. In fact, NMF generates bases that
are not orthonormal (nor orthogonal), but so that the elements of the bases
vectors are positive (or zero). Qualitatively, such bases might have interpretative
advantages with respect to bases that also contain negative numbers7.</p>
      <p>
        Testing the semantic subspace distance. It is our plan to extend the
pilot study of the semantic subspace distance reported in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], by considering a
greater and heterogeneous set of collections. Furthermore, we are interested in
experimenting with semantic spaces created from texts and news feeds written in
languages other than English, in particular Italian. However, this might require
the creation of a suitable test collections that contains topics and associated
relevance judgments. Indeed, experiments performed on these collections will
enable us to obtain an indication of the goodness of the distance measure.
      </p>
      <p>IR tasks for the semantic subspace distance. Preliminary results
suggest that the semantic subspace distance is able to discriminate between relevant
and irrelevant documents. If this nding is con rmed during the planned
investigation, then the distance might be successfully used in the tasks of information
ltering and for relevance feedback, where a set of relevant documents can be
given as input to the system. A further area of application is that of document
classi cation and clustering, where the semantic subspace distance can be used
for determining how the similarity of two documents is calculated.</p>
      <p>Acknowledgments This project is funded and supported by Zirak under grant
agreement no. 55500/1 and by the EPSRC Renaissance (EP/F014384/1) project.
7 If the value of a feature is \negative", this is di cult to interpret, as it does not necessarily imply
the absence, or negation, of that feature.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Zuccon</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Azzopardi</surname>
          </string-name>
          , L.,
          <string-name>
            <surname>van Rijsbergen</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Semantic spaces: Measuring the distance between di erent subspaces</article-title>
          .
          <source>In: QI</source>
          . Volume
          <volume>5494</volume>
          . (
          <year>2009</year>
          )
          <volume>225</volume>
          {
          <fpage>236</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Spark-Jones</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hiemstra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaragoza</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>Language modelling and relevance</article-title>
          .
          <source>In: Language Modeling for Information Retrieval</source>
          . Kluwer Academic Publishers (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Belkin</surname>
            ,
            <given-names>N.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oddy</surname>
            ,
            <given-names>R.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brooks</surname>
            ,
            <given-names>H.M.</given-names>
          </string-name>
          :
          <article-title>Ask for information retrieval: Part I. background and theory</article-title>
          .
          <source>Journal of Documentation</source>
          <volume>38</volume>
          (
          <issue>2</issue>
          ) (
          <year>1982</year>
          )
          <volume>61</volume>
          {
          <fpage>71</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Furnas</surname>
            ,
            <given-names>G.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deerwester</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumais</surname>
            ,
            <given-names>S.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Landauer</surname>
            ,
            <given-names>T.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harshman</surname>
            ,
            <given-names>R.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Streeter</surname>
            ,
            <given-names>L.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lochbaum</surname>
            ,
            <given-names>K.E.</given-names>
          </string-name>
          :
          <article-title>Information retrieval using a singular value decomposition model of latent semantic structure</article-title>
          . In: SIGIR '
          <fpage>88</fpage>
          . (
          <year>1988</year>
          )
          <volume>465</volume>
          {
          <fpage>480</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bruza</surname>
          </string-name>
          , P.D.:
          <article-title>Discovering Information Flow Using High Dimensional Conceptual Space</article-title>
          . In: SIGIR '
          <fpage>01</fpage>
          . (
          <year>2001</year>
          )
          <volume>327</volume>
          {
          <fpage>333</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Osgood</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suci</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tannenbaum</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Date</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : The Measurement of Meaning. University of Illinois Press (US) (
          <year>1957</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Lund</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burgess</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Producing</surname>
          </string-name>
          high
          <article-title>-dimensional semantic spaces from lexical co-occurrence</article-title>
          .
          <source>Behavior Research Methods, Instruments &amp; Computers</source>
          (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Azzopardi</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Girolami</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crowe</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Probabilistic hyperspace analogue to language</article-title>
          . In: SIGIR '
          <fpage>05</fpage>
          . (
          <year>2005</year>
          )
          <volume>575</volume>
          {
          <fpage>576</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Deerwester</surname>
            ,
            <given-names>S.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumais</surname>
            ,
            <given-names>S.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Landauer</surname>
            ,
            <given-names>T.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Furnas</surname>
            ,
            <given-names>G.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harshman</surname>
            ,
            <given-names>R.A.</given-names>
          </string-name>
          :
          <article-title>Indexing by Latent Semantic Analysis</article-title>
          .
          <source>JASIS</source>
          <volume>41</volume>
          (
          <issue>6</issue>
          ) (
          <year>1990</year>
          )
          <volume>391</volume>
          {
          <fpage>407</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Jurgens</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stevens</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>The s-space package: an open source package for word space models</article-title>
          .
          <source>In: ACL</source>
          <year>2010</year>
          .
          <article-title>(</article-title>
          <year>2010</year>
          )
          <volume>30</volume>
          {
          <fpage>35</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>