<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>VVecetcotro-Or-rOierniteendtRedetrRieevtarliienvXaMliLn DXaMta LCoDlleacttaions Collections</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jaroslav Pokorný</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Charles University, Faculty of Mathem</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2008</year>
      </pub-date>
      <fpage>74</fpage>
      <lpage>76</lpage>
      <abstract>
        <p>Many modern applications produce and process XML data, which is queried in its both structural and textual component. This is especially useful if we consider a casual user who looks for information in web-based database systems or intranets containing XML data, like online shops, airline reservations, digital libraries catalogues or any other, and does not expect an exact answer. Many websites are built from document-centric XML documents [3]. A remarkable characteristic of such XML data collections is that they are mostly heterogeneous, i.e. they contain domainfocused data, possibly valid w.r.t. various DTDs or XML schemes. XML documents can come from various sources. These collections can be managed as XML databases [5] as well as collections, providing an approximate way for users to search their contents. To ensure such functionality, it is required to approach these collections with both database and information retrieval (IR) methods. Current XML query languages like XPath and XQuery are applicable rather for data-centric than for document-centric XML data. Moreover, XML schemes are often necessary for their use. In other words, the languages are not longer appropriate for searching in such environments because they can not cope with the diversity of data. Hence, a research of integration of database querying and IR in context of XML is undoubtedly interesting and promising trend. Despite of the fact that a variety of systems that support such methods have been proposed, conventional IR techniques [2], e.g. vector space model, can be employed only restrictedly. The reason for it is that two types of queries should be dealt with: content-only (CO) queries, i.e. the traditional ones in IR, and content-and-structure (CAS) queries. A number of techniques to extend the vector space model have been designed, e.g. [6], [7], [8], [9], [11], and [12]. A usual critique of the mentioned approaches is that they not sufficiently reflect the structure of XML documents. A more advanced, twophase evaluation schema is proposed in [1]. First, a modified vector space model is employed to obtain similarity scores for the textual nodes of XML trees. Then, the scores are propagated upward in the XML-trees with a possible modification and possibly new scores of other nodes are generated. In [13] we described a matrix model based on an extension of the vector space model for XML data. A document D in a collection of XML documents C is represented by a matrix D, whose each row vector wt associated with a term t contains the weights of t for each path occurring in C. A query Q considered also as an XML tree is expressed as a matrix Q. The matrix model proposes to evaluate the degree of similarity of D with regard to the Q as the correlation between the matrices D and Q.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Experiments have shown that it is not possible to rely only on this score. Instead we
adjust the matrix D by an additional data structure, so called a path transform matrix,
which reflects relationships among paths. The same is done for the matrix Q. Then,
the resulted transformed matrices TD and TQ are used for query processing. First
experiments have been done with the well-known collection of Shakespeare’s plays
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and synthetic data generated by a widely used database benchmark XBench.
      </p>
      <p>
        In next development of the matrix model we found its critical points and proposed
its new version based on the approach [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In experimental implementation (called
MAMEX in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]) we used INEX collection [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] as input data. We have compared
vector model and renewed matrix model and explored cases in which precision of
results are comparable and cases where the latter model wins. The experiments
confirmed that the matrix model is mostly not worse than vector model and is
significantly better in the cases of queries with more terms. This can be of an importantance
for Web querying where a page is a query unit and a collection of pages is relatively
stable.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Anh</surname>
            ,
            <given-names>V.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moffat</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Compression and an IR Approach to XML Retrieval</article-title>
          .
          <source>In: Proc. of the First Workshop of INEX</source>
          , Dagstuhl, Germany,
          <year>December 2002</year>
          , pp.
          <fpage>99</fpage>
          -
          <lpage>104</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Baeza-Yates</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ribeiro-Neto</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Modern information retrieval</article-title>
          . NY: ACM Press,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Barbosa</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mignet</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Veltri</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Studying the XML Web: Gathering Statistics from an XML Sample</article-title>
          .
          <source>World Wide Web</source>
          <volume>8</volume>
          (
          <issue>4</issue>
          ):
          <fpage>413</fpage>
          -
          <lpage>438</lpage>
          , Springer Business + Media;
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bosak</surname>
          </string-name>
          ,
          <source>J.: Shakespeare</source>
          <volume>2</volume>
          .
          <fpage>00</fpage>
          . Los Altos, California, http://www.ibiblio.org/bosak/,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Bourret</surname>
          </string-name>
          , R.: XML and Databases, http://www.rpbourret.com/xml/XMLAndDatabases.htm.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Bremer</surname>
            ,
            <given-names>J.-M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gertz</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>XQuery/IR: Integrating XML Document and Data Retrieval In:</article-title>
          <source>Proc. of the 5th Int. Workshop on the Web and Databases (WebDB)</source>
          ,
          <year>June 2002</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Carmel</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Efraty</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Landau</surname>
            ,
            <given-names>G.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maarek</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , Mass,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          :
          <article-title>An Extension of the Vector Space Model for Querying XML Documents via XML Fragments</article-title>
          .
          <source>In: Proc. of XML and Information Retrieval (Workshop) Tampere</source>
          ,
          <year>2002</year>
          , pp.
          <fpage>14</fpage>
          -
          <lpage>25</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Crouch</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Apte</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bapat</surname>
          </string-name>
          , H.:
          <article-title>Using the Extended Vector Model for XML Retrieval</article-title>
          .
          <source>In: Proc. of the 1st INEX 2002 Workshop</source>
          , Dagstuhl,
          <year>December 2002</year>
          , pp.
          <fpage>95</fpage>
          -
          <lpage>98</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Fuhr</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Großjohann</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>XIRQL: A Query Language for Information Retrieval</article-title>
          .
          <source>In: Proc. of ACM-SIGIR, New Orleans</source>
          ,
          <year>2001</year>
          , pp.
          <fpage>172</fpage>
          -
          <lpage>180</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Gövert</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kazai</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Overview of the INitiative for the Evaluation of XML retrieval (INEX) 2002</article-title>
          .
          <article-title>In: Proc. of the first Workshop of the INitiative for the Evaluation of XML Retrieval (INEX)</article-title>
          ,
          <year>Dagstuhl</year>
          ,
          <year>2002</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>17</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Grabs</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schek</surname>
          </string-name>
          , H.:
          <article-title>Generating vector spaces on-the-fly for flexible XML retrieval</article-title>
          .
          <source>In: Proc. of XML and Information Retrieval (Workshop)</source>
          , Tampere, ACM Press,
          <year>2002</year>
          , pp.
          <fpage>4</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Kakade</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raghavan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Encoding XML in vector spaces</article-title>
          .
          <source>In: Proc. of the 27th European Conf. in Information Retrieval (EPIC)</source>
          .
          <source>LNCS 3408</source>
          . Springer, NY,
          <year>2005</year>
          , pp.
          <fpage>96</fpage>
          -
          <lpage>111</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Pokorný</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rejlek</surname>
          </string-name>
          , V.:
          <article-title>A Matrix Model for XML Data</article-title>
          . Chap.
          <source>in: Databases and Information Systems</source>
          , Volume
          <volume>118</volume>
          Frontiers in Artificial Intelligence and Applications, Eds.
          <string-name>
            <given-names>J.</given-names>
            <surname>Barzdins</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Caplinskas</surname>
          </string-name>
          , IOS Press,
          <year>2005</year>
          , pp.
          <fpage>53</fpage>
          -
          <lpage>64</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Vávra</surname>
          </string-name>
          , J.:
          <article-title>Matrix model in context of XML IR methods</article-title>
          .
          <source>Master Thesis</source>
          ,
          <source>Faculty of Mathematics and Physics</source>
          , Charles University, Praha, Czech Republic,
          <year>2005</year>
          . (in Czech)
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>