<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of MMFaFthUeKmaPtriachsa</institution>
          ,
          <addr-line>aMndalPoshtyrasnicssk,éCnáhmar.l2e5s,U11n8iv0e0rsity Prague</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2011</year>
      </pub-date>
      <fpage>173</fpage>
      <lpage>180</lpage>
      <abstract>
        <p>Précis system has been designed for text based searching over relational database. Unlike the common approach, this system allows user to search requested data over a whole database, not only within one table. System takes queries formulated in free-form and produces rows containing information corresponding to the query and also information associated to them within the database. This paper presents implementation of the index, suitable for efficient searching in the Oracle relational database management system. Implementation provides SQL query language extension for searching desired data.</p>
      </abstract>
      <kwd-group>
        <kwd>Information Retrieval</kwd>
        <kwd>Précis index</kwd>
        <kwd>Relational Databases</kwd>
        <kwd>Data Mining</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>(0.8)</p>
      <p>(0.8)
Actor
Actor_ID
Role
Movie_ID</p>
      <p>Actor_ID
slow and ineffective. This paper presents the implementation of the Précis index in
the Oracle database system using its extensibility features.</p>
      <p>
        Following chapter describes the Précis indexing in more details. The paper concerns
on brief implementation description and obtained performance results. More details
about the implementation can be found in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Précis system</title>
      <p>The primary data are stored in classical database and can be processed by standard
database application. The Précis querying uses data stored in all tables and columns
and takes into account associations between tables derived from foreign key
references present in the database. References, tables and columns in the database can
be additionally rated by weights, represented by numbers w∈&lt;0;1&gt;. These weights
can be either global or associated to individual users. The Précis query q consists of a
set of keywords q={k1, k2, …, kn}. According to the given query the system finds out
corresponding records and related information. Let suppose following database
structure:
(0.7)
(1.0)</p>
      <p>(0.6)</p>
      <sec id="sec-2-1">
        <title>Picture 1. Sample database structure</title>
        <p>The broader arrows represent foreign keys in the database, narrower dotted arrows
represent opposite associations. Numbers in square brackets represent ratings for
associations. Having such a database, the query q={“Woody”, “Allen”} could provide
the result (without text formatting) in form:</p>
        <p>Director: Woody Allen | December 1st 1935 | Brooklyn | New York | USA
Movie: Match Point | 2005
Movie: Melinda and Melinda | 2004
Movie: Anything Else | 2003
Role: Hollywood Ending | 2002</p>
        <p>Role: The Curse of the Jade Scorpion | 2001
Authors of the Précis system have proposed the query evaluation is in a sequence of
four successive steps: First – create list of all occurrences of all query terms in the
database D. This step provides subschema D’ of all tables containing those terms.
Second – determine the subset of the database schema accessible from D’ using
Person
Person_ID
Name
BirthDate
BirthPlace
(0.9)
(0.9)
(1.0)
Director</p>
        <p>Director_ID
(0.7)</p>
        <p>(1.0)
Movie
Movie_ID
Title
Year
Director_ID
associations. This step provides subschema D”. Third – fill database D” with the
schema D” with retrieved data. Last – optionally transform data in D” to the natural
language using text templates.</p>
        <p>Each record in the result database obtains a rating according to its relation to data
found in the first step. The rating of the row accessible through a chain of association
is computed as a product of ratings assigned to corresponding tables, columns and
association. The result then contains all data with rating exceeding given threshold.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Implementation</title>
      <p>Our concern was on efficient inverted index building in one of most used RDBMS in
the enterprise segment – the Oracle database – and on query evaluation using this
index. The Oracle database was chosen because it provides developers with necessary
background for creating user-defined index types.</p>
      <p>The basic requirements on the Précis index were as follows:
•
•
•
•</p>
      <p>Each index should contain a set of columns stored in arbitrary number of
tables within the database.</p>
      <p>Each user should be able to create any number of indexes.</p>
      <p>One column can be assigned to any number of defined indexes.</p>
      <p>The user interface for index manipulation should be as user-friendly as
possible Querying the database should be available through the Oracle SQL
query language.</p>
      <p>According to above stated requirements, it is necessary to consider each value stored
in any of indexed columns as a separate textual document.</p>
      <p>The Précis index creation process is shown on picture 2.</p>
      <p>Datastore</p>
      <p>Oracle Text
Policy_</p>
      <p>Filter
tscenouDm
txseT</p>
      <p>Stoplist
Policy_</p>
      <p>Tokens</p>
      <sec id="sec-3-1">
        <title>Picture 2. Précis index creation process</title>
        <p>Précis Index
skoenT InPdreéxciinsg</p>
        <p>
          Engine
Each document obtains its unique identifier based on the table owner, table name,
column name and row within the database. Document stored in binary format are
filtered and converted to plaintext. Text is tokenized and converted to a set of terms
together with number of their occurrences within the document. Filtering and
tokenization uses functionality available in Oracle Media Text extension [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. For each
term its term frequency (TF) within a given document document and inverted
document frequency (IDF) is computed and the standard TF*IDF formulae [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] is then
used for term weight computation.
        </p>
        <p>Logical structure of the Précis inverted index is shown on picture 3.
Data are stored within BLOB records, divided to blocks of fixed size. Document lists
short enough to not exceed one block, are stored directly in the block ordered by
document ID. Longer lists are stored in form of non-redundant B-trees. Each BLOB
record can contain blocks of the same type. It is possible to use more BLOB records
containing ordered lists, providing that their block sizes differ each to other. Having
more available blocksizes ranging from ones to hundreds items speeds up both
searching and index synchronization.</p>
        <p>First block in each BLOB – the header – contains metadata about the BLOB contents.
The most of its content represents a beginning of free blocks list. The list continues
after the last used block. During synchronization the free block list is stored in the
core memory and is written back afterwards.</p>
        <sec id="sec-3-1-1">
          <title>Position</title>
          <p>0
2B
6B
10+</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Size</title>
          <p>2B
4B
4B
4B</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>Description</title>
          <p>Number of used blocks in the record
Number of block with the rest of free records.</p>
          <p>Number of blocks used for free blocks list.</p>
          <p>Numbers of first 1000 free blocks within the BLOB.</p>
          <p>Table 1: BLOB header structure
B-tree nodes in the BLOB records have similar format. Two bytes at the beginning of
the node contain number of items stored within the node. Then follow pointers
interlieved with item records. Item records contain the key and needed metadata.
Metadata can either contain list of documents ordered by its numbers or its weights or
it can contain lists of terms with lengths ranging from 1-6, 7-12, 13-24, respectively
25-64 characters. Details are shown in following tables.</p>
        </sec>
        <sec id="sec-3-1-4">
          <title>B-tree purpose</title>
          <p>Terms</p>
        </sec>
        <sec id="sec-3-1-5">
          <title>Field</title>
          <p>Key
Metadata</p>
        </sec>
        <sec id="sec-3-1-6">
          <title>Overal</title>
        </sec>
        <sec id="sec-3-1-7">
          <title>Size</title>
          <p>6, 12, 24,
64
12
Documents
ordered
number
Documents
ordered
weigth</p>
          <p>Key
by Metadata
by</p>
          <p>Key
Metadata</p>
        </sec>
        <sec id="sec-3-1-8">
          <title>Decription</title>
          <p>Term padded with trailing zero-bytes
Number of docs containing term
Number of BLOB record with document
list
Number of block within the BLOB record
containing list ordered by document
number.</p>
          <p>Number of block within the BLOB record
containing list ordered by term weigth.</p>
          <p>Zero, if the previous list is not stored as
Btree
Document number</p>
          <p>Document weight
The application is split to two layers. Upper layer is written in PL/SQL language and
contains procedures, monitoring database changes and the user interface. Lower layer
is written in C++ and maintains internal index structures. Communication between
layers is implemented thouigh OCI interface. The C++ performace is much higher
than the alternative PL/SQL implementation. On the other hand, the invocation of
procedures written in C++ from the PL/SQL code requires substantial overhead. The
code was therefore written to minimize switches from PL/SQL to C++ as much as
possible.
3.2</p>
        </sec>
        <sec id="sec-3-1-9">
          <title>User Interface</title>
          <p>The user wanting use Précis index has to have assigned role PRECISAPP containing
the role CTXAPP that allows usage of Oracle Media Text features. To maintain
indexes including foreign columns, the user has to posses additional privileges
CREATE ANY TRIGGER, ADMINISTER DATABASE TRIGGER and SELECT
on given TABLE. Due to security risks maintaining foreign columns is initially
disallowed.</p>
          <p>Index maintainance is done through package PRECIS with following interface.
The search is accelerated through the domain index built upon column DOC_ID of
table PRECIS$index_name$D. This table formally holds all documents, physically
stored in their repective tables. The query searches data from this table or any view
created over this table using operator PRECISSYS.CONTAINS. The system provides
the view PRECIS$index_name$S that provides additional columns stored in internal
index tables. The query would look like
SELECT S.*, PRECISSYS.SCORE(n)</p>
          <p>FROM PRECIS$index_name$S S</p>
          <p>WHERE PRECISSYS.CONTAINS(DOC_ID,Q,L,n) &gt; Treshold
where Q holds Boolean expression with keywords, L means limit – maximal required
number of hits and a Treshold keeps minimal required document rating. Last
parameter n holds numerical identifier of operator within the SELECT statement and
allows obtaining of assigned document rating using auxiliary operator SCORE with
the corresponding identifier. Using Boolean expressions instead of simple keyword
extends the original proposition. To formulate queries, users can use brackets, and
logical operators AND “&amp;”, OR “|” and NOT “-“. The structure of view
PRECIS$index_name$S is described in following table. The column TEXT returns
the beginning of the document and is present for testing purposes.</p>
          <p>
            During evaluation, terms used in the query are identified in the index. If the length of
the term exceeds given limit suffix_search_min_token_len, the index searches for all
its rightwise extensions. Shorter terms are searched in their exact form.
Query evaluation is based on Paice model [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]. Index search then identifies all
documents that have high weight assigned to at least one term used in the query. To
identify them lists ordered by weight in descendant order by weight are used. The
exact rating for these documents is then evaluated using lists ordered by document
numbers.
3.4
          </p>
        </sec>
        <sec id="sec-3-1-10">
          <title>Index Data modification</title>
          <p>Indexes are synchronized in batches. Changes are logged in the table
PRECISSYS.PRECIS_PENDING. To keep track of changes Précis index uses two
types of triggers. One of them tracks DML changes in tables and the second one
watches for DROP TABLE and TRUNCATE TABLE statements. The INSERT OR
UPDATE OR DELETE FOR EACH ROW trigger has to be created for each indexed
table. Its name is PRECIS_IndexID_TableID_TRG. The trigger
PRECIS_TBLDDL_TRG of the second type is created once in each schema
containing at least one indexed column. Each document has assigned its unique
identifier within the index, that can be translated to the quadruple (owner, table_name,
column_name, rowid) through tables PRECIS$index_name$D and
PRECIS$index_name$U. All BLOB records are stored in the table
PRECIS$index_name$I. Parametrs of all precise indexes are stored in the table
PRECISSYS.PRECIS_PARAMETER.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Performance tests</title>
      <p>The graph on picture 4 shows result for index creation in comparison with Oracle
Multimedia Text.</p>
      <p>The time for 32000 documents increases significantly due to insufficient memory for
buffers, so the data have to be transferred to a from the disk and the overall number of
I/O operation was increased. In average, the index creation process is 2-3 times
slower than native Oracle Text indexing on one table.</p>
      <p>Vytvoření indexu - porovnání s Oracle Text
600
300
250
200
))s
(s(s150
e
a
imČ
T100
50
0</p>
      <p>IndInedxesxecovšnetaminiitnegrmalyl terms
IndInedxeoxbcsoanhtuajiínciíntgetremrmyss wfreithkvferenqc.í &gt;vě0tš, 0í0n1ež 0,001
OrOacralecleTeTxetxt
1000
2000</p>
      <sec id="sec-4-1">
        <title>Picture 4. Précis index creation</title>
        <p>The query evaluation was tested on collection containing 32000 documents from
Wikipedia. Results represent execution times for queries Q1 = “Astronomy”,
Q2 = “Information System”, Q3 = “Pulp Fiction”, Q4 = “Database Search Oracle
MySQL”, Q5 = “(Information Retrieval | Text Retrieval Systems)”, Q6 = “Relational
Database Index”.</p>
        <p>Rating computation
Data retrieval
Total
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>We proposed data structures suitable for Précis indexing and implemented it in the
widespread Oracle database. The proposed implementation extends original idea by
possibility to formulate queries using Boolean operators. The formulation of queries
is familiar to Oracle application developers because of its similarity to Oracle
Multimedia Text queries. Embeding most of document processing path provided by
Oracle Media Text allows maintaining not only data in plain text columns, but also
documents in different format stored in the database. The proposed data structures
still provide quick searching of documents together with tolerable speed of indexing
process.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Simitsis</surname>
            <given-names>A</given-names>
          </string-name>
          . and col.:
          <article-title>Précis: from unstructured keywords as queries to structured databases as answers</article-title>
          .
          <source>International Journal on Very Large Data Bases (VLDB Journal)</source>
          , Volume
          <volume>17</volume>
          ,
          <string-name>
            <surname>Number</surname>
            <given-names>1</given-names>
          </string-name>
          ,
          <year>2008</year>
          ,
          <fpage>117</fpage>
          -
          <lpage>149</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Čermák</surname>
            <given-names>M.</given-names>
          </string-name>
          : Master Thesis:
          <article-title>Index pro textové vyhledávání nad relačními daty</article-title>
          , Charles University Prague,
          <year>2008</year>
          .
          <article-title>(in Czech language</article-title>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Oracle</given-names>
            <surname>Text: Application Developer's Guide</surname>
          </string-name>
          ,
          <source>10g Release 2</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Salton</surname>
          </string-name>
          , G. and
          <string-name>
            <surname>M. J. McGill</surname>
          </string-name>
          (
          <year>1983</year>
          ).
          <article-title>Introduction to modern information retrieval</article-title>
          .
          <source>McGraw-Hill. ISBN 0070544840.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Paice</surname>
            ,
            <given-names>C. P.</given-names>
          </string-name>
          (
          <year>1984</year>
          ),
          <article-title>Soft Evaluation of Boolean Search Queries in Information Retrieval Systems</article-title>
          , Information Technology,
          <source>Res. Dev. Applications</source>
          ,
          <volume>3</volume>
          (
          <issue>1</issue>
          ),
          <fpage>33</fpage>
          -
          <lpage>42</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>