<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An FCA classification of durations of time for textual databases</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ulrik Sandborg-Petersen</string-name>
          <email>ulrikp@hum.aau.dk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Communication and Psychology Kroghstraede 3</institution>
          ,
          <addr-line>DK - 9220 Aalborg East</addr-line>
          ,
          <country country="DK">Denmark</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Formal Concept Analysis (FCA) is useful in many applications, not least in data analysis. In this paper, we apply the FCA approach to the problem of classifying sets of sets of durations of time, for the purposes of storing them in a database. The database system in question is, in fact, an object-oriented text database system, in which all objects are seen as arbitrary sets of integers. These sets need to be classified in textually relevant ways in order to speed up search. We present an FCA classification of these sets of sets of durations, based on linguistically motivated criteria, and show how its results can be applied to a text database system.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>Language as durations of time</title>
      <p>Language is always heard or read in time. That is, it is a basic human condition
that whenever we wish to communicate in verbal language, it takes time for us
to decode the message. A word, for example, may be seen as a duration of time
during which a linguistic event occurs, viz., a word is heard or read. This takes
time to occur, and thus a message or text occurs in time.</p>
      <p>In this section, we describe four properties of language which have
consequences for how we may model linguistic objects such as words or sentences.</p>
      <p>First, given that words occur in time, and given that words rarely stand
alone, but are structured into sentences, and given that sentences are (at one
level of analysis) sequences of words, it appears obvious that sequence is a basic
property of language. We will therefore not comment further on this property of
language.</p>
      <p>
        Second, language always carries some level of structure; for example, the
total duration of time which a message fills may be broken down into shorter
durations which map to words. Intermediate between the word-level and the
message-level, we usually find sentences, clauses, and phrases. Thus, linguistic
units embed within each other. For a lucid discussion of the linguistic terms
involved, please see [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ].
      </p>
      <p>
        Third, language carries the property of being resumptive. By this we mean
that linguistic units are not always contiguous, i.e., they may occupy multiple,
disjoint durations of time. For one such opinion, see [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>A fourth important property of linguistic units is that they may “violate each
other’s borders.” By this we mean that, while unit A may start at time a and
end at time c, unit B may start at time b and end at time d, where a &lt; b &lt; c &lt; d.
Thus, while A overlaps with B, they cannot be placed into a strict hierarchy.
3</p>
    </sec>
    <sec id="sec-3">
      <title>The EMdF model</title>
      <p>
        In his PhD thesis from 1994 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], Crist-Jan Doedens formulated a model of text
which meets the four criteria outlined in the previous section. Doedens called
his model the “Monads dot Features” (MdF) model. We have taken Doedens’
MdF model and extended it in various ways, thus arriving at the Extended MdF
(EMdF) model. In this section, we describe the EMdF model.
      </p>
      <p>Central to the EMdF model is the notion that textual units (such as books,
paragraphs, sentences, and even words) can be viewed as sets of monads. A
monad is simply an integer, but may be viewed as an indivisible duration of
time.2</p>
      <p>Objects in the EMdF model are pairs (M, F ) where M is a set of monads,
and F is a set of pairs (fi, vi) where fi is the ith feature (or attribute), and vi
is the value of fi for this particular object. A special feature, “self” is always
2 Please note that we use the term “monad”, not in the well-established algebraic
sense, but as a synonym for “integer in the context of the EMdF model, meaning an
indivisible duration of time”.
present in any F belonging to any object, and provides an integer ID which is
unique across the whole database. The inequality M 6= ∅ holds for all objects in
an EMdF database.</p>
      <p>Since textual objects can often be classified into similar kinds of objects with
the same attributes (such as words, paragraphs, sections, etc.), the EMdF model
provides object types for grouping objects.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Criteria</title>
      <p>In this section, we introduce some linguistically motivated criteria that may or
may not hold for the objects of a given object type T . This will be done with
reference to the properties inherent in language as described in Sect. 2.</p>
      <p>In the following, let Inst(T ) denote the set of objects of a given object type
T . Let a and b denote objects of a given object type. Let μ denote a function
which, given an object, produces the set of monads M being the first part of
the pair (M, F ) for that object. Let m denote a monad. Let f (a) denote μ(a)’s
first (i.e., lowest) monad, and let l(a) denote μ(a)’s last (i.e., highest) monad.
Let [m1 : m2] denote the set of monads consisting of all the monads from m1 to
m2, both inclusive.</p>
      <p>Range types:
Uniqueness constraints:
single monad(T ): means that all objects are precisely 1 monad long.</p>
      <p>∀a ∈ Inst(T ) : f (a) = l(a)
single range(T ): means that all objects have no gaps (i.e., the set of
monads constituting each object is a contiguous stretch of monads).
∀a ∈ Inst(T ) : ∀m ∈ [f (a) : l(a)] : m ∈ μ(a)
multiple range(T ): is the negation of “single range(T )”, meaning that
there exists at least one object in Inst(T ) whose set of monads is
discontiguous. Notice that the requirement is not that all objects be
discontiguous; only that there exists at least one which is discontiguous.
∃a ∈ Inst(T ) : ∃m ∈ [f (a) : l(a)] : m 6∈ μ(a)
≡ ¬(∀a ∈ Inst(T ) : ∀m ∈ [f (a) : l(a)] : m ∈ μ(a))
≡ ¬(single range(T))
unique first monad(T ): means that no two objects share the same
starting monad.
∀a, b ∈ Inst(T ) : a 6= b ↔ f (a) 6= f (b)
≡ ∀a, b ∈ Inst(T ) : f (a) = f (b) ↔ a = b
unique last monad(T ): means that no two objects share the same ending
monad.
∀a, b ∈ Inst(T ) : a 6= b ↔ l(a) 6= l(b)
≡ ∀a, b ∈ Inst(T ) : l(a) = l(b) ↔ a = b</p>
      <p>
        Notice that the two need not hold at the same time.
It is immediately noticeable from looking at Fig. 1 that “ds” is quite far down
the lattice, with several parents in the lattice. It is also noticeable that “ol” is
quite far up in the lattice, with only the top node as its parent. Therefore, “ds”
may not be as good a candidate for a criterion on which to index as “ol”. Hence,
we decided to experiment with the lattice by removing the “ds” attribute.
3 See http://conexp.sourceforge.net. Also see [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>By drawing this new lattice with ConExp, it is noticeable that the only
dependent attributes are “sm” and “vb”: All other attributes are at the very
top of the lattice, with only the top node as their parent. This means we are
getting closer to a set of criteria based on which to index sets of monads.</p>
      <p>The three range types should definitely be accommodated in any indexing
scheme. The reasons are: First, “single monad” can be stored very efficiently,
namely just by storing the single monad in the monad set. Second, “single range”
is also very easy to store: It is sufficient to store the first and the last monad.
Third, “multiple range”, as we have argued in Sect. 2, is necessary to support in
order to be able to store resumptive (discontiguous) linguistic units. It can be
stored by storing the monad set itself in marshalled form, perhaps along with
the first and last monads.</p>
      <p>This leaves us with the following criteria: “unique first monad”, “unique last
monad”, “overlapping”, and “violates borders” to decide upon.</p>
      <p>In real-life linguistic databases, “unique first monads” and “unique last
monads” are equally likely to be true of any given object type, in the sense that if
one is true, then the other is likely also to be true, while if one is false, then
the other is likely also to be false. This is because of the embedding nature of
language explained in Sect. 2: If embedding occurs at all within a single object
type, then it is likely that both first and last monads are not going to be unique.</p>
      <p>Therefore, we decided to see what happens to the lattice if we remove one
of the two uniqueness criteria from the list of attributes. The criterion chosen
for removal was “unique last monads”. Once this is done, ConExp reports that
“unique first monads” subsumes 11 objects, or 55%. This means that “unique
first monads” should probably be included in the set of criteria on which to
index.</p>
      <p>Similarly, still removing “ds” and “ulm”, and selecting “overlapping”, we
get the lattice drawn in Fig. 2. ConExp reports that “overlapping” subsumes 17
objects, or 85%, leaving only 3 objects out of 20 not subsumed by “overlapping”.
This indicates that “overlapping” is probably too general to be a good candidate
for treating specially.</p>
      <p>It is also noticeable that “violates borders” only subsumes 4 objects. Hence
it may not be such a good candidate for a criterion to handle specially, since it
is too specific in its scope.</p>
      <p>Thus, we arrive at the following list of criteria to handle specially in the
database: a) single monad; b) single range; c) multiple range; and d) unique first
monads.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Implementation and evaluation</title>
      <p>The three range types can be easily implemented in a relational database system
along the lines outlined in the previous section.</p>
      <p>The “unique first monads” criterion can be implemented in a relational
database system by a “unique” constraint on the “first monad” column of a
table holding the objects of a given object type. Notice that for multiple range,
if we store the first monad of the monad set in a separate column from the
monad set itself, this is possible for all three range types. Notice also that, if
we use one row to store each object, the “first monad” column can be used as a
primary key if “unique first monads” holds for the object type.</p>
      <p>We have run some evaluation tests of 124 diverse Emdros queries against two
versions of the same linguistic database, each loaded into four backends (SQLite
3, SQLite 2, PostgreSQL, and MySQL). One version of the database did not
have the indexing optimizations arrived at in the previous section, whereas the
other version of the database did. The version of Emdros used was 3.0.1. The
hardware was a PC with an Intel Dual Core 2, 2.4GHz CPU, 7200RPM SATA-II
disks, and 3GB of RAM, running Fedora Core Linux 8. The 124 queries were
run twice on each database, and an average obtained by dividing by 2 the sum
of the “wall time” (i.e., real time) used for all 2 × 124 queries. The results can
be seen in Table 2.</p>
      <p>As can be seen, the gain obtained for MySQL and PostgreSQL is almost
negligible, while it is significant for the two versions of SQLite.
We have presented four properties that natural language possesses, namely
sequence, embedding, resumption, and non-hierarchic overlap, and we have seen
how these properties can be modeled as sets of durations of time.</p>
      <p>We have presented the EMdF model of text, in which indivisible units of time
(heard or read) are represented by integers, called “monads”. Textual units are
then seen as objects, represented by pairs (M, F ), where M is a set of monads,
and F is a set of attribute-value assignments. An object type then gathers all
objects with like attributes.</p>
      <p>We have then presented some criteria which are derived from some of the four
properties of language outlined above. We have formally defined these in terms
of objects and their monads. We have then derived an FCA context from these
criteria, which we have then converted to a lattice using the Concept Explorer
Software (ConExp).</p>
      <p>Backend SQLite 3 SQLite 2 PostgreSQL MySQL
Avg. time for DB without optimizations 153.92 130.99 281.56 139.41
Avg. time for DB with optimizations 132.40 120.00 274.20 136.65
Performace gain 13.98% 8.39% 2.61% 1.98%</p>
      <p>We have then analyzed the lattice, and have arrived at four criteria which
should be treated specially in an implementation.</p>
      <p>We have then suggested how these four criteria can be implemented in a
relational database system. They are, in fact, implemented in ways similar to
these suggestions in the Emdros corpus query system. We have also evaluated
the performance gains obtained by implementing the four criteria.</p>
      <p>Thus FCA has been used as a tool for reasoned selection of a number of
criteria which should be treated specially in an implementation of a database
system for annotated text.</p>
      <p>Future work could also include: a) Derivation of more, pertinent criteria from
the four properties of language; b) Exploration of these criteria using FCA; c)
Implementation of such criteria; and d) Evaluation of any performance gains.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wille</surname>
            ,
            <given-names>R.:</given-names>
          </string-name>
          <article-title>A triadic approach to formal concept analysis</article-title>
          . In Ellis, G.,
          <string-name>
            <surname>Levinson</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Rich,
          <string-name>
            <given-names>W.</given-names>
            ,
            <surname>Sowa</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.F</surname>
          </string-name>
          ., eds.
          <source>: Proceedings of ICCS'95</source>
          . Volume 954 of LNAI., Springer Verlag (
          <year>1995</year>
          )
          <fpage>32</fpage>
          -
          <lpage>43</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Ganter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wille</surname>
          </string-name>
          , R.:
          <source>Formal Concept Analysis: Mathematical Foundations</source>
          . Springer-Verlag New York, Inc., Secaucus, NJ, USA (
          <year>1997</year>
          )
          <string-name>
            <surname>Translator-C. Franzke</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Petersen</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>Emdros - a text database engine for analyzed or annotated text</article-title>
          .
          <source>In: Proceedings of COLING</source>
          <year>2004</year>
          .
          <article-title>(</article-title>
          <year>2004</year>
          )
          <fpage>1190</fpage>
          -
          <lpage>1193</lpage>
          http://emdros.org/petersenemdros-COLING-
          <year>2004</year>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Petersen</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>Principles, implementation strategies, and evaluation of a corpus query system</article-title>
          .
          <source>In: Proceedings of the FSMNLP 2005</source>
          .
          <article-title>Volume 4002 of LNAI</article-title>
          ., Springer Verlag (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Van</surname>
            <given-names>Valin</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jr.</surname>
          </string-name>
          , R.D.:
          <article-title>An introduction to Syntax</article-title>
          . Cambridge University Press, Cambridge, U.K. (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Horrocks</surname>
          </string-name>
          , G.:
          <article-title>Generative Grammar</article-title>
          . Longman, London and New York (
          <year>1987</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>McCawley</surname>
            ,
            <given-names>J.D.</given-names>
          </string-name>
          :
          <article-title>Parentheticals and discontinuous constituent structure</article-title>
          .
          <source>Linguistic Inquiry</source>
          <volume>13</volume>
          (
          <issue>1</issue>
          ) (
          <year>1982</year>
          )
          <fpage>91</fpage>
          -
          <lpage>106</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Doedens</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          :
          <source>Text Databases: One Database Model and Several Retrieval Languages. Editions Rodopi Amsterdam</source>
          (
          <year>1994</year>
          ) ISBN 90-5183-729-1.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Sandborg-Petersen</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>Annotated Text Databases in the Context of the Kaj Munk Corpus: One database model, one query language, and several applications</article-title>
          .
          <source>PhD thesis</source>
          , Aalborg University, Denmark (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Yevtushenko</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          :
          <article-title>System of data analysis ”concept explorer”. (in russian)</article-title>
          .
          <source>In: Proceedings of the 7th national conference on Artificial Intelligence KII-2000</source>
          , Russia. (
          <year>2000</year>
          )
          <fpage>127</fpage>
          -
          <lpage>134</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>