An FCA classification of durations of time for textual databases Ulrik Sandborg-Petersen Department of Communication and Psychology Kroghstræde 3, DK – 9220 Aalborg East, Denmark ulrikp@hum.aau.dk Abstract. Formal Concept Analysis (FCA) is useful in many applica- tions, not least in data analysis. In this paper, we apply the FCA ap- proach to the problem of classifying sets of sets of durations of time, for the purposes of storing them in a database. The database system in question is, in fact, an object-oriented text database system, in which all objects are seen as arbitrary sets of integers. These sets need to be clas- sified in textually relevant ways in order to speed up search. We present an FCA classification of these sets of sets of durations, based on linguis- tically motivated criteria, and show how its results can be applied to a text database system. 1 Introduction Formal Concept Analysis (FCA)[1, 2] has many applications, not least of which is aiding a human analyst in making sense of large or otherwise incomprehensible data sets. In this paper, we present an application of FCA to the problem of classifying classes of linguistic objects that meet certain linguistically motivated criteria, with the purpose of storing them in a text database system. We have developed a text database system, called Emdros1 , capable of storing and retrieving not only text, but also annotations of that text [3, 4]. Emdros implements the EMdF model, in which all textual objects are seen as sets of sets of durations of time with certain attributes. The rest of the paper is laid out as follows. In Sect. 2, I describe four prop- erties of language as it relates to time. In Sect. 3, I describe the EMdF model. In Sect. 4, I mathematically define a set of criteria which may or may not hold for a given object type. This results in a Formal Context of possible classes of objects, having or not having these criteria. In Sect. 5, I use FCA to arrive at a set of criteria which should be used as indexing mechanisms in Emdros in order to speed up search. In Sect. 6, I discuss the implementation of the criteria arrived at in the previous section, and evaluate the performance gains obtained by using them. Finally, I conclude the paper and give pointers to further research. 1 http://emdros.org 2 Language as durations of time Language is always heard or read in time. That is, it is a basic human condition that whenever we wish to communicate in verbal language, it takes time for us to decode the message. A word, for example, may be seen as a duration of time during which a linguistic event occurs, viz., a word is heard or read. This takes time to occur, and thus a message or text occurs in time. In this section, we describe four properties of language which have conse- quences for how we may model linguistic objects such as words or sentences. First, given that words occur in time, and given that words rarely stand alone, but are structured into sentences, and given that sentences are (at one level of analysis) sequences of words, it appears obvious that sequence is a basic property of language. We will therefore not comment further on this property of language. Second, language always carries some level of structure; for example, the total duration of time which a message fills may be broken down into shorter durations which map to words. Intermediate between the word-level and the message-level, we usually find sentences, clauses, and phrases. Thus, linguistic units embed within each other. For a lucid discussion of the linguistic terms involved, please see [5, 6]. Third, language carries the property of being resumptive. By this we mean that linguistic units are not always contiguous, i.e., they may occupy multiple, disjoint durations of time. For one such opinion, see [7]. A fourth important property of linguistic units is that they may “violate each other’s borders.” By this we mean that, while unit A may start at time a and end at time c, unit B may start at time b and end at time d, where a < b < c < d. Thus, while A overlaps with B, they cannot be placed into a strict hierarchy. 3 The EMdF model In his PhD thesis from 1994 [8], Crist-Jan Doedens formulated a model of text which meets the four criteria outlined in the previous section. Doedens called his model the “Monads dot Features” (MdF) model. We have taken Doedens’ MdF model and extended it in various ways, thus arriving at the Extended MdF (EMdF) model. In this section, we describe the EMdF model. Central to the EMdF model is the notion that textual units (such as books, paragraphs, sentences, and even words) can be viewed as sets of monads. A monad is simply an integer, but may be viewed as an indivisible duration of time.2 Objects in the EMdF model are pairs (M, F ) where M is a set of monads, and F is a set of pairs (fi , vi ) where fi is the ith feature (or attribute), and vi is the value of fi for this particular object. A special feature, “self” is always 2 Please note that we use the term “monad”, not in the well-established algebraic sense, but as a synonym for “integer in the context of the EMdF model, meaning an indivisible duration of time”. present in any F belonging to any object, and provides an integer ID which is unique across the whole database. The inequality M 6= ∅ holds for all objects in an EMdF database. Since textual objects can often be classified into similar kinds of objects with the same attributes (such as words, paragraphs, sections, etc.), the EMdF model provides object types for grouping objects. 4 Criteria In this section, we introduce some linguistically motivated criteria that may or may not hold for the objects of a given object type T . This will be done with reference to the properties inherent in language as described in Sect. 2. In the following, let Inst(T ) denote the set of objects of a given object type T . Let a and b denote objects of a given object type. Let µ denote a function which, given an object, produces the set of monads M being the first part of the pair (M, F ) for that object. Let m denote a monad. Let f (a) denote µ(a)’s first (i.e., lowest) monad, and let l(a) denote µ(a)’s last (i.e., highest) monad. Let [m1 : m2 ] denote the set of monads consisting of all the monads from m1 to m2 , both inclusive. Range types: single monad(T ): means that all objects are precisely 1 monad long. ∀a ∈ Inst(T ) : f (a) = l(a) single range(T ): means that all objects have no gaps (i.e., the set of mon- ads constituting each object is a contiguous stretch of monads). ∀a ∈ Inst(T ) : ∀m ∈ [f (a) : l(a)] : m ∈ µ(a) multiple range(T ): is the negation of “single range(T )”, meaning that there exists at least one object in Inst(T ) whose set of monads is discon- tiguous. Notice that the requirement is not that all objects be discon- tiguous; only that there exists at least one which is discontiguous. ∃a ∈ Inst(T ) : ∃m ∈ [f (a) : l(a)] : m 6∈ µ(a) ≡ ¬(∀a ∈ Inst(T ) : ∀m ∈ [f (a) : l(a)] : m ∈ µ(a)) ≡ ¬(single range(T)) Uniqueness constraints: unique first monad(T ): means that no two objects share the same start- ing monad. ∀a, b ∈ Inst(T ) : a 6= b ↔ f (a) 6= f (b) ≡ ∀a, b ∈ Inst(T ) : f (a) = f (b) ↔ a = b unique last monad(T ): means that no two objects share the same ending monad. ∀a, b ∈ Inst(T ) : a 6= b ↔ l(a) 6= l(b) ≡ ∀a, b ∈ Inst(T ) : l(a) = l(b) ↔ a = b Notice that the two need not hold at the same time. Table 1. All the possible classes of object types. Legend: sm = single monad, sr = single range, mr = multiple range, ufm = unique first monad, ulm = unique last monad, ds = distinct, ol = overlapping, vb = violates borders. Class name sm sr mr ufm ulm ds ol vb 1.000 X X X Class name sm sr mr ufm ulm ds ol vb 1.300 X X X X X 3.000 X X 2.000 X X 3.001 X X X 2.001 X X X 3.100 X X X 2.100 X X X 3.101 X X X X 2.101 X X X X 3.200 X X X 2.200 X X X 3.201 X X X X 2.201 X X X X 3.300 X X X X 2.300 X X X X 3.301 X X X X X 2.301 X X X X X 3.310 X X X X 2.310 X X X X Linguistic properties: distinct(T ): means that all pairs of objects have no monads in common. ∀a, b ∈ Inst(T ) : a 6= b → µ(a) ∩ µ(b) = ∅ ≡ ∀a, b ∈ Inst(T ) : µ(a) ∩ µ(b) 6= ∅ → a = b overlapping(T ): is the negation of distinct(T ). ¬(distinct(T )) ≡ ∃a, b ∈ Inst(T ) : a 6= b ∧ µ(a) ∩ µ(b) 6= ∅ violates borders(T ): ∃a, b ∈ Inst(T ) : a 6= b ∧ µ(a) ∩ µ(b) 6= ∅ ∧ ((f (a) < f (b)) ∧ (l(a) ≥ f (b)) ∧ (l(a) < l(b))) Notice that violates borders(T ) → overlapping(T ), since violates borders(T ) is overlapping(T ), with an extra, conjoined term. It is possible to derive the precise set of possible classes of objects, based on logical analysis of the criteria presented in this section. For details, please see [9]. The possible classes are listed in Table 1. The context resulting from these tables is then processed by the Concept Explorer software (ConExp)3 . This produces a lattice as in Fig. 1. 5 Application It is immediately noticeable from looking at Fig. 1 that “ds” is quite far down the lattice, with several parents in the lattice. It is also noticeable that “ol” is quite far up in the lattice, with only the top node as its parent. Therefore, “ds” may not be as good a candidate for a criterion on which to index as “ol”. Hence, we decided to experiment with the lattice by removing the “ds” attribute. 3 See http://conexp.sourceforge.net. Also see [10]. Fig. 1. The lattice drawn by ConExp for the whole context. By drawing this new lattice with ConExp, it is noticeable that the only dependent attributes are “sm” and “vb”: All other attributes are at the very top of the lattice, with only the top node as their parent. This means we are getting closer to a set of criteria based on which to index sets of monads. The three range types should definitely be accommodated in any indexing scheme. The reasons are: First, “single monad” can be stored very efficiently, namely just by storing the single monad in the monad set. Second, “single range” is also very easy to store: It is sufficient to store the first and the last monad. Third, “multiple range”, as we have argued in Sect. 2, is necessary to support in order to be able to store resumptive (discontiguous) linguistic units. It can be stored by storing the monad set itself in marshalled form, perhaps along with the first and last monads. This leaves us with the following criteria: “unique first monad”, “unique last monad”, “overlapping”, and “violates borders” to decide upon. In real-life linguistic databases, “unique first monads” and “unique last mon- ads” are equally likely to be true of any given object type, in the sense that if one is true, then the other is likely also to be true, while if one is false, then the other is likely also to be false. This is because of the embedding nature of language explained in Sect. 2: If embedding occurs at all within a single object type, then it is likely that both first and last monads are not going to be unique. Therefore, we decided to see what happens to the lattice if we remove one of the two uniqueness criteria from the list of attributes. The criterion chosen for removal was “unique last monads”. Once this is done, ConExp reports that “unique first monads” subsumes 11 objects, or 55%. This means that “unique first monads” should probably be included in the set of criteria on which to index. Similarly, still removing “ds” and “ulm”, and selecting “overlapping”, we get the lattice drawn in Fig. 2. ConExp reports that “overlapping” subsumes 17 objects, or 85%, leaving only 3 objects out of 20 not subsumed by “overlapping”. This indicates that “overlapping” is probably too general to be a good candidate for treating specially. It is also noticeable that “violates borders” only subsumes 4 objects. Hence it may not be such a good candidate for a criterion to handle specially, since it is too specific in its scope. Thus, we arrive at the following list of criteria to handle specially in the database: a) single monad; b) single range; c) multiple range; and d) unique first monads. 6 Implementation and evaluation The three range types can be easily implemented in a relational database system along the lines outlined in the previous section. The “unique first monads” criterion can be implemented in a relational database system by a “unique” constraint on the “first monad” column of a table holding the objects of a given object type. Notice that for multiple range, if we store the first monad of the monad set in a separate column from the monad set itself, this is possible for all three range types. Notice also that, if we use one row to store each object, the “first monad” column can be used as a primary key if “unique first monads” holds for the object type. We have run some evaluation tests of 124 diverse Emdros queries against two versions of the same linguistic database, each loaded into four backends (SQLite 3, SQLite 2, PostgreSQL, and MySQL). One version of the database did not have the indexing optimizations arrived at in the previous section, whereas the other version of the database did. The version of Emdros used was 3.0.1. The hardware was a PC with an Intel Dual Core 2, 2.4GHz CPU, 7200RPM SATA-II disks, and 3GB of RAM, running Fedora Core Linux 8. The 124 queries were run twice on each database, and an average obtained by dividing by 2 the sum of the “wall time” (i.e., real time) used for all 2 × 124 queries. The results can be seen in Table 2. As can be seen, the gain obtained for MySQL and PostgreSQL is almost negligible, while it is significant for the two versions of SQLite. Fig. 2. The lattice drawn without the “ds” and “ulm” attributes, and with “ol” se- lected. 7 Conclusion We have presented four properties that natural language possesses, namely se- quence, embedding, resumption, and non-hierarchic overlap, and we have seen how these properties can be modeled as sets of durations of time. We have presented the EMdF model of text, in which indivisible units of time (heard or read) are represented by integers, called “monads”. Textual units are then seen as objects, represented by pairs (M, F ), where M is a set of monads, and F is a set of attribute-value assignments. An object type then gathers all objects with like attributes. We have then presented some criteria which are derived from some of the four properties of language outlined above. We have formally defined these in terms of objects and their monads. We have then derived an FCA context from these criteria, which we have then converted to a lattice using the Concept Explorer Software (ConExp). Table 2. Evaluation results on an Emdros database, in seconds. Backend SQLite 3 SQLite 2 PostgreSQL MySQL Avg. time for DB without optimizations 153.92 130.99 281.56 139.41 Avg. time for DB with optimizations 132.40 120.00 274.20 136.65 Performace gain 13.98% 8.39% 2.61% 1.98% We have then analyzed the lattice, and have arrived at four criteria which should be treated specially in an implementation. We have then suggested how these four criteria can be implemented in a relational database system. They are, in fact, implemented in ways similar to these suggestions in the Emdros corpus query system. We have also evaluated the performance gains obtained by implementing the four criteria. Thus FCA has been used as a tool for reasoned selection of a number of criteria which should be treated specially in an implementation of a database system for annotated text. Future work could also include: a) Derivation of more, pertinent criteria from the four properties of language; b) Exploration of these criteria using FCA; c) Implementation of such criteria; and d) Evaluation of any performance gains. References 1. Lehmann, F., Wille, R.: A triadic approach to formal concept analysis. In Ellis, G., Levinson, R., Rich, W., Sowa, J.F., eds.: Proceedings of ICCS’95. Volume 954 of LNAI., Springer Verlag (1995) 32–43 2. Ganter, B., Wille, R.: Formal Concept Analysis: Mathematical Foundations. Springer-Verlag New York, Inc., Secaucus, NJ, USA (1997) Translator-C. Franzke. 3. Petersen, U.: Emdros — a text database engine for analyzed or annotated text. In: Proceedings of COLING 2004. (2004) 1190–1193 http://emdros.org/petersen- emdros-COLING-2004.pdf. 4. Petersen, U.: Principles, implementation strategies, and evaluation of a corpus query system. In: Proceedings of the FSMNLP 2005. Volume 4002 of LNAI., Springer Verlag (2006) 5. Van Valin, Jr., R.D.: An introduction to Syntax. Cambridge University Press, Cambridge, U.K. (2001) 6. Horrocks, G.: Generative Grammar. Longman, London and New York (1987) 7. McCawley, J.D.: Parentheticals and discontinuous constituent structure. Linguistic Inquiry 13(1) (1982) 91–106 8. Doedens, C.J.: Text Databases: One Database Model and Several Retrieval Lan- guages. Editions Rodopi Amsterdam (1994) ISBN 90-5183-729-1. 9. Sandborg-Petersen, U.: Annotated Text Databases in the Context of the Kaj Munk Corpus: One database model, one query language, and several applications. PhD thesis, Aalborg University, Denmark (2008) 10. Yevtushenko, S.A.: System of data analysis ”concept explorer”. (in russian). In: Proceedings of the 7th national conference on Artificial Intelligence KII-2000, Rus- sia. (2000) 127–134