A set of tools for integrating linguistic and non-linguistic
                            information
                                                            Thierry Declerck½


Abstract. In this position paper we describe the actual state of the     <?xml version="1.0" encoding="iso-8859-1"
development of an integrated set of tools (called SCHUG) for lan-           standalone="yes"?>
                                                                         <!ELEMENT NE ( W+ ) >
guage processing supporting interaction with disparate sources of        <!ATTLIST NE SUBTYPE NMTOKEN #REQUIRED >
information, making thus Natural Language Processing (NLP) and           <!ATTLIST NE TYPE ( 1 | 12 | 2 | 3 | 4 | 5 | 6 | 8 )
Human Language Technology (HLT) even more relevant for Infor-               #REQUIRED >
                                                                         <!ELEMENT NP ( NE | W )* >
mation Technology (IT) applications. The set of tools is realizing the   <!ATTLIST NP TYPE NMTOKEN #FIXED "1" >
communication with non language-based devices and services via           <!ELEMENT PARAGRAPH ( NE | NP | PP | VG | W | SC )* >
XML machine readable annotations. Non-linguistic information, in         <!ELEMENT SC (NE | NP | PP | VG | W )+ >
most of the cases domain-specific knowledge, can thus be straightfor-    <!ELEMENT PP ( NE | W )* >
                                                                         <!ATTLIST PP TYPE NMTOKEN #FIXED "2" >
ward included in the linguistically analysed texts, and so contribute    <!ELEMENT SPPC_XML ( PARAGRAPH+ ) >
to a knowledge markup of textual documents. The basic language           <!ELEMENT VG ( W+ ) >
technology guiding this markup is Information Extraction (IE) and        <!ATTLIST VG TYPE NMTOKEN #FIXED "3" >
the added information can be made visible by means of automatic          <!ELEMENT W ( #PCDATA ) >
                                                                         <!ATTLIST W COMP CDATA #IMPLIED >
hyperlinking and visualization techniques.                               <!ATTLIST W INFL CDATA #IMPLIED >
                                                                         <!ATTLIST W POS CDATA #IMPLIED >
                                                                         <!ATTLIST W STEM CDATA #IMPLIED >
1     Introduction                                                       <!ATTLIST W TC NMTOKEN #REQUIRED >x

In this paper we describe the actual state of the development of an         This simple DTD just states that the basic linguistic analysis of
integrated set of tools (called SCHUG) for language processing sup-      a document will deliver a tree consisting of an arbitrary number of
porting the interaction with various sources of information, mak-        paragraphs, each containing an arbitrary combination of single words
ing thus Natural Language Processing (NLP) and Human Language            (W) nominal and prepositional phrases (NP, PP), verbgoups (VG, be-
Technology (HLT) even more relevant for Information Technology           ing a list of verbs), Named Entities (NE, being persons, companies,
(IT) applications. The set of tools is realizing the communication       date and time expressions etc.) and subclauses (SC, being defined
with non language-based resources, devices and services and their        for W, NE, NP, PP and VG). NP, PP, NE and VG contain at least
integration into textual documents via XML machine readable an-          one word. The element word is associated with a list of attributes:
notations and protocols, the standard underlying all Web services.       COMP (result of compound analysis), INFL (information about the
It is thus important that all information providing devices deliver      inflectional properties of the word), STEM (the lemma of the word)
an XML output, or at least have an output format that can be eas-        and POS (the syntactic category of the verb). Examples are given be-
ily transformed into XML. Ontologies providing an hierarchical de-       low. For the time being the SC element is defined quite sloppily and
scription of domain specific knowledge are very good candidates for      doesn’t state that the subclause should consist of a least one word
interacting with natural language processing tools – as Information      (subordination or coordination word) and one verbgroup. This con-
Extraction tasks have already shown, since ontologies can be easily      dition, which is valid for German, might be too specific at this place.
described in XML based representation languages and mapped onto          This DTD can be extended for the purpose of more detailed linguistic
XML encoded results of linguistic analyses.                              analyses or for specialized applications.
                                                                            The internal grammar machinery of SCHUG first maps the XML
                                                                         structure of the available shallow linguistic analysis onto a generic
2     The Chunk Parser used: SCHUG                                       feature structure, which is reflecting the original XML tree annota-
                                                                         tion of the document. And appropriated rules (defined by regular pat-
SCHUG (Shallow and CHunk-based Unification Grammar tools) has            terns over annotations) can then activated within the (shallow) uni-
been designed in such a way that it can read results from various lan-   fication formalism used in SCHUG for the further processing of the
guage processing tools (at any level of NL processing up to the de-      linguistic data. An advantage of this strategy is that it allows us on the
tection of Grammatical Functions) and transform those into an XML        one hand to use well-defined unification and subsumption operations
document conforming to our basic (shallow) linguistic DTD2 , which       on the linguistic data, and on the other hand to use the unification
is shown below:                                                          algorithm for integrating available non-linguistic, which can be put
 DFKI GmbH, 66123 Saarbruecken, Germany                                 in relation with the linguistically annotated terms. The feature struc-
 This DTD, SPPC DTD, has been designed for the SPPC system (see [7]),   ture is internally realized as an hash table, which is also offering the
    whose results are further processed by SCHUG.                        advantage of efficient random access.
   For German texts the basic XML structure is delivered by the                  the XML results of some systems we want to integrate contain some-
SPPC (Shallow Processing Production Center) System, which is per-                times (regular) errors, which can be corrected by SCHUG, avoiding
forming tokenisation, morphological analysis, POS tagging, Named                 thus the cumulative propagation of wrong linguistic annotations. The
Entities detection and analysis of basic chunks (see [7]). Chunks can            retagging procedure is also enriching the annotations provided by the
be defined as the non-recursive components of basic phrases, like                underlying systems. So in the case of the SPPC example given above,
NPs or PPs.3 An example of this mapping is given below, where the                SCHUG enriches the analysis with missing information, like for ex-
XML encoding of a PP (“Fuer die Angaben” – for the data) as de-                  ample dependence structure or the result of agreement check:
livered by the SPPC system is mapped into a feature structure of
SCHUG4 .

<PARAGRAPH>                                                                      para[0] => {
<S>                                                                                frag[0] => {
 <PP TYPE="2">                                                                       NP_HEAD = Angaben
  <W TC="22" POS="23" STEM="fuer"                                                    TAG = PP
     INFL="[102]">Fuer</W>                                                           SENT = BOS
  <W TC="21" POS="7" STEM="d-det"                                                    PP_AGR = [102]
     INFL="[2 5 20 6 13 23 9 16]">die</W>                                            STRUK = 23_7_1
  <W TC="22" POS="1" STEM="angabe"                                                   NP_SPEC = die
     INFL="[6 7 8 9]">Angaben</W>                                                    STRING = Fuer die Angaben
 </PP>                                                                               NP_AGR = [6 9]
...                                                                                  PP_NP_AGR = [9]
                The SPPC XML structure                                               TYPE = 2
                                                                                     NP_SPEC_AGR = [6 9]
                                                                                     PP_HEAD = Fuer
para[0] => {                                                                         sub_frags => {
  frag[0] => {                                                                           item[0] => {
    TAG = PP                                                                               STRG = Fuer
    SENT = BOS                                                                                features => {
    sub_frags => {                                                                            INFL = [102]
        item[0] => {                                                                          STEM = fuer
           STRG = Fuer                                                                        POS = 23
             features => {                                                                    TC = 22
             INFL = [102]                                                                  }
             STEM = fuer                                                                 }
             POS = 23                                                                 }
             TC = 22                                                             ...
           }                                                                                  The enriched SCHUG feature strucutre
        }
        item[1] => {
           STRG = die
           features => {
             INFL = [2 5 20 6 13 23 9 16]                                           In the enriched feature structure above, the reader can see that
             STEM = d-det                                                        SCHUG has added to the mother node of the PP constituent informa-
             POS = 7
             TC = 21                                                             tion about so-called “head complement” and “head modifier” struc-
           }                                                                     ture, introducing thus a dependence structure into the shallow anal-
        }                                                                        ysis. The head of the PP is the preposition “Fuer”, whereas the head
        item[2] => {                                                             of the NP complement is “Angaben”. Also an agreement check has
             STRG = Angaben
               features => {                                                     been performed, and the results are given in additional features. It is
               INFL = [6 7 8 9]                                                  important to have these additional agreement features for supporting
               STEM = angabe                                                     in further processing steps the detection of Grammatical Functions
               POS = 1
               TC = 22                                                           (Subject, direct or indirect Object etc.) and for the resolution of refer-
             }                                                                   ences. It should be noted that the detection of grammatical functions
          }                                                                      is a very important step towards the attachment of semantic or extra-
      }                                                                          linguistic information to texts. The detection of grammatical func-
                 The SCHUG feature strucutre
                                                                                 tions offers some guidance in deciding if some information should be
   On the base of this feature structure SCHUG then applies in a                 attached at the place where certain terms are occuring in texts: one
cascaded manner various Natural Language operations, if they are                 might decide to attach external information only if the terms are in
needed, pos tagging, named entity recognition, chunking, detection               the subject position of a sentence, or if the sentence is not in a passive
of Grammatical Functions and reference resolution for free text. We              mode etc. The resolution of references (pronominal, anaphorical, el-
are adopting here the general model of cascaded chunk processing, a              lipses) is also important since it gives more evidence for integrating
it was defined by [2], proposing solutions at the levels of processing           non-linguistic information: in case an anaphor like “she” or “he” can
where enough information is available for generating correct linguis-            be resolved to a referential expression, the system will get more evi-
tic structures. Furthermore we also include a ”retagging” procedure:             dence that the document is about a specific topic.
                                                                                    SCHUG is actually processing two languages: German and Span-
 For more details on chunk parsing, see [1].                                    ish, where the use of Spanish is for the time being limited to the base
 For reason of processing efficiency, some values are encoded as a figure, so
                                                                                 chunks NP, PP and verbgroups.
  for example the POS “Prep” is encoded as ’23’ and inflectional properties of
  the words are encoded as lists of figures, each representing an instantiated
                                                                                    At the end of the processing SCHUG delivers all the resulting in-
  feature structure over relevant morphological propoerties, like GENDER,        formation again in XML, providing thus an increased amount of an-
  NUMBER, CASE. We don’t go into more details here.                              notations for the original documents.
3    The integration of external information in the                               user interface allowing the querying of videos. The indexing of the
     textual documents                                                            video material with relevant events is done along the line of time
                                                                                  codes extracted from the various documents.
At the various levels of linguistic processing (cascades) or at the end              For this purpose the project makes use of data from different me-
of this process, external non-linguistic information or code can be               dia sources (textual documents, radio and television broadcasts) to
added to (unified with) the linguistic description, supporting thus a             build a specialized set of lexicons and an ontology for the selected
scalable integration of disparate information sources (i.e. domain on-            domain (soccer). All are available in XML and are integrated into
tologies, multimedia material or program codes for automatic hyper-               the IE processing components. It also digitizes non-text data and ap-
linking) into the Natural Language Processing chain. The well known               plies speech recognition techniques to extract text for the purpose of
procedures acting on feature structures, unification and subsumption,             annotation.
allow a descriptive mapping between (for example) domain ontolo-                     The core linguistic processing for the annotation of the multime-
gies and the results of NL processing.5 The resulting feature structure           dia material consists of advanced information extraction techniques
is mapped back into an (enriched) XML structure and so available for              for identifying, collecting and normalizing significant text elements
further processing. Some of the the added annotations can be used as              (such as the names of players in a team, goals scored, time points or
“semantic” index for a content-based search. Alternatively, one can               sequences etc.) which are critical for the appropriate annotation of
add the relevant nodes (or some local paths) of the ontology that have            the multimedia material in the case of soccer.
been detected as relevant for the text into the Metadata list associated             Due to the fact that the project is accessing and processing dis-
with the document, extending thus the core Metadata to a contentful               tinct media in distinct languages, there is a need for a novel type of
one, which can be easily scanned by search engiene, facilitating thus             merging tool in order to combine the semantically related annotations
the constitution of the Semantic Web. So NL processing guide the                  generated from those different data sources, and to detect inconsis-
detection and presentation of additional and associated information               tencies and/or redundancies within the combined annotations. The
and knowledge, which might be available at some other places in a                 merged annotations (in XML) are stored in a database, where they
net of information and present it in a XML structure. So for exam-                are combined with relevant metadata.
ple once in a document an occurence of a proper noun is found, a                     Actually we are investigating how domain-specific annotations,
search can be started within other documents (structured or not), ex-             gained on the base of the merging of linguistic and domain-specific
tract relevant information about the entity refered to by the proper              knowledge, can be included in the MPEG-7 standard, using for this
noun and present it in a structured way to the reader. The technol-               the slot foreseen for ”Textual Annotation”. The main issue of this in-
ogy responsible for this is often called automatic hyperlinking and is            vestigation will be to check to which extent textual annotations can
central in the context of document enrichment. This technolgoy also               be combined with low-level video features in order to achieve better
helps in order to incrementally create specialized database on entities           content indexing (and searching) of video material.
or events. One unique document can be enriched (annotated) by dif-
ferent types of annotations, depending for example on the underlying
terminology, thesaurus etc.                                                       5   Integration of various types of documents for an
   The integration of (domain-specific) knowledge during the NL                       incremental IE
processing can improve the results of the linguistic analysis, since
decision about syntactic disambiguation and attachment of linguistic              As we have seen above, MUMIS makes uses of various types of
chunks can in certain cases be supported by non-linguistic informa-               sources for the generation of content annotations. MUMIS also oper-
tion.                                                                             ates a distinction within the textual documents it consults, and applies
                                                                                  different processing techniques in dependence of the type of textual
                                                                                  document:
4    An example of an application: the MUMIS
     project                                                                      1. Reports from Newspapers (reports about specific games, general
                                                                                     reports) which is classified as free texts
The design and the ongoing implementation of SCHUG has been
                                                                                  2. Tickers, close captions, Action-Databases which are classified as
done initially for supporting the information extration (IE) task in
                                                                                     semi-formal texts
the context of the EU project MUMIS decicated to the indexing of
                                                                                  3. Formal descriptions about specific games which are classified as
Multimedia material6 .
                                                                                     formal texts
   MUMIS develops and integrates basic technologies for the auto-
matic indexing of multimedia programme material. The domain of
                                                                                     Since the information contained in formal texts can be consid-
application is soccer. Various technology components operating of-
                                                                                  ered as a database of true facts, they play an important role within
fline are generating formal annotations of events in the data material
                                                                                  MUMIS. But nevertheless they contain only few information about a
processed. These formal annotations (in XML) constitute the basis
                                                                                  game: the goals, the substitutions and some other few events (penal-
for the integral online part of the MUMIS project, consisting of a
                                                                                  ties, yellow and red cards). So there are only few time points avail-
 The mapping procedure between linguistic features and non-linguistic            able for indexing videos. Semi-formal texts, like live tickers on the
  knowledge can very probably be executed within the sole frame of XML            web, are offering much more time points sequences, related with a
  and associated semantic represetentational languages, but the use of feature
                                                                                  higher diversity of events (goals scenes, fouls etc,) and seem to offer
  structures for our purposes has shown various advantages, being the higher
  level of declarativity for the description of mapping (unification) rules and   the best textual source for our purposes. Nevertheless the quality of
  the higher efficiency (random access) in accessing sub-structures of the re-    the texts of online tickers is often quite poor. Free texts, like news-
  sults of the linguistic analysis.                                               papers articles, have a high quality but the extraction of time points
 MUMIS is an on-going EU-funded project within the Information So-
  ciety Program (IST, number 1999-10651) of the European Union, sec-
                                                                                  and their associated events in text is more difficult. Those texts also
  tion Human Language Technology (HLT). See for more information                  offer more background information which might be interesting for
  http://parlevink.cs.utwente.nl/projects/mumis/                                  the users (age of the players, the clubs they are normally playing for,
etc.). Figures 1 and 2 show examples of 2 (German) formal texts on               <TEAM>
one and the same game, and 4 gives an example of a semi-formal text                <NAME>Deutschland</NAME>
                                                                                     <TRAINER>
on the same game.                                                                      <TEAM_FUNCTION>#Trainer</TEAM_FUNCTION>
                                                                                       <TRAINER_NAME>#Ribbeck</TRAINER_NAME>
England - Deutschland 1:0 (0:0)                                                      </TRAINER>
England: Seaman (2,5) - G. Neville (3,5), Keown (3), Campbell (2), P. Neville        <PLAYERS>
(4,5) - Ince (3,5), Wise (5) - Beckham (4), Scholes (3) - Shearer (3), Owen            <PLAYER>
(5) - Trainer: Keegan                                                                  <PLAYER_NAME>Kahn</PLAYER_NAME>
Deutschland: Kahn (2) - Matthaeus (3) - Babbel (3,5), Nowotny (2,5) - Deisler          <PLAYER_NOTE>#(2)</PLAYER_NOTE>
(3), Hamann (2,5), Jeremies (3,5), Ziege (3,5) - Scholl (5) - Jancker (4),             <PLAYER_POSITION>1</PLAYER_POSITION>
Kirsten (5) - Trainer: Ribbeck                                                         <PLAYER_NUMBER>##1</PLAYER_NUMBER>
Eingewechselt: 61. Gerrard fuer Owen, 72. Barmby fuer Scholes - 70. Rink               <PLAYER_OLD>##31</PLAYER_OLD>
fuer Kirsten, 72. Ballack fuer Deisler, 78. Bode fuer Jeremies                         <PLAYER_CLUB>##Bayern Muenchen</PLAYER_CLUB>
Tore: 1:0 Shearer (53., Kopfball, Vorarbeit Beckham)                                   <PLAYER_NO_PLAYS>##26</PLAYER_NO_PLAYS>
Schiedsrichter: Collina, Pierluigi (Viareggio), Note 2 - bis auf eine falsche
Abseits-Entscheidung souveraen und sicher                                        ...
Zuschauer: 30000 (ausverkauft)
Gelbe Karten: Beckham - Babbel, Jeremies                                         <REFEREE_INFORMATION>
                                                                                    <REFEREE_NAME>#Collina, Pierluigi
                                                                                         ##Collina</REFEREE_NAME>
Figure 1. Example of a so-called formal text, where one can see that only           <REFEREE_ORIGIN>#Viareggio
 5 distinct time points can be extracted, concerning the player subsitutions             ##Italien</REFEREE_ORIGIN>
                  (“Eingewechselt”) and one goal (“Tore”).                          <REFEREE_NOTE>#2</REFEREE_NOTE>
                                                                                    ...
                                                                                    </REFEREE_INFORMATION>

Aufstellungen:
England: 1 Seaman (Arsenal London/36 Jahre/59 Laenderspiele) - 2 Gary
Neville (Manchester United/25/38), 6 Keown (Arsenal London/33/32),                Figure 3. Merged annotations generated from formal texts. Information
4 Campbell (Tottenham Hotspur/25/35), 3 Phil Neville (Manchester                  extracted from the first text is marked with “#”, from the second text with
United/23/28) - 7 Beckham (Manchester United/25/33), 14 Ince (FC Middles-        “##”. No special marker is provided if both texts give the same information.
brough/32/52), 8 Scholes (Manchester United/25/26), 17 Wise (FC Chelsea
London/33/18) - 9 Shearer (Newcastle United/29/62), 10 Owen (FC Liver-
pool/21/21) Deutschland: 1 Kahn (Bayern Muenchen/31 Jahre/26) - 2 Babbel         semi-automatically developed within the MUMIS consortium. In this
(Bayern Muenchen/27/51), 10 Matthaeus (New York Metro Stars/39/149),
6 Nowotny (Bayer Leverkusen/26/21) - 18 Deisler (Hertha BSC/20/5), 14
                                                                                 thesaurus terms in three distinct languages, Dutch, English are Ger-
Hamann (FC Liverpool/26/26), 16 Jeremies (Bayern Muenchen/26/26), 17             man are put in relation with soccer concepts. So “flankt” is put into
Ziege (FC Middlesbrough/28/52) - 7 Scholl (Bayern Muenchen/29/28) - 19           relation with the concept “cross”. With the help of those document
Jancker (Bayern Muenchen/25/8), 9 Kirsten (Bayer Leverkusen/34/50/49 fuer        external information, partially dynamically generated, the line start-
die DDR) Schiedsrichter: Collina (Italien)
                                                                                 ing with the time code “16.” in figure 4, for example, can be success-
                                                                                 fully analysed and following event annotations can be generated:
Figure 2. A second example of a so-called formal text, where one can see
   that different informatin providers give distinct information: here for       2-event_1_PLAYER = Ziege
            example the number of games for the national team.                   1-event_LOC = Goal-line::Goal-area
                                                                                 1-event_1_PLAYER = Scholes
                                                                                 3-event_EVENT_CLASS = goal_scene_fail
                                                                                 3-event_TYPE = Save
   Since the formal texts require only few linguistic analysis, but              3-event_TIME = 16:00
rather an accurate domain-specific interpretation of the symbols                 2-event_TIME = 16:00
used, a module has been defined within SCHUG, which in a first                   1-event_TIME = 16:00
                                                                                 1-event_TYPE = Cross
step maps the formal texts onto a XML annotation7 , giving the do-               DOM = SOCCER
main semantic of the expressions in the text. In a second step SCHUG
merges all the XML annotated formal texts about one game. Figure                    But also the already available information about the player
3 shows a part of such merged annotations:                                       “Ziege” (or about the player “Scholes”) is made available at this
   Those merged annotations are generated at a level that requires               level, mixed with linguistic information:
only few linguistic analysis, and reflect basically domain specific in-          OLD = ##28
formation about actors and events involved in the text. The SCHUG                TAG = NP
module applied at this level also extracts metadata information: name            NP_AGR = [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 .. 23]
of the game, date and time of the game, intermediate and final scores            NOTE = #(3,5)
                                                                                 NP_STRUK = 25
etc. This is quite inmportant, since the metadata will guide the use of          NP_STRING = Ziege
the annotations produced so far for supportig linguistic analysis and            OBJ_AGR = [1 3 4 5 7 8 9 11 12 14 15 16 18 19 21 22 23]
Information Extraction applied to more complex document, like the                POS = #4 ##3
ticker shown in 4. Let us take as an example the line beginning with             NP_HEAD = Ziege
                                                                                 PLAYER = Ziege
the time code “16.” The word “Ziege” can be interpreted as being a               GF = SUBJ/DAT_OBJ/AKK_OBJ/NP_MOD_GEN
soccer player on the base of the available annotations generated from            SUBJ_AGR = [2 10 17]
the formal texts. Without this, the default reading (goat) would have            NUMBER = ##17
                                                                                 NR_PLAYS = ##52
been selected. The other soccer terms like “flankt”, “Freistoss” etc.            TEAM = Deutschland
are getting interpreted on the base of a multilingual soccer thesaurus           CLUB = ##FC Middlesbrough
 Following a DTD resulting from the analysis of all available formal texts in
                                                                                 TYPE = NP_PLAYER
                                                                                 DOM = SOCCER
  our soccer corpus.
This basic information can also be very useful for reference resolu-              the underlying conceptual structure visible at any place of the docu-
tion. So for example, if in a sentence it is written “The 28 year old             ments.
midfield player of Middlesbrough ..”, SCHUG can consult the dy-                      We will in the future have to look at how to integrate our approach
namically generated annotations and then point to “Ziege”. SCHUG                  in a general XML architecture or Knowledge Markup editing tools.
is actually also adding to the “Ziege entry” all the events it detects in
the semi-formal texts. The updated set of annotations will be of use
                                                                                  ACKNOWLEDGEMENTS
for the subsequent analysis of free texts.
   All the generated (XML) annotations on events, with the informa-               The major part of the work reported here has been done within the
tion available about the actors involved, are passed to a MUMIS mod-              context of the EU-funded Project MUMIS (IST-1999-10651). We
ule in charge of integrating text annotations and the video stream of             would like to thank also Mireia Farrus for her help on the spanish
the game, so that this video can be queried on the base of such events            grammar, Claudia Crispi and Mihalea Hutanu for their work on the
and actors, which are also put into relation. The MUMIS searching                 SCHUG platform
environment allows queries of the form: “Give me all the goal scenes
in the second half of the game, if Ziege is involved.”
                                                                                  REFERENCES
Gruppe A: England - Deutschland 1:0 (0:0)                                         [1] Steven Abney, ‘Parsing by chunks’, in Principle-Based Parsing, eds.,
7. Ein Freistoss von Christian Ziege aus 25 Metern geht ueber das Tor.                Steven Abney Robert Berwick and Carol Tenny, Kluver Academic Pub-
12. Ziege flankt per Freistoss in den Strafraum und Jeremies versucht es per          lishers, (1991).
Kofball, verfehlt den Kasten jedoch deutlich.                                     [2] Steven Abney, ‘Partial parsing via finite-state cascades’, in Workshop on
16. Scholes flankt gefaehrlich von der Torauslinie in den Fuenfmeterraum,             Robust Parsing, 8th Europen Summer School in Logic, Language and
doch Ziege hat aufgepasst und kann klaeren.                                           Information (ESSLLI, (1996).
18. Hamann versucht es mit einem Distanzschuss aus 20 Metern, aber Seaman         [3] Doug E. Appelt, ‘An introduction to information extraction’, AI Commu-
ist auf dem Posten.                                                                   nications, 12, (1999).
23. Scholl mit einer Riesenchance: Nach Zuspiel von Hamann rennt er in            [4] Thierry Declerck and P. Wittenburg, ‘Mumis – a multimedia index-
den englischen Strafraum, wird jedoch gleich von drei Seiten bedraengt und            ing and searching environment’, in Proceedings of the 1st International
kommt nur zu einem unplazierten Schuss, den Seaman sicher abfangen kann.              Workshop on MultiMedia Annotation, MMA-2001, Tokyo, (2001).
27. Jancker spielt auf Ziege, dessen Schuss von der Strafraumgrenze kann
                                                                                  [5] ISO/IEC        JTC1/SC29/WG11.                     Mpeg-7       overview.
von Seaman abgefangen werden.
35. Michael Owen kommt nach Flanke von Philip Neville voellig frei vor dem            http://mpeg.telecomitalialab.com/standards/mpeg-7/mpeg-7.htm.
deutschen Tor zum Kopfball, doch Kahn kann zum ersten Mal sein Koennen            [6] MUC, ed. Seventh Message Understanding Conference (MUC-7),
unter Beweis stellen und rettet auf der Linie.                                        http://www.muc.saic.com/, 1998. SAIC Information Extraction.
43. Kahn zum zweiten: Beckham flankt auf Scholes, der zieht ab in den             [7] Jakub Piskorski and G. Neumann, ‘An intelligent text extraction and
rechten Winkel, aber der deutsche Keeper verhindert erneut die englische              navigation system’, in Proceedings of the 6th Conference on Recherche
Fuehrung.                                                                             d’Information Assistée par Ordinateur, RIAO-2000, (2000).
47. Christian Zieges Freistoss aus 20 Metern geht einen halben Meter ueber
das Tor.
53. Beckham flankt per Freistoss an der deutschen Abwehr vorbei auf den
Kopf von Alan Shearer, der voellig freistehend zum 1:0 fuer die Englaender
verwandelt.
58. Scholl wird von Matthaeus bedient, aber sein Schuss geht aus halbrechter
Position um Zentimeter am Tor vorbei.
65. Seaman kann nach einem Eckball vor Kirsten klaeren, der Nachschuss
von Jancker geht knapp am Tor vorbei. Riesenmoeglichkeit fuer die DFB-Elf.


Figure 4. Example of a so-called semi-formal text, where one can see that
here more time points are available, and that those can be complementary to
the time points to be extracted from formal texts. So, already at this level, a
           unification or merging of extracted time can be done.


6    CONCLUSION
We have shown that (shallow) multilingual linguistic procedures can
be very helpful for a whole range of IT applications, since it is sup-
porting the integration of various sources of information and Knowl-
edge Markup of textual documents. Within the SCHUG system it is
possible to associate non-linguistic information at various levels of
linguistic analysis, as required by the application under considera-
tion. The XML representation has proven to be an easy and useful
mean for communicating between disparate sources of information.
The SCHUG tools can capture related knowledge for a document on
the base of robust but accurate NLP and of the ontology driven IE
supported by the system. This knowledge is visualized to the reader
via the automatic hyperlinking feature included in SCHUG, which
also allows to semantically annotate the documents and also to make