A set of tools for integrating linguistic and non-linguistic information Thierry Declerck½ Abstract. In this position paper we describe the actual state of the guage processing supporting interaction with disparate sources of information, making thus Natural Language Processing (NLP) and mation Technology (IT) applications. The set of tools is realizing the communication with non language-based devices and services via XML machine readable annotations. Non-linguistic information, in most of the cases domain-specific knowledge, can thus be straightfor- ward included in the linguistically analysed texts, and so contribute to a knowledge markup of textual documents. The basic language technology guiding this markup is Information Extraction (IE) and the added information can be made visible by means of automatic hyperlinking and visualization techniques. 1 Introduction x In this paper we describe the actual state of the development of an This simple DTD just states that the basic linguistic analysis of integrated set of tools (called SCHUG) for language processing sup- a document will deliver a tree consisting of an arbitrary number of porting the interaction with various sources of information, mak- paragraphs, each containing an arbitrary combination of single words ing thus Natural Language Processing (NLP) and Human Language (W) nominal and prepositional phrases (NP, PP), verbgoups (VG, be- Technology (HLT) even more relevant for Information Technology ing a list of verbs), Named Entities (NE, being persons, companies, (IT) applications. The set of tools is realizing the communication date and time expressions etc.) and subclauses (SC, being defined with non language-based resources, devices and services and their for W, NE, NP, PP and VG). NP, PP, NE and VG contain at least integration into textual documents via XML machine readable an- one word. The element word is associated with a list of attributes: notations and protocols, the standard underlying all Web services. COMP (result of compound analysis), INFL (information about the It is thus important that all information providing devices deliver inflectional properties of the word), STEM (the lemma of the word) an XML output, or at least have an output format that can be eas- and POS (the syntactic category of the verb). Examples are given be- ily transformed into XML. Ontologies providing an hierarchical de- low. For the time being the SC element is defined quite sloppily and scription of domain specific knowledge are very good candidates for doesn’t state that the subclause should consist of a least one word interacting with natural language processing tools – as Information (subordination or coordination word) and one verbgroup. This con- Extraction tasks have already shown, since ontologies can be easily dition, which is valid for German, might be too specific at this place. described in XML based representation languages and mapped onto This DTD can be extended for the purpose of more detailed linguistic XML encoded results of linguistic analyses. analyses or for specialized applications. The internal grammar machinery of SCHUG first maps the XML structure of the available shallow linguistic analysis onto a generic 2 The Chunk Parser used: SCHUG feature structure, which is reflecting the original XML tree annota- tion of the document. And appropriated rules (defined by regular pat- SCHUG (Shallow and CHunk-based Unification Grammar tools) has terns over annotations) can then activated within the (shallow) uni- been designed in such a way that it can read results from various lan- fication formalism used in SCHUG for the further processing of the guage processing tools (at any level of NL processing up to the de- linguistic data. An advantage of this strategy is that it allows us on the tection of Grammatical Functions) and transform those into an XML one hand to use well-defined unification and subsumption operations document conforming to our basic (shallow) linguistic DTD2 , which on the linguistic data, and on the other hand to use the unification is shown below: algorithm for integrating available non-linguistic, which can be put  DFKI GmbH, 66123 Saarbruecken, Germany in relation with the linguistically annotated terms. The feature struc-  This DTD, SPPC DTD, has been designed for the SPPC system (see [7]), ture is internally realized as an hash table, which is also offering the whose results are further processed by SCHUG. advantage of efficient random access. For German texts the basic XML structure is delivered by the the XML results of some systems we want to integrate contain some- SPPC (Shallow Processing Production Center) System, which is per- times (regular) errors, which can be corrected by SCHUG, avoiding forming tokenisation, morphological analysis, POS tagging, Named thus the cumulative propagation of wrong linguistic annotations. The Entities detection and analysis of basic chunks (see [7]). Chunks can retagging procedure is also enriching the annotations provided by the be defined as the non-recursive components of basic phrases, like underlying systems. So in the case of the SPPC example given above, NPs or PPs.3 An example of this mapping is given below, where the SCHUG enriches the analysis with missing information, like for ex- XML encoding of a PP (“Fuer die Angaben” – for the data) as de- ample dependence structure or the result of agreement check: livered by the SPPC system is mapped into a feature structure of SCHUG4 . para[0] => { frag[0] => { NP_HEAD = Angaben Fuer SENT = BOS die STRUK = 23_7_1 Angaben STRING = Fuer die Angaben NP_AGR = [6 9] ... PP_NP_AGR = [9] The SPPC XML structure TYPE = 2 NP_SPEC_AGR = [6 9] PP_HEAD = Fuer para[0] => { sub_frags => { frag[0] => { item[0] => { TAG = PP STRG = Fuer SENT = BOS features => { sub_frags => { INFL = [102] item[0] => { STEM = fuer STRG = Fuer POS = 23 features => { TC = 22 INFL = [102] } STEM = fuer } POS = 23 } TC = 22 ... } The enriched SCHUG feature strucutre } item[1] => { STRG = die features => { INFL = [2 5 20 6 13 23 9 16] In the enriched feature structure above, the reader can see that STEM = d-det SCHUG has added to the mother node of the PP constituent informa- POS = 7 TC = 21 tion about so-called “head complement” and “head modifier” struc- } ture, introducing thus a dependence structure into the shallow anal- } ysis. The head of the PP is the preposition “Fuer”, whereas the head item[2] => { of the NP complement is “Angaben”. Also an agreement check has STRG = Angaben features => { been performed, and the results are given in additional features. It is INFL = [6 7 8 9] important to have these additional agreement features for supporting STEM = angabe in further processing steps the detection of Grammatical Functions POS = 1 TC = 22 (Subject, direct or indirect Object etc.) and for the resolution of refer- } ences. It should be noted that the detection of grammatical functions } is a very important step towards the attachment of semantic or extra- } linguistic information to texts. The detection of grammatical func- The SCHUG feature strucutre tions offers some guidance in deciding if some information should be On the base of this feature structure SCHUG then applies in a attached at the place where certain terms are occuring in texts: one cascaded manner various Natural Language operations, if they are might decide to attach external information only if the terms are in needed, pos tagging, named entity recognition, chunking, detection the subject position of a sentence, or if the sentence is not in a passive of Grammatical Functions and reference resolution for free text. We mode etc. The resolution of references (pronominal, anaphorical, el- are adopting here the general model of cascaded chunk processing, a lipses) is also important since it gives more evidence for integrating it was defined by [2], proposing solutions at the levels of processing non-linguistic information: in case an anaphor like “she” or “he” can where enough information is available for generating correct linguis- be resolved to a referential expression, the system will get more evi- tic structures. Furthermore we also include a ”retagging” procedure: dence that the document is about a specific topic. SCHUG is actually processing two languages: German and Span-  For more details on chunk parsing, see [1]. ish, where the use of Spanish is for the time being limited to the base  For reason of processing efficiency, some values are encoded as a figure, so chunks NP, PP and verbgroups. for example the POS “Prep” is encoded as ’23’ and inflectional properties of the words are encoded as lists of figures, each representing an instantiated At the end of the processing SCHUG delivers all the resulting in- feature structure over relevant morphological propoerties, like GENDER, formation again in XML, providing thus an increased amount of an- NUMBER, CASE. We don’t go into more details here. notations for the original documents. 3 The integration of external information in the user interface allowing the querying of videos. The indexing of the textual documents video material with relevant events is done along the line of time codes extracted from the various documents. At the various levels of linguistic processing (cascades) or at the end For this purpose the project makes use of data from different me- of this process, external non-linguistic information or code can be dia sources (textual documents, radio and television broadcasts) to added to (unified with) the linguistic description, supporting thus a build a specialized set of lexicons and an ontology for the selected scalable integration of disparate information sources (i.e. domain on- domain (soccer). All are available in XML and are integrated into tologies, multimedia material or program codes for automatic hyper- the IE processing components. It also digitizes non-text data and ap- linking) into the Natural Language Processing chain. The well known plies speech recognition techniques to extract text for the purpose of procedures acting on feature structures, unification and subsumption, annotation. allow a descriptive mapping between (for example) domain ontolo- The core linguistic processing for the annotation of the multime- gies and the results of NL processing.5 The resulting feature structure dia material consists of advanced information extraction techniques is mapped back into an (enriched) XML structure and so available for for identifying, collecting and normalizing significant text elements further processing. Some of the the added annotations can be used as (such as the names of players in a team, goals scored, time points or “semantic” index for a content-based search. Alternatively, one can sequences etc.) which are critical for the appropriate annotation of add the relevant nodes (or some local paths) of the ontology that have the multimedia material in the case of soccer. been detected as relevant for the text into the Metadata list associated Due to the fact that the project is accessing and processing dis- with the document, extending thus the core Metadata to a contentful tinct media in distinct languages, there is a need for a novel type of one, which can be easily scanned by search engiene, facilitating thus merging tool in order to combine the semantically related annotations the constitution of the Semantic Web. So NL processing guide the generated from those different data sources, and to detect inconsis- detection and presentation of additional and associated information tencies and/or redundancies within the combined annotations. The and knowledge, which might be available at some other places in a merged annotations (in XML) are stored in a database, where they net of information and present it in a XML structure. So for exam- are combined with relevant metadata. ple once in a document an occurence of a proper noun is found, a Actually we are investigating how domain-specific annotations, search can be started within other documents (structured or not), ex- gained on the base of the merging of linguistic and domain-specific tract relevant information about the entity refered to by the proper knowledge, can be included in the MPEG-7 standard, using for this noun and present it in a structured way to the reader. The technol- the slot foreseen for ”Textual Annotation”. The main issue of this in- ogy responsible for this is often called automatic hyperlinking and is vestigation will be to check to which extent textual annotations can central in the context of document enrichment. This technolgoy also be combined with low-level video features in order to achieve better helps in order to incrementally create specialized database on entities content indexing (and searching) of video material. or events. One unique document can be enriched (annotated) by dif- ferent types of annotations, depending for example on the underlying terminology, thesaurus etc. 5 Integration of various types of documents for an The integration of (domain-specific) knowledge during the NL incremental IE processing can improve the results of the linguistic analysis, since decision about syntactic disambiguation and attachment of linguistic As we have seen above, MUMIS makes uses of various types of chunks can in certain cases be supported by non-linguistic informa- sources for the generation of content annotations. MUMIS also oper- tion. ates a distinction within the textual documents it consults, and applies different processing techniques in dependence of the type of textual document: 4 An example of an application: the MUMIS project 1. Reports from Newspapers (reports about specific games, general reports) which is classified as free texts The design and the ongoing implementation of SCHUG has been 2. Tickers, close captions, Action-Databases which are classified as done initially for supporting the information extration (IE) task in semi-formal texts the context of the EU project MUMIS decicated to the indexing of 3. Formal descriptions about specific games which are classified as Multimedia material6 . formal texts MUMIS develops and integrates basic technologies for the auto- matic indexing of multimedia programme material. The domain of Since the information contained in formal texts can be consid- application is soccer. Various technology components operating of- ered as a database of true facts, they play an important role within fline are generating formal annotations of events in the data material MUMIS. But nevertheless they contain only few information about a processed. These formal annotations (in XML) constitute the basis game: the goals, the substitutions and some other few events (penal- for the integral online part of the MUMIS project, consisting of a ties, yellow and red cards). So there are only few time points avail-  The mapping procedure between linguistic features and non-linguistic able for indexing videos. Semi-formal texts, like live tickers on the knowledge can very probably be executed within the sole frame of XML web, are offering much more time points sequences, related with a and associated semantic represetentational languages, but the use of feature higher diversity of events (goals scenes, fouls etc,) and seem to offer structures for our purposes has shown various advantages, being the higher level of declarativity for the description of mapping (unification) rules and the best textual source for our purposes. Nevertheless the quality of the higher efficiency (random access) in accessing sub-structures of the re- the texts of online tickers is often quite poor. Free texts, like news- sults of the linguistic analysis. papers articles, have a high quality but the extraction of time points  MUMIS is an on-going EU-funded project within the Information So- ciety Program (IST, number 1999-10651) of the European Union, sec- and their associated events in text is more difficult. Those texts also tion Human Language Technology (HLT). See for more information offer more background information which might be interesting for http://parlevink.cs.utwente.nl/projects/mumis/ the users (age of the players, the clubs they are normally playing for, etc.). Figures 1 and 2 show examples of 2 (German) formal texts on one and the same game, and 4 gives an example of a semi-formal text Deutschland on the same game. #Trainer #Ribbeck England - Deutschland 1:0 (0:0) England: Seaman (2,5) - G. Neville (3,5), Keown (3), Campbell (2), P. Neville (4,5) - Ince (3,5), Wise (5) - Beckham (4), Scholes (3) - Shearer (3), Owen (5) - Trainer: Keegan Kahn Deutschland: Kahn (2) - Matthaeus (3) - Babbel (3,5), Nowotny (2,5) - Deisler #(2) (3), Hamann (2,5), Jeremies (3,5), Ziege (3,5) - Scholl (5) - Jancker (4), 1 Kirsten (5) - Trainer: Ribbeck ##1 Eingewechselt: 61. Gerrard fuer Owen, 72. Barmby fuer Scholes - 70. Rink ##31 fuer Kirsten, 72. Ballack fuer Deisler, 78. Bode fuer Jeremies ##Bayern Muenchen Tore: 1:0 Shearer (53., Kopfball, Vorarbeit Beckham) ##26 Schiedsrichter: Collina, Pierluigi (Viareggio), Note 2 - bis auf eine falsche Abseits-Entscheidung souveraen und sicher ... Zuschauer: 30000 (ausverkauft) Gelbe Karten: Beckham - Babbel, Jeremies #Collina, Pierluigi ##Collina Figure 1. Example of a so-called formal text, where one can see that only #Viareggio 5 distinct time points can be extracted, concerning the player subsitutions ##Italien (“Eingewechselt”) and one goal (“Tore”). #2 ... Aufstellungen: England: 1 Seaman (Arsenal London/36 Jahre/59 Laenderspiele) - 2 Gary Neville (Manchester United/25/38), 6 Keown (Arsenal London/33/32), Figure 3. Merged annotations generated from formal texts. Information 4 Campbell (Tottenham Hotspur/25/35), 3 Phil Neville (Manchester extracted from the first text is marked with “#”, from the second text with United/23/28) - 7 Beckham (Manchester United/25/33), 14 Ince (FC Middles- “##”. No special marker is provided if both texts give the same information. brough/32/52), 8 Scholes (Manchester United/25/26), 17 Wise (FC Chelsea London/33/18) - 9 Shearer (Newcastle United/29/62), 10 Owen (FC Liver- pool/21/21) Deutschland: 1 Kahn (Bayern Muenchen/31 Jahre/26) - 2 Babbel semi-automatically developed within the MUMIS consortium. In this (Bayern Muenchen/27/51), 10 Matthaeus (New York Metro Stars/39/149), 6 Nowotny (Bayer Leverkusen/26/21) - 18 Deisler (Hertha BSC/20/5), 14 thesaurus terms in three distinct languages, Dutch, English are Ger- Hamann (FC Liverpool/26/26), 16 Jeremies (Bayern Muenchen/26/26), 17 man are put in relation with soccer concepts. So “flankt” is put into Ziege (FC Middlesbrough/28/52) - 7 Scholl (Bayern Muenchen/29/28) - 19 relation with the concept “cross”. With the help of those document Jancker (Bayern Muenchen/25/8), 9 Kirsten (Bayer Leverkusen/34/50/49 fuer external information, partially dynamically generated, the line start- die DDR) Schiedsrichter: Collina (Italien) ing with the time code “16.” in figure 4, for example, can be success- fully analysed and following event annotations can be generated: Figure 2. A second example of a so-called formal text, where one can see that different informatin providers give distinct information: here for 2-event_1_PLAYER = Ziege example the number of games for the national team. 1-event_LOC = Goal-line::Goal-area 1-event_1_PLAYER = Scholes 3-event_EVENT_CLASS = goal_scene_fail 3-event_TYPE = Save Since the formal texts require only few linguistic analysis, but 3-event_TIME = 16:00 rather an accurate domain-specific interpretation of the symbols 2-event_TIME = 16:00 used, a module has been defined within SCHUG, which in a first 1-event_TIME = 16:00 1-event_TYPE = Cross step maps the formal texts onto a XML annotation7 , giving the do- DOM = SOCCER main semantic of the expressions in the text. In a second step SCHUG merges all the XML annotated formal texts about one game. Figure But also the already available information about the player 3 shows a part of such merged annotations: “Ziege” (or about the player “Scholes”) is made available at this Those merged annotations are generated at a level that requires level, mixed with linguistic information: only few linguistic analysis, and reflect basically domain specific in- OLD = ##28 formation about actors and events involved in the text. The SCHUG TAG = NP module applied at this level also extracts metadata information: name NP_AGR = [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 .. 23] of the game, date and time of the game, intermediate and final scores NOTE = #(3,5) NP_STRUK = 25 etc. This is quite inmportant, since the metadata will guide the use of NP_STRING = Ziege the annotations produced so far for supportig linguistic analysis and OBJ_AGR = [1 3 4 5 7 8 9 11 12 14 15 16 18 19 21 22 23] Information Extraction applied to more complex document, like the POS = #4 ##3 ticker shown in 4. Let us take as an example the line beginning with NP_HEAD = Ziege PLAYER = Ziege the time code “16.” The word “Ziege” can be interpreted as being a GF = SUBJ/DAT_OBJ/AKK_OBJ/NP_MOD_GEN soccer player on the base of the available annotations generated from SUBJ_AGR = [2 10 17] the formal texts. Without this, the default reading (goat) would have NUMBER = ##17 NR_PLAYS = ##52 been selected. The other soccer terms like “flankt”, “Freistoss” etc. TEAM = Deutschland are getting interpreted on the base of a multilingual soccer thesaurus CLUB = ##FC Middlesbrough  Following a DTD resulting from the analysis of all available formal texts in TYPE = NP_PLAYER DOM = SOCCER our soccer corpus. This basic information can also be very useful for reference resolu- the underlying conceptual structure visible at any place of the docu- tion. So for example, if in a sentence it is written “The 28 year old ments. midfield player of Middlesbrough ..”, SCHUG can consult the dy- We will in the future have to look at how to integrate our approach namically generated annotations and then point to “Ziege”. SCHUG in a general XML architecture or Knowledge Markup editing tools. is actually also adding to the “Ziege entry” all the events it detects in the semi-formal texts. The updated set of annotations will be of use ACKNOWLEDGEMENTS for the subsequent analysis of free texts. All the generated (XML) annotations on events, with the informa- The major part of the work reported here has been done within the tion available about the actors involved, are passed to a MUMIS mod- context of the EU-funded Project MUMIS (IST-1999-10651). We ule in charge of integrating text annotations and the video stream of would like to thank also Mireia Farrus for her help on the spanish the game, so that this video can be queried on the base of such events grammar, Claudia Crispi and Mihalea Hutanu for their work on the and actors, which are also put into relation. The MUMIS searching SCHUG platform environment allows queries of the form: “Give me all the goal scenes in the second half of the game, if Ziege is involved.” REFERENCES Gruppe A: England - Deutschland 1:0 (0:0) [1] Steven Abney, ‘Parsing by chunks’, in Principle-Based Parsing, eds., 7. Ein Freistoss von Christian Ziege aus 25 Metern geht ueber das Tor. Steven Abney Robert Berwick and Carol Tenny, Kluver Academic Pub- 12. Ziege flankt per Freistoss in den Strafraum und Jeremies versucht es per lishers, (1991). Kofball, verfehlt den Kasten jedoch deutlich. [2] Steven Abney, ‘Partial parsing via finite-state cascades’, in Workshop on 16. Scholes flankt gefaehrlich von der Torauslinie in den Fuenfmeterraum, Robust Parsing, 8th Europen Summer School in Logic, Language and doch Ziege hat aufgepasst und kann klaeren. Information (ESSLLI, (1996). 18. Hamann versucht es mit einem Distanzschuss aus 20 Metern, aber Seaman [3] Doug E. Appelt, ‘An introduction to information extraction’, AI Commu- ist auf dem Posten. nications, 12, (1999). 23. Scholl mit einer Riesenchance: Nach Zuspiel von Hamann rennt er in [4] Thierry Declerck and P. Wittenburg, ‘Mumis – a multimedia index- den englischen Strafraum, wird jedoch gleich von drei Seiten bedraengt und ing and searching environment’, in Proceedings of the 1st International kommt nur zu einem unplazierten Schuss, den Seaman sicher abfangen kann. Workshop on MultiMedia Annotation, MMA-2001, Tokyo, (2001). 27. Jancker spielt auf Ziege, dessen Schuss von der Strafraumgrenze kann [5] ISO/IEC JTC1/SC29/WG11. Mpeg-7 overview. von Seaman abgefangen werden. 35. Michael Owen kommt nach Flanke von Philip Neville voellig frei vor dem http://mpeg.telecomitalialab.com/standards/mpeg-7/mpeg-7.htm. deutschen Tor zum Kopfball, doch Kahn kann zum ersten Mal sein Koennen [6] MUC, ed. Seventh Message Understanding Conference (MUC-7), unter Beweis stellen und rettet auf der Linie. http://www.muc.saic.com/, 1998. SAIC Information Extraction. 43. Kahn zum zweiten: Beckham flankt auf Scholes, der zieht ab in den [7] Jakub Piskorski and G. Neumann, ‘An intelligent text extraction and rechten Winkel, aber der deutsche Keeper verhindert erneut die englische navigation system’, in Proceedings of the 6th Conference on Recherche Fuehrung. d’Information Assistée par Ordinateur, RIAO-2000, (2000). 47. Christian Zieges Freistoss aus 20 Metern geht einen halben Meter ueber das Tor. 53. Beckham flankt per Freistoss an der deutschen Abwehr vorbei auf den Kopf von Alan Shearer, der voellig freistehend zum 1:0 fuer die Englaender verwandelt. 58. Scholl wird von Matthaeus bedient, aber sein Schuss geht aus halbrechter Position um Zentimeter am Tor vorbei. 65. Seaman kann nach einem Eckball vor Kirsten klaeren, der Nachschuss von Jancker geht knapp am Tor vorbei. Riesenmoeglichkeit fuer die DFB-Elf. Figure 4. Example of a so-called semi-formal text, where one can see that here more time points are available, and that those can be complementary to the time points to be extracted from formal texts. So, already at this level, a unification or merging of extracted time can be done. 6 CONCLUSION We have shown that (shallow) multilingual linguistic procedures can be very helpful for a whole range of IT applications, since it is sup- porting the integration of various sources of information and Knowl- edge Markup of textual documents. Within the SCHUG system it is possible to associate non-linguistic information at various levels of linguistic analysis, as required by the application under considera- tion. The XML representation has proven to be an easy and useful mean for communicating between disparate sources of information. The SCHUG tools can capture related knowledge for a document on the base of robust but accurate NLP and of the ontology driven IE supported by the system. This knowledge is visualized to the reader via the automatic hyperlinking feature included in SCHUG, which also allows to semantically annotate the documents and also to make