Timely and Non-Intrusive Active Document Annotation via Adaptive Information Extraction Fabio Ciravegna1, Alexiei Dingli1, Daniela Petrelli2 and Yorick Wilks1 Abstract. The process of document annotation for the Semantic extraction of information) or semi-automatic way (e.g. as support Web is complex and time consuming, as it requires a great deal for human annotators in locating relevant facts in documents, via of manual annotation. Information extraction from texts (IE) is a information highlighting). In the last years a big effort has been technology used by some of the most recent systems for actively spent in the IE community on the use of Machine Learning for supporting users in the process and reducing the burden of helping in porting IE systems to new applications/domains annotation. The integration of IE systems in annotation tools is [1][2][3]. Some new annotation tools for the Semantic Web quite a new development and in our opinion there is still the already include adaptive IE capabilities for helping in the necessity of thinking the impact of the IE system in the process annotation process. At the Open University, the MnM annotation of annotation. In this paper we discuss two main requirements for tool [4] interfaces with both the UMass IE tools [5] and active annotation: timeliness and tuning of intrusiveness. Then Sheffield’s Amilcare [11]. At the University of Karlsruhe the we present and discuss a model of interaction that addresses the Ontomat annotizer [6], an implementation of the CREAM two issues and Melita, an annotation framework that implements environment, interfaces with Sheffield’s Amilcare. The advantage such methodology. of using adaptive IE as a support for annotation is quite clear: the IE system monitors the annotations inserted by the user and it learns how to reproduce them. When equivalent cases are 1. INTRODUCTION encountered, annotations are automatically inserted by the IE The effort behind the Semantic Web (SW) is to add content to system and users have just to check them. This approach, called web documents in order to access knowledge instead of active learning, has been proven to reduce the burden of manual unstructured material, allowing knowledge to be managed in an annotation up to 80% in some cases [7]. The current methodology automatic way. Much work is done on (1) the definition of of interaction between annotation tool and IE system is still quite standards for representation of knowledge (e.g. XML, RDF, OIL), simplistic. This influences also the way in which the user and the (2) the definition of structures for knowledge organization (e.g. annotation system interacts. Generally a batch interaction mode is ontologies) and (3) the population of such knowledge structures. adopted, i.e., the user annotates a batch of texts and the IE tool is (1) and (2) actually provide the necessary infrastructure for the trained on the whole batch. Then annotation is started on another Semantic Web. (3) actually requires methodologies for creating batch of texts and the IE system proposes annotations to users semantically enriched documents. It is reasonable to expect users when cases similar to those found in the training batches are to manually annotate new documents up to a certain degree, but recognized. Although the use of adaptive IE constitutes quite an annotation is a slow time-consuming process that involves high improvement with respect to the completely manual annotation costs. Therefore it is vital for the Semantic Web to produce approach, in our opinion the tremendous potentialities of adaptive automatic or semi-automatic methods for extracting information IE technologies are not fully exploited. We believe that it is time from web-related documents, either for helping in annotating new to consider the way in which the interaction can be organized in documents or to extract additional information from existing order to both maximize effectiveness in the annotation process and unstructured or partially structured documents. In this context, minimize the burden of annotating/correcting on the user’s side. Information Extraction from texts (IE) is one of the most We expect that such change will also influence the user-annotation promising areas of Human Language Technologies for the tool interaction style by moving from a simplistic user-system 1 Semantic Web. IE is an automatic method for locating important interaction to real user-system collaboration . We propose two facts in electronic documents for successive use, e.g. for user-centred criteria as measure of appropriateness of this annotating documents or for information storing (such as collaboration: timeliness and intrusiveness of the IE process. The populating an ontology with instances). In this perspective IE is first shows the ability to react to user annotation: how timely is the the perfect support for knowledge identification and extraction system to learn from user annotations. The latter represents the from Web documents as it can – for example - provide support in level to which the system bothers the user, because for example it documents annotation either in an automatic way (unsupervised requires CPU (and therefore stops the user annotation activity) or because it suggests wrong annotations. 1 Department of Computer Science, University of Sheffield, Regent Timeliness: when the IE system (IES) is trained on blocks of Court, 211 Portobello Street, S1 4DP, Sheffield, UK, email texts, there is a time gap between the moment in which {fabio|alexiei|yorick}@dcs.shef.ac.uk 2 Department of Information Studies, University of Sheffield, Regent 1 Court, 211 Portobello Street, S1 4DP, Sheffield, UK, email Collaboration means working together for a common goal, all partners D.Petrelli@shef.ac.uk contributing with their own capabilities and skills. annotations are inserted by the user and the moment in which they 2.1 Training are used by the system for learning. User and system work in During training users annotate texts without any contribution from strict sequence, one after the other. This sequential scheduling the IES. In this phase the IES uses the annotations inserted by the hampers true collaboration. If a batch of texts contains many user to train its learner. During this phase the IES is constantly similar documents, users may spend a lot of time in annotating inducing rules. We can define two sub-phases: (a) bootstrapping similar documents without receiving feedback from the IES for and (b) training with verification. During bootstrapping the only the simple reason that no learning is scheduled for the moment. IES task is to learn from the user annotations. This sub-phase can The IES is not supportive to the user neither it is efficient since be of different length according to the specific IES requirements. similar cases are of very little use for the learner because they It depends on the minimum number of examples needed for a cannot offer the variety of phenomena that empower learning. minimum training. During the second sub-phases, the user The bigger the size of the batch of texts the worse the problem of continues with the unassisted annotation, but the behaviour of the lack of timeliness is. A true collaboration implies a (re)training of IES changes. With some rules already available the IES silently the system after every annotated text is released by the user. competes with the user in annotating the document. When the Training can take a considerable amount of CPU time, therefore annotation process is finished, the IES automatically compares its stop the annotation session for a while. A positive collaboration annotations with those inserted by the user and calculates its requires not to constraint the user time to the IES training time accuracy. Missing annotations or mistakes are used to retrain the (otherwise the intrusiveness of the IES increases). We believe that learners. The training phase ends when the accuracy in annotating an intelligent scheduling is needed to keep timeliness in learning can provide the user preferred level of pro-activity and therefore it without increasing intrusiveness. It is also important to bear in is possible to move to the next phase: active annotation. We will mind that timeliness is a matter of perception from the user side, discuss in the following section how this condition is verified. not an absolute feature, therefore what is important is that users do not perceive any delay or impediment. The focus is on effective collaboration not on timeliness at any cost. Intrusiveness: in all the experiments with active learning done so far it turned out difficult to avoid bothering users with proposed annotations generated by unreliable rules (e.g. induced using an insufficient number of cases). This problem is mainly related to the tuning of the IES behaviour. Some IES provide internal tuning methods for balancing features such as precision and recall or the minimum number of cases to be covered in order to accepted a rule for annotation. Such tuning methodologies are designed for IE experts since they require a deep knowledge of the underline IE system. This is especially true because the user goal is tuning Figure 1. The training with verification sub-phase. the level of intrusiveness in the annotation process and very often there is no obvious correspondent in the IES tuning methodology. For example Amilcare allows to modify error thresholds for rules, 2.2 Active Annotation with Revision number of cases covered by rules for acceptance, balance of In this phase the annotation methodology is heavily based on the precision and recall in rule tuning: none of these correspond suggestions of the IES and the user main task is to correct and directly to tuning the level of intrusiveness (even if large part of it integrate the suggested annotations (i.e. remove or add relies in the precision/recall balance). The acceptable level of annotations). Human corrections and integrations are inputted intrusiveness is subjective: some users might like to receive back to the IES for retraining. This is the phase where the real suggestions largely regardless from their correctness, while others system-user cooperation takes place: the system helps the user in do not want to be bothered unless suggestions are absolutely annotation; the user feeds back the mistakes to help the system reliable. We think that a user-friendly interaction methodology perform better. In user terms this is where the added value of the must be implemented to help in selecting the appropriate level of IES becomes apparent, because it heavily reduces the amount of intrusiveness, without requiring users to cope with the complexity annotation the user has to insert. This supervision task is much of tuning an adaptive IE system. more convenient from both cognition and actions. Correcting In this paper we present an IE-based annotation methodology for annotations is simpler than annotating bare texts, it is less time the Semantic Web that takes into account the problems of consuming and it is also likely to be less error prone. timeliness and intrusiveness mentioned above. 3. A NEW MODEL OF INTERACTION 2. THE ANNOTATION PROCESS The proposed model of interaction is based on non-intrusive and In our model the annotation process is split into two main phases timely active annotation. The first level of non-intrusiveness is from the system point of view: (1) training and (2) active that the IES does not require any specific interface for annotation annotation with revision. In user terms the first corresponds to or any specific adaptation by the user. It integrates in the usual unassisted annotation, while the latter just requires correction of user environment and provides suggestions for possible annotation proposed by the IES. annotations in a way that is both familiar and intuitive for the user. To some extent users could even ignore that an IES is working for them. The interaction with the user is left to the annotation interface, a tool designed for specific user classes and therefore able to elicit the tuning requirements by using the correct works in the background with two parallel and asynchronous terminology for the specific domain. Even the correct settings and processes. On the one hand while the user annotates document n requirements for the appropriate IES’s settings must be elicited the system learns the annotations inserted in document n-1, i.e. the through the interface (and then converted in the IES specific learner is always one document behind the user. At the same time settings thorough an API). (i.e. as a separate process) the IES applies the rules induced in the previous learning sessions (i.e. from document 1 to document n-2) in order to extract information (either for suggesting annotations 3.1 Intrusiveness vs. Proactivity during active annotation or in order to silently test its accuracy Intrusiveness is the risk related to proactivity. As mentioned, there during unassisted learning). This means that the annotation are a number of ways in which the IES can be intrusive with capability is always two steps behind. The advantage is that there respect to the user task. On the one hand when the system is no idle time for the user, as the annotation of a document suggests annotation during phase 2 (active annotation with generally requires a great deal more time than training on a single revision), it can bother users with unreliable annotations. The text. requirement here is to enable users to tune the IES behaviour so that the level of suggestions is appropriate. The annotation 3.3 Coping with Timeliness interface must bridge the qualitative vision of users (e.g. a request to be more/less active or accurate) with the specific IES settings As explained above timeliness is not fully obtained with the above (e.g. change error thresholds) [8]. On the other hand the IES interaction methodology: the IES annotation capability always training requires CPU time and this can slow down or even stop refers to rules learned by using the entire annotated corpus but the the user activity. This may happen in both the phases mentioned last document. This means that the IES is not able to help when above (training and active annotation with revision) as discussed two similar documents are annotated in sequence. From the user in the next section. point of view such a situation is equivalent to train on batches of two texts, with all the disadvantages of batch training mentioned above (even if a batch of size two is quite small). In this respect the collaboration between the system and the user fails in being effective. Timeliness is a matter of perception from the user side, not an absolute feature, therefore the only important matter – we believe – is that users perceive it. In this respect we start from the consideration that in many applications the order in which documents are annotated is random. Generally users adopt criteria such as date of creation or file name order in directories. In such cases it is possible to organize the annotation order so to avoid the possibility of presenting similar documents in sequence and therefore to hide the lack of timeliness. In order to implement such a feature we need a measure of similarity of texts from the annotation point of view. The IES can be used to work out such a measure. At the end of each learning session all the induced rules are applied to the whole unannotated corpus. As result two main subsets in the corpus are detected: texts were the available rules Figure 2. The active annotation with revision phase fire (i.e. annotations can be added: positive subset) and texts were they do not fire at all (uncovered texts: negative subset). Each text 3.2 Limiting the User Idle Time in the positive subset can be associated with a score given by the number of annotations that can be added. The score can be used as Training requires time and for this reason most of the current an approximation of similarity among texts: inserted annotations systems use a batch mode of training so to limit the time in which mean similarity with respect to the part of the corpus annotated so the user has to wait while the system trains to specific moments far, no inserted annotation means actual difference. Such (e.g. coffee time). As explained above, the batch approach information can be used to make the timeliness more effective: a presents timeliness problems: users may have to annotate a completely uncovered document is always followed by a fairly number of similar texts before the learner is activated and the IES covered document. In this way a difference between successive is able to suggest annotations. documents is very likely and therefore the probability that similar An appropriate scheduling of the learning phase can both improve documents are presented in turn within the batch of two (i.e. the timeliness between user’s annotation and system learning and blindness window of the system) is very low. Incidentally this limits the user idle time to the minimum. If we observe how time strategy also tackles another major problem in annotation, i.e. user is spent in the annotation process (select a document, manually boredom. This is the major reason why the level of user annotate the document, save the annotation), we notice that most productivity and effectiveness falls proportional to time. of the user time is spent in the manual annotation process. For this Presenting users with radically different documents should avoid reason we believe that this is the right moment to train the IES in the boredom that comes from coping with very similar documents the background without the user noticing it. In principle it would in sequence. In the next section a first implementation of the be possible to treat every annotation event in the interface as a discussed interaction model is presented. We introduce both the request to train on a specific example, but this requires the ability IES used (Amilcare) and the annotation interface (Melita). Finally to retreat annotations in case of user errors and this makes the we discuss how the current implementation meets the interaction with the IES quite complex. In our method the IES requirements described. 4. ADAPTIVE IE IN AMILCARE information as opposed to using shallower approaches. Lazy NLP- based learners learn which is the best strategy for each Amilcare is a tool for adaptive Information Extraction from information/context separately. For example they may decide that text (IE) designed for supporting active annotation of using the result of a part of speech tagger is the best strategy for documents for the Semantic Web. It performs IE by recognizing the speaker in seminar announcements, but not to spot enriching texts with XML annotations, i.e. the system the seminar location. This strategy is quite effective for analyzing marks the extracted information with XML annotations. documents with mixed genres, quite a common situation in web The only knowledge required for porting Amilcare to new documents [14]. applications or domains is the ability of manually The learner induces two types of rules: tagging rules and annotating the information to be extracted in a training correction rules. A tagging rule is composed of a left hand side, corpus. No knowledge of Human Language Technology is containing a pattern of conditions on a connected sequence of necessary. Adaptation starts with the definition of a tag-set words, and a right hand side that is an action inserting an XML tag for annotation possibly organized as an ontology where in the texts. Each rule inserts a single XML tag, e.g. tags are associated to concepts and relations. Then users . This makes the approach different from many have to manually annotate a corpus for training the learner. adaptive IE algorithms, whose rules recognize whole pieces of An annotation interface is to be connected to Amilcare for information (i.e. they insert both and annotating texts using XML mark ups. As mentioned [7]), or even multi slots [15]. Correction rules Amilcare has been integrated with a number of annotation shift misplaced annotations (inserted by tagging rules) to the tools so far, including MnM[4], Ontomat[6]. For example correct position. They are learnt from the mistakes made in the annotation interface in Ontomat is used to annotate attempting to re-annotate the training corpus using the induced texts in a user-friendly manner. Ontomat automatically tagging rules. Correction rules are identical to tagging rules, but (1) their patterns match also the tags inserted by the tagging rules converts the user annotations into XML tags to train the and (2) their actions shift misplaced tags rather than adding new learner. Amilcare's learner induces rules that are able to ones. The output of the training phase is a collection of rules for reproduce the text annotation. Amilcare can work in two IE that is associated to the specific scenario. modes: training, used to adapt to a new application, and When working in extraction mode, Amilcare receives as input a extraction, used to actually annotate texts. In both modes, (collection of) text(s) with the associated scenario (including the Amilcare first of all preprocesses texts using Annie, the rules induced during the training phase). It preprocesses the text(s) shallow IE system included in the Gate package ([9], by using Annie and then it applies its rules and returns the original www.gate.ac.uk). Annie performs text tokenization text with the added annotations. The Gate annotation schema is (segmenting texts into words), sentence splitting used for annotation [9]. (identifying sentences) part of speech tagging (lexical disambiguation), gazetteer lookup (dictionary lookup) and named entity recognition (recognition of people and 5. THE MELITA FRAMEWORK organization names, dates, etc.). Melita is an ontology-based demonstrator for text annotation. The When operating in training mode, Amilcare induces rules for goal of Melita is not to produce a further annotation interface, but information extraction. The learner is based on (LP)2, a covering a demonstrator of how it is possible to actively interact with the algorithm for supervised learning of IE rules based on Lazy-NLP IES in order to meet the requirements of timeliness and tunable [10] [11]. This is a wrapper induction methodology [12] that, pro-activity mentioned above. Melita’s main control panel is unlike other wrapper induction approaches, uses linguistic depicted in figure 3. It is composed of three main areas: information in the rule generalization process. The learner starts 1. The ontology (left) representing the annotations that can be inducing wrapper-like rules that make no use of linguistic inserted; annotations are associated to concepts and relations. A information, where rules are sets of conjunctive conditions on specific colour is associated to each node in the ontology (e.g. adjacent words. Then the linguistic information provided by “speaker is depicted in blue). Annie is used in order to generalise rules: conditions on words are substituted with conditions on the linguistic information (e.g. 2. The document to be annotated (centre-right). Selecting the condition matching either the lexical category, or the class portion of text with the mouse and then clicking on the node in provided by the gazetteer, etc. [11]). All the generalizations are the ontology insert annotations. Inserted annotations are shown tested in parallel by using a variant of the AQ algorithm [13] and by turning the background of the annotated text portion to the the best k generalizations are kept for IE. The idea is that the colour associated to the node in the hierarchy (e.g. the linguistic-based generalization is used only when the use of NLP background of the portion of text representing a speaker information is reliable or effective. The measure of reliability here becomes blue). is not linguistic correctness (immeasurable by incompetent users), 3. The IES suggestion area (bottom) where some of the but effectiveness in extracting information using linguistic suggested annotations are presented. Melita does not differ in appearance from other annotation interfaces such as the Gate annotation tool, or MnM or Ontomat. 5.1 Controlling Proactivity This is because – as mentioned – it is a demonstrator to show how a typical annotation interface could interact with the IES. The Users can customize the behaviour of the IES tuning the level of novelty of Melita is the possibility of (1) tuning the IES so to IES proactivity thus changing the level of intrusiveness by using a provide the desired level of proactivity and (2) scheduling texts so special slidebar (fig.4). It allows to set two thresholds that divide to provide timeliness in annotation learning. The typical annotation cycle in Melita follows the two-phase cycle based on training and active annotation described in the previous section. Users may not be aware of the difference between the two phases. They just will notice that at some point the annotation system will start suggesting annotations and that they have a way to influence when and with which modalities this will happen. Suggestions can be presented in the suggestion area or in the document area according to a number of criteria. When presented in the suggestion area an explicit selection (on the tick box) is required to the user to accept the suggestion, otherwise the suggestion is not inserted. When presented directly into the document under annotation suggestions are displayed using the same colour code (e.g. blue background for speaker), but they are made recognizable as suggestions because of a special coloured border. The assumption here is that annotations are considered correct unless the user removes them explicitly. The presentation strategy adopted displays unstable tags (i.e. tags not yet fully reliable) in the suggestion area, while tags considered reliable by the system are displayed directly in the document. Note that reliability is independent for each piece of information. For example a system can become quite reliable in a short time in recognizing some information (e.g. seminar start time) requiring more training examples for others (e.g. speaker). In this case there will be a moment in which the suggested annotations for the time will be Figure 4: the slidebar for tuning system’s intrusivity inserted in the document pane while the annotations for the speaker will go into the suggestion panel. the accuracy space in three areas: the first level decides which is the minimum accuracy the IES must be able to reach in order to start inserting annotation in the suggestion panel. The second threshold defines the minimum accuracy the system must reach experiment we did not use a Named Entity Recogniser. A NERC before starting suggesting in the document panel. In the example would have allowed reducing the needs of examples for speaker. in figure 4 the system will suggest in the suggestion panel when We performed the same type of analysis on other corpora such as its accuracy is between 43 and 75% and in the document panel the Austin TX Jobs announcement corpus and found similar when greater than 75%. When accuracy is less than 43% the IES results. does not suggest (i.e. it is still in training mode). This general 6.1. Is it Worth Using IE? default holds for all the nodes in the ontology, but it can be overridden for specific tags by using the same kind of window. The experiments show that the contribution of the IES can be Changing the default for specific tags is useful because users can quite high. Reliable annotation can be obtained with limited have different feelings about intrusiveness for different kinds of training, especially when adopting high precision IES information depending on the effort required to identify and select configurations. In the case of the CMU corpus, our experiments that piece of information. It is worth noting that the same slidebar show that it is possible to move from bootstrapping to active shows the accuracy currently reached by the IES for the specific annotation after annotating some dozens of texts. In table 1 we information: it is the blue filler mark that grows from the bottom show the amount of training needed for moving to active (around 10% in figure 4). It is a feedback on the current status of annotation for each type of information, given a minimum user the IES, e.g. if it is in training mode, if it is suggesting in the requirement of 75% precision. This shows that the IES suggestion panel, etc. Moreover such feedback should support an contribution heavily reduces the burden of manual annotation and intuitive changing of the current IES behaviour, e.g. turn off the that such reduction is particularly relevant and immediate in case IES suggestions by lifting up the two arrows beyond the blue of quite regular information (e.g., time expressions). In user terms maximum level. Note that the same information is presented near this means that it is possible to focus the activity on annotating each node in the ontology panel: a small square is divided in three more complex pieces of information (e.g. speaker), avoiding to be parts (corresponding to the three areas above). The small square bothered with repetitive ones (such as stime). With some more fills in the same way the slidebar fills. In this way the user has training cases the IES is also able to contribute in annotating the always a feedback on the current status for each piece of relevant complex cases. information. Tag Amount of Texts Prec Rec needed for training 6. EVALUATING IE’S CONTRIBUTION stime 30 91 78 etime 20 96 72 We performed a number of experiments for demonstrating how location 30 82 61 fast the IES can converge to an active annotation status and to speaker 100 75 70 quantify its contribution to the annotation task, i.e. its ability to suggest correctly. We selected the CMU seminar announcements Table 1: The amount of training texts needed for reaching corpus, where 483 emails are manually annotated with speaker, at least 75% precision and 50% recall starting time, ending time and location of seminars. Such corpus was already used for evaluating a number of adaptive algorithms [10]. In our experiment the annotation in the corpus was used to 7. CONCLUSIONS AND FUTURE WORK simulate human annotation in the methodology described above. In this paper we have presented a modality of interaction between We have evaluated the potential contribution of the IE system at an adaptive IES and a classical annotation interface for the regular intervals during corpus tagging, i.e. after the annotation of Semantic Web. We have defined a modality in which the interface 5, 10, 20, 25, 30, 50, 62, 75, 100 and 150 documents (each subset and the IES cooperate in order to obtain effective annotation in the fully including the previous one). Each time we tested the way preferred by a specific user. We have also explained how to accuracy of the IES on the following 200 texts in the corpus (so organize learning in order to reduce or avoid any idle time from when training on 25 texts, the test was performed also on the the user point of view. Then we have discussed how it is possible following 25 texts that will be used for training on 50). The ability to maintain a reasonable timeliness in learning from examples to suggest on the test corpus was measured in terms of precision while hiding to users the delay necessary for training the and recall. Recall represents here an approximation of the underlying IES. Finally we have presented Melita, a demonstrator probability that the user receives a suggestion in tagging a new that implements such methodology and we have described how document. Precision represents the probability that such user configurations in Melita are turned into settings for Amilcare. suggestion is correct. The maximum gain comes in annotating stime and etime. This is not surprising as they present quite We believe that this methodology of interaction between the IES regular fillers. After training on only 20 texts, the system is and the annotation interface allows to fully exploiting the potentially able to propose 368 stimes (out of 491), 303 are potentiality of adaptive IE for annotating texts because: 2 correct, 18 partially correct , 47 wrong, leading to Precision=84 1. It inserts in the usual user environment without imposing Recall=61. With 30 texts the recognition reaches P=91, R=78, particular requirements on the annotation interface used to with 50 P=92, R=80. The situation is very similar for etime, while train the IES. (2) it is more complex for speaker and location, where 80% f-measure 2. It maximizes the cooperation between user and IES: users is reached only after about 100 texts. This is due to the fact that insert annotations in texts as part of their normal work and at locations and speakers are much more difficult to learn than time the same time they train the IES. The IES in turn simplifies expressions because they are much less regular. Note that in the the user work by inserting annotations similar to those 2 inserted by the user in other documents; this collaboration is Where the proposed and correct annotations partially overlap. They count as half correct in calculating precision and recall. made timely and effective by the fact that the IES is retrained [7] C. A. Thompson, M. E. Califf, and R. J. Mooney: “Active Learning after each document annotation. for Natural Language Parsing and Information Extraction”, Proceedings 3. The modality in which the IES system suggests new of the Sixteenth International Machine Learning Conference (ICML- annotations is fully tunable and therefore easily adaptable to 99), Bled, Slovenia, pp. 406-414, June 1999. the specific user needs/preferences. [8] F. Ciravegna and D. Petrelli: “User Involvement in Adaptive 4. It allows to timely train the IES without disrupting the user Information Extraction: Position Paper” in Proceedings of the IJCAI- pace with learning sessions consuming a large amount of 2001 Workshop on Adaptive Text Extraction and Mining held in CPU time (and therefore either stop or slow down the conjunction with the 17th International Conference on Artificial annotation process). Intelligence (IJCAI-01), Seattle, August, 2001 Future work will consider the better formalization of the way in [9] D. Maynard, V. Tablan, H. Cunningham, C. Ursu, H. Saggion, K. which Melita’s settings are turned into IES settings. The currently Bontcheva and Y. Wilks: “Architectural Elements of Language adopted solution is still under evaluation and it needs further Engineering Robustness”, Journal of Natural Language Engineering -- development and experiments, as currently it is completely Special Issue on Robust Methods in Analysis of Natural Language arbitrary and the risk is to produce an opaque effect on the user Data, 2002, forthcoming. with respect to the way in which the IES is influenced. [10] F. Ciravegna: "Adaptive Information Extraction from Text by Rule Induction and Generalisation" in Proceedings of 17th International Joint Conference on Artificial Intelligence (IJCAI 2001), Seattle, 8. ACKNOWLEDGEMENT August 2001." The current work has been carried on in the framework of the [11] F. Ciravegna: "(LP)2, an Adaptive Algorithm for Information AKT project (Advanced Knowledge Technologies, Extraction from Web-related Texts" in Proceedings of the IJCAI-2001 http://www.aktors.org), an Interdisciplinary Research Workshop on Adaptive Text Extraction and Mining held in conjunction Collaboration (IRC) sponsored by the UK Engineering and with the 17th International Conference on Artificial Intelligence (IJCAI- Physical Sciences Research Council (grant GR/N15764/01). AKT 01), Seattle, August, 2001 involves the Universities of Aberdeen, Edinburgh, Sheffield, [12] N. Kushmerick, D. Weld and R. Doorenbos: `Wrapper induction for Southampton and the Open University (www.aktors.org). AKT is information extraction', Proc. of 15th International Conference on a multimillion pound six year research project that started in 2000. Artificial Intelligence, IJCAI-97. Its objectives are to develop technologies to cope with the six main challenges of knowledge management: acquisition, [13] R. S. Mickalski, I. Mozetic, J. Hong and H. Lavrack: The multi modelling, retrieval/extraction, reuse, publication and purpose incremental learning system AQ15 and its testing application maintenance. The work on annotation interfaces described in this to three medical domains’, in Proceedings of the 5th National work would not have been possible without the discussions and Conference on Artificial Intelligence, Philadelphia: Morgan Kaufmann. interactions with Enrico Motta, Mattia Lanzoni and John [14] F. Ciravegna: “Challenges in Information Extraction from Text for Domingue (Open University), Steffen Staab and Siegfried Knowledge Management”, IEEE Intelligent Systems and Their Handschuh (University of Karlsruhe). Amilcare uses Annie for Applications, November 2001. preprocessing (www.gate.ac.uk). Thanks to the Gate group for [15] S. Soderland: `Learning information extraction rules for semi- providing Annie and for help in integrating it into Amilcare. structured and free text', Machine Learning, (1), 1-44, 1999. [16] A. Douthat, “The message understanding conference scoring B i bl i o g r a p hy software user's manual”, in [17] [1] M. E. Califf, D. Freitag, N. Kushmerick and I. Muslea (eds.): [17] 7th Message Understanding Conference Proceedings, MUC-7. AAAI-99 Workshop on Machine Learning for Information Extraction http://www.itl.nist.gov/iaui/894.02/related_projects/muc/ July 19, 1999, Orlando Florida (http://www.isi.edu/~muslea/RISE/ML4IE/) [2] F. Ciravegna, R. Basili, R. Gaizauskas (eds.) ECAI2000 Workshop on Machine Learning for IE, Berlin, 2000, (www.dcs.shef.ac.uk/~fabio/ecai-workshop.html) [3] F. Ciravegna, N. Kushmenrick, R. Mooney and I. Muslea (ed.), IJCAI-2001 Workshop on Adaptive Text Extraction and Mining held in conjunction with the 17th International Conference on Artificial Intelligence (IJCAI-01), Seattle, August, 2001 (http://www.smi.ucd.ie/ATEM2001/) [4] J.B. Domingue, M. Lanzoni, E. Motta, M. Vargas-Vera and F. Ciravegna: “MnM: Ontology driven semi-automatic or automatic support for semantic markup”, submitted paper. [5] BADGER Information Extraction (IE) Software, http://www- nlp.cs.umass.edu/software/badger.html [6] S. Handschuh, S. Staab and F. Ciravegna: “S-CREAM - Semi- automatic CREAtion of Metadata”, submitted paper.