Can ProMED-mail Bootstrap Blogs? Automatic Labeling of Victim-reporting Sentences? Avaré Stewart and Kerstin Denecke L3S Research Center Appelstr. 9A, 30169 Hannover, Germany Abstract. Due to the proliferation of social media data and user-ge- nerated content available, monitoring trends or using this data in other scenarios becomes more interesting. Our research focuses on the extrac- tion of information on health events from user generated content with the objective to support Epidemic Intelligence. Specifically, we describe and evaluate a method for identifying sentences relevant for event extrac- tion. Labeled data is unavailable for this task and manual annotation is expensive. Therefore, in order to reduce the number of labeled examples, we apply a bootstrapping algorithm for this task. In more detail, we will study the suitability of a classifier trained on one text type (e-mails) for the classification of texts of another text type (blogs). 1 Introduction The spread of infectious diseases and - due to this - the increased public concern - raises the necessity to have health surveillance systems on hand for detecting disease outbreaks as early as possible. All the activities related to early identifica- tion of potential health hazards, their verification, assessment and investigation with the objective to recommend public health control measures are summarized by the term Epidemic Intelligence [1]. A public health event is an event that cre- ates a need for action of public health officials, for instance an outbreak of an infectious disease or one case of a very seldom infectious disease. A public health event can be described by event information providing information on who was infected by what, where and when, i.e., information on a victim, a location, a time or a disease. Besides the traditional surveillance systems that monitor indicators such as death rates, drug prescriptions, occurrence of viruses etc. event-based systems have been developed. They extract and analyse outbreak-related information from text in electronic sources such as e-mail, official reports, news wires and present the results to the user. Social media data or user-generated content (e.g., Weblogs, Twitter messages) remained so far unconsidered for Epidemic Intelligence. In our research, we will focus on this text type. The problem of detecting health events can be decomposed into mainly three sub-problems: ? This work has been done within the M-Eco project, partly funded by the European Commission under 247829. MEDEX 2010 Proceedings 25 1. Annotation: Identifying sentences containing information on an event. 2. Event Extraction: Identifying relevant facts to describe the event. 3. Event Aggregation: Aggregating information on the same event that has been reported in different sources. In this paper, the focus is on identifying pieces of text relevant for public health event detection (Annotation problem). State of the art approaches for detecting public health events either rely upon a huge number of extraction patterns or a large set of labeled data in order to detect the relevant information-bearing sen- tences. However, when extracting such information specifically from social me- dia, additional challenges are faced which make these approaches inadequate[2] including use of specific language, and different styles of writing. Large amounts of social media data are available. The data is noisy to a large extent; and it is often opinionated and can contain irrelevant information. We address these challenges using semi-supervised learning in the context of event extraction. In particular, the sub-problem of identifying sentences relevant to the detection of health events on infectious diseases is considered. We will study the suitability of a classifier trained on one text type for the classification of texts of another. In more detail, patterns acquired from a manually curated source are transferred to bootstrap the classifier for medical blogs. To reduce manual work for labeling training data, we come up with an approach of auto- matically labeling a training set that bases upon research results known from the field of text summarization. The automatically determined training set is then used to learn a classifier for sentence classification. The contributions of this work are (1) presentation of a health event detec- tion framework, (2) introduction of a cross-corpus bootstrapping framework to identify relevant sentences, and (3) study related corpora for the task of boot- strapping. 2 Related Work Our final goal is to extract public health events for outbreak detection. Exist- ing systems for this task rely upon the enumeration of possible types of victim reporting patterns (e.g., MediSys [3], HealthMap [4], BioCaster [5]). MediSys for example uses manually specified keywords in different languages to identify news articles reporting on health events. In these systems, little or no attention is devoted to using blogs and other forms of user generated content as source of information. Other systems rely upon the linguistic, interpretive, and analytical expertise of analysts to filter and extract information about health threats (e.g., GPHIN [6]). Given the dynamic nature of social media in general and of we- blogs in particular, a pattern-based system is not realizable since the number of patterns needed may be numerous. Further, the large amount of user generated content available requires application of automatic methods. In this paper, an approach based on semi-supervised learning for identifying relevant sentences for event extraction is introduced. 26 MEDEX 2010 Proceedings Bootstrapping approaches are such semi-supervised learning approaches and can be grouped into self-training and co-training approaches. Semi-supervised self-training has been applied in the field of sentiment analysis (e.g., subjectiv- ity analysis [7]). Didaci and Role [8] compare several semi-supervised learning methods for multiple classifier systems. Chen and Ji [9] present a framework where bootstrapping is used for event extraction in a cross-lingual setting. An event extraction system in one language is bootstrapped by exploring new ev- idences from a system in another language. In this work, we apply an existing bootstrapping algorithm to the task of sentence classification. To the best of our knowledge, this particular task of classifying sentences of user generated content for health event detection has not been considered before. Further, bootstrap- ping has not been used in this context before. We will study the quality of such approach and report the results. Methods which make use of unannotated text and intra-document informa- tion have emerged as important approaches for information gathering. The sys- tems typically rely upon the redundancy present on the web, and assume that facts with multiple mentions are more reliable [3]. Zhen and Li [10] consider the problem of cross-domain text classification in the news domain. They propose a support vector based semi-supervised algorithm to solve this problem. Yan- garber et al. [11] use cross-document analysis to support building a consistent and robust fact base about epidemics or the outbreak of disease. In our work, the cross-corpora analysis is applied to detect victim-reporting sentences. Cross- classification is applied to train a classifier on a data set of an auxiliary domain to classify data of the target domain. Besides reporting quality results of the approach, we will study the conditions when it is possible to use such a learner. 3 Approach The objective of our approach is to identify information bearing sentences in social media, when faced with the problem of large amounts of unlabeled data. In particular, we consider a sentence relevant when it contains information on disease outbreaks (see section 4 for more details). Given the characteristics of social media data, a supervised classification approach seems to be better suited than pattern-based approaches that rely upon extensive manual work. Our objective is to reduce the manual work for labeling examples as much as possible. For this reason, we make use of data of an auxiliary domain to determine labeled training examples. In more detail, for an auxiliary domain, we build a classifier in a bootstrap process. This classifier is then applied to label sentences of the target domain. To avoid manual labeling, we introduce an approach to automatic labeling examples of the auxiliary domain. The single processing steps are described in more detail in the next sections. 3.1 Automatic Labeling For learning the classifier, a related and complementary, auxiliary data source is used. This related source typically uses a terse and more compact style of prose. It MEDEX 2010 Proceedings 27 also exhibits structural properties which allow us to apply a weak form of labeling - i.e., automatically label selected sentences as positive or negative examples with respect to disease reporting based on their position in the document. This is in contrast to the sentences in the target domain, for which we have a less obvious structural pattern from which we can weakly label sentences. In more detail, the idea is to use sentences at the beginning of a document as positive examples. This idea has already been proven successful in the field of text summarization where sentences at the beginning of a document are used to produce a document summary. Let D = d1 , d2 , . . . , dj be the set of documents in an auxiliary corpus, where each document, di ∈ D, consists of a set of one or more sentences. Further, let T = t1 , t2 , ..., tm be a set of feature types used to represent the sentences of D. We kept the approach general with respect to the set of features to be used. Types of features can include bag-of-words, bag-of-concepts, part of speech, or a typed dependency structure. Given a set of documents D and a surrogate representation for a given type, Tt applied to the sentences in D, a corpus can be modeled as a sentence database, St = st11 , st12 , ..., stjk , where stjk represents the k th sentence for the j th docu- ment using feature type, t. Further, we label the top-N sentences in the database automatically as positive cases and the bottom-N as negative examples, for a threshold value of N. At this stage, we also want to introduce the concept of sublanguages. Given a sentence database of type t, St , for an auxiliary domain. Then, we can define the auxiliary corpus to be a sublanguage for the target corpus if a self-trained classifier built from the auxiliary domain performs well on the unlabeled exam- ples in the target domain with some threshold tolerance. In our experiments, we will study whether the dataset from the auxiliary domain is a sublanguage for the target corpus. 3.2 Cross-Corpora Bootstrapping Using the previously described approach to automatically labeling the sentences of the auxiliary domain, we determine a set of labeled examples to be used to train a classification model using bootstrapping. We are considering these examples produced by the automatic labelling approach as weakly labeled, i.e., there is some confidence why they have been selected as positive or negative examples, but it is unclear to what extent this labeling is confident. To reduce bias produced by this incertitude, we produce an improved classifier through a bootstrap process. The bootstrap process is depicted in Figure 2. The algorithm is described in more detail in the following (see Figure 1). 3.3 Classifying Sentences The previous step produces a classifier trained on material of the auxiliary do- main. We have chosen Support Vector Machines as classification algorithm. Fi- nally, this classifier is applied to the sentences of the target domain and labels 28 MEDEX 2010 Proceedings Input: W : a set of weakly labeled examples from an auxiliary domain L: set of labeled examples from the auxiliary domain Ci : a set of classifiers Initialization: Initialize the set L by selecting the weakly labeled candidates from W . Create a pool Wprime of examples by choosing p examples from W Loop until a condition is satisfied (e.g., W = ∅, or threshold number of iterations reached) – Train each classifier Ci on L, and label the examples in Wprime – For each classifier Ci , select the most confidently labeled examples (e.g., the confidence score is above an absolute or relative threshold for the current run ) and add them to L – Refill Wprime with examples from W , and keep the size of Wprime as constant p Fig. 1. Bootstapping Algorithm the sentences of the target domain as positive or negative, or victim-related and not victim-related, respectively. Fig. 2. Weakly Labeled Bootstrap: Overview 4 Experiments In this preliminary work the experimental goals are twofold. First, we are inter- ested in knowing how good is a classifier based on weak labeling at identifying sentences bearing information on public health events in blog posts. Second, we are interested in characterizing the conditions under which the auxiliary source can be considered a sublanguage for the noisier corpus. We characterize noise based on the length of the sentences or the type of entities appearing in them. We MEDEX 2010 Proceedings 29 conduct experiments from two perspectives of the bootstrapping process, namely training and testing phases. Experimental setting and results are described in the next section. At the end, we will discuss the approach. 4.1 Experimental Setting In our experiments, we use data collected from ProMED-mail [12] as the aux- iliary domain. ProMED-Mail (referred to as Promed) is a global electronic re- porting system, which lists outbreak reports of emerging infectious diseases. It constitutes a terse source of information about epidemic events. Similarly, the World Health Organization also reports disease outbreak news on their web- page (http://www.who.int/csr/don/en/index.html). This data is considered as yet another moderated data source, referred to as WHO. The data of the target domain is provided by the AvianFluDiary (http:// afludiary.blogspot.com/). All data was collected directly from the websites. Summary statistics for the data are shown in Table 1. Source Years No. of No. of Documents Sentences AvianFluDiary 2006-2009 4249 100890 ProMed-Mail 2002-2009 13369 22170 WHO 1996-2009 1531 16213 Table 1. Data Collection for Experiments The goal of our experiments is to identify information bearing sentences in AvianFluDiary blogs. As can be seen in Table 1, even for a single blog, spanning less than half the number of years for each moderated source, the number of blog sentences is still over three times greater. We therefore seek to evaluate, the effectiveness of a weakly labeled classifier at detecting relevant disease reporting sentences in such a voluminous and more verbose data source. In order to do so, we train a classifier on data material of our auxiliary domain (Promed) based on the structural SVM implementation of SVM-TK v1.2 [13]. The features used in the classifier are the tree structure of the parts of speech for each sentence. To create these features, the text was normalized to remove extraneous symbols. Sentence splitting and parse trees were created by the Stanford Parser. Named entities were recognized by applying OpenCalais (http://www.opencalais.com/). As initial training documents for training the classifier, we created a weakly labeled set of examples for the auxilary domain using the automatic labeling approach introduced in section 3.1. In more detail, the top-1 sentences in the sentence database were labeled as positive cases and the bottom-1 as negative examples for our classification task. Observing that not all positive and nega- tive sentences were of equal quality, we prefiltered all top-1 sentences based on 30 MEDEX 2010 Proceedings the sentence length. We used default values where the minimum and maximum sentence lengths were set to 20 and 200, respectively. To test the classifier, a total of 5,029 (729 positive and 4,599 negative) Avian- FluDiary sentences were hand labeled. The sentences were labeled with respect to the task of identifying victim reporting sentences. The definition we used for victim reporting is based on the MedISys Disease Incidents template1 . The template reports disease, time, location, status, cases, description and url to a natural language text. Therefore, we label a sentence as victim reporting sentence if it contains: D = {disease} in union with I = {time, location, status, cases}. Status reports the condition of a victim (e.g., hospitalized, dead ), and cases refers to the number and type of victims (e.g., bird, child, etc). Further, we labeled all sentences needed to report a single event as a positive case with the value 1; all other sentences are labeled with 0. 4.2 Experimental Results The objective of the experiments were two-fold: Determining the quality of the introduced classification approach (Part I), and characterizing conditions for sublanguages (Part II). Part I: Bootstrapping with Automatically Labeled Data For the first part of the experiments we were interested in determining the quality of a clas- sifier at identifying information bearing sentences when it is trained on weak labeled training material of an auxiliary domain. We tested the classifier on the AvianFlu-Diary data. The bootstrap learner takes into account different scenar- ios of the bootstrap process. – Scenario 1: Default settings based on the values used by the authors for a similar task (auxiliary domain: Promed), – Scenario 2: Applied on bottom-1 sentences only that are additionally fil- tered (auxiliary domain: Promed), – Scenario 3: Scenario 1 with WHO as auxiliary data, – Scenario 4: Sentences are filtered based on presence of named entities (aux- iliary domain: Promed) In Scenario 1, the pool size was set to 15, and 50 sentences per iteration was used. A stopping condition was reached when 2,000 items in the weakly labels set were labeled (model size). A classified sentence was selected as confident if its confidence value exceeded 70% percent of the maximum confidence value relative to the given iteration. The initial pool size of 50 positive and 50 negatives sentences was used. In Scenario 2, similar parameters are used, except that the model size is reduced to 1700. In Scenario 3, WHO data is used as an auxiliary source of data, 1 http://medusa.jrc.it/medisys/helsinkiedition/all/home.html MEDEX 2010 Proceedings 31 to see if applying another weakly labeled sentence database yield to different results. Finally, in Scenario 4, we filter the sentences based on the presence of named entities which contain both a medical condition and location. For the different scenarios, the precision, recall and accuracy values of the learner on the AvianFlu-Diary data is examined (see Table 2). Scenario Precision Recall Accuracy 1 .77 .45 .57 2 .71 .66 .69 3 .75 .22 .34 4 .80 .40 .53 Table 2. Bootstrap Results per Scenario It can be seen that for the different scenarios differing accuracy values are achieved. The best accuracy of .69 is determined for scenario 2, while the worst accuracy of .34 is achieved for scenario 3. Precision values lie between .71 and .8 for all four scenarios. The recall is significantly lower with values between .22 and .66. We discuss these results in Section 4.3. Part II: Analysing Sublanguage Conditions In the second part of ex- periments, we are interested in characterizing the conditions under which the language of an auxiliary data source can be considered a sublanguage for the noisier corpora. We account for noise by filtering the length of the sentences in the target data and filter based on the type of entities appearing in the sentences. The training sentence lengths were filtered with a minimum sentence length of 30 and a maximum sentence length of 200. The results are shown in Figures 3. When varying the minimum sentence length, the best precision of more than 80% is achieved for sentences with a minimum length of 50. For shorter sentences, the precision drops below 80%. Further, when varying the maximum length of sentences, the best results are achieved for a maximum sentence length of fifty words. In the next section, these results will be interpreted and discussed. 4.3 Discussion In part I of the experiments (see section 4.2), an important observation is that given a threshold value of .7 for precision, ProMed-Mail can in fact be used in an automatic way to build a bootstrap classifier for blogs. This implies that by using the top-1 sentences as positive cases we are capable of identifying relevant health related sentences in blog postings. Given the fact to no human effort was incurred for labeling a training set, a fairly good classifier can be built based on weak labeling for identifying sentences bearing information on public health events. 32 MEDEX 2010 Proceedings (a) Fixed Maximum Sentence Length (b) Fixed Minimum Sentence Length Fig. 3. Part II: Analysis of Sublanguage Conditions We also notice that except for Scenario 2, in which additional filtering was applied to the bottom-1 sentences, the recall tends to be quite low. This would suggest that although top-1 positive cases perform well, the bottom-1 negative examples used in training are not representative enough to distinguish the neg- ative examples present in the blogs. In light of this, we propose that further experiments are needed to better characterize the conditions under which the auxiliary source can be considered a sublanguage for the noisier corpora for identifying negative cases. In part II of the experiments 4.2, we seek notice that although for most of the different sentence lengths, the same results in precision and recall are achieved, in a single range for both the fixed upper and fixed lower, we achieve a noticeable peak precision above 80%. This suggests that such an approach is sensitive to the length of sentences in the test data. In summary, the introduced approach for sentence classification has been proven to provide acceptable results. In contrast to existing approaches, the main benefits of the method presented here are: (a) Avoidance of manual label- ing of training material, and (b) Reduction of bias produced by automatic la- beling through bootstrapping for learning the classifier. The presented approach is new for several reasons: Bootstrapping has not yet been applied for learn- ing a classifier in a cross-corpora setting for labeling sentences. The problem of victim-reporting sentences classification was only considered using pattern-based approaches by now. We introduce a weakly-supervised approach to address this problem. Further, the automatic labeling of examples and its combination with bootstrapping to reduce incertitude has not been reported and analysed before. 5 Conclusion In this work, an approach is described to identify disease- and victim-reporting sentences from blogs to support epidemic intelligence. Challenges given by the characteristics of blogs (mainly noise and data abundance) are addressed using automatic labeling of data collected from moderated sources to learn a classifier for identifying relevant sentences present in blogs. A bootstrap process is applied MEDEX 2010 Proceedings 33 to learn a classifier and filter more noisy and irrelevant sentences. The results show that the approach taken here is quite effective at sentence level filtering in blogs. Without manual effort, we are able to achieve a precision as high as .80 and a recall .66. In the future, we will perform robust experiments, particularly using more blogs. We also intend to experiment with increasing values for top-N and compare this to bootstrapping on the blog set alone and using a hybrid approach. References 1. C. Paquet, D. Coulombier, R.K., Ciotti, M.: Epidemic intelligence: A new frame- work for strenghtening disease surveillance in europe. Euro Surveill. 11(12) (2006) 2. Moens, M.F.: Information extraction from blogs. In Jansen, B.J., Spink, A., Taksa, I., eds.: Handbook of Research on Web Log Analysis, IGI Global (2009) 469–487 3. Yangarber, R.: Verification of facts across document boundaries. In Proceedings International Workshop on Intelligent Information Access (2006) 4. Freifeld, C.F., Mandl, K.D., Reis, B.Y., Brownstein, J.S.: Healthmap: Global in- fectious disease monitoring through automated classification and visualization of internet media reports. In Proceedings International Workshop on Intelligent In- formation Access (2006) 5. Collier, N., et al.: Biocaster: detecting public health rumors with a web-based text mining system. Bioinformatics, Oxford University Press (2008) 6. Mykhalovskiy, E., Weir, L.: The global public health intelligence network and early warning outbreak detection: a canadian contribution to global public health. Can J Public Health 97(1) (2006) 42–44 7. Wang, B., Spencer, B., Ling, C., Zhang, H.: Semi-supervised self-training for sen- tence subjectivity classification. Canadian AI 2008, LNAI 5032 (2008) 344–55 8. Didaci, L., Roli, F.: Using co-training and self-training in semi-supervised multiple classifier systems. In: D.-Y. Yeung et al. (Eds.): SSPR&SPR 2006, LNCS 4109 (2006) 522–530 9. Chen, Z., Ji, H.: Can one language bootstrap the other: a case study on event extraction. In: SemiSupLearn ’09: Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing, Morristown, NJ, USA, Association for Computational Linguistics (2009) 66–74 10. Zhen, Y., Li, C.: Cross-domain knowledge transfer using semi-supervised classifica- tion. In: AI ’08: Proceedings of the 21st Australian Joint Conference on Artificial Intelligence, Berlin, Heidelberg, Springer-Verlag (2008) 362–371 11. Yangarber, R., et al.: Combining information about epidemic threats from multiple sources. In: RANLP-2007, Borovets, Bulgaria (2007) 12. Madoff, L.C.: Promed-mail: An early warning system for emerging disease. Clinical Infectious Diseases 2(39) (July 2004) 227–232 13. Moschitti, A.: A study on convolution kernels for shallow semantic parsing. In: ACL ’04, Morristown, NJ, USA, Association for Computational Linguistics (2004) 335 34 MEDEX 2010 Proceedings