Mining Trends in Texts on the Web

Mining Trends in Texts on the Web OlgaStreibel streibel@inf.fu-berlin.de Networked Information Systems Free University Berlin

Königin-Luise-Str.24-26 14195 Berlin Germany

Year Phd Networked Information Systems Free University Berlin

Königin-Luise-Str.24-26 14195 Berlin Germany

ProfSupervisor Networked Information Systems Free University Berlin

Königin-Luise-Str.24-26 14195 Berlin Germany

Dr-IngRobertTolksdorf Networked Information Systems Free University Berlin

Königin-Luise-Str.24-26 14195 Berlin Germany

Mining Trends in Texts on the Web 39A71B87B6E800E68EBADAB1C5082E2F GROBID - A machine learning software for extracting information from scholarly documents trend mining machine learning knowledge acquisition knowledge integration semantic learning tagging folksonomy

From online news and blog articles, a human can often deduce information and knowledge needed for the prediction of market movements or sociological trends. However, this recognition and comprehension process is very complex and requires experience as well as some context knowledge about the domain in which trends are to detect. In order to support human experts in trend analysis, I propose an automatic trend mining method based on knowledge integrating learning approach.

1 Problem statement "Many people have been led to believe that trends are about intuition. This is because the majority of the people who work with trends find it difficult to explain why something will happen the way they say it will. The explanation often boils down to "because I think so." Some people do seem to be able to predict what will happen based on their own intuition. Unfortunately, there are too many cases in which people's intuition has obviously been mistaken (...)" [26] Detecting trends from the sociologists' point of view is an analytical method for observing changes in people's behavior over time with regard to "six attitudes towards trends" defined as trendsetters, trend followers, early mainstreamers, mainstreamers, late mainstreamers and conservatives (s. "The Diamond-Shaped Trend Model" in [26]). Consequently, trends are certain patterns of people's behavior and lifestyle that evolved over a focused time interval and the word trend refers to a process of change. Detecting trends from the statistical point of view is based on trend analysis of time-series data regarding two goals of analysis: "modeling time series (i.e. to gain insight into the mechanisms or underlying forces that generate the time series) and forecasting time series (i.e., to predict the future values of the timeseries variables)" (p. 490, [10]). In this terms, trend refers to the general direction in which a time-series graph, based on numeric data, is moving over a focused interval of time. Detecting trends from text collections refers to the detection of emerging topics in texts. In terms of textual data mining a trend in texts is defined as "a topic area that is growing in interest and utility over time" [13] whereas topic in terms of Topic Detection and Tracking (TDT) [3] research is "defined to be a set of news stories that are strongly related by some seminal real world event". All of these points of view on trend detection show the different dimensions of trend analysis research. However, they have one thing in common: observing patterns of changes that are based on certain variables (i.e., people, numbers, words) and lead to a general change-the emerging trend-in the system which is depending on these variables. As already defined in my trend ontology approach [23], this research uses trend mining as a general term describing trend detection, trend recognition and trending analysis. It can refer either to the detection of emerging topic areas from text analysis or to the detection of trends based on numeric data analysis as in the case of stock values. However, this work focuses only on textual data available on the Web, i.e. online news and blogs, and on learning this data under inclusion of related background knowledge in order to capture and explain trends. In general, I refer to the "emerging topic areas" (see also Section 4) while using the term trend in texts whereas the objective of mining trend is "to provide an alert that new developments are happening in a specific area of interest in an automated way" [13]. Interesting approaches have been developed in the field of trend mining on texts (s. following Section) but they are still lacking the integration of expert knowledge in the process of trend recognition. Such knowledge is crucial for the proper trend mining and the lack of methods that integrate expert knowledge is a research gap. This thesis aims at closing this gap. It deals with the trend detection task as with a complex learning task based on learning and recognizing of complex relations and dependencies in given domain regarding the time dimension. I focus on the learning method able to integrate expert knowledge in order to automatically recognize trends in text collections. Considering that "In general, trending analysis of textual data can be performed in any domain that involves written records of human endeavors whether scientific or artistic in nature." [20] trend mining based on texts is useful for many application domains, i.e. medical diagnosis, opinion mining, market monitoring, stock market analysis, etc., and, regarding the increasing information availability on the Internet with its need for intelligent data analysis, it is becoming more and more important research topic in the recent years. Besides contribution to the Trend Mining research, this thesis can have important impact for Machine Learning, and also for the Semantic Web.

Main questions of the thesis

Two main research questions are important for this thesis: 1) How to change existing machine learning approaches for trend mining into knowledge integrating learning approaches with regard to the development of the Semantic Web? 2) How to acquire and formalize trend knowledge? Main research projects in the field of trend mining are described in Topic Detec-tion and Tracking (TDT) research [3] and in Emergent Trend Detection (ETD) [5]. Regarding relevant work for this thesis, in first I concentrate on the research done in the field of trend mining with a focus on the machine learning algorithms since they seem to be crucial in the automatic trend mining. One of the researches, EAnalyst system described in [15], proved that determination and early detection of emerging trends can be retrieved from numeric data as well as from texts. EAnalyst has been designed and implemented as a general architecture for the association of news stories with trends. The system collects hybrid data-financial time series and time-stamped news stories, redescribes time series data into "high-level features", called trends, and aligns each trend with time-stamped news stories. Such news stories serve as training set for learning the language model which determines the statistics of word usage patterns in the stories. This language model, learnt for every trend type, helps to monitor a stream of new incoming news stories. The model processes new news stories due to the learnt hypothesis. Authors define here the task of trend detection as a special case of the Activity Monitoring as introduced by [7]. This research allows for the general precondition in my thesis: it is possible to automatically recognize trends by analyzing texts. Different from EAnalyst, I do not elaborate on text stream monitoring but focus more on the recognition and comprehension process for trend mining. Emergent Trend Detection (ETD) systems that concern with detection of trends presented in [13] have been characterized based on the following aspects: input data and attributes, learning algorithms and visualization, that are important for creating a trend analysis system. The most relevant comparison perspective for our work are the learning algorithms. According to the system description in [13] and regarding the prototypes [27][17] [6], following learning algorithms have been proven useful for trend mining:

combined "hypothesis testing"-based methods (Time Mines [24]) single-pass clustering (New Event Detection [4]) sequential pattern matching and shape query processing (Patent Miner [16][1]) feed-forward, backpropagation NN, c4. 5

and SVM (Hierarchical Distributed

Dynamic Indexing [20], Wüthrich [27]) k-NN classifier (Wüthrich [27]) regression analysis (Wühtrich [27]) Besides, there are many research works related to trend mining, i.e, trend detection based on a fuzzy temporal profile model [8], modeling bursty streams using infinite-state automaton [12], finite mixture model for tracking dynamics of topic trends [18], and clustering approaches [14][3] Concerning both, the trend mining based on texts and enhanced text analysis, there are many related projects on the Internet, scientific and commercial, as well as services that are to some extend relevant for this work: GoogleTrends 1 , BlogPulse 2 , OpenCalais 3 Two interesting research project GIDA (Generic Information-based Decision Assistant) [9][2] and its follower, TREMA (Trend Mining, Fusion and Analysis of multimodal Data) [19], that concentrated on the fusion of multimodal market data in order to mine trends in financial markets (GIDA, TREMA) and in market research (TREMA) are relevant for this thesis. Several projects that concern themselves with lightweight ontologies and extended vocabularies are relevant for the trend knowledge representation part of this thesis, in particular: ConceptNet4 and OpenMind5 of MIT, MoaT6 , Word-Net7 , SentiWordNet8 , Wortschatz Uni Leipzig9 , DWDS10 , SKOS 11 , SCOT 12Regarding relevant work outlined above and according to the two research questions, this research focuses on the development of a semantic learning approach for the automatic trend mining in texts on the Web. It also proposes the use of trend ontology and elaborates on the extreme tagging approach [25] for knowledge acquisition in the trend mining task. However, the main goal of this work is not to predict stock prices for the stock markets based on news analysis nor to create an artificial trader for market trading based on text analysis. This research is neither about a general trend analysis system and it is not studying the influence of Web news on emerging trends (it doesn't take into account the distinction into trend creator news, trend follower news and mainstream news). General assumptions for this thesis are: context is crucial for successful trend mining, collective associations like user tags from folksonomies enable the creation of context knowledge, statistical learning can be enhanced with background knowledge using knowledge representation approach from Semantic Web.

General approach

This thesis is anchored in Information System research and Design Science paradigm [22] [11] is the methodology that provides the scientific framework for my research. Two main research issues are in focus of my thesis: knowledgeintegrating learning approach for trend mining based on Machine Learning and the representation of trend knowledge based on Semantic Web approach. Concentrating on them, I create my artefact (in terms of Design Science), test and evaluate my trend mining approach. So far, first of all I did an extensive literature review comparing following general aspects of related projects on trend mining: trend definitions, general trend analysis approaches, applied machine learning methods and document corpora. Regarding this issues I elaborated on a general definition of trends in text (this gives the main setting for defining the learn problem in the next steps). Furthermore, I implemented a static storage, parsing and partially preprocessing of document corpus that consists of about 200000 business news in German language in the time interval 2006-2007. I also elaborated on the trend ontology approach [23] and on the knowledge acquisition approach using tag tagging [25]. In the next steps, I have to concentrate on the general description of the learning problem in case of mining trends in texts (what kind of feedback is available, what kind of features should be learnt, how to extract trend labels, what is the feature space and how good separable are different classes, how can the features be extended into semantic features, etc.). While defining the learning problem, I also have to consider the representation of the learning data and the representation of the background knowledge. In general, this thesis elaborates on the idea of semantic learning which is the adoption of inductive learning approach from the Machine Learning with the knowledge representation approach from Semantic Web. The outcome of this thesis is a knowledge integrating method for mining trends in texts which aims at improving the quality of trend mining methods and brings the additional value to the existing methods-the trend explanation.

Proposed solution

At this stage of my work, the solution proposed starts with few important definitions: time window, time slice, burstiness, interestingness, utility and trend indication. Based on them, an exact description of what are trends in text is possible: Definition 1: Time window t window is a time interval in which trends can occur. Furthermore, it can be described as an ordered set of subintervals. t slice 13 is a subinterval of time window. If its starting point lies at t 0 the end point has to lie at t k < t n

t window = [t 0 ...t n ] ∧ t slicek = [t 0 ...t k ] t window := {t slicek , ..., t slicen } ∧ |t slicek | = |t slicen | ∧ k, n ∈ N ∧ k < n(1)

Time slices have the same length.

Definition 2: Burstiness

In order to distinguish words in the documents of given time slice from the all documents in time window, TFIDF (term frequency inverse document frequency) [21] function is adapted. The function result for each word says how important is a given word in a given period of time. This is the function to discover the burstiness of words: if there is a word in a given time slice which appears only in the documents of this time slice and not in the whole window (backwards) it could be the so called entry point of a trend.

burst(w)

t window := T F (w,|D|t slice ) * IDF (w,|D|t window )(2)

IDF (w,|D|t window ) := log |D| t window DF (w) t window whereas |D| is the total number of documents. If the word continues to appear in next time slices, and becomes interesting, the word can become trendindicating. Based on the time component as in Def. 1, trendindication is defined by interestingness and utility as follows: Definition 3: Interestingness Interestingness is defined by the frequency of word w in the time window. This can be expressed for a time slice by the sum of the term frequency of word w in all the documents of given time slice divided by the number of documents in this time slice (scaled by binary logarithm).

interest(w)t slice = f (w) t slice := log T F (w,Dt slice ) |D| t slice(3)

For the trendindication it is important to know if the interestingness of a word is rising over time window. As given by formula 1 in Def.1, we define as follows for given time window:

interest(w) t window := {f (w) t slice k , f (w) t slice k+1 , . . . , f (w) t slice n }(4)

expresses increasing interestingness if 14 :

f (w) t slice k < f (w)

whereas:

ratio(t window ) = |t window |

is the number of time slices. The definitions above allow for a general description of emerging topics in given time window: emerging topics are in the simplest case the intersection of the trend indicating words (set of all words that at some point in the time window start to have bursty behavior and appear frequently enough to be discovered and rare enough to be important in given time window) with the set of words used as tags in a CTB in this time window. Furthermore, the trend indication allows for automatic labeling of the document corpus and dividing it in trend indicating and neutral documents (regarding the time slices in which the documents appear). However, this is the statistical part of the approach and it focuses only on simple words. At this stage of the thesis tests have to be done in order to prove it useful. Furthermore, I have to elaborate on the inclusion of the background knowledge into the labeling either by applying my trend ontology [23] approach or tag tagging approach [25] in order to extend the features into the "real" semantic concepts, which I call statements, and at the same time to reduce the dimension of the feature space.

As for learning approach I propose to adapt the Bayes learning 15 . The Bayes theorem could be in this case explain in very general way as:

P (T |S) = P (S|T )P (T ) P (S)(7)

P (T |S) is the a posteriori probability of T conditioned on S whereas T is the hypothesis and S a statement. In case of mining trends T says that there is an indication for a trend and P (T |S) reflects the probability that the given statement S will indicate a trend (or that the given statement S is built on trendindicating concepts and therefore indicates a trend). P (T ) and P (S) are the a priori probabilities: over T (any given statement causes trend) and over S (any statement from the training set is trend-indicating), P (S|T ) can be estimated from the given data.

At this stage of my work, I start the tests for trend feature extraction and continue to elaborate on my solution for integration of background knowledge as well as for proper definition of the learning method.

Evaluation

The evaluation of my approach is primary based on the evaluation of the model performance which can be conducted using crossvalidation and measured in general by the recall and precision values. For the crossvalidation, the document corpus is divided in i folders and the validation process is repeated i times whereas 15 However, also decision trees (good for vizualization and comprehending of the model) and support vector machines (most reliable classification method) have to be considered in every i-step of the validation the 1 i part of the document corpus is used as a test set while the rest i−1 i stacks are used for building the learning model. If D is the set of documents, |D| is the total number of documents in the set, the precision/recall value are:

Future Work

Many research issues are relevant to this thesis. From the information retrieval point of view one of them is for example the research on graph-based representation model for documents and semantic indexing of the document collections. In this stage of the work it is too early to expand the remaining issues.

p 11recall = |D| trendindicating−and−retrieved |D| trendindicating (8) precision = |D| trendindicating−and−retrieved |D| retrievedAlso, for the numeric prediction, the relative absolute error measure can be applied:|p 1 − a 1 | + . . . + |p n − a n | |a 1 − a| + . . . + |a n − a| (, p 2 , .. . p n mean the predicted value for the test instances and a 1 , a 2 , . . . a n the actual values. The formulas above give only an insight into the possible measure ways. The final evaluation depends on the final learning model and should also take into account the knowledge integration part (this could be done i.e. in case of decision trees by additional measure of changes in information gain values).

t slice k+1 < . . . < f (w) t slice n Definition 4: Utility Utility expresses how popular do users find a given word w in the given time window. I propose to retrieve it by analysing collaborative tagging systems (CTB), i.e. delicious, and estimating the popularity of given word as a tag in the same time window as for the trend estimation. The popularity can be simple described by the number of resources in CTB that in given time window have been tagged with the word w divided by the number of all resources tagged in this time window:util(w) t window := log|R| (tag=w)twindow |R| (tag)twindow(5)Definition 5: Trend indicationtrendind(w)t window =burst(w)t slice k + interest(w)t slice * util(w)t window ratio(t window )

http://www.google.de/trends http://www.blogpulse.com/ http://www.opencalais.com/ http://conceptnet.media.mit.edu/ http://commons.media.mit.edu/en/ http://moat-project.org/ http://wordnet.princeton.edu/ http://sentiwordnet.isti.cnr.it/ http://wortschatz.uni-leipzig.de/ http://www.dwds.de/ http://www.w3.org/2004/02/skos/ http://scot-project.org/

Acknowledgments This work has been partially supported by the InnoProfile-Corporate Semantic Web project funded by the German Federal Ministry of Education and Research (BMBF) and the BMBF Innovation Initiative for the New German Länder -Entrepreneurial Regions. The author wants to thank Prof. Robert Tolksdorf and Prof. Abraham Bernstein for their helpful comments on the content of this thesis.

Querying shapes of histories RakeshAgrawal EdwardLWimmers MohamedZait 1995 Events and the causes of events KhurshidAhmad Proceedings of the Workshop on Making Money in the Financial Services Industry, at the 6th International Conference on Terminology and Knowledge Engineering LeeGillam the Workshop on Making Money in the Financial Services Industry, at the 6th International Conference on Terminology and Knowledge Engineering 2002 Topic Detection and Tracking: Event-based Information Organization JamesAllan 2002 Kluwer Academic Publishers On-line new event detection and tracking JamesAllan RonPapka VictorLavrenko SIGIR '98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval

New York, NY, USA

ACM 1998 Survey of Text Mining: Clustering, Classification, and Retrieval MichaelBerry year = 2004 Springer Science+Business Media, Inc Currency exchange rate forecasting from news headlines KRaymond Wong Desh Peramunetilleke Proceedings 13th Australasian Database Conference 13th Australasian Database Conference 2002 Activity monitoring: Noticing interesting changes in behavior TomFawcett FosterProvost Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining the Fifth International Conference on Knowledge Discovery and Data Mining 1999 Trend detection based on a fuzzy temporal profile model PauloFélix SantiagoFraga RoqueMarín SenénBarro AI in Engineering 13 4 1999 LGillam KAhmad SAhmad MCasey DCheng TTaskaya PC FOliveira Manomaisupat Economic news and stock market correlation: A study of the uk market 2002 Data Mining Concepts and Techniques JHan MKamber 2006 Morgan Kaufmann Publishers Inc Design science in information systems research ARHevner STMarch JPark SRam MIS Quarterly 28 1 2004 Bursty and hierarchical structure in streams JonKleinberg KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining 2002 A Survey of Emerging Trend Detection in Textual Data Mining April;Kontostathis LeonGalitsky WilliamMPottenger SomaRoy DanielJPhelps 2003 Springer-Verlag Use of term clusters for emerging trend detection April;Kontostathis LarsEHolzman WilliamMPottenger 2004 Technical report Mining of concurrent text and time series VictorLavrenko MattSchmill DawnLawrie PaulOgilvie DavidJensen JamesAllan proceedings of the 6 th ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining Workshop on Text Mining the 6 th ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining Workshop on Text Mining 2000 Discovering trends in text databases BrianLent RakeshAgrawal RamakrishnanSrikant 1997 AAAI Press Newscats: A news categorization and trading system Marc-AndreMittermayer GerhardFKnolmayer IEEE International Conference on 2006 0 Data Mining Tracking dynamics of topic trends using a finite mixture model SatoshiMorinaga KenjiYamanishi Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Seattle, Washington, USA

August 22-25, 2004. 2004 OlgaStreibel Xml-clearinghouse report 17: Xml-technologies and semantic web for trend mining in business applications XML-Clearinghouse Project 2007 Freie Universitt Berlin Technical report Detecting emerging concepts in textual data mining WilliamMPottenger Ting-HaoYang 2001 Automatic Text Processing GerardSalton 1989 Addison-Wesley The sciences of the artificial HerbertASimon 1996 MIT Press Cambridge, MA, USA 3rd ed Trend ontology for knowledge-based trend mining in textual information OlgaStreibel MalgorzataMochol 7th International Conference on Internet Technology: New Generations 2010 Timemines: Constructing timelines with statistical models of word usage RusselSwan DavidJensen KDD-2000 Workshop on Text Mining Extreme tagging: Emergent semantics through the tagging of tags VladTanasescu OlgaStreibel ESOE 2007 Anatomy of A Trend HenrikVejlgaard 2008 McGraw-Hill Daily prediction of major stock indices from textual www data BWüthrich DPermunetilleke SLeung VCho JZhang WLam proceedings of the 4th International Conference on Knowledge Discovery and Data Mining -KDD-98 the 4th International Conference on Knowledge Discovery and Data Mining -KDD-98 1998