Trend template: mining trends with a semi-formal trend model Olga Streibel, Lars Wißler, Robert Tolksdorf, Danilo Montesi streibel@inf.fu-berlin.de, lars.wissler@googlemail.com, tolk@ag-nbi.de Networked Information Systems Group, Freie Universität Berlin, Berlin, Germany montesi@cs.unibo.it University of Bologna, Bologna, Italy Abstract which emerged in political news worldwide in the beginning of 2011, as well as the financial and real estate crisis which Predictions of uprising or falling trends are helpful started to emerge on business news worldwide in 2008. A in different scenarios in which users have to deal graphical representation of a trend, based on GoogleTrends3 , with huge amount of information in a timely man- is shown in Fig. 1. ner,such as during financial analysis. This tempo- ral aspect in various cases of data analysis requires novel data mining techniques. Assuming that a given set of data, e.g. web news, contains informa- tion about a potential trend, e.g. financial crisis, it is possible to apply statistical or probabilistic meth- ods in order to find out more information about this trend. However, we argue that in order to under- stand the context, the structure, and explanation of a trend, it is necessary to take a knowledge-based approach. In our study we define trend mining and Figure 1: This graph shows a search volume index for the terms “fi- propose the application of an ontology-based trend nancial crisis” (blue curve) and “insolvent” (red curve) in Germany model for mining trends from textual data. We in- from 2006 to 2011. Source: GoogleTrends troduce the preliminary definition of trend mining as well as two components of our trend model: the Several methods have been proposed for detecting trends in trend template and the trend ontology. Further- texts or discovering trends in the web news (see Section 3). more, we discuss the results of our experiments Other works provide approaches from statistics and time se- with trend ontology on the test corpus of German ries analysis that can be applied for analyzing trends in non- web news. We show that our trend mining approach textual data. Our work contributes to the general understand- is relevant for different scenarios in ubiquitous data ing of trend mining that we see as highly relevant to ubiqui- mining. tous data mining. In this paper, we explain our abstract con- cept of a trend template and go on to describe a trend ontology 1 Introduction which is an instance of the trend template. When discussing trends some of us may think about the ups 2 Ubiquitous data mining and trend mining and downs of NASDAQ1 , or DAX2 curves, or changes in pub- lic opinion on politics before elections. Likewise, one can The Ubiquitous Data Mining (UDM) is defined as the essen- think about web trends, life style trends or daily trends, i.e. tial part of the ubiquitous computing [Witten and Eibe, 2005]. hot topics, in the news or on social networks. Changes in a The UDM techniques help in extracting useful knowledge mobile data stream also fall within the definition of a trend. from data that describes the world in movement, including Understanding a trend as a hot topic is related to the research the aspects of space and time. Time is the necessary dimen- in Emerging Topic Detection (EDT) and Topic Detection and sion for trend mining– there is no trend without time. And a Tracking (TDT), the subfields of information retrieval [Allan, trend is one of the aspects of a world in movement. Before 2002][Kontostathis et al., 2003]. A trend is defined there as we discuss general trend characteristics, we want to mention a topic that emerges in interest and utility over time. Accord- the sociological and statistical perspectives on the trend, as ingly, common examples of trends may be the “Arab Spring” well as define trend mining. This helps in understanding the trend characteristics that create the basis for the definition of 1 http://www.nasdaq.com/ online accessed 04-17-2013 our trend template later in this paper. 2 http://dax-indices.com/EN/index.aspx?pageID=4 online accessed 04-17- 3 2013 http://www.google.com/trends/ online accessed 04-17-2013 2.1 Trend from different perspectives the case for so-called short-term trends that are indeed trig- gered by some events and in order to detect them we have Detecting trends from the sociological point of view is an an- to monitor the stream in which they occur, e.g. the occur- alytical method for observing changes in peoples behavior rence of “Eyjafjallajkull eruption”4 which was reported in over time with regard to “six attitudes towards trends” [Ve- social networks and on the news in March 2010. However, jlgaard, 2008]. The definition of these six attitudes is based so-called long-term trends, e.g. “financial crisis”, that started on eight different personality profiles of groups who partici- to be on-topic in 2008 are not necessarily conjoined with one pate in the trend process: trend creators, trend setters, trend specific event. It is more a chain of events or even the “soft” followers, early mainstreamers, mainstreamers, late main- indicators as public opinion or news. No sharp distinction streamers, conservatives and anti-innovators. has been made between the TDT and ETD research fields, Detecting trends from the statistics perspective is based on which means that some research such as [Swan and Allan, trend analysis of time-series data with two goals in mind: 1999] or [Lavrenko et al., 2000] can be in fact classified into “modeling time series (i.e. to gain insight into the mecha- both fields. Temporal data mining research [Mitsa, 2010] of- nisms or underlying forces that generate the time series) and fers methods for clustering, classification, dimension reduc- forecasting time series (i.e. to predict the future values of the tion and processing of time-series data [Wang et al., 2005]. time-series variables)” [Han and Kamber, 2006]. The trend It addresses in general the temporal data and the techniques analysis process consists of four major components: trend or of time series analysis on these data. One definition of tem- long-term movements, cyclic movements or cyclic variations, poral data is “time series data which consist of real valued seasonal movements or seasonal variations, and irregular or sampled at regular time intervals” [Mitsa, 2010]. Temporal random movements [Han and Kamber, 2006]. A trend, in this data mining applies the data mining methodology and deals context, is an indicator for a change in the data mean [Mitsa, with the same approaches for classification or clustering, that 2010]. are relevant for mining trends in textual data. 2.2 Trend mining 4 Trend template Since data mining can be described as “the extraction of implicit, previously unknown, and potentially useful infor- Based on our experiments and considerations, we outline the mation from data” [Witten and Eibe, 2005], we propose the following assumptions about trends in the general context of use of the term trend mining as defined below: this work; A trend can be described by the following characteristics: DEF 2.1 Trend mining is the extraction of implicit, trigger, context, amplitude, direction, time interval, and re- previously unknown and potentially useful knowledge from lation. Fig. 2 illustrates the trend template. time-ordered text or data. The trend mining techniques In 4.1, we more precisely define each characteristic. can be used for capturing trend in order to support user in providing previously unknown information and knowledge 4.1 Definitions about the general development in users field of interests. Trigger is a thing. They can be: an event, a person, or a topic anything that triggers the trend. A trigger can but does not have to cause a trend. A trigger makes the trend visible. An 3 Related Research example of a trigger is Lehman Brothers5 insolvency that can In general, when mining trends from textual data, at least the be classified as both a topic and an event. following three research areas should be mentioned: emer- Context is the area of the trigger. If the trigger is a topic gent trend detection, topic detection and tracking, and tem- then the context is this topic’s area, e.g. Lehman Brothers poral data mining. insolvency is mentioned in the context of real estate market. In [Kontostathis et al., 2003] several systems that detect Amplitude is the strength of a given trend. It can be emerging trends in textual data are presented. These ETD expressed by a number, the higher the number the more systems are classified into two main categories: semi- impact the trend has or by a qualitative value that describes automatic and fully-automatic. For each system there is a the trend phase, e.g. beginning (setter), emerging (follower), characterization based on the following aspects: input data mainstream, fading (conservative). and attributes, learning algorithms and visualization. This Time is necessary while spotting trend, since there can be comparison includes an overview over the research published no trend without time. It is the interval in which the trend in [Allan et al., 1998][Lent et al., 1997][Agrawal et al., is appearing, independent from the amplitude, e.g. the real 1995][Swan and Jensen, 2000][Swan and Allan, 1999][Watts estate crisis appeared between the years 2008-2011. et al., 1997]. TDT research [Allan, 2002] is predomi- Relation expresses the dependency between the trigger and nantly related to the event-based approaches. Event-based the context, it puts the given trigger, e.g. Lehman Brothers approaches for trend mining underlie the assumption that insolvency within the given context of the real estate crisis trends are always triggered by an event, which is often de- in a relation, e.g. Lehman Brothers insolvency is part of the fined as “something happening” or “something taking place” 4 [Lita Lundquist, 2000] in the literature. Considering a trend The eruption of an Icelandic volcano in March 2010 that caused air travel chaos in Eu- rope and revenue lost for the airlines http://www.volcanodiscovery.com/iceland/ from the event research perspective means that trend detec- eyjafjallajoekull.html online accessed 04-17-2013 5 tion has to be understood as a monitoring task. This is mostly http://www.lehman.com/ online accessed 04-17-2013 Figure 2: Trend template– an abstract conceptualization real estate crisis. and Rco the set of relations: Rco := {rco0 , . . . , rcon }, n ∈ N ∧rco ∈ Rco ∧Rco ⊆ Cco × Cco 4.2 Formal description whereas rco defines a binary relation: The trend template is an abstract model that describes rco : ccox , ccoy −→ rco (ccox , ccoy ) ∧ ccox 6= ccoy the main concepts that are important and necessary for and the context element is defined by: knowledge-based trend mining. In following, we more ex- plicitly define the trend template: c = cco ∪ (ccoi , ccoj ) DEF. 4.1: Trend template (TT) is a quintuple: C = Cco ∪ Cco × Cco T T := hT, C, R, T W, Ai DEF. 4.4: R-Relational is a set of relations: where: T is trigger, C is context, R is relation, T W is time window, and A is amplitude. R := {r0 , . . . , rn }, n ∈ N ∧ r ∈ R ∧ R := {T × C} with DEF. 4.2: T- Trigger is set of concepts: ri : ti , ci −→ ri (ti , ci ) T := {t0 , . . . , tn }, n ∈ N ∧ t ∈ T DEF. 4.5: TW- Time window is a function that assigns time so that if E, P , To are the sets defining: slice to the time points: events: E := {e0 , . . . , en }, n ∈ N ∧ e ∈ E persons: P := {p0 , . . . , pn }, n ∈ N ∧ p ∈ P T P := {tpoint |tpoint = locations: L := {l0 , . . . , ln }, n ∈ N ∧ l ∈ L = ms ∨ second ∨ minute ∨ hour ∨ day ∨ month ∨ year} topics: To := {to0 , . . . , ton }, n ∈ N ∧ to ∈ To T S := htpoint0 . . . tpointn i then: T W : T P −→ T S T := E ∪ P ∪ To ∪ L DEF. 4.3: C- Context is a union set consisting of a set of DEF. 4.6: A- Amplitude is a function that assigns a value to concepts and a set of relations between them where c is a the quadruple of hT, C, R, T W i context element: A : T × C × R × T W −→ N ∪ V C := Cco ∪ Rco , c ∈ C where N is the set of natural numbers and V is the set of with Cco the set of concepts categorical values Cco := {cco0 , . . . , ccon }, n ∈ N ∧ cco ∈ Cco a : (t, c, r, tw) −→ n ∨ v 5 Trend Ontology Algorithm 6.1: CREATE T REND D ESCRIPTION(c, o) One way of implementing the trend template is the realization of this model in the form of an ontology. We can understand comment: parse ∀ document ∈ corpus the ontology as an instance of the trend template. comment: into ontology Based on the trend template described above, we created an parse(c, inO, outO){ applicable model, using SKOS6 and RDFS/OWL7 concepts model.read(inO) and properties. Our model serves as a general model that create.reasoner(inO) can be extended regarding the particular application domain for each d∈ c and applied for annotating a text corpus in order to retrieve do { the trend structure. The trend ontology is divided into levels parse(keywords); match.model(keywords, inO){ meta, middle and low which correspond to three abstract lay- for keyword ← 0 to i ers of the model. Whereas the low level and the middle level if inO.concept.label==keyword or relate to the corresponding application domain (in our case it keyword∈ inO.concept.label is the German Stock Exchange, DAX), the meta level is the keyword.pref ix or keyword.postf ix== most interesting one. Meta ontology incorporates the general inO.concept.label.pref ix or .postf ix trend characteristics and can be applied to any application do- then matches.add(keyword)} main. relate.model(matches, inO){ The central concepts of the ontology are Trigger, Trig- if model.getRelation(matches).isEmpty gerCollection, Indication, Relational and ValuePartition then model.createRelation(matches) and have been modeled as subconcepts of skos:Concept, else model.incCounter(matches)}} model.write(outO) skos:Collection and time:TemporalEntity, with different se- mantic construction, e.g. skos:related, skos:member. The concepts mirror the composition of the trend template. Trig- ger consists of three subconcepts: event, person, location. The main goal of the meta ontology is to offer all necessary In general, the content of the corpus is focused on finance and business information concerning German companies and concepts and relations in order to span the trend template as stocks. It focuses on the situation at DAX, as well as on re- a structure over a text corpus. To actually translate a specific views and ratings of German companies and shares. For eval- document corpus into such a structure, meta ontology needs uation purposes regarding usefulness and practicability, the to be combined with a domain specific trend ontology which trend ontology has been filled with two different parts of the defines domain specific concepts, their keywords and possi- test corpus: stock market specific documents in Part 1 and the bly also their relations. This can either be done manually by general business news in Part 2 (subsequently first and second extracting common terms as keywords and linking them to part). They contain over 5,000 and 16,000 documents respec- their respective concepts, or automatically by entity recog- tively. We specified several basic questions and respective nition. The pseudocode 6.1 describes the algorithm that we queries as relevant for trends in general and specifically for applied to build up the trend description on the test corpus. stock market trends. Querying the ontology for the total oc- currence of concepts yields the following output (shortened to some of the most relevant concepts): Germany (9,137), USA 6 Experiments (4,808), Deutsche Telekom (442), Allianz (433), Switzerland (382), Starbucks (104). The output corresponds directly to The text corpus which we call German finance data8 that the corpus of German stock news with a clear focus on Ger- served as our test corpus consists of about 40,500 news ar- man companies followed by the still dominant US market. A ticles related to the fields of business and finance, provided similar query for often mentioned lines of business in the con- as XML files. The corpus is available in German and pro- text of Germany in contrast to the USA yields a major focus vides news articles from January 2007 to May 2008. The text on the industry for Germany. 4.5% to 7.1% of the total oc- was parsed in cooperation with neofonie9 from the following currences of Germany appear in the context of different lines sources: comdirect10 , derivatecheck11 , Handelsblatt12 , God- of industry. The USA is strong in the context of IT (9%) and modeTrader13 , Yahoo14 , Financial Times Deutschland15 , and services (6.9%). Moreover, we checked so-called topic struc- finanzen.net16 . ture by using our ontology. Here a general example for the 6 concept Germany: http://www.w3.org/2004/02/skos/ online accessed 04-17-2013 7 http://www.w3.org/TR/owl-features/ online accessed 04-17-2013 trendonto:#Germany (9137) has Topic 8 Currently (May 2013) in the publishing process at Linguistic Data Consortium http://www. trendonto:#Financial : 1142 ldc.upenn.edu/ trendonto:#buy : 1003 9 http://www.neofonie.de, online accessed 04-25-2012 trendonto:#MachineBuildingIndustry : 650 10 http://www.comdirect.de/inf/index.html, online accessed 04-25-2012 trendonto:#Share : 606 11 http://derivatecheck.de/, online accessed 04-25-2012 trendonto:#StockPrice : 562 12 http://www.handelsblatt.com/weblogs/, online accessed 04-25-2012 trendonto:#Up : 520 13 http://www.godmode-trader.de/, online accessed 04-25-2012 trendonto:#Industry : 510 14 http://de.biz.yahoo.com/, online accessed 04-25-2012 trendonto:#Investment : 468 15 http://www.ftd.de/, online accessed 04-25-2012 trendonto:#Supplier : 422 16 http://www.finanzen.net, online accessed 04-30-2012 trendonto:#AutomobilIndustry : 414 ACM SIGIR conference on Research and development in information retrieval, pages 37–45. ACM, 1998. [Allan, 2002] James Allan, editor. Topic Detection and Tracking. Event-based Information Organization. Kluwer academic publishers, 2002. [Han and Kamber, 2006] J. Han and M. Kamber. Data Min- ing Concepts and Techniques. Morgan Kaufmann Publish- ers Inc., 2006. [Kontostathis et al., 2003] April Kontostathis, Leon Galit- sky, William M. Pottenger, Soma Roy, and Daniel J. Phelps. A Survey of Emerging Trend Detection in Textual Data Mining. Springer-Verlag, 2003. Figure 3: Performance of shares in the first corpus (5,000 docu- ments) by ontology based ranking and comparison with share in- [Lavrenko et al., 2000] Victor Lavrenko, Matt Schmill, dices in the time window July 2007 to July 2011. Dawn Lawrie, Paul Ogilvie, David Jensen, and James Allan. Mining of concurrent text and time series. In Proceedings of the 6 th ACM SIGKDD International In Fig. 3 we show the comparison of the performance val- Conference on Knowledge Discovery and Data Mining ues for the stock markets as ranked by ontology (test based on Workshop on Text Mining, pages 37–44, 2000. time window: July 2007 to April 2008) and reported in real [Lent et al., 1997] Brian Lent, Rakesh Agrawal, and Ra- (time window July 2007 to July 2011). Applying the trend makrishnan Srikant. Discovering trends in text databases. ontology to the test set enables to find out specific informa- In Proceedings of the KDD’97, pages 227–230. AAAI tion about the certain trend that is described in the documents Press, 1997. of the test set. Our preliminary experiments results that we [Lita Lundquist, 2000] Robert J. Jarvella Lita Lundquist. partially present in this paper show that our idea of a trend template could help in harvesting knowledge from the given Language, Text, and Knowledge. Mental Models of Expert test data in a timely manner. Communication. De Gruyter, 2000. [Mitsa, 2010] Theophano Mitsa, editor. Temporal Data Min- 7 Conclusions and future work ing. Chapman Hall/CRC Press, 2010. [Swan and Allan, 1999] Russell Swan and James Allan. Ex- This paper presents our research on knowledge-based trend tracting significant time varying features from text. In mining, wherein the main contribution is our semi-formal CIKM’99: Proceedings of the eighth international confer- model of a trend template. We showed that the implemen- ence on Information and knowledge management, pages tation of the trend template in the form of a trend ontology 38–45. ACM, 1999. allows for capturing the trend structure out of a test docu- ment set. Our experiments confirm that a knowledge-based [Swan and Jensen, 2000] Russel Swan and David Jensen. approach for mining trends out of data allows for extended Timemines: Constructing timelines with statistical models trend explanations. Currently we are comparing the trend on- of word usage. In KDD-2000 Workshop on Text Mining, tology experiment results with the results from adapted K- 2000. Means clustering and LDA-based topic modeling algorithms [Vejlgaard, 2008] Henrik Vejlgaard. Anatomy of A Trend. applied on our test set. McGraw-Hill, 2008. [Wang et al., 2005] X. Wang, K. Smith, and R. Hyndman. Acknowledgments Dimension reduction for clustering time series using global characteristics. In Vaidy Sunderam, Geert van Al- This work has been partially supported by the “InnoProfile- bada, Peter Sloot, and Jack Dongarra, editors, Computa- Corporate Semantic Web” project funded by the German Fed- tional Science - ICCS 2005, volume 3516 of Lecture Notes eral Ministry of Education and Research (BMBF) and the in Computer Science, pages 11–14. Springer Berlin / Hei- BMBF Innovation Initiative for the New German Länder - delberg, 2005. Entrepreneurial Regions. [Watts et al., 1997] Robert J. Watts, Alan L. Porter, Scott Cunningham, and Donghua Zhu. Toas intelligence min- References ing; analysis of natural language processing and computa- [Agrawal et al., 1995] Rakesh Agrawal, Edward L. Wim- tional linguistics. In PKDD ’97: Proceedings of the First mers, and Mohamed Zait. Querying shapes of histories. European Symposium on Principles of Data Mining and In Proceedings of the 21st VLDB, pages 502–514. Morgan Knowledge Discovery, pages 323–334. Springer-Verlag, Kaufmann Publishers Inc., 1995. 1997. [Allan et al., 1998] James Allan, Ron Papka, and Victor [Witten and Eibe, 2005] Ian. H. Witten and F. Eibe. Data Lavrenko. On-line new event detection and tracking. In Mining Concepts and Techniques. Morgan Kaufmann SIGIR’98: Proceedings of the 21st annual international Publishers Inc, 2005.