Forecasting out-of-the-ordinary financial events1 Marco Brambilla2 and Davide Greco3 and Sara Marchesini2 and Luca Marconi2 and Mirjana Mazuran2 and Martina Morlacchi Bonfanti2 and Alessandro Negrini2 and Letizia Tanca2 Abstract. Being able to understand the financial market is very that influence and possibly shake the market: we call them events. important for investors and, given the width and complexity of the Some of them are more relevant because they represent considerable topic, tools to support investor decisions are badly needed. In this pa- changes of the financial market: we call them catastrophes, and they per we present Mercurio, a system that supports the decision-making coincide with extraordinary financial moves (not necessarily nega- process of financial investors through the automatic extraction and tive, though), e.g. merger and acquisition, or other significant moves analysis of financial data coming from the Web. Mercurio formal- of the company management, or stockprice variations. The occur- izes the knowledge and reasoning of an expert in financial journal- rence of a catastrophe is usually anticipated by “symptoms” that we ism and uses it to identify relevant events within financial newspa- call signals. For example, an investor might observe that often, be- pers. Moreover, it performs automatic analysis of financial indexes to fore a crash, a company gives an interview stating that profits are identify relevant events related to the stock market. Then, sequential increasing; from now on, whenever such an interview is published pattern mining is used to predict exceptional events on the basis of the expert will expect the related stock to fall in the stock market. the knowledge of their past occurrences and relationships with other Thus, an article containing an interview about increasing profit is a events, in order to to warn investors about them. signal, while a stock crash is a catastrophe. The paper is organized as follows: Section 2 briefly describes some proposals with aims similar to ours, Section 3 gives the details of 1 Introduction the Mercurio system, Section 4 provides the current implementation Financial data are daily produced and made available on the Web, state and, finally, Section 5 draws the conclusions we have currently therefore the possibility to process them allows us to model and study reached and future research directions. a world that is inherently complex due to the rules governing the fi- nancial market and to the internal and external factors influencing 2 Related work it. Investors constantly read financial news and analyze financial in- dexes, using their knowledge and experience to predict market events Market prediction always receives high interest in the financial lit- and make profitable investments. Our research aims at developing erature: mostly, only numerical data are used, but some approaches Mercurio, a decision support system to help investors during these exploit also textual information to increase the quality of input data activities. and improve predictions. Mercurio identifies relevant financial events, understands how they Works in [3, 4, 5, 6, 7] use Automated Text Categorization tech- are related to each other and exploits this knowledge to predict fu- niques to predict short-term market reactions to news. Articles are ture happenings. It uses: (i) the knowledge of an expert in financial categorized depending on the influence their publication has on fi- journalism, whose deep understanding of the news does not consist nancial indexes, and then correlated with financial trends and differ- of sole natural language processing and (ii) financial indicators that ent approaches use different types of classifiers. Our approach differs provide an objective overview of the stock and, more in general, of from these as we use expert knowledge to determine the relevance of the companies’ performances. On one hand, a domain expert knows articles. Among the examined works, [8] has a similar goal as Mer- “how to” read an article and understand its meaning, especially since curio, to find sequences of articles that anticipate a changing trend. its literal inspection might not coincide with the real meaning of what Once again the focus is on numerical data, while we are interested in has happened. On the other hand, financial indicators provide an im- predicting strategically extraordinary financial moves. partial overview of the past and current financial situation of compa- Existing works are primarily data driven, however some propos- nies. Financial happenings are all about signals and indications that als use a-priori knowledge about the application domain. Works in companies leave behind along their life, and that the system must [9, 10] analyze financial articles and create a handcrafted thesaurus capture and interpret. Investment decisions are still made by human containing words that drive the stock prices and that are later used to investors, and Mercurio provides them with more knowledge, pos- predict stock prices. Similarly, [11] uses a-priori domain knowledge sibly hidden to human observers, to improve their decision-making to predict interest rates: a cognitive map represents cause-effect rela- process. tionships among the events in the domain and is used as the basis to Among the many financial data available on the web, Mercurio retrieve the relevant news; these are then classified as either positive looks for those that convey “important” happenings, i.e., happenings or negative according to the way they influence the rates. A work 1 This research is partially supported by the IBM Faculty Award “SOFIA: similar to ours is [12], where the objective is to predict the Tokyo Semi-autOmatic Financial Information Analytics” stock exchange price using a-priori knowledge in the form of rules. 2 Politecnico di Milano Domain rules are defined eliciting non-numerical factors that influ- 3 Accento ence the stock price, however these rules differ from ours as they 11 convey general knowledge about political and international events. the timeline of a company are taken as input by the Model Predictor On the contrary, we focus on financial and economic events typical module that uses them to forecast the happening of a certain catas- of a company’s life. The latter approaches differ from ours either trophe with respect to a certain company. The output provided by the in the way knowledge is represented or in the kind of knowledge Model Predictor is composed by a set of alerts such as “there is a P% adopted as background; we are currently trying to find a basis for probability that company A will encounter catastrophe C within X an effective comparison, since the systems are not available and thus timeslots”. an experimental comparison on the same corpus is for the moment The most challenging and crucial aspect of the project is thus the impossible. process of event recognition and sequencing; however, as a side anal- To the best of our knowledge, a comprehensive system that makes ysis, the time series generated by the Time-Sequence Generator can use of both textual and numerical information to predict strategically be compared with numerical data (indexes), arranged on their own extraordinary financial moves is still missing. timeline, in order to understand correlations between them. 3 The Mercurio system 3.1 Textual information We envision an integrated and modular system that draws informa- Events can be recognized inside textual information through text tion from various sources and uses them appropriately with the final analysis; in Mercurio we propose the use of three different ap- aim of predicting the happening of extraordinary financial events, proaches: that is, catastrophes. Finance is a kind of domain in which the key to • Semantic approach: events are recognized by means of semantic successful data analysis is the integrated analysis of heterogeneous rules that formalize the knowledge and experience of our domain data, where time-dependent and highly frequent numerical data (e.g., expert. price and volume) and textual data (e.g., news articles) should be • Automatic approach: events are identified by applying clustering considered jointly [13]. Both categories might encompass various algorithms to financial news. data sources that can be easily added to the system (as shown in Fig- • Hybrid approach: a combination of the previous approaches ure 1). Each of the textual data sources is managed by an Event Rec- where catastrophes are recognized with semantic rules and signals ognizer that is able to extract events from the data and feed them into by means of clustering. Mercurio. Events can be catastrophes (i.e. they convey considerable changes of the financial market) or signals (i.e. symptoms anticipat- In the semantic approach, in particular, rules define a relationship ing a catastrophe). Event recognition strategies vary depending on between sentence structures and corresponding events. This is one of the type and nature of the managed data, for instance, each financial the innovative features of Mercurio and can be further improved by market (Italian, British, etc.) has its own language and dynamics, and introducing different formalization strategies. there are differences also among financial newspapers of the same Some rules are independent of each other in the sense that they country. represent events that do not interact in any way. Other rules instead might represent events that are somehow related, e.g., one event Textual information might be a composition of two different events. Moreover, some "Corriere "Sole 24 . rules are related to events that involve only one company while oth- "Radiocor" della Sera" Ore" . ers might represent an interaction among different financial players. Event Event Event . These considerations generate a rule categorization that also intro- Recognizer Recognizer Recognizer . duces the need for rule ordering. Such ordering is needed during the Mercurio phase when rules are applied to the financial news in order to ensure the correct event recognition. Time-Sequence Generator Index 1 An interesting idea is to organize and formalize the semantic Index 2 rules into an ontology. The concepts in the ontology would represent Model Constructor Model Predictor ..... Numerical information events, and relationships among concepts would describe how events are related to each other and how they interact and depend on each Alerts other. Each concept should be related to a set of words (or sentence Figure 1. Mercurio architecture structures): those that express the corresponding rule. These words could be defined ad-hoc according to the semantic rules in Mercurio, In Mercurio, the events extracted from the financial news are re- but can also originate from external ontologies describing the finan- ceived by the Time-Sequence Generator that arranges them on one cial scenario or others. This addition helps to enrich the semantic or more timelines depending on the use the system has to make of formalization by taking into account both synonyms and new terms. them. If the aim is to construct a model from them all, then the Time- The use of an ontology would also allow us, through the use of Sequence Generator creates a single timeline where all the received inference, to discover novel information about the formalized data, events are placed and provides this timeline as input to the Model possibly stimulating the discovery of new events. Constructor. On the other hand, if the aim is to predict the future happenings related to specific companies, each created timeline con- 3.2 Numerical information tains only events related to a specific company, and inputs these data to the Model Predictor. Time-dependent series such as financial indexes are represented as The Model Constructor module takes a sequence of events and values on a timeline. Each timeslot (e.g. hour or day) is associated – uses Sequential Pattern Mining techniques to find frequent subse- according to the index – with a value, e.g. an opening value, price, quences of events and thus creates a model of the data represented closing value, average and so on. The timeline containing these val- in terms of a set of sequential patterns. These patterns, together with ues can be used, in addition to the timeline containing events coming 12 from textual data, to enrich our data representation for the user. This Two different text pre-processing strategies are adopted, one used is possible not only by taking into consideration single values but during the semantic event recognition and the other for the automatic also by looking at some patterns inside the index. event recognition. In the first strategy we kept all special charac- A first technique is based on Bollinger Bands 4 that, given a numer- ters, symbols, punctuation marks, numbers, words, company names ical series, provide an upper and lower band such that the observed and persons details because they are needed by the expert’s rules. values usually oscillate within them. Whenever a value goes beyond In the second strategy these data are not significant, sometimes even these bands, it means that an unusual oscillation is happening. Thus, misleading when applying clustering algorithms, thus they are elim- a trend that goes below the lower band is an unexpected price fall inated from the texts. while a trend that goes above the upper band is an unexpected price rise. A second technique that has been applied in the financial context 4.2 Event recognition is the detection of specific patterns, in terms of curve shape, inside Events are detected through text analysis of the financial news. Mer- financial time series (rather than single interesting points). The fi- curio implements three event recognition approaches; all of them out- nancial domain comprises some well known and meaningful trend put a temporal sequence containing the recognized events. patterns [14] such as “double top”, “spike bottom”, “wedge” and so on. Another interesting approach is to approximate financial time se- 4.2.1 Semantic event recognition. ries through the use of segments, for example by using piecewise Mercurio uses a set of rules that formalize the recognition of rele- segmentation [8]. In such way each segment represents a trend in the vant events inside financial news. Rules define a relationship between series, thus, we might have segments representing increasing, stable some keywords, regular expressions (in general, sentence structures), or decreasing volumes or prices. and corresponding events (e.g. “take” is a keyword related to an ac- Yet another segmentation technique specifically adopted in the fi- quisition event). An article that contains the expressions defined in a nancial scenario is based on Turning Points (TP) [15]. TPs are lo- rule is assigned a label corresponding to the event formalized by the cal minimum and maximum points from the historical data and are rule. Each article is assigned zero, one or more labels depending on widely used in technical analysis for predicting the movement of a the rules it triggers. stock. In fact, they represent the trend of the stock change and can be Rules capture meanings that go beyond the sole natural language used to identify the beginning or end of a transaction period. processing. For example, financial newspapers, usually, publish in- terviews when requested by a company. The question is: why would 4 Current implementation a company want to be interviewed? When this breaks a trend of non- communication it must be a signal. Also, an article that mentions the Currently. our system predicts catastrophes by taking into consid- gross profit of a company is not a good sign because this indicator eration the information coming from financial news, while the part does not provide the amount of real revenue of the company, thus it allowing the comparison with financial indexes is not implemented could hide a negative trend of the company, whereas the net profit is yet. The system comprises three main phases: not ambiguous, so this is a positive financial communication. Currently, Mercurio encompasses 30 semantic rules, 7 of which 1. Data acquisition and management: financial news are extracted identify catastrophes while the rest formalize signals. from web sources, structured and stored into a relational database; their contents are then cleaned and pre-processed; 2. Event recognition: articles are analyzed to identify both catastro- 4.2.2 Automatic event recognition. phes and signals. Mercurio adopts the three different approaches This approach does not use any a-priori knowledge but relies on introduced in Section 3: (i) semantic approach, (ii) automatic ap- the detection of events by only applying clustering algorithms to the proach and (iii) hybrid approach. pre-processed financial news. Articles are represented in the Vector 3. Model construction: the events found in the previous step are used Space Model [1] where the weight of each term is the TF-IDF fre- in combination with sequential pattern mining to learn a model, quency of its occurrences in the article. Then, articles are clustered represented by means of temporal patterns, to predict the arrival using the K-means algorithm and each article is assigned one label, of catastrophes. corresponding to the cluster it belongs to. The process of article clustering has proven to be quite challeng- 4.1 Data acquisition and management ing because at the end of the clustering phase we tried to interpret the results and found it impossible to distinguish between clusters Mercurio currently monitors 250 Italian mid-cap companies and the representing signals and those representing catastrophes. This was information about them is gathered from important Italian financial a big drawback from our point of view since we were not able to and economic web sources such as “Il Sole 24 Ore”, “Radiocor”, understand how to predict catastrophes. “La Repubblica” and “Il Corriere della Sera”. Articles about compa- nies are extracted directly from the newspaper websites and stored into a MySQL database (our initial data contains about 14,000 arti- 4.2.3 Hybrid event recognition. cles, from year 2010 to 2015) keeping only those that: (i) are part of To overcome the problem exposed above, we “added some seman- financial and economic sections and (ii) refer to one of the chosen tics” to the automatic approach, obtaining what we called the hybrid companies. After this phase the article texts are cleaned by tokeniza- one. In this approach, catastrophes are found by using the semantic tion, stopword elimination and word stemming. rules that formalize catastrophic events, while the other signals are 4 http://www.investopedia.com/terms/b/ obtained by clustering all those articles that were not isolated by the bollingerbands.asp rules defining catastrophic events. 13 4.3 Model construction set; (ii) some catastrophes have maximum precision and maximum recall thus they are perfectly predicted, i.e., there are only right pre- The output of the event recognition phase is a sequence of events, dictions and not wrong or missed ones; (iii) other catastrophes have each associated with a timestamp that corresponds to the date and always maximum precision because the system makes only right pre- time of publishing of the article in which the event was found. Based dictions about them, however (iv) some have low recall which means on this sequence, Mercurio uses Sequential Pattern Mining to find that many times the catastrophe happens and the system was not able “recurring” temporal patterns in the input data which are then used to predict it. to predict future catastrophes. These results strongly depend on the minimum support thresh- This step is performed by using AIDA [2], a tool that encompasses old: the higher the support threshold, the higher the precision and both the model creation and prediction features. The tool is applied in the lower the recall; conversely, the lower the support threshold, the two phases: (i) given as input a temporal sequence of events, a spe- lower the precision and the higher the recall. In general, we noticed cific event e from the sequence and a minimum support threshold, that both approaches offer satisfactory performances, however we are it finds all temporal patterns that end with e and whose support is working at making the models more accurate, so that the final proto- above the threshold; (ii) given the found model and a real-time flow type will be based on more training data and on an integration of the of previously unseen articles, it predicts the happening of the learned two techniques. events within a certain time span. In particular, during the prediction phase, each incoming new article is processed and labeled according to the events it triggers. Then, the system tries to match each event 5 Conclusion to the ones in the patterns of the model. If this happens, it waits for In this paper we discussed Mercurio, a system that supports the another event that would match the next event in the pattern. This decision-making process of investors through the automatic extrac- process is repeated until a pattern expires because of time constraints tion and analysis of financial data, with the aim of predicting extraor- or its last but one event is reached. When this happens, we can pre- dinary financial moves. Current results are encouraging but leave dict the happening of the next event, which is the one corresponding space for many improvements, especially related to enrichments of to the last node of the pattern, which, by construction, is always a the current model, such as introducing weights and polarity to each catastrophe. event and the use of statistical information about the whole financial market, its different sectors and each monitored company. 4.4 Experiments REFERENCES Let us briefly discuss on the performance of our prototype and com- [1] G. Salton, A. Wong, C. S. Yang. A vector space model for automatic pare the semantic approach (SA) and hybrid approach (HA). First of indexing. Commun. ACM 18, 11 (November 1975), 613-620. all, let us recall the differences between the two approaches, in terms [2] M. Mazuran, M. Simoni, L. Tanca. AIDA: Automatic Indexing based on of article-event relationships: (i) in SA an article might contain both DAta mining. SEBD 2015. pp.176-183. catastrophes and signals, while in HA this is not possible because [3] S. Bacher. Mining Unstructured Financial News to Forecast Intraday Stock Price Movements. PhD Thesis. University Mannheim. 2012. clustering is computed only on those articles that do not trigger any [4] G.P.C. Fung, J. Xu Yu, W. Lam. News Sensitive Stock Trend Prediction. catastrophe; (ii) in SA an article might not trigger any rules thus not PAKDD 2002 pp.481-493. generate any event; in HA all the articles are associated with exactly [5] G. Gidofalvi. Using News Articles to Predict Stock Price Movements. one event, either a catastrophe or a cluster label; (iii) in SA an article Department of Computer Science and Engineering, University of Cali- might trigger more than one signal, while in HA each article belongs fornia, San Diego. 2001. [6] M.A. Mittermayer. Forecasting Intraday Stock Price Trends with Text to only one cluster, thus, it is related to only one signal. These differ- Mining Techniques. HICSS 2004. ences make it difficult to qualitatively compare the results of the two [7] D. Peramunetilleke, R.K. Wong. Currency exchange rate forecasting approaches, articles that trigger the same events in SA often belong from news headlines. ADC 2002. pp.131-139. to different clusters in HA. [8] V. Lavrenko, M. Schmill, D. Lawrie, P. Ogilvie, D. Jensen, J. Allan. Lan- guage models for financial news recommendation. CIKM 2000. pp.389- In the semantic approach we considered 2549 instances of events 396. (556 of catastrophes, 1993 of signals) and, for each catastrophe, built [9] M.A. Mittermayer, G. F. Knolmayer. NewsCATS: A News Categoriza- a model to predict it. The constructed models contain an average of 9 tion and Trading System. ICDM 2006. pp.1002-1007. patterns whose lengths vary between 2 and 7. In the hybrid approach [10] B. Wuthrich, V. Cho, S. W. Leung, D. Permunetilleke, K. Sankaran, J. we consider 3283 articles (438 catastrophes, 2845 are clustered). The Zhang. Daily stock market forecast from textual web data. ICSMC 1998. pp.2720-2725. constructed models contain an average of 13 patterns whose lengths [11] T. Hong, I. Han, Knowledge-based data mining of news information on vary between 2 and 6. The hybrid approach allows us to obtain a the Internet using cognitive maps and neural networks. Expert Systems greater number of patterns w.r.t. the semantic approach and results in with Applications 2002, 23(1):1-8. an increase of the average number of patterns for each catastrophe. [12] K. Kohara, T. Ishikawa, Y. Fukuhara, Y. Nakamura. Stock Price Predic- tion Using Prior Knowledge and Neural Networks. Intelligent Systems All the constructed models were tested on previously unseen data to in Accounting, Finance and Management 1997, 6(1):11-22. determine the precision and recall of the predictions. We recall that [13] F. Wanner, T. Shreck, W. Jentner, L. Sharalieva, D. A. Keim, Relating low precision means that there are many wrong predictions, i.e. many Interesting Quantitative Time Series Pattern with Text Events and Text times the system predicts a catastrophe which does not actually hap- Features. SPIE 2013. pen, and a low recall means that there are many missed predictions, [14] T. Fu, F. Chung, V. Ng, R. Luk. Evolutionary Segmentation of finalcial time series into subsequences. Evolutionary Computation 2001. i.e. many times the system does not predict a catastrophe and the [15] J. Yin, Y. Si, Z. Gong. Financial Time Series Segmentation Based On catastrophe actually happens. Turning Points. ICSSE 2011. The results obtained by applying the two methods vary depending on the catastrophe: (i) some catastrophes cannot be predicted because their model has only one pattern which does not appear in the testing 14 15