Event detection and time series alignment to improve stock market forecasting Elliot Maître Zakaria Chemli Max Chevalier Institut de Recherche en Informatique Scalian Institut de Recherche en Informatique de Toulouse / Scalian Paris, France de Toulouse Toulouse, France zakaria.chemli@scalian.com Toulouse, France elliot.maitre@irit.fr max.chevalier@irit.fr Bernard Dousset Jean-Philippe Gitto Olivier Teste Institut de Recherche en Informatique Scalian Institut de Recherche en Informatique de Toulouse Blagnac, France de Toulouse Toulouse, France jean-philippe.gitto@scalian.com Toulouse, France bernard.dousset@irit.fr olivier.teste@irit.fr ABSTRACT time series forecasting using textual information is a challenging Buying commodities is a critical issue for multiple industries be- research issue [30]. cause the variations of stock prices are induced not only by multiple In order to extract text data, multiple sources can be considered. economic parameters but also by external events. Raw material An important one is micro-blogging. Several studies showed the buyers must keep track of information in numerous fields, which predictive power of such media [23]. Sentiment analysis on Twitter constitutes a major challenge considering the exponential growth can be helpful [2], the activity on social network can be correlated of online data. To tackle this issue, we propose an event detec- with variation of the stock [26] and Twitter data can be used to tion approach in order to assist them in their anticipation process. forecast polls that are then used to interpret stock variations [22]. Indeed, a lot of contextual information is contained in text and Specialized financial website, such as Seeking Alpha, where com- exploiting it can allow one to improve its anticipation ability. Thus, munities of traders share their insights about the stock market, we develop a framework of event detection and qualification, then also contains meaningful information for stock market forecasting we quantify the impact of these events on stock market to help [5]. Thus, multiple sources of information like micro-blogging and buyers in their anticipation process. In this paper, we will first intro- specialized community websites can be combined to improve stock duce our context, then explain the scope of our work and our goals. market forecasting. After detailing the related work, we will present our proposition, Leveraging the expertise of several buyers via multiple inter- conclude and propose some future work possibilities. views, we observed that they base their decisions on events happen- ing in the real world, related by newspapers and social networks. CCS CONCEPTS Hence, given that the stock market reacts to news and events [8], we will particularly focus on event detection in text. Indeed, some • Information systems → Data management systems; Informa- periods are more intense than others [4] and are considered as tion retrieval; • Computing methodologies → Natural lan- more important. These periods, characterized by some events, are guage processing. carrying more information than other periods. Being able to detect KEYWORDS these events and quantify their impact constitute a major asset Event detection, text analysis, nlp, neural networks, time series, for buyers and traders. It is a difficult task, as illustrated by the commodities impact on the stock market of the Covid-19 outbreak, which was widely discussed but largely underestimated. With adapted tools, 1 INTRODUCTION one could have anticipated this crisis and behaved accordingly in order to mitigate the impact. Time series play a major role in several industrial fields, such as Our research aims at providing a tool leveraging information energy [1], transport [29], economy [11] or finance [28]. Being able contained in text data, especially events, in order to assist people to accurately forecast time series is a major asset in order to an- in their time series anticipation process, i.e. commodities buyers in ticipate the modeled phenomenon for companies. In commodities our context. In this paper we will focus on the event detection step. buying, the stock market is described by time series and is par- We will firstly introduce our general work, then we will focus on ticularly volatile, making its forecasting both a strategical and a the related work about event detection in text. Afterwards, we will challenging task [7], [10]. Classic stock forecasting methods like develop our proposal. [15] or [24] are usually based on economical data, such as curren- cies, indices or futures but most of them do not take into account textual data which can contain precious information. Improving 2 OVERVIEW OF OUR PROPOSAL "Copyright © 2020 for this paper by its authors. Use permitted under Creative Com- The task of commodities price forecasting is particularly complex mons License Attribution 4.0 International (CC BY 4.0)." due to the tremendous amount of parameters that influence the Maître, et al. Figure 1: General approach variations of the stock. To bring more contextual information to be able to recognize the word "killed" as a trigger for the event the buyers and to our model, we want to combine time series with "Die". Currently, the state-of-the-art for this task is achieved by text information. This is not a straightforward process and it needs using neural networks and several approaches have been proposed to be broke down in sub-tasks. Hence, our work will be articulated on this base. Nguyen introduced in [20] a CNN-based approach around three major steps as illustrated by Figure 1 : to detect these triggers. In [9], the authors improve this work by (1) Time series analysis to find coherent temporal areas, adding a Bi-LSTM to the CNN in order to include sentence context (2) Temporal event extraction, to the detection. The authors of [14] propose a self-regulating GAN (3) Events and time-series alignment. to perform the detection. In [18], the authors include even more context by a document-scale approach. While these steps are mutually dependent, it is also possible to treat them separately. Each of them constitute a scientific challenge and thus will be developed separately [20], [9], [14], [25], [24], [15]. 3.2 Topic modeling approaches In the rest of this paper, we will particularly focus on part (2) which While the former approach is mostly based on semantic and syntac- is the part we are currently working on and give insights about (3) tic properties, topic modeling approaches are statistical approaches. which is the next step of our work. Part (1) is currently not in the The authors of [27] propose to use Twitter users as human sensors scope of this work, we plan to use existing approaches to tackle to detect in real-time earthquake occurrences. The authors are using this issue. keywords to detect these target events and they use probabilistic models to detect the location of the events. Weng et al., in [31] 3 RELATED WORK analyze the wavelet signal of words in Tweets in order to filter triv- There are different approaches to perform event detection in text. ial words and clusters words to detect events. In [17], the authors The two principal are topic modeling and event trigger detection. analyze daily topics on Twitter via Latent Dirichlet Analysis (LDA) The former is a statistical approach while the latter is based on and then determine similarity between daily topics. They detect word classification. bumps in word usage and then clusterizes topics in "eventy topics". The authors of [21] propose a sub-event detection technique using 3.1 Event trigger based approaches topic modeling. This technique detect sub-events linked to an event The event trigger based approach is a classification method which and assign a label to these sub-events. In [13], the authors propose consists in classifying words in event categories. Some words, a real-time framework to detect minor and major events on Twitter. named trigger-words, are supposed to trigger the event in the sen- The first module of the framework detects events and then the tence and they are carrying the meaning. Detecting and classifying second module clusterizes these events. those words hence allow one to understand if a sentence depicts an event. ACE 2005 [12] is the reference dataset for this task and Thus, event trigger based approaches tend to exploit the power has been studied multiple times [20], [9], [14]. According to the of deep neural networks while topic modeling approaches are based ACE 2005 annotation guideline, in the sentence "A police officer on frequency of words and on what is discussed on social networks. was killed in New Jersey today", an event detection system should We argue that combining the asset of each technique could be Event detection and time series alignment to improve stock market forecasting an interesting objective. The power of representation brought by (1) Text data is extracted from sources previously selected by neural network is complementary to the detection approach of buyers, such as trusted Twitter users, in order to gather text topic modeling. written in regular English and focused on sharing important information. Indeed, most of the content on the internet is created by a few users. 4 OUR PROPOSAL: EVENT DETECTION (2) In order to have an exploitable event representation, we COMBINING TOPIC MODELING AND embed the content, using word embedding and sentence NEURAL MODELS embedding. Several constraints, such as the influence of possibly unknown (3) The embedded content is clusterized, leveraging the amount parameters and the real-time nature, arise from the definition of of information the embeddings bring. This can be done by the stock market. To predict future stock, one must exploit histor- placing the embedded content on vertices of a graph and ical data but also real-time data. Hence, our framework must be creating an edge between each vertex, weighted by the dis- applicable to data stream such as the Twitter stream. Moreover, tance between the two embeddings. If the distance is under some events may not be comparable to past events, so the classifi- a certain threshold, the edge is removed in order to create cation must be able to handle and assign labels to unknown classes. clusters of related contents. However, we do not aim at making real-time commodities trading, (4) The clusters are labelized, by determining representative we want to assist buyers in their daily buying decisions. We only document. An example of a representative document is a want our solution to be applicable in a real-time context, i.e. with a document with the minimum average distance with other granularity sufficient to help buyers in their daily transactions. tweets of the cluster. Thus, the clusters obtained are expected to be of great quality 4.1 Motivations thanks to a better representation, allowing a better identification Topic-modeling approaches correspond to our prerequisites, but and classification of events. These detected events will have two some of them are not adapted to data-streams or does not work usages : they will be used in the next steps in order to estimate the with unknown classes. Recent work which satisfies our constraints variations of the times series, and they will also be given to the fails to exploit the properties of the language and are only based buyers in order to help make their decision, alongside with our time on a probabilistic approach linked with word apparitions. series estimation. Since the tweets are extracted from the Twitter Neural based approaches, such as the methods used in the trigger- Stream, we will order them as their apparition order, which allows based approaches, are powerful in order to exploit patterns dis- us to take time into account and adapt to the type of application covered in past data. Moreover, they bring more information by we want. leveraging semantics and syntactic information, with methods such as word and sentence embeddings. Our goal is to exploit these information to improve the quality 4.3 Pros and cons of event representation. We think that these approaches are com- This methods brings more information than a regular topic model- plementary and we assert that combining them will allow us to ing approach, leveraging the representation power of neural based leverage the time and frequency aspect derived from topic model- approach. It allows us to consider the documents in a time-ordered ing and the representation power of neural networks, in order to manner which is not the case in most classification problem. This optimize event classification. make it suitable for time-based applications such as our. However, the efficiency of such a model for unknown events is not certain. Indeed, it is clear that neural networks sometimes 4.2 Our method fail to generalize correctly. Handling an event containing too much To do so, we propose a novel approach based on word and sen- novelty might be misleading for some models. The time aspect may tence embeddings. The idea behind this method is to leverage the also have some impact on the efficiency of the model. geometric power of these methods. Using the representation ob- Moreover, neural based approaches require annotated data, which tained, similar documents should have similar representations in is not always available, especially in context such as Twitter where the embedding space. By comparing the distance between docu- the amount of data is huge. This problem has been considered in ments, we will be able to create clusters of documents. Each cluster recent work, notably in [19] where the authors propose a weakly- corresponds to an event. Some events may be related and clusters supervised approach to limit annotation time. The problem of un- of similar events might be regrouped in an event cluster. This event known classes is not appropriately handled by these approaches. cluster represents a class of events, such as sports events, geopoliti- Detecting novelty without labeling it could be an insight in order cal events... Hence, unknown events can be assimilated to events to detect change in the time series, but in the mean time, we want in the same event cluster. We will order documents by their appari- to focus on a method allowing us to label unknown events. tion time, so we can adapt to the real-world context we want to apply this method to, i.e. commodities stock estimation using event Thus, this method helps us bringing more information in order detection in text data stream. to fulfill our classification objective, to adapt to our time-dependant context however it may rise several issues that we have not ad- Our proposition is articulated as follows: dressed yet. Maître, et al. Figure 2: GAN example 5 LINKING EVENTS AND TIME SERIES Its objectives is to automatically extract information from the VARIATIONS TO ESTIMATE FUTURE TIME detected events it takes as input, and link it with the variations in SERIES VARIATIONS the historical time series data. Following the idea of combining time series and text, the detected events will be fed to a generative adversarial network (GAN) along 6 CONCLUSION with time series data, to predict expected variations of the stock Considering the constraints induced by our context, namely detect- prices. Figure 2 illustrates the process we will describe. Our in- ing possibly unknown events in order to help buyers in their daily tuition is that the GAN will be able to link detected events and buying decisions, we deduced that a combination of topic-modeling variations in the time-series. A GAN is composed of two major approaches and neural based models is a promising method to com- parts : the generator and the discriminator. The generator try to plete our task. We propose to embed content using recent models, mimic the actual data and the discriminator tries to identify fake i.e. word and sentence embeddings, in order to produce a better data produced by the generator. We want to produce time series clusterization leveraging the representation power of these models estimations, so our solution is articulated as follow : the generator and therefore have a better event classification. part of the GAN will produce time series estimations taking events as input. The discriminator will be fed with two inputs, the actual 7 FUTURE WORK time series and the fake time-series, which is generated by the gen- In [3], the authors temporalize word2vec to detect the mostly dis- erator. The objective for the generator is to be able to produce time cussed topics during certain phases of the bitcoin time series. We series estimations that are really close to reality, in order to fool the would like to transpose this idea to our context, by detecting which discriminator. The discriminator objective is to have a maximum ac- events are activated during special phases of the commodities stock. curacy in its task to differentiate fake and real input. Since the final Using time stamps of the documents, the idea is to determine which output we want is a time series estimation, our general objective is clusters of events are activated during a certain period of time to have a generator as optimized as possible. The discriminator is and link it with stock variations. If using timestamps to order doc- only used in the training loop, in order to give feedback to gener- uments is not difficult, determining when an event is activated ator, to train it to produce valuable output. In order to give hints brings a lot more difficulties, such as tracking event evolution and about the future time series variations, the generator will take as detecting the end of an event. Another goal is to be able to directly input the events we have previously detected, which are supposed link time series and event, in a similar method as [25]. Finally, to carry information that influences these variations. By training it encoder-decoder architecture are currently revolutionising the NLP properly, the generator will be able to extract information from the domain. We would like to be able to better represent events, lever- events and from the feedback of the discriminator. The feedback aging the power of encoder-decoder architectures such as BERT from the discriminator contains information about the time series, [6]. Wu et al. did something similar with news representation in which are not directly available to the generator. Indeed, the final [32]. Indeed, transformers are able to produce quality embeddings objective is to have a generator which is able to predict time series for both words and sentences and have proved their quality by variations, by only exploiting the events we detect. outperforming static embedding techniques. A major drawback of To summarize, the GAN corresponds to the event-quantifying transformer-based methods is their computation cost. Thus, the step, and the event-time series alignment step. usage of distilled models such as TinyBERT [16] could be a solution. Event detection and time series alignment to improve stock market forecasting REFERENCES [21] Diogo Nolasco and Jonice Oliveira. 2019. Subevents detection through topic [1] John Asafu-Adjaye. 2000. The Relationship between Energy Consumption, Energy modeling in social media posts. Future Generation Comp. Syst. 93 (2019), 290– Prices and Economic Growth: Time Series Evidence from Asian Developing 303. Countries. Energy Economics 22 (12 2000), 615–625. https://doi.org/10.1016/ [22] Brendan O’Connor, Ramnath Balasubramanyan, Bryan Routledge, and Noah S0140-9883(00)00050-5 Smith. 2010. From Tweets to Polls: Linking Text Sentiment to Public Opinion [2] Johan Bollen, Huina Mao, and Xiao-Jun Zeng. 2010. Twitter mood predicts the Time Series. International AAAI Conference on Weblogs and Social Media 11. stock market. CoRR abs/1010.3003 (2010). arXiv:1010.3003 http://arxiv.org/abs/ [23] Nuno Oliveira, Paulo Cortez, and Nelson Areal. 2016. The impact of microblogging 1010.3003 data for stock market prediction: Using Twitter to predict returns, volatility, [3] Andrew Burnie and Emine Yilmaz. 2019. An Analysis of the Change in Dis- trading volume and survey sentiment indices. Expert Systems with Applications cussions on Social Media with Bitcoin Price. 889–892. https://doi.org/10.1145/ 73 (12 2016). https://doi.org/10.1016/j.eswa.2016.12.036 3331184.3331304 [24] Ping-Feng Pai and Chih-Sheng Lin. 2005. A hybrid ARIMA and support vector [4] Patrick Champagne. 2000. L’événement comme enjeu. (2000). https://doi.org/ machines model in stock price forecasting. Omega 33 (12 2005), 497–505. https: 10.3406/reso.2000.2231 //doi.org/10.1016/j.omega.2004.07.024 [5] Hailiang Chen, Prabuddha De, Yu Hu, and Byoung-Hyoun Hwang. 2013. Wisdom [25] Filipe Rodrigues, Ioulia Markou, and Francisco Pereira. 2018. Combining time- of Crowds: The Value of Stock Opinions Transmitted Through Social Media. series and textual data for taxi demand prediction in event areas: A deep learning Review of Financial Studies (12 2013). https://doi.org/10.2139/ssrn.1807265 approach. Information Fusion 49 (07 2018). https://doi.org/10.1016/j.inffus.2018. [6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. 07.007 BERT: Pre-training of Deep Bidirectional Transformers for Language Understand- [26] Eduardo Ruiz, Vagelis Hristidis, Carlos Castillo, Aristides Gionis, and Alejandro ing. In Proceedings of the 2019 Conference of the North American Chapter of Jaimes. 2012. Correlating Financial Time Series with Micro-Blogging Activity. the Association for Computational Linguistics: Human Language Technologies, WSDM 2012 - Proceedings of the 5th ACM International Conference on Web Volume 1 (Long and Short Papers). Association for Computational Linguistics, Search and Data Mining, 513–522. https://doi.org/10.1145/2124295.2124358 Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423 [27] Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. 2010. Earthquake Shakes [7] Claude B. Erb and Campbell R. Harvey. 2006. The Strategic and Tactical Value Twitter Users: Real-Time Event Detection by Social Sensors. Proceedings of of Commodity Futures. Financial Analysts Journal 62, 2 (2006), 69–97. https: the 19th International Conference on World Wide Web, WWW ’10, 851–860. //doi.org/10.2469/faj.v62.n2.4084 arXiv:https://doi.org/10.2469/faj.v62.n2.4084 https://doi.org/10.1145/1772690.1772777 [8] Eugene F. Fama. 1965. The Behavior of Stock-Market Prices. The Journal of [28] Ruey S. Tsay. 2005. Analysis of financial time series (2. ed. ed.). Wiley- Business 38, 1 (1965), 34–105. http://www.jstor.org/stable/2350752 Interscience, Hoboken, NJ. http://gso.gbv.de/DB=2.1/CMD?ACT=SRCHA&SRT= [9] Xiaocheng Feng, Lifu Huang, Duyu Tang, Heng Ji, Bing Qin, and Ting Liu. 2016. YOP&IKT=1016&TRM=ppn+483463442&sourceid=fbw_bibsonomy A Language-Independent Neural Network for Event Detection. In Proceedings [29] Mascha C. van der Voort, Mark Dougherty, M.S. Dougherty, and Susan Watson. of the 54th Annual Meeting of the Association for Computational Linguistics 1996. Combining Kohonen maps with Arima time series models to forecast (Volume 2: Short Papers). Association for Computational Linguistics, Berlin, traffic flow. Transportation research. Part C: Emerging technologies 4, 5 (1996), Germany, 66–71. https://doi.org/10.18653/v1/P16-2011 307–318. https://doi.org/10.1016/S0968-090X(97)82903-8 [10] Gary Gereffi. 1999. International trade and industrial upgrading in the apparel [30] Baohua Wang, Hejiao Huang, and Xiaolong Wang. 2012. A novel text mining commodity chain. Journal of International Economics 48, 1 (June 1999), 37–70. approach to financial time series forecasting. Neurocomputing 83 (04 2012), https://ideas.repec.org/a/eee/inecon/v48y1999i1p37-70.html 136–145. https://doi.org/10.1016/j.neucom.2011.12.013 [11] Clive Granger and Paul Newbold. 1986. Forecasting Economic Time Series (2 [31] Jianshu Weng and Bu-Sung Lee. 2011. Event Detection in Twitter. https: ed.). Elsevier. https://EconPapers.repec.org/RePEc:eee:monogr:9780122951831 //www.aaai.org/ocs/index.php/ICWSM/ICWSM11/paper/view/2767 [12] Ralph Grishman, David Westbrook, and Adam Meyers. 2005. NYU’s English [32] Chuhan Wu, Fangzhao Wu, Mingxiao An, Yongfeng Huang, and Xing Xie. ACE 2005 system description. Proceedings of ACE 2005 Evaluation Workshop. 2019. Neural News Recommendation with Topic-Aware News Representation. In Journal on Satisfiability 51 (01 2005). Proceedings of the 57th Annual Meeting of the Association for Computational [13] Mahmud Hasan, Mehmet A. Orgun, and Rolf Schwitter. 2019. Real-time event Linguistics. Association for Computational Linguistics, Florence, Italy, 1154–1159. detection from the Twitter data stream using the TwitterNews+ framework. https://doi.org/10.18653/v1/P19-1110 Information Processing and Management 56, 3 (5 2019), 1146–1165. https://doi. org/10.1016/j.ipm.2018.03.001 [14] Yu Hong, Wenxuan Zhou, Jingli Zhang, Guodong Zhou, and Qiaoming Zhu. 2018. Self-regulation: Employing a Generative Adversarial Network to Improve Event Detection. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Compu- tational Linguistics, Melbourne, Australia, 515–526. https://doi.org/10.18653/v1/ P18-1048 [15] Wei Huang, Yoshiteru Nakamori, and Shou-Yang Wang. 2005. Forecasting Stock Market Movement Direction with Support Vector Machine. Comput. Oper. Res. 32, 10 (Oct. 2005), 2513–2522. https://doi.org/10.1016/j.cor.2004.03.016 [16] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2020. Tiny{BERT}: Distilling {BERT} for Natural Language Understanding. https://openreview.net/forum?id=rJx0Q6EFPB [17] Nathan Keane, Connie Yee, and Liang Zhou. 2015. Using Topic Modeling and Similarity Thresholds to Detect Events. In Proceedings of the The 3rd Workshop on EVENTS: Definition, Detection, Coreference, and Representation. Association for Computational Linguistics, Denver, Colorado, 34–42. https: //doi.org/10.3115/v1/W15-0805 [18] Dorian Kodelja, Romaric Besançon, and Olivier Ferret. 2019. Exploiting a More Global Context for Event Detection Through Bootstrapping. 763–770. https: //doi.org/10.1007/978-3-030-15712-8_51 [19] Shulin Liu, Yang Li, Feng Zhang, Tao Yang, and Xinpeng Zhou. 2019. Event Detection without Triggers. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Com- putational Linguistics, Minneapolis, Minnesota, 735–744. https://doi.org/10. 18653/v1/N19-1080 [20] Thien Huu Nguyen and Ralph Grishman. 2015. Event Detection and Do- main Adaptation with Convolutional Neural Networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Association for Computational Linguistics, Beijing, China, 365– 371. https://doi.org/10.3115/v1/P15-2060