Visual Analytics Methods for the Automatic Content Generation from Streaming Data Fabio Giachelle[0000−0001−5015−5498] Department of Information Engineering, University of Padua fabio.giachelle@unipd.it Abstract. We present a PhD project regarding the application of Visual Analytics (VA) methods for the automatic generation of wiki documents - i.e. wikification - and event storylines from streaming data. In contrast to static automatically generated wiki-like documents, this project inves- tigates the employment of VA techniques for the automatic generation of wiki documents made up of dynamic contents, based on user preferences. The purpose of the project is to make the user an active component for the wikification process, able to provide useful feedback regarding which contents are more relevant for the topic of interest, thus improving the wikification algorithms. For this purpose, the project focuses on exploit- ing VA methods and data provenance to enhance data comprehension, by means of continuous interaction with the user according to the human- in-the-loop model. Keywords: Visual Analytics · Wikification · Data Provenance · Human- in-the-loop. 1 Introduction 1.1 Overall context Nowadays, millions of users everyday surf the Internet looking for useful in- formation to satisfy their information needs. In particular, Wikipedia is one of the most visited reference websites of all time and probably the most popular web-based, free-content encyclopedia of the world. Since Wikipedia is based on a model of openly editable content, the number of articles is growing continuously. In the last years, the automatic creation of Wikipedia articles − i.e. automatic wikification − has grown of interest. In particular, recent research works focus on the automatic creation of wiki documents, dynamically edited over time, based on distributed streaming data from heterogeneous sources as newsfeed and so- cial media. However, these documents are available to end-users, only as static web pages. Unfortunately, in this way, users cannot provide any useful feedback to assess and improve the performances of the wikification algorithms. For this reason, as shown in Figure 1, the focus of this project is to allow the dynamic visualization of automatically generated articles, by means of interactive visual interfaces that control the algorithms underlying them. In this way, users can judge which contents are more relevant for the given topic and exploit implicit and explicit user feedback to assess and improve the performances of the wikifi- cation algorithms. Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). FDIA 2019, 17-18 July 2019, Milan, Italy. Fig. 1: Wikification and improved understanding of data by means of visual analytics methods. 1.2 State-of-the-art Wikification and event storylines generation The term “wikification” [12] refers to the creation of documents containing enti- ties linked to Wikipedia, which represents the target knowledge base. The task of identifying entities in a given text and linking them to a specific knowledge base is known as “entity linking” [13]. Event storylines, instead, are a chronological reconstruction of a sequence of event happenings over time, related to a topic of interest [1]. In particular, we consider an “event” something that happens, im- portant or of interest for users. Nowadays, news and information about events are shared mostly through social media. Hence, many research works focus on collecting useful information, from social network services, for the automatic generation of event storylines [10]. The main idea is to generate event storylines, by the fusion of crowdsourced retrieved data [3, 7] to grant access to a single au- tomatically generated web page containing all the useful information regarding a specific event of interest. This is a change of paradigm from the retrieval of existing relevant content, to the generation of new documents as a synthesis of the relevant content [11]. Recent research works have studied new wikification algorithms to create dynamic Wikipedia pages, that are automatically edited based on social activity e.g. in Twitter [2]. Anyway, these methods do not con- sider any interaction with the end-user neither in the wikification nor in the consultation phase. Hence, users have no means to dynamically select or exclude some sources or to easily understand where some information is coming from. In addition, no feedback signals are gathered to improve wikification algorithms. For these reasons, this project is focused on the employment of visual analytics techniques to support the human-in-the-loop interaction and the comprehension of data provenance, which is exploited to improve both the wikification and event storylines generation processes. Visual Analytics Visual Analytics (VA) is “the science of analytical reasoning facilitated by inter- active visual interfaces” [14]. VA integrates information visualization with data and model interaction with the purpose of helping the user to understand data and dynamically modify the algorithms underlying them. Progressive Visual An- alytics (PVA) methods allow us to overcome the inefficiencies associated with the traditional “compute-wait-visualize” workflow. Besides, PVA methods allow analysts to inspect partial results of an algorithm without having to wait for the end of the process. The partial results of each stage are shown in the visual in- terface so that the user can make decisions that influence the progression of the analytical algorithms running in the background [6]. In Information Retrieval (IR) VA techniques have been applied recently to ease and make experimental evaluation more intuitive [4,5]. Whereas, despite being a promising and effective approach for dealing with streaming data, PVA has never been used in IR. 2 Project objectives This project aims at developing innovative visual analytics tools to improve the wikification process and storyline generation, by exploiting dynamic and hetero- geneous streaming data. Therefore, this project focuses on the employment of visual analytics techniques to support human-in-the-loop interactions and the use of data provenance to improve the quality of the wikification and event sto- rylines generation processes and the user experience. In Figure 1, we see the user in the middle of the loop that generates articles. According to the human-in-the- loop model, there is a continuous interaction between the user and the visual interface. The system architecture we propose aims at maximizing the human contribution, which is fundamental to assess and improve the performances of the wikification and event storylines generation processes. Hence, we will fo- cus on the application of VA methods, in a human-in-the-loop architecture in which the user feedback is useful to produce dynamic wiki articles that sum- marize the relevant information regarding a topic. This approach enhances data comprehension and exploits data provenance to reward the sources that provide more relevant and authoritative contents. This represents a change of paradigm, from static automatically generated articles to dynamic ones, in which the user becomes an active component for the improvement of wikification algorithms. 3 Research work description The four main stages of this project are reported as follows: 1. State-of-the-art inspection: This stage focuses on studying the state-of- the-art of: web crawling, clustering algorithms for streaming data coming from social media, entity linking, wikification, data provenance, human-in- the-loop model and VA methods for dynamic interactive contents and data explainability. Fig. 2: Clustering of web crawled streaming data. 2. Automatic wikification and event storylines generation: This stage aims at reproducing state-of-the-art wikification algorithms and involves the following tasks (see the left part of Figure 1): – Web crawling and gathering of streaming data from social media, infor- mation networks and microblogging services. – Entity linking to the reference knowledge base. To this end, we will consider the use of relevant ontologies such as BabelNet1 . – Clustering of retrieved documents, articles, news and posts. Since stream- ing data come from heterogeneous multiple sources and information may be duplicated, clustering algorithms are necessary to aggregate seman- tically related documents. In Figure 2, we can see an example of the results for a not well specified query (“Aircraft”): the retrieved docu- ments regard two different domains (“New Aircrafts” and “Ethiopian Airlines crash”) and the purpose of clustering algorithms is to assign each document to the appropriate cluster. For clustering purposes, we exploit semantic information to enrich the bag-of-words (BOW) model and create a bag-of-concepts (BOC) document representation [8]. – Event storylines reconstruction. Temporal information is exploited to produce timelines that present event happenings in chronological order. For this purpose, one possible benchmark dataset is presented in [16]. 3. Application of Visual Analytics (VA) methods: This stage focuses on the application of VA methods to the wikified con- tents, generated in the previous phase. Therefore, during this stage, new VA tools for the automatic wikification will be developed. Since VA methods 1 https://babelnet.org rely on the human-in-the-loop model, according to which there is a contin- uous interaction between the user and the visual interface, VA interfaces play an important role. For this reason, this stage involves the study and development of intuitive VA interfaces designed to provide complete control of the parameters that influence the analytics algorithms designated for the extraction of useful information from data. To this aim, the choice of the UX framework (e.g. React2 ) for the development of the VA interfaces is cru- cial because interfaces need to be reactive and capable of updating quickly, according to the continuous flow of data coming from multiple streaming sources. In this context, the most relevant contents, selected by the analyti- cal algorithms running in the background, are shown in the visual interface so that users can judge which contents are more relevant for the given topic to satisfy their information needs. The provided judgements act as useful feedback to improve the wikification algorithms and to allow the visualiza- tion of dynamic contents. 4. Evaluation: The last stage regards the evaluation of the overall architec- ture presented in Figure 1. In particular, this stage is focused on evaluating the performances of the wikification algorithms. The evaluation process will be done, by means of a tool that will be developed to compare different wikification algorithms, based on user assessments. Furthermore, different user studies will be done to investigate whether and how the developed tools improve the effectiveness of wikification algorithms and speed up access to knowledge. Some examples of user studies are: A/B testing, focus group, web analytics and first click testing. 4 Final remarks This PhD project focuses on the application of VA methods to automatic wikifi- cation and event storylines generation. The employment of VA techniques allows us to generate dynamic wiki-like documents based on user feedback and interac- tion. In the last years, some research works have studied methods for storylines visualization [15], but the employment of visual analytics techniques for wikifica- tion and event storylines generation, still need to be examined. In addition, this project aims at investigating the combination of VA methods with algorithmic strategies, e.g. clustering of news and microblog posts [9]. It is worth noting that the automatic wikification and generation of event storylines are open problems. In recent years, plenty of work has been done to address these problems, but they are far to be solved. However, the efforts made to address these problems are certainly useful to improve the quality of automatically generated wiki doc- uments and, most importantly, to ease the access to knowledge. Acknowledgments: This work is partially supported by the Computational Data Citation (CDC-STARS) project of the University of Padua. 2 https://reactjs.org References 1. AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Generating stories from archived col- lections. In: Proceedings of the 2017 ACM on Web Science Conference. pp. 309–318. WebSci ’17, ACM, New York, NY, USA (2017) 2. Alonso, O., Kandylas, V., Tremblay, S.E.: Automatic story evolution wikification from social data. In: Twelfth International AAAI Conference on Web and Social Media (2018) 3. Alonso, O., Sellam, T.: Quantitative information extraction from social data. In: The 41st International ACM SIGIR Conference on Research & Development in In- formation Retrieval. pp. 1005–1008. SIGIR ’18, ACM, New York, NY, USA (2018) 4. Angelini, M., Fazzini, V., Ferro, N., Santucci, G., Silvello, G.: Claire: A combi- natorial visual analytics system for information retrieval evaluation. Information Processing & Management 54(6), 1077–1100 (2018) 5. Angelini, M., Ferro, N., Santucci, G., Silvello, G.: Virtue: A visual tool for in- formation retrieval performance evaluation and failure analysis. Journal of Visual Languages & Computing 25(4), 394–413 (2014) 6. Angelini, M., Santucci, G., Schumann, H., Schulz, H.J.: A review and characteriza- tion of progressive visual analytics. In: Informatics. vol. 5, p. 31. Multidisciplinary Digital Publishing Institute (2018) 7. Guo, B., Ouyang, Y., Zhang, C., Zhang, J., Yu, Z., Wu, D., Wang, Y.: Crowdstory: Fine-grained event storyline generation by fusion of multi-modal crowdsourced data. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 1, 1–19 (2017) 8. Huang, A., Milne, D., Frank, E., Witten, I.H.: Clustering documents using a wikipedia-based concept representation. In: Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. pp. 628–636. PAKDD ’09, Springer-Verlag (2009) 9. Li, L., Ye, J., Deng, F., Xiong, S., Zhong, L.: A comparison study of clustering algorithms for microblog posts. Cluster Computing 19(3), 1333–1345 (2016) 10. Lin, C., Lin, C., Li, J., Wang, D., Chen, Y., Li, T.: Generating event storylines from microblogs. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management. pp. 175–184. CIKM ’12, ACM, New York, NY, USA (2012) 11. Lioma, C., Larsen, B., Petersen, C., Simonsen, J.G.: Deep learning relevance: Cre- ating relevant information (as opposed to retrieving it) (2016) 12. Mihalcea, R., Csomai, A.: Wikify!: Linking documents to encyclopedic knowledge. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management. pp. 233–242. CIKM ’07 (2007) 13. Rao, D., McNamee, P., Dredze, M.: Entity linking: Finding extracted entities in a knowledge base. In: Multi-source, Multilingual Information Extraction and Sum- marization. pp. 93–115. Springer Berlin Heidelberg (2013) 14. Scholtz, J.: Beyond usability: Evaluation aspects of visual analytic environments. In: 2006 IEEE Symposium On Visual Analytics Science And Technology. pp. 145– 150. IEEE (2006) 15. Tanahashi, Y., Hsueh, C.H., Ma, K.L.: An efficient framework for generating sto- ryline visualizations from streaming data. IEEE transactions on visualization and computer graphics 21(6), 730–742 (2015) 16. Zubiaga, A.: A longitudinal assessment of the persistence of twitter datasets. Jour- nal of the Association for Information Science and Technology 69(8), 974–984 (2018)