=Paper=
{{Paper
|id=Vol-2841/DARLI-AP_14
|storemode=property
|title=Towards human in the loop based query rewriting for exploring datasets
|pdfUrl=https://ceur-ws.org/Vol-2841/DARLI-AP_14.pdf
|volume=Vol-2841
|authors=Genoveva Vargas-Solar,Mehrdad Farokhnejad,Javier A. Espinosa-Oviedo
|dblpUrl=https://dblp.org/rec/conf/edbt/Vargas-SolarFE21
}}
==Towards human in the loop based query rewriting for exploring datasets==
Towards Human-in-the-Loop Based Query Rewriting for Exploring Datasets Genoveva Vargas-Solar Mehrdad Farokhnejad Javier A. Espinosa-Oviedo CNRS, LIRIS-LAFMIA Univ. Grenoble Alpes, Grenoble INP, Univ. Lyon 2, ERIC-LAFMIA Lyon, France CNRS, LIG Lyon, France genoveva.vargas-solar@liris.cnrs.fr Grenoble, France javier.espinosa@acm.org mehrdad.farokhnejad@ univ-grenoble-alpes.fr ABSTRACT • defines a loop where, given a user intention expressed us- Data exploration promotes a new querying philosophy that grad- ing terms and raw data collections, the exploration strate- ually converges into queries that can be used to exploit raw data gies propose different types of possible queries that can collections according to data explorers (i.e., users) expectations. be asked on top of data, and that potentially correspond Data exploration aims to guide the understanding of data collec- to user’s expectations; tions with different rawness degree and define the type of ques- • interacts with the user for refining intentions based on tions that can be asked on top of them, often through interactive the proposed queries and starts the loop again until the exploration processes. This paper introduces a human-guided queries proposal converges with user expectations. data exploration approach defining exploration operations that Different data exploration pipelines can be defined combining result in different types of factual and analytic queries. Our first different exploration techniques for performing specific data results include a proposal of query morphing and queries as an- exploration tasks. swers strategies. This paper describes an experiment setting used The rest of the paper is organised as follows. Section 2 sum- for testing the data exploration techniques. marises the related work done in the field of data exploration proposing a classification of approaches and techniques. Section 3 provides a detailed description of the approach. Section 4 de- 1 INTRODUCTION scribes our experimental setting including the dataset and its The evolution in querying, information retrieval and human- pre-processing steps. Section 5 concludes the paper and discusses computer interaction has led to the shift of interest from tradi- future work. tional Query-Response paradigm to actual human intelligence systems. Approaches such as interactive query expansion (IQE) 2 RELATED WORK [3, 9, 19] have shown the importance of data consumers in the data exploration process. Users’ intention helps to navigate through Data exploration calls for combining different exploration, query- the unknown data, formulate queries and find the desired infor- ing and processing methods and strategies proposed in diverse mation. In most of the occurrences, user feedback acts as vital domains. Therefore we performed a systematic review to clas- relevance criteria for next query search iteration. Such novel sify them (see Figure 1). According to a systematic review we requirements of modern exploration driven processes call to re- performed, we propose a classification of existing data explo- think data querying processes. ration techniques and methods. The classification consists of Traditional data management systems assume that when users facets representing an aspect of data exploration and dimensions ask a query (i) they have good knowledge of the schema, mean- that denote the concepts that define each facet. As shown in ing and contents of the database, and (ii) they are sure that this Figure 1, the facets classify: (F1) the type of queries addressed particular query is the one they wanted to ask. In short, it is by existing work; (F2) the type of algorithms used for exploring assumed that users know what they are looking for. In response, data collections; (F3) the knowledge domain of data collections systems like DBMS, always try to produce correct and complete and data types; (F4) the exploration processes done with human results [7]. These assumptions are becoming less true as the vol- intervention; and (F5) data exploration techniques and systems ume and diversity of data grow, and as raw datasets representing conceived for understanding raw datasets content. phenomena observations, rather than facts, need to be explored Since exploration can put different types of queries in ac- by data scientists and other users. First, the structure and content tion, facet F1 classifies the types of queries that are defined and of the database are hard to understand. Second, finding the right used in different works that exploit datasets. The spectrum goes question to ask is a long-running and complex task, often requir- from "classic" keyword and relational queries evaluated on top of ing a great deal of experimentation with queries, backtracking more or less curated datasets, to data processing operations on query results, and revision of results at various points in the raw datasets (e.g., descriptive statistics). In this spectrum, these process [18]. Existing systems have limited provisions to help types represent families of queries that can include aggregation, the users to reformulate their queries as they evolve with the clustering operations. We mainly identify "query by example" search progression [10]. techniques useful particularly in cases where the knowledge This paper proposes a data exploration approach that about the datasets’ content is too weak (see d1.8). Query by ex- ample is an intuitive way to explore data, so many techniques are © 2021 Copyright held by the owner/author(s). Published in Proceedings of the applying it to data exploration. Examples can either represent 24th International Conference on Extending Database Technology (EDBT), March approaches like reverse engineering querying and queries like 23-26, 2021, ISBN 978-3-89318-084-4 on OpenProceedings.org. query morphing or queries. We also note that data exploration Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0. is a loop that obtains approximated results and the techniques are specialised according to the type of data models (relational, observed that no existing system integrates them so that data graph, semi-structured, text, multimedia). scientists can develop exploration pipelines that can thoroughly Depending on the domain, works propose algorithms rather understand data and its analytics potential. Therefore, there is than operators (like in relational contexts) to process datasets and room for proposing approaches for each of them, defining rules to discover and derive a precise statistical understanding of their on how they can be combined within data exploration pipelines content (facet F2). Algorithms sometimes depend on the type of and integrating them to provide a whole data exploration envi- data structures used for representing data. For example, there are ronment. algorithms for processing graphs (centrality, pathfinding, etc.) or querying tables (selection, projection, etc.). Many works use 3 QUERYING PIPELINES FOR EXPLORING well-known heuristics, data mining, machine learning, artificial DATASETS intelligence algorithms for processing datasets, and insight into Figure 2 shows our general approach based on query rewriting their content. Finally, other works propose their strategies with- techniques and summarised as follows: "given an initial query, out adhering to a specific domain. provide sets of queries that can help data consumers better exploit The vision of data exploration in this work is that it should data collections". The approach considers that data collections be a human-guided process. Therefore, we have studied tech- are textual and indexed (not necessarily cleaned) and the repre- niques where humans intervene to adjust and guide the process sentative vocabulary used in their content has been extracted of receiving information (d.4.5). We studied works on group rec- and classified. For example in a crisis management scenario, the ommendation, consensus functions, group preference and group classes are events (e.g., someone looks for shelter, a building disagreement. These study address objectives like designing con- has been damaged) and actions (e.g., a hotel provides shelter sensus functions that aggregate individual group members’ pref- for victims, people is approaching a damaged building to search erences to reflect the overall group’s preference for each item victims). [1, 4, 13] or disagreement about an item [16]. Consensus functions The approach is intended to rewrite initial keyword queries can be applied within a data exploration process given where by morphing expressions to produce results that can retrieve rep- a user can agree and disagree about the proposed queries; the resentative insights into these collections’ content. The rewriting system can recommend queries according to given constraints process is gradual and interactive, where the user expresses an ini- that can be interpreted as preferences. tial expression, and the exploration process provides new queries According to our classification, facet F5 considers dimensions associated with content samples that can give insight into the that represent exploration techniques. Regarding exploration content of the dataset. The alternative queries are assessed and query expression (d5.1), we have identified three types of ap- adjusted by the user. Then, the exploration process is triggered proaches: multi-scale query processing for gradual exploration; again until a set of queries is chosen to be evaluated to produce query morphing to adjust for proximity results; queries as an- results. Results produced by different exploration strategies can swers as query alternatives to cope with lack of providence. Re- also be used as input to others. For example, query morphing’s sults filtering (d5.2) addresses analysis and visualisation to give output can be used as an input for the queries as answers. insight to data content. Finally, data exploration systems & en- The next sections describe two rewriting techniques query vironments (d5.3) are tailored for exploring data incrementally morphing and queries as answers (expansion) that we have pro- and adaptively. posed for exploring datasets. Concerning data exploration techniques, M. L. Kersten et al. [13] have compiled five methods to explore data sets query- 3.1 Query morphing ing: one-minute DB kernels, multi-scale queries, result set post- processing and query morphing and queries as answers. These Query morphing is the process of rewriting conjunctive and methods revisit fundamental characteristics of existing systems disjunctive keyword queries, by adding terms, to increase the like the notion of results completeness and correctness promoted possibility of exploring the most number of items in a collection. by traditional databases, splitting queries execution on different We proposed and implemented a “query morphing” pipeline that fragments of a database, precision of queries, and one-shot com- can help the data scientist better precise her query (see Figure putation of query results. These query systems provide a broader 3). Our query morphing pipeline uses a vocabulary and Wordnet (i.e., less precise but with a broader scope) approach, discarding to look for associated terms and synonyms that help expand the exactness and completeness for speed and a more global vision terms to enhance the chance of matching it with relevant data of the data. items in the target collection. The pipeline is described as follows. Finally, facet F3 classifies the type of datasets used to test dif- Given a conjunctive and disjunctive keyword query represented ferent exploration techniques and approaches. Datasets content as an expression tree go through the tree in depth-first until is often textual and with different rawness degree (newspapers, finding a leaf representing a term and then: micro-texts from social networks) and already processed content (1) Use a vocabulary representing the dataset content and using NLP (Natural Language Processing) techniques and repre- wordnet seeking for: sented as graphs or tables. Other datasets are built by collecting (a) equivalent terms and generate a node with the operator observations monitored using, for example, IoT infrastructures. and then connect the initial term with the equivalent These data sets contain records of measures or even video or terms in a conjunctive expression subtree. images. (b) more general terms and connect the initial term with We have the following remarks about what we have studied these terms in a disjunctive expression subtree. in state of the art. Data exploration pipelines are mostly ad-hoc, (c) assess and adjust the morphed query by the user and implemented in an artisanal manner, and partially human-guided. eventually restart the expansion process. The assess- Machine learning, analytics and querying techniques (e.g., query ment process includes pondering the terms and explor- by example, queries as answers, etc.) are complementary. We ing result samples to see potential results. Figure 1: Querying techniques for exploring datasets Figure 2: Deriving queries to explore data collections For example, for a query "victims AND missing AND shelter", 3.2 Query as answers using Wordnet1 , the query can be expanded as follows: "(victim Given an initial conjunctive/disjunctive keyword query, the query OR casualty OR "unfortunate person") AND (missing OR absent) is rewritten and transformed into several queries by extending AND (shelter OR protection OR housing)". The key is using a con- it with general and more specific terms, synonyms, etc., and by cept ontology or glossary and can find the maximum equivalent exploiting the knowledge domain (see Figure 3). The result is a and general terms. In this example, we only found equivalent set of possible alternative queries with associated sample results terms. The user can then mark which terms should or should not so that the user can choose which ones to execute. be included in the expanded query. She can also test different In our approach, the initial query is represented by an expres- combinations of the query and compare the results to see which sion tree (intermediate representation) where nodes are conjunc- morphed query can produce the results that best respond to her tion and disjunction operators and leaves are terms. During the expectations. rewriting process, the tree is modified by adding new types of Once the new query expression has been rewritten, as done nodes and tagged arcs. New nodes represent “and” and “or” nodes in information retrieval techniques, we use the inverted index to that do not belong to the initial query and more general/specific find the corresponding documents where the query terms occur. terms associated with an “initial” term. These new nodes are Then, we use the frequency matrix to compute the final result connected with the nodes of the initial query by tagged arcs. A set tagged with precision and recall measures. tag can indicate whether it connects a node with a conjunction or disjunction of more general/precise terms. 1 http://wordnetweb.princeton.edu/perl/ Figure 3: Query morphing as answers pipeline For example, consider an initial query linking three terms t1 , 4 EXPERIMENTS t2 and t3 with a conjunctive and disjunctive initial query (see To experiment our general approach, let us consider a disaster Figure 3). It is then re-written in a new query represented by a management scenario where various data collections are pro- tree that extends the query with terms that can be synonyms or duced during the life cycle of the disaster and must be explored related terms to t1 and t2 . Three possible queries are derived: Q11 to organise relief, resilience and duty of memory actions. The which provides an alternative to t1 with a new complex query scenario we use is related to disaster management under a hori- with a synonym/or an associated term t4 , saying that we can look zontal organisation 2 , where civilians take active action when an for "t1 or t1 and t4 ". Similarly, Q12 provides an alternative to t2 event happens (e.g., earthquake, flooding, fire) and continue to saying that we can either look for t2 or t2 and t6 , which could be influence decision making during the other phases of its manage- a synonym or a related term. Finally, Q13 is a complex query that ment. In this context, social media is a fast-paced channel used by integrated Q11 , Q12 with t3 of the initial query. affected people to describe their situation and observations, seek The following steps are performed for computing queries al- information, specify their requests, and offer their voluntary as- ternatives where every step aims at deriving the initial query sistance; providing actionable information [17, 21]. Critical data into queries that add knowledge. For each leaf in the expression is continuously posted on social media like Twitter, during the tree of the query: disaster life cycle (the event, relief, resilience, duty of memory). During such life-threatening emergencies, affected and vulner- (1) Use a vocabulary (extracted from the dataset content) and able people, humanitarian organisations, and other concerned Wordnet seeking for: authorities search for information useful to provide help and pre- • equivalent terms and generate a node with the operator vent a crisis. Nobody has control over the type of data exchanged and then connect the initial term with the equivalent by actors. These data are crucial in making critical decisions like terms in a conjunctive expression subtree; saving lives, searching people, and providing shelter and medi- • more general terms and connect the initial term with cal assistance. For this reason, it is required to explore past and these terms in a disjunctive expression subtree. present in an agile manner to find hints to make decisions and (2) Use a frequency matrix for looking for terms that are often act individually and collectively. Social network data collections associated with the initial term with a specific frequency can include reports on architectural and building environment and getting a sample of documents that can belong to damages and volunteers informing them that they have answered query results. calls for help (see Figure 4). The question is whether these data collections can help to The user can choose those queries that best target her expec- find (i) causal correlations, for example, is it possible to know tations. A history of queries is maintained that can be reused given a post asking for help whether actions have already been for suggesting or pre-calculating morphing or query as answers 2 The phenomenon of organisation of civilians under horizontal and marginal groups results or for adjusting the chosen query set with new queries as the dataset evolves. has come up in different countries during rivers flooding, annual landslides in diverse regions particularly in Latin America. Figure 4: Exploring crisis social network posts during crisis taken? since when and whether the problem has been solved; (ii) 4.1 Dataset preparation is it possible to find patterns showing which zones have been Among social media studies, most of them focus on Twitter, systematically damaged in other events? Is there more risk and mainly because of its timeliness and availability of information help required in those regions?; (iii) Spatio-temporal relations, is from a large user base. We use CrisisNLP [11] labelled and un- it possible to figure out from the beginning of the event until a labelled datasets. The datasets contain approximately 5 million given subsequent time? have actors installed camps to provide unlabelled and 50k labelled tweets. The size of this dataset is first aid? Does help come from urban areas?; (iv) how to ask about about 7 gigabytes. The datasets consist of various event types the type of help still being required after a day of the event? such as earthquakes, floods, typhoons, etc. The datasets were Note that these questions are not asking for results, they are collected from the Twitter streaming API using different key- asking for assistance on how to ask them on top of data to poten- words and hashtags during the disaster. The tweets are labelled tially best explore data. How can I express my query to expect into various informative classes (e.g., urgent needs, donation to receive the best guidance to act? Is my query pertinent to be offers, infrastructure damage, dead or injured people) and one asked given the data I can have access to? not-related or irrelevant class. Table 5 shows a sample of some Data exploration techniques can help assist in expressing labelled tweets from data collection. queries that can potentially explore data collections and be perti- nent according to their content. This section describes the experimental setting for the as- Data Preprocessing. Since the tweet texts are brief, informal, sessment of our approach. Our experiments deal with the crisis noisy, unstructured, and often contain misspellings and gram- scenario introduced previously, and they use micro-texts datasets matical mistakes, preprocessing must be done before using them from Twitter concerning this topic. Given Twitter’s 140 charac- in further analysis. Moreover, due to Twitter’s 140 characters ters limit restriction, the frequency matrix cannot be useful, so limit restriction, Twitter users intentionally shorten words us- we used the word2vec model to pre-process the dataset and then ing abbreviations, acronyms, slang, and sometimes words with- find similar terms for rewriting queries. In this work, we consider out spaces; hence we need to normalise those OOV terms [11]. those words provided by word2vec model, and words are also Besides, tweets frequently contain duplicates as the same in- indexed in the frequency matrix for extending query. With this formation is often retweeted / re-posted by many users [20]. information, we modify the tree adding "AND" and "OR" nodes, Presence of duplicates can result in an over-estimation of the and thereby we create other possible queries that derive from performance of retrieval/extraction methodologies. Therefore, the initial one. we eliminated duplicate tweets using ’remove duplicates toolkit’ We have experimented with generating the knowledge do- by Excel. Currently, we use 73562 unlabelled data set related to main and then using it for validating morphing queries. Our 2014. We performed the following preprocessing steps to clean experiment is based on the disaster management use case using the micro-documents: Twitter posts as documents collections. The experiment applies text mining techniques to build the vocabulary and classify it (1) We removed stop words (e.g. ’a’, ’at’, ’here’), non-ASCII into events produced and actions performed during a disaster characters, punctuations (e.g. ’.’, ’!’), URLs (e.g. ’ http://t.co/ life-cycle. In this section, we first describe the datasets we used 24Db832o4U ’), hashtags (e.g. ’#Napaquake ’) and Twitter in our experiments and then the experiment setting, including reserved words (e.g. ’RT’, ’via’). the algorithms used to process data collections and classify the (2) We further tokenize the tweets using nltk.tokenize library extracted vocabulary. [24]. (3) We performed stemming using the WordNet Lemmatizer library [24]: e.g. troubled (trouble). Figure 5: Examples of some labelled tweets, posted during the 2014 California used a list of the crisis related OOVs [11] to normalize suggests, we considered tweets containing a subject related to tweets’ terms: e.g. govt (government), 2morrow (tomor- any occurrence during or after the crisis. For example- damage row), missin (missing). happened to a building, or people are trapped in buildings. For (4) We removed duplicate tweets. an "Action" we considered those tweets that focus on operations After performing the cleaning 126161 unlabelled data related to and activities during or after the crisis. Such as government or the 2014 California Earthquake, we obtained a set of 73562 tweets. NGOs providing help to the affected people. This set is used for all experiments reported in this work. We performed a set of experiments on California and Nepal The pipeline implemented to create the knowledge base re- earthquake datasets consisting of approximately 3032 labelled quired for experimenting our data exploration techniques con- tweets, out of which 2203 tweets of Nepal and 829 tweets of sists of two steps: (i) indexing data collections content using California dataset. The datasets are divided into two sets. As usual information retrieval techniques; (ii) create a vocabulary using in machine learning techniques, we divided the data collection classification techniques. into training and test datasets. The first set comprised of 70% of the messages (i.e. training set) and the second comprised of 30% Indexing the data collection. As a result of indexing the cleaned of the messages (i.e. test set). We trained all three different kinds tweets collection, we created an inverted index and a frequency of classifiers using the preprocessed data. matrix representing the content of the collection. We imple- We used multilayer perceptron with a CNN. We conducted mented an inverted index to provide agile access to a document’s experiments on the same dataset and eventually established that position in which a term appears. The inverted index is used as a CNN outperformed the task with an adequate margin compared dictionary that associates each word with a list of document iden- to our previous work. tifiers where the word appears. This structure prevents making For the evaluation of the trained models, we compared the the running time of token comparisons quadratically. So, instead results to [11, 15]. The results obtained by CNN model are better of comparing, record by record, each token to every other token than traditional techniques, and we were able to obtain the same to see if they match, the inverted indices are used to look up results as the original paper [11, 15] (see Table 1). records that match a particular token. Currently, we use 73562 unlabelled data set related to the 2014 Table 1: Accuracy, Precision, Recall and f-score of CNN California Earthquake. We generated an inverted index consisting model with respect to California Earthquake and Nepal of 20313 rows. The rows correspond to terms in our raw data Earthquake crisis tweet data. collection, and columns correspond to documents where the terms occur. The inverted index allows a fast full-text search. It Datasets/SYS Accuracy Precision Recall f-Score can help to explore queries’ terms to find the documents where California Earthquake 92.72 86.53 90.00 88.23 the terms occur. Nepal Earthquake[12] 89.31 91.25 91.87 91.85 A term frequency matrix is a mathematical matrix that de- scribes the frequency of terms that occur in a collection of doc- uments [14]. The matrix contains 73562 columns, where each 4.2 Testing query morphing column corresponds to a document (tweet) and each row to a term. A cell in the matrix contains the number of times that the We implemented the ”query morphing” process that we proposed term appears in the document. The top 20 most frequent terms to help the data scientist better precise her query, or define sev- in our data collection can help us expand the query using data eral queries representing what she is looking for. Our query collection. morphing algorithm uses Wordnet to look for associated terms and synonyms that help expand the terms to enhance the chance Creating a vocabulary. We implemented a classification pipeline of matching it with relevant tweets in the target collection. to build a vocabulary of events and actions related to disasters, For assessing expanded term quality, we have compared the thereby generating a knowledge base describing the tweets’ data performance of our proposed classification based query expand- collections used for our experiment. The pipeline combines ma- ing method against the traditional query expanding method. We chine learning methods reproducing an existing work proposed calculated the mean average of Cosine Similarity (MACS) be- by [11, 15]. tween the query and expanded query terms to assess the pro- We applied supervised techniques such as Random Forest [5], posed approach’s performance. The experimental results show Support Vector Machines (SVM) and Convolutional Neural Net- that the expanded query terms, obtained from the classified query work (CNN) [6, 8] to classify the tweets of our experiment dataset expansion model, are more similar and relevant than the non- to build the vocabulary of events and actions. As the word "Event" classification model. Figure 6: Query morphing example In this section, we presented an ablation study about the per- Table 2: The mean average of Cosine Similarity (MACS) be- formance of our proposed classification based query morphing tween query and morphed query terms with and without method. We use the available crisisNLP pre-trained word embed- classification model. ding via word2vec method [11] to obtain query and expansion terms vectors. In the vector space model, all queries and terms Query Expansion Model ET@10 ET@20 ET@30 Classification 0.420 0.377 0.371 are represented as vectors in dimensional space 300. Documents Non-classification 0.401 0.366 0.369 similarity is determined by computing the similarity of their con- tent vector. To obtain a query vector, we represent keywords in user queries as vectors, and then sum all the keyword vectors the process. We have proposed a solution for exploring scien- followed by averaging them. For our analysis, we calculated the tific papers through an experiment defining a set of exploration average similarity between the query vector (Q_vector) and ’m’ queries. Results were assessed by scientists of the National In- keyword vectors obtained for a given query (T_vector) by using stitute of Genetic Engineering and Biotechnology, Tehran, Iran the formula of similarity 𝑆𝑖𝑚 given in equation 1. and Golestan University of Medical Sciences. Scientists provided feedback about exploration operations through questionnaires Í𝑚 that are processed for obtaining satisfaction metrics. We are cur- 𝑖=1 (𝐶𝑜𝑠𝑖𝑛𝑒 (𝑄_𝑣𝑒𝑐𝑡𝑜𝑟,𝑇 _𝑣𝑒𝑐𝑡𝑜𝑟 [𝑖])) 𝑆𝑖𝑚(𝐶𝑇 𝑠, 𝑄𝑢𝑒𝑟𝑦) = (1) rently defining a crowd-based setting for obtaining feedback in 𝑚 the case of crisis datasets. The idea is to work with different where CTs are candidate terms, ’m’ is a hyper-parameter in groups of users (victims, volunteers, logistics decision-makers, query expansion-based retrieval, which shows the number of police, medical staff) and queries to assess exploration results. expansion terms (ET), using as reference the studies[2, 22]. We set the number of expansion terms to 10, 20 and 30 (ET@10, ET@20, 5 CONCLUSION AND FUTURE WORK ET@30). We repeat this task for 100 queries and report the mean This paper introduced a general datasets exploration approach of average of each ET@ set in table 2. The experimental results that includes human in the loop. The current approach includes show that the morphed query expanded with new terms obtained two exploration techniques (i.e., query morphing and queries as from the classified query morphing model are more similar and answers) to help define queries that can fully explore and exploit relevant than the non-classification model. The ET@10, ET@20 a dataset. They are complementary query rewriting techniques and ET@30 scores of our proposed classification model surpassed where initially expanding a query can help adjust the terms used the transition non-classification based model. Also, we observe for exploring a dataset and then produce possible combinations that when we set the number of expansion terms to 10, we achieve of terms with possible queries that can lead to different scopes. the best performance. In both cases, the user finally chooses a set of representative Currently, we used pseudo relevance feedback. This method queries to her interests and the produced results that target her automates the manual part of relevance feedback. It is assumed expectations. We have tested query morphing in the case of crisis that the user takes top-m ranked morphed query terms returned dataset exploration, where people involved in a critical event by the initial query as relevant to expand her query. Results scor- either as victim or volunteers can define queries for retrieving ing must be completed with user feedback that finally guides information to look for or provide help. Our future work includes modelling query exploration pipelines that can combine different techniques for exploring data collec- tions. We will also propose ways of morphing and giving queries as answers where queries can be analytical or imply quantitative data views. REFERENCES [1] Sihem Amer-Yahia, Senjuti Basu Roy, Ashish Chawlat, Gautam Das, and Cong Yu. 2009. Group recommendation: Semantics and efficiency. Proceedings of the VLDB Endowment 2, 1 (2009), 754–765. [2] Hiteshwar Kumar Azad and Akshay Deepak. 2019. Query expansion tech- niques for information retrieval: a survey. Information Processing & Manage- ment 56, 5 (2019), 1698–1735. [3] Nicholas J Belkin. 2008. Some (what) grand challenges for information retrieval. In ACM SIGIR Forum, Vol. 42. ACM New York, NY, USA, 47–54. [4] Ludovico Boratto, Salvatore Carta, Alessandro Chessa, Maurizio Agelli, and M Laura Clemente. 2009. Group recommendation with automatic identification of users communities. In 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, Vol. 3. IEEE, 547–550. [5] Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5–32. [6] Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine learning 20, 3 (1995), 273–297. [7] Kyriaki Dimitriadou, Olga Papaemmanouil, and Yanlei Diao. 2016. AIDE: an active learning-based approach for interactive data exploration. IEEE Transactions on Knowledge and Data Engineering 28, 11 (2016), 2842–2856. [8] Kunihiko Fukushima and Sei Miyake. 1982. Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition. In Competition and cooperation in neural nets. Springer, 267–285. [9] Parantapa Goswami, Eric Gaussier, and Massih-Reza Amini. 2017. Explor- ing the space of information retrieval term scoring functions. Information Processing & Management 53, 2 (2017), 454–472. [10] Stratos Idreos, Olga Papaemmanouil, and Surajit Chaudhuri. 2015. Overview of data exploration techniques. In Proceedings of the 2015 ACM SIGMOD Inter- national Conference on Management of Data. 277–281. [11] Muhammad Imran, Prasenjit Mitra, and Carlos Castillo. 2016. Twitter as a lifeline: Human-annotated twitter corpora for NLP of crisis-related messages. arXiv preprint arXiv:1605.05894 (2016). [12] Muhammad Imran, Prasenjit Mitra, and Carlos Castillo. 2016. Twitter as a Life- line: Human-annotated Twitter Corpora for NLP of Crisis-related Messages. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (23-28). European Language Resources Association (ELRA), Paris, France. [13] Martin L Kersten, Stratos Idreos, Stefan Manegold, and Erietta Liarou. 2011. The researcher’s guide to the data deluge: Querying a scientific database in just a few seconds. Proceedings of the VLDB Endowment 4, 12 (2011), 1474–1477. [14] Hans Peter Luhn. 1957. A statistical approach to mechanized encoding and searching of literary information. IBM Journal of research and development 1, 4 (1957), 309–317. [15] Dat Tien Nguyen, Kamela Ali Al Mannai, Shafiq Joty, Hassan Sajjad, Muham- mad Imran, and Prasenjit Mitra. 2016. Rapid classification of crisis-related data on social networks using convolutional neural networks. arXiv preprint arXiv:1608.03902 (2016). [16] Mark O’Connor, Dan Cosley, Joseph A Konstan, and John Riedl. 2001. PolyLens: a recommender system for groups of users. In ECSCW 2001. Springer, 199–218. [17] Leysia Palen and Sarah Vieweg. 2008. The emergence of online widescale interaction in unexpected events: assistance, alliance & retreat. In Proceedings of the 2008 ACM conference on Computer supported cooperative work. 117–126. [18] Olga Papaemmanouil, Yanlei Diao, Kyriaki Dimitriadou, and Liping Peng. 2016. Interactive Data Exploration via Machine Learning Models. IEEE Data Eng. Bull. 39, 4 (2016), 38–49. [19] Ian Ruthven. 2003. Re-examining the potential effectiveness of interactive query expansion. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. 213–220. [20] Ke Tao, Fabian Abel, Claudia Hauff, Geert-Jan Houben, and Ujwal Gadiraju. 2013. Groundhog day: near-duplicate detection on twitter. In Proceedings of the 22nd international conference on World Wide Web. 1273–1284. [21] Sarah Vieweg, Amanda L Hughes, Kate Starbird, and Leysia Palen. 2010. Mi- croblogging during two natural hazards events: what twitter may contribute to situational awareness. In Proceedings of the SIGCHI conference on human factors in computing systems. 1079–1088. [22] Chengxiang Zhai and John Lafferty. 2001. Model-based feedback in the lan- guage modeling approach to information retrieval. In Proceedings of the tenth international conference on Information and knowledge management. 403–410.