=Paper=
{{Paper
|id=Vol-2841/DARLI-AP_14
|storemode=property
|title=Towards human in the loop based query rewriting for exploring datasets
|pdfUrl=https://ceur-ws.org/Vol-2841/DARLI-AP_14.pdf
|volume=Vol-2841
|authors=Genoveva Vargas-Solar,Mehrdad Farokhnejad,Javier A. Espinosa-Oviedo
|dblpUrl=https://dblp.org/rec/conf/edbt/Vargas-SolarFE21
}}
==Towards human in the loop based query rewriting for exploring datasets==
<pdf width="1500px">https://ceur-ws.org/Vol-2841/DARLI-AP_14.pdf</pdf>
<pre>
            Towards Human-in-the-Loop Based Query Rewriting
                        for Exploring Datasets
         Genoveva Vargas-Solar                                  Mehrdad Farokhnejad                       Javier A. Espinosa-Oviedo
          CNRS, LIRIS-LAFMIA                              Univ. Grenoble Alpes, Grenoble INP,               Univ. Lyon 2, ERIC-LAFMIA
              Lyon, France                                            CNRS, LIG                                     Lyon, France
    genoveva.vargas-solar@liris.cnrs.fr                            Grenoble, France                          javier.espinosa@acm.org
                                                                mehrdad.farokhnejad@
                                                                 univ-grenoble-alpes.fr

ABSTRACT                                                                              • defines a loop where, given a user intention expressed us-
Data exploration promotes a new querying philosophy that grad-                          ing terms and raw data collections, the exploration strate-
ually converges into queries that can be used to exploit raw data                       gies propose different types of possible queries that can
collections according to data explorers (i.e., users) expectations.                     be asked on top of data, and that potentially correspond
Data exploration aims to guide the understanding of data collec-                        to user’s expectations;
tions with different rawness degree and define the type of ques-                      • interacts with the user for refining intentions based on
tions that can be asked on top of them, often through interactive                       the proposed queries and starts the loop again until the
exploration processes. This paper introduces a human-guided                             queries proposal converges with user expectations.
data exploration approach defining exploration operations that                    Different data exploration pipelines can be defined combining
result in different types of factual and analytic queries. Our first              different exploration techniques for performing specific data
results include a proposal of query morphing and queries as an-                   exploration tasks.
swers strategies. This paper describes an experiment setting used                    The rest of the paper is organised as follows. Section 2 sum-
for testing the data exploration techniques.                                      marises the related work done in the field of data exploration
                                                                                  proposing a classification of approaches and techniques. Section
                                                                                  3 provides a detailed description of the approach. Section 4 de-
1    INTRODUCTION                                                                 scribes our experimental setting including the dataset and its
The evolution in querying, information retrieval and human-                       pre-processing steps. Section 5 concludes the paper and discusses
computer interaction has led to the shift of interest from tradi-                 future work.
tional Query-Response paradigm to actual human intelligence
systems. Approaches such as interactive query expansion (IQE)
                                                                                  2   RELATED WORK
[3, 9, 19] have shown the importance of data consumers in the
data exploration process. Users’ intention helps to navigate through              Data exploration calls for combining different exploration, query-
the unknown data, formulate queries and find the desired infor-                   ing and processing methods and strategies proposed in diverse
mation. In most of the occurrences, user feedback acts as vital                   domains. Therefore we performed a systematic review to clas-
relevance criteria for next query search iteration. Such novel                    sify them (see Figure 1). According to a systematic review we
requirements of modern exploration driven processes call to re-                   performed, we propose a classification of existing data explo-
think data querying processes.                                                    ration techniques and methods. The classification consists of
   Traditional data management systems assume that when users                     facets representing an aspect of data exploration and dimensions
ask a query (i) they have good knowledge of the schema, mean-                     that denote the concepts that define each facet. As shown in
ing and contents of the database, and (ii) they are sure that this                Figure 1, the facets classify: (F1) the type of queries addressed
particular query is the one they wanted to ask. In short, it is                   by existing work; (F2) the type of algorithms used for exploring
assumed that users know what they are looking for. In response,                   data collections; (F3) the knowledge domain of data collections
systems like DBMS, always try to produce correct and complete                     and data types; (F4) the exploration processes done with human
results [7]. These assumptions are becoming less true as the vol-                 intervention; and (F5) data exploration techniques and systems
ume and diversity of data grow, and as raw datasets representing                  conceived for understanding raw datasets content.
phenomena observations, rather than facts, need to be explored                       Since exploration can put different types of queries in ac-
by data scientists and other users. First, the structure and content              tion, facet F1 classifies the types of queries that are defined and
of the database are hard to understand. Second, finding the right                 used in different works that exploit datasets. The spectrum goes
question to ask is a long-running and complex task, often requir-                 from "classic" keyword and relational queries evaluated on top of
ing a great deal of experimentation with queries, backtracking                    more or less curated datasets, to data processing operations on
query results, and revision of results at various points in the                   raw datasets (e.g., descriptive statistics). In this spectrum, these
process [18]. Existing systems have limited provisions to help                    types represent families of queries that can include aggregation,
the users to reformulate their queries as they evolve with the                    clustering operations. We mainly identify "query by example"
search progression [10].                                                          techniques useful particularly in cases where the knowledge
   This paper proposes a data exploration approach that                           about the datasets’ content is too weak (see d1.8). Query by ex-
                                                                                  ample is an intuitive way to explore data, so many techniques are
© 2021 Copyright held by the owner/author(s). Published in Proceedings of the     applying it to data exploration. Examples can either represent
24th International Conference on Extending Database Technology (EDBT), March      approaches like reverse engineering querying and queries like
23-26, 2021, ISBN 978-3-89318-084-4 on OpenProceedings.org.                       query morphing or queries. We also note that data exploration
Distribution of this paper is permitted under the terms of the Creative Commons
license CC-by-nc-nd 4.0.                                                          is a loop that obtains approximated results and the techniques
are specialised according to the type of data models (relational,         observed that no existing system integrates them so that data
graph, semi-structured, text, multimedia).                                scientists can develop exploration pipelines that can thoroughly
    Depending on the domain, works propose algorithms rather              understand data and its analytics potential. Therefore, there is
than operators (like in relational contexts) to process datasets and      room for proposing approaches for each of them, defining rules
to discover and derive a precise statistical understanding of their       on how they can be combined within data exploration pipelines
content (facet F2). Algorithms sometimes depend on the type of            and integrating them to provide a whole data exploration envi-
data structures used for representing data. For example, there are        ronment.
algorithms for processing graphs (centrality, pathfinding, etc.)
or querying tables (selection, projection, etc.). Many works use          3     QUERYING PIPELINES FOR EXPLORING
well-known heuristics, data mining, machine learning, artificial                DATASETS
intelligence algorithms for processing datasets, and insight into         Figure 2 shows our general approach based on query rewriting
their content. Finally, other works propose their strategies with-        techniques and summarised as follows: "given an initial query,
out adhering to a specific domain.                                        provide sets of queries that can help data consumers better exploit
    The vision of data exploration in this work is that it should         data collections". The approach considers that data collections
be a human-guided process. Therefore, we have studied tech-               are textual and indexed (not necessarily cleaned) and the repre-
niques where humans intervene to adjust and guide the process             sentative vocabulary used in their content has been extracted
of receiving information (d.4.5). We studied works on group rec-          and classified. For example in a crisis management scenario, the
ommendation, consensus functions, group preference and group              classes are events (e.g., someone looks for shelter, a building
disagreement. These study address objectives like designing con-          has been damaged) and actions (e.g., a hotel provides shelter
sensus functions that aggregate individual group members’ pref-           for victims, people is approaching a damaged building to search
erences to reflect the overall group’s preference for each item           victims).
[1, 4, 13] or disagreement about an item [16]. Consensus functions           The approach is intended to rewrite initial keyword queries
can be applied within a data exploration process given where              by morphing expressions to produce results that can retrieve rep-
a user can agree and disagree about the proposed queries; the             resentative insights into these collections’ content. The rewriting
system can recommend queries according to given constraints               process is gradual and interactive, where the user expresses an ini-
that can be interpreted as preferences.                                   tial expression, and the exploration process provides new queries
    According to our classification, facet F5 considers dimensions        associated with content samples that can give insight into the
that represent exploration techniques. Regarding exploration              content of the dataset. The alternative queries are assessed and
query expression (d5.1), we have identified three types of ap-            adjusted by the user. Then, the exploration process is triggered
proaches: multi-scale query processing for gradual exploration;           again until a set of queries is chosen to be evaluated to produce
query morphing to adjust for proximity results; queries as an-            results. Results produced by different exploration strategies can
swers as query alternatives to cope with lack of providence. Re-          also be used as input to others. For example, query morphing’s
sults filtering (d5.2) addresses analysis and visualisation to give       output can be used as an input for the queries as answers.
insight to data content. Finally, data exploration systems & en-             The next sections describe two rewriting techniques query
vironments (d5.3) are tailored for exploring data incrementally           morphing and queries as answers (expansion) that we have pro-
and adaptively.                                                           posed for exploring datasets.
    Concerning data exploration techniques, M. L. Kersten et al.
[13] have compiled five methods to explore data sets query-               3.1    Query morphing
ing: one-minute DB kernels, multi-scale queries, result set post-
processing and query morphing and queries as answers. These               Query morphing is the process of rewriting conjunctive and
methods revisit fundamental characteristics of existing systems           disjunctive keyword queries, by adding terms, to increase the
like the notion of results completeness and correctness promoted          possibility of exploring the most number of items in a collection.
by traditional databases, splitting queries execution on different        We proposed and implemented a “query morphing” pipeline that
fragments of a database, precision of queries, and one-shot com-          can help the data scientist better precise her query (see Figure
putation of query results. These query systems provide a broader          3). Our query morphing pipeline uses a vocabulary and Wordnet
(i.e., less precise but with a broader scope) approach, discarding        to look for associated terms and synonyms that help expand the
exactness and completeness for speed and a more global vision             terms to enhance the chance of matching it with relevant data
of the data.                                                              items in the target collection. The pipeline is described as follows.
    Finally, facet F3 classifies the type of datasets used to test dif-   Given a conjunctive and disjunctive keyword query represented
ferent exploration techniques and approaches. Datasets content            as an expression tree go through the tree in depth-first until
is often textual and with different rawness degree (newspapers,           finding a leaf representing a term and then:
micro-texts from social networks) and already processed content               (1) Use a vocabulary representing the dataset content and
using NLP (Natural Language Processing) techniques and repre-                     wordnet seeking for:
sented as graphs or tables. Other datasets are built by collecting               (a) equivalent terms and generate a node with the operator
observations monitored using, for example, IoT infrastructures.                      and then connect the initial term with the equivalent
These data sets contain records of measures or even video or                         terms in a conjunctive expression subtree.
images.                                                                         (b) more general terms and connect the initial term with
    We have the following remarks about what we have studied                         these terms in a disjunctive expression subtree.
in state of the art. Data exploration pipelines are mostly ad-hoc,               (c) assess and adjust the morphed query by the user and
implemented in an artisanal manner, and partially human-guided.                      eventually restart the expansion process. The assess-
Machine learning, analytics and querying techniques (e.g., query                     ment process includes pondering the terms and explor-
by example, queries as answers, etc.) are complementary. We                          ing result samples to see potential results.
                                          Figure 1: Querying techniques for exploring datasets


                                          Figure 2: Deriving queries to explore data collections


   For example, for a query "victims AND missing AND shelter",          3.2    Query as answers
using Wordnet1 , the query can be expanded as follows: "(victim         Given an initial conjunctive/disjunctive keyword query, the query
OR casualty OR "unfortunate person") AND (missing OR absent)            is rewritten and transformed into several queries by extending
AND (shelter OR protection OR housing)". The key is using a con-        it with general and more specific terms, synonyms, etc., and by
cept ontology or glossary and can find the maximum equivalent           exploiting the knowledge domain (see Figure 3). The result is a
and general terms. In this example, we only found equivalent            set of possible alternative queries with associated sample results
terms. The user can then mark which terms should or should not          so that the user can choose which ones to execute.
be included in the expanded query. She can also test different              In our approach, the initial query is represented by an expres-
combinations of the query and compare the results to see which          sion tree (intermediate representation) where nodes are conjunc-
morphed query can produce the results that best respond to her          tion and disjunction operators and leaves are terms. During the
expectations.                                                           rewriting process, the tree is modified by adding new types of
   Once the new query expression has been rewritten, as done            nodes and tagged arcs. New nodes represent “and” and “or” nodes
in information retrieval techniques, we use the inverted index to       that do not belong to the initial query and more general/specific
find the corresponding documents where the query terms occur.           terms associated with an “initial” term. These new nodes are
Then, we use the frequency matrix to compute the final result           connected with the nodes of the initial query by tagged arcs. A
set tagged with precision and recall measures.                          tag can indicate whether it connects a node with a conjunction
                                                                        or disjunction of more general/precise terms.
1 http://wordnetweb.princeton.edu/perl/
                                           Figure 3: Query morphing as answers pipeline


   For example, consider an initial query linking three terms t1 ,     4    EXPERIMENTS
t2 and t3 with a conjunctive and disjunctive initial query (see        To experiment our general approach, let us consider a disaster
Figure 3). It is then re-written in a new query represented by a       management scenario where various data collections are pro-
tree that extends the query with terms that can be synonyms or         duced during the life cycle of the disaster and must be explored
related terms to t1 and t2 . Three possible queries are derived: Q11   to organise relief, resilience and duty of memory actions. The
which provides an alternative to t1 with a new complex query           scenario we use is related to disaster management under a hori-
with a synonym/or an associated term t4 , saying that we can look      zontal organisation 2 , where civilians take active action when an
for "t1 or t1 and t4 ". Similarly, Q12 provides an alternative to t2   event happens (e.g., earthquake, flooding, fire) and continue to
saying that we can either look for t2 or t2 and t6 , which could be    influence decision making during the other phases of its manage-
a synonym or a related term. Finally, Q13 is a complex query that      ment. In this context, social media is a fast-paced channel used by
integrated Q11 , Q12 with t3 of the initial query.                     affected people to describe their situation and observations, seek
   The following steps are performed for computing queries al-         information, specify their requests, and offer their voluntary as-
ternatives where every step aims at deriving the initial query         sistance; providing actionable information [17, 21]. Critical data
into queries that add knowledge. For each leaf in the expression       is continuously posted on social media like Twitter, during the
tree of the query:                                                     disaster life cycle (the event, relief, resilience, duty of memory).
                                                                          During such life-threatening emergencies, affected and vulner-
   (1) Use a vocabulary (extracted from the dataset content) and       able people, humanitarian organisations, and other concerned
       Wordnet seeking for:                                            authorities search for information useful to provide help and pre-
       • equivalent terms and generate a node with the operator        vent a crisis. Nobody has control over the type of data exchanged
         and then connect the initial term with the equivalent         by actors. These data are crucial in making critical decisions like
         terms in a conjunctive expression subtree;                    saving lives, searching people, and providing shelter and medi-
       • more general terms and connect the initial term with          cal assistance. For this reason, it is required to explore past and
         these terms in a disjunctive expression subtree.              present in an agile manner to find hints to make decisions and
   (2) Use a frequency matrix for looking for terms that are often     act individually and collectively. Social network data collections
       associated with the initial term with a specific frequency      can include reports on architectural and building environment
       and getting a sample of documents that can belong to            damages and volunteers informing them that they have answered
       query results.                                                  calls for help (see Figure 4).
                                                                          The question is whether these data collections can help to
   The user can choose those queries that best target her expec-       find (i) causal correlations, for example, is it possible to know
tations. A history of queries is maintained that can be reused         given a post asking for help whether actions have already been
for suggesting or pre-calculating morphing or query as answers
                                                                       2 The phenomenon of organisation of civilians under horizontal and marginal groups
results or for adjusting the chosen query set with new queries as
the dataset evolves.                                                   has come up in different countries during rivers flooding, annual landslides in
                                                                       diverse regions particularly in Latin America.
                                    Figure 4: Exploring crisis social network posts during crisis


taken? since when and whether the problem has been solved; (ii)        4.1    Dataset preparation
is it possible to find patterns showing which zones have been          Among social media studies, most of them focus on Twitter,
systematically damaged in other events? Is there more risk and         mainly because of its timeliness and availability of information
help required in those regions?; (iii) Spatio-temporal relations, is   from a large user base. We use CrisisNLP [11] labelled and un-
it possible to figure out from the beginning of the event until a      labelled datasets. The datasets contain approximately 5 million
given subsequent time? have actors installed camps to provide          unlabelled and 50k labelled tweets. The size of this dataset is
first aid? Does help come from urban areas?; (iv) how to ask about     about 7 gigabytes. The datasets consist of various event types
the type of help still being required after a day of the event?        such as earthquakes, floods, typhoons, etc. The datasets were
    Note that these questions are not asking for results, they are     collected from the Twitter streaming API using different key-
asking for assistance on how to ask them on top of data to poten-      words and hashtags during the disaster. The tweets are labelled
tially best explore data. How can I express my query to expect         into various informative classes (e.g., urgent needs, donation
to receive the best guidance to act? Is my query pertinent to be       offers, infrastructure damage, dead or injured people) and one
asked given the data I can have access to?                             not-related or irrelevant class. Table 5 shows a sample of some
    Data exploration techniques can help assist in expressing          labelled tweets from data collection.
queries that can potentially explore data collections and be perti-
nent according to their content.
    This section describes the experimental setting for the as-           Data Preprocessing. Since the tweet texts are brief, informal,
sessment of our approach. Our experiments deal with the crisis         noisy, unstructured, and often contain misspellings and gram-
scenario introduced previously, and they use micro-texts datasets      matical mistakes, preprocessing must be done before using them
from Twitter concerning this topic. Given Twitter’s 140 charac-        in further analysis. Moreover, due to Twitter’s 140 characters
ters limit restriction, the frequency matrix cannot be useful, so      limit restriction, Twitter users intentionally shorten words us-
we used the word2vec model to pre-process the dataset and then         ing abbreviations, acronyms, slang, and sometimes words with-
find similar terms for rewriting queries. In this work, we consider    out spaces; hence we need to normalise those OOV terms [11].
those words provided by word2vec model, and words are also             Besides, tweets frequently contain duplicates as the same in-
indexed in the frequency matrix for extending query. With this         formation is often retweeted / re-posted by many users [20].
information, we modify the tree adding "AND" and "OR" nodes,           Presence of duplicates can result in an over-estimation of the
and thereby we create other possible queries that derive from          performance of retrieval/extraction methodologies. Therefore,
the initial one.                                                       we eliminated duplicate tweets using ’remove duplicates toolkit’
    We have experimented with generating the knowledge do-             by Excel. Currently, we use 73562 unlabelled data set related to
main and then using it for validating morphing queries. Our            2014. We performed the following preprocessing steps to clean
experiment is based on the disaster management use case using          the micro-documents:
Twitter posts as documents collections. The experiment applies
text mining techniques to build the vocabulary and classify it            (1) We removed stop words (e.g. ’a’, ’at’, ’here’), non-ASCII
into events produced and actions performed during a disaster                  characters, punctuations (e.g. ’.’, ’!’), URLs (e.g. ’ http://t.co/
life-cycle. In this section, we first describe the datasets we used           24Db832o4U ’), hashtags (e.g. ’#Napaquake ’) and Twitter
in our experiments and then the experiment setting, including                 reserved words (e.g. ’RT’, ’via’).
the algorithms used to process data collections and classify the          (2) We further tokenize the tweets using nltk.tokenize library
extracted vocabulary.                                                         [24].
                                                                          (3) We performed stemming using the WordNet Lemmatizer
                                                                              library [24]: e.g. troubled (trouble).
                         Figure 5: Examples of some labelled tweets, posted during the 2014 California


       used a list of the crisis related OOVs [11] to normalize         suggests, we considered tweets containing a subject related to
       tweets’ terms: e.g. govt (government), 2morrow (tomor-           any occurrence during or after the crisis. For example- damage
       row), missin (missing).                                          happened to a building, or people are trapped in buildings. For
   (4) We removed duplicate tweets.                                     an "Action" we considered those tweets that focus on operations
After performing the cleaning 126161 unlabelled data related to         and activities during or after the crisis. Such as government or
the 2014 California Earthquake, we obtained a set of 73562 tweets.      NGOs providing help to the affected people.
This set is used for all experiments reported in this work.                We performed a set of experiments on California and Nepal
   The pipeline implemented to create the knowledge base re-            earthquake datasets consisting of approximately 3032 labelled
quired for experimenting our data exploration techniques con-           tweets, out of which 2203 tweets of Nepal and 829 tweets of
sists of two steps: (i) indexing data collections content using         California dataset. The datasets are divided into two sets. As usual
information retrieval techniques; (ii) create a vocabulary using        in machine learning techniques, we divided the data collection
classification techniques.                                              into training and test datasets. The first set comprised of 70% of
                                                                        the messages (i.e. training set) and the second comprised of 30%
    Indexing the data collection. As a result of indexing the cleaned   of the messages (i.e. test set). We trained all three different kinds
tweets collection, we created an inverted index and a frequency         of classifiers using the preprocessed data.
matrix representing the content of the collection. We imple-               We used multilayer perceptron with a CNN. We conducted
mented an inverted index to provide agile access to a document’s        experiments on the same dataset and eventually established that
position in which a term appears. The inverted index is used as a       CNN outperformed the task with an adequate margin compared
dictionary that associates each word with a list of document iden-      to our previous work.
tifiers where the word appears. This structure prevents making             For the evaluation of the trained models, we compared the
the running time of token comparisons quadratically. So, instead        results to [11, 15]. The results obtained by CNN model are better
of comparing, record by record, each token to every other token         than traditional techniques, and we were able to obtain the same
to see if they match, the inverted indices are used to look up          results as the original paper [11, 15] (see Table 1).
records that match a particular token.
    Currently, we use 73562 unlabelled data set related to the 2014     Table 1: Accuracy, Precision, Recall and f-score of CNN
California Earthquake. We generated an inverted index consisting        model with respect to California Earthquake and Nepal
of 20313 rows. The rows correspond to terms in our raw data             Earthquake crisis tweet data.
collection, and columns correspond to documents where the
terms occur. The inverted index allows a fast full-text search. It         Datasets/SYS             Accuracy    Precision   Recall    f-Score
can help to explore queries’ terms to find the documents where             California Earthquake   92.72       86.53        90.00    88.23
the terms occur.                                                           Nepal Earthquake[12]    89.31       91.25        91.87    91.85
    A term frequency matrix is a mathematical matrix that de-
scribes the frequency of terms that occur in a collection of doc-
uments [14]. The matrix contains 73562 columns, where each
                                                                        4.2    Testing query morphing
column corresponds to a document (tweet) and each row to a
term. A cell in the matrix contains the number of times that the        We implemented the ”query morphing” process that we proposed
term appears in the document. The top 20 most frequent terms            to help the data scientist better precise her query, or define sev-
in our data collection can help us expand the query using data          eral queries representing what she is looking for. Our query
collection.                                                             morphing algorithm uses Wordnet to look for associated terms
                                                                        and synonyms that help expand the terms to enhance the chance
   Creating a vocabulary. We implemented a classification pipeline      of matching it with relevant tweets in the target collection.
to build a vocabulary of events and actions related to disasters,          For assessing expanded term quality, we have compared the
thereby generating a knowledge base describing the tweets’ data         performance of our proposed classification based query expand-
collections used for our experiment. The pipeline combines ma-          ing method against the traditional query expanding method. We
chine learning methods reproducing an existing work proposed            calculated the mean average of Cosine Similarity (MACS) be-
by [11, 15].                                                            tween the query and expanded query terms to assess the pro-
   We applied supervised techniques such as Random Forest [5],          posed approach’s performance. The experimental results show
Support Vector Machines (SVM) and Convolutional Neural Net-             that the expanded query terms, obtained from the classified query
work (CNN) [6, 8] to classify the tweets of our experiment dataset      expansion model, are more similar and relevant than the non-
to build the vocabulary of events and actions. As the word "Event"      classification model.
                                                 Figure 6: Query morphing example


   In this section, we presented an ablation study about the per-     Table 2: The mean average of Cosine Similarity (MACS) be-
formance of our proposed classification based query morphing          tween query and morphed query terms with and without
method. We use the available crisisNLP pre-trained word embed-        classification model.
ding via word2vec method [11] to obtain query and expansion
terms vectors. In the vector space model, all queries and terms              Query Expansion Model     ET@10    ET@20     ET@30
                                                                             Classification           0.420    0.377     0.371
are represented as vectors in dimensional space 300. Documents               Non-classification       0.401    0.366     0.369
similarity is determined by computing the similarity of their con-
tent vector. To obtain a query vector, we represent keywords in
user queries as vectors, and then sum all the keyword vectors
                                                                      the process. We have proposed a solution for exploring scien-
followed by averaging them. For our analysis, we calculated the
                                                                      tific papers through an experiment defining a set of exploration
average similarity between the query vector (Q_vector) and ’m’
                                                                      queries. Results were assessed by scientists of the National In-
keyword vectors obtained for a given query (T_vector) by using
                                                                      stitute of Genetic Engineering and Biotechnology, Tehran, Iran
the formula of similarity 𝑆𝑖𝑚 given in equation 1.
                                                                      and Golestan University of Medical Sciences. Scientists provided
                                                                      feedback about exploration operations through questionnaires
                      Í𝑚                                              that are processed for obtaining satisfaction metrics. We are cur-
                        𝑖=1 (𝐶𝑜𝑠𝑖𝑛𝑒 (𝑄_𝑣𝑒𝑐𝑡𝑜𝑟,𝑇 _𝑣𝑒𝑐𝑡𝑜𝑟 [𝑖]))
 𝑆𝑖𝑚(𝐶𝑇 𝑠, 𝑄𝑢𝑒𝑟𝑦) =                                             (1)   rently defining a crowd-based setting for obtaining feedback in
                                         𝑚
                                                                      the case of crisis datasets. The idea is to work with different
   where CTs are candidate terms, ’m’ is a hyper-parameter in         groups of users (victims, volunteers, logistics decision-makers,
query expansion-based retrieval, which shows the number of            police, medical staff) and queries to assess exploration results.
expansion terms (ET), using as reference the studies[2, 22]. We set
the number of expansion terms to 10, 20 and 30 (ET@10, ET@20,         5   CONCLUSION AND FUTURE WORK
ET@30). We repeat this task for 100 queries and report the mean       This paper introduced a general datasets exploration approach
of average of each ET@ set in table 2. The experimental results       that includes human in the loop. The current approach includes
show that the morphed query expanded with new terms obtained          two exploration techniques (i.e., query morphing and queries as
from the classified query morphing model are more similar and         answers) to help define queries that can fully explore and exploit
relevant than the non-classification model. The ET@10, ET@20          a dataset. They are complementary query rewriting techniques
and ET@30 scores of our proposed classification model surpassed       where initially expanding a query can help adjust the terms used
the transition non-classification based model. Also, we observe       for exploring a dataset and then produce possible combinations
that when we set the number of expansion terms to 10, we achieve      of terms with possible queries that can lead to different scopes.
the best performance.                                                 In both cases, the user finally chooses a set of representative
   Currently, we used pseudo relevance feedback. This method          queries to her interests and the produced results that target her
automates the manual part of relevance feedback. It is assumed        expectations. We have tested query morphing in the case of crisis
that the user takes top-m ranked morphed query terms returned         dataset exploration, where people involved in a critical event
by the initial query as relevant to expand her query. Results scor-   either as victim or volunteers can define queries for retrieving
ing must be completed with user feedback that finally guides          information to look for or provide help.
   Our future work includes modelling query exploration pipelines
that can combine different techniques for exploring data collec-
tions. We will also propose ways of morphing and giving queries
as answers where queries can be analytical or imply quantitative
data views.

REFERENCES
 [1] Sihem Amer-Yahia, Senjuti Basu Roy, Ashish Chawlat, Gautam Das, and Cong
     Yu. 2009. Group recommendation: Semantics and efficiency. Proceedings of
     the VLDB Endowment 2, 1 (2009), 754–765.
 [2] Hiteshwar Kumar Azad and Akshay Deepak. 2019. Query expansion tech-
     niques for information retrieval: a survey. Information Processing & Manage-
     ment 56, 5 (2019), 1698–1735.
 [3] Nicholas J Belkin. 2008. Some (what) grand challenges for information retrieval.
     In ACM SIGIR Forum, Vol. 42. ACM New York, NY, USA, 47–54.
 [4] Ludovico Boratto, Salvatore Carta, Alessandro Chessa, Maurizio Agelli, and
     M Laura Clemente. 2009. Group recommendation with automatic identification
     of users communities. In 2009 IEEE/WIC/ACM International Joint Conference
     on Web Intelligence and Intelligent Agent Technology, Vol. 3. IEEE, 547–550.
 [5] Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5–32.
 [6] Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine
     learning 20, 3 (1995), 273–297.
 [7] Kyriaki Dimitriadou, Olga Papaemmanouil, and Yanlei Diao. 2016. AIDE:
     an active learning-based approach for interactive data exploration. IEEE
     Transactions on Knowledge and Data Engineering 28, 11 (2016), 2842–2856.
 [8] Kunihiko Fukushima and Sei Miyake. 1982. Neocognitron: A self-organizing
     neural network model for a mechanism of visual pattern recognition. In
     Competition and cooperation in neural nets. Springer, 267–285.
 [9] Parantapa Goswami, Eric Gaussier, and Massih-Reza Amini. 2017. Explor-
     ing the space of information retrieval term scoring functions. Information
     Processing & Management 53, 2 (2017), 454–472.
[10] Stratos Idreos, Olga Papaemmanouil, and Surajit Chaudhuri. 2015. Overview
     of data exploration techniques. In Proceedings of the 2015 ACM SIGMOD Inter-
     national Conference on Management of Data. 277–281.
[11] Muhammad Imran, Prasenjit Mitra, and Carlos Castillo. 2016. Twitter as a
     lifeline: Human-annotated twitter corpora for NLP of crisis-related messages.
     arXiv preprint arXiv:1605.05894 (2016).
[12] Muhammad Imran, Prasenjit Mitra, and Carlos Castillo. 2016. Twitter as a Life-
     line: Human-annotated Twitter Corpora for NLP of Crisis-related Messages.
     In Proceedings of the Tenth International Conference on Language Resources and
     Evaluation (LREC 2016) (23-28). European Language Resources Association
     (ELRA), Paris, France.
[13] Martin L Kersten, Stratos Idreos, Stefan Manegold, and Erietta Liarou. 2011.
     The researcher’s guide to the data deluge: Querying a scientific database in just
     a few seconds. Proceedings of the VLDB Endowment 4, 12 (2011), 1474–1477.
[14] Hans Peter Luhn. 1957. A statistical approach to mechanized encoding and
     searching of literary information. IBM Journal of research and development 1,
     4 (1957), 309–317.
[15] Dat Tien Nguyen, Kamela Ali Al Mannai, Shafiq Joty, Hassan Sajjad, Muham-
     mad Imran, and Prasenjit Mitra. 2016. Rapid classification of crisis-related
     data on social networks using convolutional neural networks. arXiv preprint
     arXiv:1608.03902 (2016).
[16] Mark O’Connor, Dan Cosley, Joseph A Konstan, and John Riedl. 2001. PolyLens:
     a recommender system for groups of users. In ECSCW 2001. Springer, 199–218.
[17] Leysia Palen and Sarah Vieweg. 2008. The emergence of online widescale
     interaction in unexpected events: assistance, alliance & retreat. In Proceedings
     of the 2008 ACM conference on Computer supported cooperative work. 117–126.
[18] Olga Papaemmanouil, Yanlei Diao, Kyriaki Dimitriadou, and Liping Peng. 2016.
     Interactive Data Exploration via Machine Learning Models. IEEE Data Eng.
     Bull. 39, 4 (2016), 38–49.
[19] Ian Ruthven. 2003. Re-examining the potential effectiveness of interactive
     query expansion. In Proceedings of the 26th annual international ACM SIGIR
     conference on Research and development in informaion retrieval. 213–220.
[20] Ke Tao, Fabian Abel, Claudia Hauff, Geert-Jan Houben, and Ujwal Gadiraju.
     2013. Groundhog day: near-duplicate detection on twitter. In Proceedings of
     the 22nd international conference on World Wide Web. 1273–1284.
[21] Sarah Vieweg, Amanda L Hughes, Kate Starbird, and Leysia Palen. 2010. Mi-
     croblogging during two natural hazards events: what twitter may contribute
     to situational awareness. In Proceedings of the SIGCHI conference on human
     factors in computing systems. 1079–1088.
[22] Chengxiang Zhai and John Lafferty. 2001. Model-based feedback in the lan-
     guage modeling approach to information retrieval. In Proceedings of the tenth
     international conference on Information and knowledge management. 403–410.

</pre>