<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Foundation for Spatio-Textual-Temporal Cube Analytics</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mohsin Iqbal</string-name>
          <email>mohsin@cs.aau.dk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Lissandrini</string-name>
          <email>matteo@cs.aau.dk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Torben Bach Pedersen</string-name>
          <email>tbp@cs.aau.dk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aalborg University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Large amounts of spatial, textual, and temporal (STT) data are being produced daily. This is data containing an unstructured component (text), a spatial component (geographic position), and a time component (timestamp). Therefore, there is a need for a powerful and general way of analyzing STT data together. In this paper, we define and formalize the Spatio-Textual-Temporal Cube (STTCube) structure to enable combined efective and eficient analytical queries over STT data. Our novel data model over STT objects enables novel joint and integrated STT insights that are hard to obtain using existing methods. Moreover, we introduce the new concept of STT measures with associated novel STTOLAP operators. To allow for eficient large-scale analytics, we present a pre-aggregation framework for exact and approximate computation of STT measures. Our comprehensive experimental evaluation on a real-world Twitter dataset confirms that our proposed methods reduce query response time by 1-5 orders of magnitude compared to the No Materialization baseline and decrease storage cost between 97% and 99.9% compared to the Full Materialization baseline while adding only a negligible overhead in the STTCube construction time. Moreover, approximate computation achieves an accuracy between 90% and 100% while reducing query response time by 3-5 orders of magnitude compared to No Materialization.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Due to the increased usage of mobile devices and advancements
in accurate geo-tagging, more and more geo-tagged data is being
produced [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In particular, social media platforms like Twitter
and Facebook are some of the main sources of geo-tagged data,
usually in the form of posts, comments, and reviews (e.g.,
Figure 1). This type of data contains spatial, textual, and temporal
(STT) information. As a result, STT data analysis is becoming
increasingly important [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] since it allows to extract new insights
regarding customer satisfaction, user-generated content shared
online, and brand reputation [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ].
      </p>
      <p>
        STT data contains information regarding topics discussed w.r.t.
time and location, hence presenting an invaluable link between
user opinions and the real world. For example, STT data can
help us analyze an advertisement campaign to identify the best
locations for ad placements. Traditionally, this information is
accessed through spatial keyword-queries [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], e.g., to retrieve
topics within a certain location, or identify in which locations
some topic is discussed. However, keyword or topic search are
point-wise search tasks. Instead, there a significant need to
provide more extensive analytics analogous to traditional OLAP-style
analytics. An example STT query is “find the top-k trending
hashtags aggregated by topic within a user-defined region (i.e., polygon)
around Paris this month".
      </p>
      <p>
        The traditional data cube model is one of the most widely
used tools to analyze structured data. Since their introduction,
data cubes have been extended to analyze diferent types of data,
like sales [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], locations [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], time-series [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and text [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], but
separately. In particular, some works propose OLAP operators to
analyze either textual data [
        <xref ref-type="bibr" rid="ref3 ref37">3, 37</xref>
        ] or spatial data [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ]. However,
no previous work proposed a unified model and set of operators
enabling integrated and joint analysis of STT data. Moreover, as
we propose to jointly analyze STT dimensions together with other
dimensions, we are also able to define novel families of measures
that have not been studied before, namely STT measures. These
measures, as we show later, allow to produce more advanced
analytics instead of, e.g., simple keyword frequency.
Contributions. In this paper, we introduce the
Spatio-TextualTemporal Cube (STTCube) to analyze STT data. Adding spatial,
textual, and temporal support to a traditional data cube is not
straight-forward due to the presence of − relationships in
textual hierarchies and because existing families of measures cannot
support joint and integrated analysis involving spatial, textual,
and temporal dimensions, e.g., finding the trending keywords
grouped by regions, defined by geometry shapes, over a time
interval (Section 3.3). Hence, we introduce new families of
measures and OLAP operators that extract combined insights from
STT dimensions and measures. STTCube provides specialized
spatio-textual and spatio-textual-temporal measures such as
Topk Dense Keywords within an area and Top-k Volatile Keywords
within an area that deliver the integrated aggregates over STT
data. Moreover, a set of analytical operators, namely STT slice,
dice, roll-up, and drill-down are proposed. This results in a data
model able to support spatio-textual-temporal OLAP (STTOLAP)
operators. Furthermore, we propose Partial Exact Materialization
(PEM) and Partial Approximate Materialization (PAM) methods for
eficient exact and approximate computations of STT measures,
respectively. Among other things, we also provide a systematic
set of solutions to handle − relationships in textual hierarchies.
      </p>
      <p>In this work, we present the following contributions: I) We
extend the standard cube model to add support for spatial,
textual, and temporal dimensions and hierarchies and spatio-textual
and spatio-textual-temporal measures (Sections 3.1 to 3.3). II) We
propose a set of analytical operators (STTOLAP) over
spatiotextual-temporal data (Section 3.4). III) We introduce keyword
density and keywords volatility as prototypical spatio-textual and
spatio-textual-temporal measures (Section 3.3). IV) We propose a
pre-aggregation framework (STTCube materialization) for
eficient, exact (PEM) and approximate (PAM), computation of the
proposed STT measures (Section 4). V) We propose techniques for
processing spatio-textual-temporal objects and the construction
of the STTCube (Section 5). VI) We evaluate the pre-aggregation
framework's (PEM and PAM) query response time, storage cost,
and accuracy by comparing it with the No STT Cube, Full
Materialization, and No Materialization baselines. Our pre-aggregation
framework provides 1-5 orders of magnitude improvement in
query response time and a 97% to 99.9% reduction in storage cost
with an accuracy between 90% and 100% (Section 6).
2</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        OLAP and the Data Cube [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] are used heavily in business
intelligence to obtain insights over the historical, current, and future
state of business. With the emergence of web and social media, an
immense amount of unstructured data is being produced, which
must be included in the analytical process. Table 2 summarizes
the state of the art on spatial, textual, and temporal analytics by
listing the properties and gaps in the current methods.
      </p>
      <p>
        The Text-Cube [
        <xref ref-type="bibr" rid="ref22 ref30">22, 30</xref>
        ] allows OLAP-like queries on text data
by providing dimensions and hierarchies for terms. Moreover, it
supports the computation of two information retrieval (IR)
measures: inverted index and term frequency. EXODuS [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] processes
semi-structured document stores (i.e., JSON) using a
schema-onread approach to allow exploratory OLAP on text. Text OLAP [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ]
extends traditional OLAP to support textual dimensions and
keyword-based top-k search [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Yet, all these approaches lack
support for spatial and temporal data and the advanced measures
and operators required for spatio-textual-temporal analytics.
      </p>
      <p>
        For spatial data, GeoMiner [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] proposes a cube structure
for mining characteristics, comparisons, and association rules
from geo-spatial data. The coupling of GIS and OLAP is known
as Spatial OLAP (SOLAP) [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] and Spatial cube [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] allows to
perform SOLAP on the semantic web. Yet, these solutions focus on
spatial data only and lack support for textual and temporal data.
      </p>
      <p>
        There are solutions that combine more than one component
of data, e.g., spatio-temporal [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ], into the same model but do
not provide combined STT analytics. Among those, the
contextualized warehouse [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] combines traditional OLAP with a textual
warehouse. This allows the user to provide some keywords, select
a market (country or region), retrieve documents matching the
keywords as context, and then analyze the facts related to those
keywords and documents. Similarly, Topic Cube [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ] extends the
functionality of a traditional cube and combines probabilistic
topic modeling with OLAP by introducing the topic hierarchy.
TwitterSand [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] and StreamCube [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] exploit textual and
spatial information to gain insights by clustering twitter hashtags
and tweets in a region, respectively. STT data is also analyzed to
extract events and topics information in TextStreams [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ] and
TopicExploration [
        <xref ref-type="bibr" rid="ref39">39</xref>
        ]. Finally, SocialCube [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] tries to capture
human, social, and cultural behavior by performing linguistic
analysis (sentiment analysis) over tweets. All these approaches
focus on the unstructured nature of text along with spatial and
temporal data but they do not provide Integrated STT analytics,
for example, they do not provide the ability to compute aggregate
spatial, textual, temporal, and spatio-textual-temporal measures
over spatial, textual, and temporal dimensions and hierarchies.
      </p>
      <p>
        Spatial top-k keyword-queries [
        <xref ref-type="bibr" rid="ref25 ref5 ref9">5, 9, 25</xref>
        ] answer only point-wise
queries and do not support aggregation functions or hierarchies.
Thus, they do not support more complex OLAP-style analytical
tasks, which we do. There are methods that solve a very specific
task for a specific type of data [
        <xref ref-type="bibr" rid="ref2 ref21 ref28">2, 21, 28</xref>
        ]. These methods are
fundamentally diferent from STTCube because STTCube provides
a generic framework for a wide range of STT analytics over diferent
kinds of STT data sources, including, but not limited to, geo-tagged
tweets. Also, STTCube can take advantage of the improvements
suggested over other cubes, e.g., Nanocubes [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] and DICE [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ],
making it a powerful tool for OLAP-style STT analytics.
      </p>
      <p>Our summary of related work in Table 2 shows that no
existing method provides integrated support for STT data, unlike
STTCube. To the best of our knowledge, a proper formalization
of a data cube model for STT data able to support complex
analytics for STT objects at scale is missing. In particular, no previous
method studies dimensions, hierarchies, and measures that
allow processing STT data jointly. Furthermore, the main novel
challenge for STT-OLAP is handling − relationships inside the
STT dimensions efectively since − relationships do not allow
traditional pre-aggregation techniques to be used. Moreover,
arbitrary temporal ranges with multiple levels of granularity adds
complexity to STT measures computations. As a remedy, we
propose STTCube which enables the joint and integrated analysis
of STT objects by introducing new sets of STT measures to gain
in-depth insights using STTOLAP operators.
3</p>
    </sec>
    <sec id="sec-3">
      <title>SPATIO-TEXTUAL-TEMPORAL CUBES</title>
      <p>Here, we define the STTCube, an extension of the traditional data
cube to allow storage and analysis of STT objects. Data cubes are
used to model and analyze multi-dimensional data. The basic
building block of a data cube is the cell that contains fact. Each
fact is the observational object for analysis, with one or more
numerical measures associated with it.</p>
      <p>Definition 1 (Data Cube). An n-dimensional data cube CS
is a tuple CS =(, ,  ), with a set of dimensions ={1, 2, · · ·,
 }, a set of measures ={1, 2, · · ·,  }, and a set of facts  .
A dimension  ∈ has a set of hierarchies  . Each hierarchy
ℎ∈ has a set of hierarchy steps (discussed in Section 3.1) and
is organized into a set of levels ℎ . Each level  ∈ℎ contains a set
of members and has a set of attributes  . Each attribute ∈ is
defined over a domain. Each measure ∈ is a function defined
over a domain which can return either a single value or a complex
object. The domain of a dimension  is denoted by  ( )
55557766....00114158,,,, 00119900....99211907---- 1112 1---- -2---- --11003 ---11111 ---112010 --30022010-A-p---0202121-p5Fl:re1---3u41037-iL:t1o--16v----e2-1-0.1-----11.:0-301-.0331-521O:05:-1n16189i1o2:2:0n13-212:1043--2020901--920091-92019</p>
      <p>. . .</p>
      <sec id="sec-3-1">
        <title>Spatio-Textual-Temporal (STT) Objects An STT object records</title>
        <p>place (geo-coordinates or location where it was created), text (a
review, or a user comment), and time (when it was created).
Social networks with geo-tagged micro-blog posts are typical STT
data sources (e.g., the geo-tagged tweet in Figure 1).</p>
        <p>Definition 2 (STT object). A spatio-textual-temporal object is
a tuple   =⟨, ,  ⟩ where , , and  represent the location, text,
and time components, respectively.</p>
        <p>The Location is represented as the latitude and longitude pair
∈(R × R). The Text is an ordered list =⟨ 1,  2, . . . ,  ⟩ where
 ∈W is a string and is called a Term. Among all Terms,
keywords are a user-defined subset of important Terms  ⊆W. For
instance, the user can decide that hashtags (terms starting with
#) have special meaning and are a special type of keyword. Time
specifies a precise instant (a timestamp) to some resolution (e.g.,
seconds). Table 1 contains examples of STT objects with their
location, a set of keywords extracted from the text, and timestamp.
3.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>The STTCube Schema</title>
      <p>For analytical processing of STT objects we propose to model
them as an STTCube. An STTCube CS =(, ,  ) is a data cube
(Definition 1) with three special dimensions, namely Location,
Text, and Time, along with zero or more traditional dimensions
that is  = {,   ,   , 4, . . . ,  }.</p>
      <p>Dimensions. An STTCube stores STT objects as facts modeling
their spatial, textual, and temporal features in the corresponding
dimensions. Figure 2 shows a 3-dimensional STTCube built on the
sample dataset in Table 1 where each row represents one fact (i.e.,
the members of  ) with dimensions ={Location, Text, Time}.
Domains for the respective dimensions are
 (Location) = {(57.016, 09.991), (56.187, 10.171), . . . }
 (Text) = {apple, Fruit, #love, . . . }
 (Time) = {11:12:13 20-10-2019, 11:18:23 24-10-2019, . . . }
Hence w.r.t. Definition 2, the dimensions capturing , , and 
are the spatial, textual, and temporal dimensions, respectively.
STTCube supports one spatial, textual, and temporal dimension
with the possibility of having multiple hierarchies for each.
Dimension Hierarchy. A hierarchy is spatial, textual, or
temporal if it contains spatial, textual, or temporal levels, respectively.
In Figure 2, the Location dimension is a spatial dimension with a
spatial hierarchy going from  to City, Region, and Country and
the Text dimension is a textual dimension aggregating  into the
Term, Theme, Topic, and Concept levels. Similarly, Time is a
temporal dimension. Hierarchy steps ℎ ={ℎ1, ℎ2, ℎ3, . . . , ℎ }
deifne the mechanism of moving from a lower (child) level to an
upper (parent) level and vice versa. A hierarchy step ℎ =(↓, ↑,  −
)∈ℎ entails that members of a child level ↓ can be
aggregated together if they correspond to the same member at the
parent level ↑ and that this correspondence between children to
parent members has the given ∈{1−1, 1−, −1, −}.
For instance, the step from Date to Month has an −1 cardinality,
while Term to Topic has an − cardinality (e.g., the Carrot Term
correspond both to the Gardening and Food Topics, while the
Food Topic has as child members not only Carrot but also Apple).
Level Attributes. As mentioned earlier, a level  is associated
with a set of attributes  ={1, 2, . . . ,  } and has a set of
members  ={1, 2, . . . , 3}. Attribute values describe the diferent
characteristics of each member from that level. Spatial, textual, and
temporal levels are then usually characterized by spatial, textual,
and temporal attributes. For instance, at the City level, all
members have the Boundary attribute whose value is the polygon
defining the boundary of respective city. An example of a textual
attribute is Sentiment which captures the polarity of the
associated textual member. Similarly, an integer value representing the
number of days in a specific month is a temporal attribute.
3.2</p>
    </sec>
    <sec id="sec-5">
      <title>Managing STT Hierarchies</title>
      <p>We now describe the STTCube's dimensions and hierarchies.
Spatial Dimensions. Spatial information can be analyzed at
diferent levels and granularities. It is important to note that facts
in an STTCube are composed only by geographical points (i.e.,
each tweet or user post is associated with a coordinate, not with
shapes or polygons). Points can be aggregated either within a
predefined spatial grid or based on semantic information.</p>
      <p>Grid-Based Hierarchy. Here, the geographic area being
analyzed is divided into small equal size cells with a predefined
resolution, e.g., 1×1 2. At the lowest level, each latitude and
longitude point is assigned to the cell they fall in. To analyze data
at a coarser granularity, neighboring cells are combined into a
larger cell at the parent level (e.g., 3×3 2). This hierarchy can
be built automatically, without the need for any meta-data.</p>
      <p>Semantic-Based Hierarchy. Here, data is analyzed in a
predeifned taxonomy, e.g., an administrative division. Therefore, we
move within the taxonomy, e.g., from the Location to the City
level, from the City level to the Region level, and so on up to the
All level. This hierarchy requires each object coordinate to be
associated with a member in the lowest level in the hierarchy
(usually in a pre-processing step) and requires the taxonomy
information to build the entire hierarchy.</p>
      <p>
        Textual Dimensions. Hierarchies in the textual dimension move
from specific concepts to general ones. This follows a generic
taxonomic structure connecting more specific terms to more general
ones (i.e., hypernyms) [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Textual hierarchies are implemented
using WordNet [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] which is discussed in Section 6. In particular,
Terms are the base level which are grouped into Themes, Themes
into larger categories called Topics, and Topics in turn grouped
into Concepts. Diferently from most hierarchies, the members in
the levels of a textual hierarchy are typically in an − relationship.
Hence, when moving between textual levels we need to decide
how measure values get aggregated. Below we propose a set of
aggregation techniques to address this issue.
      </p>
      <p>Replication-Based Hierarchy. This is a common approach where
each member of a child level is aggregated into all the parent
members. Hence, its value is efectively replicated. This approach
leads to a counting problem when parent levels are further
aggregated. For example, the first data instance in Table 1 will be
part of two Themes: 1) Fruits because it contains Term {apple and
fruit} and 2) Emotion because of Term {#love}.</p>
      <p>Majority-Based Hierarchy. If a fact can be mapped to more
than one parent member, then that fact will be part of the parent
member which has the most representation (e.g., in terms of
frequency). This scheme avoids double counting of facts in parent
members. In case of ties, some tie-breaking heuristic or a
userdefined criterion can be employed instead, e.g., the first fact in
Table 1 will be part of only the Fruits Theme because it has the
two representative Term {apple, fruit}, as compared to Emotion
having only one Term {#love}.</p>
      <p>Custom Hierarchy. In general, other user-specified criteria and
rules can be defined to establish how child-parent level steps
will be aggregated in case of ambiguities. For instance, a
domainspecific importance score can be assigned to the hierarchy
members during the STTCube construction. In this way, facts will be
part of only the parent member with the highest importance.
Temporal Dimensions. Similarly, temporal dimension allows
to analyze STT objects at diferent levels of granularity w.r.t. time
and has the following two temporal hierarchies:  →  →
ℎ →  →   →  and  →  →  →
 →  . Here, the first contains a hierarchy of Date
aggregated by the temporal levels Day, Month, Quarter, and Year (total
5 levels including All), whereas the second is a hierarchy for
TimeOfDay having 4 levels in total.
3.3</p>
    </sec>
    <sec id="sec-6">
      <title>Spatial, Textual, and Temporal Measures</title>
      <p>As defined earlier, an n-dimensional STTCube has a set of
measures ={1, 2, 3, . . . ,  }, which permit to analyze STT
objects by computing values at diferent levels of granularity. For
instance, the STTCube in Figure 2 models Location, Text, and Time
with Fact Count as a measure (i.e., Fact Count∈). In practice, it
maintains the count of STT objects at given spatial, textual, and
temporal aggregation levels. Measure values at diferent levels
in the hierarchies are obtained by applying an aggregation
function over the STT objects. Examples of aggregation functions are
SUM, COUNT, MIN, MAX, and AVG. The STTCube in Figure 2
uses COUNT as an aggregation function. For example, it reports
that on September, 20th at AAU Bus Terminal the Term apple was
mentioned in 2 facts.</p>
      <p>A measure is spatial if it is defined over a spatial domain. A
spatial measure is then computed over a collection of spatial
values (e.g., geographical points, or geometry shapes like polygons).
A spatial measure can be a simple value, e.g., the (numeric) area
of the convex hull of multiple shapes, or a complex spatial object,
e.g., the polygon representing the convex hull itself. A measure
is textual if it is defined over a textual domain, and can be either
a simple numeric value or a complex textual object. Analogously,
a measure is temporal if it is defined over a temporal domain, A
measure is spatio-textual if it is defined over a spatial and textual
domain and is a combination of spatial and textual measures.
Finally, a measure is spatio-textual-temporal if it is defined over
a spatial, textual, and temporal domain and is a combination of
spatial, textual, and temporal measures. Below, we propose a list
of spatio-textual and spatio-textual-temporal measures to be used
as part of STTCube to analyze STT objects efectively.
Top-k Keywords within an Area is a spatio-textual measure
−→
which returns a list of tuples ⟨,  ⟩ consisting of a geometry
shape  representing a geographical area and the list of top-k most
frequent keywords −→ =⟨ 1,  2, . . . ,  ⟩ in that area. Analogous
to previous measures, it can also be computed at diferent levels
of aggregation, so that it can return the top-k keywords for each
City or each Region.</p>
      <p>Keyword Density is a spatio-textual measure which returns
a list of tuples ⟨,   ,   ⟩ consisting of a geometry shape 
representing a geographical area, a keyword   , and its density
  in the area  . The density   of a keyword   over an area 
is computed as   = Sufrrfeaqce(Ar,ea() ) , in which freq(,   ) is the
frequency of the keyword   in the area  (i.e., the number of
objects located within  in which   appears) and SurfaceArea
is the surface area of  . For example, if we have two Regions 1,
2 with SurfaceArea(1)=102, SurfaceArea(2)=1002, and the
term Apple with frequency 5 and 30 in 1 and 2, respectively (see
Figure 3), then, keyword densities are 1=0.5, 2=0.3 for 1 and
2, respectively.</p>
      <sec id="sec-6-1">
        <title>Top-k Dense Keywords within an Area is a spatio-textual</title>
        <p>−→
measure which returns a list of tuples ⟨,  ⟩ computing the
keyword density as described in the measure above, but in this
case, it returns the top-k keywords −→ =⟨ 1,  2, . . . ,  ⟩ with
the highest density.</p>
        <p>
          Keyword Volatility is a spatio-textual-temporal measure
(becomes textual-temporal If no region is specified) which returns a
list of tuples ⟨,   ,  , Δ  ⟩ consisting of a geometry shape 
representing a geographical area, a keyword   , a time interval
 , and its change in density Δ  in the area  over the time
interval  (divided into  equal intervals). The change in density
Δ  of a keyword   in an area  over a time interval  is
computed as Δ  = Í=1 |  − −1 | , where   represents the
density of the keyword   in the area  at a specific time instance
 . Furthermore, the change in density computation formula
can be updated depending on the analysis requirements, e.g., it
can be changed to weighted density (assign diferent weights to
each interval in  ) or to rate of change computation using linear
regression [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ].
        </p>
      </sec>
      <sec id="sec-6-2">
        <title>Top-k Volatile Keywords within an Area is a spatio-textual</title>
        <p>−→
temporal measure which returns a list of tuples ⟨,  ⟩
computing the keyword volatility as described above, but in this case,
it returns the top-k volatile keywords −→ =⟨ 1,  2, . . . ,  ⟩ with
the highest change in density.</p>
        <p>
          Distributive, Algebraic, and Holistic Measures. There are
three types (also known as additivity) of measures: distributive,
algebraic, and holistic, depending on whether it is possible to
compute the value of a measure at a parent level directly from
the values at the child level [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. For distributive and algebraic
measures, this is possible. For instance, the Fact Count at the State
level can be computed by summing up the Fact Counts at the
City. Keyword Density is instead an algebraic measure. We can
compute the higher-level aggregate values of this measure if we
store for each child level both the frequency of each keyword and
the SurfaceArea. The Top-k Keywords, the Top-k Dense Keywords,
and Top-k Volatile keywords within an area measures, instead, are
holistic, since the value at a parent level cannot be computed
directly from the values at the child level, but it is necessary to
recompute them directly from the base facts every time.
        </p>
        <p>Consider the computation of Top-3 Dense Keywords within an
Area in Figure 3 given the two Regions 1 and 2 with SurfaceArea
102 and 1002, respectively, and the computation at the parent
level 3=1∪2 (grayed-out rows are not part of the computed
measure value). The values in the top-3 for the members 1 and
2 at the child level are not suficient to compute the correct
densities for region 3. Both, some of the computed density (in
column −3, while the correct values are reported in  ) and
consequently the final ranking, would be wrong. For instance, the
keyword Strawberry would not have been returned (if computed
algebraically) because it is neither in the top-3 for 1 nor 2. To
compute the correct response, either we have to store all the
aggregate values for each possible cell or we have to reprocess all
the facts covered by the query. When dealing with large datasets
these approaches are not feasible. Hence, in Section 4 we provide
a framework for the computation of an exact and approximate
solution with accuracy guarantees.
3.4</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>STTOLAP Operators</title>
      <p>
        A data cube allows diferent OnLine Analytical Processing (OLAP)
operators to group, filter, and analyze cells and subsets of cells
at diferent levels of granularity and under diferent
perspectives. Those operators are known as Slice, Dice, Roll-Up, and
Drill-Down [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. We extend the basic OLAP operators to
STTOLAP operators, i.e., for spatial, textual, and temporal dimensions,
hierarchies, and measures (Handing of − relationships is
explained in Section 4 and 5). In general, an OLAP (and STTOLAP)
operator OP accepts as input a cube  ′, some parameters params
and outputs a new cube  ′′, i.e., OP ( ′, params)= ′′. In this way,
a new OLAP operator can be applied to  ′′. Among all cubes, we
distinguish the initial or base cube  as the cube containing all
the original information at the base level.
4
      </p>
    </sec>
    <sec id="sec-8">
      <title>CUBE MATERIALIZATION</title>
      <p>
        Cube materialization is the process of pre-aggregating measure
values at diferent levels of granularity in the cube to compute
query responses from pre-aggregated results instead of the raw
data, and hence improve query response time for STTOLAP
operators [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. In a data cube, a cuboid is a collection of level members
and associated measure values for a unique combination of
dimension hierarchy levels. Each unique combination is represented
by a separate cuboid. For instance, if we request the Fact Count
for the State of Denmark and have stored Fact Count at the
Region level, we can avoid accessing the raw data and compute
the aggregation from much fewer rows. This is an example of
partial materialization, i.e., the actual cuboid at the State level,
containing the answer to the query was not materialized, but the
system was still able to exploit the cuboid for Region.
      </p>
      <p>What to materialize and how much to materialize depends on
the trade-of between query response time and storage cost. Full
Materialization (FM) is obtained by pre-computing measure
values for all combinations of levels in all hierarchies. This approach
requires huge storage but achieves the best query response time
since every operation can just look up already pre-computed
results. At the other extreme, No Materialization (NM) only
materializes the base cuboid and does not require any extra space,
but will require aggregated measure values to be recomputed
from the base cuboid every time, hence incurring much slower
response times. A middle-ground solution is to partially
materialize the cube, i.e., to materialize only some of the possible
cuboids. In this strategy, some queries will be able to exploit
pre-aggregated values at the current level, while other queries
can exploit pre-aggregated values at lower levels for distributive
or algebraic measures.
4.1</p>
    </sec>
    <sec id="sec-9">
      <title>Cost Model</title>
      <p>The core of the proposed partial materialization approach
depends on the trade-of between the storage cost of materializing
any particular cuboid and the actual benefit that the
materialization of the cuboid provides. To evaluate this benefit, we have to
estimate the (run time) cost of a query. To devise a cost model for
this estimation, we performed a micro-benchmark which
conifrmed that the running time is directly proportional to the data
size (the number of rows). Hence we can use the following linear
cost model for benefit calculation</p>
      <p>Benefit () =</p>
      <p>Õ
′ ∈descendants()∪{ }
cost( ′) − size()
4.2</p>
    </sec>
    <sec id="sec-10">
      <title>Partial Exact Materialization</title>
      <p>We propose an exact partial materialization technique for
precomputing the spatio-textual and spatio-textual-temporal measure
values. To answer an STT query for these measures we materialize
two other distributive measures, namely Keyword Frequency 
and SurfaceArea . Then, since Keyword Density  and Keyword
Volatility Δ are algebraic measures, they can be computed from
the values of Keyword Frequency  and SurfaceArea . Finally,
Top-k Dense Keywords and Top-k Volatile keywords are holistic but
for an exact solution we materialize Top-ALL and hence, compute
it from the materialized measure values (Figure 3).</p>
      <p>
        We adopt the chosen linear cost model (Section 4.1) and extend
the greedy algorithm approach [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] to our task (Algorithm 1).
Additionally, and diferent from [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], Algorithm 1 accepts an input
parameter  and materializes only the top-K measures values in
each cuboid. For instance, for  = 10, it will materialize the top-10
keywords in each cuboid. Then, any top-k query, with  ≤ , for
a materialized cuboid will return the pre-computed answer.
      </p>
      <p>
        Algorithm 1, given a size budget  (measured in rows, cuboids,
or GB), proceeds until the size of the current cube is as large
as possible within the budget (Line 6). At each step, it selects
among all the non-materialized cuboids (Line 3) the one with the
highest benefit (Line 4) and materializes it (Line 5). The diference
between the exact (PEM) and approximate (PAM) materialization
using Algorithm 1 is the value of  . When  =∞ the full sorted
list of measure values will be stored so that all top-k queries can
be answered for that cuboid. We set  =∞ and  = (to materialize
only top  measure values) for PEM and PAM, respectively.
Query rewriting. Finally, as in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], after STTCube
materialization, queries are still formulated in terms of the base cuboid but
rewritten by the system to be evaluated over the smallest cuboid.
Algorithm 2: Top-K Volatile Keywords in an Area
      </p>
      <p>−−−→ −−−→
1 TopKVolatile (Φ={ ⟨1, 1,1 ⟩, . . . , ⟨, , ⟩ }, ,)</p>
      <p>Input: Set of Top-k+1 Volatile Keywords lists Φ, Set of  Timestamps  ,</p>
      <p>Integer k
Output: ⟨, , ⟩ top-k keywords −−→ in the merged area  over time</p>
      <p>−−→
interval  ,  number of guaranteed top positions
2  ← Ð∈[1,]  ,  ←SurfaceArea() ;
3  ← {}, Δ ← {},  ← {} ;
4 foreach  ∈  do
5 foreach ⟨, −−→, ⟩ ∈ Φ do
6 foreach  ∈ [1, . . . , +1] do
7 if  ∈  then
8 ←−−→ . ( );
9  ←−−→ .freq( );
10  [ ] ←  [ ] +  ;
11 Δ [ ] ← Δ [ ] + | [ ] −  |;
12  [ ] ←  ;
// Merge areas
// Empty dictionaries</p>
      <p>// keyword at 
// frequency at</p>
      <p>+= −−→ .freq( + 1);
−−→ ←topK (, , Δ ) ;
 ← max∈[1,..., ] −−→.freq( ) ≥  ;</p>
      <p>−−→
return ⟨, , ⟩, 
// top-k volatile keywords</p>
    </sec>
    <sec id="sec-11">
      <title>4.3 Partial Approximate Materialization</title>
      <p>As a result of the materialization performed by Algorithm 1, when
querying a non-materialized cuboid, we can directly exploit
values in the cuboid’s materialized ancestors when computing all
distributive and algebraic measures. On the other hand, for
holistic measures, we have to perform some additional computation.
For instance, as mentioned earlier, to compute the value for the
Top-k Dense Keywords in an area we can exploit the pre-computed
Keyword Density values, but then we need to perform the top-k
selection. That is, if the top-k for the current view is not
materialized, we cannot exploit the materialized top-k of the ancestor
views without incurring the risk of returning the wrong result.</p>
      <p>
        Yet, it is possible to exploit the top-k computation in some
materialized cuboid to retrieve an approximate top-k and estimate
the result's accuracy [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ]. In practice, for the Top-k Dense
Keywords within an area, given a target k for the top-k computation,
when materializing a cuboid, we materialize the top-k+1 most
dense keywords for that cuboid (i.e., set  =+1 in Algorithm 1).
Then, to compute the top-k dense keywords for a descendant
cuboid by exploiting a materialized ancestor cuboid, we
determine which members of the list are guaranteed to be correct.
      </p>
      <p>Algorithm 2 implements this computation for Top-k Volatile
Keywords within an area. It receives as input the set Φ={⟨1, −−→1,
1⟩,⟨2, −−→2, 2⟩, . . . , ⟨, , ⟩} of lists of top-K (i.e., k+1) dense
−−−→
keywords in a specific area with respective time stamps, time
interval  divided into x equal-sized interval (e.g., day or month),
and the value for . The output is the ranked list of top-k volatile
keywords in the area  that is composed by the merging of the
areas 1, 2, . . . , . It computes the SurfaceArea of the merged area
 (lines 2). Then it merges all the aggregated keyword frequencies
(line 10) and change in keyword frequencies (line 11) for each
time instance in  (line 7) in respective dictionaries  and Δ
(lines 4-13) by getting each keyword in each list (line 8) and the
corresponding frequencies (line 9). If a keyword is not found in
the  , Δ , or   dictionary then its value is considered to be
zero. Moreover, it keeps track of the upper-bound  frequency for
keywords outside the current materialized ranking for possible
error reporting (line 13). Once all frequencies and changes in
frequencies are merged, we compute the top-k volatile keywords
using the aggregated values (line 14). Finally, by comparing the
value of  with the frequencies of keywords in the aggregated
top-k, we report how many positions in the current ranking are
Algorithm 3: STTCubeConstruction
guaranteed to be exact (line 15). In the best case, the frequency of
the keyword at position  will be at least  and thus the computed
top-k is guaranteed to be correct.</p>
    </sec>
    <sec id="sec-12">
      <title>5 STTCUBE CONSTRUCTION</title>
      <p>Here, we describe the proposed approach for constructing an
STTCube. Algorithm 3 takes a collection X of STT objects to be
analyzed, a textual taxonomy T with semantic information about
the terms, themes, topics, and concepts, and a geographical
taxonomy G for cities, regions, and countries. Standard date functions
are used for the temporal dimension processing. Moreover, it
also receives as input the parameters  and  as the budget and
number of top-K keywords for the partial materialization.</p>
      <p>Algorithm 3 constructs the STTCube in an incremental way,
it initializes an empty cube (line 2), and then the corresponding
spatial, textual, and temporal dimensions (lines 3-5) as well as the
Fact Table (line 6). If the cube is already constructed, i.e., the cube
is being updated instead of constructed for the first time, then
Algorithm 3 loads the existing STTCube (lines 2-6) and updates
it with new information. In particular, the spatial dimension has
the grid-based hierarchy and the hierarchy with the base level
at each object's Location (i.e., the geographical point), and then
the levels City, Region, Country, and All (5 levels in total). The
textual dimension, instead, has the hierarchy build from the base
level Term, and then Theme, Topic, Concept, and All (5 levels in
total). Finally, the temporal dimension contains the Date and
TimeOfDay hierarchies mentioned in Section 3.</p>
      <p>Once the basic structure is prepared, Algorithm 3 loops through
each STT object in X (lines 8-13). In this loop, it extracts and
initializes from each STT object the base-level members for each
dimension. Then, once the base level data has been extracted, it
proceeds with building the various dimension hierarchies starting
from the existing base-level members and exploiting the provided
spatial and textual taxonomies (lines 8-12). Once the dimension
hierarchies are built, the STT object itself is then inserted in the
fact table of the STTCube (line 13) so that each fact is linked to
the lowest (base) level members in the respective dimensions.
In this step (line 13), the fact measure values are also computed
(e.g., the keyword count). As the last step (line 14) Algorithm 3
executes the (partial) materialization procedure.</p>
      <sec id="sec-12-1">
        <title>Spatial Hierarchies Construction. In our proposed STTCube</title>
        <p>the base level for the spatial hierarchies is the Location present in
the raw data, i.e., the longitude and latitude points. Hence, we use
Military Grid Reference System (MGRS) for grid-based
hierarchy and when building the semantic-based hierarchy, individual
points are linked to the respective cities using the information in
the available geographical taxonomy G, or to a special member
for points that link to unknown locations. Therefore, this
corresponds to the step function from Location to City. The spatial
taxonomy G is also used to generate the spatial hierarchy step
functions for the higher levels.</p>
        <p>Textual Hierarchies Construction. The unstructured nature
of the text makes it a challenging task to convert it into a
dimension of a cube. In Algorithm 3, the ProcessText function
(line 11) implements the following steps: (1) splits the text into
individual words, (2) removes stop words, and (3) converts the
remaining words to their base form (e.g., “works” and “working”
have the same base form “work”). The final processed text is used
to populate the Term base-level in the textual dimension. This
implements the base step function in the textual hierarchies, and
links every fact to one or more Terms, hence it has an −
cardinality. Moreover, while constructing the higher levels, using the
semantic taxonomy T (e.g., WordNet), each STT object is linked
to one or more Themes, and similarly for Topics, and Concepts.
6</p>
      </sec>
    </sec>
    <sec id="sec-13">
      <title>EXPERIMENTAL EVALUATION</title>
      <p>Now, we report on the performance of STTCube analysis. In
particular, we compare the diferent materialization strategies for
STTCube and No STTCube (NC) implementations, in terms of
query response time (QRT) and storage cost. NC answers the
queries by computing the query response from base data without
constructing the STTCube. Also, we compare QRT and
hierarchy construction time for diferent combinations of hierarchy
schemes. Moreover, we also report on the accuracy of PAM and
demonstrate the advantage in performance when compared to
PEM. Lastly, we compare QRTs for diferent spatial and textual
hierarchy schemes, showing that combinations of Grid-based
spatial and Majority-based textual (GM) hierarchy scheme achieves
the fastest QRTs among all hierarchy combinations.
Experimental Setup. We evaluate the STTCube on a real-world
Twitter dataset containing 125 million tweets collected over six
weeks. Each tweet contains the tweet location, text, and time.
We implemented the STTCube in a leading commercial RDBMS,
called RDBMS-X as we cannot disclose the name. The proposed
design is realized using a snowflake schema to avoid redundancy
in the dimension data.</p>
      <p>We implemented the Pre-Processing (PP) component, where
the whole raw dataset is parsed and the relational tables are
populated, in Java (v11). All tests are run on a Windows Server
machine with 2 Intel Xeon 2.50GHz CPUs and 16GB RAM.</p>
      <p>
        We extracted the taxonomy for the spatial dimension from
GeoNames [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For the City level, we considered all the cities
having  &gt; 1000 and for the Region level, we use
administrative divisions information available in the GeoNames
dataset. We use the reverse geocoding process to nfid the city
name for the Location coordinates.
      </p>
      <p>
        For the textual dimension, as a taxonomy for Terms, Themes,
Topics, and Concepts, we use the widely used WordNet [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. We
use the direct HYPERNYM link of WordNet to decide the parent
member for a Term, Theme, and Topic. If a term is present in
WordNet and has a super-class (HYPERNYM) then the
superclass becomes the parent of the term. Otherwise, it becomes its
own parent (this avoid unbalanced hierarchies and UNKNOWN
values in the hierarchy). For text pre-processing –tokenization,
lemmatization, and stop word removal– we use the Stanford Core
NLP library [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. We implemented the temporal dimension using
the standard Date and Time functions supported in RDBMS-X.
      </p>
      <p>We implemented the semantic-based and grid-based hierarchy
schemes for the spatial dimension, replication-based and
majoritybased hierarchy schemes for the textual dimension (Section 3.2),
and Date hierarchy for the temporal dimension.</p>
      <p>Spatial, Textual, and Temporal Levels Members. The base
levels contain 40.1 million unique Location Points and 9.8 million
unique Terms. The GeoNames taxonomy contains 132 cities,
divided into 4 administrative divisions (regions) for 247
countries. Among those, we have tweets for 104 cities, 3.8 regions,
and 246 distinct countries. In the textual hierarchy, terms are
grouped into 23.8 Themes, 19.4 Topics, and 17.6 Concepts.
Furthermore, the temporal dimension spans over 37 days. Finally,
for PAM we materialize  =31 densest keywords.</p>
      <p>We compare PEM and PAM strategies with the following three
baselines. No STTCube (NC): is the traditional RDBMS setup
with all textual, spatial, and temporal functions implemented
as built-in or user-defined functions. Specifically, NC uses
userdefined functions for text (for retrieving individual terms) and
location processing (e.g., identification of the city a particular
longitude, latitude point belongs to) and built-in functions for
timestamp. Further, NC filters on location and timestamp for
the queried area and time and performs a series of joins, e.g., 4
joins for Concept level, to retrieve information for the requested
textual level. Finally, it groups results on the textual and
temporal columns, computes the STT measure values, and performs
the top-k selection. NC is the traditional solution one would go
for without the STTCube. No Materialization (NM): constructs
the STTCube and minimizes the storage cost by only
materializing the base cuboid and computing all query responses from it.
Full Materialization (FM): minimizes the QRT by
materializing every cuboid in the STTCube. With this approach queries are
answered through a lookup in the pre-computed cuboid. These
three baselines are at the extreme ends of the space-time trade-of
and are usually infeasible for large datasets.</p>
      <p>Queries. We perform experiments on five diferent sizes of
datasets using nine diferent STT queries. Each STT query, described
in Table 3, targets a diferent level of spatial, textual, and
temporal granularity. Each query requests either dense or volatile
keywords with a range of time which is used for volatile but not
used for dense keywords queries. We execute each query ten
times with randomly generated parameters for each method and
report mean and standard deviation.</p>
      <p>Query Response Time. For Top-k Dense and Top-k Volatile
Keywords within an area measures, we compare the QRT of PEM and
PAM with the NC, NM, and FM baselines. For Keyword Density
and Keyword Volatility, no approximate solution is possible so we
only compare PEM with NC, NM, and FM. As the Majority-based
textual hierarchy scheme does not process Terms (Section 3.2), we
only evaluate five out of nine queries requesting Theme, Topic, and
Concept for it (Figures 4c, 4d, 4g, and 4h).Furthermore, we cannot
evaluate PAM for Q9 as no approximate solution is possible for
it.
108
)s106
(m104
T
RQ102
100</p>
      <p>Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
(a) GR - Keyword Density</p>
      <p>Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
(b) GR - Keyword Volatility</p>
      <p>NC NM</p>
      <p>Q2
PEM</p>
      <p>Q3 Q5 Q6
(c) GM - Keyword Density
FM</p>
      <p>Q9</p>
      <p>Q2</p>
      <p>Q3 Q5 Q6
(d) GM - Keyword Volatility</p>
      <p>Q9
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
(e) GR - Top-k Dense Keyowrds</p>
      <p>Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
(f) GR - Top-k Volatile Keyowrds</p>
      <p>Q2 Q3 Q5 Q6 Q9
(g) GM - Top-k Dense Keyowrds</p>
      <p>Q2 Q3 Q5 Q6 Q9
(h) SM - Top-k Volatile Keyowrds
NC</p>
      <p>NM</p>
      <p>PEM</p>
      <p>PAM</p>
      <p>FM</p>
      <p>We plot results in Figures 4a—4h for 100% (125M) of data, as
the results are similar for smaller data sizes. Each row in Figure 4
shows the QRTs for one particular combination of spatio-textual
hierarchy schemes. Specifically, Figures 4a, 4b, 4e, and 4f show
the QRTs for the Grid-based spatial and Replication-based textual
(GR) hierarchy combination for all measures. Similarly, Figures 4c,
4d, 4g, and 4h show QRTs for Grid-based spatial and
Majoritybased textual (GM) combinations. Figure 4 has queries on the
x-axis and QRTs in msec on the y-axis (note: log scale). Figure 4
confirms that NC is 1−5 orders of magnitude slower than NM.
Specifically, regardless of the spatial hierarchy scheme, it is 1−2
and 3−5 orders of magnitude slower than NM for
Replicationbased and Majority-based textual hierarchy, respectively. The
Majority-based textual hierarchy scheme achieves faster QRTs
because it does not process individual Terms but directly links
Theme to the fact, hence, drastically reducing the number of rows
to process (from millions to thousands). Furthermore, NM is 1−4
and 3−5 orders of magnitude slower than PEM and PAM,
respectively, for all measures and combinations of hierarchy schemes.
PEM is on average six times slower than FM which achieves its
fast QRTs at the expense of a highly increased storage cost
(Figure 6a). PAM achieves near-optimal QRTs because it materializes
only the K densest keywords in the cuboid, hence it has much
fewer rows to process. QRTs for Q9 for Top-k Volatile Keyword
within an area and Top-k Dense Keywords within an area
measures for all combinations of hierarchy schemes are the worst for
PAM (same as NM) because it requests ALL keywords' densities
instead of top-k which cannot be computed from the approximate
pre-aggregated information. To generate a response for Q9, we
have to process all detail data directly from the base facts. In
comparison, PEM and PAM materialize a subset of views (also a
subset of rows for PAM) and use the pre-aggregated measure
values in those views to eficiently generate a response for a query
instead of processing base facts, thus improving the overall QRT.
NC is the slowest of all (1 − 5 orders of magnitude slower than
the slowest STTCube NM) because it has to process the complete
dataset for computing each query response, and cannot take
advantage of the STTCube optimizations for STT measures. Among
all the hierarchy scheme combinations, GM has the fastest QRTs
mainly because of Majority-based which drastically reduces the
row count by linking the Theme directly to each Fact instead
of individual Terms, whereas, GR has the slowest QRTs due to
Replication-based having far more rows to process than
Majoritybased textual hierarchy. Furthermore, Grid and Semantic-based
spatial hierarchies have similar QRTs.</p>
      <p>Figure 5 shows the scalability of PEM and PAM over growing
data sizes for diferent combinations of hierarchy schemes and
confirms that the QRTs are almost constant as the data grows.
This is because the sizes of materialized views do not increase a
lot as the data grows. Only new dimension members, e.g., new
cities or topics, increase the size of materialized views, but only
by a small fraction. Figures 5f—5j confirm that the GM hierarchy
combination results in the fastest QRTs, i.e., all QRTs &lt; 100 msec.
On the contrary, Figures 5a—5e show that GR yields the slowest
QRTs, with QRT as high as 400 msec. Figure 5 confirms that PAM
consistently achieves the fastest QRTs (mostly &lt; 100 msec with
few a bit over 100 msec) regardless of hierarchy schemes. Figure 5
shows that PEM and PAM scale linearly w.r.t. data size.
Storage Cost. We now compare the storage cost for FM, PEM,
PAM, and NM. We do not compare NC's storage cost because it
does not construct STTCube, and hence does not materialize
anything. We only show the storage cost for up to 20 million because
FM takes an unfeasible amount of time (shown in Figure 6c) while
for the other methods and over the larger datasets we observe the
same trend. We use the number of rows in a view as its storage
cost.The base cube's storage cost is always needed. Besides that,
every additional materialized view adds to the storage cost, as
displayed in Figure 6a, that shows the storage cost of NM, PAM,
PEM, and FM over growing data sizes. The materialization of the
STTCube using PEM and PAM only adds 13% and 0.1% to the
storage cost of the base cube, respectively. Whereas, using FM
increases the storage cost by more than an order of magnitude.
PEM reduces the storage cost by only materializing a subset of
views (four views) and still achieves 2-5 orders of magnitude
improvement in QRT (Figures 4). PAM further reduces the storage
cost by only materializing a subset of rows in a view (top-k) and
gains an additional order of magnitude improvement in QRT.
On the other hand, FM materializes all views in a cube, i.e., 500
(5 × 5 × 5 × 4) views in our case, which makes the view
materialization storage cost much higher (one order of magnitude) than
the base cube itself, as shown in Figure 6a. Figure 6a confirms
that our proposed methods PEM and PAM reduce the storage
cost between 97% and 99.9% compared to FM.</p>
      <sec id="sec-13-1">
        <title>Views Selection for Materialization. Our proposed methods</title>
        <p>PEM and PAM are partial materialization methods that
materialize only a subset of the cuboids. Hence, an important trade-of to
be understood is between the number of cuboids to materialize,
the corresponding storage cost, and the gain in query response
time achieved. We empirically evaluate the benefit gained
(improvements in QRT for all dependent cells which can be answered
using this view) against the cost of materializing the view
(Algorithm 1). We consider the base cube as a necessary view to
be materialized and consider its benefit as zero. Figure 6b shows
that materializing three cuboids ((Day, City, Term), (Day, Location,
Theme), and (Day, Region, Term)) on top of the base cube gain the
most benefit after which we do not get a significant advantage of
materializing further cuboids. The reason is that the materialized
cuboids are already small enough, so the benefit of materializing
any descendant cuboid is small. Hence, materializing 4 cuboids
represents the best trade-of between QRT and storage cost.
Pre-Processing and Cube Construction. Here, we report the
time for the construction of STTCube. Construction of an STTCube
is divided into two steps: 1) Pre-Processing (PP) of base facts (STT
objects) and population of the relational tables and 2)
materialization of views. Further, the materialization of views can be
done either using FM, PEM, or PAM. In Figure 6c, we have data
sizes on the x-axis and time in minutes on the y-axis (note: log
scale). FM is the most time consuming among all and adds
significant overhead on top of PP time and does not scale. On the
contrary, PEM and PAM time is negligible compared to the FM
time. Hence, with PEM and PAM STTCube construction time
scales linearly. To evaluate STTCube’s ability to handle updates
(maintenance wall-clock time), we performed several updates
of 25M tweets each (PP_INC in Figure 6c). Experiments confirm
that STTCube’s update time grows linearly with the amount of
new STT objects because it only processes the new STT objects
and updates respective fact and dimensions tables.</p>
        <p>Furthermore, we compare the diferent hierarchy schemes
w.r.t. their construction time. Figure 6d shows the hierarchies’
construction time for diferent hierarchy schemes. It is evident
from Figure 6d that, among all, the Replication-based textual
hierarchy scheme takes the longest to construct because for each
single spatio-textual-temporal object it has to process each
individual Term and construct hierarchy for it. Whereas, for all
other schemes, for each spatio-textual-temporal object only one
hierarchy instance is processed. Figure 6d confirms that all of the
hierarchy schemes are constructed in linear time w.r.t. data size,
allowing STTCube to support multiple hierarchy schemes.
Accuracy. Given that PAM eficiently computes the approximate
measure values, it becomes necessary to evaluate its accuracy.
To evaluate the accuracy of PAM, we use NM's results as ground
truth. Our evaluation result in Table 7a confirms that it achieves
high accuracy. Specifically, it is 100% for 6 out of 8 queries, and
90-97% for 2. Queries with 90-97% accuracy request as many
keywords as are materialized and the risk of having wrong results
near the border (bottom of the top-k list) is higher.</p>
        <p>QRT of STTOLAP Operators. Our proposed materialization
strategies (PEM and PAM) improves the QRTs for STTOLAP
operators. To demonstrate this, we perform a series of STTOLAP
operations and measure their QRT for diferent materialization
strategies. Figure 7b shows the QRTs for multiple STTOLAP
operations for diferent materialization strategies. We have STTOLAP
operators on the x-axis (RU, D, S, and DD represents STT Roll Up,
Dice, Slice, and Drill Down operators, respectively) on QRT in
msec on the y-axis. It is evident that NM is on average 3-5 orders
of magnitude slower than PEM which is one order of magnitude
slower than PAM. Furthermore, PAM achieves near-optimal QRTs,
just a fraction higher than FM. These experiments confirm that
STTCube's materialization methods (PEM and PAM) improves
STTOLAP operators' QRTs by materializing only a subset of cuboids.
Top-K Value Estimation. Here, we study the relationship
between QRT and the value of materialized  . We create seven
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8</p>
        <p>NM
PEM
PAM
FM
Start RU RU D S DD DD</p>
        <p>STTOLAP Operations
(7b) STTOLAP Operations' QRTs
diferent STTCube materialization versions using 10, 20, 50, 100,
200, 500, and 1000 as the value of  . Next, we use the Gamma
distribution to generate 100 random numbers, to be used as top-k
values, in the range of 1 and 1000. We chose the Gamma
distribution because it resembles a common long-tail distribution for
top-k values. We execute each query for all the 100 generated
top-k values over all seven materialization versions. Figure 7c
shows the QRT for all queries over diferent materialization
versions. For  =10 and 20 the median value is the same as the box
top, hence not visible in the plot. It is evident from Figure 7c
that a larger value of materialized  achieves faster QRTs (lower
median value) because almost all the queries are answered using
the pre-computed measure values. But, in the case of smaller  ,
all the queries requesting  &gt; need to be answered using the
non-pre-computed measure values from the base cuboid. Hence,
resulting in slower QRTs (higher median value). A larger value
of  such as 1000 is not recommended because 1) there will be
very few queries requesting a larger top-k and 2) it will require
more storage cost (Figure 6a). Specifically, between  =50 and
100 and  =100 and 200 QRT decrease by 35% and 0% but storage
increase 250% and 200%, respectively. Hence, these experiments
confirm that choosing a value between 20—50 for  in our current
experiments settings is a near-optimal choice.
7</p>
      </sec>
    </sec>
    <sec id="sec-14">
      <title>CONCLUSION AND FUTURE WORK</title>
      <p>In this paper, we defined and formalized the STTCube structure
to efectively perform STTCube analytics. We introduced STT
hierarchies, STT measures, and STTOLAP operators to analyze STT
data together. For eficient, exact and approximate, computation
of STT measures, we proposed a pre-aggregation framework able
to provide faster response times by requiring a controlled amount
of extra storage to store pre-computed measure values. We
observed how the partial materialization provides 1 to 5 orders of
magnitude reduction in query response time, with between 97%
and 99.9% reduced storage cost compared to full materialization
techniques. Moreover, the approximate materialization provides
accuracy between 90% and 100%, while requiring considerably
less space compared to no materialization techniques. In future
work, we plan to enhance STTCube with additional STT measures
and distributed implementation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <year>2020</year>
          . GeoNames. http://download.geonames.org/. Accessed:
          <fpage>2020</fpage>
          -09-09.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Almaslukh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Magdy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Aly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Mokbel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Elnikety</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nath</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W. G.</given-names>
            <surname>Aref</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Local trend discovery on real-time microblogs with uncertain locations in tight memory environments</article-title>
          .
          <source>GeoInformatica</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Azabou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Khrouf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Feki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Soulé-Dupuy</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>N.</given-names>
            <surname>Vallès</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Analyzing textual documents with new OLAP operators</article-title>
          .
          <source>AICCSA</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>X.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Cong,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Jensen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Skovsgaard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Yiu</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Spatial Keyword Querying</article-title>
          .
          <source>ER</source>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Cong,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Jensen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Wu</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Spatial Keyword Query Processing: An Experimental Evaluation</article-title>
          .
          <source>PVLDB</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Dong, J. Han,
          <string-name>
            <given-names>B. W.</given-names>
            <surname>Wah</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Multi-dimensional Regression Analysis of Time-series Data Streams</article-title>
          .
          <source>VLDB</source>
          (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Chouder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rizzi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Chalal</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Exploratory OLAP over Doc Stores</article-title>
          .
          <source>IS</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Cong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Feng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhao</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Querying and mining geo-textual data for exploration: Challenges and opportunities</article-title>
          .
          <source>ICDEW</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Cong</surname>
          </string-name>
          and
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Jensen</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Spatial Keyword Queries and Beyond</article-title>
          .
          <source>SIGMOD</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. X.</given-names>
            <surname>Lin</surname>
          </string-name>
          , J. Han,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N. C.</given-names>
            <surname>Oza</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Eficient Keyword-Based Search for Top-K Cells in Text Cube</article-title>
          .
          <source>TKDE</source>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.</given-names>
            <surname>Fellbaum</surname>
          </string-name>
          .
          <year>1998</year>
          .
          <article-title>WordNet: An Electronic Lexical Database</article-title>
          . MIT Press (
          <year>1998</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>W.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Zhang, J. Han,
          <string-name>
            <given-names>J</given-names>
            .
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Aggarwal</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>STREAMCUBE: Hierarchical spatio-temporal hashtag clustering for event exploration over the Twitter stream</article-title>
          .
          <source>ICDE</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bosworth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lyaman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Pirahesh</surname>
          </string-name>
          .
          <year>1996</year>
          .
          <article-title>Data cube: a relational aggregation operator generalizing GROUP-BY, CROSS-TAB, and SUB-TOTALS</article-title>
          .
          <source>ICDE</source>
          (
          <year>1996</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>N.</given-names>
            <surname>Gür</surname>
          </string-name>
          , T. B.
          <string-name>
            <surname>Pedersen</surname>
            , E. Zimanyi, and
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Hose</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>A foundation for spatial data warehouses on the Semantic Web</article-title>
          . Semantic
          <string-name>
            <surname>Web</surname>
          </string-name>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Koperski</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Stefanovic</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>GeoMiner: A System Prototype for Spatial Data Mining</article-title>
          .
          <source>SIGMOD</source>
          (
          <year>1997</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>V.</given-names>
            <surname>Harinarayan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rajaraman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Ullman</surname>
          </string-name>
          .
          <year>1996</year>
          .
          <article-title>Implementing data cubes eficiently</article-title>
          .
          <source>SIGMOD</source>
          (
          <year>1996</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>P.</given-names>
            <surname>Jayachandran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Tunga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kamat</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Nandi</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Combining User Interaction, Speculative Query Execution and Sampling in the DICE System</article-title>
          .
          <source>ICDE</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Jensen</surname>
          </string-name>
          , T. B.
          <string-name>
            <surname>Pedersen</surname>
            , and
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Thomsen</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Multidimensional Databases and Data Warehousing</article-title>
          . Morgan &amp; Claypool Publishers.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Kenney</surname>
          </string-name>
          and
          <string-name>
            <given-names>E. S.</given-names>
            <surname>Keeping</surname>
          </string-name>
          .
          <year>1962</year>
          . Mathematics of Statistics, Part 1, chapter 15. van Nostrand (
          <year>1962</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Knijf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Frasincar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Hogenboom</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Domain taxonomy learning from text: The subsumption method versus hierarchical clustering</article-title>
          .
          <source>DKE</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>M. D. Lieberman</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Samet</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Sankaranarayanan</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Sperling</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>STEWARD: Architecture of a Spatio-textual Search Engine</article-title>
          . GIS (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>C. X.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ding</surname>
          </string-name>
          , J. Han,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhao</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Text Cube: Computing IR Measures for Multidimensional Text Database Analysis</article-title>
          . ICDM (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>L.</given-names>
            <surname>Lins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Klosowski</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Scheidegger</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Nanocubes for real-time exploration of spatiotemporal datasets</article-title>
          .
          <source>TVCG</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hancock</surname>
          </string-name>
          , J. Han,
          <string-name>
            <given-names>M.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Pokorny</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>A Text Cube Approach to Human Social and Cultural Behavior in the Twitter Stream</article-title>
          .
          <source>SBP</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>A.</given-names>
            <surname>Magdy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Abdelhafeez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kang</surname>
          </string-name>
          , E. Ong, and
          <string-name>
            <given-names>M.F.</given-names>
            <surname>Mokbel</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Microblogs data management: a survey</article-title>
          .
          <source>The VLDB Journal</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>C.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Surdeanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Finkel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bethard</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>McClosky</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>The Stanford CoreNLP natural language processing toolkit</article-title>
          .
          <source>ACL</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>R.</given-names>
            <surname>Othman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Belkaroui</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Faiz</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Customer Opinion Summarization Based on Twitter Conversations</article-title>
          .
          <source>WIMS</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>B.</given-names>
            <surname>Pat</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kanza</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Where's Waldo?: Geosocial Search over Myriad Geotagged Posts</article-title>
          .
          <source>SIGSPATIAL</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Pérez-Martínez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Berlanga-Llavori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Aramburu-Cabo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T. B.</given-names>
            <surname>Pedersen</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Contextualizing Data Warehouses with Documents</article-title>
          .
          <source>DSS</source>
          (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>F.</given-names>
            <surname>Ravat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Teste</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tournier</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Zurfluh</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Top_keyword: An aggregation function for textual document OLAP</article-title>
          .
          <source>DaWaK</source>
          (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>S.</given-names>
            <surname>Rivest</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bédard</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Marchand</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>Toward better support for spatial decision making: defining the characteristics of spatial on-line analytical processing (SOLAP)</article-title>
          .
          <source>Geomatica</source>
          (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>J.</given-names>
            <surname>Sankaranarayanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Samet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. E.</given-names>
            <surname>Teitler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Lieberman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Sperling</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>TwitterStand: News in Tweets</article-title>
          .
          <source>SIGSPATIAL</source>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>A.</given-names>
            <surname>Skovsgaard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sidlauskas</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Jensen</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Scalable top-k spatiotemporal term querying</article-title>
          .
          <source>ICDE</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>M.</given-names>
            <surname>Walther</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Kaisser</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Geo-spatial Event Detection in Twitter Stream</article-title>
          .
          <source>ECIR</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Yu</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Deep learning for spatio-temporal data mining: A survey</article-title>
          .
          <source>TKDE</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ni</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Hierarchical Topic Modeling of Twitter Data for Online Analytical Processing</article-title>
          . IEEE Access (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          and J. Han.
          <year>2019</year>
          .
          <article-title>Multidimensional Mining of Massive Text Data</article-title>
          .
          <source>DMKD</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          , J. Han,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Srivastava, and
          <string-name>
            <given-names>N.</given-names>
            <surname>Oza</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Topic Modeling for OLAP on Multidimensional Text Databases</article-title>
          .
          <source>Stat. Anal. Data Min</source>
          . (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Cong</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Topic Exploration in Spatio-Temporal Document Collections</article-title>
          .
          <source>SIGMOD</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>