<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TourMiner: Effective and Efficient Clustering Big Mobile Social Data for Supporting Advanced Analytics Tools</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gloria Bordogna</string-name>
          <email>bordogna.g@irea.cnr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alfredo Cuzzocrea</string-name>
          <email>alfredo.cuzzocrea@dia.units.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Frigerio</string-name>
          <email>frigerio.l@irea.cnr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Psaila</string-name>
          <email>psaila@unibg.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Danilo Amendola</string-name>
          <email>daniloamendola@gmail.com</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CNR-IREA</institution>
          ,
          <addr-line>Milano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>DIA Dept., University of Trieste</institution>
          ,
          <addr-line>Trieste</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>DIGIP Dept., University of Bergamo</institution>
          ,
          <addr-line>Bergamo</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>ICAR-CNR</institution>
          ,
          <addr-line>Rende</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>IRCCS Bonino-Pulejo</institution>
          ,
          <addr-line>Messina</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Nowadays a great deal of attention is devoted to the issue of supporting big data analytics over big mobile social data. These data are generated by modern emerging social systems like Twitter, Facebook, Instagram, and so forth. Mining big mobile social data has been of great interest, as analyzing such data is critical for a wide spectrum of big data applications (e.g., smart cities). Among several proposals, clustering is a well-known solution for extracting interesting and actionable knowledge from massive amounts of big mobile (geo-located) social data. Inspired by this main thesis, this paper proposes an effective and efficient similarity-matrixbased algorithm for clustering big mobile social data, called TourMiner, which is specifically targeted to clustering trips extracted from tweets, in order to mine most popular tours. The main characteristic of TourMiner consists in applying clustering over a well-suited similarity matrix computed on top of trips.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Social networks coupled with the diffusion of smart mobile devices equipped with GPS
sensors have contributed to generate a big amounts of users-generated contents which are
geo-referenced and timestamped. These are recognized in literature as big mobile social
data (e.g., [1,2,3]). By analizying recent research trends, it follows that, nowadays, a great
deal of attention is devoted to the issue of supportingbig mobile social data analytics.
These data are generated by modern emerging social systems like Twitter, Facebook,
Instagram, and so forth. Mining big mobile social data (e.g., [13,14]) has been of great
interest, as analyzing such data is critical for a wide spectrum of big data applications
(e.g., smart cities). The possibility of cross-analyzing geo-references, timestamps,
characteristics of content authors, and, last but not least, message contents, poses new
challenges to the research community (e.g., [4]). CISCO estimated that data on the
Internet will increase at a Compound Annual Growth Rate of 25% by the year 2017. Thus,
in order to deal with continuously-growing-in-size data sets, it will be necessary to
frequently scale-up existing algorithms or to define new paradigms for big mobile social
data analytics (e.g., [5,6]).</p>
      <p>Another aspect to be faced-off is represented by the amenity of synthesizing big data
collections with the purpose of highlighting relevant meanings embedded within them.
Several real-world applications, such as social network community identification (e.g.,
[7]) and their trajectory summarization (e.g., [8]), require means to filter-out the
communities of users and further to group and represent similar trajectories. This latest
task implies scaling-down the collecions of big data by removing noise and redundancies,
so that highlighting the main representative spatio-temporal patterns (e.g., [9]). Faceted
search (e.g., [10]) and clustering techniques can be used to this purpose since their goal is
to identify groups of items within the target data set with common characteristics in a
feature space, while removing outliers and noise which are considered uninteresting from
further analysis.</p>
      <p>Inspired by this main thesis, this paper proposes an efficient implementation of
similarity-matrix-based algorithm for clustering big mobile social data, called
TourMiner, which is specifically targeted to clustering trips extracted from geo-located
tweets, in order to mine most popular tours. The main characteristic of TourMiner
consists in applying clustering (any clustering algorithm) over a well-suited similarity
matrix (possibly using distinct similarity measures) computed on top of trips.</p>
      <p>Looking at specific algorithmic charateristics, TourMiner can be considered as a
hybrid algorithm as it sequentially applies a “scaling-up” and subsequently a
“scalingdown” phase to mine tours. In particular, the first phase applies a scale-up approach in
order to identify the community of tourists visiting a region of interest within a big set of
Tweet messages, which are collected during a period of time, so that data increasingly
scale up with time, thus efficiently improving our previous contribution [15]. The second
phase, which is based on geo-partitioning and geo-clustering, applies a scale-down
approach for identifying and synthetising the main tours followed by most of the tourists
within the target region, by exploiting low-cost hardware, GPU-based computing
paradigms (e.g., [16]) and open-source libraries to manage available memory resources,
thus significantly extending and further improving our previous contribution [17] and
clearly matching the strict computational requirements of big data management by means
of low cost hardware (e.g., [18]).</p>
    </sec>
    <sec id="sec-2">
      <title>2 TourMiner: An Effective and Efficient Algorithm for</title>
    </sec>
    <sec id="sec-3">
      <title>Most-Popular Tours from Big Social Media</title>
    </sec>
    <sec id="sec-4">
      <title>Mining</title>
      <p>In this Section, we provide principles and implementation of the proposed algorithm
TourMiner. TourMiner comprises three main phases:
1.
2.</p>
      <p>Geo-partitioning: trips extracted from tweets are generated according to
different representations so that they are defined as semantic trajectories over
a spatial domain ontology, intended as a representation of spatial entities of
interest with their geographic footprints;
Computing the pairwise trip similarity matrix: trips are processed in order to
compute a suitable similarity matrix that is the fundamental data structure of
our clustering approach;
Trip clustering (based on the similarity matrix): finally, on the basis of the
similarity matrix, trips are clustered as to extract actionable knowledge from
the target big mobile social media data.</p>
      <p>In the following, we describe these phases in details.</p>
      <sec id="sec-4-1">
        <title>2.1 Geo-Partitioning</title>
        <p>Firstly, each trip is represented by a semantic trajectory, i.e., a sequence of IDs a1, a2, …,
an that identify interesting locations on the spatial domain, such as Points Of Interest
(POIs) like restaurants, hotels, museums, etc, or area districts such as city areas,
municipalities, etc. An example of semantic trajectory defined by sequence of IDs could
be:</p>
        <p>&lt;Malpensa Airport, Milano Central Station, Bergamo Railway Station, Piazza
Vecchia&gt;</p>
        <p>By this representation we may argue that the traveller landed in Malpensa airport, then
went to Milano Central Station and from there he travelled by train to Bergamo where he
visited the medioeval Piazza Vecchia.</p>
        <p>This same trajectory could be represented in a coarsest way bearing more uncertainty
on the visited locations, for example by a sequence of municipalities, namely:
&lt;Malpensa, Milano, Bergamo, Bergamo&gt;
In this case, we can just guess the visited cities.</p>
        <p>In this phase, one can choose to represent a trip based on the following two kinds of
IDs:</p>
        <p>a) Categorical IDs: trips are represented in terms of a sequence of objects of
interest that are implemented as Point Of Interest (POI) on the spatial domain or zones
such as municipalities, ZIP code areas, NUTS etc. In this case, IDs are categorical and
neither a metric, nor an order is defined between them.</p>
        <p>b) Numerical IDs: trips are represented in terms of a sequence of natural numbers
ai  N among which a linear order is defined so that ai  aj if i  j. These IDs identify
cells in a regular tessellation of the spatial domain, and can be generated by space filling
functions [12] such as the Z-Order and the Peano curve. They are named as GeoHash
codes and a metric over them is defined by a function of kind: |.| N → N that satisfies the
following properties: (i) |ai – ai| = 0; (ii) |ai – aj| = |aj – ai|; (iii) |ai – aj| + |aj – ak|  |ai –
ak|.</p>
        <p>The advantage of using GeoHash codes is represented by the opportunity of using
similarity measures based on a metric distance, thus achieving more accuracy. In fact,
when using categorical IDs, the matching operation between two IDs is crisp, they either
match or not, and, in this context, it is meaningless to evaluate fuzzy matching, since two
IDs, e.g. ZIP codes, may differ only by one digit and be distant in space. On the other
hand, when using GeoHash codes, one obtains trajectoies whose semantic is both implicit
and uncertain with respect to the objects of interest: to explicit the semantics, one should
determine which POIs are within each cell identified by the GeoHash codes.</p>
      </sec>
      <sec id="sec-4-2">
        <title>2.2 Computing the Pairwise Trip Similarity Matrix</title>
        <p>Computing the pairwise trip similarity matrix depends on the types of IDs used to
represent trips (see Section 2.1), i.e. via categorical IDs or numerical IDs. In the
following, we address both cases. Before computing such similarity matrix, we first
eliminate redundancies, i.e. consecutive duplications of the same ID in the trip
representation. Therefore, a trip of kind:
becomes of kind:
after redundancy removal.</p>
        <p>A = ID1, ID2, ID3, ID3, ID3, ID4, ID3</p>
        <p>A = ID1, ID2, ID3, ID4, ID3
Furthermore, pointwise trips are deleted. For instance, a trip of kind:</p>
        <p>B = ID1, ID1, ID1</p>
        <p>B = ID1
becomes of kind:
after redundancy removal and then, being pointwise, it is deleted.</p>
        <p>The motivation of redundancy removal relies in the idea of not to favor the subsequent
grouping of trips into a tour just because many tweets have been sent from the same
locality. This usually happens in the airport area, and free WIFI hotspots .</p>
        <p>We next describe how the pairwise trip similarity matrix is defined for both cases (i.e.,
categorical IDs and numerical IDs).</p>
        <p>Case 1: Categorical IDs. Let A = a1, a2, …, aN and B = b1, b2, …, bM denote a pair of
trips of different length, such that N  M, being ai and bj strings that identify categories
(i.e., ZIP or NUTS codes). Let A  B denote the intersection trip between A and B. We
define the cardinality of A  B, denoted by A  B, as follows:
|  | =</p>
        <p>max
 =1,…max⁡( , )⁡|⁡ ⁡≠⁡ ⁡∀⁡ &lt; ⁡
(  =   )
(1)
(2)
(3)
(4)
min( , )
∑</p>
        <p>Formula (1) is computed as follows. We first scan every element of the shortest trip A
and match it with every element of the longest trip B by increasing the counter if there is a
new element in the longest trip (not previously matching any other element of B) that
matches the current element of A. This way, we constrain an element of B to be selected
only once by one single element of A. This measure is symmetric. This means that by
scanning the longest trip and matching it with every element of the shortest one of the two
respective trips we would obtain the same result due to the fact that we cancel, at each
match, the elements already selected.</p>
        <p>Furthermore, the cardinality of intersection trip satisfies the following properties:
A  B = B  A</p>
        <p>A  A = A</p>
        <p>These similarity measures, just like to case of trips modeled as categorical IDs, are
used to enrich the knowledge discovery phase from big social media implemented in our
framework.</p>
        <p>On the basis of the definitions above, the pairwise trip similarity matrix SIM is
introduced as a D  D matrix such that, for each pair of trips ti and tj, the entry SIM(ti,tj)
satisfies the following properties:
 . 
 . 
(  ,   ) = 1
(  ,   ) = 
(  ,   )
(10)</p>
        <p>Based on (ii), the similarity matrix SIM is a triangular matrix hence we need to
compute only its values below the main diagonal, i.e.  i = 1, …, D. We compute
SIM(ti,tj) with j = 1,…, i-1 only.</p>
        <p>Moreover, since the computation of the similarity between any two trips is
independent from the other trips, in our implementation this process has been parallelized
by exploiting GPU-based computing paradigms.</p>
      </sec>
      <sec id="sec-4-3">
        <title>2.3 Similarity-Matrix-based Trip Clustering</title>
        <p>Here we describe the core strategy of our proposed TourMiner algorithm, i.e. clustering
of trips based on the similarity matrix SIM (see Section 2.2).</p>
        <p>In this phase, clustering is applied to group trips into clusters based on their similarity,
and then the biggest clusters are selected as popular tours. In principle, any clustering
algorithm based on a similarity matrix could be applied. In our implementation, we tested
both the density based DBSCAN clustering algorithm, generating a flat partition [24], and
the hierarchical agglomerative complete link clustering whose main idea consists in
generating, at each hierarchical cycle, a new cluster by focusing on the two clusters, said
Ci and Cj , with greatest similarity in SIM. The complete link clustering main steps are the
following:
Step 1. Identification of the two clusters to fuse Ci and Cj as follows:
{ ,  } = argmax(
( ,  ))⁡⁡ ,  = 1, … , 
(11)
Step 2. Computation of the similarity of the actual clusters with respect to the candidate
cluster in terms of the minimum among the similarities with the elements of the candidate
cluster (see formula in step 4.3 in of Algorithm 1). This guarantees that, when a new
element or cluster is added to a cluster to form a higher-level cluster, all its (clustered)
elements share a minimum similarity degree. This allows generating compact clusters.Step
3. Selection of the popular tours: they are the first k biggest clusters in partition with
minimum similarity , being k and  tunable input parameters. identifies the level of the
cluster hierarchy, i.e., the cluster partition. k is used to select the clusters with greatest
number of trips from this partition.
In order to prove the effectiveness and the efficiency of our proposed algorithm
TourMiner, we designed and developed an experimental campaign, whose results are
reported in this Section.</p>
        <p>In particular, experiments were aimed at evaluating the proposed clustering approach
in terms of both time and memory needed to process the distinct phases (see Section 2).
We designed and evaluated optimized solutions for all these phases. We discuss these
solutions in the following.</p>
        <p>We developed ad-hoc metrics over such phases, and we studied the evolution of such
metrics by increasing the number of tweets and the number of trips. Both parameters
well-test the scalability of our proposed algorithm. In the following, we provide our
experimental results.</p>
        <p>First, in order to have a better understanding on the realibility of our experimental
campaign, Fig. 1 shows the time needed to create annotated trips, i.e., semantic
trajectories, in terms of GeoHash codes depending on the number of tweets. This, indeed,
allows us to better estimate the size of data we deal with, as size is one of the main
characteristics of big data to be considered when designing algorithms that run on big data
sets.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4 Conclusions and Future Work</title>
      <p>In this paper, we have introduced and experimentally assessed TourMiner, an effective
and efficient algorithm for spatially mining big mobile social data about to semantic
trajectories. TourMiner implements a clustering procedure that founds on (i) suitable
semantic representations of trips extracted from Twitter data, alternatively in terms of
categorical IDs or numerical IDs, and (ii) a well-suited similarity matrix computed on top
of such trips. We provided motivations, definitions and analysis of TourMiner, along
with its experimental assessment and analysis over Twitter data according to several
(experimental) parameters. Results clearly comfirm the benefits coming from our
proposal.</p>
      <p>Future work is mainly oriented towards enriching our proposed framework with
innovative characteristics like uncertain big data management (e.g., [19]) and big data
approximation paradigms (e.g., [27]).
[1] S. M. Allen, M. J. Chorley, G. Colombo, E. Jaho, M. Karaliopoulos, I. Stavrakakis, R. M. Whitaker,
“Exploiting User Interest Similarity and Social Links for Micro-Blog Forwarding in Mobile
Opportunistic Networks”, Pervasive and Mobile Computing 11, pp. 106–131, 2014
[2] O. Khalid, M. U. S. Khan, S. U. Khan, A. Y. Zomaya, “Omnisuggest: A ubiquitous cloud-based
contextaware recommendation system for mobile social networks”, IEEE Transactions on Services Computing
7(3), pp. 401–414, 2014
[3] Y. Yu, X. Huang, X. Zhu, G. Wang, “Camel: A journey group t-pattern mining system based on
instagram trajectory data”, in: Proceedings of IEEE DASFAA, pp. 527–530, 2014
[4]</p>
      <p>M. Balduini, A. Bozzon, E. D. Valle, Y. Huang, G. Houben, “Recommending Venues using Continuous
Predictive Social Media Analytics”, IEEE Internet Computing 18(5), pp. 28–35, 2014.
[5] Alfredo Cuzzocrea, Domenico Saccà, Jeffrey D. Ullman, “Big data: a research agenda”, in: Proceedings
of IDEAS 2013, pp. 198–203, 2013
[6] A. Cuzzocrea, “Analytics over Big Data: Exploring the Convergence of Data Warehousing, OLAP and</p>
      <p>Data-Intensive Cloud Infrastructures”, in: Proceedings of COMPSAC 2013, pp. 481–483, 2013
[7] M. Takaffoli, R. Rabbany, O. R. Zaïane, “Incremental Local Community Identification in Dynamic</p>
      <p>Social Networks”, in: Proceedings of ASONAM, pp. 90–94, 2013
[8] H. Su, K. Zheng, K. Zeng, J. Huang, S.W. Sadiq, N.J. Yuan, X. Zhou, “Making Sense of Trajectory Data:</p>
      <p>A Partition-and-Summarization approach”, in: Proceedings of IEEE ICDE, pp. 963–974, 2015
[9] H. Jiang, L. Huang, J.J. Zhang, Y. Zhang, D.W. Gao, “Spatialtemporal Characterization of
Synchrophasor Measurement Systems – A Big Data Approach for Smart Grid System Situational
Awareness”, in: Proceedings of ACSSC, pp. 750–754, 2014
[10] H. Zhuge, Y. Wilks, “Faceted Search, Social Networking and Interactive Semantics”, World Wide Web
17(4), pp. 589–593, 2014
[11] R.C. Daley, J.B. Dennis, “Virtual Memory, Processes, and Sharing in MULTICS”, Communications of
the ACM 11(5), pp. 306–312, 1968
[12] M.F. Mokbel, W.G. Aref, I. Kamel, “Analysis of Multi-Dimensional Space-Filling Curves”,</p>
      <p>GeoInformatica 7(3), pp. 179–209, 2003
[13] S. Jin, W. Lin, H. Yin, S. Yang, A. Li, B. Deng, “Community Structure Mining in Big Data Social Media</p>
      <p>Networks with MapReduce”, Cluster Computing 18(3), pp. 999–1010, 2015
[14] N.J. Yuan, “Mining Social and Urban Big Data”, in: Proceedings of ACM WWW (Companion Volume),
p. 1103, 2015
[15] A. Cuzzocrea, G. Psaila, M. Toccu, “An Innovative Framework for Effectively and Efficiently
Supporting Big Data Analytics over Geo-Located Mobile Social Media”, in: Proceedings of ACM
IDEAS, 2016
[16] J.A. Cedillo, I. Rivera-Zarate, J.C. Herrera Lozada, M. Olguin-Carbajal, “GPU Programming Paradigm”,</p>
      <p>Journal of Theoretical and Applied Information Technology, pp. 133–138, 2009
[17] G. Bordogna, A. Cuzzocrea, L. Frigerio, G. Psaila, “Clustering Geo-Tagged Tweets for Advanced Big</p>
      <p>Data Analytics”, in: Proceedings of IEEE BigData Congress, 2016
[18] B. Yu, A. Cuzzocrea, D.H. Jeong, S. Maydebura, “On Managing Very Large Sensor-Network Data Using</p>
      <p>Bigtable”, in: Proceedings of CCGRID 2012, pp. 918–922, 2012
[19] A. Cuzzocrea, C.K.S. Leung, R.K. MacKinnon, “Mining Constrained Frequent Itemsets from Distributed</p>
      <p>Uncertain Data”, Future Generation Computer Systems 37 (1), 117–126, 2014
[20] F. Giannotti, M. Nanni, F. Pinelli, D. Pedreschi, “Trajectory Pattern Mining”, in: Proceedings of ACM</p>
      <p>SIGKDD, pp. 330–339, 2007
[21] L.O. Alvares, V. Bogorny, B. Kuijpers, J.A.F. de Macedo, B.Moelans, A. Vaisman, “A Model for</p>
      <p>Enriching Trajectories wth Semantic Geographical Information”, in: Proceedings of ACM GIS, 2007.
[22] S. Spaccapietra, C. Parent, M.L. Daminai, J.A.F. de Macedo, F. Porto, C. Vangenot, “A Conceptual View
on Trajectories”, IEEE Transactions on Knowledge and Data Engineering, 2008
[23] Z. Yan, D. Chakraborty, C. Parent, S. Spaccapietra, K. Aberer, “Semitri: A Framework for Semantic</p>
      <p>Annotation of Hetherogeneous Trajectories”, in: Proceedings of EDBT, 2011
[24] M. Ester, H.P. Kriegel, J. Sander, X. Xu, “A Density-based Algorithm for Discovering Clusters in Large</p>
      <p>Spatial Databases with Noise”, in: Proceedings of ACM KDD, 1996
[25] Z. Yin, L. Cao, J. Han, J. Luo, T.S. Huang, “Diversified Trajectory Pattern Ranking in Geo-Tagged</p>
      <p>Social Media”, in: Proceedings of SDM, pp. 980–991, 2011
[26] C. Comito, D. Falcone, D. Talia, “Mining Popular Travel Routes from Social Network Geo-Tagged
Data”, in: (E. Damiani et al. eds), “Intelligent Interactive Multimedia Systems and Services”, Springer,
pp. 81–95, 2015
[27] A. Cuzzocrea, U. Matrangolo, “Analytical synopses for Approximate Query Answering in OLAP
Environments”, in: Proceedings of DEXA, pp. 359–370, 2004</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>