<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Querying In uence Graphs over Online Social Networks E ectively and E ciently (Discussion Paper)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vincenzo Moscato</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonio Picariello</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giancarlo Sperl</string-name>
          <email>giancarlo.sperlig@unina.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alfredo Cuzzocrea</string-name>
          <email>alfredo.cuzzocrea@dia.units.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DIA Dept., University of Trieste</institution>
          ,
          <addr-line>Trieste</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>DIETI Dept., University of Naples \Federico II"</institution>
          ,
          <addr-line>Naples</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>From the Social Networks Analysis (SNA) perspective, Viral Marketing has the aim to maximize the number of people that become aware of a given product/service by identifying a few number of individuals, considered more \in uential" that can be promoted products or services. In this paper we propose a novel concept of in uence graph that can be easily derived by querying social network modelled as a graph. Furthermore, spread di usion model has been de ned as a Combinatorial Multi-Armed Bandit (CMAB) problem for retrieving the most in uential users, without any kind of preliminary knowledge.</p>
      </abstract>
      <kwd-group>
        <kwd>Social Networks Analytics</kwd>
        <kwd>Viral Marketing</kwd>
        <kwd>In uence Di usion and Maximization</kwd>
        <kwd>Online Social Networks Models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The use of Online Social Networks or OSNs have rapidly become an essential
part of every day life, and today about 2 billions of users are connected on simple
smart platforms, where single people or communities of people share personal
information, feelings, opinions about their life or on public facts [
        <xref ref-type="bibr" rid="ref1 ref5">1, 5</xref>
        ], and, of
course, such environments represent a novel channel for assessing, choosing and
buying goods and services .
      </p>
      <p>Sociologists nd out that individuals are frequently, explicitly or implicitly,
in uenced by their social contacts while deciding whether to adopt an innovation
(such as a political idea, or to buy a new product): the way in which new practices
spread through a population should so mainly depend on the fact that people
in uence each others' behavior.</p>
      <p>Copyright c 2019 for the individual papers by the papers authors. Copying
permitted for private and academic purposes. This volume is published and copyrighted
by its editors. SEBD 2019, June 16-19, 2019, Castiglione della Pescaia, Italy.</p>
      <p>It appears really important for big companies to target \opinion leaders",
because if they are able to in uence users of a community, it will start a large
cascade of further recommendations. One of the main problems here is to
maximize the number of people that become aware of a given product, nding the best
set of users to start the di usion and, consequently, to maximize the spread: this
is, for example, the basic idea behind each social advertisement campaigns, and
it corresponds to solve an In uence Maximization (IM) problem, which has the
goal to nd a small subset of nodes that could maximize the spread of in uence
over a social network graph.</p>
      <p>An application of this strategy in social advertising is surely the so-called
Viral Marketing. Viral Marketing allows to improve the spread awareness about
a given product/service through a sort of \word-of-mouth" advertising. In a
nutshell, a company selects a xed (small) number of individuals, considered more
in uential, and o ers or discounts them the product/services, hoping that they
will be recursively recommended: as in epidemic di usion, the viral e ect starts
and new products can reach a large number of people. In this way, users will
in uence their neighbors/friends triggering a \cascade e ect". Viral marketing
thus capitalizes on the advantages of social networks, and in particular their
high capacity of fast and e ective information di usion. In this paper, we
propose a novel approach for viral marketing based on modeling OSNs as a graph
database. Using regular path queries, we can then query the database and extract
relevant \social paths", by which users in uence the other ones, considering a
number of di erent social interactions. Relevant social paths are thus
opportunely merged into an in uence graph that describes the in uence spread over
a social network, following the Independent Cascade (IC) model. Finally, we
exploit a particular online learning problem, namely the Combinatorial
MultiArmed Bandit (CMAB) framework, to determine the most in uential users. The
chosen approach allows us to automatically estimate the in uence probabilities
during the in uence maximization stage, without any preliminary knowledge.
2</p>
    </sec>
    <sec id="sec-2">
      <title>OSN modeling</title>
      <p>Our idea consists in modeling a generic OSN as a graph database that is an
undirected edge-labeled graph G = (V; L; E) where:
{ V is the set of graph vertices, representing the main entities of a social</p>
      <p>network;
{ L is a set of labels, usually belonging to a given vocabulary and describing
the di erent kinds of relationships that can occur among the social network
entities;
{ E V L V is the set of edges;
{ V and E are abstract data types with a set of properties (expressed using
several attributes that can be di erent depending on the type of nodes/edges).</p>
      <p>To better explain the proposed approach, we consider a case study based
on YELP social network. YELP is an On-line Social Network in which users
can build friendship relationships and submit reviews about business objects,
described by a set of attributes (i.e. business hour, accessibility and parking
and so on), across all industries. Furthermore, an evaluation metrics based on
a 5-point rating is assigned to both entities for estimating their relevance with
respect to their reviews and actions on YELP.</p>
      <p>Considering our model, we can represent Yelp as a graph in which the vertices
set is composed by Users and Business objects while friendship and reviews
represent respectively the relationships between two users or a user and business
object. Furthermore, vertices and edges can be described by a set of attributes.</p>
      <p>Using this general model we have de ned a methodology to analyze users'
behavior, proposing the novel concepts of \social path".</p>
      <p>A social path is a particular edge-labeled path, i.e. a sequence p = (v1; l1; v2; : : : ; lk; vk+1)
with (vi; li; vi+1) 2 E for each i 2 f1; ; kg.</p>
      <p>Given the graph in Figure 1, examples of social paths connecting two users
belonging to YELP are: p1 = (vinni; f riendship; giank) and p2 = (vinni; review; gamberorosso; review; giank).</p>
      <p>From a social paths, we can derive relevant social paths, i.e. special social
paths that may be used in a variety of Social Networks Analytics applications.</p>
      <p>The elements belonging to a relevant social path must match a set of conditions
, de ned on the attributes of nodes and edges belonging to the path.</p>
      <p>For example, in YELP, the paths p1 and p2 can be relevant for an in uence
analysis problem and represent di erent ways by which a user can \in uence"
another one.</p>
      <p>
        We explicitly note that a relevant social path can be expressed by regular
path queries on a graph database [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. A regular path query (RPQ) over the set
of edge labels L is expressed as a regular expression over L. 3 In other terms,
RPQs have the form Q(x; y) : (x; R; y), R being a regular expression over the
vocabulary of edge labels.
      </p>
      <p>To extract the set of nodes' pairs connected by relevant social paths, we can
rst exploit regular path queries and then ltering the obtained results on the
base of conditions.</p>
      <p>As an example, the query Q(ui; uj) : (ui; review; review; uj) returns the set
of users pairs that have performed a review on the same business object, and in
particular, it can be represented using the graph in Figure 2.</p>
      <p>Exploiting as further condition = (e2:t e1:t) t ^ (e1:m = e2:m) (m
and t respectively being the mood and timestamp of a given review), we can
select only the set of users pairs in which the user uj has performed a review
immediately after ui with the same mood.
3 In uence analysis: building an In uence Graph from</p>
      <p>
        an OSN
We can express more complex social paths using conjunctive RPQs [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. As an
example, the query Q(ui; uj) : (ui; review; review; uj) ^ (ui; f riendship; uj)
returns the set of users pairs that have performed a review on the same business
object and are friends.
      </p>
      <p>We de ne in uential paths all the social paths that are relevant for an
inuence analysis problem. The execution of speci c RPQS describing in uential
paths on the database graph (representing a particular social network) allows
to determine a new homogeneous graph (vertices are only users), that we call
\in uence graph", we used to e ectively model the in uence spread over an
OSN.</p>
      <p>More precisely, an In uence Graph is a labeled graph IG = (V^ ; Eb; ) where:
3The answer Q(D) to a RPQ Q over a database D is the set of pairs of nodes
connected in D by a social path traversing a sequence of edges forming a word in the
regular language L(Q) de ned by Q.</p>
      <p>{ V^ is the set of nodes such that each v 2 Vb corresponds to a user u 2 U ;
{ E^ V^ V^ is the set of edge (with no self-loops);
{ : V^ V^ ! [0; 1] is a function that assigns to each edge e = (vi; vj) a label,
representing the probability that user ui can in uence user uj.</p>
      <p>To build an in uence graph, we used a sentiment analysis technique to
associate a prede ned mood to each YELP review whilst we initially decided to
exploit only the in uence graph structure/topology, without considering the
in</p>
      <p>
        uence probabilities. In particular, we adopted a Combinatorial Multi-Armed
Bandit (CMAB) approach [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] that allows us to automatically estimate the in
uence probabilities during the in uence maximization stage. More in detail, the
CMAB approach consists of multiple rounds, in each of them we leverage an IM
method. Considering that we start without the knowledge of in uence
probabilities, the obtained spread will be low compared to the optimal solution and we
refer to such situation as a regret. At each step, we then attempt to reduce the
regret and, at the same time, to improve in uence probabilities estimation using
an exploration-exploitation trade-o .
      </p>
      <p>In particular, the regret after n rounds can be described by the Equation 1:
it practically represents the di erence between spread related to the optimal
seed set S (knowing the right probabilities) and that returned by the CMAB
strategy applied to current seed set Si at iteration i.</p>
      <p>= n
(S )
n
X (Si)</p>
      <p>(1)
i=1</p>
      <p>Indeed, CMAB is based on the following exploration-exploitation trade-o :
{ exploration allows us to improve our knowledge, incrementally learning the</p>
      <p>in uence probabilities over the network;
{ exploitation allows us to achieve the largest spread by exploiting the current</p>
      <p>knowledge.
3.1</p>
      <p>Modeling In uence maximization as CMAB problem
In our model, we consider m arms each one with a random variable Xji 2 [0; 1]
having j as mean, which represents the obtained reward when the arm j is
triggered at the round i. Furthermore, in the CMAB model it is possible to play
a superarm A in each round: in this way all arms in A are triggered. We indicate
with pjA the probability of arm j to be triggered when superarm A is played.</p>
      <p>At each round, we then adopt an approximation oracle that leverages the
current means' distribution ^ = (^1; ^2; :::; ^m) to nd the best (super)arm
to play. After that, the superarm is played and consequently several arms are
triggered. In the end, we observe the rewards obtained from each arm to update
their mean estimation ^j. To this goal, we use the number of times nj;i that an
arm j has been triggered until the round i and the process continues until i = n.</p>
      <p>In other terms, the CMAB relies on the Independent Cascade (IC) di usion
model that mimics the spread of an epidemic disease, evolving along discrete
steps.</p>
      <p>Given an in uence graph IG = (V^ ; Eb; ) , where at each edge (u; v) 2 Eb is
assigned a in uence probability u;v 2 [0; 1], the process starts with an initial
set of active nodes S0 and each node u 2 S0 has a single chance to activate
each outgoing inactive node v: it happens with probability u;v. The node v can
become active in the next step and attempt to in uence each one of its inactive
out-neighbors. In this way the cascade continues until the process converges; the</p>
      <p>nal spread (Sn) represents the number of nodes that are activated starting
from S0 after n iterations. Thus, IM tries to maximize the nal spread using
reduced seed sets.
3.2</p>
      <p>The IM algorithm
The procedure receives as input an in uence graph IG without the knowledge of
in uence probabilities . In the initial step, it is used the vector ^0 that assigns
0 or a low in uence probability to each edge (u; v).</p>
      <p>
        At each round, the algorithm attempts to reduce the regret by implementing
a particular exploration-exploitation trade-o technique (coin ip):
{ the exploitation is performed using as approximation oracle the very simple
and e cient greedy strategy proposed by the authors in [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ] that allows to
select a seed set S (with jSj = k) representing the played superarm A (it
requires to solve an IM problem on the graph IG with in uence probability
estimating through the current means' vector ^);
{ the exploration is obtained generating a seed set S in a random way.
      </p>
      <p>When a superarm is played, the di usion process starts and leads to active
several inactive nodes by triggering other arms. At the end of the process, it is
evaluated the reward considering the number of nodes that are activated. This
information is the used to update the means' estimation ^ by the following
Equation:</p>
      <p>Pz
^j = i=1 nj;i (2)</p>
      <p>z
nj;i being the number of activated nodes playing the arm j at step i and z 2 [1; n].</p>
      <p>Thus, we improve the knowledge to minimize the regret in the next step.
4</p>
      <p>System Overview and Implementation Details
Here, we present an overview of the system for viral marketing purposes, whose
main components together with the related work ow are shown in Figure 3.</p>
      <p>The system relies on a multilayer architecture typical of Big Data
infrastructures based on the Apache Hadoop4 technological stack and supports four
main tasks: i) data ingestion, ii) data storing, iii) batch computation, iv) data
visualization.</p>
      <p>During the data ingestion, and using the functionalities provided by an ETL
module, information about users' social interactions are:</p>
      <p>StagingArea
ETL</p>
      <p>KB
GrapOhSBNuilder</p>
      <p>Sentiment
Analyzer
Query Engine
{ periodically extracted from several OSNs (e.g., YELP, TripAdvisor, etc.)</p>
      <p>using the related API;
{ cleaned and transformed in according to a particular log format containing
information on user, timestamp and the performed action with several
attributes (as an example h vinni, 13-05-2018, Gambero Rosso, \the food was
delicious"; : : :i);
{ loaded into a particular Staging Area.</p>
      <p>Concerning implementation details, the ETL module was realized exploiting
facilities provided by the Talend5 tool. In turn, the Staging Area repository was
implemented through the Cassandra6 columnar database.</p>
      <p>The data storing task has the goal of generating the OSN graph database
by means of OSN Graph Builder module, using data from staging area. For each
OSN, the nal graph is then stored into system Knowledge Base (KB). The OSN
Graph Builder module was realized on the top of Spark7 engine (choosing Python
as programming language), while we exploited Neo4j8 for the KB.</p>
      <p>During the batch computation task several activities are performed:
{ textual reviews are processed to infer the related sentiment using Sentiment</p>
      <p>Analyzer module that is saved as edge property within the KB;
{ on the base of a set of RPQs, the KB is queried by a Query Engine module</p>
      <p>to extract the in uence graph;
{ an in uence maximization procedure (based on the CMAB theory) is applied
on the in uence graph to determine the most in uential users (through
In</p>
      <p>uence Maximization module).</p>
      <p>Regarding implementation details, the Sentiment Analyzer module was
implemented on the top of Spark engine leveraging several GATE9 NLP libraries.</p>
      <p>5https://www.talend.com/l
6http://cassandra.apache.org/
7https://spark.apache.org/
8https://neo4j.com/
9https://gate.ac.uk/
GATE detects sentiment related to each sentence in the text in terms of score (a
numeric value for the sentiment), sarcasm (a boolean value to indicate whether
the sentence is sarcastic or not) and polarity (a boolean value that can be
positive or negative). RPQs are expressed in the Cypher Graph Query Language10
and the Query Engine module was realized using Spark. The IG is stored in
a textual raw format directly on HDFS. Eventually, the In uence Maximization
module implements the CMAB IM algorithm in Python leveraging Spark graphs'
management libraries.</p>
      <p>Data visualization task allows to present results of several elaborations (e.g.,
in uence graph, in uential users, etc.) thanks to the functionalities provided by
Data visualization module implemented using the Jupiter11 tool.
5</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>In this paper we presented a novel approach for viral marketing based on
modeling OSNs as particular graph databases. In particular, we derive the in uence
graph by analyzing in uence paths and using CMAB technique.</p>
      <p>Future works will be devoted to provide more evaluation of the proposed
approach and to compare the proposed methodology with di erent and more
recent in uence maximization approaches.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Jordi</given-names>
            <surname>Albors</surname>
          </string-name>
          , Juan C Ramos, and Jose L Hervas.
          <article-title>New learning network paradigms: Communities of objectives, crowdsourcing, wikis and open source</article-title>
          .
          <source>International Journal of Information Management</source>
          ,
          <volume>28</volume>
          (
          <issue>3</issue>
          ):
          <volume>194</volume>
          {
          <fpage>202</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Flora</given-names>
            <surname>Amato</surname>
          </string-name>
          , Antonio Bosco, Vincenzo Moscato, Antonio Picariello, and
          <string-name>
            <given-names>Giancarlo</given-names>
            <surname>Sperl</surname>
          </string-name>
          .
          <article-title>A novel in uence di usion model based on user generated content in online social networks</article-title>
          .
          <source>In Proceedings of the 6th International Conference on Data Science, Technology and Applications. SCITEPRESS - Science and Technology Publications</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Flora</given-names>
            <surname>Amato</surname>
          </string-name>
          , Vincenzo Moscato, Antonio Picariello, and
          <string-name>
            <given-names>Giancarlo</given-names>
            <surname>Sperl</surname>
          </string-name>
          .
          <article-title>Di usion algorithms in multimedia social networks: a preliminary model</article-title>
          .
          <source>In Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining</source>
          <year>2017</year>
          , pages
          <fpage>844</fpage>
          {
          <fpage>851</fpage>
          . ACM,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Angela</given-names>
            <surname>Bonifati</surname>
          </string-name>
          , Wim Martens, and
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Timm</surname>
          </string-name>
          .
          <article-title>An analytical study of large sparql query logs</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .,
          <volume>11</volume>
          (
          <issue>2</issue>
          ):
          <volume>149</volume>
          {
          <fpage>161</fpage>
          ,
          <year>October 2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Glen</given-names>
            <surname>Urban</surname>
          </string-name>
          .
          <article-title>Don't just relate-advocate!: a blueprint for pro t in the era of customer power</article-title>
          .
          <source>Pearson Education</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Sharan</given-names>
            <surname>Vaswani</surname>
          </string-name>
          , Laks Lakshmanan,
          <string-name>
            <given-names>Mark</given-names>
            <surname>Schmidt</surname>
          </string-name>
          , et al.
          <article-title>In uence maximization with bandits</article-title>
          .
          <source>arXiv preprint arXiv:1503.00024</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>