Information Evolution Modeling and Tracking:
      State-of-Art, Challenges and Opportunities

                    Ekaterina Shabunina and Gabriella Pasi

Università degli Studi di Milano-Bicocca, Dipartimento di Informatica Sistemistica e
                  Comunicazione, Viale Sarca 336, 20126 Milan, Italy
                   {ekaterina.shabunina, pasi}@disco.unimib.it


      Abstract. In the Web 2.0, where everyone is the creator of content,
      information spreads and evolves rapidly through unpredictable paths of
      rebounds between news sources and Social Media. In this context, mod-
      eling, analyzing and tracking the information evolution through time
      offers unprecedented opportunities to diverse research fields, including
      Information Retrieval. In this paper we propose a synthetic analysis of
      the state-of-art on Information Evolution on the Web, and we summarize
      the interesting opportunities it offers to Information Retrieval.


1    Introduction
The emergence of the Web 2.0 has granted user the freedom to interact with
other users and to contribute contents to the World Wide Web. Consequently, it
has motivated novel research directions, and it has also provided new perspec-
tives within existing ones. The most common way to propagate opinions and
ideas on the Web is constituted by Social Media, which facilitates the creation
and sharing of User Generated Content (UGC). Thus, Social Media provides the
possibility to analyze the content generated by a vast number of users from dif-
ferent countries and social backgrounds to the aim of excerpting cultural trends
and ideas that spread in real time. This enables unprecedented opportunities
such as, for example, an early detection of social crisis, disasters and emergen-
cies [2,10]. The identification of social phenomena in UGC can bring insight on
the behavior of users in Social Networks, the patterns of their interactions, and
the structure of the information spread depending on the phenomenon driving
it. Thus, the study of the evolution of information on Social Media allows to
track how information related to specific topics or events changes over time, and
it makes possible to monitor the evolution of cultural, political and social ideas.
Evidently, modeling, analyzing and tracking the evolution of information in time
on the Web and, particularly, on Social Media promises a myriad of unprece-
dented opportunities and applications to several fields, among which Information
Retrieval and Social Media Analytics.
    In the following Sections we will present a synthetic analysis of the state-of-
art in modeling, analyzing and tracking the evolution of information on the Web
with respect to its main challenges (Section 2) and the opportunities it offers to
Information Retrieval and related fields (Section 3).
2   State-of-Art and Challenges
Information Propagation (IP) aims to analyze the spread of information on the
Web through time. This issue has been explored by a number of works in the
literature. The majority of the proposed approaches has considered IP as a
network-centered problem [6,13] The main focus of this line of research is to
study how information spreads in a user network. Simultaneously, another line
of works on IP aims to study how the content of a piece of information evolves
in time [3,5,8]. The scope of the present paper is this latter approach, the objec-
tive of which is the quantitative and qualitative evaluation of the evolution of a
stream of information.
    In content-centered IP the common approach is to primarily identify the
core units of information in a stream of Social Media posts. In the litera-
ture these units of information have been frequently referred to as “memes”
[1,3,4,8,9,11,12], a notion coined by R. Dawkins in 1976 to refer to a unit of
human cultural evolution, analogous to a gene in genetics [7]. Subsequently, the
analysis of information evolution is performed on the identified core units of
information of the studied information stream.
    There are two main challenges in content-centered IP. The first core aspect
is the formal representation of textual information. In the majority of works in
the literature a unit of information is identified as a short, frequently quoted
phrase and its slight variations [1,8,11,12]. In [8,12] it is formally defined as a
phrase graph, with nodes representing the phrases and the edges representing the
edit distance between the phrases. In [9] a unit of information is assumed to be
represented by several objects such as “hashtags” and “mentions” on Twitter,
URLs and the preprocessed text of the tweet itself. Similarly, in [4] different
types of displays of memes from the Yahoo! Meme platform are considered:
short snippets of text, photos, audio, or video, tokens in URLs, etc, which are
represented as bag-of-words.
    The work in [5] is among the ones that pioneered the issue of the identification
and formal representation of an information granule in a Social Media stream;
such information granule is defined by the authors as an ememe or “electronic
meme”, and is formally represented as a micro ontology generated by posts on
the blogosphere by means of an OWL schema.
    The study in [3] presents a semi-supervised attempt to represent memes in a
set of documents as semantic networks, by extracting n-grams that co-occur in
many documents and, subsequently, by constructing the semantic network where
nouns and adjectives, from the extracted n-grams, are the nodes and verbs are
the edges between them.
    The second challenge in the content-centered study of information diffusion
concerns the methods for measuring, evaluating and analyzing the information
evolution in time. In [5] a set of operators is proposed in the context of semantic
web ontologies aimed at measuring some useful properties of ememes such as
fidelity (the degree to which a meme is accurately reproduced, computed as the
fuzzy matching between the given blog post and the original ememe descrip-
tion), mutation (the difference between the maximum and the minimum fidelity
among the instances of the ememe), spread (reproductive activity of a meme,
calculated as the number of instances of the ememe in the searched source), and
longevity (the time duration of the ememe’s life span, which is the difference be-
tween the dates of the most recent and the oldest posts that contain the ememe
instance). Similarly, in [3] three meme metrics are proposed in the context of
semantic networks: longevity (alike longevity in [5]), fecundity (alike spread in
[5]) and copy-fidelity (alike fidelity in [5]). One of the most recent and large-scale
studies on memes, presented in [1], analyzes the mutation and replication rates
in memes evolution with the Yule process. The work in [11] presents a study
on the changes introduced in quoted texts as they diffuse through time; the au-
thors examine properties of the quoted texts variants and uncover patterns in
the rate of appearance of new variants, their length, the types of changes in-
troduced, their popularity and the type of sites that are replicating them. The
temporal patterns of variations in quoted phrases are studied in [8], by extract-
ing the temporal threads of all blogs and news media sites that mention the
meme phrase, identifying the patterns and time lags of quoting between them,
as well as analyzing their change in time in the whole thread.


3    Opportunities

As previously outlined, the possibility to track through time the evolution of
information on the Web can bring large and unprecedented benefits and oppor-
tunities to many research areas such as Information Retrieval, Social Network
Analysis and others. Here we emphasize some promising directions for the ex-
ploitation of content-centered IP in the context of IR.
    User profiling for personalized search is commonly performed by tracking the
user’s activities on the Web to infer a representation of the user’s interests. More
recently, users’ profiles have been defined based on the content generated by users
in Social Media [14]. Commonly, user interests are dynamic and they evolve in
time. Thus, user profiling presents a natural scenario for an application of the
automatic analysis of the evolution of an information stream. Independently
from the means by which the user’s topical interests are gathered, either by
query logs or as the content generated by the user on Social Media, the methods
for tracking the evolution of information in time can be successfully applied to
update the formal representation of the user model.
    Additionally, the exploitation of the evolution in time identified in the users
topical interests, can help in dealing with the “filter bubble” problem in person-
alized search and personalized recommendation by introducing a diversification
in the retrieved results through the natural and non-evident change in the infor-
mation.
    Another interesting application of tracking the evolution of textual informa-
tion is the analysis of queries formulated by the user over a given time interval.
This could bring insights on how users’ interests change in time.
References
 1. L. A. Adamic, T. M. Lento, E. Adar, and P. C. Ng. Information evolution in
    social networks. In Proceedings of the Ninth ACM International Conference on
    Web Search and Data Mining, WSDM ’16, pages 473–482, New York, NY, USA,
    2016. ACM.
 2. M. Avvenuti, S. Cresci, A. Marchetti, C. Meletti, and M. Tesconi. Ears (earthquake
    alert and report system): A real time decision support system for earthquake crisis
    management. In Proceedings of the 20th ACM SIGKDD International Conference
    on Knowledge Discovery and Data Mining, KDD ’14, pages 1749–1758, New York,
    NY, USA, 2014. ACM.
 3. H. Beck-Fernandez and D. F. Nettleton. Identification and extraction of memes
    represented as semantic networks from free text online forums. In MDAI 2013 -
    Modeling Decisions for Artificial Intelligence, Barcelona, 20/11/2013 2013.
 4. F. Bonchi, C. Castillo, and D. Ienco. Meme ranking to maximize posts virality in
    microblogging platforms. Journal of Intelligent Information Systems, 40(2):211–
    239, 2013.
 5. G. Bordogna and G. Pasi. A fuzzy approach to the conceptual identification of
    ememes on the blogosphere. In Fuzzy Systems (FUZZ), 2013 IEEE International
    Conference on, pages 1–8, July 2013.
 6. J. Cheng, L. Adamic, P. A. Dow, J. M. Kleinberg, and J. Leskovec. Can cascades
    be predicted? In Proceedings of the 23rd International Conference on World Wide
    Web, WWW ’14, pages 925–936, New York, NY, USA, 2014. ACM.
 7. R. Dawkins. The Selfish Gene. Oxford University Press, Oxford, UK, 1976.
 8. J. Leskovec, L. Backstrom, and J. Kleinberg. Meme-tracking and the dynamics of
    the news cycle. In Proceedings of the 15th ACM SIGKDD International Conference
    on Knowledge Discovery and Data Mining, KDD ’09, pages 497–506, New York,
    NY, USA, 2009. ACM.
 9. J. Ratkiewicz, M. Conover, M. Meiss, B. Gonçalves, S. Patil, A. Flammini, and
    F. Menczer. Truthy: Mapping the spread of astroturf in microblog streams. In
    Proceedings of the 20th International Conference Companion on World Wide Web,
    WWW ’11, pages 249–252, New York, NY, USA, 2011. ACM.
10. T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes twitter users: Real-
    time event detection by social sensors. In Proceedings of the 19th International
    Conference on World Wide Web, WWW ’10, pages 851–860, New York, NY, USA,
    2010. ACM.
11. M. P. Simmons, L. A. Adamic, and E. Adar. Memes online: Extracted, subtracted,
    injected, and recollected. In L. A. Adamic, R. A. Baeza-Yates, and S. Counts,
    editors, ICWSM. The AAAI Press, 2011.
12. C. Suen, S. Huang, C. Eksombatchai, R. Sosic, and J. Leskovec. Nifty: A system
    for large scale information flow tracking and clustering. In Proceedings of the 22Nd
    International Conference on World Wide Web, WWW ’13, pages 1237–1248, New
    York, NY, USA, 2013. ACM.
13. J. Yang and J. Leskovec. Modeling information diffusion in implicit networks. In
    Proceedings of the 2010 IEEE International Conference on Data Mining, ICDM
    ’10, pages 599–608, Washington, DC, USA, 2010. IEEE Computer Society.
14. A. Younus, C. O’Riordan, and G. Pasi. A language modeling approach to per-
    sonalized search based on users’ microblog behavior. In Advances in Information
    Retrieval - 36th European Conference on IR Research, ECIR 2014, Amsterdam,
    The Netherlands, April 13-16, 2014. Proceedings, pages 727–732, 2014.