=Paper=
{{Paper
|id=Vol-2080/paper9
|storemode=property
|title=A Generic Framework to Perform Comprehensive Analysis of Tweets
|pdfUrl=https://ceur-ws.org/Vol-2080/paper9.pdf
|volume=Vol-2080
|authors=Marie-Noelle Bessagnet
|dblpUrl=https://dblp.org/rec/conf/ecir/Bessagnet18
}}
==A Generic Framework to Perform Comprehensive Analysis of Tweets==
<pdf width="1500px">https://ceur-ws.org/Vol-2080/paper9.pdf</pdf>
<pre>
                                           BIR 2018 Workshop on Bibliometric-enhanced Information Retrieval


A generic framework to perform comprehensive analysis
                      of tweets

                               Marie-Noelle BESSAGNET1
                                    1
                                    Laboratoire LIUPPA
                                    IAE Pau-Bayonne
                         Université de Pau et des Pays de l’Adour
                                       64012 PAU
                      marie-noelle.bessagnet@univ-pau.fr


       Abstract. Recently, there has been an increased interest in the use of social me-
       dia data as important traffic information sources and as a data research in the
       field of Human and Social Sciences. Social media are used, for example, to
       identify Twitter user communities in the context of altmetrics. In this paper, we
       highlight the potential use of social media data analysis by non computer sci-
       ence researchers. We present an approach based on a multi-dimensional analy-
       sis which combines the thematic, temporal and spatial features of tweets. We
       detail an experimentation using different tools in order to create the “perfecti-
       ble” toolbox for researchers in HSS (for example territorial marketing) or for
       local territory managers. This approach can be applied in the context of
       altmetrics.

       Keywords: Geographical information, Data analysis, Territorial Policy,
       Almetric.


1      Introduction

Twitter, a popular microblogging service, has received much attention recently. This online
social network is used by millions of people around the world to remain socially connected to
their friends, family members, and coworkers through their computers and mobile phones [1].
Twitter asks one question, “What’s happening?” Answers must be fewer than 140 characters
   A tweet is often used as a message to friends, colleagues and other people in order to inform
about situations, to share opinions, feelings, etc… about many subjects. We think we can use
the information embedded in a tweet in order to identify correlations between more or less
structured data.
   Counting articles and citations, analyzing citations and co-authors graphs have become ways
to assess researchers and institutions performance. But, before using these ways to assess re-
searchers, the research data on which they are working are very important. For researchers in
the field of Human and Social Sciences, a set of tweets can be seen as a research data and re-
quires different steps known as the research data lifecycle: collecting, recording, processing,
analysing and then results can be published. Tweet analysis is used in many domains such as
Real-Time Event Detection, political opinion analysis, improvement of social media strategy,
marketing strategy, environment, almetrics, etc. We can quote the research work of [2] who


                                              80
                                                   BIR 2018 Workshop on Bibliometric-enhanced Information Retrieval


demonstrates that publications from the social sciences, humanities and the medical and life
sciences show the highest presence of altmetrics, indicating their potential value and interest for
these fields. In another domain, we can quote the research work of [3] who works on meta data
linked to the tweets in order to analyse the behavior of people within a metropolitan urban area
In many organizations the job of “community manager” is created in support of the organiza-
tion’s strategy. Results from tweets analysis can be done as statistics figures like Key Perfor-
mance Indicator (KPI)1.
   We know as computer scientists that it is possible to extract, explore, synthesize and visual-
ize knowledge based on a large mass of information available on Twitter. And sometimes, we
are competent to do so. But the question is whether we can propose a framework and some
tools to non computer scientist researchers to enable them to achieve these actions in a simply
way. What kind of comprehensive analysis can we perform using a collection of tweets? To
what end?
   The goal of this paper is twofold. It first presents a framework devised from a geographical
approach based on different theories and tools. Secondly it illustrates the use of dedicated tools
for tweet analysis in order to identify correlations between thematic areas (what?), location of
events inside the tweets or location of authors tweets (where?), time of the event (when?), and
identification of sentiment.
   The next section presents the general framework.


2         Defining a generic framework to perform comprehensive
          analysis

In this section we will present our generic framework and a toolbox we can imagine to process
with this kind of research data.


2.1       A generic framework

The flowchart in Figure 1 depicts how tweets can be analyzed to assess the 5 W dimensions
(who, when, what, where, why), three of which (When, Where and What) emphasize the geo-
graphical information. We can know who was tweeting, how, what about, what is the opinion
expressed in the tweet, what kind of extraction we can do concerning a territory or a publica-
tion. As we said, for researchers in the field of Human and Social Sciences (HSS), a set of
tweets can be seen as a research data. Thanks to this research data (meta data linked to the
tweet or content of the tweet), researchers in the field of HSS can work on non structured textu-
al document. For them, tweets are like others non structured textual documents: poems, episto-
lary exchanges between artists, ancient documents…
   The main objective of the method is to analyze a collection of tweets from a multi-
dimensional perspective. The general approach describes a process made up of different steps
of analysis conducted to build a dashboard adapted to user profile: (1) preparation and valida-
tion of tweets, (2) first analysis based on user profiles and information embedded in the differ-
ent fields of the tweets, (3) the multidimensional information analysis of the content of the
tweet which permits the exploration of the collection of tweets and (4) summarization step


1
    KPIs evaluate the success of an organization or of a particular activity (such as projects, programs,
     products and other initiatives) in which it engages.


                                                      81
                                            BIR 2018 Workshop on Bibliometric-enhanced Information Retrieval


thanks to dashboard, map, timeline, KPI,....
   After collecting the tweets we propose two types of analysis:
        1. The first one is based on information mentioned in user profiles where we can ana-
             lyze, in a basic way, the who, when and where dimensions,
        2. The second one deals with the content of the tweet where more sophisticated anal-
             yses can be made of the 5 W dimensions.
   Thanks to data in the user profiles, some statistics can be processed on the collection of
tweets summarizing some of their features: number of tweets per day; users classified by loca-
tion, by country; the relevant platforms, the most cited hashtags, the number of retweets,…
This kind of analysis can be performed without cleaning the tweets, in a simple way. As we
said, since the text is essentially informal, many challenges must be taken up in order to per-
form the different tweet analyses according the different dimensions.


2.2    The perfectible Toolbox
  Based on our experiments we present a toolbox that although simple, can easily be extended
with new and/or improved methods. The Twitter REST search API was used for collecting
tweets using GET requests. We implemented a crawler to harvest tweets automatically, every
day between January 2017 and July 2017, which could contain at least the following term:
#Bearn.

First step of preprocessing
   According to [4], the literature presents some proposals for dealing with such information,
namely: a) Filtering: removal of URLs, Twitter user names (starting with @) and Twitter spe-
cial words (“RT”, “via”, ...); b) Removal of stop words; c) Use of synonyms for the decom-
posed terms; d) Part of speech tagging usage (POS tagging); e) Recognition/Extraction of enti-
ties; f) Stemming: method for reducing a term to its radical, removing endings, affixes, and
thematic vowels; and g) Treatment of the composite terms containing HashTags. The terms are
normally separated according to the capitalization of letters. For example, “#VeryGood” be-
comes “Very Good” - a blank space is added between the words.
The classification and geocoding steps are responsible for classifying sentiment polarity and
inferring geographical locations mentioned in the text, respectively.

IRAMUTEQ Environment
   IRAMUTEQ is an environment dedicated to lexicometry. Thanks to this environment, a re-
searcher can analyze much of the corpora. We can define lexicometry as the measurement of
the frequency with which words occur in text. It provides statistical indicators and graphical
representations that permit to investigate written text such as tweets. This kind of analysis
completes content analysis [5].


ELIXA Environment
   There are three levels of sentimental analysis –1) document level sentiment classification; 2)
sentence level sentiment classification; and 3) aspect level sentiment analysis. This study has
carried out a sentiment analysis by using EliXa which is an Aspect based sentiment analysis
platform developed by the Elhuyar Foundation. Given that a sentence may contain multiple
opinions, they define a window span around a given opinion target (5 words before and 5 words
after). Both scores are calculated as the sum of every positive/negative score in the correspond-
ing lexicon divided by the number of words in the sentence [6].


                                               82
                                            BIR 2018 Workshop on Bibliometric-enhanced Information Retrieval


GATE Environment
   We have developed a processing environment with the GATE platform ([7] ; [8]). This envi-
ronment especially embeds the POS tagger Treetagger [9] and is also relevant for French lan-
guage. Two steps are required: step 1 performs texts and ontology (GEONTO) to produce the
same texts in which concepts and relationships are annotated. This process begins with the
lemmatization of the texts (thanks to French Tokeniser and TreeTagger-FR-No-Tokenization
modules) and continues with the annotation of terms suited to defined labels in the ontology
(Flexible Gazetteer module). Each annotation includes the following details: the original term,
the corresponding lemma, the identified label, name of the concept or the relation, object type
(instance, class or relation). Step 2 performs rules (JAPE Transducer_00B7B module) derived
from regular expressions combining domain concepts, codomain concepts and relations. It
relies on annotations of step 1 and the ontology in order to validate the annotation of the seman-
tic relations of the triplets (relation, domain, codomain).
   Other processes allow us to annotate spatial entities and temporal ones. Details are given in
[10].


2.3     In the context of altmetrics
   This framework and the tools described are adapted to analyse tweets in the context of
Almetrics. The two types of analysis we propose on the 5W dimensions are adapted to tweets
linked to set of scientific papers. Instead of waiting months, we can rapidly know the sentiment
expressed in a tweet linked to a scientific paper. For example, we can measure the e-reputation
of researchers thanks to positive sentiment. A major limitation is done to altmetrics findings.
We do not know their meaning in the process of scholarly communication but we think
almetrics can change the way to recognize the scientific contributions of researchers.
   .


                                               83
                                                                                                                                                                                          BIR 2018 Workshop on Bibliometric-enhanced Information Retrieval


                                                                                                                             Tweets transcript from Twitter for a period


                                              Users Profiles                                                                                                                                                                    Tweets


Assessing Who/                                                                   Assessing Why (some                              Assessing « About What »                             Assessing What                           Assessing « About Where »
                                 Assessing «When »
 Where/When                                                                          Generalities)                                   (Thematic Entities)                            « Sentiment analysis»                            (Spatial Entities)


                                                                                                         To connect a tweet to a category                                                                                        Geolocation found in the
                            First                                       More Complex Statistical                                                                         To connect a tweet to a positive,
                                                                                                         (suited to a local territory policy                                                                                          content of the
                           Analysis                                            Analysis                                                                                   negative or neutral sentiment
                                                                                                                   for example)                                                                                                          tweets


                     Number of tweets per
                                                 IRAMUTEQ environment


                             day


                                                                                                                                                                ELIXA environment
                                                                                                          GATE environment


                                                                                                                                                                                                             GATE environment
     Command lines


                          Location
                                                                                                                                                                                                                                   Extraction of Spatial
                     Number of « Tweetos »                                                                                       Textual analysis
                                                                                                                                                                                     Sentiment analysis                                  Entities
                        Used platform                                   Textual statistical analysis                               (Ontology
                                                                                                                                                                                         (Lexicons)                                  Textual analysis
                      Most cited Hastags                                                                                           Gazeteers)
                                                                                                                                                                                                                                       (Gazeteers)
                          Followers


                                                                         DashBoard adapted to user profile – Summarization like maps, temporal analysis, words cloud, KPI


                                                                                  Fig. 1. A general framework (about tweets corpora analysis).


                                                                                                                                     84
                                           BIR 2018 Workshop on Bibliometric-enhanced Information Retrieval


References
[1] Panteras, G., Wise, S., Lu, X., Croitoru, A., Crooks, A., & Stefanidis, A. (2015). Triangu-
     lating social multimedia content for event localization using Flickr and Twitter. Transac-
     tions in GIS, 19(5), 694-715.
[2] Rodrigo Costas, Zohreh Zahedi, Paul Wouters, Do altmetrics correlate with citations?
     Extensive comparison of altmetric indicators with citations from a multidisciplinary per-
     spective, https://arxiv.org/abs/1401.4321, last access 3/2/2018.
[3] Françoise Lucchini, Bernard Elissalde, Leny Grassot et Julien Baudry, « Paris tweets,
     données numériques géolocalisées et évènements urbains », Netcom, 30-3/4 | 2016, 207-
     230.
[4] Alves, André Luiz Firmino ; de Souza Baptista, Cláudio ; Firmino, Anderson Almeida ;
     de Oliveira, Maxwell Guimarães ; de Paiva, Anselmo Cardoso: A Spatial and Temporal
     Sentiment Analysis Approach Applied to Twitter Microtexts.. In: JIDM, 6 (2015), Nr. 2,
     S. 118-129
[5] Daniel Pélissier, “Comment préparer l’analyse de textes de sites Web grâce à la
     lexicométrie et au logiciel Iramuteq ?,” dans Présence numérique des organisations,
     14/04/2016, https://presnumorg.hypotheses.org/187
[6] I. San Vicente, X. Saralegi, y R. Agerri, «EliXa: A modular and flexible ABSA platform»,
     in Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015),
     Denver, Colorado, 2015, pp. 748-752.
[7] Cunningham H., Gaizauskas R., Wilks Y. (1995). A General Architecture for Text Engi-
     neering (GATE) – a new approach to Language Engineering R&D. Rapport technique no
     CS – 95 – 21. Department of Computer Science, University of Sheffield.
[8] Bontcheva K., Tablan V., Maynard D., Cunningham H. (2004). Evolving GATE to Meet
     New Challenges in Language Engineering. Natural Language Engineering, vol. 10, no 3/4,
     p. 349–373.
[9] Schmid H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceed-
     ings of international conference on new methods in language processing, p. 44-49.
[10] Buscaldi D., Bessagnet M.-N., Royer A., Sallaberry C. (2013). Using the semantics of
     texts for information retrieval: A concept- and domain relation-based approach. In B. Ca-
     tania et al.(Eds.), Adbis (2), vol. 241, p. 257-266. Springer.


                                              85

</pre>