<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A generic framework to perform comprehensive analysis of tweets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marie-Noelle BESSAGNET</string-name>
          <email>marie-noelle.bessagnet@univ-pau.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Laboratoire LIUPPA IAE Pau-Bayonne Université de Pau et des Pays de l'Adour 64012 PAU</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>80</fpage>
      <lpage>85</lpage>
      <abstract>
        <p>Recently, there has been an increased interest in the use of social media data as important traffic information sources and as a data research in the field of Human and Social Sciences. Social media are used, for example, to identify Twitter user communities in the context of altmetrics. In this paper, we highlight the potential use of social media data analysis by non computer science researchers. We present an approach based on a multi-dimensional analysis which combines the thematic, temporal and spatial features of tweets. We detail an experimentation using different tools in order to create the “perfectible” toolbox for researchers in HSS (for example territorial marketing) or for local territory managers. This approach can be applied in the context of altmetrics.</p>
      </abstract>
      <kwd-group>
        <kwd>Geographical information</kwd>
        <kwd>Data analysis</kwd>
        <kwd>Territorial Policy</kwd>
        <kwd>Almetric</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Twitter, a popular microblogging service, has received much attention recently. This online
social network is used by millions of people around the world to remain socially connected to
their friends, family members, and coworkers through their computers and mobile phones [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Twitter asks one question, “What’s happening?” Answers must be fewer than 140 characters
      </p>
      <p>A tweet is often used as a message to friends, colleagues and other people in order to inform
about situations, to share opinions, feelings, etc… about many subjects. We think we can use
the information embedded in a tweet in order to identify correlations between more or less
structured data.</p>
      <p>
        Counting articles and citations, analyzing citations and co-authors graphs have become ways
to assess researchers and institutions performance. But, before using these ways to assess
researchers, the research data on which they are working are very important. For researchers in
the field of Human and Social Sciences, a set of tweets can be seen as a research data and
requires different steps known as the research data lifecycle: collecting, recording, processing,
analysing and then results can be published. Tweet analysis is used in many domains such as
Real-Time Event Detection, political opinion analysis, improvement of social media strategy,
marketing strategy, environment, almetrics, etc. We can quote the research work of [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] who
demonstrates that publications from the social sciences, humanities and the medical and life
sciences show the highest presence of altmetrics, indicating their potential value and interest for
these fields. In another domain, we can quote the research work of [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] who works on meta data
linked to the tweets in order to analyse the behavior of people within a metropolitan urban area
In many organizations the job of “community manager” is created in support of the
organization’s strategy. Results from tweets analysis can be done as statistics figures like Key
Performance Indicator (KPI)1.
      </p>
      <p>We know as computer scientists that it is possible to extract, explore, synthesize and
visualize knowledge based on a large mass of information available on Twitter. And sometimes, we
are competent to do so. But the question is whether we can propose a framework and some
tools to non computer scientist researchers to enable them to achieve these actions in a simply
way. What kind of comprehensive analysis can we perform using a collection of tweets? To
what end?</p>
      <p>The goal of this paper is twofold. It first presents a framework devised from a geographical
approach based on different theories and tools. Secondly it illustrates the use of dedicated tools
for tweet analysis in order to identify correlations between thematic areas (what?), location of
events inside the tweets or location of authors tweets (where?), time of the event (when?), and
identification of sentiment.</p>
      <p>The next section presents the general framework.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Defining a generic framework to perform comprehensive analysis</title>
      <p>In this section we will present our generic framework and a toolbox we can imagine to process
with this kind of research data.
2.1</p>
      <sec id="sec-2-1">
        <title>A generic framework</title>
        <p>The flowchart in Figure 1 depicts how tweets can be analyzed to assess the 5 W dimensions
(who, when, what, where, why), three of which (When, Where and What) emphasize the
geographical information. We can know who was tweeting, how, what about, what is the opinion
expressed in the tweet, what kind of extraction we can do concerning a territory or a
publication. As we said, for researchers in the field of Human and Social Sciences (HSS), a set of
tweets can be seen as a research data. Thanks to this research data (meta data linked to the
tweet or content of the tweet), researchers in the field of HSS can work on non structured
textual document. For them, tweets are like others non structured textual documents: poems,
epistolary exchanges between artists, ancient documents…</p>
        <p>The main objective of the method is to analyze a collection of tweets from a
multidimensional perspective. The general approach describes a process made up of different steps
of analysis conducted to build a dashboard adapted to user profile: (1) preparation and
validation of tweets, (2) first analysis based on user profiles and information embedded in the
different fields of the tweets, (3) the multidimensional information analysis of the content of the
tweet which permits the exploration of the collection of tweets and (4) summarization step
1 KPIs evaluate the success of an organization or of a particular activity (such as projects, programs,
products and other initiatives) in which it engages.
thanks to dashboard, map, timeline, KPI,....</p>
        <p>After collecting the tweets we propose two types of analysis:
1. The first one is based on information mentioned in user profiles where we can
analyze, in a basic way, the who, when and where dimensions,
2. The second one deals with the content of the tweet where more sophisticated
analyses can be made of the 5 W dimensions.</p>
        <p>Thanks to data in the user profiles, some statistics can be processed on the collection of
tweets summarizing some of their features: number of tweets per day; users classified by
location, by country; the relevant platforms, the most cited hashtags, the number of retweets,…
This kind of analysis can be performed without cleaning the tweets, in a simple way. As we
said, since the text is essentially informal, many challenges must be taken up in order to
perform the different tweet analyses according the different dimensions.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>The perfectible Toolbox</title>
        <p>Based on our experiments we present a toolbox that although simple, can easily be extended
with new and/or improved methods. The Twitter REST search API was used for collecting
tweets using GET requests. We implemented a crawler to harvest tweets automatically, every
day between January 2017 and July 2017, which could contain at least the following term:
#Bearn.</p>
      </sec>
      <sec id="sec-2-3">
        <title>First step of preprocessing</title>
        <p>
          According to [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], the literature presents some proposals for dealing with such information,
namely: a) Filtering: removal of URLs, Twitter user names (starting with @) and Twitter
special words (“RT”, “via”, ...); b) Removal of stop words; c) Use of synonyms for the
decomposed terms; d) Part of speech tagging usage (POS tagging); e) Recognition/Extraction of
entities; f) Stemming: method for reducing a term to its radical, removing endings, affixes, and
thematic vowels; and g) Treatment of the composite terms containing HashTags. The terms are
normally separated according to the capitalization of letters. For example, “#VeryGood”
becomes “Very Good” - a blank space is added between the words.
        </p>
        <p>The classification and geocoding steps are responsible for classifying sentiment polarity and
inferring geographical locations mentioned in the text, respectively.</p>
      </sec>
      <sec id="sec-2-4">
        <title>IRAMUTEQ Environment</title>
        <p>
          IRAMUTEQ is an environment dedicated to lexicometry. Thanks to this environment, a
researcher can analyze much of the corpora. We can define lexicometry as the measurement of
the frequency with which words occur in text. It provides statistical indicators and graphical
representations that permit to investigate written text such as tweets. This kind of analysis
completes content analysis [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-5">
        <title>ELIXA Environment</title>
        <p>
          There are three levels of sentimental analysis –1) document level sentiment classification; 2)
sentence level sentiment classification; and 3) aspect level sentiment analysis. This study has
carried out a sentiment analysis by using EliXa which is an Aspect based sentiment analysis
platform developed by the Elhuyar Foundation. Given that a sentence may contain multiple
opinions, they define a window span around a given opinion target (5 words before and 5 words
after). Both scores are calculated as the sum of every positive/negative score in the
corresponding lexicon divided by the number of words in the sentence [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-6">
        <title>GATE Environment</title>
        <p>
          We have developed a processing environment with the GATE platform ([
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] ; [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]). This
environment especially embeds the POS tagger Treetagger [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] and is also relevant for French
language. Two steps are required: step 1 performs texts and ontology (GEONTO) to produce the
same texts in which concepts and relationships are annotated. This process begins with the
lemmatization of the texts (thanks to French Tokeniser and TreeTagger-FR-No-Tokenization
modules) and continues with the annotation of terms suited to defined labels in the ontology
(Flexible Gazetteer module). Each annotation includes the following details: the original term,
the corresponding lemma, the identified label, name of the concept or the relation, object type
(instance, class or relation). Step 2 performs rules (JAPE Transducer_00B7B module) derived
from regular expressions combining domain concepts, codomain concepts and relations. It
relies on annotations of step 1 and the ontology in order to validate the annotation of the
semantic relations of the triplets (relation, domain, codomain).
        </p>
        <p>
          Other processes allow us to annotate spatial entities and temporal ones. Details are given in
[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
2.3
        </p>
      </sec>
      <sec id="sec-2-7">
        <title>In the context of altmetrics</title>
        <p>This framework and the tools described are adapted to analyse tweets in the context of
Almetrics. The two types of analysis we propose on the 5W dimensions are adapted to tweets
linked to set of scientific papers. Instead of waiting months, we can rapidly know the sentiment
expressed in a tweet linked to a scientific paper. For example, we can measure the e-reputation
of researchers thanks to positive sentiment. A major limitation is done to altmetrics findings.
We do not know their meaning in the process of scholarly communication but we think
almetrics can change the way to recognize the scientific contributions of researchers.
.</p>
        <p>Tweets transcript from Twitter for a period
Users Profiles
Tweets
Assessing Who/
Where/When</p>
        <p>Assessing «When »</p>
        <p>Assessing Why (some</p>
        <p>Generalities)</p>
        <p>Assessing « About What »
(Thematic Entities)</p>
        <p>Assessing What
« Sentiment analysis»</p>
        <p>Assessing « About Where »
(Spatial Entities)</p>
        <p>First</p>
        <p>Analysis
DashBoard adapted to user profile – Summarization like maps, temporal analysis, words cloud, KPI</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Panteras</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wise</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Croitoru</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crooks</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Stefanidis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Triangulating social multimedia content for event localization using Flickr and Twitter</article-title>
          . Transactions in GIS,
          <volume>19</volume>
          (
          <issue>5</issue>
          ),
          <fpage>694</fpage>
          -
          <lpage>715</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Rodrigo</given-names>
            <surname>Costas</surname>
          </string-name>
          , Zohreh Zahedi, Paul Wouters,
          <article-title>Do altmetrics correlate with citations? Extensive comparison of altmetric indicators with citations from a multidisciplinary perspective</article-title>
          , https://arxiv.org/abs/1401.4321, last access 3/2/2018.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Françoise</given-names>
            <surname>Lucchini</surname>
          </string-name>
          , Bernard Elissalde, Leny Grassot et Julien Baudry, « Paris tweets, données numériques géolocalisées et évènements urbains »,
          <source>Netcom</source>
          ,
          <fpage>30</fpage>
          -
          <lpage>3</lpage>
          /4 | 2016,
          <fpage>207</fpage>
          -
          <lpage>230</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Alves</surname>
          </string-name>
          , André Luiz Firmino ; de Souza Baptista, Cláudio ; Firmino, Anderson Almeida ; de Oliveira, Maxwell Guimarães ; de Paiva,
          <article-title>Anselmo Cardoso: A Spatial and Temporal Sentiment Analysis Approach Applied to Twitter Microtexts.</article-title>
          .
          <source>In: JIDM</source>
          ,
          <volume>6</volume>
          (
          <year>2015</year>
          ),
          <year>Nr</year>
          . 2,
          <string-name>
            <surname>S.</surname>
          </string-name>
          118-
          <fpage>129</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Pélissier</surname>
          </string-name>
          , “
          <article-title>Comment préparer l'analyse de textes de sites Web grâce à la lexicométrie et au logiciel Iramuteq ?,” dans Présence numérique des organisations</article-title>
          ,
          <volume>14</volume>
          /04/2016, https://presnumorg.hypotheses.
          <source>org/187</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>I. San</given-names>
            <surname>Vicente</surname>
          </string-name>
          , X. Saralegi, y R. Agerri, «
          <article-title>EliXa: A modular and flexible ABSA platform»</article-title>
          ,
          <source>in Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval</source>
          <year>2015</year>
          ), Denver, Colorado,
          <year>2015</year>
          , pp.
          <fpage>748</fpage>
          -
          <lpage>752</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Cunningham</surname>
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaizauskas</surname>
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wilks</surname>
            <given-names>Y.</given-names>
          </string-name>
          (
          <year>1995</year>
          ).
          <article-title>A General Architecture for Text Engineering (GATE) - a new approach to Language Engineering R&amp;D. Rapport technique no CS - 95 - 21</article-title>
          . Department of Computer Science, University of Sheffield.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Bontcheva</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tablan</surname>
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maynard</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cunningham</surname>
            <given-names>H.</given-names>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>Evolving GATE to Meet New Challenges in Language Engineering</article-title>
          .
          <source>Natural Language Engineering</source>
          , vol.
          <volume>10</volume>
          , no 3
          <issue>/4</issue>
          , p.
          <fpage>349</fpage>
          -
          <lpage>373</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Schmid</surname>
            <given-names>H.</given-names>
          </string-name>
          (
          <year>1994</year>
          ).
          <article-title>Probabilistic part-of-speech tagging using decision trees</article-title>
          .
          <source>In Proceedings of international conference on new methods in language processing</source>
          , p.
          <fpage>44</fpage>
          -
          <lpage>49</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Buscaldi</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bessagnet</surname>
            <given-names>M.-N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Royer</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sallaberry</surname>
            <given-names>C.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Using the semantics of texts for information retrieval: A concept- and domain relation-based approach</article-title>
          . In B. Catania et al.(Eds.),
          <source>Adbis (2)</source>
          , vol.
          <volume>241</volume>
          , p.
          <fpage>257</fpage>
          -
          <lpage>266</lpage>
          . Springer.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>