Data Expedition into the Swiss Twitter Corpus —
                              Workshop Results at SwissText 2018

         Ralf Grubenmann1 , William Fallouh1 , Christoforos Nalmpantis1 , Mark Cieliebak2
                    1
                      SpinningBytes AG, 2 Zurich University of Applied Sciences
            rg@spinningbytes.com, william.fallouh@isen.yncrea.fr,
                       christofernal@gmail.com, ciel@zhaw.ch


                                                                      backgrounds, we decided to go with this format to
                                                                      give participants the possibility of hands-on experi-
                         Abstract                                     ence, giving them an opportunity to exchange ideas
    Data Expeditions are short, collaborative                         with people outside their field and possibly discover-
    events focusing on finding interesting patterns                   ing new topics for future research. Since the focus
    and insights in a dataset through interdisci-                     of the SwissText is on Natural Language Processing
    plinary teams. This is a report of the Data                       in Switzerland, giving the participants Tweets from or
    Expedition into the Swiss Twitter Corpus ex-                      regarding Switzerland was the natural choice. These
    pedition hosted by SpinningBytes AG at the                        Tweets were specifically curated for this workshop, as
    SwissText 2018 conference. The aim was to                         detailed in Section 2.
    research interesting topics related to Switzer-
    land in the Swiss Twitter Corpus1 . Two teams                     2       Swiss Twitter Corpus
    with a total of 11 participants were given                        The Swiss Twitter Corpus is a collection of over 3 mil-
    140’521 Switzerland-related Tweets with rel-                      lion Tweets related to Switzerland which has been col-
    evant metadata and analyzed topics of their                       lected since January 2018. Being related to Switzer-
    choice during the 4 hour workshop. We ex-                         land, or ”Swissness”, is defined as either originating
    plain how the data expedition was organized                       in Switzerland, being written by an important Swiss
    and discuss some of the results and lessons                       Twitter account or being about one of a number of
    learned.                                                          hand-curated keywords related to a Swiss topic. Ad-
                                                                      ditionally, we look at the users profile location being
1   Introduction
                                                                      in Switzerland and whether the language of the user is
More and more data is available to industry and to                    Swiss-German.
researchers, which leads to more and more new av-                        For the expedition, a subsample of Tweets was
enues of research being available all the time. For this              selected by selecting Tweets with Swiss Geocoor-
purpose, Data expeditions are a popular tool for edu-                 dinates, Tweets with at least two keywords present
cational and research purposes that can quickly pro-                  and Tweets with a Swissness-Score of at least 3 or
duce interesting analyses from a dataset (Radchenko                   more. The Swissness-Score counts how many of
and Sakoyan, 2016; Ciociola and Reggi, 2015; Burov                    the Swissness-Rules apply to a Tweet, for instance
et al., 2016). They allow groups of people to quickly                 a Tweet with two relevant keywords and Geocoordi-
try out new ideas and test hypotheses, leading to new                 nates in Switzerland would get a Swissness-Score of
results that might not be found in a more traditional                 3. This results in a sample of 140’521 Tweets that are
research setting.                                                     highly relevant to Switzerland.
   Since visitors to the SwissText conference come                       Each Corpus entry contains the Tweet text, the
from a wide spectrum of industrial as well as research                name of the user, the date of the Tweet, the Tweet-
In: Mark Cieliebak, Don Tuggener and Fernando Benites (eds.):         language2 , the users country-code, the latitude and
Proceedings of the 3rd Swiss Text Analytics Conference (Swiss-        longitude (if provided), the keywords found, senti-
Text 2018), Winterthur, Switzerland, June 2018
    1                                                                     2
      https://www.swisstwittercorpus.ch/                                      as provided by Twitter


                                                                 1
                                                                 92
ment annotation (based on Deriu et al. (2017)) as well
as the Swissness-Score and why the Tweet was in-
cluded in the set.

3   Data Expedition
The Data Expedition followed the following format.
Participants were split into groups of 4-8 people (5
                                                                      (a) Popularity of Swiss celebrities measured by number
and 6 in our case), with the goal of forming diverse                  of Tweets.
groups as a mix of developers, researchers, designers
and storytellers. The participants were then handed
the data along with an explanation of the data format
and with an introduction to the Data Expedition. The
participants then had roughly 3.5 hours to decide on
one or more research topics and to analyze and visu-
alize the data and their results. The teams were then
able to present their findings to each other. A sum-
mary of the findings were presented the following day                 (b) Geographic location of Tweets about Roger Federer.
to the general audience of the conference.
                                                                         Figure 1: Visual results of the first team.
4   Results                                                     4.2     Team 2
This section details the results and findings of the two        The second team decided to create a Twitter-based
teams.                                                          tourism guide of Switzerland. Specifically, they
                                                                wanted to recommend top locations for a visitor to
4.1 Team 1
                                                                Switzerland in the months of February to May. They
The first team focused on finding interesting patterns          compared manually curated tourism guides with the
related to Swiss celebrities, most of which were in-            Twitter data. To this end, they performed case studies
cluded in the keyword set already. They looked into             for four different Swiss destinations.
the relative number of Tweets per celebrity to find                           Destination # of Tweets
the most and least popular ones. The most popular                             Geneva                7536
celebrity in the dataset was Roger Federer, a famous                          Lausanne              3379
Swiss Tennis player, and the least popular one was                            Sion                  1969
Christoph Blocher, a Swiss politician. The ranking                            Zermatt                 909
can be seen in Figure 1.                                                      Verbier                 312
   The group then focused on analyzing Tweets about
Roger Federer further. First, they looked at the lan-           Table 1: Popular destinations around lake Geneva
guage and geographic location of relevant Tweets,               sorted by number of Tweets.
noticing that most Tweets originate in Switzerland,
but that Roger Federer is also a popular topic world               Lake Geneva: Geneva was listed as the most pop-
(See Figure 1). A majority of the Tweets was writ-              ular tourist destination in multiple guides. The task
ten in English, followed by German and French, with             was hampered by the existence of a town called ”Lake
almost no Tweets being in Italian.                              Geneva” in the United States of America, which had
   To finish their analysis, they looked into common            to be filtered out. The team created a ranking of
words occurring together with Roger Federer as well             towns around lake Geneva by popularity (number of
as Hashtags related to the topic. A lot of the associ-          Tweets), as seen in Table 1.
ations found were to be expected, like ”Wawrinka”,                 Brugg: They then analyzed mentions of Brugg, a
an important opponent of Federer, though some were              relatively small town in Switzerland, due to a num-
surprising to the participants, like ”Rotterdam”, which         ber of participants coming from the FHNW university
they couldn’t find an explanation for.                          situated in Brugg. Due to the small size of Brugg,


                                                           2
                                                           93
only 40 relevant Tweets were found in the dataset and            several participants and received positive feedback in
no conclusion could be drawn. Though participants                general, pointing to data expeditions being a useful
did find an amusing, sexually explicit Tweet that they           and easily introduced tool in education and research.
shared with the other workshop participants.                        In the future, it might be useful to let participants
   Lucerne: Lucerne, another popular tourist loca-               chose their role in advance, to ease team formation.
tion, was mentioned in 5115 Tweets, with 227 men-                Producing general statistics about the data in advance
tioning the nearby Titlis mountain, a local tourist at-          and adding scaffolding code for participants to use
traction. The participants didn’t find any interest-             might help participants finding a suitable topic and
ing information regarding this town, though they re-             speed up development, at the risk of biasing partici-
marked on the Queen Victoria exhibition taking place             pants towards certain avenues of exploration.
there, which was of interest to the British team mem-               Overall, the expedition was successful and the for-
ber.                                                             mat will likely be repeated by us in the future.
   Lugano: Next, the second team wanted to see if the
    ¯                                                            Acknowledgments
mountains around the city of Lugano were mentioned
in the dataset, since those are purported popular tourist        We would like to thank all the participants in the
locations. Surprisingly, the mountains were only men-            expedition (In no particular order): Stephen, Khalil,
tioned a total of 14 times, even though Lugano itself            Nathan, Jacky, Stefan, Ela, Alma Karalic, Christoph
was mentioned 2792 times.                                        Sess, Michael Sladoje, Alexandru Dimofte, Matthias
          City         % of positive Tweets                      Sommer.
          Bern                         61.2                        We would also like to thank the organizers of the
          Luzern                       65.9                      SwissText conference for the opportunity to lead this
          Basel                        71.4                      workshop.
          Zrich                        72.6
          Lugano                       86.7
          Geneva                       89.8                      References
          Lausanne                     90.5                      AV Burov, AV Baranov, and AV Tagaev. 2016. Data expe-
          Zermatt                      94.4                        dition as an effective tool of creating a culture of work-
                                                                   ing with open data of the future state and municipal offi-
                                                                   cials. In Proceedings of the International Conference on
Table 2: Percent of positive Tweets (Positive Sen-                 Electronic Governance and Open Society: Challenges
timent larger negative Sentiment) in various Swiss                 in Eurasia. ACM, pages 167–170.
cities.                                                          Chiara Ciociola and Luigi Reggi. 2015. A scuola di open-
                                                                   coesione: From open data to civic engagement. Open
   To round off their analysis, the team members                   Data as Open Educational Resources page 26.
looked at the distribution of sentiment annotations for
mentions of Swiss cities (see Table 2). They couldn’t            Jan Deriu, Aurelien Lucchi, Valeria De Luca, Aliaksei Sev-
                                                                    eryn, Simon Müller, Mark Cieliebak, Thomas Hofmann,
find any overwhelmingly negative Swiss cities, but                  and Martin Jaggi. 2017. Leveraging Large Amounts of
noticed that in general, the Italian and French part of             Weakly Supervised Data for Multi-Language Sentiment
Switzerland is more happy than the German one.                      Classification. In WWW 2017 - International World
                                                                    Wide Web Conference. Perth, Australia.
5   Discussion                                                   Irina Radchenko and Anna Sakoyan. 2016. On some rus-
                                                                     sian educational projects in open data and data journal-
We organized and executed a data expedition into                     ism. In Open Data for Education, Springer, pages 153–
Swiss Twitter data with a group of 11 people. The                    165.
participants were very motivated and interested in the
topic at hand and discovered several new and sur-
prising insights from the data. Even though the total
time available for the analysis was only 3 hours, the
teams quickly settled on a topic to study and produced
the first results. The workshop itself was praised by


                                                            3
                                                            94