=Paper=
{{Paper
|id=Vol-2226/paper11
|storemode=property
|title=Data Expedition into the Swiss Twitter Corpus
|pdfUrl=https://ceur-ws.org/Vol-2226/paper11.pdf
|volume=Vol-2226
|authors=Ralf Grubenmann,William Fallouh,Christoforos Nalmpantis,Mark Cieliebak
|dblpUrl=https://dblp.org/rec/conf/swisstext/GrubenmannFNC18
}}
==Data Expedition into the Swiss Twitter Corpus==
Data Expedition into the Swiss Twitter Corpus —
Workshop Results at SwissText 2018
Ralf Grubenmann1 , William Fallouh1 , Christoforos Nalmpantis1 , Mark Cieliebak2
1
SpinningBytes AG, 2 Zurich University of Applied Sciences
rg@spinningbytes.com, william.fallouh@isen.yncrea.fr,
christofernal@gmail.com, ciel@zhaw.ch
backgrounds, we decided to go with this format to
give participants the possibility of hands-on experi-
Abstract ence, giving them an opportunity to exchange ideas
Data Expeditions are short, collaborative with people outside their field and possibly discover-
events focusing on finding interesting patterns ing new topics for future research. Since the focus
and insights in a dataset through interdisci- of the SwissText is on Natural Language Processing
plinary teams. This is a report of the Data in Switzerland, giving the participants Tweets from or
Expedition into the Swiss Twitter Corpus ex- regarding Switzerland was the natural choice. These
pedition hosted by SpinningBytes AG at the Tweets were specifically curated for this workshop, as
SwissText 2018 conference. The aim was to detailed in Section 2.
research interesting topics related to Switzer-
land in the Swiss Twitter Corpus1 . Two teams 2 Swiss Twitter Corpus
with a total of 11 participants were given The Swiss Twitter Corpus is a collection of over 3 mil-
140’521 Switzerland-related Tweets with rel- lion Tweets related to Switzerland which has been col-
evant metadata and analyzed topics of their lected since January 2018. Being related to Switzer-
choice during the 4 hour workshop. We ex- land, or ”Swissness”, is defined as either originating
plain how the data expedition was organized in Switzerland, being written by an important Swiss
and discuss some of the results and lessons Twitter account or being about one of a number of
learned. hand-curated keywords related to a Swiss topic. Ad-
ditionally, we look at the users profile location being
1 Introduction
in Switzerland and whether the language of the user is
More and more data is available to industry and to Swiss-German.
researchers, which leads to more and more new av- For the expedition, a subsample of Tweets was
enues of research being available all the time. For this selected by selecting Tweets with Swiss Geocoor-
purpose, Data expeditions are a popular tool for edu- dinates, Tweets with at least two keywords present
cational and research purposes that can quickly pro- and Tweets with a Swissness-Score of at least 3 or
duce interesting analyses from a dataset (Radchenko more. The Swissness-Score counts how many of
and Sakoyan, 2016; Ciociola and Reggi, 2015; Burov the Swissness-Rules apply to a Tweet, for instance
et al., 2016). They allow groups of people to quickly a Tweet with two relevant keywords and Geocoordi-
try out new ideas and test hypotheses, leading to new nates in Switzerland would get a Swissness-Score of
results that might not be found in a more traditional 3. This results in a sample of 140’521 Tweets that are
research setting. highly relevant to Switzerland.
Since visitors to the SwissText conference come Each Corpus entry contains the Tweet text, the
from a wide spectrum of industrial as well as research name of the user, the date of the Tweet, the Tweet-
In: Mark Cieliebak, Don Tuggener and Fernando Benites (eds.): language2 , the users country-code, the latitude and
Proceedings of the 3rd Swiss Text Analytics Conference (Swiss- longitude (if provided), the keywords found, senti-
Text 2018), Winterthur, Switzerland, June 2018
1 2
https://www.swisstwittercorpus.ch/ as provided by Twitter
1
92
ment annotation (based on Deriu et al. (2017)) as well
as the Swissness-Score and why the Tweet was in-
cluded in the set.
3 Data Expedition
The Data Expedition followed the following format.
Participants were split into groups of 4-8 people (5
(a) Popularity of Swiss celebrities measured by number
and 6 in our case), with the goal of forming diverse of Tweets.
groups as a mix of developers, researchers, designers
and storytellers. The participants were then handed
the data along with an explanation of the data format
and with an introduction to the Data Expedition. The
participants then had roughly 3.5 hours to decide on
one or more research topics and to analyze and visu-
alize the data and their results. The teams were then
able to present their findings to each other. A sum-
mary of the findings were presented the following day (b) Geographic location of Tweets about Roger Federer.
to the general audience of the conference.
Figure 1: Visual results of the first team.
4 Results 4.2 Team 2
This section details the results and findings of the two The second team decided to create a Twitter-based
teams. tourism guide of Switzerland. Specifically, they
wanted to recommend top locations for a visitor to
4.1 Team 1
Switzerland in the months of February to May. They
The first team focused on finding interesting patterns compared manually curated tourism guides with the
related to Swiss celebrities, most of which were in- Twitter data. To this end, they performed case studies
cluded in the keyword set already. They looked into for four different Swiss destinations.
the relative number of Tweets per celebrity to find Destination # of Tweets
the most and least popular ones. The most popular Geneva 7536
celebrity in the dataset was Roger Federer, a famous Lausanne 3379
Swiss Tennis player, and the least popular one was Sion 1969
Christoph Blocher, a Swiss politician. The ranking Zermatt 909
can be seen in Figure 1. Verbier 312
The group then focused on analyzing Tweets about
Roger Federer further. First, they looked at the lan- Table 1: Popular destinations around lake Geneva
guage and geographic location of relevant Tweets, sorted by number of Tweets.
noticing that most Tweets originate in Switzerland,
but that Roger Federer is also a popular topic world Lake Geneva: Geneva was listed as the most pop-
(See Figure 1). A majority of the Tweets was writ- ular tourist destination in multiple guides. The task
ten in English, followed by German and French, with was hampered by the existence of a town called ”Lake
almost no Tweets being in Italian. Geneva” in the United States of America, which had
To finish their analysis, they looked into common to be filtered out. The team created a ranking of
words occurring together with Roger Federer as well towns around lake Geneva by popularity (number of
as Hashtags related to the topic. A lot of the associ- Tweets), as seen in Table 1.
ations found were to be expected, like ”Wawrinka”, Brugg: They then analyzed mentions of Brugg, a
an important opponent of Federer, though some were relatively small town in Switzerland, due to a num-
surprising to the participants, like ”Rotterdam”, which ber of participants coming from the FHNW university
they couldn’t find an explanation for. situated in Brugg. Due to the small size of Brugg,
2
93
only 40 relevant Tweets were found in the dataset and several participants and received positive feedback in
no conclusion could be drawn. Though participants general, pointing to data expeditions being a useful
did find an amusing, sexually explicit Tweet that they and easily introduced tool in education and research.
shared with the other workshop participants. In the future, it might be useful to let participants
Lucerne: Lucerne, another popular tourist loca- chose their role in advance, to ease team formation.
tion, was mentioned in 5115 Tweets, with 227 men- Producing general statistics about the data in advance
tioning the nearby Titlis mountain, a local tourist at- and adding scaffolding code for participants to use
traction. The participants didn’t find any interest- might help participants finding a suitable topic and
ing information regarding this town, though they re- speed up development, at the risk of biasing partici-
marked on the Queen Victoria exhibition taking place pants towards certain avenues of exploration.
there, which was of interest to the British team mem- Overall, the expedition was successful and the for-
ber. mat will likely be repeated by us in the future.
Lugano: Next, the second team wanted to see if the
¯ Acknowledgments
mountains around the city of Lugano were mentioned
in the dataset, since those are purported popular tourist We would like to thank all the participants in the
locations. Surprisingly, the mountains were only men- expedition (In no particular order): Stephen, Khalil,
tioned a total of 14 times, even though Lugano itself Nathan, Jacky, Stefan, Ela, Alma Karalic, Christoph
was mentioned 2792 times. Sess, Michael Sladoje, Alexandru Dimofte, Matthias
City % of positive Tweets Sommer.
Bern 61.2 We would also like to thank the organizers of the
Luzern 65.9 SwissText conference for the opportunity to lead this
Basel 71.4 workshop.
Zrich 72.6
Lugano 86.7
Geneva 89.8 References
Lausanne 90.5 AV Burov, AV Baranov, and AV Tagaev. 2016. Data expe-
Zermatt 94.4 dition as an effective tool of creating a culture of work-
ing with open data of the future state and municipal offi-
cials. In Proceedings of the International Conference on
Table 2: Percent of positive Tweets (Positive Sen- Electronic Governance and Open Society: Challenges
timent larger negative Sentiment) in various Swiss in Eurasia. ACM, pages 167–170.
cities. Chiara Ciociola and Luigi Reggi. 2015. A scuola di open-
coesione: From open data to civic engagement. Open
To round off their analysis, the team members Data as Open Educational Resources page 26.
looked at the distribution of sentiment annotations for
mentions of Swiss cities (see Table 2). They couldn’t Jan Deriu, Aurelien Lucchi, Valeria De Luca, Aliaksei Sev-
eryn, Simon Müller, Mark Cieliebak, Thomas Hofmann,
find any overwhelmingly negative Swiss cities, but and Martin Jaggi. 2017. Leveraging Large Amounts of
noticed that in general, the Italian and French part of Weakly Supervised Data for Multi-Language Sentiment
Switzerland is more happy than the German one. Classification. In WWW 2017 - International World
Wide Web Conference. Perth, Australia.
5 Discussion Irina Radchenko and Anna Sakoyan. 2016. On some rus-
sian educational projects in open data and data journal-
We organized and executed a data expedition into ism. In Open Data for Education, Springer, pages 153–
Swiss Twitter data with a group of 11 people. The 165.
participants were very motivated and interested in the
topic at hand and discovered several new and sur-
prising insights from the data. Even though the total
time available for the analysis was only 3 hours, the
teams quickly settled on a topic to study and produced
the first results. The workshop itself was praised by
3
94