Data Expedition into the Swiss Twitter Corpus — Workshop Results at SwissText 2018 Ralf Grubenmann1 , William Fallouh1 , Christoforos Nalmpantis1 , Mark Cieliebak2 1 SpinningBytes AG, 2 Zurich University of Applied Sciences rg@spinningbytes.com, william.fallouh@isen.yncrea.fr, christofernal@gmail.com, ciel@zhaw.ch backgrounds, we decided to go with this format to give participants the possibility of hands-on experi- Abstract ence, giving them an opportunity to exchange ideas Data Expeditions are short, collaborative with people outside their field and possibly discover- events focusing on finding interesting patterns ing new topics for future research. Since the focus and insights in a dataset through interdisci- of the SwissText is on Natural Language Processing plinary teams. This is a report of the Data in Switzerland, giving the participants Tweets from or Expedition into the Swiss Twitter Corpus ex- regarding Switzerland was the natural choice. These pedition hosted by SpinningBytes AG at the Tweets were specifically curated for this workshop, as SwissText 2018 conference. The aim was to detailed in Section 2. research interesting topics related to Switzer- land in the Swiss Twitter Corpus1 . Two teams 2 Swiss Twitter Corpus with a total of 11 participants were given The Swiss Twitter Corpus is a collection of over 3 mil- 140’521 Switzerland-related Tweets with rel- lion Tweets related to Switzerland which has been col- evant metadata and analyzed topics of their lected since January 2018. Being related to Switzer- choice during the 4 hour workshop. We ex- land, or ”Swissness”, is defined as either originating plain how the data expedition was organized in Switzerland, being written by an important Swiss and discuss some of the results and lessons Twitter account or being about one of a number of learned. hand-curated keywords related to a Swiss topic. Ad- ditionally, we look at the users profile location being 1 Introduction in Switzerland and whether the language of the user is More and more data is available to industry and to Swiss-German. researchers, which leads to more and more new av- For the expedition, a subsample of Tweets was enues of research being available all the time. For this selected by selecting Tweets with Swiss Geocoor- purpose, Data expeditions are a popular tool for edu- dinates, Tweets with at least two keywords present cational and research purposes that can quickly pro- and Tweets with a Swissness-Score of at least 3 or duce interesting analyses from a dataset (Radchenko more. The Swissness-Score counts how many of and Sakoyan, 2016; Ciociola and Reggi, 2015; Burov the Swissness-Rules apply to a Tweet, for instance et al., 2016). They allow groups of people to quickly a Tweet with two relevant keywords and Geocoordi- try out new ideas and test hypotheses, leading to new nates in Switzerland would get a Swissness-Score of results that might not be found in a more traditional 3. This results in a sample of 140’521 Tweets that are research setting. highly relevant to Switzerland. Since visitors to the SwissText conference come Each Corpus entry contains the Tweet text, the from a wide spectrum of industrial as well as research name of the user, the date of the Tweet, the Tweet- In: Mark Cieliebak, Don Tuggener and Fernando Benites (eds.): language2 , the users country-code, the latitude and Proceedings of the 3rd Swiss Text Analytics Conference (Swiss- longitude (if provided), the keywords found, senti- Text 2018), Winterthur, Switzerland, June 2018 1 2 https://www.swisstwittercorpus.ch/ as provided by Twitter 1 92 ment annotation (based on Deriu et al. (2017)) as well as the Swissness-Score and why the Tweet was in- cluded in the set. 3 Data Expedition The Data Expedition followed the following format. Participants were split into groups of 4-8 people (5 (a) Popularity of Swiss celebrities measured by number and 6 in our case), with the goal of forming diverse of Tweets. groups as a mix of developers, researchers, designers and storytellers. The participants were then handed the data along with an explanation of the data format and with an introduction to the Data Expedition. The participants then had roughly 3.5 hours to decide on one or more research topics and to analyze and visu- alize the data and their results. The teams were then able to present their findings to each other. A sum- mary of the findings were presented the following day (b) Geographic location of Tweets about Roger Federer. to the general audience of the conference. Figure 1: Visual results of the first team. 4 Results 4.2 Team 2 This section details the results and findings of the two The second team decided to create a Twitter-based teams. tourism guide of Switzerland. Specifically, they wanted to recommend top locations for a visitor to 4.1 Team 1 Switzerland in the months of February to May. They The first team focused on finding interesting patterns compared manually curated tourism guides with the related to Swiss celebrities, most of which were in- Twitter data. To this end, they performed case studies cluded in the keyword set already. They looked into for four different Swiss destinations. the relative number of Tweets per celebrity to find Destination # of Tweets the most and least popular ones. The most popular Geneva 7536 celebrity in the dataset was Roger Federer, a famous Lausanne 3379 Swiss Tennis player, and the least popular one was Sion 1969 Christoph Blocher, a Swiss politician. The ranking Zermatt 909 can be seen in Figure 1. Verbier 312 The group then focused on analyzing Tweets about Roger Federer further. First, they looked at the lan- Table 1: Popular destinations around lake Geneva guage and geographic location of relevant Tweets, sorted by number of Tweets. noticing that most Tweets originate in Switzerland, but that Roger Federer is also a popular topic world Lake Geneva: Geneva was listed as the most pop- (See Figure 1). A majority of the Tweets was writ- ular tourist destination in multiple guides. The task ten in English, followed by German and French, with was hampered by the existence of a town called ”Lake almost no Tweets being in Italian. Geneva” in the United States of America, which had To finish their analysis, they looked into common to be filtered out. The team created a ranking of words occurring together with Roger Federer as well towns around lake Geneva by popularity (number of as Hashtags related to the topic. A lot of the associ- Tweets), as seen in Table 1. ations found were to be expected, like ”Wawrinka”, Brugg: They then analyzed mentions of Brugg, a an important opponent of Federer, though some were relatively small town in Switzerland, due to a num- surprising to the participants, like ”Rotterdam”, which ber of participants coming from the FHNW university they couldn’t find an explanation for. situated in Brugg. Due to the small size of Brugg, 2 93 only 40 relevant Tweets were found in the dataset and several participants and received positive feedback in no conclusion could be drawn. Though participants general, pointing to data expeditions being a useful did find an amusing, sexually explicit Tweet that they and easily introduced tool in education and research. shared with the other workshop participants. In the future, it might be useful to let participants Lucerne: Lucerne, another popular tourist loca- chose their role in advance, to ease team formation. tion, was mentioned in 5115 Tweets, with 227 men- Producing general statistics about the data in advance tioning the nearby Titlis mountain, a local tourist at- and adding scaffolding code for participants to use traction. The participants didn’t find any interest- might help participants finding a suitable topic and ing information regarding this town, though they re- speed up development, at the risk of biasing partici- marked on the Queen Victoria exhibition taking place pants towards certain avenues of exploration. there, which was of interest to the British team mem- Overall, the expedition was successful and the for- ber. mat will likely be repeated by us in the future. Lugano: Next, the second team wanted to see if the ¯ Acknowledgments mountains around the city of Lugano were mentioned in the dataset, since those are purported popular tourist We would like to thank all the participants in the locations. Surprisingly, the mountains were only men- expedition (In no particular order): Stephen, Khalil, tioned a total of 14 times, even though Lugano itself Nathan, Jacky, Stefan, Ela, Alma Karalic, Christoph was mentioned 2792 times. Sess, Michael Sladoje, Alexandru Dimofte, Matthias City % of positive Tweets Sommer. Bern 61.2 We would also like to thank the organizers of the Luzern 65.9 SwissText conference for the opportunity to lead this Basel 71.4 workshop. Zrich 72.6 Lugano 86.7 Geneva 89.8 References Lausanne 90.5 AV Burov, AV Baranov, and AV Tagaev. 2016. Data expe- Zermatt 94.4 dition as an effective tool of creating a culture of work- ing with open data of the future state and municipal offi- cials. In Proceedings of the International Conference on Table 2: Percent of positive Tweets (Positive Sen- Electronic Governance and Open Society: Challenges timent larger negative Sentiment) in various Swiss in Eurasia. ACM, pages 167–170. cities. Chiara Ciociola and Luigi Reggi. 2015. A scuola di open- coesione: From open data to civic engagement. Open To round off their analysis, the team members Data as Open Educational Resources page 26. looked at the distribution of sentiment annotations for mentions of Swiss cities (see Table 2). They couldn’t Jan Deriu, Aurelien Lucchi, Valeria De Luca, Aliaksei Sev- eryn, Simon Müller, Mark Cieliebak, Thomas Hofmann, find any overwhelmingly negative Swiss cities, but and Martin Jaggi. 2017. Leveraging Large Amounts of noticed that in general, the Italian and French part of Weakly Supervised Data for Multi-Language Sentiment Switzerland is more happy than the German one. Classification. In WWW 2017 - International World Wide Web Conference. Perth, Australia. 5 Discussion Irina Radchenko and Anna Sakoyan. 2016. On some rus- sian educational projects in open data and data journal- We organized and executed a data expedition into ism. In Open Data for Education, Springer, pages 153– Swiss Twitter data with a group of 11 people. The 165. participants were very motivated and interested in the topic at hand and discovered several new and sur- prising insights from the data. Even though the total time available for the analysis was only 3 hours, the teams quickly settled on a topic to study and produced the first results. The workshop itself was praised by 3 94