<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">0747-5632</issn>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1145/1718487.1718542</article-id>
      <title-group>
        <article-title>Data Collection and Annotation Pipeline for Social Good Projects</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Christoph Scheunemann</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julian Naumann</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Max Eichler</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kevin Stowe</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iryna Gurevych</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science Technical University of Darmstadt https://</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2010</year>
      </pub-date>
      <volume>2017</volume>
      <issue>3</issue>
      <fpage>4</fpage>
      <lpage>6</lpage>
      <abstract>
        <p>Vast amounts of data are generated during crisis events through both formal and informal sources, and this data can be used to make a positive impact in all phases of crisis events. However, collecting and annotating data quickly and effectively in the face of crises is a challenging task. Crises require quick, robust, and efficient annotation to best respond to unfolding events. Data must be accessed and aggregated across different platforms and sources, and annotation tools must be able to utilize this data effectively. This work describes an architecture built for rapid collection and annotation of data from multiple sources which can then be built into machine learning and data analysis models. We extract data from social media via multiple systems for Twitter data collection, as well as building architecture for the collection of news articles from diverse sources. These can then be input into the INCEpTION annotation framework, which has been adapted to allow for easy management of multiple annotators, aiming to improve functionality to facilitate the application of citizen science. This allows us to rapidly prototype new annotation schema across a diverse array of data sources, which can then be deployed for machine learning. As a use case, we explore annotation of COVID-19 related Tweets and news articles for case prediction.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Data collection and annotation is a difficult process; this
difficulty is compounded in crisis situations. In order to be
effective, data collection and annotation needs to be quick
and comprehensive. This requires analyzing ”real-time” data
sources (like social media) as well as data sources that are
more broad (such as published news). This information is
critical to build an understanding of the event in all phases
of crisis, which is in turn necessary to provide adequate
relief to impacted areas, build resilience, provide information
to affected populations, and generally mitigate harm.</p>
      <p>In order to facilitate the rapid collection and analysis of
relevant information, we build an architecture for
collection and annotation of data from multiple sources (Figure
1). First, we explore Twitter, building a system to extract not
only live-streaming Tweets, but also older tweets. This
allows users to quickly deploy a system that collects Tweets
as they arrive (for instance, during the time immediately
after a disaster event occurs), and then later collecting relevant
data from before that point to complement the live stream.
We also implement functionality to collect users’ streams as
well as reply graphs, both of which can increase our
understanding of crisis events.</p>
      <p>We also build architecture for collection of news
articles based on keywords. This collection spans thousands of
sources, allowing for rapid extraction of relevant news
articles, giving a more comprehensive view of crisis events.
While social media can be an effective lens to view crises,
utilizing only one data source necessarily introduces the
biases inherent to certain platforms. Adding additional data
sources, particularly more formal sources, allows for not
only a more thorough analysis of known problems but also
new, interesting contributions combining informal social
data with more formal sources.</p>
      <p>
        We use both of these systems for data collection, and
provide infrastructure for then inputting the collected data to the
INCEpTION platform for annotation. INCEpTION
        <xref ref-type="bibr" rid="ref3">(Eckart
de Castilho et al. 2018)</xref>
        is a powerful, flexible, web-based
annotation platform. We supplement its architecture with
functionality for improved management of multiple annotators
over multiple tasks. This facilitates the use of citizen science
        <xref ref-type="bibr" rid="ref6">(Gura 2013)</xref>
        by allowing project managers to easily manage
and assess a large quantity of non-expert annotators that can
contribute to relevant projects.
      </p>
      <p>To highlight the utility of this architecture, we present a
case study concerning the COVID-19 pandemic. We collect
relevant data via our Twitter and news crawling
architectures, supply it to the INCEpTION, and can then quickly
deliver it to annotators for detailed coding. We use this
output to then make predictions about COVID-19 cases based
not only on previous case numbers, but also on social
media reactions and news responses. This combination of
social media and news events provides a novel viewpoint on
this humanitarian crisis, highlighting the necessity of rapid
data collection and annotation for multiple sources, and
providing a blueprint for applying this architecture for
social good projects. A practical step by step walkthrough
of our work, including data collection from Twitter and
news sources, import to INCEpTION, and making use of
workflow improvements, is available at https://github.com/
UKPLab/social-good-data-pipeline.</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        There are numerous perspectives on collecting social
media and other data during crisis situations
        <xref ref-type="bibr" rid="ref2 ref7">(Reuter et al.
2017; Imran et al. 2015; Spence, Lachlan, and Rainear 2016;
Castillo 2016)</xref>
        . There are many issues that need to be
accounted for (for a relevant bibliography, see Palen et al.
(2020)). In order to be maximally impactful, data collection
needs to be quick and efficient. In addition, flexibility is
necessary to handle the changing nature of crisis events.
Capturing data from multiple sources is essential to alleviate biases
associated with using only a single type of data. Of course,
there are also numerous ethical considerations that need to
be made, particularly with regard to collection and
distribution of data from populations that may be at risk. Even in
public data such as Twitter, there are issues regarding users’
understanding of their data that researchers need to be aware
of
        <xref ref-type="bibr" rid="ref5">(Fiesler and Proferes 2018)</xref>
        .
      </p>
      <p>The primary aspect we attempt to address is the massive
amount of data created during these events. Keyword
collection is often insufficient: keywords can both overgenerate
(yielding excessive irrelevant data) as well as undergenerate
(missing important data that lacks a given keyword).
Consider the following example tweets (Stowe et al. 2018b):</p>
      <sec id="sec-2-1">
        <title>1. Rock you like a HURRICANE!!</title>
        <p>2. Just bought a 100 tea-light candles so yes, if we lose
power, my apartment will look like The Bachelor finale
In collecting data for a possible hurricane event, using the
hurricane keyword will yield (1), despite the fact that it is
likely irrelevant. Conversely, given the context of a
hurricane, (2) very likely gives relevant information pertaining to
the pre-crisis phase, exhibiting fear about incoming power
losses and taking preparatory actions. However, it is
unlikely that a basic list of keywords will capture this tweet,
excepting perhaps power, which will again greatly increase
the number of false positives.</p>
        <p>For this reason, data is often first collected with keywords,
then machine learning algorithms are applied to better filter
find more relevant information. Our methodology supports
this pattern: we build tools for quickly extracting relevant
data based on keywords. This data is then sent to an
annotation pipeline where fine-grained tags can be applied by
trained or volunteer workers. Then, using this labelled data,
we can build supervised machine learning models for better
data processing and analysis.</p>
        <p>
          There are a variety of related approaches for data
collection. Perhaps most relevant is the Artificial Intelligence for
Digital Response (AIDR), a platform for collecting and
classifying social media messages during disasters
          <xref ref-type="bibr" rid="ref8">(Imran et al.
2014)</xref>
          . AIDR allows users to define search queries and
collect Twitter data, which can then be further filtered using
user-defined machine learning systems.
        </p>
        <p>Another relevant systems is the Social Data Collection
(SDC) of Reuter et al. (2017). This architecture allows for
search and analysis of social media data through a
userfriendly graphical interface, allowing non-experts to quickly
build and analyze relevant data sets.</p>
        <p>
          The architecture of
          <xref ref-type="bibr" rid="ref1">Anderson et al. (2015)</xref>
          also provides a
solution for Twitter collection in emergency situations. Their
system is designed to scale, incorporating a variety of
different technologies to process, store, and analyze massive
incoming data streams via the Twitter firehouse. Their
architecture is ideally suited to processing the extreme volume of
data available on Twitter, but requires substantial setup and
expertise, as well as relying on the official data stream from
Twitter. Our methodology is orthogonal in that it is built for
rapid, lightweight data collection, requiring little technical
expertise.
        </p>
        <p>Our framework differs others dealing with social media in
a number of critical areas. First, we collect both informal
social media data (via Twitter) and more formal news sources.
This allows for a broader understanding of events. Second,
our system retrieves both real-time and historic data, making
it practical for all phases of crises. Finally, our architecture
is built to seamlessly pass data to an annotation framework
built to support citizen-science based annotation. This
allows for the rapid collection and annotation of diverse data,
which can then leveraged for machine learning purposes.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Twitter Collection</title>
      <p>Our Twitter collection focuses on four different ways of
scraping Twitter data, while relying on two different
platforms for obtaining the data. The (1) method of obtaining
data works by accessing the official Twitter API, which we
access through the Tweepy library.1 The (2) method relies
on the NASTY scraper, which accesses the Twitter website
by simulating http requests and processing of the resulting</p>
      <sec id="sec-3-1">
        <title>1https://www.tweepy.org</title>
        <p>HTML.2 See Figure 2 for an overview of the different ways
of obtaining data.</p>
        <sec id="sec-3-1-1">
          <title>Live Streaming API</title>
          <p>The first way of obtaining Tweets is by using the
streaming feature of the Twitter API. First, the researcher supplies
keywords which they are interested in. The scraper then
accesses the live stream of globally tweeted messages and
filters it so that at least one of the keywords is contained in
every Tweet which is then in turn passed of for further
processing, i.e. meta data stripping. Through this method we
are able to collect roughly 4,100,000 Tweets a day for our
case study (described below). It utilizes the unpaid Twitter
API, allowing easy access for researchers. It represents the
most efficient way to gather data quickly. The three other use
cases our scraper provides are supplementary to the
streaming of live data.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Historical Scraping</title>
          <p>The historical feature provides a way to gather data from
the past. This is useful when certain dates from the past
become interesting retrospectively, e.g. how populations
reacted at the very beginning of a pandemic before researchers
began their analysis. Similarly to the streaming feature, the
researcher provides keywords which are then used in the
request to the Twitter website. As this feature relies on the
NASTY scraper instead of the official Twitter API, it is
slower than the streaming feature and provides less data
overall. In our tests, the historical feature was able to find
up to roughly 710,000 Tweets for a specific search while the
streaming feature was able to find 4,100,000 Tweets for the
same time frame. 73% of the 710,000 historical Tweets were
duplicates of the live stream data while the other 27% were
distinct from the streaming data, indicating the potential
usefulness of combining both methods. Combination is done in
a duplicate avoidance fashion as explained further below.</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>User Collection</title>
          <p>The user-collection feature provides researchers with an
easy way of obtaining Tweets from specific users. This can
be useful in cases where specific Twitter accounts provide
useful information on an continual basis or where users of
interest can be observed over a continual basis to observe</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>2https://github.com/lschmelzeisen/nasty</title>
        <p>change in behavior. Previous research has shown the
difficulty of analyzing tweets in isolation (Stowe et al. 2018a),
and user streams allow us to better understand tweets within
context.</p>
        <sec id="sec-3-2-1">
          <title>Reply Threads</title>
          <p>Reply thread collection is the last supplemental method the
scraper provides. Given a Tweet ID it collects all the replies
in a tree-like fashion. This supplements the user collection,
and is also useful to researchers for examining Twitter
discourses which are more in-context than just one-off
messages.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Post-Processing</title>
          <p>In case of topics which are not bound to one region but are
rather trans-national, we provide a naive filtering
implementation which provides the possibility to filter by language
after the Tweets are collected. This implementation relies on
the language meta-tag provided by Twitter, acknowledging
that it is not always correct.</p>
          <p>As all four methods potentially gather overlapping data,
efficient and scalable duplicate detection was implemented
in order to build data sets of good quality.</p>
          <p>In the case of disaster studies which unfold over variable
time spans, efficient data management is important to ensure
scalable processing of the collected data. As crisis events
unfold, the needs of researchers may change, and thus we
need flexibility in terms of data collection. Therefore, we
included a filtering feature for Tweet dehydration, whereby
researchers can decide which metadata of a Tweet is
interesting for their research and which can be filtered out after
obtaining the Tweets.</p>
          <p>Finally, we provide an easy way to export the data set
from the default, universally-accessible format to the UIMA
XML CAS format which is required for use in INCEpTION,
in order to facilitate the pipeline from data collection to
annotation. Another publishing feature also enables the
publishing of the data set by providing a way to strip all meta
data from the Tweet as is required by the Twitter terms of
service.</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>Legality of Collecting Data From Twitter</title>
          <p>As we are using the NASTY scraper to scrape Twitter data
users of our application are potentially in violation of
Twitter’s terms of service as this is expressly prohibited and only
allowed via certain bots (ie. Googlebot). Morally, it is
questionable if citizens can be prohibited from freely accessing
information in any way they deem appropriate from
platforms which are used by politicians around the world.</p>
          <p>Legally, in many jurisdictions it is still perfectly legal to
use our tool, and by extension the NASTY scraper, as per
the following laws:
1. In Germany, up to 75% of any publicly accessible data
may be copied for academic purposes.3
3https://papers.ssrn.com/sol3/papers.cfm?abstract id=3491192
2. In the United States of America, some courts have
affirmed the right to scrape publicly available information.4
3. Depending on the jurisdiction, it is unclear if the
Twitter’s terms of service extend to users of our scraper, as
users never agreed to the terms and could therefore not be
bound to their conditions.</p>
          <p>For further discussion on this topic, see documentation via
the NASTY scraper repository.5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>News Collection</title>
      <p>Our news collection uses two data sources (1) The GDELT
Project6 and (2) News API7 both of which provide access to
a diverse set of world wide news outlets. GDELT provides
access to a set of near real-time news articles and blogs for
the last 3 month while News API focuses on news outlets.
By providing data from these two sources researchers can
easily compare data quality and adjust to different use cases,
i.e. through the inclusion of blogs or a sole focus on news.
While these two APIs have significant built-in functionality
regarding search, they only provide researchers with URLs
and some metadata. This leaves researchers with the task
of scraping and parsing websites manually for data
collection. We implemented an end-to-end data pipeline to crawl,
scrape, and parse news sites, so researchers can put their
focus towards experiments instead of data collection.</p>
      <p>GDELT and News API both provide URLs plus additional
metadata matching a search query. A variety of filters
including language, country,8 domain, and date range can be
applied to each search query. Results can be further enhanced
by using ”and” / ”or” syntax to find a subset of results, e.g.
COVID-19 AND (lockdown OR masks).</p>
      <p>4http://cdn.ca9.uscourts.gov/datastore/opinions/2019/09/09/
17-16783.pdf</p>
      <p>5https://github.com/lschmelzeisen/nasty#legal-and-moralconsiderations
6https://www.gdeltproject.org
7https://newsapi.org
8Country selection is only available on GDELT data.</p>
      <p>
        We developed a data collection pipeline visualized in
Figure 3 which passes search query and filters to GDELT or
News API and scrapes each returned URL for its raw HTML
data. To make the data more usable we apply boilerplate
detection
        <xref ref-type="bibr" rid="ref10">(Kohlschu¨tter, Fankhauser, and Nejdl 2010)</xref>
        to
extract the article title and main content. Boilerplate detection
looks at the number of words and the link density to
distinguish text from navigation, advertisement, and other
sections of a website. While long(-ish) text usually features a
high number of words with a low link density, the opposite
is true for boilerplate which can then be removed. To further
simplify the usage of our data we tokenize each article on
a sentence level, and extract metadata like title, description,
timestamp, domain, and language.
      </p>
      <p>We provide JSON data via multiple REST API endpoints
which vastly reduces the setup time for data collection and
gives easier access to researchers from non-technical
backgrounds. It is as simple as determining a search query plus
optional filters and sending a http request. In Python a full
data collection process can be set up in less then 5 lines of
code.</p>
      <p>Currently we support English, German, and Italian news
articles for both data sources, though this could eventually
be extended to up to a maximum of 65 languages supported
by GDELT and 14 supported by News API.</p>
      <p>Due to a focus on concurrent data processing our pipeline
returns results in seconds which allows researchers to
quickly try multiple search queries to gather a fitting data
set for their needs. This is especially critical during a
crisis where time is of the essence and circumstances might
change quickly. If one searches for a hundred news
articles, we create a unique thread to scrape and parse each
of those. This is important since we are often faced with
slow servers, timeouts, and sometimes sites that block our
requests because of their noncompliance with GDPR laws.
Since we frequently encounter websites that are
inaccessible, we rarely retrieve all requested results. We mitigate this
by adding a small buffer of additional sites which fill the gap
if a process fails. This allows us to return as many news
articles as possible without sacrificing performance. With our
data pipeline researchers can rapidly try new search queries
and evaluate how these changes affect the rest of their work.</p>
      <p>In addition to raw data collection, the pipeline
performance allows for the development of end user applications.
Our data pipeline has been employed for a near real-time
argument mining web application, showing its practicality for
collection of news data.9</p>
    </sec>
    <sec id="sec-5">
      <title>INCEpTION</title>
      <p>
        We extended INCEpTION
        <xref ref-type="bibr" rid="ref3">(Eckart de Castilho et al. 2018)</xref>
        , a
fully functional tool to annotate data from different possible
data sources. INCEpTION is an online annotation platform
that incorporates many related tasks into a joint web-based
platform.10 In this section we explain what INCEpTION is
and how an annotator can annotate data. Furthermore, we
      </p>
      <sec id="sec-5-1">
        <title>9http://asg.ukp.informatik.tu-darmstadt.de</title>
        <p>10https://www.informatik.tu-darmstadt.de/ukp/research 6/
current projects/inception/index.en.jsp
will give an introduction on how we improved usability and
functionality to support citizen science. However,
following our walkthrough, which is linked in the introduction, is
recommended as it will explain in detail how INCEpTION
can be set up and how our features can be used as they are
currently experimental only and therefore must be enabled
manually.</p>
        <sec id="sec-5-1-1">
          <title>Project structure</title>
          <p>A project is the baseline for any annotation workflow in
INCEpTION. All of the documents must be available in that
specific project in order to be annotated. A project may
consist of many annotators who can work simultaneously on any
of the documents.</p>
          <p>In order to add annotators to a project, every annotator
requires an account on the corresponding server. Annotators
only have the ability to annotate the documents they are
provided with. While these accounts currently need to be
manually created, part of our ongoing work is the incorporation
of volunteer account creation, allowing non-experts to
participate in annotation projects.</p>
          <p>Data can be imported to INCEpTION directly using files
in the UIMA XML CAS format. This format is output by
both our Twitter and news crawling tools, simplifying the
processing pipeline. Annotation is done document by
document. When an annotator is finished with a specific
document, they must mark it as ”finished” to continue with the
next one until all documents are processed.</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>Workload</title>
          <p>The new workload manager seen in Figure 4 is our primary
addition to INCEpTION. Our goal is to make INCEpTION
more flexible regarding the monitoring of the documents
within a project and thereby becoming more suitable for
citizen science. We want to give project managers not only more
control for the documents and the annotation workflow, but
also a better and faster overview of their projects, allowing
them to better manage a greater number of volunteer
annotators.</p>
          <p>Our new workload can be accessed only by a project
manager and mainly consists of a substantial but easy to
understand table containing all the data a project manager needs
from their documents. We added a comprehensive amount
of filters and made the table sortable to increase its
readability. We implemented filtering for the documents a specific
annotator is working on, as well as for documents that have
not been annotated at all. These documents can then be
either cut out of the project or assigned manually to annotators
to increase their progress.</p>
          <p>Furthermore, the project manager is also able to get
detailed information on how each of their annotators is
complying with their work, e.g. by getting exact feedback on
how many of their assigned documents are already finished
in their annotation status and how many are still left to be
done.</p>
          <p>Our new workload system also enables a new workflow
in the annotation process. Now, annotators can be
automatically assigned to documents rather than making their own
decisions. This prevents the case of having one specific
document annotated too often, whereas others are not annotated
at all. In addition, we decided, that it is significantly better
for the annotation process if documents are finished linearly.
Therefore, we disabled the ability for annotators to switch
between their assigned documents. Instead, they must first
finish the document they are currently working on before
getting to the next one.</p>
          <p>Another area we have focused on is performance. The
larger the project, the bigger the difference in load times.
Even for very small projects with only 20 documents, our
workload page is three times faster than the old monitoring
page (40ms compared to 120ms). This was mainly achieved
by refreshing only necessary parts of the page and not using
and recreating many smaller web resources.</p>
          <p>To maintain consistency with older projects, we give
researchers the possibility to decide which workload manager
to use for their project: Either the previous monitoring type
or our new workload manager. Switching between both of
them is available at any time. The previous monitoring type
also contains an overview table, but lacks important features
for a project manager, including filtering and sorting.</p>
          <p>In summary, we build a data pipeline for collecting
both social media data and news from formal and
informal sources. We updated the INCEpTION platform to
improve management of annotation projects, improving speed
and performance in order to facilitate larger, citizen-science
based methodology for annotation. This architecture will
facilitate quicker development of projects combining NLP
practitioners and on-the-ground workers to develop
highquality understanding of crisis events.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Case Study: COVID-19</title>
      <p>To showcase the usability of our data collection and
annotation pipeline, we designed a small case study to predict
COVID-19 case numbers. We decided to include Twitter
sentiment (positive, neutral, or negative) as a potential
indicator of peoples’ behavior and news articles which are
referencing an increase or decrease in measures by government
or private entities (increased quarantine/mask regulations vs
reopening schools and businesses).</p>
      <p>We set up our Twitter scraper which collects publicly
available COVID-19 related Tweets by searching for the
keywords ”corona”, ”covid”, ”epidemic”, ”pandemic”, and
”lockdown”. A smaller, random subset of these Tweets is
then exported into the UIMA XML CAS format which is
used by INCEpTION. We try to distinguish between a
positive sentiment, e.g. people who are optimistic about the
current situation, negative sentiment, e.g. people who are
unhappy with their government actions, and neutral, i.e.
unrelated, unspecific Tweets or factual reporting of facts without
expressing opinion. Our underlying assumption is that
people with a positive sentiment are more likely to follow
government rules which (theoretically) leads to a decrease case
numbers.</p>
      <p>For news data we collect articles mentioning COVID-19
AND (lockdown OR masks OR measure OR opening). We
annotate these articles into those mentioning an increase in
restricting measures, e.g. lockdown or required masks in
super markets, a decrease, e.g. opening of schools or
allowing international travel, and those that do not mention or are
merely discussing current measures. To develop these
annotation guidelines INCEpTION provides automatic
calculation of inter annotator agreement for two annotators.</p>
      <p>After receiving both annotated Twitter sentiment data as
well as news article about lockdown measures we train a
neural network which takes those two features as well as
officially reported COVID-19 case numbers in order to
predict case number development for a short time frame, i.e.
a time frame of five days into the future. This work is
in progress; for documentation on the results, as well as
a comprehensive walkthrough of our procedure, see https:
//github.com/UKPLab/social-good-data-pipeline.</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusions and Future Work</title>
      <p>Many projects in NLP start with the collection and
annotation of data. This holds especially true during a novel
crisis where there might be no existing data set available.
This is a time consuming endeavour and costs valuable
resources when time is of the essence. We developed a
general purpose data collection and annotation pipeline for
researchers to rapidly gather Twitter and news data from
various sources in near real-time, which then can be annotated
using the INCEpTION annotation platform. We extended
INCEpTION to accommodate citizen science by allowing
researchers to manage a group of annotators and their
workload.</p>
      <p>As future work we would like to improve the robustness
of our Twitter scraper, as the unofficial NASTY library relies
on website parsing and is not supported by Twitter and
therefore sometimes experiences performance issues and needs
patching. With regard to news collection, extraction of user
comments on news articles would be useful to better
understand the article in context, public reactions, and its likely
effects on the population.</p>
      <p>INCEpTION will be further optimized towards the needs
of citizen science, therefore, another important planned
feature is the invitation of annotators and researchers via their
.edu addresses, which will allow a more flexible login
mechanism. As we want to increase INCEpTIONs ability to
cope up with citizen science, granting many people from
academia all around the globe the ability to create accounts
and login by following an email link will make it much
easier to find annotators for projects. Still, in order to maintain
a good quality of the annotation work, we want to add the
possibility for project manager to ”mark” annotators who
are performing poorly. These annotators will then not be
offered any new documents after dropping below a certain
threshold. Therefore, one of our key features will be a
standardized test document each project automatically contains
and which must be annotated by all annotators.</p>
      <p>Having immediate access to social media data from
Twitter and news articles across the world while utilizing an
annotation platform for citizen science allows researchers to
focus on their analysis of data and development of machine
learning models to better understand how society reacts
during an ongoing crisis. This data processing pipeline
facilitates the better use of data for humanitarian efforts,
particularly those that occur during crisis events.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>This work has been supported by the German Federal
Ministry of Education and Research (BMBF) under the
promotional reference 01UG1816B (CEDIFOR), as well as the
German Research Foundation under grant No. EC 503/1-1
and GU 798/21-1.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Anderson</surname>
            ,
            <given-names>K. M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Aydin</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Barrenechea</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Cardenas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Hakeem</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Jambi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Design Challenges/Solutions for Environments Supporting the Analysis of Social Media Data in Crisis Informatics Research</article-title>
          .
          <source>In 2015 48th Hawaii International Conference on System Sciences</source>
          ,
          <fpage>163</fpage>
          -
          <lpage>172</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Castillo</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Big Crisis Data: Social Media in Disasters and Time-Critical Situations</article-title>
          .
          <source>ISBN 9781107135765. doi: 10</source>
          .1017/9781316476840.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Eckart de Castilho</surname>
            , R.; Klie,
            <given-names>J.</given-names>
          </string-name>
          ; Kumar,
          <string-name>
            <given-names>N.</given-names>
            ;
            <surname>Boullosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ; and
            <surname>Gurevych</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Linking Text and Knowledge Using the INCEpTION Annotation Platform</article-title>
          .
          <source>In 14th IEEE International Conference on e-Science, e-Science</source>
          <year>2018</year>
          , Amsterdam, The Netherlands,
          <source>October 29 - November 1</source>
          ,
          <year>2018</year>
          ,
          <fpage>327</fpage>
          -
          <lpage>328</lpage>
          . IEEE Computer Society. doi:
          <volume>10</volume>
          .1109/eScience.
          <year>2018</year>
          .00077. URL https://doi.org/10.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          1109/eScience.
          <year>2018</year>
          .
          <volume>00077</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Fiesler</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Proferes</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <year>2018</year>
          . aˆoeParticipantaˆ Perceptions of Twitter Research Ethics.
          <source>Social Media + Society</source>
          <volume>4</volume>
          (
          <issue>1</issue>
          ):
          <fpage>2056305118763366</fpage>
          . doi:
          <volume>10</volume>
          .1177/2056305118763366. URL https: //doi.org/10.1177/2056305118763366.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Gura</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <year>2013</year>
          .
          <article-title>Citizen science: Amateur experts</article-title>
          .
          <source>Nature</source>
          <volume>496</volume>
          :
          <fpage>259</fpage>
          -
          <lpage>261</lpage>
          . doi:”
          <volume>10</volume>
          .1038/nj7444-
          <fpage>259a</fpage>
          ”.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Imran</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Castillo</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Diaz</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Vieweg</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Processing Social Media Messages in Mass Emergency: A Survey</article-title>
          .
          <source>ACM Comput. Surv</source>
          .
          <volume>47</volume>
          (
          <issue>4</issue>
          ).
          <source>ISSN 0360-0300</source>
          . doi:
          <volume>10</volume>
          .1145/2771588. URL https://doi.org/10.1145/2771588.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Imran</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Castillo</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lucas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Meier,
          <string-name>
            <given-names>P.</given-names>
            ; and
            <surname>Vieweg</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>AIDR: Artificial Intelligence for Disaster Response. doi:10.1145/ 2567948</source>
          .2577034.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Kohlschu</surname>
          </string-name>
          ¨ tter, C.;
          <string-name>
            <surname>Fankhauser</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Nejdl</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Boilerplate detection using shallow text features</article-title>
          . In Davison, B. D.; Suel,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Craswell</surname>
          </string-name>
          , N.; and Liu, B., eds.,
          <source>Proceedings of the Third</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>