=Paper= {{Paper |id=Vol-1831/paper_7 |storemode=property |title=Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying Mosquitoes |pdfUrl=https://ceur-ws.org/Vol-1831/paper_7.pdf |volume=Vol-1831 |authors=Antonio Rodriguez,Frederic Bartumeus,Ricard Gavaldà |dblpUrl=https://dblp.org/rec/conf/pkdd/RodriguezBG16 }} ==Machine Learning Assists the Classification of Reports by Citizens on Disease-Carrying Mosquitoes== https://ceur-ws.org/Vol-1831/paper_7.pdf
    Machine Learning Assists the Classification of
     Reports by Citizens on Disease-Carrying
                   Mosquitoes

        Antonio Rodriguez1 , Frederic Bartumeus2,3,4 , and Ricard Gavaldà1
               1
                   Universitat Politècnica de Catalunya, Barcelona (Spain)
    2
        Centre for Advanced Studies of Blanes (CEAB-CSIC), 17300 Girona (Spain)
               3
                  CREAF, Cerdanyola del Vallès, 08193 Barcelona (Spain)
               4
                  ICREA, Pg Lluís Companys 23, 08010 Barcelona (Spain)



         Abstract. Mosquito Alert (www.mosquitoalert.com/en) is an expert-
         validated citizen science platform for tracking and controlling disease-
         carrying mosquitoes. Citizens download a free app and use their phones
         to send reports of presumed sightings of two world-wide disease vector
         mosquito species (the Asian Tiger and the Yellow Fever mosquito). These
         reports are then supervised by a team of entomologists and, once vali-
         dated, added to a database. As the platform prepares to scale to much
         larger geographical areas and user bases, the expert validation by en-
         tomologists becomes the main bottleneck. In this paper we describe the
         use of machine learning on the citizen reports to automatically validate a
         fraction of them, therefore allowing the entomologists either to deal with
         larger report streams or to concentrate on those that are more strate-
         gic, such as reports from new areas (so that early warning protocols are
         activated) or from areas with high epidemiological risks (so that con-
         trol actions to reduce mosquito populations are activated). The current
         prototype flags a third of the reports as “almost certainly positive” with
         high confidence. It is currently being integrated into the main workflow
         of the Mosquito Alert platform.


1       Introduction
One of the unintended consequences of globalization is the expansion of inva-
sive species outside of their original habitats. These may have harmful or even
devastating effects on the invaded ecosystem. In other cases, these species can
carry serious diseases (e.g. Dengue, Chickungunya, Zika, Yellow Fever) affecting
humans or other animals. This is the case of the mosquito species that are the
focus of this paper.
    Mosquito Alert (www.mosquitoalert.com/en) [1] is a citizen science, expert-
validated, platform developed at Center for Research and Ecological Applications
(CREAF) and the Center for Advanced Studies of Blanes (CEAB-CSIC), near
Barcelona (Spain). It started in 2014 under the name AtrapaelTigre (“catch the
tiger”) because it initially focused on determining the Asian Tiger mosquito
(Aedes albopictus) distribution and spreading process in Spain. In the past two


                                             1
years, Mosquito Alert has built up a community, bringing together citizens,
scientists (modelers and medical entomologists), and stakeholders (public health
administrations and mosquito control services) to help minimize mosquito-borne
disease risks in Spain. Based on a solid multidisciplinary team, Mosquito Alert
is now starting to scale up to offer a global tool that can aid in the fight against
Zika, Chikungunya, Dengue, and Yellow Fever worldwide; this has implied adding
the Yellow Fever mosquito (Aedes aegypti ) in the platform, recently notorious
as the main vector of the Zika epidemics in South America [2], extending the
geographical area, and moving the system from data-rich to Big Data scenarios.
Currently, Mosquito Alert has attracted interest in Latin America, United States
and China, where new communities are expected to grow and generate new data.
    The “frontend” of the platform is a freely downloadable app for Android and
iOS phones that everybody can use to send reports of mosquito sightings. Up to
now there have been 20,000 downloads of the app in Spain, and there are several
thousands of continuous participants in average throughout the year. The num-
ber of participants is increasing rapidly in light of the global attention to the
current Zika epidemics in Latin America. Reports from the app users are stored
in the “backend” of the platform and help scientists to detect adult mosquitoes
and their breeding grounds. This information is used to build distribution and
future expansion models, as well as directly helping to control their expansion by
activating early warning alerts and control actions to decrease mosquito popula-
tions and epidemic risks. It is thus a clear example of citizen science [3] in which
the general public co-participates in the scientific research and development of
the platform, via knowledge sharing, intellectual abilities, resources, or tools.
Citizens are actors and at the same time end users experiencing the benefits of
such a research process.
    More precisely, citizens’ actions involve sending reports of spotted tiger
mosquitoes or their breeding sites. The latter are small water containers that
can proliferate after rain periods either in public spaces (e.g. fountains, sewers,
or water drainers) or in private areas (e.g. flower pots in terraces and gardens,
rain collecting devices in urban growing gardens, etc). When sending the re-
port to the platform servers, citizens are asked to fill-in a short questionnaire,
attach a photography if possible, and give permission to attach the geolocal-
ization of that report. For confidentiality purposes, no information about the
citizens is collected other than the random anonymous identifier assigned when
they register the app. This unique identifier is used to trace the overall activity
and performance of each anonymous user in order to improve engagement and
communication with the community. These reports are inspected and validated
(or rejected) by a team of entomologists and are included in a database and an
interactive webmap. The project, or authorities, can then derive actions from
that information, such as dispatching verification and control teams to reported
locations.
   Classification is strictly done by visual examination of the pictures, if included
with the report. Each report is validated by three experts, with a super-expert
making the final decision in case of disagreement.

                                         2
    Reports need supervised validation because a certain fraction of the reports
are erroneous: non-experienced citizens may report regular mosquitoes as tiger
mosquitoes despite the tutorials in the web and the information and guided
questionnaire provided in the app. The work of the validating entomologists is
one of the bottlenecks for the scalability of the system. This paper describes
an application of machine learning to spare entomologists the verification of
part of the reports; more specifically, with the datasets gathered so far, the
application can flag over 30% of the received reports as true tiger mosquito
sightings, with high confidence. This will allow the entomologists to focus on
other more valuable tasks, such as verifying new reports from areas where the
specimens have not yet been established, or organizing control teams in high-risk
epidemiological areas, as well as being able to handle larger geographical areas
and larger user bases. Other possibilities to the same effect beyond analyzing the
reports could be considered in the future, for example, analyzing the pictures
themselves with image processing techniques and cross-validating expert and
non-expert supervision to allow citizens to improve over time.
    In fact, feedback on pictures is partially available through the public map
in the Mosquito Alert webpage. In the pop-ups of the map, participants can
see their own pictures with the comments by experts. By checking the expert
comments and the pictures, an untrained citizen can quickly learn which are
pictures are considered high-quality, i.e. which ones are useful to identify the
targeted species. In a future app release, feedback on scores and overall quality
of their pictures will be sent directly to the users’ cellphones.
    The manuscript is structured as follows. In Sections 2 and 3 we describe
the datasets used for the study and the preprocessing process. In Section 4 we
describe the experiments with different classifiers, their results, and the current
choice of a classifier to be implemented. In Section 5 we sketch the architecture of
the system as it is currently being implemented in MosquitoAlert. Finally in Sec-
tion 6 we recap the outcomes of the experience and highlight a few possibilities
for future work.


2    The Datasets

The project provided two main files for the study. The first lists the users that
have downloaded the app including user ID, download time and other related
information (app version downloaded). The interesting part is the file with the
reports received during 2014 and 2015, whose fields will be discussed shortly.
    The dataset contains 16967 users and 10618 reports; about a third of regis-
tered users sent no reports, another third sent a single report, and following a
Zipfian-looking distribution, the maximum number of reports sent by a single
user is 38. Note that in that period the app only allowed to report the tiger
mosquito Aedes albopictus species.
The key value of this dataset is that it contains the label or classification provided
by the entomologists for each report, allowing us to treat the problem as a

                                          3
supervised learning one. Five labels are possible, encoded as integers in the
range −2 . . . 2.

 – 2: this is for sure a tiger mosquito spotting
 – 1: this is probably a tiger mosquito spotting
 – 0: there is not enough information to classify
 – -1: this is probably not a tiger mosquito spotting
 – -2: this is for sure not a tiger mosquito spotting

Reports with label 0 were unfortunately the majority, typically those without a
picture, for which entomologists cannot assess the validity for sure. We removed
them since it was difficult to use them for either training and testing. This left
a total of 2094 usable reports, distributed in classes as follows:

                             class   2   1 -1 -2
                          frequency 47% 46% 2% 5%

Therefore, this is a moderately class-imbalanced problem, with positive instances
over 7 times as frequent as negative ones.


2.1   Contents of the reports

The most relevant fields for each report are:

 – userId,
 – app version number
 – phone operating system,
 – report date and time
 – report georeference (latitude and longitude), if available,
 – report type (adult mosquito or breeding site),
 – the answers to three taxonomic questions present in the questionnaire:
     • Q1: Is it small, black and has white stripes?
     • Q2: Does it have a white stripe in both head and thorax?
     • Q3: Does it have white stripes in both abdomen and legs?
   For each question the user can select one of three options (No/I don’t
   know/Yes), represented by the numbers -1, 0 and 1 respectively;
 – an optional comment in free-text format, and
 – finally, the label or class assigned by the entomologist, taking values in
   {−2, −1, 1, 2} as explained before.


3     Instance Construction

As usual in machine learning, features had to be transformed and new features
built so that classifier building algorithms can use them. Most of the new features
are created by aggregating across all reports from the same user, or across all
reports from a geographical zone.


                                        4
   In particular, the Mosquito Alert platform locates reports in a grid formed
by square cells of 4km × 4km, used as reference for the project. Knowing that
a mosquito normally travels about 700 meters by itself during its life (if not
transported e.g. by entering a car), we decided to additionally consider circular
areas or 1km around each report. It is expected that positive reports tend to
appear more often in invaded areas, e.g. around previous positive reports. A
report from an area from which no previous positive reports exist is, on the one
hand, more likely to be a user error, but on the other hand very important as it
may be an early alert for a new invaded area.
   The main features included in the dataset for classifier training are thus:

 – Discretized time-of-day of the report (0-6am, 6am-noon, noon-6pm, 6pm-0).
   This information is relevant as some mosquito species may have different
   daily activity patterns. Other discretization ranges are of course possible.
 – Month of the year. Mosquitoes are visible mostly during summer months,
   although the distribution curve is affected and shifted to some extent by
   weather conditions.
 – Number of previous reports by the same user; note that some of the following
   features do not make sense if this is the first report by a user.
 – The answers to questions Q1, Q2, Q3.
 – The operating system used (Android, iPhone), just in case it happens to be
   relevant.
 – User accuracy: Fraction of previous reports in agreement with entomologists.
 – Time between user sign-in and this report.
 – Time between last report by the user and this one.
 – Average time between reports by this user.
 – User Action Areas: Number of cells from which the user has sent report.
 – User Mobility Index: This variable tries to express the user mobility between
   cells and its activity in each of them. A user that always sends reports from
   an specific cell but has sent only one report from a different cell is definitively
   different from one who sends the same amount of reports from two different
   cells. Intending to express this user movement activity, the variable has been
   computed as the standard deviation of the number of reports sent from every
   different cell the user has been active on.
 – Reports around 1km in the last hour, last day, last week, and last month:
   Four variables indicating the number of reports in a circle of radius 1km
   around the location of this report, in the four time periods indicated.
 – Valid reports around 1km in the last hour, last day, last week, and last month:
   Same as before but considering only the proportion of reports validated as
   positive.
 – A boolean indicating whether the report has a comment or not. The user
   taking the time to enter a comment may indicate more careful work on
   his/her part.
 – The class, in {−2, −1, 1, 2}.


                                          5
4    Classifiers tested

Four classifiers were tested in order to predict the report classification, in par-
ticular their implementations in R and the Rweka packages:

 – Naïve Bayes - ’e1071’ R library
 – k-nearest neighbors - ’e1071’ R library
 – Decision trees (C4.5) - Rweka R library
 – Random forests - RandomForest R library.

A 10-fold cross-validation procedure was used to assess classifier performance.
We omit the discussion of the parameter choice and model validation in each
algorithm and other resampling methods.
    In all cases, the initial results were rather poor. Very often, the classifier re-
turned one of the majority classes (either 1 or 2) on all instances, never returning
-1 or -2. Two variations of the training process were carried out:

 1. The four classes -2, -1, 1, 2 were reduced to two by merging “probably nega-
    tive” (-1) into “negative” (-2) and “probably positive” (1) into “positive” (2).
    This was a reasonable option since the Mosquito Alert team considered it
    was too challenging for the classifier to tell these categories apart , e.g. to-
    tal or partial certainty about a positive or a negative report was somewhat
    subtle.
 2. Oversampling the infrequent class (-2) in the training set by factors 5x to
    10x, since classifiers tended to simply return “2” on all instances. Crucially, no
    replication was added to the testing dataset for each fold, therefore the error
    rates should be those of the trained classifier on the original distribution,
    not on the oversampled one.

Results were far better after these modifications except for k-nearest neighbors,
which kept performing poorly. Confusion matrices for each of the other three
classifiers are given in Tables 1, 2, 3.


                                             Predicted class
                                            negative positive
                                   negative    5.8%     0.6%
                        True class
                                   positive 61.4% 32.2%
                      Table 1. Confusion matrix, Naive Bayes



Two of the classifiers stood out for their performance: Random Forests and
Naïve Bayes. Random Forest achieves the highest accuracy (87.7%). However,
it does not have particularly high recall or particularly high precision on any of
the two classes, so it is unclear what use it can be given. On the other hand,
Naïve Bayes has a substantially smaller accuracy (38%) but has an extremely


                                          6
                                             Predicted class
                                            negative positive
                                   negative    3.9%     2.4%
                        True class
                                   positive 16.4% 77.3%
              Table 2. Confusion matrix, C4.5 Decision Trees. M=10.


                                             Predicted class
                                            negative positive
                                   negative    3.7%     2.6%
                        True class
                                   positive    9.7% 84.0%
               Table 3. Confusion matrix, Random Forests. 500 trees.



good precision on the positive class (32.2 / 32.8 = 98.2%). That is, whenever it
classifies a report as positive, that report is with very high probability indeed
positive. Furthermore, it does classify about a third (32.8%) of the reports as
positive and only about 10% of the truly negative reports are labeled as possible.
That is what is required for our intended purpose, that is, identifying with high
confidence a significant fraction of reports that can be classified without taking
entomologists’ time. It has also the advantage that it can by interpreted relatively
well for the experts; interpretation is harder for Random Forests.
    Thus, we chose Naïve Bayes as the classifier to be implemented in the pro-
totype, keeping in mind that Random Forests is a good candidate if some other
usage that requires to prime overall accuracy appears in the future. The possi-
bility of combining the two (or more) classifiers via voting is being investigated.
    Figure 1 presents the ROC curve of the Naïve Bayes classifiers, and Table 4
lists the features by decreasing order of importance. It can be observed that the
most important ones are the questionnaire answers Q2, Q3, and Q1, together
with the number of reports within 1km in the last month. However, trying to
predict on the bases of these four variables alone created a noticeable decrease
in performance, e.g. the other variables do help. The least significant ones are
(not very surprisingly) the fact that the user is new and the phone’s operating
system.


5    Integration in the project
The MosquitoAlert system is a hook-model [4] consisting of three main blocks:
the MosquitoAlert app (available for Android and iOS), a corresponding Django-
based server-side functionality including a data-base containing the reports send
by the app-users, and an online platform http://www.mosquitoalert.com, pro-
viding three different levels of services. At the first level we have a platform called
EntoLab. This is a restricted access service through which a set of experts can
make a previous filtering of inappropriate reports and classify the rest as either
positive or negative ones. Only classified reports are afterwards made visible


                                          7
Fig. 1. ROC curve for the NB classifier. The area under curve is 0.8, and the maximum
sensitivity (recall) that can be achieved without any false positives is close to 0.3.



to the rest of the services. An intermediate level called ManagersPortal grants
on-demand-access to stakeholders (e.g. public health administrations, mosquito
control services, private companies) with particular interests on the informa-
tion about the spread of the mosquito and areas with imminent risk of being
invaded. Finally, an open access level allows citizens to visualize all the infor-
mation gathered, synthesized in the form of interactive maps (e.g. observations,
app downloads), where they can find their individual contributions validated
by the experts. This top level constitutes the necessary reward that closes the
hook-model loop.
    Because of the inherent bias of the set of reports towards positive ones (as
well as in the classifier itself) the idea is to implement the classifier as a filter
that yields an ordered list of the pending reports based on its positive score (i.e.
the probability of being positive). Afterwards, and based on the ROC of the
classifier (plus any aside considerations like current geographical interest) the
experts can decide how many reports from the top of the list can be considered
as correctly classified and do not need further expert supervision. Because of
the relative low computational cost of the Naïve Bayes classifier, this filtering
is implemented as a batch process scheduled at regular intervals (i.e. daily) but
can be tuned and triggered as desired by the platform managers. This may be
reconsidered in the future if computationally heavier classifiers were used.
    The reports data-base contains all received reports, either already classified or
pending to be classified. Thus, a batch implementation of the algorithm consists
of the following steps:


                                          8
                    Variable name              Importance
                    reportQ2Answ                   0.7424
                    reportQ3Answ                   0.7038
                    reports1kmLastMonth            0.6623
                    reportQ1Answ                   0.6615
                    userNumReports                 0.6405
                    userNumActionAreas             0.6348
                    validReports1kmLastMonth       0.6216
                    userTimeForFirstReport         0.6197
                    reports1kmLastWeek             0.6158
                    userAccuracy                   0.6085
                    userTimeSinceLastReport        0.5923
                    userMeanTimeBetweenReports     0.5912
                    validReports1kmLastWeek        0.5688
                    reports1kmLastDay              0.5594
                    validReports1kmLastDay         0.5452
                    validReports1kmLastHour        0.5301
                    reports1kmLastHour             0.5281
                    userMobilityIndex              0.5260
                    reportMonth                    0.5199
                    reportNote                     0.5081
                    reportTimeOfDay                0.5057
                    os                             0.4650
                    newUser                        0.3648
Table 4. Variable importance in the NB classifier. Numbers are the values of the model
coefficients after standarization.



 1. Training & testing:
    (a) select all labeled (validated) reports with report generation date poste-
        rior to a fixed date (in order to control computational cost and capture
        relative up-to-date information);
    (b) preprocess data and generate the set of training/test instances;
    (c) split the instance set into training and testing subsets (or using cross-
        validation);
    (d) train the classifier with the training set;
    (e) test the classifier with the test set and compute the ROC.

 2. Classifying:
    (a) select all pending reports (also with a report generation date posterior
        to a given date);
    (b) preprocess data and generate the set of instances; at this point we need
        the previous set of training/test instances to compute features like Re-
        ports around 1Km and Valid reports around 1Km;
    (c) classify the instances;
    (d) order the set of instances by decreasing positive score;


                                          9
6   Conclusions and Future Work
In summary, a simple machine learning method opens the possibility of saving
at least a third of the expert time with small rate of false positives.
    An alternative use of classifiers that is being considered is have them exclude
both surely positive and surely negative reports, and send the entomologists
only the reports that the classifier is uncertain of. To this end one could consider
the combination of Random Forests (which, as mentioned, have overall higher
accuracy than Naïve Bayes) with ROC curve analysis. At the time of writing
we are also starting to explore the option of deleting “uncertain” classes (-1,
1) from the training set, so that training is carried out only on the basis of
“certain” labels (-2, 2). Preliminary experiments indicate considerable precision
improvement, but further research is needed to assure the classifiers trained by
these methods fit the platform needs.
    Another aspect that could be exploited for classification is the use of “crowd
intelligence" to extract information from the pictures. This is currently done at
the platform www.crowdcrafting.org, where mosquitoalert.com is one among
other projects being crowdcrafted. Mosquito pictures sent by citizen are redi-
rected at this platform for people to validate. The results from this crowd val-
idation is also visualized in the map, together with the expert validation. In
future app versions, the possibility will be given to validate pictures from the
cellphones itself. So users will not only be asked to make pictures but also to
validate other user’s pictures. Either from a web platform (crowdcrafting.org)
or from the Mosquito Alert app, the system will collect citizen (i.e. non-expert)
classification information to be compared with the expert one. In the future,
we will be able to add citizen validation scores as input features, analyze the
convergence between expert and citizen validations by region or collectively, and
exploit all this new information for training the classifiers.
    It is clear that one route to go is to develop and test new features and algo-
rithms. A set of new features could involve the direct extraction of information
from the pictures themselves, for example, through image processing techniques.
The classification system can be fed with new input features, and as the super-
vised set of reports is constantly growing it makes sense to re-test the chosen
classifiers or look for new ones in order to improve the overall performance of
the classification system.
    Integrating the classifier with expert-mandated rules is another necessary
step. As mentioned before, some reports may be more strategic or urgent than
others, e.g. those arriving from new areas with no past sightings or particularly
vulnerable to disease. The possibility of using the non-expert user base for ad-
ditional cross-validation while simultaneously speeding-up their learning curve,
would require also careful integration with the classifier. Gamification could be
an avenue to study.
    One of the main future challenges of the platform Mosquito Alert is the
scalability of the system in order to support the growing flow of reports and an
increase on its complexity. It is true that at the moment a considerable flow of
reports can be comfortably handed with a single machine. This could certainly

                                        10
change be certainly true if the aforementioned image-processing techniques were
incorporated. But even with the current structure, some challenges may appear
soon. In particular, creating an instance out of a new report requires locating the
previous report within the same 4km × 4km cell and within a radius of 1km. For a
large report stream, this may be complex without incorporating the proper data
structures. Alternatively, the platform currently uses the PostgresSQL database,
which has an extension called PostGIS which adds support for geographic objects
allowing location queries to be run in SQL.


References
1. Mosquito Alert: A citizen platform for studying and controlling mosquitos which
   transmit global diseases., Web content, http://www.mosquitoalert.com/en/. Ac-
   cessed: July 11th, 2016.
2. Zika virus disease epidemic: Preparedness planning guide for diseases trans-
   mitted by Aedes aegypti and Aedes albopictus, Paula Vasconcelos, Lau-
   rence Marrama, Emma Wiltshire, Dragoslav Domanovic and Andrea Würz,
   Available      at:      http://ecdc.europa.eu/en/publications/Publications/
   zika-preparedness-planning-guide-aedes-mosquitoes.pdf. Accessed: July
   11th, 2016.
3. White Paper on Citizen Science for Europe, Fermín Serrano Sanz, Teresa Holocher-
   Ertl, Barbara Kieslinger, Francisco Sanz García and Cândida G. Silva, Avail-
   able at: http://www.socientize.eu/sites/default/files/white-paper_0.pdf.
   Accessed: July 11th, 2016.
4. Hooked: how to build habit-forming products, Nir Eyal and Ryan Hoover, 2014. ISBN-
   13: 978-0241184837.




                                         11