=Paper= {{Paper |id=Vol-2771/paper9 |storemode=property |title= Active Learning for the Text Classification of Rock Climbing Logbook Data |pdfUrl=https://ceur-ws.org/Vol-2771/AICS2020_paper_9.pdf |volume=Vol-2771 |authors=Eoghan Cunningham,Derek Greene |dblpUrl=https://dblp.org/rec/conf/aics/CunninghamG20 }} == Active Learning for the Text Classification of Rock Climbing Logbook Data == https://ceur-ws.org/Vol-2771/AICS2020_paper_9.pdf
                                    Active Learning for the Text Classification of
                                           Rock Climbing Logbook Data

                                                     Eoghan Cunningham1,2 and Derek Greene1,2
                                            1
                                               School of Computer Science, University College Dublin, Ireland
                                        2
                                            Insight Centre for Data Analytics, University College Dublin, Ireland



                                        Abstract. This work applies active learning to the novel problem of
                                        automatically classifying user-generated logbook comments, published in
                                        online rock climbing forums. These short comments record details about
                                        a climber’s experience on a given route. We show that such comments
                                        can be successfully classified using a minimal amount of training data.
                                        Furthermore, we provide valuable insight into real-world applications of
                                        active learning where the cost of annotation is high and the data is
                                        imbalanced. We outline the benefits of a model-free approach for active
                                        learning, and discuss the difficulties that are faced when evaluating the
                                        use of additional training data.


                               1      Introduction
                               The purest style of traditional rock climbing is called “on-sight” climbing. This
                               refers to climbing a route with no more information than can be seen from the
                               ground3 . Any information about a rock climb that would aid a climber in their
                               ascent is known as “beta’. Beta often takes the form of tips or instructions about
                               how to climb the route and what equipment to bring. It is typical for climbers to
                               share beta with each other before attempting a route. However, many climbers
                               who seek a greater challenge and hope to claim an “on-sight” ascent, will actively
                               avoid reading or hearing beta prior to their ascent.
                                    Rock climbers in the UK and Ireland currently log ascents of routes on fo-
                               rums hosted by UKClimbing.com (UKC). When logging ascents, UKC users can
                               leave comments about their experience on a given route. The site has recently
                               added functionality which allows users to label the comments that contain beta
                               information. Like spoilers in a film, some users wish to avoid reading beta before
                               climbing a route. With the addition of such labels, these users are able to hide
                               comments that contain beta. However, the vast majority of the comments on
                               UKC remain unlabelled. In this work we investigate the potential for a text clas-
                               sification algorithm to be used to automatically assign labels to these comments.
                                    On collecting a dataset of comments from UKC, we found that the scenario
                               here corresponds to that of many supervised machine learning tasks, where an
                               abundance of unlabelled data exists, but the amount of training data available
                               is quite limited. A description of this novel dataset is provided in Section 3.1.
                                3
                                    https://en.wikipedia.org/wiki/Glossary of climbing terms




Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
2       Eoghan Cunningham and Derek Greene

Crowdsourced annotations have proven to be useful when creating new training
sets from unlabelled natural language corpora [13], but this process requires
considerable time and effort on the part of the human annotators.
    Therefore, in this work we employ active learning methods [15] to compile
a larger training set, allowing us to build a text classification model that gen-
eralises well to unseen comments. In active learning, the examples chosen for
human labelling are selected carefully, so as to yield the maximum increase in
classification accuracy with the minimum amount of human effort. This project
investigated the effectiveness of a range of different text classification methods
and active learning strategies, as applied to the large dataset of comments col-
lected from UKC. These experiments are described in Section 3.2.
    After comparing alternative active learning approaches, we adopt the model-
free, exploration based strategy EGAL [5] to build an informative training set.
As our training set would be labelled by a committee of expert human anno-
tators, the cost of annotation was particularly high. This represents a unique
case where the benefits of a model-free approach are especially important, as we
see in Section 4.1. Using the additional training data, we are able to improve
the performance of our comment classifier. The evaluation of the quality of this
additional training data is a key challenge of this work, as outlined in Section
4.3. Finally, we achieve further improvements by including non-textual comment
metadata in the classification process, as discussed later in Section 4.4.


2     Related Work
2.1   Active Learning
In many supervised learning tasks, there are not enough labelled samples to
create sufficient training and test sets. Often the cost of labelling samples to
create such sets is very high. In these cases active learning can be employed to
reduce the number of samples that need to be labelled.
    In typical supervised learning tasks, training data is a random or stratified
sample selected from the larger space. In the case of active learning, the learner
builds the training set by choosing the samples which are most useful to the
classification process. This is to reduce the number of samples that need to
labelled. The key hypothesis is that, if a learning algorithm is allowed to choose
the data from which it learns, it will perform better with less training [12]. Pool-
based active learning was shown to reduce the amount of training data required
to achieve a given level of accuracy in a text classification task 500-fold [7].
Given a pool of unlabeled data U , an active learner is made up of the following:
a classifier f trained on a pool of labeled data X, and a query function q().
Given the data in the labeled pool X, q() decides which instances in U to query
next. The samples queried from U are presented to an oracle, typically a human
annotator, to be labelled and added to X. The query function q(), often called
the selection strategy, is a vital component of an active learning system.
    In the work by Lewis and Gale [7], the query function selected the unlabeled
samples about which the classifier was least certain. This approach is known as
 Active Learning for the Text Classification of Rock Climbing Logbook Data       3

uncertainty sampling. A query function performing uncertainty sampling in a
binary classification task may simply return the unlabeled samples for which the
probability of positive classification is closest to 0.5 [7, 6]. Settles and Craven
proposed an information density framework [12], which is based on the idea
that the most informative queries are not only those which are uncertain, but
also those that are the most representative of the distribution of the samples
in the data. Such samples are found in dense regions of the sample space. The
information density approach will weight the uncertainty of a sample by the its
average similarity to other samples in the sample space.
    As methods that consider only density and/or uncertainty can fail to explore
the sample space, some selection strategies will seek to select queries that are
differ from the samples in the labelled pool. Exploration guided active learn-
ing (EGAL) [5] is a selection strategy which uses both diversity and density to
find informative queries. By querying from dense regions of the space, EGAL
avoids querying outliers while ensuring that selected samples are suitably di-
verse from the labelled pool of samples. Further, as EGAL does not measure
uncertainty in sampling, it can be implemented without a classification model.
This model-free approach allows for all of the queries to be found and anno-
tated in a single batch. In many applications where human annotation is not
readily available, model-free approaches are highly favorable over model-based
approaches in which models are retrained after each smaller batch [10].


2.2   Crowdsourcing Annotated Data

There has been much success in recent years using crowdsourcing to annotate
data. reCaptcha is one of the best known examples of crowdsourcing in classifi-
cation tasks [16]. To determine if users were computers or human, captchas were
introduced as an “automated public Turing test”, where the user was asked to
decipher some text. reCaptchas introduced a second similar task. In this case
the text used is not known in advance, but rather is typically scanned text from
a corpus of text that is being digitized. If a human user provides a satisfactory
answer to the captcha, their response is considered to be a useful annotation for
the original scanned image. More recently, sites like Amazon’s Mechanical Turk
[11] have been a popular source for human-annotated data. Mechanical Turk
(MTurk) has been shown to be effective in collecting annotations for a range
of classification tasks [14]. However, when domain-specific expertise is required,
MTurk is often not a suitable source of information.


3     Methods

In this section we describe the key methods used to identify the most appropriate
active learning approach for our data, how we applied this approach to collect
additional training data and finally, how we evaluated the resulting augmented
training set. We also include a brief description of the dataset we have collected.
4      Eoghan Cunningham and Derek Greene

3.1   Dataset

A dataset of 100,000 comments was collected from UKC.com in September 2019.
These include ascents covering the period 1952–2019, and the average comment
was just 15 words long. In addition to the comment text, we also collected addi-
tional non-textual features associated with each comment. It was hypothesised
that these additional features might aid in comment classification. In the case of
each comment, where the information was available, the location of the route,
the location that the climber was local to and the maximum grade achieved
by the climber that year were collected as comment metadata. Of the 100,000
comments collected, 304 (0.304%) of them were labelled as containing beta. On
inspecting those comments and their labels, an expert human annotator agreed
with only 170 of them. It is clear that the data source contains even fewer la-
belled comments than initially expected and that the quality of this labelling is
very poor. However, manual re-annotation of these comments provided a labelled
pool of comments that we were confident were correctly labelled and could serve
as a starting point for the active learning process. While 99.7% of the comments
in this dataset remain without class labels, we believe there to be a large class
imbalance in the data where comments containing beta are in the minority. We
include two comments from the dataset below to provide insight into the nature
of the data and the language used to give beta. Comment 1 contains beta, while
Comment 2 does not.
 1. “Laybacked the main crack until the prominent right foothold, then stepped
    up with right foot whilst reaching up with left hand for a sound hold.”
 2. “Damp when we first arrived but soon brightened up. An excellent afternoon,
    first visit to the crag.”
A cross-validation experiment conducted on the pool of 304 annotated comments
showed that the classification of beta was a learnable task, but that a larger
training would be required to improve classification performance and ensure the
model would generalise well to unseen data. Our dataset may prove useful to
other researchers, in particular those considering classification of user generated
text, short text or datasets with a large class imbalance.


3.2   Active Learning Selection Strategies

To expand our training set as efficiently as possible, we sought to find an ap-
propriate active learning strategy for our dataset. We evaluated six different
experimental combinations: three selection strategies, each with two classifica-
tion models. The models proposed were a Support Vector Machine (SVM) with
a linear kernel and a Random Forest classifier as these were shown to perform
well for our data. The selection strategies were uncertainty sampling, random
sampling, and EGAL. In each case the comments were represented as TF-IDF
vectors of uni-grams, bi-grams and tri-grams.
    In order to evaluate the different approaches, we simulated active learning
on our pool of labelled samples. A small random labelled pool was removed and
used as a training set while the remaining data was considered unlabelled. On
 Active Learning for the Text Classification of Rock Climbing Logbook Data      5

each iteration, samples would be selected from the unlabelled pool, added to
the training set and used to train a classification model. As we lacked sufficient
annotated data to remove a hold-out test set, at each iteration, the model was
evaluated on all of the available annotated data. Each selection strategy was
then evaluated by plotting the classification performance (balanced accuracy
score) over the size of the training set. The best performing approach would
be the one that produced the steepest learning curve, ie. the approach that
achieved the highest accuracy with smallest amount of training data. As each
model was tested on a test set that contained samples also present in the training
set, the resulting accuracy would likely overestimate the performance on unseen
data, but the overall shape of the learning curve would still be informative. This
approach to evaluation is common when the labelled dataset is small [8]. While
using a hold out test set, or even testing on the unlabelled pool, are also common
approaches [1, 2], they are only feasible in cases were a fully-annotated dataset
is available.
    In each case, the learning process was repeated 10 times, each with a differ-
ent labelled pool of 10 randomly-chosen samples. All base labelled pools were
balanced with 5 positive and 5 negative samples. In each case classification was
evaluated using a balanced accuracy score which was averaged over the 10 ex-
periments. The results of this experiment are provided in 4.1.


3.3   Annotation Environment

The selection strategy that performed best in the experiment outlined above
was used to query 100 comments from the un-labelled of comments. These com-
ments were labelled by a committee of expert annotators and added to the
labelled comments used for training. A web page to developed to allow for re-
mote, reliable annotation of comments by the human annotators. A brief pilot
study was conducted with three participants in order to evaluate this annotation
tool. Participants were asked to log on, annotate 30 comments and complete a
questionnaire about their experience. From this pilot study we asserted that the
site was functioning properly and that we could expect annotators to label 100
comments over the course of a week.
    After collecting the user judgements from the annotation environment, the
labels were evaluated to assess inter-annotator agreement and quantify consen-
sus. In addition to a simple percentage agreement calculation, a Cohen’s kappa
statistic [3] was calculated. Following this, majority voting was used to assign a
class label to every comment. Each annotator could also be ranked by the rate
at which they disagreed with this majority. This was done to identify unreliable
annotators. These annotators could then be removed from the study, and con-
sensus and agreement measures can be recalculated. This is based on the greedy
consensus rules proposed by Endriss and Fernandez [4], where annotators were
only permitted to disagree with the majority vote t times before being removed
from consideration. The results of these calculations are outlined in Section 4.2.
6       Eoghan Cunningham and Derek Greene

3.4   Augmented Training Set

The active learning stage of the project resulted in 100 additional samples that
had been queried by our learner and labelled by participants using the annota-
tion environment. In this section we outline the experiments used to evaluate the
quality of this additional training data, and investigate how it might improve
our ability to perform text classification of comments from UKC.com. The addi-
tional training samples were evaluated using a hold out test set of the comments
most recently posted to UKC. Two models were trained; a pre-active learning
model, trained on the original pool of labelled comments and a post-active learn-
ing model, trained on the original comments and the additional training data.
Using our unseen test set, we could evaluate how each classifier might perform in
a real-world deployment scenario. In our evaluations we report both a Balanced
Accuracy Score (BAS) and the Area under the ROC Curve (AUC). The BAS
is used due to the class imbalance in the data and reports performance on an
unseen test set, at a decision threshold chosen after parameter tuning. The AUC
instead considers performance across all decision thresholds and the classifier
that maximises AUC is considered to be better able to distinguish between the
two classes. While cross-validation is a common approach to evaluating classi-
fication, particularly in cases where annotated training and test data is scarce,
we deemed it inappropriate for evaluating our additional training data for the
following reasons:
  1. The additional training data should not be used as test data, as this would
     reward any selection strategy that queried samples similar to the already
     labelled training data and may result in trivial classification.
  2. The original training data should not be used as test data as the additional
     training samples are chosen to augment the training set, not to replace it.
Instead, we have used cross-validation only as a means of parameter-tuning. In
the case of each model, a 10 fold cross-validation was performed using all of the
data available to the model, i.e., the pre-active learning model was tuned using
only the samples that had been labelled prior to active learning.
     The test set was composed of 150 comments that had been most recently
posted to UKC.com. These were annotated by three expert human annotators.
We refer to this set as Set 1. This set was heavily imbalanced, as it contained only
17 samples that were labelled as beta. While this test set was representative of
the typical comments posted on UKC, a more informative test set should contain
more instances of the positive class. Although it may have been possible to find
more examples of beta by searching for key words or other textual features, it
was deemed that this was likely to result in trivial classification.
     We chose instead to use features not included in the classification process
to attempt to find more instances of the beta class. From our initial labelled
set of comments, we found that there were some users who were very active on
the site and were more likely to include beta in their comments. By collecting
up to 15 of the most recent comments from 10 such users we collected a set of
139 comments which, on annotation, was shown to contain 37 positive samples.
This final set was added to the original test set to give a larger, richer test
 Active Learning for the Text Classification of Rock Climbing Logbook Data       7

set of 289 comments. We refer to this set as set 2. Both pre- and post-active
learning models were tuned, trained and evaluated on each set and the results
are reported in Section 4.3.


3.5   Additional Features

In addition to the textual features, three non-textual features were extracted
and evaluated. (i) local to, a boolean features indicating if the climber was local
to the area where the route was climbed. (ii) comment length, the length of the
comment. (iii) challenge, a feature describing how challenging the climber would
find the route, calculating by comparing the grade of the route to the max grade
achieved by the climber in the same year. In the case of each non-textual feature,
the impact of that feature on the target variable was measured using information
gain. This was calculated using the annotated base set and the additional data
from the EGAL queries. The unseen test set was omitted from these calculations.
The effect of including these features in classification was then evaluated using
the post-active learning training set and the final unseen test set. Given the
relatively small number of additional features, it was possible to evaluate all
possible combinations of features exhaustively. For each subset of features, the
model was trained on all of the textual features and that subset. Performance
on the test set was reported as BAS and AUC.


4     Results

4.1   Selection Strategy

The results of the experiment comparing different active learning selection strate-
gies are provided below. Table 1 reports the area under the learning curve for
each selection strategy, up to a training set size of 100 samples. Figure 1 shows
the learning curves of the best performing selection strategies: (i) uncertainty
sampling using a random forest classifier; (ii) EGAL using a random forest clas-
sifier. The smoother learning curve achieved using uncertainty sampling is due
to the larger batch size. Each model-based approach was implemented using a
batch size of 10, while EGAL was implemented with a batch size of 1.
     As outlined in Section 2.1, EGAL is a model-free approach, meaning the
classifier does not need to be retrained between batches of queries. As a result,
it can be implemented with a batch size of 1 and still, all 100 queries can be made
before any annotation is received. This is not the case however for model-based
approaches like uncertainty sampling. In the case of uncertainty sampling, each
batch must be annotated before the next batch can be queried. In light of this,
the batches used to evaluate uncertainty sampling may have been too small. It
is anticipated that, even with a larger batch size of 20 samples, the extent of
annotation that was received would have been reduced, since annotators could
not provide labels as and when they wished. The next batch of queries could not
be made until the previous batch had been fully annotated.
8      Eoghan Cunningham and Derek Greene

                 Selection Strategy                   AUC (0-100)
                 Random Sampling (SVM)                  60.41
                 Uncertainty Sampling (SVM)             57.75
                 EGAL (SVM)                             49.92
                 Random Sampling (RF)                   55.88
                 Uncertainty Sampling (RF)              68.41
                 EGAL (RF)                              67.09
        Table 1. AUC scores scores for different query selection strategies.




     Fig. 1. Learning curves for the best performing active learning strategies.




    Fig. 2. Learning curves for uncertainty sampling, with different batch sizes.




    Figure 2 plots the learning curves achieved using uncertainty sampling with
a random forest classifier, as the batch size is increased. It is clear from this
figure that for larger batches, the classification performance decreases. EGAL is
shown to outperform the other strategies across all batch sizes larger than 10.
In general it is desirable to use larger batches to reduce dependencies amongst
annotators and increase the total number of annotations that we receive. For
this reason we have selected EGAL for our active learning system.
 Active Learning for the Text Classification of Rock Climbing Logbook Data       9

4.2   User Annotation
Using EGAL, 100 comments were queried from the pool of unlabelled comments
and annotated by 10 human experts using our annotation tool. Here we report
the results of some brief experiments to evaluate consensus amongst these anno-
tators. In total we received 680 labels, meaning on average comments would be
assigned a class label aggregated over close to 7 votes. Using pairwise agreement
we calculated a total agreement percentage of 76.7% and a Cohen’s kappa value
of 0.498. Percentage agreement is often criticized as a means of inter-annotator
agreement as it does not account for chance agreement. If we assign 10 random
labels to each of our 100 comments, we achieve a 50% percentage agreement. For
this reason we consider Cohen’s kappa in value in addition to percentage agree-
ment. A kappa value between 0.4 and 0.6 is considered moderate agreement.
    We also chose to consider the rate at which annotators disagreed with the
majority. The average annotator disagreed with the majority between 8 and 12%
of the time. However there we two outliers; annotators who disagreed with the
majority 21 and 33% of the time. These annotators received kappa scores of
0.21 and 0.32 indicating very low agreement. After removing these annotators
from the study we recalculated the agreement measures. Across the remaining
annotators the percentage agreement was calculated to be 82% with a kappa of
0.63. This indicated substantial agreement.
    On investigation it was found that both of the unreliable annotators had
strong biases that lead to their poor agreement scores. 17 of the 18 errors made by
one annotator were false positives, while the other problematic annotator made
17 false negatives. It is encouraging to find that the unreliable annotators were
biased and not simply inconsistent. Were these annotations very inconsistent, it
may suggest that labelling beta is not a learnable task. The fact that they are
seen to be biased suggests that, with better calibration, their agreement scores
could have been improved. This calibration could be achieved by changing the
instructions and examples provided to the annotators prior to annotation.

4.3   Augmented Training Set
Table 2 reports the balanced accuracy score (BAS) and area under the ROC
curve (AUC) for each model on each of the unseen test sets. We can see that
in all cases the classifier trained post active learning outperforms the classifier
trained before active learning. The balanced accuracy scores are achieved using
the decision threshold selected from the cross validation and as such, represent
our best estimate of the models performance were it to be deployed. The AUC
reported is the area under the ROC curve and instead quantifies the classifiers
performance across all decision thresholds. This may provide a more robust
means of comparing the models as it is not subject to decisions made in the cross-
validation process, namely: choosing the best decision threshold. This choice of
decision threshold may be overfitting to our relatively small labelled set used in
the cross-validation. While, there may be a better choice, that better generalizes
to all comments, the balanced accuracy scores reported in table 2 represent our
best effort at labelling unseen data both before and after active learning.
10     Eoghan Cunningham and Derek Greene

             Metric        Pre-Active Learning Post-Active Learning
             AUC Set 1            0.794               0.824
             BAS Set 1            0.699               0.715
             AUC Set 2            0.830               0.837
             BAS Set 2            0.736               0.756
                Table 2. AUC and BAS scores on unseen test sets.



Feature          Information Gain
                                      Feature Set                      AUC BAS
challenge             0.0175
                                      comment len                      0.850 0.795
is local              0.0015
                                      is local                         0.843 0.765
comment len           0.1825
                                      challenge                        0.844 0.772
Textual Features
                                      comment len, is local            0.851 0.780
’right’               0.0790
                                      comment len, challenge           0.849 0.781
’reach’               0.0686
                                      is local, challenge              0.839 0.765
’crack’               0.0672
                                      comment len, is local, challenge 0.845 0.760
’foot’                0.0604
’left’                0.0584          Table 4. AUC and BAC scores for
                                      additional features on unseen test set.
 Table 3. Feature Information Gain.




4.4   Additional Features
Each of the additional non-textual features were evaluated by calculating their
information gain on the target class. These values are reported in table 3. The
five most informative textual features are included in this table for comparison.
When we consider these textual features we see examples of the highly specific
language that is common to the comments containing beta. Words like ‘reach’,
‘left’, ‘right’, and ‘foot’ are all indicative of the instructive language used to
give beta. We can see how these terms form the instructions that might help a
climber succeed on a route. Table 3 shows the comment length to be the most
informative, not only of the additional features but of any feature.
    In addition to calculating the information gain, it was also proposed to eval-
uate the additional features by including them in the classification and assessing
their effect on performance. Given the relatively small number of additional fea-
tures, it was possible to evaluate each possible combination of features. Table 4
reports the performance of the post-active learning classifier on the final unseen
test set with each combination of the additional features. These results show
that the addition of any of the non-textual features resulted in an increase in
performance. However, unsurprisingly the comment length feature proved to be
the most beneficial, achieving a 4% increase in BAS and a 1.3% increase in AUC.
    In an attempt to better understand the effect of these non-textual features
on comment classification, it was decided to repeat these evaluations using our
pre-active learning classifier. Figure 3 shows how the non-textual features effect
classification performance in both pre- and post-active learning models. Each
 Active Learning for the Text Classification of Rock Climbing Logbook Data            11




       Fig. 3. Illustration of the effect of additional features on classification.


line in figure 3 reports the BAS before and after the non-textual features were
included. The slope of the lines represent the effect of these on the model’s
performance. It is apparent from the figure that the pre-active learning model
sees less improvement from the additional features than the post-active learning
model. This is a further testament to the quality of the active learning training
data. In essence, with this additional training data, our model makes better use
of the non-textual features.


5    Conclusion

We sought to apply active learning to the problem of automatically classifying
user-generated logbook comments from online rock climbing forums. UKC.com
implemented labels on their site which allow users to identify comments contain-
ing beta yet, the majority of these comments remain unlabelled. We have shown
our final classification model can identify beta in comments on UKC.com with
80% accuracy. We believe this to be close to the upper limit of what is achievable
for this classification task. As we saw n Section 4.2, expert human annotators
achieve 82% pairwise agreement in the same task. We have used Exploration
Guided Active Learning (EGAL) to identify the most informative samples to
add to our training data. By including non-textual metadata features in the
classification process, we have improved our model’s balanced accuracy on an
unseen test set by ≈ 6%. We believe the techniques we have employed should
prove useful in similar tasks involving the identification of anomalies in user-
generated text where annotation is limited. An obvious analog to our work is
identifying spoilers in movie reviews. Further, we have highlighted the necessity
for the use of model-free active learning approaches in cases where the cost of
annotation is high. Many model-based selection strategies are evaluated using
batches of five or fewer samples [7, 9, 15], as this offers the best results in off-line
experiments. However, such an approach is not feasible when annotations are
crowd-sourced. Finally, we provided insights into the complications associated
with evaluating active learning techniques when the cost of annotation is high.
12        Eoghan Cunningham and Derek Greene

The new dataset compiled for this work is made available online4 , and may prove
useful to other researchers, particularly those considering classification of short,
user-generated text or data with a large class imbalance.

Acknowledgement. This work was supported by Science Foundation Ireland
(SFI) under Grant Number SFI/12/RC/2289 P2.

References
 1. Baldridge, J., Osborne, M.: Active learning and the total cost of annotation. In:
    Proc. Conf. Empirical Methods in Natural Language Processing. pp. 9–16 (2004)
 2. Brinker, K.: Incorporating diversity in active learning with support vector ma-
    chines. In: Proc. 20th Int. Conference on Machine Learning. pp. 59–66 (2003)
 3. Cohen, J.: A coefficient of agreement for nominal scales. Educational and psycho-
    logical measurement 20(1), 37–46 (1960)
 4. Endriss, U., Fernández, R.: Collective annotation of linguistic resources: Basic prin-
    ciples and a formal model. In: Proc. 51st Annual Meeting of the Association for
    Computational Linguistics. vol. 1, pp. 539–549 (2013)
 5. Hu, R., Delany, S.J., Mac Namee, B.: EGAL: Exploration Guided Active Learning
    for TCBR. In: Proc. 18th International Conf. Case-Based Reasoning Research and
    Development. pp. 156–170. Springer-Verlag (2010)
 6. Lewis, D.D., Catlett, J.: Heterogeneous uncertainty sampling for supervised learn-
    ing. In: Proc. Int. Conf. Machine Learning. pp. 148–156. Morgan Kaufmann (1994)
 7. Lewis, D.D., Gale, W.A.: A sequential algorithm for training text classifiers. In:
    Proc. 17th Annual International ACM SIGIR Conf. Research and Development in
    Information Retrieval. pp. 3–12. Springer-Verlag New York (1994)
 8. Mac Namee, B., Delany, S.J.: Sweetening the data set: Using active learning to
    label unlabelled datasets. In: Proc. 19th Irish Conference on Artificial Intelligence
    and Cognitive Science (2008)
 9. McCallumzy, A.K., Nigamy, K.: Employing em and pool-based active learning for
    text classification. In: Proc. Int. Conf. on Machine Learning). pp. 359–367 (1998)
10. O’Neill, J., Delany, S., MacNamee, B.: Model-Free and Model-Based Active Learn-
    ing for Regression, vol. 513, pp. 375–386 (01 2017)
11. Rashtchian, C., Young, P., Hodosh, M., Hockenmaier, J.: Collecting image anno-
    tations using Amazon’s Mechanical Turk. In: Proc. NAACL HLT Workshop on
    Creating Speech and Language Data with Amazon’s Mechanical Turk
12. Settles, B.: Active learning literature survey. Computer Sciences Technical Re-
    port 1648, University of Wisconsin–Madison (2009)
13. Snow, R., O’Connor, B., Jurafsky, D., Ng, A.: Cheap and fast – but is it good?
    evaluating non-expert annotations for natural language tasks. In: Proc. Conf. Em-
    pirical Methods in Natural Language Processing. pp. 254–263 (2008)
14. Sorokin, A., Forsyth, D.A.: Utility data annotation with amazon mechanical turk.
    IEEE Conf. Computer Vision and Pattern Recognition Workshops pp. 1–8 (2008)
15. Tong, S., Koller, D.: Support vector machine active learning with applications to
    text classification. J. Machine Learning Research 2, 45–66 (2002)
16. Von Ahn, L., Maurer, B., McMillen, C., Abraham, D., Blum, M.: recaptcha:
    Human-based character recognition via web security measures. Science 321(5895),
    1465–1468 (2008)

4
     https://github.com/eoghancunn/logbookdataset