Active Learning for Spreadsheet Cell Classification
         Julius Gonsior, Josephine Rehak, Maik Thiele, Elvis Koci, Michael Günther and Wolfgang
                                                 Lehner
                                                                Technische Universität Dresden
                                                              firstname.lastname@tu-dresden.de
ABSTRACT                                                                                L      U         Train model on L
                                                                                                                                     Random Sampling,
                                                                                                                                   Uncertainty Sampling, …
                                                                                                                                                                           Uncertainty Method,
                                                                                                                                                                       Selected Accuracy Method, …
Spreadsheets are mainly the most successful content generation                              Start
                                                                                                            Classifier
                                                                                                                                           Query
                                                                                                                                                                 Uq
                                                                                                                                                                               Stopping
                                                                                                                                          Strategy                              Criteria
tools, used in almost every enterprise to create a plethora of semi-                                                 learner
structured data. However, this information is often intermingled                             L + Lq                                                        How many cells?
with various formatting, layout, and textual metadata, making                                          Retrieve labels for cells                          One or many files?
it hard to identify and extract the actual tabularly structured                                        Gold Standard /                                         Batch
                                                                                                                                                                                           End
                                                                                                          Human oracle             Pose queries (cells)         Size
payload. For this reason, automated information extraction from
spreadsheets is a challenging task. Previous papers proposed cell
                                                                                           Figure 1: Active Learning Cycle for Cell Classification
classification as a first step of the table extraction process, which,
however, requires a substantial amount of labeled training data,
that is expensive to obtain. Therefore, in this paper we inves-                        identifying the most promising cells in 𝑈 . The cycle should stop
tigate a semi-supervised approach called Active Learning (AL),                         as soon as the trained classifier reaches the best accuracy. The AL
that can be used to train classification models by selecting only                      cycle is performed by two main actors: a learner and an oracle.
the most informative examples from an unlabeled dataset. In                            In our case, a learner is a cell classifier which is continuously
detail, we implement an AL cycle for spreadsheet cell classifica-                      retrained on the newly labeled data 𝐿𝑞 . The oracle maintains the
tion by investigating different selection strategies and stopping                      label-providing entity, i.e. a human annotator providing the gold
criteria. We compare the performance of various AL strategies                          standard. Based on the utilized query strategy (see Section 3.1) an
and derive guidelines for semi-supervised cell classification. Our                     unlabeled sample 𝑈𝑞 is chosen from the unlabeled pool 𝑈 . The
experiments show, that by implementing AL for cell classifica-                         purpose of this task is to identify such cells that contribute most
tion, we are able to reduce the amount of training data by 90%                         to the classifier to be trained. All other cells remain unlabeled,
without any accuracy losses compared to a passive classifier.                          potentially saving a lot of human effort. The cells proposed by
                                                                                       the query strategy 𝑈𝑞 are then given to the oracle resulting in 𝐿𝑞 .
KEYWORDS                                                                               The newly labeled data 𝐿𝑞 is added to 𝐿 and the process starts
Information Extraction, Active Learning, Semi-supervised, Ma-                          again by retraining the classifier on the extended dataset. The
chine Learning, Classification, Spreadsheets                                           AL cycle proceeds until a stopping criteria (see Section 3.2) is met
                                                                                       or until 𝑈 is out of queries.
                                                                                       Our results show, that the classification models trained on the
1    INTRODUCTION                                                                      data determined by the AL cycle outperform the passive classifier
Spreadsheets are powerful content generation tools, assisting                          and reduce the amount of training data by 90%.
novices and professionals alike. They contain data that are roughly                    Outline. The remainder of this paper is organized as follows:
relational, but accompanied by various formatting, layout, and                         In Section 2, we shortly review the task of cell classification for
textual metadata. Thus, the generated content is primarily de-                         spreadsheets. The individual parts of the AL cycle for cell classifi-
signed for human consumption and carries a lot of implicit in-                         cation are introduced in Section 3. In detail, we discuss different
formation. Due to these reasons, automatic table extraction and                        query strategies, stopping criteria and batch sizes, evaluated in
recognition [2, 5, 17] for spreadsheets is a very difficult task.                      Section 4. We additionally evaluate varying start set sizes and
The most crucial step for all table recognition approaches is cell                     give a recommendation on how to implement the AL cycle for
classification, determining for each non-empty cell its role (data,                    cell classification. Finally, we present related work in Section 5
header, metadata etc.) within the spreadsheet. However, the train-                     and conclude in Section 6.
ing of classification models relies on human-labeled training data,
which involves extensive human effort. In this paper, we there-                        2     CELL CLASSIFICATION FOR
fore propose a semi-supervised approach for cell classification.                             SPREADSHEETS
In detail, we propose an active learning (AL) cycle that is able to
                                                                                       The objective of capturing the tabular data embedded in spread-
determine the optimal amount of training data, needed to train a
                                                                                       sheets can be treated as a classification problem where the specific
cell classification model.
                                                                                       structures of a table have to be identified. Given a spreadsheet
Figure 1 sketches the AL cycle and its main steps. The AL process
                                                                                       the goal is to classify each non-empty cell with a given set of
starts with two sets of data: the unlabeled sample set 𝑈 and the
                                                                                       labels. The set of labels we use in this work is described in Sec-
already labeled dataset 𝐿, with |𝐿| << |𝑈 |. Initially, the learner
                                                                                       tion 2.1. Section 2.2 briefly presents the cell features used to train
is trained on the small labeled dataset 𝐿. The main task of the
                                                                                       a classifier.
AL cycle is to iteratively increase the set of labeled data 𝐿 by

© 2020 Copyright for this paper by its author(s). Published in the Workshop Proceed-
                                                                                       2.1          Cell Labels
ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen,       The research literature [2, 5, 33] basically defines seven roles for
Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At-
tribution 4.0 International (CC BY 4.0)                                                non-empty cells: Data, Header, Derived, GroupHeader, Title, Note,
                                                                                       and Other. However, for the specific task of table identification not
all of these roles are equally important and most approaches just        queries should result in more certain predictions of the classifier.
rely on the main table building blocks: header, data, and metadata       We compare three uncertainty metrics: least confident, margin
cells. Header cells give names to columns, describing the values         sampling and entropy [28]. Least confidence [20] tries to capture
below them. They can be nested occupying several consecutive             the probability, that the classifier is mislabeling the data using
rows. Data cells follow the structure defined by header cells and        the posterior probability 𝑃 where 𝑦ˆ is the most likely prediction:
contain the table’s payload. Metadata cells provide additional                                                ˆ
                                                                                        𝑈𝑞,𝐿𝐶 = argmax 1 − 𝑃 (𝑦|𝑥), 𝑥 ∈𝑈                  (1)
information about the sheet as a whole, or about a subset of its                                     𝑥
values. Some typical examples are footnotes and table titles.            Information about other classes next to the most probable one is
                                                                         not taken into account by this strategy. Margin sampling [25] in
2.2    Cell Features                                                     contrast uses the posteriors for the first and second most probable
The label of a cell is determined by a classifier, which makes           classes and samples the instances with the smallest margin:
use of cell features such as formatting, content types, formula                          𝑈𝑞,𝑀 = argmin 𝑃 (𝑦ˆ1 |𝑥) − 𝑃 (𝑦ˆ2 |𝑥)            (2)
references, and additional context from nearby cells [2, 5]. In total,                               𝑥
we consider 159 available features to train our cell classifiers. A      Entropy uncertainty [30] uses all possible classes and captures
detailed explanation of all cell features can be found in [19].          the entropy of a given distribution:
                                                                                                       Õ
2.3    Training Dataset                                                             𝑈𝑞,𝐸 = argmax −        𝑃 (𝑦𝑖 |𝑥) log 𝑃 (𝑦𝑖 |𝑥)   (3)
                                                                                                 𝑥        𝑖
For our experiments we used the DECO dataset [18], an existing
corpus of labeled spreadsheets, that can be used to train clas-             Query-by-Committee. In contrast to the other strategies the
sification models and additionally, in context of our work, to           Query-by-Committee [29] approach maintains multiple models
simulate an oracle (for more information see the Section 3) in           in parallel, called committee members. They are all being trained
the AL cycle. DECO contains 1,165 labeled spreadsheets which             on the same current labeled set 𝐿 and then vote on the query can-
have been derived from the Enron corpus [13]. The time to an-            didates. The vote is conducted by letting all committee members
notate a sheet within a spreadsheet has been logged during the           predict all unlabeled samples and measuring the controversial
creation of DECO. In average the annotation takes 4.3 minutes            score for each sample. The most controversial query is consid-
per spreadsheet with a max value of 284.4 minutes. This shows            ered the most informative. For the measurement of disagreement
the complexity of the cumbersome manual labeling task. It is             various methods exist. In this paper, we choose vote entropy as
clear, that a reduction of to-be-labeled spreadsheets will drasti-       proposed by [9]:
cally improve the cell classification approach and thus the overall                                      Õ 𝑉 (𝑦𝑖 )      𝑉 (𝑦𝑖 )
                                                                                       𝑈𝑞,𝑉 𝐸 = argmax −            log                 (4)
table extraction process.                                                                          𝑥           𝐶          𝐶
                                                                                                          𝑖

3     ACTIVE LEARNING FOR SPREADSHEET                                    Whereby 𝐶 denotes the committee size, 𝑦𝑖 ranges over all possible
                                                                         labels and 𝑉 (𝑦𝑖 ) is the number of times a label was predicted by
      CELL CLASSIFICATION
                                                                         a classifier. This scoring method scores dissenting votes as the
To implement an AL cycle (shown in Figure 1) for a given super-          highest and concurring votes as the lowest. Our applied commit-
vised machine learning task, such as cell classification, one has        tee consisted of three Random Forest Classifiers with different
to decide for a query strategy, a stopping criteria and a certain        random initializations, one Naïve Bayes, and one Support Vector
batch size. The query strategy selects the samples which should          Machine.
be labeled next by the oracle. In Section 3.1, we give an overview
of some popular query strategies which are later considered in           3.2    Stopping Criteria
our evaluation. To achieve our goal of reducing the human effort
                                                                         So far, the AL cycle would stop only, if the set of unlabeled data
in labeling spreadsheet cells, we have to find the optimal point for
                                                                         𝑈 is out of cells. Obviously this leads to no reduction in labeling
stopping the AL cycle. Therefore, different stopping criteria are
                                                                         effort. Therefore, we introduce a stopping criteria, that is able to
discussed in Section 3.2. Another issue that impacts the labeling
                                                                         detect whether proceeding the AL cycle is not resulting in any
effort is the batch size, i.e. the number of samples or cells labeled
                                                                         accuracy gain. Additionally, it can be shown, that by reducing
within each iteration of the AL cycle, discussed in Section 3.3.
                                                                         the amount of training data, we prevent overfitting and the test
                                                                         accuracy is increased (Section 4.6). In this paper, we investigate
3.1    Query Strategies
                                                                         three different stopping criteria (SC) from [34].
In this section, we shortly introduce the different strategies for
choosing the most informative queries. Each strategy approx-                Maximum Uncertainty. Maximum Uncertainty uses the same
imates the contained informativeness of unlabeled data for a             measurement as the Uncertainty Sampling Least Confident strat-
potential classifier.                                                    egy in Section 3.1. The basic idea is, that by passing multiple
                                                                         learning cycles the classifier’s uncertainty should decrease as the
   Random Sampling. Random sampling is a common AL query                 classifier has more knowledge over its data. Therefore, the stop-
strategy and found application in [3, 7, 27]. Unlike the other           ping criteria uses the uncertainty of the most uncertain sample in
methods, random sampling chooses queries at random and fully             𝑈 as measure. If the uncertainty value drops below a user-defined
independently of their informativeness. However, even with this          threshold, the stopping criteria will end the AL cycle:
strategy a rise in prediction accuracy is possible, since the amount
                                                                                               1 , 𝑈𝑞 ∈ 𝑈 and 𝑈 𝑀 (𝑈𝑞 ) ≤ 𝜃 𝑀𝑈
                                                                                             (
of training data is steadily increased. We use random sampling
                                                                                   𝑆𝐶𝑀𝑈 =                                                 (5)
as a baseline to compare the other strategies, too.                                            0 , otherwise
  Uncertainty Sampling. Uncertainty sampling chooses queries             𝑈 𝑀 (𝑥 ∗ ) is the method retrieving the uncertainty of a sample
which are the most uncertain to predict. Hence, learning these           𝑈𝑞 and 𝜃 𝑀𝑈 the predefined threshold. For batches we used the
smallest uncertainty of the batch, other common strategies are             layout roles have to be merged into the three broader categories
the mean or the median.                                                    data, header, metadata. 2) Due to the very high amount of data
                                                                           cells (94%) in DECO, the data class would get an increased prior
    Selected Accuracy Method. This method focuses on the classi-
                                                                           probability. Therefore, we decided to limit the number of data
fication accuracy of the chosen queries 𝑈𝑞 . It assumes, that the
                                                                           cells to 100 per spreadsheet, chosen randomly. 3) Cells with the
classifier always chooses the most difficult queries for labeling.
                                                                           label other were not included in our experiments, since they
Hence the model’s predictions on the most uncertain queries
                                                                           are too ambiguous and not helpful for table identification. The
should reflect the classifier’s certainty. The AL process stops,
                                                                           original dataset contains 98.68% data cells, 1.18% Header cells
if the classifier successfully predicts the labels of those most
                                                                           and 0.14% metadata cells, with a total of 1, 393, 398 cells. After
difficult queries.
                                                                           performing the aforementioned steps, the distribution is 83.27%
                           1 , 𝑎𝑐𝑐 |𝑈𝑞 | (𝐶) ≥ 𝜃 𝑆𝐴                        data cells, 14.99% header cells and 1.74% metadata cells.
                         (
                 𝑆𝐶𝑆𝐴 =                                        (6)         We implemented the AL cycle sketched in Figure 1 in Python in
                           0 , otherwise
                                                                           the following way: the learner consists of several classification
𝑎𝑐𝑐𝑚 (𝐶) determines with the help of the oracle the accuracy of            models such as Decision Trees, Random Forest, Naïve Bayes and
the classifier’s predictions on the queries, 𝑚 denotes the iteration       SVMs, using their Scikit-learn [22] implementation. To perform
of the AL cycle of the query. 𝜃 𝑆𝐴 is the threshold. For batches           AL on the DECO dataset we split it into 80% used for 𝐿 and 𝑈 as
the mean accuracy for each cell is used.                                   well as 20% for testing. Within the experiments, we usually set
                                                                           |𝐿| to 1% and |𝑈 | to 99% (of the overall 80%). Since our dataset is
   Standard Deviation-based Stopping. In contrast to other ap-
                                                                           already completely annotated, our implemented oracle can auto-
proaches, Standard Deviation-based Stopping [6] assesses mul-
                                                                           matically provide the requested label, i.e. for a given 𝑈𝑞 provide
tiple cycles for trend detection. Two criteria have to be met for
                                                                           𝐿𝑞 . While this setup is perfect to test the different configurations
stopping: first, the standard deviation of the accuracies for the
                                                                           of the AL cycle it is important to note, that in a real-word setting
last five 𝑈𝑞 has to be lower than a predefined threshold 𝜃 1 and
                                                                           the oracle will always be a human annotator.
secondly, the accuracy of the current 𝑈𝑞 has to be larger than
                                                                           In Section 4.1, we first propose a metric to measure the perfor-
the threshold 𝜃 2 . The first criteria identifies plateaus in the query
                                                                           mance of an AL cycle in terms of accuracy improvement and
accuracy and the second one prevents local optima.
                                                                           sample size reduction. Before looking at the AL strategies, we
            1 , 𝜎 (𝑎𝑐𝑐𝑛−5...𝑛 ) ≤ 𝜃 1 ∧ 𝑎𝑐𝑐𝑛 ≥ 𝜃 2 , when 𝑛 ≥ 5            report in Section 4.2 on the accuracy of passive learners, i.e. clas-
          (
  𝑆𝐶𝜎 =                                                              (7)   sifiers using the whole DECO dataset for training. In Section 4.3,
            0 , otherwise
                                                                           we study the impact of different start set sizes on the learning
The current cycle under inspection is denoted with 𝑛.                      curves. Varying batch sizes, sampling strategies and stopping
Since the stopping criteria described above all return Boolean             criteria are investigated in Section 4.4, Section 4.5 and Section 4.6.
values, they can be used in combination in form of logical expres-
sions.                                                                     4.1    Slope Calculation
                                                                           There does not exist a standardized numeric measure to report
3.3    Batch Sizes                                                         on AL performances. A common method in AL research is to
It is common practice in machine learning to train a model on              compare smoothed accuracy graphs. However, to compare differ-
batches of samples instead of single data points. In the later evalu-      ent settings in a faster way, many papers also rate the classifier
ation (Section 4.4) we compare simple top-𝑘 batches and so-called          performance by looking at the start and end point of the learning
spreadsheet batches: A top-𝑘 batch consists of the top-𝑘 cells be-         curve. To take into account also the reduction of the human effort
ing selected by the query strategy (Section 3.1). However, this            in labeling data, we additionally divide the gained accuracy by
strategy has the disadvantage, that the selected cells potentially         the numbers of queries, that have been used to achieve this result.
belong to a large number of spreadsheets which leads to higher             The measurement is denoted as slope.
efforts for the oracle, i.e. the human annotator. As reported in                                       𝑎𝑐𝑐𝑒𝑛𝑑 − 𝑎𝑐𝑐𝑠𝑡𝑎𝑟𝑡
Section 2.3 it takes several minutes to label a spreadsheet, while                             𝑠𝑙𝑜𝑝𝑒 =                                    (9)
                                                                                                            |𝑞𝑢𝑒𝑟𝑖𝑒𝑠 |
most of the time is needed to understand the spreadsheet content
and structure. For this reason, we propose to provide just one             The 𝑎𝑐𝑐𝑠𝑡𝑎𝑟𝑡 is the classifier’s accuracy before any AL cycle passes.
spreadsheet to the human annotator within an AL cycle, i.e. a              The 𝑎𝑐𝑐𝑒𝑛𝑑 is determined by the stopping criteria (see Section 3.2)
batch consists of all cells from this single spreadsheet. To select        or in the worst case corresponds to the point when the unlabeled
the best spreadsheet within an AL cycle we have to adapt our de-           sample set 𝑈 is out of data.
fined query strategy metrics 𝑄𝑆 (see Section 3.1) for spreadsheet
batches, where 𝑈𝑖 denotes the set of unlabeled cells of spread-            4.2    Passive Learning
sheet 𝑖 as follows and 𝑆𝐵 denotes the spreadsheet to be used as            In a first experiment, we compared the performance of different
𝑈𝑞 :                                                                       classifiers trained on the whole dataset 𝐿, without applying the
                                                                           AL cycle. In detail, we investigated Decision Trees (DT), Random
                                   1 Õ
                  𝑆𝐵 = argmax              𝑄𝑆 (𝑥)                 (8)      Forest (RF), Naïve Bayes (NB), Support Vector Machines (SVM),
                                  |𝑈 𝑖|
                             𝑖         𝑥 ∈𝑈𝑖                               and Multi-Layer-Perceptron (MLP) using a standard random
                                                                           search over the hyperparameter and five-fold cross-validation.
4     EVALUATION                                                           Table 1 lists the achieved classification results. The data has been
To evaluate our semi-supervised approach for spreadsheet cell              downsampled and the data points of the underrepresented classes
classification, we performed a set of experiments on DECO dataset          have been used as weights for the accuracies for the header and
(Section 2.3). To use this dataset for cell classification three changes   metadata cells, as the F1-Sores without would be significantly
have been made: 1) As described already in Section 2.1 the seven           lower compared to the F1-Scores of the data class. Even though
                          Decision Naïve   Random    SVM        MLP                    0.875                                        Baseline
                           Tree    Bayes   Forest                                      0.850
                                                                                                                                    50
                                                                                                                                    150
                                                                                                                                    250


                                                                            Accuracy
 Data                      0.95    0.91     0.95      0.95       0.95                  0.825                                        Sheet
 Header                    0.75    0.69     0.76      0.78       0.78
 Metadata                  0.23    0.35     0.46      0.47       0.52                  0.800
 Weighted Avg.             0.90    0.87     0.91      0.92       0.92                  0.775
 F1-Score
 Avg. training             0.15    0.09     0.33     112.52     490.91                             0   250 500 750 1000 1250 1500 1750
                                                                                                               # Batches
 time (sec)
   Table 1: F1-Scores of the Passive Classifiers Problem
                                                                         Figure 3: Comparison of different batch sizes and the pas-
                                                                         sive classifier as baseline


                                                     Baseline                                                               Baseline
              0.850                                  1%                                     0.85                            Random
                                                                                                                            Least Confident
                                                     5%
                                                                                                                            Entropy


                                                                                 Accuracy
                                                     10%
   Accuracy


              0.825                                                                                                         Margin
                                                                                                                            Committee
                                                                                            0.80
              0.800
              0.775
                                                                                            0.75
                      0   100 200 300 400 500 600 700                                              0   100 200 300 400 500 600 700
                                # Queried Sheets                                                            # Queried Sheets

Figure 2: Comparison of different start sizes and the pas-               Figure 4: Accuracies for different sampling strategies and
sive classifier as baseline                                              passive classifier as baseline


                                                                         4.5           Sampling Strategies
MLP and SVM achieved the best average F1-score we used the               Figure 4 shows the classification accuracies for the different sam-
close second best option RF for the remaining AL experiments,            pling strategies, all with the same start size of 1% and spreadsheet-
since its training time is significantly lower. In the following         wise batches. One can see a clear difference between the random
experiments the shown baseline is the accuracy of a passive clas-        strategy and the other strategies: For the random strategy the
sifier trained on the whole dataset 𝐿. Table 1 lists the achieved        accuracy goes down after a few queried sheets. Entropy based
classification results.                                                  uncertainty sampling, uncertainty margin sampling and committee
                                                                         sampling show the highest increase while the latter achieved the
4.3           Start Set Size                                             best results and is therefore the recommended strategy.
We compare three different start set sizes 𝐿: 1%, 5% and 10%,
                                                                         4.6           Stopping Criteria
the batch size is 1 spreadsheet at a time, and the query strategy
entropy based uncertainty. Figure 2 shows, that a smaller start set      For evaluating the stopping criteria, we use a start size of 1%
size leads to higher increase in accuracy at the beginning of the        and committee sampling. Figure 5 shows the values of the three
AL cycle. Nevertheless, after around 100 queried sheets for all          stopping metrics proposed in Section 3.2. The markers indicate
different start sizes the graphs have almost the same shape. In          the stopping points resulting from these stopping metrics. The
terms of reducing human effort, we can therefore state, that a           vertical dashed line denotes the optimal stopping point derived
small start size is more beneficial and leads to the same or even        from the peak of the learning curve for the test data. It should
higher accuracies.                                                       be noted again, that the learning curve is just available in our
                                                                         experimental setting. In practice the stopping point has to be
                                                                         determined by the stopping criteria, that just rely on the output
4.4           Batch Size                                                 of the classifier.
Instead of querying the oracle for just one new labeled data point       For sheet-based batches, the maximum uncertainty stopping val-
during a single AL cycle, the classifier is being refitted on a whole    ues are quite consistent, only for the last queries a drop is no-
batch of new labeled cell sheets. Different batch sizes, respectively    ticeable. The reason is probably an overall consistent average
50, 150, 250 and all cells of a single spreadsheet are selected for      uncertainty over all unlabeled sheets. The selected accuracy met-
comparison. The start set size for this experiment is 1% and             ric fluctuates quite a lot and depends heavily on the sheet, but an
entropy based uncertainty query strategy is used (Section 3.1).          upward trend is nevertheless noticeable the more labeled train-
Figure 3 shows, that the best results can be achieved with a batch       ing data becomes available. The standard deviation of the last
size of 50. However, the difference between cell-based batches and       five classification accuracies of the queried spreadsheets is quite
the batches consisting of cells, that belong to a single spreadsheet     stable, having some valleys. This stopping criterion provides a
is very small. Therefore, we advise to use a batch size of one           robust categorization of the accuracies, but cannot be used to
single spreadsheet per AL cycle in order to reduce the effort for        distinguish between plateaus and valleys. Even though in our
the human annotator.                                                     case the minimum of the standard deviation is almost exactly at
                                                                      1.0                              spreadsheet community for a while. It was created with the help
                 0.850                                                                                 of search engines, issuing queries containing keywords such as “fi-
                                                                      0.8


                                                                           Stopping Criteria
                                                                                                       nancial" and “inventory", and file type “.xls”. Overall, it comprises
                 0.825
      Accuracy


                             Baseline                                 0.6                              4, 498 unique spreadsheets, organized into categories (folders)
                             Committee                                                                 based on the used keywords. The more recent Enron corpus [13]
                 0.800                                                0.4
                                                                                                       contains 15, 770 spreadsheets, extracted from the Enron email
                 0.775                          Max Uncertainty
                                                Standard Deviation
                                                                      0.2                              archive1 . This corpus is unique regarding its exclusive view on
                                                Selected Accuracy
                 0.750                                                0.0                              the use of spreadsheets in enterprise settings. Another recent
                         0          200   400            600                                           corpus is Fuse [4], which comprises 249, 376 unique spreadsheets,
                                                                                                       extracted from Common Crawl2 . Each spreadsheet is accompa-
                                                                                                       nied by a JSON file, which includes NLP token extraction and
Figure 5: Comparison of stopping strategies Max Uncer-
                                                                                                       metrics related to the use of formulas.
tainty, Standard Deviation and Selected Accuracy together
                                                                                                       So far, these three corpora have been used by researchers viewing
with the test accuracy of committee sampling (mauve)
                                                                                                       spreadsheets from a software engineering perspective. Formula
and the optimal stopping point (marked using a vertical
                                                                                                       error detection and debugging [15, 26], but also usage, life-cycle,
dashed line)
                                                                                                       modeling, and governance of spreadsheets [1, 8, 14] are important
                                                                                                       research subjects within this community. Table Recognition In
                                           𝑎𝑐𝑐 Δ %         |𝑞𝑢𝑒𝑟𝑖𝑒𝑠 |                          𝑠𝑙𝑜𝑝𝑒   the literature we find various approaches with regards to layout
 No Stopping                                    3.00                 671                   4.474e-03   inference and table recognition in spreadsheets as prerequisites
 Maximum Uncertainty Stopping                3.40                    661               5.144e-03       for information extraction. These approaches can be classified
 Standard Deviation Stopping                11.19                     70              1.599e-01        as rule-based, heuristics-based, and ML-based ones. Here, we
 Selected Accuracy Stopping                 10.88                     43               2.530e-01       consider those, that are ML-based, like [11, 19]. More recent
                                                                                                       publications apply to some extent machine learning techniques
Table 2: Comparison of different learning slopes (spread-                                              [2, 5, 6, 19]. Other works make use of domain specific languages,
sheet batches, committee sampling, 1% start size)                                                      such as [16], [31], and [14].
                                                                                                       Active Learning The existing AL approaches can be grouped
                                                                                                       in three categories: Membership Query Synthesis, Stream-Based
                                                                                                       Selective Sampling, and Pool-Based Sampling [28]. Membership
the optimum stopping point for a general case we recommend
                                                                                                       Query Synthesis assumes, that new data points can arbitrarily be
to use standard deviation as a necessary criterion and selected
                                                                                                       created and the learner can request any data point from the prob-
accuracy as a sufficient criterion to detect a continuous peak.
                                                                                                       lem space. For Stream-Based Selective Sampling a data point is
To see the amount of effort which can be reduced by AL we
                                                                                                       being sampled from the problem space and the learner can decide,
calculated the slopes for the stopping points as listed in Table 2.
                                                                                                       if the generated sample should be labeled or discarded. Both ap-
The best results were achieved with standard deviation stopping,
                                                                                                       proaches are based on the assumption, that arbitrary real-world
resulting in an increase of the accuracy by +11.19% using a start
                                                                                                       data points can be produced in an inexpensive way which is not
set of just 6 spreadsheets and a training set of 70 spreadsheets
                                                                                                       the case in our scenario. Pool-Based Sampling instead assumes
compared to 671 spreadsheets of the passive classifier.
                                                                                                       a fixed pool of labeled data 𝐿 and a large pool of unlabeled data
                                                                                                       𝑈 from the problem space, which holds true for our spreadsheet
4.7        Discussion                                                                                  dataset. Therefore, in this paper we implemented a pool-based
The following insights can be derived from our experiments: As                                         sampling approach. In the context of spreadsheets, [6] proposed
a first step we recommend a comparison of different classifiers.                                       a rule-assisted AL approach to detect certain properties. The au-
For our scenario of cell classification, we decide on a Random                                         thors introduce a hybrid approach, that combines active learning
Forest classifier providing high accuracy and fast training time.                                      with crude easy-to-write user-provided rules.
Compared to weak-supervised approach (see Section 5) active                                            Weak Supervision The common notion behind weak super-
learning requires a start set of already labeled data. However, our                                    vision techniques is, that it is much easier to generate a large
experiments showed, that even a very small start set consisting                                        amount of so-called weakly-labeled data than a small amount
of 6 spreadsheets only is totally sufficient. For cell classifica-                                     of good-labeled data and that more data always leads to better
tion, we decide to use one spreadsheet per batch to reduce the                                         results. Under this assumption multiple approaches have been
annotator effort. Regarding the sampling strategies, we can con-                                       proposed: In Snorkel [24] user defined labeling functions (LF) are
clude, that except the random approach all strategies worked                                           used to label given data points. Writing LFs however, requires
well. The best slope was achieved using the committee strategy                                         experts with domain knowledge, in contrast to spreadsheet cell
which is therefore our recommended one. It is however notewor-                                         labeling which can be done by non-experts also. Additionally,
thy, that this strategy requires the highest amount of computing                                       Snorkel users often have to provide a development set of labeled
resources, since several classifiers have to be trained in parallel.                                   data providing quick approximate signals on the accuracies of the
We could not identify a single stopping criterion which suffices                                       LFs. Hence, the human effort is not necessarily lower compared
for identifying nearly optimal stopping points. Therefore, our                                         to AL.
recommendation is a combination of a high selected accuracy and                                        As successor of Snorkel Snuba [32] was created, where the noisy
a low standard deviation of the query accuracies.                                                      LFs are automatically derived from a small labeled dataset. How-
                                                                                                       ever, our experiments with the open-source code base of Snuba
5     RELATED WORK
Unlabeled Spreadsheet Corpora We find multiple, unlabeled                                              1 http://info.nuix.com/Enron.html

spreadsheet corpora in the literature: Euses [12] has served the                                       2 http://commoncrawl.org/
have shown some limitations. In the current state it is not easily                       [11] Julian Eberius, Christoper Werner, Maik Thiele, Katrin Braunschweig, Lars
applicable to problems of more than two labels and it seems to                                Dannecker, and Wolfgang Lehner. [n.d.]. DeExcelerator: A framework for
                                                                                              extracting relational data from partially structured documents. In CIKM’13.
be only working well for certain types of problems, to which the                              2477–2480.
task of classifying spreadsheet cells does not belong to: First, our                     [12] Marc Fisher and Gregg Rothermel. [n.d.]. The EUSES spreadsheet corpus: a
                                                                                              shared resource for supporting experimentation with spreadsheet dependabil-
dataset is quite unbalanced, which results in LFs, that are only fo-                          ity mechanisms. In SIGSOFT’05, Vol. 30. 1–5.
cusing on the majority class and second, with a growing number                           [13] Felienne Hermans and Emerson Murphy-Hill. 2015. Enron’s spreadsheets and
of labeling classes the LFs again tend to focus only on the most                              related emails: A dataset and analysis. In ICSE. IEEE Press, 7–16.
                                                                                         [14] Felienne Hermans, Martin Pinzger, and Arie Van Deursen. [n.d.]. Automati-
common label. While investigating this we discovered multiple                                 cally extracting class diagrams from spreadsheets. ECOOP’10 ([n. d.]), 52–75.
issues in the current implementation of Snuba which prevent                              [15] Felienne Hermans, Martin Pinzger, and Arie van Deursen. 2015. Detecting
the overall sound concept from working well in a generalized                                  and refactoring code smells in spreadsheet formulas. Empirical Software
                                                                                              Engineering 20, 2 (2015), 549–575.
setting.                                                                                 [16] Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011.
Other methods include assigning a label per feature instead of                                Wrangler: Interactive visual specification of data transformation scripts. In
                                                                                              SIGCHI. ACM, 3363–3372.
assigning a label per data point [10], which is suitable in a setting                    [17] Elvis Koci, Maik Thiele, , Oscar Romero, and Wolfgang Lehner. 2019. A
of textual data, where it is easy to specify distinctive terms per                            Genetic-based Search for Automatic Table Recognition in Spreadsheets. In
label class, but not in the case of classifying spreadsheet cells.                            ICDAR’19.
                                                                                         [18] Elvis Koci, Maik Thiele, Josephine Rehak, Oscar Romero, and Wolfgang Lehner.
Another approach [21, 23] first clusters the input data and as-                               2019. DECO: A Dataset of Annotated Spreadsheets for Layout and Table
signs then a label per cluster based on a fraction of labeled data                            Recognition. In ICDAR’19.
points per cluster. Our problem space has a dimension of 159                             [19] Elvis Koci, Maik Thiele, Óscar Romero Moral, and Wolfgang Lehner. 2016.
                                                                                              A machine learning approach for layout inference in spreadsheets. In KDIR.
features and early experiments showed, that it is not easy to                                 SciTePress, 77–88.
identify appropriate clusters in such a high-dimensional space.                          [20] David D. Lewis. 1995. A Sequential Algorithm for Training Text Classifiers:
                                                                                              Corrigendum and Additional Data. SIGIR Forum 29, 2 (Sept. 1995), 13–19.
                                                                                              https://doi.org/10.1145/219587.219592
6     CONCLUSIONS                                                                        [21] Zhiliang Liu, Xiaomin Zhao, Jianxiao Zou, and Hongbing Xu. 2013. A Semi-
                                                                                              Supervised Approach Based on k-Nearest Neighbor. Journal of Software 8 (04
In this paper, we proposed a semi-supervised approach for cell                                2013). https://doi.org/10.4304/jsw.8.4.768-775
classification, that drastically reduces the amount of training data,                    [22] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.
                                                                                              Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D.
without any losses in accuracy. In detail, we investigated different                          Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn:
query strategies, start set sizes, stopping criteria, and batch sizes                         Machine Learning in Python. JMLR 12 (2011), 2825–2830.
                                                                                         [23] Fábio Perez, Rémi Lebret, and Karl Aberer. 2018. Cluster-Based Active Learning.
to implement the AL cycle. With the DECO corpus, consisting of                                CoRR abs/1812.11780 (2018). arXiv:1812.11780
1,165 labeled spreadsheets, we tested all these configurations and                       [24] Alexander Ratner, Stephen H. Bach, Henry R. Ehrenberg, Jason Alan Fries,
derived good settings for them. Using committee sampling, a start                             Sen Wu, and Christopher Ré. 2017. Snorkel: Rapid Training Data Creation
                                                                                              with Weak Supervision. CoRR abs/1711.10160 (2017). arXiv:1711.10160
set size of 1% and the selected accuracy stopping criterion we are                       [25] Tobias Scheffer, Christian Decomain, and Stefan Wrobel. 2001. Active Hid-
able to reduce the size of the training dataset by 90%. In absolute                           den Markov Models for Information Extraction. In IDA ’01. Springer-Verlag,
numbers, our AL cell classification approach obtains the same                                 London, UK, UK, 309–318.
                                                                                         [26] Thomas Schmitz, Dietmar Jannach, Birgit Hofer, Patrick W. Koch, Konstantin
accuracies using just 76 labeled spreadsheets (6 sheets in the start                          Schekotihin, and Franz Wotawa. 2017. A decomposition-based approach to
set and 70 selected by the AL approach) compared to the passive                               spreadsheet testing and debugging. In Visual Languages and Human-Centric
                                                                                              Computing. IEEE Computer Society, 117–121.
classifier trained on 1,165 spreadsheets. Therefore, the effort for                      [27] Greg Schohn and David Cohn. 2000. Less is More: Active Learning with
the human annotators was reduced from 3,029 to 197 minutes.                                   Support Vector Machines. Machine Learning-International Workshop then
Given this reduction of annotation work and consequently in                                   Conference (10 2000).
                                                                                         [28] Burr Settles. 2010. Active learning literature survey. Computer Sciences
annotation costs, it now becomes feasible to apply ML-driven                                  Technical Report 1648 (07 2010).
cell classification and table identification in practice.                                [29] H. S. Seung, M. Opper, and H. Sompolinsky. 1992. Query by Committee. In
                                                                                              COLT ’92. ACM, New York, NY, USA, 287–294. https://doi.org/10.1145/130385.
                                                                                              130417
REFERENCES                                                                               [30] C. E. Shannon. 2001. A Mathematical Theory of Communication. SIGMOBILE
                                                                                              Mob. Comput. Commun. Rev. 5, 1 (Jan. 2001), 3–55. https://doi.org/10.1145/
 [1] Robin Abraham and Martin Erwig. 2006. Inferring templates from spread-
                                                                                              584091.584093
     sheets. In Proceedings of the 28th international conference on Software engineer-
                                                                                         [31] Alexey O Shigarov and Andrey A Mikhailov. 2017. Rule-based spreadsheet
     ing. ACM, 182–191.
                                                                                              data transformation from arbitrary to relational tables. Information Systems
 [2] Marco D Adelfio and Hanan Samet. [n.d.]. Schema extraction for tabular data
                                                                                              71 (2017), 123–136.
     on the web. VLDB’16 ([n. d.]), 421–432.
                                                                                         [32] Paroma Varma and Christopher Ré. 2018. Snuba: Automating Weak Super-
 [3] Yoram Baram, Ran El-Yaniv, and Kobi Luz. 2004. Online Choice of Active
                                                                                              vision to Label Training Data. Proc. VLDB Endow. 12, 3 (Nov. 2018), 223–236.
     Learning Algorithms. J. Mach. Learn. Res. 5 (Dec. 2004), 255–291.
                                                                                              https://doi.org/10.14778/3291264.3291268
 [4] Titus Barik, Kevin Lubick, Justin Smith, John Slankas, and Emerson Murphy-
                                                                                         [33] Xinxin Wang. 1996. Tabular abstraction, editing, and formatting. Technical
     Hill. 2015. Fuse: a reproducible, extendable, internet-scale corpus of spread-
                                                                                              Report. University of Waretloo, Waterloo, Ontaria, Canada.
     sheets. In MSR’15. 486–489.
                                                                                         [34] Jingbo Zhu, Huizhen Wang, Eduard Hovy, and Matthew Ma. 2010. Confidence-
 [5] Zhe Chen and Michael Cafarella. [n.d.]. Automatic web spreadsheet data
                                                                                              based Stopping Criteria for Active Learning for Data Annotation. ACM Trans.
     extraction. In SSQ’13. 1.
                                                                                              Speech Lang. Process. 6, 3, Article 3 (April 2010), 24 pages. https://doi.org/10.
 [6] Zhe Chen, Sasha Dadiomov, Richard Wesley, Gang Xiao, Daniel Cory, Michael
                                                                                              1145/1753783.1753784
     Cafarella, and Jock Mackinlay. 2017. Spreadsheet Property Detection With
     Rule-assisted Active Learning. In CIKM ’17. ACM, New York, NY, USA, 999–
     1008. https://doi.org/10.1145/3132847.3132882
 [7] David Cohn, Les Atlas, and Richard Ladner. 1994. Improving generalization
     with active learning. Machine Learning 15, 2 (01 May 1994), 201–221. https:
     //doi.org/10.1007/BF00993277
 [8] Jácome Cunha, João Paulo Fernandes, Jorge Mendes, and João Saraiva. 2012.
     MDSheet: A framework for model-driven spreadsheet engineering. In ICSE.
     IEEE Press, 1395–1398.
 [9] Ido Dagan and Sean P. Engelson. 1995. Committee-based Sampling for Training
     Probabilistic Classifiers. In ICML’95. Morgan Kaufmann Publishers Inc., San
     Francisco, CA, USA, 150–157.
[10] Gregory Druck, Burr Settles, and Andrew Mccallum. 2009. Active learning by
     labeling features. In In Proc. EMNLP. ACL Press, 81–90.