=Paper= {{Paper |id=Vol-2578/SEAData1 |storemode=property |title=Active Learning for Spreadsheet Cell Classification |pdfUrl=https://ceur-ws.org/Vol-2578/SEAData1.pdf |volume=Vol-2578 |authors=Julius Gonsior,Josephine Rehak,Maik Thiele,Elvis Koci,Michael Günther,Wolfgang Lehner |dblpUrl=https://dblp.org/rec/conf/edbt/GonsiorRTK0L20 }} ==Active Learning for Spreadsheet Cell Classification== https://ceur-ws.org/Vol-2578/SEAData1.pdf

Active Learning for Spreadsheet Cell Classification
Julius Gonsior, Josephine Rehak, Maik Thiele, Elvis Koci, Michael Günther and Wolfgang
Lehner
Technische Universität Dresden
firstname.lastname@tu-dresden.de
ABSTRACT L U Train model on L
Random Sampling,
Uncertainty Sampling, …
Uncertainty Method,
Selected Accuracy Method, …
Spreadsheets are mainly the most successful content generation Start
Classifier
Query
Uq
Stopping
Strategy Criteria
tools, used in almost every enterprise to create a plethora of semi- learner
structured data. However, this information is often intermingled L + Lq How many cells?
with various formatting, layout, and textual metadata, making Retrieve labels for cells One or many files?
it hard to identify and extract the actual tabularly structured Gold Standard / Batch
End
Human oracle Pose queries (cells) Size
payload. For this reason, automated information extraction from
spreadsheets is a challenging task. Previous papers proposed cell
Figure 1: Active Learning Cycle for Cell Classification
classification as a first step of the table extraction process, which,
however, requires a substantial amount of labeled training data,
that is expensive to obtain. Therefore, in this paper we inves- identifying the most promising cells in 𝑈 . The cycle should stop
tigate a semi-supervised approach called Active Learning (AL), as soon as the trained classifier reaches the best accuracy. The AL
that can be used to train classification models by selecting only cycle is performed by two main actors: a learner and an oracle.
the most informative examples from an unlabeled dataset. In In our case, a learner is a cell classifier which is continuously
detail, we implement an AL cycle for spreadsheet cell classifica- retrained on the newly labeled data 𝐿𝑞 . The oracle maintains the
tion by investigating different selection strategies and stopping label-providing entity, i.e. a human annotator providing the gold
criteria. We compare the performance of various AL strategies standard. Based on the utilized query strategy (see Section 3.1) an
and derive guidelines for semi-supervised cell classification. Our unlabeled sample 𝑈𝑞 is chosen from the unlabeled pool 𝑈 . The
experiments show, that by implementing AL for cell classifica- purpose of this task is to identify such cells that contribute most
tion, we are able to reduce the amount of training data by 90% to the classifier to be trained. All other cells remain unlabeled,
without any accuracy losses compared to a passive classifier. potentially saving a lot of human effort. The cells proposed by
the query strategy 𝑈𝑞 are then given to the oracle resulting in 𝐿𝑞 .
KEYWORDS The newly labeled data 𝐿𝑞 is added to 𝐿 and the process starts
Information Extraction, Active Learning, Semi-supervised, Ma- again by retraining the classifier on the extended dataset. The
chine Learning, Classification, Spreadsheets AL cycle proceeds until a stopping criteria (see Section 3.2) is met
or until 𝑈 is out of queries.
Our results show, that the classification models trained on the
1 INTRODUCTION data determined by the AL cycle outperform the passive classifier
Spreadsheets are powerful content generation tools, assisting and reduce the amount of training data by 90%.
novices and professionals alike. They contain data that are roughly Outline. The remainder of this paper is organized as follows:
relational, but accompanied by various formatting, layout, and In Section 2, we shortly review the task of cell classification for
textual metadata. Thus, the generated content is primarily de- spreadsheets. The individual parts of the AL cycle for cell classifi-
signed for human consumption and carries a lot of implicit in- cation are introduced in Section 3. In detail, we discuss different
formation. Due to these reasons, automatic table extraction and query strategies, stopping criteria and batch sizes, evaluated in
recognition [2, 5, 17] for spreadsheets is a very difficult task. Section 4. We additionally evaluate varying start set sizes and
The most crucial step for all table recognition approaches is cell give a recommendation on how to implement the AL cycle for
classification, determining for each non-empty cell its role (data, cell classification. Finally, we present related work in Section 5
header, metadata etc.) within the spreadsheet. However, the train- and conclude in Section 6.
ing of classification models relies on human-labeled training data,
which involves extensive human effort. In this paper, we there- 2 CELL CLASSIFICATION FOR
fore propose a semi-supervised approach for cell classification. SPREADSHEETS
In detail, we propose an active learning (AL) cycle that is able to
The objective of capturing the tabular data embedded in spread-
determine the optimal amount of training data, needed to train a
sheets can be treated as a classification problem where the specific
cell classification model.
structures of a table have to be identified. Given a spreadsheet
Figure 1 sketches the AL cycle and its main steps. The AL process
the goal is to classify each non-empty cell with a given set of
starts with two sets of data: the unlabeled sample set 𝑈 and the
labels. The set of labels we use in this work is described in Sec-
already labeled dataset 𝐿, with |𝐿| << |𝑈 |. Initially, the learner
tion 2.1. Section 2.2 briefly presents the cell features used to train
is trained on the small labeled dataset 𝐿. The main task of the
a classifier.
AL cycle is to iteratively increase the set of labeled data 𝐿 by

© 2020 Copyright for this paper by its author(s). Published in the Workshop Proceed-
2.1 Cell Labels
ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen, The research literature [2, 5, 33] basically defines seven roles for
Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At-
tribution 4.0 International (CC BY 4.0) non-empty cells: Data, Header, Derived, GroupHeader, Title, Note,
and Other. However, for the specific task of table identification not
all of these roles are equally important and most approaches just queries should result in more certain predictions of the classifier.
rely on the main table building blocks: header, data, and metadata We compare three uncertainty metrics: least confident, margin
cells. Header cells give names to columns, describing the values sampling and entropy [28]. Least confidence [20] tries to capture
below them. They can be nested occupying several consecutive the probability, that the classifier is mislabeling the data using
rows. Data cells follow the structure defined by header cells and the posterior probability 𝑃 where 𝑦ˆ is the most likely prediction:
contain the table’s payload. Metadata cells provide additional ˆ
𝑈𝑞,𝐿𝐶 = argmax 1 − 𝑃 (𝑦|𝑥), 𝑥 ∈𝑈 (1)
information about the sheet as a whole, or about a subset of its 𝑥
values. Some typical examples are footnotes and table titles. Information about other classes next to the most probable one is
not taken into account by this strategy. Margin sampling [25] in
2.2 Cell Features contrast uses the posteriors for the first and second most probable
The label of a cell is determined by a classifier, which makes classes and samples the instances with the smallest margin:
use of cell features such as formatting, content types, formula 𝑈𝑞,𝑀 = argmin 𝑃 (𝑦ˆ1 |𝑥) − 𝑃 (𝑦ˆ2 |𝑥) (2)
references, and additional context from nearby cells [2, 5]. In total, 𝑥
we consider 159 available features to train our cell classifiers. A Entropy uncertainty [30] uses all possible classes and captures
detailed explanation of all cell features can be found in [19]. the entropy of a given distribution:
Õ
2.3 Training Dataset 𝑈𝑞,𝐸 = argmax − 𝑃 (𝑦𝑖 |𝑥) log 𝑃 (𝑦𝑖 |𝑥) (3)
𝑥 𝑖
For our experiments we used the DECO dataset [18], an existing
corpus of labeled spreadsheets, that can be used to train clas- Query-by-Committee. In contrast to the other strategies the
sification models and additionally, in context of our work, to Query-by-Committee [29] approach maintains multiple models
simulate an oracle (for more information see the Section 3) in in parallel, called committee members. They are all being trained
the AL cycle. DECO contains 1,165 labeled spreadsheets which on the same current labeled set 𝐿 and then vote on the query can-
have been derived from the Enron corpus [13]. The time to an- didates. The vote is conducted by letting all committee members
notate a sheet within a spreadsheet has been logged during the predict all unlabeled samples and measuring the controversial
creation of DECO. In average the annotation takes 4.3 minutes score for each sample. The most controversial query is consid-
per spreadsheet with a max value of 284.4 minutes. This shows ered the most informative. For the measurement of disagreement
the complexity of the cumbersome manual labeling task. It is various methods exist. In this paper, we choose vote entropy as
clear, that a reduction of to-be-labeled spreadsheets will drasti- proposed by [9]:
cally improve the cell classification approach and thus the overall Õ 𝑉 (𝑦𝑖 ) 𝑉 (𝑦𝑖 )
𝑈𝑞,𝑉 𝐸 = argmax − log (4)
table extraction process. 𝑥 𝐶 𝐶
𝑖

3 ACTIVE LEARNING FOR SPREADSHEET Whereby 𝐶 denotes the committee size, 𝑦𝑖 ranges over all possible
labels and 𝑉 (𝑦𝑖 ) is the number of times a label was predicted by
CELL CLASSIFICATION
a classifier. This scoring method scores dissenting votes as the
To implement an AL cycle (shown in Figure 1) for a given super- highest and concurring votes as the lowest. Our applied commit-
vised machine learning task, such as cell classification, one has tee consisted of three Random Forest Classifiers with different
to decide for a query strategy, a stopping criteria and a certain random initializations, one Naïve Bayes, and one Support Vector
batch size. The query strategy selects the samples which should Machine.
be labeled next by the oracle. In Section 3.1, we give an overview
of some popular query strategies which are later considered in 3.2 Stopping Criteria
our evaluation. To achieve our goal of reducing the human effort
So far, the AL cycle would stop only, if the set of unlabeled data
in labeling spreadsheet cells, we have to find the optimal point for
𝑈 is out of cells. Obviously this leads to no reduction in labeling
stopping the AL cycle. Therefore, different stopping criteria are
effort. Therefore, we introduce a stopping criteria, that is able to
discussed in Section 3.2. Another issue that impacts the labeling
detect whether proceeding the AL cycle is not resulting in any
effort is the batch size, i.e. the number of samples or cells labeled
accuracy gain. Additionally, it can be shown, that by reducing
within each iteration of the AL cycle, discussed in Section 3.3.
the amount of training data, we prevent overfitting and the test
accuracy is increased (Section 4.6). In this paper, we investigate
3.1 Query Strategies
three different stopping criteria (SC) from [34].
In this section, we shortly introduce the different strategies for
choosing the most informative queries. Each strategy approx- Maximum Uncertainty. Maximum Uncertainty uses the same
imates the contained informativeness of unlabeled data for a measurement as the Uncertainty Sampling Least Confident strat-
potential classifier. egy in Section 3.1. The basic idea is, that by passing multiple
learning cycles the classifier’s uncertainty should decrease as the
Random Sampling. Random sampling is a common AL query classifier has more knowledge over its data. Therefore, the stop-
strategy and found application in [3, 7, 27]. Unlike the other ping criteria uses the uncertainty of the most uncertain sample in
methods, random sampling chooses queries at random and fully 𝑈 as measure. If the uncertainty value drops below a user-defined
independently of their informativeness. However, even with this threshold, the stopping criteria will end the AL cycle:
strategy a rise in prediction accuracy is possible, since the amount
1 , 𝑈𝑞 ∈ 𝑈 and 𝑈 𝑀 (𝑈𝑞 ) ≤ 𝜃 𝑀𝑈
(
of training data is steadily increased. We use random sampling
𝑆𝐶𝑀𝑈 = (5)
as a baseline to compare the other strategies, too. 0 , otherwise
Uncertainty Sampling. Uncertainty sampling chooses queries 𝑈 𝑀 (𝑥 ∗ ) is the method retrieving the uncertainty of a sample
which are the most uncertain to predict. Hence, learning these 𝑈𝑞 and 𝜃 𝑀𝑈 the predefined threshold. For batches we used the
smallest uncertainty of the batch, other common strategies are layout roles have to be merged into the three broader categories
the mean or the median. data, header, metadata. 2) Due to the very high amount of data
cells (94%) in DECO, the data class would get an increased prior
Selected Accuracy Method. This method focuses on the classi-
probability. Therefore, we decided to limit the number of data
fication accuracy of the chosen queries 𝑈𝑞 . It assumes, that the
cells to 100 per spreadsheet, chosen randomly. 3) Cells with the
classifier always chooses the most difficult queries for labeling.
label other were not included in our experiments, since they
Hence the model’s predictions on the most uncertain queries
are too ambiguous and not helpful for table identification. The
should reflect the classifier’s certainty. The AL process stops,
original dataset contains 98.68% data cells, 1.18% Header cells
if the classifier successfully predicts the labels of those most
and 0.14% metadata cells, with a total of 1, 393, 398 cells. After
difficult queries.
performing the aforementioned steps, the distribution is 83.27%
1 , 𝑎𝑐𝑐 |𝑈𝑞 | (𝐶) ≥ 𝜃 𝑆𝐴 data cells, 14.99% header cells and 1.74% metadata cells.
(
𝑆𝐶𝑆𝐴 = (6) We implemented the AL cycle sketched in Figure 1 in Python in
0 , otherwise
the following way: the learner consists of several classification
𝑎𝑐𝑐𝑚 (𝐶) determines with the help of the oracle the accuracy of models such as Decision Trees, Random Forest, Naïve Bayes and
the classifier’s predictions on the queries, 𝑚 denotes the iteration SVMs, using their Scikit-learn [22] implementation. To perform
of the AL cycle of the query. 𝜃 𝑆𝐴 is the threshold. For batches AL on the DECO dataset we split it into 80% used for 𝐿 and 𝑈 as
the mean accuracy for each cell is used. well as 20% for testing. Within the experiments, we usually set
|𝐿| to 1% and |𝑈 | to 99% (of the overall 80%). Since our dataset is
Standard Deviation-based Stopping. In contrast to other ap-
already completely annotated, our implemented oracle can auto-
proaches, Standard Deviation-based Stopping [6] assesses mul-
matically provide the requested label, i.e. for a given 𝑈𝑞 provide
tiple cycles for trend detection. Two criteria have to be met for
𝐿𝑞 . While this setup is perfect to test the different configurations
stopping: first, the standard deviation of the accuracies for the
of the AL cycle it is important to note, that in a real-word setting
last five 𝑈𝑞 has to be lower than a predefined threshold 𝜃 1 and
the oracle will always be a human annotator.
secondly, the accuracy of the current 𝑈𝑞 has to be larger than
In Section 4.1, we first propose a metric to measure the perfor-
the threshold 𝜃 2 . The first criteria identifies plateaus in the query
mance of an AL cycle in terms of accuracy improvement and
accuracy and the second one prevents local optima.
sample size reduction. Before looking at the AL strategies, we
1 , 𝜎 (𝑎𝑐𝑐𝑛−5...𝑛 ) ≤ 𝜃 1 ∧ 𝑎𝑐𝑐𝑛 ≥ 𝜃 2 , when 𝑛 ≥ 5 report in Section 4.2 on the accuracy of passive learners, i.e. clas-
(
𝑆𝐶𝜎 = (7) sifiers using the whole DECO dataset for training. In Section 4.3,
0 , otherwise
we study the impact of different start set sizes on the learning
The current cycle under inspection is denoted with 𝑛. curves. Varying batch sizes, sampling strategies and stopping
Since the stopping criteria described above all return Boolean criteria are investigated in Section 4.4, Section 4.5 and Section 4.6.
values, they can be used in combination in form of logical expres-
sions. 4.1 Slope Calculation
There does not exist a standardized numeric measure to report
3.3 Batch Sizes on AL performances. A common method in AL research is to
It is common practice in machine learning to train a model on compare smoothed accuracy graphs. However, to compare differ-
batches of samples instead of single data points. In the later evalu- ent settings in a faster way, many papers also rate the classifier
ation (Section 4.4) we compare simple top-𝑘 batches and so-called performance by looking at the start and end point of the learning
spreadsheet batches: A top-𝑘 batch consists of the top-𝑘 cells be- curve. To take into account also the reduction of the human effort
ing selected by the query strategy (Section 3.1). However, this in labeling data, we additionally divide the gained accuracy by
strategy has the disadvantage, that the selected cells potentially the numbers of queries, that have been used to achieve this result.
belong to a large number of spreadsheets which leads to higher The measurement is denoted as slope.
efforts for the oracle, i.e. the human annotator. As reported in 𝑎𝑐𝑐𝑒𝑛𝑑 − 𝑎𝑐𝑐𝑠𝑡𝑎𝑟𝑡
Section 2.3 it takes several minutes to label a spreadsheet, while 𝑠𝑙𝑜𝑝𝑒 = (9)
|𝑞𝑢𝑒𝑟𝑖𝑒𝑠 |
most of the time is needed to understand the spreadsheet content
and structure. For this reason, we propose to provide just one The 𝑎𝑐𝑐𝑠𝑡𝑎𝑟𝑡 is the classifier’s accuracy before any AL cycle passes.
spreadsheet to the human annotator within an AL cycle, i.e. a The 𝑎𝑐𝑐𝑒𝑛𝑑 is determined by the stopping criteria (see Section 3.2)
batch consists of all cells from this single spreadsheet. To select or in the worst case corresponds to the point when the unlabeled
the best spreadsheet within an AL cycle we have to adapt our de- sample set 𝑈 is out of data.
fined query strategy metrics 𝑄𝑆 (see Section 3.1) for spreadsheet
batches, where 𝑈𝑖 denotes the set of unlabeled cells of spread- 4.2 Passive Learning
sheet 𝑖 as follows and 𝑆𝐵 denotes the spreadsheet to be used as In a first experiment, we compared the performance of different
𝑈𝑞 : classifiers trained on the whole dataset 𝐿, without applying the
AL cycle. In detail, we investigated Decision Trees (DT), Random
1 Õ
𝑆𝐵 = argmax 𝑄𝑆 (𝑥) (8) Forest (RF), Naïve Bayes (NB), Support Vector Machines (SVM),
|𝑈 𝑖|
𝑖 𝑥 ∈𝑈𝑖 and Multi-Layer-Perceptron (MLP) using a standard random
search over the hyperparameter and five-fold cross-validation.
4 EVALUATION Table 1 lists the achieved classification results. The data has been
To evaluate our semi-supervised approach for spreadsheet cell downsampled and the data points of the underrepresented classes
classification, we performed a set of experiments on DECO dataset have been used as weights for the accuracies for the header and
(Section 2.3). To use this dataset for cell classification three changes metadata cells, as the F1-Sores without would be significantly
have been made: 1) As described already in Section 2.1 the seven lower compared to the F1-Scores of the data class. Even though
Decision Naïve Random SVM MLP 0.875 Baseline
Tree Bayes Forest 0.850
50
150
250

Accuracy
Data 0.95 0.91 0.95 0.95 0.95 0.825 Sheet
Header 0.75 0.69 0.76 0.78 0.78
Metadata 0.23 0.35 0.46 0.47 0.52 0.800
Weighted Avg. 0.90 0.87 0.91 0.92 0.92 0.775
F1-Score
Avg. training 0.15 0.09 0.33 112.52 490.91 0 250 500 750 1000 1250 1500 1750
# Batches
time (sec)
Table 1: F1-Scores of the Passive Classifiers Problem
Figure 3: Comparison of different batch sizes and the pas-
sive classifier as baseline

Baseline Baseline
0.850 1% 0.85 Random
Least Confident
5%
Entropy

Accuracy
10%
Accuracy

0.825 Margin
Committee
0.80
0.800
0.775
0.75
0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700
# Queried Sheets # Queried Sheets

Figure 2: Comparison of different start sizes and the pas- Figure 4: Accuracies for different sampling strategies and
sive classifier as baseline passive classifier as baseline

4.5 Sampling Strategies
MLP and SVM achieved the best average F1-score we used the Figure 4 shows the classification accuracies for the different sam-
close second best option RF for the remaining AL experiments, pling strategies, all with the same start size of 1% and spreadsheet-
since its training time is significantly lower. In the following wise batches. One can see a clear difference between the random
experiments the shown baseline is the accuracy of a passive clas- strategy and the other strategies: For the random strategy the
sifier trained on the whole dataset 𝐿. Table 1 lists the achieved accuracy goes down after a few queried sheets. Entropy based
classification results. uncertainty sampling, uncertainty margin sampling and committee
sampling show the highest increase while the latter achieved the
4.3 Start Set Size best results and is therefore the recommended strategy.
We compare three different start set sizes 𝐿: 1%, 5% and 10%,
4.6 Stopping Criteria
the batch size is 1 spreadsheet at a time, and the query strategy
entropy based uncertainty. Figure 2 shows, that a smaller start set For evaluating the stopping criteria, we use a start size of 1%
size leads to higher increase in accuracy at the beginning of the and committee sampling. Figure 5 shows the values of the three
AL cycle. Nevertheless, after around 100 queried sheets for all stopping metrics proposed in Section 3.2. The markers indicate
different start sizes the graphs have almost the same shape. In the stopping points resulting from these stopping metrics. The
terms of reducing human effort, we can therefore state, that a vertical dashed line denotes the optimal stopping point derived
small start size is more beneficial and leads to the same or even from the peak of the learning curve for the test data. It should
higher accuracies. be noted again, that the learning curve is just available in our
experimental setting. In practice the stopping point has to be
determined by the stopping criteria, that just rely on the output
4.4 Batch Size of the classifier.
Instead of querying the oracle for just one new labeled data point For sheet-based batches, the maximum uncertainty stopping val-
during a single AL cycle, the classifier is being refitted on a whole ues are quite consistent, only for the last queries a drop is no-
batch of new labeled cell sheets. Different batch sizes, respectively ticeable. The reason is probably an overall consistent average
50, 150, 250 and all cells of a single spreadsheet are selected for uncertainty over all unlabeled sheets. The selected accuracy met-
comparison. The start set size for this experiment is 1% and ric fluctuates quite a lot and depends heavily on the sheet, but an
entropy based uncertainty query strategy is used (Section 3.1). upward trend is nevertheless noticeable the more labeled train-
Figure 3 shows, that the best results can be achieved with a batch ing data becomes available. The standard deviation of the last
size of 50. However, the difference between cell-based batches and five classification accuracies of the queried spreadsheets is quite
the batches consisting of cells, that belong to a single spreadsheet stable, having some valleys. This stopping criterion provides a
is very small. Therefore, we advise to use a batch size of one robust categorization of the accuracies, but cannot be used to
single spreadsheet per AL cycle in order to reduce the effort for distinguish between plateaus and valleys. Even though in our
the human annotator. case the minimum of the standard deviation is almost exactly at
1.0 spreadsheet community for a while. It was created with the help
0.850 of search engines, issuing queries containing keywords such as “fi-
0.8

Stopping Criteria
nancial" and “inventory", and file type “.xls”. Overall, it comprises
0.825
Accuracy

Baseline 0.6 4, 498 unique spreadsheets, organized into categories (folders)
Committee based on the used keywords. The more recent Enron corpus [13]
0.800 0.4
contains 15, 770 spreadsheets, extracted from the Enron email
0.775 Max Uncertainty
Standard Deviation
0.2 archive1 . This corpus is unique regarding its exclusive view on
Selected Accuracy
0.750 0.0 the use of spreadsheets in enterprise settings. Another recent
0 200 400 600 corpus is Fuse [4], which comprises 249, 376 unique spreadsheets,
extracted from Common Crawl2 . Each spreadsheet is accompa-
nied by a JSON file, which includes NLP token extraction and
Figure 5: Comparison of stopping strategies Max Uncer-
metrics related to the use of formulas.
tainty, Standard Deviation and Selected Accuracy together
So far, these three corpora have been used by researchers viewing
with the test accuracy of committee sampling (mauve)
spreadsheets from a software engineering perspective. Formula
and the optimal stopping point (marked using a vertical
error detection and debugging [15, 26], but also usage, life-cycle,
dashed line)
modeling, and governance of spreadsheets [1, 8, 14] are important
research subjects within this community. Table Recognition In
𝑎𝑐𝑐 Δ % |𝑞𝑢𝑒𝑟𝑖𝑒𝑠 | 𝑠𝑙𝑜𝑝𝑒 the literature we find various approaches with regards to layout
No Stopping 3.00 671 4.474e-03 inference and table recognition in spreadsheets as prerequisites
Maximum Uncertainty Stopping 3.40 661 5.144e-03 for information extraction. These approaches can be classified
Standard Deviation Stopping 11.19 70 1.599e-01 as rule-based, heuristics-based, and ML-based ones. Here, we
Selected Accuracy Stopping 10.88 43 2.530e-01 consider those, that are ML-based, like [11, 19]. More recent
publications apply to some extent machine learning techniques
Table 2: Comparison of different learning slopes (spread- [2, 5, 6, 19]. Other works make use of domain specific languages,
sheet batches, committee sampling, 1% start size) such as [16], [31], and [14].
Active Learning The existing AL approaches can be grouped
in three categories: Membership Query Synthesis, Stream-Based
Selective Sampling, and Pool-Based Sampling [28]. Membership
the optimum stopping point for a general case we recommend
Query Synthesis assumes, that new data points can arbitrarily be
to use standard deviation as a necessary criterion and selected
created and the learner can request any data point from the prob-
accuracy as a sufficient criterion to detect a continuous peak.
lem space. For Stream-Based Selective Sampling a data point is
To see the amount of effort which can be reduced by AL we
being sampled from the problem space and the learner can decide,
calculated the slopes for the stopping points as listed in Table 2.
if the generated sample should be labeled or discarded. Both ap-
The best results were achieved with standard deviation stopping,
proaches are based on the assumption, that arbitrary real-world
resulting in an increase of the accuracy by +11.19% using a start
data points can be produced in an inexpensive way which is not
set of just 6 spreadsheets and a training set of 70 spreadsheets
the case in our scenario. Pool-Based Sampling instead assumes
compared to 671 spreadsheets of the passive classifier.
a fixed pool of labeled data 𝐿 and a large pool of unlabeled data
𝑈 from the problem space, which holds true for our spreadsheet
4.7 Discussion dataset. Therefore, in this paper we implemented a pool-based
The following insights can be derived from our experiments: As sampling approach. In the context of spreadsheets, [6] proposed
a first step we recommend a comparison of different classifiers. a rule-assisted AL approach to detect certain properties. The au-
For our scenario of cell classification, we decide on a Random thors introduce a hybrid approach, that combines active learning
Forest classifier providing high accuracy and fast training time. with crude easy-to-write user-provided rules.
Compared to weak-supervised approach (see Section 5) active Weak Supervision The common notion behind weak super-
learning requires a start set of already labeled data. However, our vision techniques is, that it is much easier to generate a large
experiments showed, that even a very small start set consisting amount of so-called weakly-labeled data than a small amount
of 6 spreadsheets only is totally sufficient. For cell classifica- of good-labeled data and that more data always leads to better
tion, we decide to use one spreadsheet per batch to reduce the results. Under this assumption multiple approaches have been
annotator effort. Regarding the sampling strategies, we can con- proposed: In Snorkel [24] user defined labeling functions (LF) are
clude, that except the random approach all strategies worked used to label given data points. Writing LFs however, requires
well. The best slope was achieved using the committee strategy experts with domain knowledge, in contrast to spreadsheet cell
which is therefore our recommended one. It is however notewor- labeling which can be done by non-experts also. Additionally,
thy, that this strategy requires the highest amount of computing Snorkel users often have to provide a development set of labeled
resources, since several classifiers have to be trained in parallel. data providing quick approximate signals on the accuracies of the
We could not identify a single stopping criterion which suffices LFs. Hence, the human effort is not necessarily lower compared
for identifying nearly optimal stopping points. Therefore, our to AL.
recommendation is a combination of a high selected accuracy and As successor of Snorkel Snuba [32] was created, where the noisy
a low standard deviation of the query accuracies. LFs are automatically derived from a small labeled dataset. How-
ever, our experiments with the open-source code base of Snuba
5 RELATED WORK
Unlabeled Spreadsheet Corpora We find multiple, unlabeled 1 http://info.nuix.com/Enron.html

spreadsheet corpora in the literature: Euses [12] has served the 2 http://commoncrawl.org/
have shown some limitations. In the current state it is not easily [11] Julian Eberius, Christoper Werner, Maik Thiele, Katrin Braunschweig, Lars
applicable to problems of more than two labels and it seems to Dannecker, and Wolfgang Lehner. [n.d.]. DeExcelerator: A framework for
extracting relational data from partially structured documents. In CIKM’13.
be only working well for certain types of problems, to which the 2477–2480.
task of classifying spreadsheet cells does not belong to: First, our [12] Marc Fisher and Gregg Rothermel. [n.d.]. The EUSES spreadsheet corpus: a
shared resource for supporting experimentation with spreadsheet dependabil-
dataset is quite unbalanced, which results in LFs, that are only fo- ity mechanisms. In SIGSOFT’05, Vol. 30. 1–5.
cusing on the majority class and second, with a growing number [13] Felienne Hermans and Emerson Murphy-Hill. 2015. Enron’s spreadsheets and
of labeling classes the LFs again tend to focus only on the most related emails: A dataset and analysis. In ICSE. IEEE Press, 7–16.
[14] Felienne Hermans, Martin Pinzger, and Arie Van Deursen. [n.d.]. Automati-
common label. While investigating this we discovered multiple cally extracting class diagrams from spreadsheets. ECOOP’10 ([n. d.]), 52–75.
issues in the current implementation of Snuba which prevent [15] Felienne Hermans, Martin Pinzger, and Arie van Deursen. 2015. Detecting
the overall sound concept from working well in a generalized and refactoring code smells in spreadsheet formulas. Empirical Software
Engineering 20, 2 (2015), 549–575.
setting. [16] Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011.
Other methods include assigning a label per feature instead of Wrangler: Interactive visual specification of data transformation scripts. In
SIGCHI. ACM, 3363–3372.
assigning a label per data point [10], which is suitable in a setting [17] Elvis Koci, Maik Thiele, , Oscar Romero, and Wolfgang Lehner. 2019. A
of textual data, where it is easy to specify distinctive terms per Genetic-based Search for Automatic Table Recognition in Spreadsheets. In
label class, but not in the case of classifying spreadsheet cells. ICDAR’19.
[18] Elvis Koci, Maik Thiele, Josephine Rehak, Oscar Romero, and Wolfgang Lehner.
Another approach [21, 23] first clusters the input data and as- 2019. DECO: A Dataset of Annotated Spreadsheets for Layout and Table
signs then a label per cluster based on a fraction of labeled data Recognition. In ICDAR’19.
points per cluster. Our problem space has a dimension of 159 [19] Elvis Koci, Maik Thiele, Óscar Romero Moral, and Wolfgang Lehner. 2016.
A machine learning approach for layout inference in spreadsheets. In KDIR.
features and early experiments showed, that it is not easy to SciTePress, 77–88.
identify appropriate clusters in such a high-dimensional space. [20] David D. Lewis. 1995. A Sequential Algorithm for Training Text Classifiers:
Corrigendum and Additional Data. SIGIR Forum 29, 2 (Sept. 1995), 13–19.
https://doi.org/10.1145/219587.219592
6 CONCLUSIONS [21] Zhiliang Liu, Xiaomin Zhao, Jianxiao Zou, and Hongbing Xu. 2013. A Semi-
Supervised Approach Based on k-Nearest Neighbor. Journal of Software 8 (04
In this paper, we proposed a semi-supervised approach for cell 2013). https://doi.org/10.4304/jsw.8.4.768-775
classification, that drastically reduces the amount of training data, [22] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.
Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D.
without any losses in accuracy. In detail, we investigated different Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn:
query strategies, start set sizes, stopping criteria, and batch sizes Machine Learning in Python. JMLR 12 (2011), 2825–2830.
[23] Fábio Perez, Rémi Lebret, and Karl Aberer. 2018. Cluster-Based Active Learning.
to implement the AL cycle. With the DECO corpus, consisting of CoRR abs/1812.11780 (2018). arXiv:1812.11780
1,165 labeled spreadsheets, we tested all these configurations and [24] Alexander Ratner, Stephen H. Bach, Henry R. Ehrenberg, Jason Alan Fries,
derived good settings for them. Using committee sampling, a start Sen Wu, and Christopher Ré. 2017. Snorkel: Rapid Training Data Creation
with Weak Supervision. CoRR abs/1711.10160 (2017). arXiv:1711.10160
set size of 1% and the selected accuracy stopping criterion we are [25] Tobias Scheffer, Christian Decomain, and Stefan Wrobel. 2001. Active Hid-
able to reduce the size of the training dataset by 90%. In absolute den Markov Models for Information Extraction. In IDA ’01. Springer-Verlag,
numbers, our AL cell classification approach obtains the same London, UK, UK, 309–318.
[26] Thomas Schmitz, Dietmar Jannach, Birgit Hofer, Patrick W. Koch, Konstantin
accuracies using just 76 labeled spreadsheets (6 sheets in the start Schekotihin, and Franz Wotawa. 2017. A decomposition-based approach to
set and 70 selected by the AL approach) compared to the passive spreadsheet testing and debugging. In Visual Languages and Human-Centric
Computing. IEEE Computer Society, 117–121.
classifier trained on 1,165 spreadsheets. Therefore, the effort for [27] Greg Schohn and David Cohn. 2000. Less is More: Active Learning with
the human annotators was reduced from 3,029 to 197 minutes. Support Vector Machines. Machine Learning-International Workshop then
Given this reduction of annotation work and consequently in Conference (10 2000).
[28] Burr Settles. 2010. Active learning literature survey. Computer Sciences
annotation costs, it now becomes feasible to apply ML-driven Technical Report 1648 (07 2010).
cell classification and table identification in practice. [29] H. S. Seung, M. Opper, and H. Sompolinsky. 1992. Query by Committee. In
COLT ’92. ACM, New York, NY, USA, 287–294. https://doi.org/10.1145/130385.
130417
REFERENCES [30] C. E. Shannon. 2001. A Mathematical Theory of Communication. SIGMOBILE
Mob. Comput. Commun. Rev. 5, 1 (Jan. 2001), 3–55. https://doi.org/10.1145/
[1] Robin Abraham and Martin Erwig. 2006. Inferring templates from spread-
584091.584093
sheets. In Proceedings of the 28th international conference on Software engineer-
[31] Alexey O Shigarov and Andrey A Mikhailov. 2017. Rule-based spreadsheet
ing. ACM, 182–191.
data transformation from arbitrary to relational tables. Information Systems
[2] Marco D Adelfio and Hanan Samet. [n.d.]. Schema extraction for tabular data
71 (2017), 123–136.
on the web. VLDB’16 ([n. d.]), 421–432.
[32] Paroma Varma and Christopher Ré. 2018. Snuba: Automating Weak Super-
[3] Yoram Baram, Ran El-Yaniv, and Kobi Luz. 2004. Online Choice of Active
vision to Label Training Data. Proc. VLDB Endow. 12, 3 (Nov. 2018), 223–236.
Learning Algorithms. J. Mach. Learn. Res. 5 (Dec. 2004), 255–291.
https://doi.org/10.14778/3291264.3291268
[4] Titus Barik, Kevin Lubick, Justin Smith, John Slankas, and Emerson Murphy-
[33] Xinxin Wang. 1996. Tabular abstraction, editing, and formatting. Technical
Hill. 2015. Fuse: a reproducible, extendable, internet-scale corpus of spread-
Report. University of Waretloo, Waterloo, Ontaria, Canada.
sheets. In MSR’15. 486–489.
[34] Jingbo Zhu, Huizhen Wang, Eduard Hovy, and Matthew Ma. 2010. Confidence-
[5] Zhe Chen and Michael Cafarella. [n.d.]. Automatic web spreadsheet data
based Stopping Criteria for Active Learning for Data Annotation. ACM Trans.
extraction. In SSQ’13. 1.
Speech Lang. Process. 6, 3, Article 3 (April 2010), 24 pages. https://doi.org/10.
[6] Zhe Chen, Sasha Dadiomov, Richard Wesley, Gang Xiao, Daniel Cory, Michael
1145/1753783.1753784
Cafarella, and Jock Mackinlay. 2017. Spreadsheet Property Detection With
Rule-assisted Active Learning. In CIKM ’17. ACM, New York, NY, USA, 999–
1008. https://doi.org/10.1145/3132847.3132882
[7] David Cohn, Les Atlas, and Richard Ladner. 1994. Improving generalization
with active learning. Machine Learning 15, 2 (01 May 1994), 201–221. https:
//doi.org/10.1007/BF00993277
[8] Jácome Cunha, João Paulo Fernandes, Jorge Mendes, and João Saraiva. 2012.
MDSheet: A framework for model-driven spreadsheet engineering. In ICSE.
IEEE Press, 1395–1398.
[9] Ido Dagan and Sean P. Engelson. 1995. Committee-based Sampling for Training
Probabilistic Classifiers. In ICML’95. Morgan Kaufmann Publishers Inc., San
Francisco, CA, USA, 150–157.
[10] Gregory Druck, Burr Settles, and Andrew Mccallum. 2009. Active learning by
labeling features. In In Proc. EMNLP. ACL Press, 81–90.