Active Learning for Spreadsheet Cell Classification Julius Gonsior, Josephine Rehak, Maik Thiele, Elvis Koci, Michael GΓΌnther and Wolfgang Lehner Technische UniversitΓ€t Dresden firstname.lastname@tu-dresden.de ABSTRACT L U Train model on L Random Sampling, Uncertainty Sampling, … Uncertainty Method, Selected Accuracy Method, … Spreadsheets are mainly the most successful content generation Start Classifier Query Uq Stopping Strategy Criteria tools, used in almost every enterprise to create a plethora of semi- learner structured data. However, this information is often intermingled L + Lq How many cells? with various formatting, layout, and textual metadata, making Retrieve labels for cells One or many files? it hard to identify and extract the actual tabularly structured Gold Standard / Batch End Human oracle Pose queries (cells) Size payload. For this reason, automated information extraction from spreadsheets is a challenging task. Previous papers proposed cell Figure 1: Active Learning Cycle for Cell Classification classification as a first step of the table extraction process, which, however, requires a substantial amount of labeled training data, that is expensive to obtain. Therefore, in this paper we inves- identifying the most promising cells in π‘ˆ . The cycle should stop tigate a semi-supervised approach called Active Learning (AL), as soon as the trained classifier reaches the best accuracy. The AL that can be used to train classification models by selecting only cycle is performed by two main actors: a learner and an oracle. the most informative examples from an unlabeled dataset. In In our case, a learner is a cell classifier which is continuously detail, we implement an AL cycle for spreadsheet cell classifica- retrained on the newly labeled data πΏπ‘ž . The oracle maintains the tion by investigating different selection strategies and stopping label-providing entity, i.e. a human annotator providing the gold criteria. We compare the performance of various AL strategies standard. Based on the utilized query strategy (see Section 3.1) an and derive guidelines for semi-supervised cell classification. Our unlabeled sample π‘ˆπ‘ž is chosen from the unlabeled pool π‘ˆ . The experiments show, that by implementing AL for cell classifica- purpose of this task is to identify such cells that contribute most tion, we are able to reduce the amount of training data by 90% to the classifier to be trained. All other cells remain unlabeled, without any accuracy losses compared to a passive classifier. potentially saving a lot of human effort. The cells proposed by the query strategy π‘ˆπ‘ž are then given to the oracle resulting in πΏπ‘ž . KEYWORDS The newly labeled data πΏπ‘ž is added to 𝐿 and the process starts Information Extraction, Active Learning, Semi-supervised, Ma- again by retraining the classifier on the extended dataset. The chine Learning, Classification, Spreadsheets AL cycle proceeds until a stopping criteria (see Section 3.2) is met or until π‘ˆ is out of queries. Our results show, that the classification models trained on the 1 INTRODUCTION data determined by the AL cycle outperform the passive classifier Spreadsheets are powerful content generation tools, assisting and reduce the amount of training data by 90%. novices and professionals alike. They contain data that are roughly Outline. The remainder of this paper is organized as follows: relational, but accompanied by various formatting, layout, and In Section 2, we shortly review the task of cell classification for textual metadata. Thus, the generated content is primarily de- spreadsheets. The individual parts of the AL cycle for cell classifi- signed for human consumption and carries a lot of implicit in- cation are introduced in Section 3. In detail, we discuss different formation. Due to these reasons, automatic table extraction and query strategies, stopping criteria and batch sizes, evaluated in recognition [2, 5, 17] for spreadsheets is a very difficult task. Section 4. We additionally evaluate varying start set sizes and The most crucial step for all table recognition approaches is cell give a recommendation on how to implement the AL cycle for classification, determining for each non-empty cell its role (data, cell classification. Finally, we present related work in Section 5 header, metadata etc.) within the spreadsheet. However, the train- and conclude in Section 6. ing of classification models relies on human-labeled training data, which involves extensive human effort. In this paper, we there- 2 CELL CLASSIFICATION FOR fore propose a semi-supervised approach for cell classification. SPREADSHEETS In detail, we propose an active learning (AL) cycle that is able to The objective of capturing the tabular data embedded in spread- determine the optimal amount of training data, needed to train a sheets can be treated as a classification problem where the specific cell classification model. structures of a table have to be identified. Given a spreadsheet Figure 1 sketches the AL cycle and its main steps. The AL process the goal is to classify each non-empty cell with a given set of starts with two sets of data: the unlabeled sample set π‘ˆ and the labels. The set of labels we use in this work is described in Sec- already labeled dataset 𝐿, with |𝐿| << |π‘ˆ |. Initially, the learner tion 2.1. Section 2.2 briefly presents the cell features used to train is trained on the small labeled dataset 𝐿. The main task of the a classifier. AL cycle is to iteratively increase the set of labeled data 𝐿 by Β© 2020 Copyright for this paper by its author(s). Published in the Workshop Proceed- 2.1 Cell Labels ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen, The research literature [2, 5, 33] basically defines seven roles for Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At- tribution 4.0 International (CC BY 4.0) non-empty cells: Data, Header, Derived, GroupHeader, Title, Note, and Other. However, for the specific task of table identification not all of these roles are equally important and most approaches just queries should result in more certain predictions of the classifier. rely on the main table building blocks: header, data, and metadata We compare three uncertainty metrics: least confident, margin cells. Header cells give names to columns, describing the values sampling and entropy [28]. Least confidence [20] tries to capture below them. They can be nested occupying several consecutive the probability, that the classifier is mislabeling the data using rows. Data cells follow the structure defined by header cells and the posterior probability 𝑃 where 𝑦ˆ is the most likely prediction: contain the table’s payload. Metadata cells provide additional Λ† π‘ˆπ‘ž,𝐿𝐢 = argmax 1 βˆ’ 𝑃 (𝑦|π‘₯), π‘₯ βˆˆπ‘ˆ (1) information about the sheet as a whole, or about a subset of its π‘₯ values. Some typical examples are footnotes and table titles. Information about other classes next to the most probable one is not taken into account by this strategy. Margin sampling [25] in 2.2 Cell Features contrast uses the posteriors for the first and second most probable The label of a cell is determined by a classifier, which makes classes and samples the instances with the smallest margin: use of cell features such as formatting, content types, formula π‘ˆπ‘ž,𝑀 = argmin 𝑃 (𝑦ˆ1 |π‘₯) βˆ’ 𝑃 (𝑦ˆ2 |π‘₯) (2) references, and additional context from nearby cells [2, 5]. In total, π‘₯ we consider 159 available features to train our cell classifiers. A Entropy uncertainty [30] uses all possible classes and captures detailed explanation of all cell features can be found in [19]. the entropy of a given distribution: Γ• 2.3 Training Dataset π‘ˆπ‘ž,𝐸 = argmax βˆ’ 𝑃 (𝑦𝑖 |π‘₯) log 𝑃 (𝑦𝑖 |π‘₯) (3) π‘₯ 𝑖 For our experiments we used the DECO dataset [18], an existing corpus of labeled spreadsheets, that can be used to train clas- Query-by-Committee. In contrast to the other strategies the sification models and additionally, in context of our work, to Query-by-Committee [29] approach maintains multiple models simulate an oracle (for more information see the Section 3) in in parallel, called committee members. They are all being trained the AL cycle. DECO contains 1,165 labeled spreadsheets which on the same current labeled set 𝐿 and then vote on the query can- have been derived from the Enron corpus [13]. The time to an- didates. The vote is conducted by letting all committee members notate a sheet within a spreadsheet has been logged during the predict all unlabeled samples and measuring the controversial creation of DECO. In average the annotation takes 4.3 minutes score for each sample. The most controversial query is consid- per spreadsheet with a max value of 284.4 minutes. This shows ered the most informative. For the measurement of disagreement the complexity of the cumbersome manual labeling task. It is various methods exist. In this paper, we choose vote entropy as clear, that a reduction of to-be-labeled spreadsheets will drasti- proposed by [9]: cally improve the cell classification approach and thus the overall Γ• 𝑉 (𝑦𝑖 ) 𝑉 (𝑦𝑖 ) π‘ˆπ‘ž,𝑉 𝐸 = argmax βˆ’ log (4) table extraction process. π‘₯ 𝐢 𝐢 𝑖 3 ACTIVE LEARNING FOR SPREADSHEET Whereby 𝐢 denotes the committee size, 𝑦𝑖 ranges over all possible labels and 𝑉 (𝑦𝑖 ) is the number of times a label was predicted by CELL CLASSIFICATION a classifier. This scoring method scores dissenting votes as the To implement an AL cycle (shown in Figure 1) for a given super- highest and concurring votes as the lowest. Our applied commit- vised machine learning task, such as cell classification, one has tee consisted of three Random Forest Classifiers with different to decide for a query strategy, a stopping criteria and a certain random initializations, one NaΓ―ve Bayes, and one Support Vector batch size. The query strategy selects the samples which should Machine. be labeled next by the oracle. In Section 3.1, we give an overview of some popular query strategies which are later considered in 3.2 Stopping Criteria our evaluation. To achieve our goal of reducing the human effort So far, the AL cycle would stop only, if the set of unlabeled data in labeling spreadsheet cells, we have to find the optimal point for π‘ˆ is out of cells. Obviously this leads to no reduction in labeling stopping the AL cycle. Therefore, different stopping criteria are effort. Therefore, we introduce a stopping criteria, that is able to discussed in Section 3.2. Another issue that impacts the labeling detect whether proceeding the AL cycle is not resulting in any effort is the batch size, i.e. the number of samples or cells labeled accuracy gain. Additionally, it can be shown, that by reducing within each iteration of the AL cycle, discussed in Section 3.3. the amount of training data, we prevent overfitting and the test accuracy is increased (Section 4.6). In this paper, we investigate 3.1 Query Strategies three different stopping criteria (SC) from [34]. In this section, we shortly introduce the different strategies for choosing the most informative queries. Each strategy approx- Maximum Uncertainty. Maximum Uncertainty uses the same imates the contained informativeness of unlabeled data for a measurement as the Uncertainty Sampling Least Confident strat- potential classifier. egy in Section 3.1. The basic idea is, that by passing multiple learning cycles the classifier’s uncertainty should decrease as the Random Sampling. Random sampling is a common AL query classifier has more knowledge over its data. Therefore, the stop- strategy and found application in [3, 7, 27]. Unlike the other ping criteria uses the uncertainty of the most uncertain sample in methods, random sampling chooses queries at random and fully π‘ˆ as measure. If the uncertainty value drops below a user-defined independently of their informativeness. However, even with this threshold, the stopping criteria will end the AL cycle: strategy a rise in prediction accuracy is possible, since the amount 1 , π‘ˆπ‘ž ∈ π‘ˆ and π‘ˆ 𝑀 (π‘ˆπ‘ž ) ≀ πœƒ π‘€π‘ˆ ( of training data is steadily increased. We use random sampling π‘†πΆπ‘€π‘ˆ = (5) as a baseline to compare the other strategies, too. 0 , otherwise Uncertainty Sampling. Uncertainty sampling chooses queries π‘ˆ 𝑀 (π‘₯ βˆ— ) is the method retrieving the uncertainty of a sample which are the most uncertain to predict. Hence, learning these π‘ˆπ‘ž and πœƒ π‘€π‘ˆ the predefined threshold. For batches we used the smallest uncertainty of the batch, other common strategies are layout roles have to be merged into the three broader categories the mean or the median. data, header, metadata. 2) Due to the very high amount of data cells (94%) in DECO, the data class would get an increased prior Selected Accuracy Method. This method focuses on the classi- probability. Therefore, we decided to limit the number of data fication accuracy of the chosen queries π‘ˆπ‘ž . It assumes, that the cells to 100 per spreadsheet, chosen randomly. 3) Cells with the classifier always chooses the most difficult queries for labeling. label other were not included in our experiments, since they Hence the model’s predictions on the most uncertain queries are too ambiguous and not helpful for table identification. The should reflect the classifier’s certainty. The AL process stops, original dataset contains 98.68% data cells, 1.18% Header cells if the classifier successfully predicts the labels of those most and 0.14% metadata cells, with a total of 1, 393, 398 cells. After difficult queries. performing the aforementioned steps, the distribution is 83.27% 1 , π‘Žπ‘π‘ |π‘ˆπ‘ž | (𝐢) β‰₯ πœƒ 𝑆𝐴 data cells, 14.99% header cells and 1.74% metadata cells. ( 𝑆𝐢𝑆𝐴 = (6) We implemented the AL cycle sketched in Figure 1 in Python in 0 , otherwise the following way: the learner consists of several classification π‘Žπ‘π‘π‘š (𝐢) determines with the help of the oracle the accuracy of models such as Decision Trees, Random Forest, NaΓ―ve Bayes and the classifier’s predictions on the queries, π‘š denotes the iteration SVMs, using their Scikit-learn [22] implementation. To perform of the AL cycle of the query. πœƒ 𝑆𝐴 is the threshold. For batches AL on the DECO dataset we split it into 80% used for 𝐿 and π‘ˆ as the mean accuracy for each cell is used. well as 20% for testing. Within the experiments, we usually set |𝐿| to 1% and |π‘ˆ | to 99% (of the overall 80%). Since our dataset is Standard Deviation-based Stopping. In contrast to other ap- already completely annotated, our implemented oracle can auto- proaches, Standard Deviation-based Stopping [6] assesses mul- matically provide the requested label, i.e. for a given π‘ˆπ‘ž provide tiple cycles for trend detection. Two criteria have to be met for πΏπ‘ž . While this setup is perfect to test the different configurations stopping: first, the standard deviation of the accuracies for the of the AL cycle it is important to note, that in a real-word setting last five π‘ˆπ‘ž has to be lower than a predefined threshold πœƒ 1 and the oracle will always be a human annotator. secondly, the accuracy of the current π‘ˆπ‘ž has to be larger than In Section 4.1, we first propose a metric to measure the perfor- the threshold πœƒ 2 . The first criteria identifies plateaus in the query mance of an AL cycle in terms of accuracy improvement and accuracy and the second one prevents local optima. sample size reduction. Before looking at the AL strategies, we 1 , 𝜎 (π‘Žπ‘π‘π‘›βˆ’5...𝑛 ) ≀ πœƒ 1 ∧ π‘Žπ‘π‘π‘› β‰₯ πœƒ 2 , when 𝑛 β‰₯ 5 report in Section 4.2 on the accuracy of passive learners, i.e. clas- ( π‘†πΆπœŽ = (7) sifiers using the whole DECO dataset for training. In Section 4.3, 0 , otherwise we study the impact of different start set sizes on the learning The current cycle under inspection is denoted with 𝑛. curves. Varying batch sizes, sampling strategies and stopping Since the stopping criteria described above all return Boolean criteria are investigated in Section 4.4, Section 4.5 and Section 4.6. values, they can be used in combination in form of logical expres- sions. 4.1 Slope Calculation There does not exist a standardized numeric measure to report 3.3 Batch Sizes on AL performances. A common method in AL research is to It is common practice in machine learning to train a model on compare smoothed accuracy graphs. However, to compare differ- batches of samples instead of single data points. In the later evalu- ent settings in a faster way, many papers also rate the classifier ation (Section 4.4) we compare simple top-π‘˜ batches and so-called performance by looking at the start and end point of the learning spreadsheet batches: A top-π‘˜ batch consists of the top-π‘˜ cells be- curve. To take into account also the reduction of the human effort ing selected by the query strategy (Section 3.1). However, this in labeling data, we additionally divide the gained accuracy by strategy has the disadvantage, that the selected cells potentially the numbers of queries, that have been used to achieve this result. belong to a large number of spreadsheets which leads to higher The measurement is denoted as slope. efforts for the oracle, i.e. the human annotator. As reported in π‘Žπ‘π‘π‘’π‘›π‘‘ βˆ’ π‘Žπ‘π‘π‘ π‘‘π‘Žπ‘Ÿπ‘‘ Section 2.3 it takes several minutes to label a spreadsheet, while π‘ π‘™π‘œπ‘π‘’ = (9) |π‘žπ‘’π‘’π‘Ÿπ‘–π‘’π‘  | most of the time is needed to understand the spreadsheet content and structure. For this reason, we propose to provide just one The π‘Žπ‘π‘π‘ π‘‘π‘Žπ‘Ÿπ‘‘ is the classifier’s accuracy before any AL cycle passes. spreadsheet to the human annotator within an AL cycle, i.e. a The π‘Žπ‘π‘π‘’π‘›π‘‘ is determined by the stopping criteria (see Section 3.2) batch consists of all cells from this single spreadsheet. To select or in the worst case corresponds to the point when the unlabeled the best spreadsheet within an AL cycle we have to adapt our de- sample set π‘ˆ is out of data. fined query strategy metrics 𝑄𝑆 (see Section 3.1) for spreadsheet batches, where π‘ˆπ‘– denotes the set of unlabeled cells of spread- 4.2 Passive Learning sheet 𝑖 as follows and 𝑆𝐡 denotes the spreadsheet to be used as In a first experiment, we compared the performance of different π‘ˆπ‘ž : classifiers trained on the whole dataset 𝐿, without applying the AL cycle. In detail, we investigated Decision Trees (DT), Random 1 Γ• 𝑆𝐡 = argmax 𝑄𝑆 (π‘₯) (8) Forest (RF), NaΓ―ve Bayes (NB), Support Vector Machines (SVM), |π‘ˆ 𝑖| 𝑖 π‘₯ βˆˆπ‘ˆπ‘– and Multi-Layer-Perceptron (MLP) using a standard random search over the hyperparameter and five-fold cross-validation. 4 EVALUATION Table 1 lists the achieved classification results. The data has been To evaluate our semi-supervised approach for spreadsheet cell downsampled and the data points of the underrepresented classes classification, we performed a set of experiments on DECO dataset have been used as weights for the accuracies for the header and (Section 2.3). To use this dataset for cell classification three changes metadata cells, as the F1-Sores without would be significantly have been made: 1) As described already in Section 2.1 the seven lower compared to the F1-Scores of the data class. Even though Decision NaΓ―ve Random SVM MLP 0.875 Baseline Tree Bayes Forest 0.850 50 150 250 Accuracy Data 0.95 0.91 0.95 0.95 0.95 0.825 Sheet Header 0.75 0.69 0.76 0.78 0.78 Metadata 0.23 0.35 0.46 0.47 0.52 0.800 Weighted Avg. 0.90 0.87 0.91 0.92 0.92 0.775 F1-Score Avg. training 0.15 0.09 0.33 112.52 490.91 0 250 500 750 1000 1250 1500 1750 # Batches time (sec) Table 1: F1-Scores of the Passive Classifiers Problem Figure 3: Comparison of different batch sizes and the pas- sive classifier as baseline Baseline Baseline 0.850 1% 0.85 Random Least Confident 5% Entropy Accuracy 10% Accuracy 0.825 Margin Committee 0.80 0.800 0.775 0.75 0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700 # Queried Sheets # Queried Sheets Figure 2: Comparison of different start sizes and the pas- Figure 4: Accuracies for different sampling strategies and sive classifier as baseline passive classifier as baseline 4.5 Sampling Strategies MLP and SVM achieved the best average F1-score we used the Figure 4 shows the classification accuracies for the different sam- close second best option RF for the remaining AL experiments, pling strategies, all with the same start size of 1% and spreadsheet- since its training time is significantly lower. In the following wise batches. One can see a clear difference between the random experiments the shown baseline is the accuracy of a passive clas- strategy and the other strategies: For the random strategy the sifier trained on the whole dataset 𝐿. Table 1 lists the achieved accuracy goes down after a few queried sheets. Entropy based classification results. uncertainty sampling, uncertainty margin sampling and committee sampling show the highest increase while the latter achieved the 4.3 Start Set Size best results and is therefore the recommended strategy. We compare three different start set sizes 𝐿: 1%, 5% and 10%, 4.6 Stopping Criteria the batch size is 1 spreadsheet at a time, and the query strategy entropy based uncertainty. Figure 2 shows, that a smaller start set For evaluating the stopping criteria, we use a start size of 1% size leads to higher increase in accuracy at the beginning of the and committee sampling. Figure 5 shows the values of the three AL cycle. Nevertheless, after around 100 queried sheets for all stopping metrics proposed in Section 3.2. The markers indicate different start sizes the graphs have almost the same shape. In the stopping points resulting from these stopping metrics. The terms of reducing human effort, we can therefore state, that a vertical dashed line denotes the optimal stopping point derived small start size is more beneficial and leads to the same or even from the peak of the learning curve for the test data. It should higher accuracies. be noted again, that the learning curve is just available in our experimental setting. In practice the stopping point has to be determined by the stopping criteria, that just rely on the output 4.4 Batch Size of the classifier. Instead of querying the oracle for just one new labeled data point For sheet-based batches, the maximum uncertainty stopping val- during a single AL cycle, the classifier is being refitted on a whole ues are quite consistent, only for the last queries a drop is no- batch of new labeled cell sheets. Different batch sizes, respectively ticeable. The reason is probably an overall consistent average 50, 150, 250 and all cells of a single spreadsheet are selected for uncertainty over all unlabeled sheets. The selected accuracy met- comparison. The start set size for this experiment is 1% and ric fluctuates quite a lot and depends heavily on the sheet, but an entropy based uncertainty query strategy is used (Section 3.1). upward trend is nevertheless noticeable the more labeled train- Figure 3 shows, that the best results can be achieved with a batch ing data becomes available. The standard deviation of the last size of 50. However, the difference between cell-based batches and five classification accuracies of the queried spreadsheets is quite the batches consisting of cells, that belong to a single spreadsheet stable, having some valleys. This stopping criterion provides a is very small. Therefore, we advise to use a batch size of one robust categorization of the accuracies, but cannot be used to single spreadsheet per AL cycle in order to reduce the effort for distinguish between plateaus and valleys. Even though in our the human annotator. case the minimum of the standard deviation is almost exactly at 1.0 spreadsheet community for a while. It was created with the help 0.850 of search engines, issuing queries containing keywords such as β€œfi- 0.8 Stopping Criteria nancial" and β€œinventory", and file type β€œ.xls”. Overall, it comprises 0.825 Accuracy Baseline 0.6 4, 498 unique spreadsheets, organized into categories (folders) Committee based on the used keywords. The more recent Enron corpus [13] 0.800 0.4 contains 15, 770 spreadsheets, extracted from the Enron email 0.775 Max Uncertainty Standard Deviation 0.2 archive1 . This corpus is unique regarding its exclusive view on Selected Accuracy 0.750 0.0 the use of spreadsheets in enterprise settings. Another recent 0 200 400 600 corpus is Fuse [4], which comprises 249, 376 unique spreadsheets, extracted from Common Crawl2 . Each spreadsheet is accompa- nied by a JSON file, which includes NLP token extraction and Figure 5: Comparison of stopping strategies Max Uncer- metrics related to the use of formulas. tainty, Standard Deviation and Selected Accuracy together So far, these three corpora have been used by researchers viewing with the test accuracy of committee sampling (mauve) spreadsheets from a software engineering perspective. Formula and the optimal stopping point (marked using a vertical error detection and debugging [15, 26], but also usage, life-cycle, dashed line) modeling, and governance of spreadsheets [1, 8, 14] are important research subjects within this community. Table Recognition In π‘Žπ‘π‘ Ξ” % |π‘žπ‘’π‘’π‘Ÿπ‘–π‘’π‘  | π‘ π‘™π‘œπ‘π‘’ the literature we find various approaches with regards to layout No Stopping 3.00 671 4.474e-03 inference and table recognition in spreadsheets as prerequisites Maximum Uncertainty Stopping 3.40 661 5.144e-03 for information extraction. These approaches can be classified Standard Deviation Stopping 11.19 70 1.599e-01 as rule-based, heuristics-based, and ML-based ones. Here, we Selected Accuracy Stopping 10.88 43 2.530e-01 consider those, that are ML-based, like [11, 19]. More recent publications apply to some extent machine learning techniques Table 2: Comparison of different learning slopes (spread- [2, 5, 6, 19]. Other works make use of domain specific languages, sheet batches, committee sampling, 1% start size) such as [16], [31], and [14]. Active Learning The existing AL approaches can be grouped in three categories: Membership Query Synthesis, Stream-Based Selective Sampling, and Pool-Based Sampling [28]. Membership the optimum stopping point for a general case we recommend Query Synthesis assumes, that new data points can arbitrarily be to use standard deviation as a necessary criterion and selected created and the learner can request any data point from the prob- accuracy as a sufficient criterion to detect a continuous peak. lem space. For Stream-Based Selective Sampling a data point is To see the amount of effort which can be reduced by AL we being sampled from the problem space and the learner can decide, calculated the slopes for the stopping points as listed in Table 2. if the generated sample should be labeled or discarded. Both ap- The best results were achieved with standard deviation stopping, proaches are based on the assumption, that arbitrary real-world resulting in an increase of the accuracy by +11.19% using a start data points can be produced in an inexpensive way which is not set of just 6 spreadsheets and a training set of 70 spreadsheets the case in our scenario. Pool-Based Sampling instead assumes compared to 671 spreadsheets of the passive classifier. a fixed pool of labeled data 𝐿 and a large pool of unlabeled data π‘ˆ from the problem space, which holds true for our spreadsheet 4.7 Discussion dataset. Therefore, in this paper we implemented a pool-based The following insights can be derived from our experiments: As sampling approach. In the context of spreadsheets, [6] proposed a first step we recommend a comparison of different classifiers. a rule-assisted AL approach to detect certain properties. The au- For our scenario of cell classification, we decide on a Random thors introduce a hybrid approach, that combines active learning Forest classifier providing high accuracy and fast training time. with crude easy-to-write user-provided rules. Compared to weak-supervised approach (see Section 5) active Weak Supervision The common notion behind weak super- learning requires a start set of already labeled data. However, our vision techniques is, that it is much easier to generate a large experiments showed, that even a very small start set consisting amount of so-called weakly-labeled data than a small amount of 6 spreadsheets only is totally sufficient. For cell classifica- of good-labeled data and that more data always leads to better tion, we decide to use one spreadsheet per batch to reduce the results. Under this assumption multiple approaches have been annotator effort. Regarding the sampling strategies, we can con- proposed: In Snorkel [24] user defined labeling functions (LF) are clude, that except the random approach all strategies worked used to label given data points. Writing LFs however, requires well. The best slope was achieved using the committee strategy experts with domain knowledge, in contrast to spreadsheet cell which is therefore our recommended one. It is however notewor- labeling which can be done by non-experts also. Additionally, thy, that this strategy requires the highest amount of computing Snorkel users often have to provide a development set of labeled resources, since several classifiers have to be trained in parallel. data providing quick approximate signals on the accuracies of the We could not identify a single stopping criterion which suffices LFs. Hence, the human effort is not necessarily lower compared for identifying nearly optimal stopping points. Therefore, our to AL. recommendation is a combination of a high selected accuracy and As successor of Snorkel Snuba [32] was created, where the noisy a low standard deviation of the query accuracies. LFs are automatically derived from a small labeled dataset. How- ever, our experiments with the open-source code base of Snuba 5 RELATED WORK Unlabeled Spreadsheet Corpora We find multiple, unlabeled 1 http://info.nuix.com/Enron.html spreadsheet corpora in the literature: Euses [12] has served the 2 http://commoncrawl.org/ have shown some limitations. In the current state it is not easily [11] Julian Eberius, Christoper Werner, Maik Thiele, Katrin Braunschweig, Lars applicable to problems of more than two labels and it seems to Dannecker, and Wolfgang Lehner. [n.d.]. DeExcelerator: A framework for extracting relational data from partially structured documents. In CIKM’13. be only working well for certain types of problems, to which the 2477–2480. task of classifying spreadsheet cells does not belong to: First, our [12] Marc Fisher and Gregg Rothermel. [n.d.]. The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependabil- dataset is quite unbalanced, which results in LFs, that are only fo- ity mechanisms. In SIGSOFT’05, Vol. 30. 1–5. cusing on the majority class and second, with a growing number [13] Felienne Hermans and Emerson Murphy-Hill. 2015. Enron’s spreadsheets and of labeling classes the LFs again tend to focus only on the most related emails: A dataset and analysis. In ICSE. IEEE Press, 7–16. [14] Felienne Hermans, Martin Pinzger, and Arie Van Deursen. [n.d.]. Automati- common label. While investigating this we discovered multiple cally extracting class diagrams from spreadsheets. ECOOP’10 ([n. d.]), 52–75. issues in the current implementation of Snuba which prevent [15] Felienne Hermans, Martin Pinzger, and Arie van Deursen. 2015. Detecting the overall sound concept from working well in a generalized and refactoring code smells in spreadsheet formulas. Empirical Software Engineering 20, 2 (2015), 549–575. setting. [16] Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011. Other methods include assigning a label per feature instead of Wrangler: Interactive visual specification of data transformation scripts. In SIGCHI. ACM, 3363–3372. assigning a label per data point [10], which is suitable in a setting [17] Elvis Koci, Maik Thiele, , Oscar Romero, and Wolfgang Lehner. 2019. A of textual data, where it is easy to specify distinctive terms per Genetic-based Search for Automatic Table Recognition in Spreadsheets. In label class, but not in the case of classifying spreadsheet cells. ICDAR’19. [18] Elvis Koci, Maik Thiele, Josephine Rehak, Oscar Romero, and Wolfgang Lehner. Another approach [21, 23] first clusters the input data and as- 2019. DECO: A Dataset of Annotated Spreadsheets for Layout and Table signs then a label per cluster based on a fraction of labeled data Recognition. In ICDAR’19. points per cluster. Our problem space has a dimension of 159 [19] Elvis Koci, Maik Thiele, Γ“scar Romero Moral, and Wolfgang Lehner. 2016. A machine learning approach for layout inference in spreadsheets. In KDIR. features and early experiments showed, that it is not easy to SciTePress, 77–88. identify appropriate clusters in such a high-dimensional space. [20] David D. Lewis. 1995. A Sequential Algorithm for Training Text Classifiers: Corrigendum and Additional Data. SIGIR Forum 29, 2 (Sept. 1995), 13–19. https://doi.org/10.1145/219587.219592 6 CONCLUSIONS [21] Zhiliang Liu, Xiaomin Zhao, Jianxiao Zou, and Hongbing Xu. 2013. A Semi- Supervised Approach Based on k-Nearest Neighbor. Journal of Software 8 (04 In this paper, we proposed a semi-supervised approach for cell 2013). https://doi.org/10.4304/jsw.8.4.768-775 classification, that drastically reduces the amount of training data, [22] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. without any losses in accuracy. In detail, we investigated different Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: query strategies, start set sizes, stopping criteria, and batch sizes Machine Learning in Python. JMLR 12 (2011), 2825–2830. [23] FΓ‘bio Perez, RΓ©mi Lebret, and Karl Aberer. 2018. Cluster-Based Active Learning. to implement the AL cycle. With the DECO corpus, consisting of CoRR abs/1812.11780 (2018). arXiv:1812.11780 1,165 labeled spreadsheets, we tested all these configurations and [24] Alexander Ratner, Stephen H. Bach, Henry R. Ehrenberg, Jason Alan Fries, derived good settings for them. Using committee sampling, a start Sen Wu, and Christopher RΓ©. 2017. Snorkel: Rapid Training Data Creation with Weak Supervision. CoRR abs/1711.10160 (2017). arXiv:1711.10160 set size of 1% and the selected accuracy stopping criterion we are [25] Tobias Scheffer, Christian Decomain, and Stefan Wrobel. 2001. Active Hid- able to reduce the size of the training dataset by 90%. In absolute den Markov Models for Information Extraction. In IDA ’01. Springer-Verlag, numbers, our AL cell classification approach obtains the same London, UK, UK, 309–318. [26] Thomas Schmitz, Dietmar Jannach, Birgit Hofer, Patrick W. Koch, Konstantin accuracies using just 76 labeled spreadsheets (6 sheets in the start Schekotihin, and Franz Wotawa. 2017. A decomposition-based approach to set and 70 selected by the AL approach) compared to the passive spreadsheet testing and debugging. In Visual Languages and Human-Centric Computing. IEEE Computer Society, 117–121. classifier trained on 1,165 spreadsheets. Therefore, the effort for [27] Greg Schohn and David Cohn. 2000. Less is More: Active Learning with the human annotators was reduced from 3,029 to 197 minutes. Support Vector Machines. Machine Learning-International Workshop then Given this reduction of annotation work and consequently in Conference (10 2000). [28] Burr Settles. 2010. Active learning literature survey. Computer Sciences annotation costs, it now becomes feasible to apply ML-driven Technical Report 1648 (07 2010). cell classification and table identification in practice. [29] H. S. Seung, M. Opper, and H. Sompolinsky. 1992. Query by Committee. In COLT ’92. ACM, New York, NY, USA, 287–294. https://doi.org/10.1145/130385. 130417 REFERENCES [30] C. E. Shannon. 2001. A Mathematical Theory of Communication. SIGMOBILE Mob. Comput. Commun. Rev. 5, 1 (Jan. 2001), 3–55. https://doi.org/10.1145/ [1] Robin Abraham and Martin Erwig. 2006. Inferring templates from spread- 584091.584093 sheets. In Proceedings of the 28th international conference on Software engineer- [31] Alexey O Shigarov and Andrey A Mikhailov. 2017. Rule-based spreadsheet ing. ACM, 182–191. data transformation from arbitrary to relational tables. Information Systems [2] Marco D Adelfio and Hanan Samet. [n.d.]. Schema extraction for tabular data 71 (2017), 123–136. on the web. VLDB’16 ([n. d.]), 421–432. [32] Paroma Varma and Christopher RΓ©. 2018. Snuba: Automating Weak Super- [3] Yoram Baram, Ran El-Yaniv, and Kobi Luz. 2004. Online Choice of Active vision to Label Training Data. Proc. VLDB Endow. 12, 3 (Nov. 2018), 223–236. Learning Algorithms. J. Mach. Learn. Res. 5 (Dec. 2004), 255–291. https://doi.org/10.14778/3291264.3291268 [4] Titus Barik, Kevin Lubick, Justin Smith, John Slankas, and Emerson Murphy- [33] Xinxin Wang. 1996. Tabular abstraction, editing, and formatting. Technical Hill. 2015. Fuse: a reproducible, extendable, internet-scale corpus of spread- Report. University of Waretloo, Waterloo, Ontaria, Canada. sheets. In MSR’15. 486–489. [34] Jingbo Zhu, Huizhen Wang, Eduard Hovy, and Matthew Ma. 2010. Confidence- [5] Zhe Chen and Michael Cafarella. [n.d.]. Automatic web spreadsheet data based Stopping Criteria for Active Learning for Data Annotation. ACM Trans. extraction. In SSQ’13. 1. Speech Lang. Process. 6, 3, Article 3 (April 2010), 24 pages. https://doi.org/10. [6] Zhe Chen, Sasha Dadiomov, Richard Wesley, Gang Xiao, Daniel Cory, Michael 1145/1753783.1753784 Cafarella, and Jock Mackinlay. 2017. Spreadsheet Property Detection With Rule-assisted Active Learning. In CIKM ’17. ACM, New York, NY, USA, 999– 1008. https://doi.org/10.1145/3132847.3132882 [7] David Cohn, Les Atlas, and Richard Ladner. 1994. Improving generalization with active learning. Machine Learning 15, 2 (01 May 1994), 201–221. https: //doi.org/10.1007/BF00993277 [8] JΓ‘come Cunha, JoΓ£o Paulo Fernandes, Jorge Mendes, and JoΓ£o Saraiva. 2012. MDSheet: A framework for model-driven spreadsheet engineering. In ICSE. IEEE Press, 1395–1398. [9] Ido Dagan and Sean P. Engelson. 1995. Committee-based Sampling for Training Probabilistic Classifiers. In ICML’95. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 150–157. [10] Gregory Druck, Burr Settles, and Andrew Mccallum. 2009. Active learning by labeling features. In In Proc. EMNLP. ACL Press, 81–90.