Automatic detection of abusive South African tweets using a semi-supervised learning approach Oluwafemi Oriola1[0000−0003−0255−6160] and Eduan Kotzé1[0000−0002−5572−4319] Department of Computer Science and Informatics, University of the Free State, Bloemfontein, South Africa {OriolaO, kotzeJE}@ufs.ac.za Abstract. Major setbacks for detection of abusive South African tweets are inadequacy of annotated corpus and high cost of annotation, which semi-supervised learning solves. Semi-supervised learning techniques en- hance training data by combining labelled and unlabelled data. However, existing approaches have skewed classification of unlabelled data towards labelled data despite class imbalance of labelled data and unmatched fea- ture distribution between labelled and testing data, that is common in abusive texts. This paper presents a reliable semi-supervised learning ap- proach that reduces the noise in training data by combining features of unlabelled data with varying sizes of important features of labelled data. Chi-square statistics is used for the feature selection, while k-means al- gorithm is used for clustering of data points. By majority voting rule, reliable labels are assigned to the data points. Classifications with Sup- port Vector Machine and Logistic Regression classifiers show that the proposed approach improves prediction performance. Keywords: South African Tweets · Abusive Language · Semi-Supervised Machine Learning · Clustering · Classification 1 Introduction The rise in the act of racial and social media conflicts and their negative conse- quences means that improved detection of abusive languages on social networks cannot be over-emphasised. One of the major setbacks to improvement of detec- tion of abusive languages in social networks is inadequacy of lexical resources for many languages [1]. Abusive language is referred to as an oral or textual expres- sion that contains dirty words or phrases [2]. This expression can be derogatory, profane, cyber-bullying or hate speech. Over the last decades, South Africa has experienced upsurge in various de- grees of violence such as violent protests and xenophobic attacks, which have led to loss of human and material resources. Many of these violent incidents could be attributed to fast spread of inciteful and abusive comments, perpe- trated through social networks. However, there has not been any infrastructural measure to check the soaring volumes of such communications. Recently, South African government promulgated laws to address incidents of hate speech [3], but it is important for such legal tool to be supported by an active preventive measure, which abusive language detection will provide. 2 O. Oriola, E. Kotzé As at January 2019, there was twenty-three million active social media users in South Africa, out which twenty percent, about five million have subscribed to Twitter [4]. A recent study of hate speech in multi-domain perspectives has shown that Twitter has been highly used as a medium to propagate racial com- munications in South Africa [5]. In fact, most of the recent studies on abusive communication online have focused on Twitter. Therefore, this research explores Twitter contents called tweets for detection of abusive language in social media. Machine Learning is a reliable technique that has been used for abusive tweet detection [6]. Supervised machine learning techniques have been the most widely used but they perform poorly when there are few labelled data. Unsupervised machine learning techniques are used to observe the relationship between fea- tures by relying on the similarities among the data and probabilistic approach. They have been mostly applied to zero resourced problems [7]. Some works have combined the two methods as semi-supervised learning, when there are few labelled data and large unlabelled data [8], [9] but they have all relied on deep learning techniques, which are computationally expensive and require huge training data. South Africa is a multilingual society, but English is mostly used to commu- nicate on the social media. The existing English corpora for abusive language detection have been labelled using crowd-sourcing tools or by annotators that are familiar with the contexts of the abusive discourses. However, studies have shown that language use varies across societies, contexts and individual [10], which is very prominent in abusive South African tweets. Tweets written in En- glish might be code-mixed with profane and non-profane words, in indigenous language or slangs that are peculiar to South Africa. To our best knowledge, the only lexical resources for abusive language that is specific to South Africa is available in Hatebase [11]; however, the resources are very few. This paper therefore focuses on development of inexpensive and reliable Semi- supervised learning approach, which combines large unlabelled tweets and few labelled tweets to automatically detect abusive tweets. 2 Semi-supervised learning techniques Semi-supervised learning (SSL) techniques are machine learning techniques that rely on both labelled and unlabelled data for classification tasks. Here, machines learn from fewer labelled data points with the help of large number of unlabelled data points. Several research works have used unlabelled data to enhance the performance of classification models. These can be categorised into Self-Learning Approach and Active Learning Approach. In Self-learning approach, unlabelled data is automatically annotated and the instances with high confidence are added to the training datasets iteratively. It can be categorised into Self-training meth- ods [12], [13] and Generative learning methods [14], [15]. Active Learning was developed to improve the selection process of unlabelled samples and solve class 3 imbalance problem, but they often relied on manual method [16] or co-training approach [17], [18], [19], which are costly. In this work, we are interested in Cluster-then-Label Generative method be- cause they were developed to address missing data problem. Kumar et al. [14] applied Cluster-then-Label to cross-domain adaptation problem, in which unla- belled data in a source domain was merged with unlabelled data in the target domain and clustered using Fuzzy K-Means algorithm. Labels were assigned to the clusters using common knowledge from experts, while classification to pre- dict target dataset was carried out using Dual Margin Binary Hypersphere-based Support Vector Machine. Albalate et al. [15] applied the labelled samples’ labels to the clusters of unlabelled data using optimum cluster labelling approach of Hungarian algorithm and removed uncertainty using Silhoutte Cluster Pruning. Leng et al. [20] proposed Adaptive Semi-supervised clustering algorithm with label propagation to label unlabelled dataset. The available labels of the labelled samples were used to assign labels to the unlabelled data based on K-Nearest Neighbour to core objects defined by adaptive threshold. The adaptive threshold was estimated by the density of each cluster, which the label data point belonged to. Also, new cluster was detected by the distance from the clusters core objects. Peikari et al. [21] clustered labelled and unlabelled datasets and mapped out the high-density regions in the data space. Fuzzy C-Means was used to assign la- bels to the identified clusters, while Support Vector Machine was used to label the data on the low-density region. Forestier and Wemmert [22] focused on how multiple clustering algorithms can be combined with a supervised learning al- gorithm to achieve better results than classical semi-supervised and supervised algorithms. They proposed Supervised Learning Ensembles with multiple clus- tering. The clustering combined labelled and unlabelled objects and maximized intra-cluster similarity using multiple observations. The above semi-supervised learning approaches have relied on the labels of the labelled data to assign labels to the unlabelled data despite the class- imbalance nature of the labelled data and partially matched features of the labelled and testing data, which is often the case in real-life. 3 Proposed method In this section, we formalise the approach used to detect South African abusive tweets. 3.1 Semi-supervised learning approach The Semi-supervised learning method proposed in this work is motivated by the following assumptions which indicate that labelling decision cannot be skewed towards labelled data, in real-life. – Features of the testing data might not match exactly the features of either labelled or unlabelled data. – Datasets of similar contexts share asymmetrical features. 4 O. Oriola, E. Kotzé 3.2 Classification problem Let X be set of n tweet samples xi � X. Given a binary-class classification problem with l very low labelled instances and u large unlabelled data such that U > L; the set of labelled instances L = {(x1 , y1 ), . . . ., (xl , yl )} and the set of unlabelled instances U = {xl+1 , . . . , xl+u }, where y = (0, 1) are the class values of the data. Since the objective of semi-supervised learning is to build a classification model based on the training dataset, then we define our approach as presented in equation (1). y = CX(x) : y�{0, 1} (1) The schematic diagram in Figure 1 depicts the semi-supervised learning pro- cess. This involves three procedures: labelling of unlabelled data as described in section 3.3 and 3.4, training of merged unlabelled and labelled data, and testing with test data as described in section 4.3.3. Fig. 1. Proposed Semi-supervised Learning Process 3.3 Clustering In order to apply both unlabelled and labelled data samples as training data without skewness towards labelled data, the features of the unlabelled data are fused with different sizes of features from labelled data. Given that R is the set of features of unlabelled data U and S is the set of features of labelled data, that is R = {rl + 1, . . . , rl + u} and S = {s1 , . . . , sl } Then, L�{a1 , a2 , . . . , aq } and ai < ai+1 where a �A is the linearly selected size of features and q is the size of features. By Matrix Multiplication, R0 = R ∗ S (2) 5 Applying K-Means algorithm to cluster partition of Ro into k disjoint clusters C = {0, 1} given that k − yi = 0, we get J sets of cluster partitions (j1 , j2 , . . . , ji ) k � � �� 0 �� ji = argmint ��R − µi�� (3) i−1 r�ti The pseudocode is presented below: Data: Labelled data features (R); unlabelled data instances (U) Result: The labels VT for unlabelled data instances y for qi for R = L + 1, L + 2, . . . , L + u do for qi = 3, 5, 10, 15, 20, 25, 30 do for S = 1, 2, . . . , L do R0 = R ∗ S; �k � �� �� j = argmint i−1 r�ti ��R0 − µi�� ; VT = j; end end end 3.4 Fusion and Labelling n order to assign labels to the unlabelled data, majority vote rule is applied to every instance of R0 in A such that VT > 0. The selected label yi = Cu (max(Vt )) (4) Where R0 = {rl+1 , . . . , rl+u } , VT = labels for each instance for all tested sizes of features. The pseudocode is presented below: Data: Labels for unlabelled data instances, VT Result: Majority Voting labels, Vm for T = 1, . . . , t do max count = Max (AllCounters) if max count > t/2 then Vm =class label corresponding to max count end end 4 Experimental steps 4.1 Data collection and annotation Total of 21,350 tweets of South African discourses on Twitter between the period of May 5, 2019 and May 13, 2019 were collected using Twitter Archival tool, a Google Sheets plugin, that works based on Twitter Search API. The collection 6 O. Oriola, E. Kotzé targeted tweets related to 2019 South African national elections, popular South Africa individuals and trending issues such as land reclamation, Orania and white communities. Tweets that contained non-English words were removed, except names of individuals, towns, people, and organizations. Retweets and repeated tweets as well as tweets with empty word characters were also removed. Following these steps, the number of tweets remaining was 10,245. The tweets were divided into three data samples of labelled, unlabelled and testing data. Total of 1,737 tweets were randomly selected for annotation while the remaining 8,548 tweets were not annotated. The selected samples were annotated by two annotators as either ‘abusive’ (A) or ‘non-abusive’ (NA), from which 1,730 tweets were selected because of agreement on their labels. The Cohen’s Kappa agreement score [23] was 0.8490, which indicated almost excellent agreement. The 7 tweets, which were disagreed upon were added to the unlabelled dataset thus, making it 8,555. The 1,730 tweets were divided into 338 labelled tweets and 1,352 testing data. The resulting distribution of the dataset is presented in Table 1. Table 1. Distribution of Dataset Number Non-abusive Abusive Dataset Samples of Instances (NA) (A) Unlabelled Training 8555 - - data Labelled 338 286 52 data Testing Testing 1352 1118 234 data 4.2 Data pre-processing The samples of the dataset went through various stages of pre-processing so as to be suitable for text processing. These stages include removal of username, punctuations, special characters and symbols including emoticons and emojis, removal of hash symbols in hashtags, removal of English stopwords, stemming and change of all texts to lower case. 4.3 Data processing Three major stages of process were involved, which include feature engineering, clustering and classification. Feature Engineering The texts in the tweets were transformed into Term Fre- quency and Inverse Document Frequency (TF-IDF) feature space, where weights were created as indicated in equation (5) [24]. TF-IDF was chosen over Bag of 7 Words (BoW) because TF-IDF considers the IDF of each term unlike BoW and performed better than most surface-level feature representations [25]. The TF-IDF weights for a given term t in a document d is given as: T F − IDF (t, d) = T F (t, d) ∗ IDF (t) (5) when IDF (t) = log[n/(DF (t) + 1] , n = total number of documents in the document set; DF (t) = document frequency of t We also made use of word and character n-gram models. The word n-gram include as Unigram (1), combination of Unigram and Bigram (1, 2) and combination of Unigram, Bigram and Trigram (1, 2, 3), while character n-gram include character n-gram with length sizes from 2 to 6 (2-to-6), 3 to 7 (3-to-7) and 4 to 8 (4-to-8) Clustering We employed TF-IDF vectorization on the features of the labelled and unlabelled data, without over or under-sampling. The important features of the labelled samples were extracted using the Chi-Square statistics [26] with K values of 3, 5, 10, 15, 20, 25 and 30. This was followed by fusion of the features of labelled and unlabelled samples as described in equation (4). The K-Means unsupervised learning algorithm (number of clusters = 2) was used to cluster the fused samples, resulting in seven different cluster samples of two cluster par- titions each. By majority voting rule presented in equation (4), the most reliable cluster partition was obtained. Abusive label was assigned to cluster partition with more abusive words, while non-abusive label was assigned to cluster parti- tion with lesser abusive words. Classification We applied n-gram features weighted by TF-IDF vectorization on the combination of semi-supervised labelled data and the originally labelled data samples. They were used because of their effective performance in previ- ous text classification problems [5], [25]. SMOTE oversampling technique [24] was applied to reduce class imbalance. The testing data sample was also trans- formed in the same manner. Support Vector Machine (kernel=linear kernel) and Logistic Regression (kernel=liblinear) classifiers were used to train the merged data samples and detect abusive tweets from the testing data sample using the n-gram features. In order to select the best training model, Grid-search Hyperparameter Tun- ing approach was implemented over 10-fold Cross Validation. For the Support Vector Machine classifier (SVM), different C-regularization values ranging from 0.001 to 1000 were tested. For the Logistic Regression (LogReg), both L1 and L2 penalty functions with np.logspace values over -4, 4 and 20 were tested. 4.4 Performance metrics Precision, Recall, F-Measure, Accuracy and Mean Accuracy are metrics used to evaluate the performance of the proposed Semi-supervised Learning Method. The performance metrics are defined as presented in equations (6) to (10). The 8 O. Oriola, E. Kotzé equations rely on the true positive (TP), which is the number of correctly pre- dicted abusive tweets; true negative (TN), which is the number of correctly predicted non-abusive tweets; false positive (FP), which is the number of in- correctly predicted abusive tweets; false negative (FN), which is the number of incorrectly predicted non-abusive tweets. TP Precision P = (6) TP + FP TP Recall R = (7) TP + FN 2X(Recall X P recison) F-Measure F1 = (8) (Recall + P recision) TP + TN Accuracy A = (9) TP + FP + TN + FN �10 Ai Mean Accuracy = i=1 (10) 10 4.5 Performance evaluation The proposed semi-supervised learning approach (SSL) was compared with two supervised learning approaches. – Method A: This is a supervised learning method, in which only labelled data was used for training data. – Method B: This is a supervised learning method with unmatched train- ing data distribution consisting of labelled data and unlabelled data, which pseudo-labels was obtained by K-means clustering of unlabelled data without the features of the labelled data. 5 Results We evaluated the SSL for word n-gram and character n-gram features with Support Vector Machine and Logistic Regression classifiers and compared the performances with method A and method B. The testing results of accuracy and mean accuracy for SVM and LogReg over 10-fold cross validation are presented in Table 2 and Table 3, respectively. Also, the results of Precision, Recall and F-Measure for SVM for word and character n-gram features are presented in Table 4, while the results of LogReg are presented in Table 5. The comparisons of accuracy and F-measure of the SSL, method A and method B are presented in Figure 2. The results of the performance of the SSL, method B and method A in Table 2 showed that the SSL recorded the highest accuracy of 0.9585 and 0.9667 for word n-gram and character n-gram, respectively followed by method A. The difference 9 Table 2. Accuracy and Mean Accuracy for Support Vector Machine testing Best Feature Features Best Mean Model Accuracy Type Parameter Accuracy (n-gram) Method C Word 1 0.8335 0.8727 A =1 Char 3-to-7 C=1 0.8713 0.8816 Method Word 1,2,3 C=0.01 0.7988 0.8639 B Char 4-to-8 C=0.001 0.7980 0.8639 SSL Word 1 C=0.01 0.9585 0.9581 Char 2-to-6 C=1000 0.9667 0.9597 in the accuracy of the SSL and method A were 0.0942 (≈0.1) and 0.0968 (≈0.1) for word and character n-gram, respectively. In Table 3, the SSL recorded the highest accuracy of 0.9578 and 0.9696 for word n-gram and character n-gram, respectively followed by method A. The difference in the accuracy of the SSL and method A were 0.1228 and 0.0924 (≈0.1) for word and character n-gram, respectively. The same ranges of differences were recorded in mean accuracy, which showed clearly that the SSL convincingly outperformed method A and method B, in terms of accuracy. Method B recorded the lowest accurcy in both Table 2 and Table 3. Table 3. Accuracy and Mean Accuracy for Logistic Regression Testing Best Feature Features Best Mean Model Accuracy Type Parameter Accuracy (n-gram) Method L1, Word 1,2 0.8350 0.8786 A C = 3792.69 L2, Char 3-to-7 0.8772 0.8816 C= 78.47 Method L1, Word 1 0.7980 0.8639 B C = 0.0001 L1, Char 2-to-6 0.7980 0.8639 C = 0.0001 L1, SSL Word 1 0.9578 0.9580 C = 29.76 L2, Char 2-to-6 0.9696 0.9613 C = 78.47 In Table 4, the value of precision, recall and F-Measure for SSL were higher than both method A and method B. While 0.96 and 0.95 precisions were recorded 10 O. Oriola, E. Kotzé Table 4. Precision, Recall and F-Measure for Support Vector Machine testing Model Feature Type Precision Recall F-Measure NA A NA A NA A Method A Word 0.89 0.59 0.91 0.53 0.9 0.56 Char 0.89 0.76 0.96 0.52 0.96 0.62 Method B Word 0.8 0.8 1 0 0.89 0.01 Char 0.8 0 1 0 0.89 0 SSL Word 0.96 0.95 0.99 0.8 0.98 0.87 Char 0.97 0.94 0.99 0.87 0.98 0.9 Table 5. Precision, Recall and F-Measure for Logistic Regression Testing Model Feature Type Precision Recall F-Measure NA A NA A NA A Method A Word 0.88 0.6 0.92 0.49 0.9 0.54 Char 0.89 0.8 0.97 0.51 0.93 0.62 Method B Word 0.8 0 1 0 0.89 0 Char 0.8 0 1 0 0.89 0 SSL Word 0.98 0.87 0.97 0.89 0.97 0.88 Char 0.98 0.93 0.99 0.89 0.98 0.91 by the SSL in word n-gram evaluation for non-abusive and abusive, respectively, method A recorded 0.80 and 0.80, and method B recorded 0.89 and 0.90. In char- acter n-gram evaluation, the SSL recorded 0.97 and 0.94, respectively, method A recorded 0.89 and 0.76, while method B recorded 0.80 and 0.00. In the case of recall, the highest values of 0.99 and 0.80 were recorded by the SSL for non- abusive and abusive, respectively, followed by 0.91 and 0.53 for method A. In character n-gram, the SSL recorded the highest recall of 0.99 and 0.87, respec- tively, followed by method A. The 1.00 recall against 0.00 for method B showed bias against abusive class during training. In the case of F-Measure, the SSL recorded the higest performance of 0.87 and 0.90 for abusive tweet detection for word n-gram and character n-gram, respecively followed by method A. The value of precision, recall and F-Measure for the SSL in Table 5 were also higher than both method A and method B. In character n-gram evaluation, the SSL recorded 0.98 and 0.93, respectively, method A recorded 0.89 and 0.80, while method B recorded 0.80 and 0.00. For recall, while 0.97 and 0.89 precision was recorded by proposed SSL for non- abusive and abusive, respectively, method A recorded 0.92 and 0.49 and method B recorded 1.00 and 0.00. For character n-gram, the SSL recorded 0.99 and 0.89, respectively, method A recorded 0.96 and 0.52, while method B recorded 1.00 and 0.00. The 1.00 recall against 0.00 for method B, showed bias against abusive class during training. In terms of F-Measure, the SSL recorded the higest performance of 0.90 and 0.91 for abusive tweet detection for word n-gram and character n-gram respecively followed by method A. The lowest F-Measure was recorded by method B. 11 The bar chart in Figure 2(a) showed that there was consistent increase in the width of the charts from method B through method A to the SSL. There was also slight increase from word n-gram to character n-gram. In Figure 2(b), drastic rise was observed from the bar of method B to method A. So also was drastic rise from method A to proposed SSL. These outcomes indicated that the proposed SSL addressed the problems of class imbalance and unmatched distribution. Fig. 2. Performance for the Different Methods : (a) Accuracy (b) F-measure 6 Conclusion We have developed a semi-supervised learning approach that combined both la- belled and unlabelled data, without skewness towards labelled data for improved detection of abusive tweets in binary classification model. The approach reduced the impact of class imbalance and unmatched distribution among labelled, un- labelled and testing data features. Matrix multiplication was used to fuse the labelled and unlabelled features; K-Means algorithm was used to cluster the fused features; majority voting rule was applied to select reliable labels for the unlabelled samples. The labelled and the previous unlabelled samples were used as training data. The performance of the approach was evaluated using word n-gram and character n-gram features as well as support vector machine and logistic regression classifiers. The results showed that our semi-supervised learning approach performed better than su- pervised learning approaches, with few training data or noisy training data. In future, classification of abusive language will be considered. 12 O. Oriola, E. Kotzé References 1. A. Søgaard, I. Vulic, S. Ruder, and M. Faruqui. Cross-lingual word embeddings. Synth. Lect. Hum. Lang. Technol, page 132 1–132. 2. M.O. Ibrohim and I. Budi. Sciencedirect a dataset dataset and and preliminaries preliminaries study study for for abusive abusive language language detection de- tection in indonesian social media in indonesian social media. Procedia Comput. Sci, 135:222–229. 3. R.S.A. Republıc of south afrıca preventıon and combatıng of hate crımes and hate. 4. J. Clement. South africa: digital population as of january 2019. Statista. Available. 5. Multilingual cross-domain perspectives on online hate speech. Clips tech. rep. series 8,. 6. A. Schmidt and M. Wiegand. A survey on hate speech detection using natural language processing. 7. H. Kamper, L. Karen, and G. Sharon. An embedded segmental k-means model for unsupervised segmentation and clustering of speech. Comput. Lang. arXiv:1703,. 8. C. Khatri, B. Hedayatnia, R. Goel, A. Venkatesh, R. Gabriel, and A. Mandal. Detecting offensive content in open-domain conversations using two stage semi- supervision arxiv : 1811. 12900v1 [ cs, 30(v 2018). 9. I. Gunasekara. A review of standard text classification practices for multi-label toxicity identification of online content. 10. B.language Norton. Identity and the ownership of english. Language and Identity, 31(3):409–429. 11. HateBase. Hatebase: The world’s largest structured repository of regionalized, multilingual hate speech. Hatebase. Available at:. 12. I.E. Livieris, A. Kanavos, V. Tampakas, and P. Pintelas. An auto-adjustable semi- supervised. 13. B.A. Hamilton. Semi-supervised learning with self-supervised networks. arXiv:1906.10343v1. 14. S. Kumar, X. Gao, and I. Welch. Cluster-than-label : Semi-supervised approach for domain adaptation. In 2017 IEEE 31st International Conference on Advanced Information Networking and Applications, page 704–711. 15. A. Albalate, A. Suchindranath, D. Suendermann, and W. Minker. A semi- supervised cluster-and-label approach for utterance classification. In Inter- speech2010, page 1–4. 16. M. Chegini. clustering , and active learning, 3:9–17. 17. D.-H. Lee. Pseudo-label : The simple and ecient semi-supervised learning method for deep neural networks. In ICML 2013 Workshop : Challenges in Representation Learning (WREPL, page 2–7. Atlanta, Georgia, USA. 18. B. Miller. Active learning approaches for labeling text. 19. Y. Li, Y. Lv, S. Wang, J. Liang, J. Li, and X. Li. Cooperative hybrid semi- supervised learning for text classification. Symmetry, 11(2). 20. Mingwei Leng, Jinjin Wang, Jianjun Cheng, and X. Hanhai Zhou. C. Journal of Software. J. Softw. Eng, 22:1–7. 21. M. Peikari, S. Salama, S. Nofech-mozes, and A.L. Martel. Open a cluster-then- label semi- supervised learning approach for pathology image classification. Sci. Rep, 1–13. 22. G. Forestier, C. Wemmert, G. Forestier, and C. Wemmert. Semi-supervised learn- ing using multiple clusterings with limited labeled data with limited labeled data. Inf. Sci. Elsevier, 361–362:48–65. 13 23. J. Cohen. A coefficient of agreement for nominal scales. Educ. Psychol. Meas, 20:37–46. 24. Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., G. O., and D.E. Scikit-learn. Machine learning in python. J. Mach. Learn. Res, 12:2825–2830. 25. T. Davidson, D. Warmsley, M. Macy, and I. Weber. Automated hate speech de- tection and the problem of offensive language. In Proceedings of the Eleventh International AAAI Conference on Web and Social Media(ICWSM, page 512–515. 26. J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. Local features and kernels for classification of texture and object categories: A comprehensive study. Int. J. Comput. Vis.