Logical Vision: Meta-Interpretive Learning for Simple Geometrical Concepts Wang-Zhou Dai1 , Stephen H. Muggleton2 and Zhi-Hua Zhou1 1 National Key Laboratory for Novel Software Technology, Nanjing University 2 Department of Computing, Imperial College London Abstract. Progress in statistical learning in recent years has enabled comput- ers to recognize objects with near-human ability. However, recent studies have revealed particular drawbacks in current computer vision systems which suggest there exist considerable differences between the way these systems function com- pared with human visual cognition. Major differences are that: 1) current com- puter vision systems learn high-level notions directly from the low-level feature space and ignore the mid-level representations, which makes them difficult to incorporate background knowledge. 2) typical computer vision systems learn vi- sual concepts discriminatively instead of encoding the knowledge necessary to produce a visual representation of the class. In this paper, we introduce a frame- work referred as Logical Vision which is demonstrated on learning visual con- cepts constructively and symbolically. Given a set of images, a set of first-order logic formulae of background knowledge and a set of examples of target visual concepts, Logical Vision extracts logical facts concerning geometrical elements from an image by sampling low-level features guided by the background knowl- edge and conjecturing geometrical elements as output. It first extracts logical facts of mid-level features, then generative Meta-Interpretive Learning technique is ap- plied to learn high-level notions because it is capable of learning recursions, in- venting predicates and so on. Owing to its symbolic representation paradigm, in our implementation, Logical Vision is fully implemented in Prolog apart from low-level image feature extraction primitives. In our implementation, Logical Vi- sion was used to extract polygon edges as mid-level symbols, and a generalized Meta-Interpreter Learner was applied to learn high-level geometrical notions. Ex- periments are conducted on learning shapes (e.g. triangles, quadrilaterals, etc.), regular polygons and right-angle triangles. These demonstrates that learning vi- sual concepts constructively and symbolically is effective. 1 Introduction Computer vision is a sub-field of Artificial Intelligence which aims at the analysis and interpretation of visual information. It is an information-processing procedure that re- ceives in input raw and low structured data, and gives outputs of explicit, meaningful descriptions of the structured information. Computer vision can be categorized into high-level and low-level vision. Low-level vision was aimed at image processing tasks and identification of local features. By contrast, high-level vision was aimed at delivering overall scene analysis in terms of re- lations between objects in the scene [14]. Recently, almost all research has concentrated 1 Fig. 1. Images of Papilionidae and Pieridae† : each butterfly can be roughly seen as a composition of different polygons; the visual concepts of butterfly species are defined by their component polygons, the (position) relations between the components and color patterns inside of each com- ponents, etc. on low-level vision and drawn on the power and the efficiency of statistical learning to enable computer vision algorithms to identify information from low-level metrics (e.g. color and gradient information in smaller pieces of objects surrounding interest points in images). The recognition task is then based on searching for and matching the low- level features with complex statistical classifiers like Decision Trees, Neural Networks, Support Vector Machines and so on. In addition to traditional low-level features, recent popular feature descriptors like SIFT [19] and SURF [4] can even encode spatial re- lationships between the original low-level features. These features are developed to be invariant to changes in scale, lightness, rotation, and affine transformations. During the last decade, more and more complex classifiers and matching methods have been used to combine strong and weak features for more effective recognition. Recently, deep neural networks (DNNs) [12, 5] have demonstrated impressive and state-of-the-art results on many pattern recognition tasks, especially image classifica- tion problems [17, 15]. DNNs are able to learn hierarchical layers of representation from sensory input, and it is possible to train neurons to be selective for high-level con- cepts that function as detectors for faces, human bodies, and cat faces, which enables its human-competitive ability in many tasks [17, 15]. However, recent studies revealed some major differences between them and human visual cognition [1, 31], which exist in most of statistics-based computer vision learning algorithms. For example, it is easy to produce images that are completely unrecognizable to hu- mans, though state-of-the-art visual learning algorithms believe them to be recognizable objects with over 99% confidence [1]. This is because it learns a discriminative model, some synthetic images that lying deep within a classification region (i.e. far from the decision boundary) in the low-level feature space can produce high confidence predic- tions, even though they are also far from natural images in the class [1]. In other cases small perturbations to the input images, which are imperceptible to human eyes, can arbitrarily change the classifier’s prediction [31]. Analysis from the authors shows that the instability is caused by classifiers’ sensitivity to small changes of low-level features in input images. † Source: http://www.papua-insects.nl 2 Moreover, humans can typically learn from a single visual example [16], unlike sta- tistical learning which depends on hundreds or thousands of images. Humans achieve this ability using background knowledge, which plays a critical role. By contrast, statistics- based computer vision algorithms have no general mechanisms for incorporating back- ground knowledge. According to [21], human vision process can be postulated as a hierarchical architecture with different intermediate representations and processing lev- els. At each stage of recognition, the representations (symbols) obtained from previous stages play the role of background knowledge. For example in the Figure 1, to recognize a butterfly, one should first be able to detect polygons on images, then from the different shapes, position relations and color patterns between polygons, people can categorize the butterflies into different species. In this paper we consider the approach of using modern ILP techniques to support the incorporation of background knowledge in the generation of scene analysis in terms of high-level relations. For this purpose we propose a novel visual concept learning framework, called Logical Vision, to realize this symbolic visual processing paradigm. Logical Vision first uses background knowledge on mid-level symbols to guide the sam- pling of low-level features, then it uses the sampled results to revise previously conjec- tured mid-level symbols. With the extracted mid-level feature symbols as background knowledge, a generalized Meta-Interpretive learner [23] is used to learn high-level vi- sual concepts because it enhances the constructive paradigm of Logic Vision through its ability to learn recursive theories, inventing predicates and learning from a single example. The long-term aim of Logical Vision is to learn to analyze natural objects and scenes by partitioning them into colored polygons like Figure 1 shows. As the first step, in this work we applied our Logical Vision framework to tasks involving learn- ing simple geometrical concepts such as triangles, quadrilaterals, regular polygons and so on. In order to model the visual process for general applications, we try to use one of the lowest feature as primitive to learn other concepts. In this work, we define a “point” as an pixel which has large gradient in its local region. Base on the primitives, we can learn more complex objects such as edges, polygons, combinations of polygons etc. stage by stage. Owing to its symbolic representation, Logical Vision can be fully implemented in Prolog given low-level image feature extraction primitives as the ini- tial background knowledge. Our experimental results show its effectiveness in learning target visual concepts which are difficult for typical low-level feature based statistical computer vision algorithms. The rest of this paper is organized as follows. Section 2 presents some related works. Section 3 proposes the Logical Vision framework. Section 4 describes the implemen- tation of the M etagolLogicalV ision approach, followed by experimental results in Sec- tion 5. Finally, Section 6 concludes and discusses about future works. 2 Related work State-of-the-art computer vision algorithms are mostly deep neural networks (DNN) which have been trained from large-scale datasets, such as [17, 15, 29, 30]. For small- scale tasks, people usually use DNN descriptors which learned from large-scale data as feature space for learning and recognition. It has been shown that the DNN features with 3 standard statistical learning technique still achieved the state-of-the-art performance in these kinds of tasks [29, 30]. In in section 5, we made a comparison between the proposed approach and statistical classifier with DNN feature. Besides of the end-to-end visual learning paradigms, there exists another classical idea which tries to parse images hierarchically like human cognition, e.g. [25]. Re- cently, this framework has become more tractable due to progress in machine learn- ing and statistics. For example, [11, 9, 27] developed grammar models for hierarchical object recognition. [13] proposed the “composition machine” for constructing proba- bilistic hierarchical image models and encode contextual relationships. [18] proposed a hierarchical feature coding approach which uses code-word learning, coding and pool- ing to obtain high-level features. There are also some approaches that different levels of features together [6, 35]. Most of statistics-based computer vision systems either design objective functions as constraints in the statistical learning procedure, or manually de- velop specific features to enhance the learning process. However, it is difficult for them to incorporate general background knowledge in a logical formalism. To incorporate background knowledge into statistical learning, many algorithms has been proposed in the last decade [20, 2, 22]. Some of them use background knowledge about low-level features to constraint the statistical learning process [20], some others directly learn models in a high-level feature space, in which first-order logic background knowledge can be naturally applied [2, 22]. Different to these approaches, Logical Vi- sion can exploit background knowledge of different levels and can process raw image data directly. More closely related works are those approaches which also adopt symbolic learn- ing paradigm like Logical Vision. A representative work in this branch is [3]. The pro- posed approach first does a low-level feature descriptor extraction on the whole im- age. Based on these descriptors, interest points are selected to form possible interest regions. Then a statistical model is trained for categorizing the candidate interested re- gions, positive ones are retained and labeled as different object symbols. Finally the ob- ject symbols are used for learning high-level concepts with supplementary background knowledge that expressed by first-order logic. Our work shares the same objective to this work, however, Logical Vision seeks for symbolic representation in most of the vi- sion cognition stages, which enables more flexible ways to incorporating background knowledge. Besides of directly using the statistically extracted facts as materials for relational learning, Logical Vision is able to guide the low-level feature extraction with first-order knowledge. 3 The proposed framework In this section we introduce the framework of Logical Vision. The input for Logical Vision consists of a set of geometrical primitives BP , one or a set of images I as back- ground knowledge, and a set of logical facts E representing the examples as the target visual concepts. The task is to learn a hypothesis H that defines the target visual concept where BP , I, H |= E. Given an input image, Logical Vision first alternately conjecture about mid-level visual objects and samples low-level features to support or revise those conjectures. 4 After obtaining mid-level features, a meta-interpreter can be executed for learning the target visual concepts. 3.1 Mid-level features extraction The purpose of mid-level features extraction is to obtain necessary logical facts BA representing mid-level features of I ∈ I for target visual concepts learning by ILP. The mid-level feature extraction in Logical Vision is realized by repeatedly execut- ing a “conjecturing and sampling” procedure. It uses mid-level feature conjectures to guide the sampling of low-level features, they are then used to revise previously con- structed conjectures. Here the low-level features are referred to local visual metrics such as color informa- tion, gradients, SIFT and SURF descriptors, etc. The term “mid-level feature/symbol” is a relative concept: they are logical facts that represent possible sub-parts or compo- nents of higher-level concepts. For example, a low-level feature “color gradient” can be useful for describing mid-level features such as edge discovery or contour extraction. However, if the target concept is “butterfly”, the mid-level feature “edge” can be seen as sub-parts of other higher-level concepts like “shape” and “region”. Together with more features like “color pattern”, “region size” and background knowledge about “position relations” and so on, we can finally learn the concept of butterfly within a pure symbolic paradigm by ILP (see Figure 1). The intuition of the “conjecturing and sampling” process is an analogy to human vision process. Suppose a man stands in front of a huge wall painting, which is very large that we can only clearly observe a small region at one time. To get a whole picture of this painting, he can try to move his eyes around to see different small regions in the painting and guess about the entire view of it. During the observation, he either can sample more details to support his conjecture, or can revise them by doing more sampling. After doing enough samples, he will believe that his final conjecture is the ground truth. Formally, mid-level feature extraction of an image I ∈ I could be described as follows: 1. Sample low-level features F in a subarea (e.g. surrounding a focal point) of I, then add F into the sampled low-level features set F. 2. Conjecture a mid-level feature (edge, region, texture, etc.) C according to F. 3. Validate the conjecture C on image I by doing few more samples. If the validation failed, reject C and go to 1, otherwise go to 4. 4. When C is valid, add it to mid-level feature set BA , then remove the low-level features f (C) that encapsulated by C, the rest of low-level features F 0 = F −f (C). For example, if the low-level features are pixels whose local area have a large color variance, the mid-level features to be extracted are contours on the image: once a conjectured contour C has been accepted to BA , we should remove all other pixels on the contour, so f (C) = {pixel|pixel is on contour C}. 5. If F 0 = φ, terminate the construction procedure and return BA , otherwise go to 1. Briefly speaking, Logical Vision uses mid-level feature conjectures to guide the sampling of low-level features, then uses sampled results to revise previously obtained 5 conjectures. The low-level features themselves like pixel colors, local color variances or gradient directions are usually redundant and trifling for representing higher-level visual concepts. After the background-knowledge-guided extraction, they can be com- pactly abduced into logical symbols such as edges, regions, textures, etc., to serve as bases for higher-level concepts learning. This human-cognition-mimic paradigm assumes that the mid-level conjectures are constructed by pre-defined predicates, i.e. background knowledge. This assumption is reasonable because a person need certain knowledge about particular simple primitives to learn more complicated concepts. For example, we have to define “gradient” before we learn the notion “contour”, and we have to understand “line segment” before we learn the concept “polygon”. This procedure follows a symbolic learning paradigm, thus it can be easily implemented by logic programming tools like Prolog. This en- sures that Logical Vision will hardly introduce typical computer vision operations such as sliding windows or image filtering. This is because we believe that in human cogni- tion, observation is a process that happens only when necessary rather than a exhaustive enumeration, and background-knowledge/conjecture-guided abduction is a proper way to model this kind of action. By incorporating first-order formalism in feature extrac- tion procedure, we wanted to show that symbolic reasoning is able to solve particular computer vision tasks which have been long time considered as a statistical problem. Table 1. Prolog code for the generalized meta-interpreter. The interpreter recursively proves a se- ries of atomic goals by matching them against the heads of meta-rules. After testing the Order constraint save subst checks whether the meta-substitution is already in the program and otherwise adds it to form an augmented program. On completion the returned program, by con- struction, derives all the examples. Generalized meta-interpreter prove([], Prog, Prog). prove([Atom|As], Prog1, Prog2):- metarule(Name,MetaSub, (Atom:-Body), Order), Order, save subst(metasub(Name,MetaSub), Prog1, Prog3), prove(Body, Prog3, Prog4), prove(As, Prog4, Prog2). 3.2 Meta-Interpretive Learning After obtaining mid-level logic symbols BA , Logical Vision uses a generalized Meta- Interpretive Learner to learn target visual concepts. The input of generalized Meta- Interpretive Learning (MIL) [24] consists of a generalized Meta-Interpreter BM and domain specific primitives BP together with two sets of ground atoms as background knowledge BA and examples E respectively. The output of MIL is a revised form of the background knowledge containing the original background knowledge BA , domain specific primitives BP augmented with additional ground atoms representing a hypoth- esis H. According to Inverse Entailment, B, H  E is equivalent to B, ¬E  ¬H. In 6 Algorithm 1 LogicalV isionP oly (BP , I, M etagolLogicalV ision , E, N ) Input: Geometrical primitives BP , input image I, examples E, Meta-Interpretive learner M etagolLogicalV ision , sampling level N . Output: Hypothesis of the target visual concept H; Start: Initialize edge points set F = φ and sampled edges set BE = φ; Randomly sample some edge points {P1,P2,. . .}, let F = F ∪ {P1,P2,...}; repeat Select a pair of edge points P1,P2∈ F ; Validate whether P1P2 forms an edge by querying edge(P1,P2,N ); if edge(P1,P2,N ) succeeded then Extend P1P2 on both of its directions to form a conjecture of edge C; BE = BE ∪ C; Remove all edge points P∈ F that lies on edge C; else Randomly sample a line which crosses the line segment P1P2 for new edge points, if they are not encapsulated by any sampled edge in BE then add them into F; end if until F = φ; Find connected edges in BE to construct facts of polygons BA ; Learn a hypothesis H with BA , M etagolLogicalV ision , BP , E through MIL; Return: H. this form we see that B, ¬E is given to the meta-interpreter where ¬E is a goal and the resulting abducted program ¬H represents a headless Horn clause. The Prolog imple- mentation of generalized meta-interpreter is showed in Table 1. In this work, we used dyadic meta-rules [24] to learn target theories. 4 Implementation Below we describe the implementation of Logical Vision for the task of polygon shapes learning. The target concepts of this task are definitions of different kinds of polygons (e.g. triangles, regular polygons, etc.). Our implementation is displayed as Algorithm 1, which is referred to as LogicalV isionP oly . 4.1 Polygon extraction To learn the concepts of polygon shapes, we targeted the mid-level features BA to be ex- tracted as polygons. They are denoted as polygon(Pol i, [Edge1,...,EdgeN]). The process of polygon extraction can be split into two stages: edge discovery and poly- gon construction. For simplicity, in the polygon construction stage, LogicalV isionP oly groups of connected edges as a list as polygon. Therefore, the major challenge is to discover those edges. Following the framework in section 3.1, we introduce some primitives as background knowledge to perform the conjecturing and validation. For example, the background knowledge for “edge” is de- fined as follow: 7 (a) (b) (c) Fig. 2. (a) 2 edge points A and B are sampled; (b) Edge AB is conjectured but it is invalid because if midpoint(A,B,P), then edge point(P) is false. So a random line that crosses AB is sampled, 2 new edge points C and D have been discovered; (c) Edge AC is conjectured and it passes the recursive edge test, so AC is extended until no continuous edge points were found. Finally the edge A0 C 0 is recorded and A, C are removed from F. edge(P1,P2,0):-midpoint(P1,P2,P),edge point(P1),edge point(P2), edge point(P). edge(P1,P2,N ):-midpoint(P1,P2,P),edge point(P1),edge point(P2), N 1 is N − 1,edge(P1,P,N 1),edge(P,P2,N 1). in which P1 and P2 are the input conjectured end points of an edge, N is the recursion limit that controls the depth of edge validation and midpoint/3 finds the midpoint between two pixels. By conjecturing (P1, P2) to be an edge, the predicate would recursively sample the points between (P1, P2) and test whether do they have a high local variance (color gradient) as well. The color gradient test is performed by edge point/1, it is the only primitive that interacts with low-level features on im- ages, it returns true when the color gradient magnitude of pixel P exceeds a pre-defined threshold. The color gradient is computed as follow: q G(A) = Gx (A)2 + Gy (A)2 (1) where   −1 0 +1 Gx (A) = −2 0 +2 ∗ A (2) −1 0 +1   −1 2 −1 Gy (A) =  0 0 0  ∗ A. (3) +1 +2 +1 are Sobel filters, A is the focal point that has been queried, ∗ denotes the 2-dimensional signal processing convolution operation. Gx and Gy represents the image derivatives in vertical and horizontal directions respectively. We implemented an image-processing program by OpenCV [10], and used a C++-Prolog interface [34] to enable communi- cation between predicate edge point/1 and input images. A detail example of edge extraction process is illustrated in Figure 2. 8 4.2 M etagolLogicalV ision Nevertheless, the polygon extraction procedure in section 4.1 sometimes results in a noisy BA . This may cause the depth-first search in Metagol to fail or return ground hypotheses that cover only one example. Thus, we altered the original Metagol to enable it abduce imperfect hypotheses and evaluate the hypotheses using foil gain [28]. This procedure is described as follow: 1. Abduce an hypothesis P with general Meta-Interpreter. 2. If the hypothesis covers all positive examples and rejects all negative examples, return P ; otherwise go to step 3. 3. Evaluate the quality of P with foil gain [28]. If P is better than current best hy- pothesis P̂ ∗ then replace P̂ ∗ with P . Go to step 1 to abduce another candidate hypothesis. The domain specific primitives BP of M etagolLogicalV ison include some neces- sary background knowledge for learning polygon shape related target concepts. For example, in order to learn the concept of regular polygon, M etagolLogicalV ision uses angle list/2 to obtain all inner angles of a polygon and uses std dev bounded/2 to test whether the standard deviation of a list of double numbers is bounded by an automatically learned threshold. Moreover, the stochastic implementation of polygon extraction occasionally discov- ers redundant edges (see Figure 4), therefore the BP of Logical Vision included a post- processing primitive rmv rdndnt(L1, L2, T). It enables M etagolLogicalV ision to learn a threshold T for automatically removing extra edges. Here L1 and L2 are the input and output edge lists respectively, the threshold T controls the degree of refine- ment: consider 2 connected edges AB and BC, if (|AB| + |BC|)/|AC| ≤ T, then rmv rdndnt/3 choose to omit them and construct a new edge AC. 5 Experiments In this section we describe our experiments which compare M etagolLogicalV ision with statistics-based computer vision approaches on tasks of learning simple geometrical concepts from binary colored images. 5.1 Materials 3 labeled image datasets are generated for polygon shape learning. For simplicity, the images are binary-colored, each image contains one polygon. Target concepts of the 3 tasks are: 1) triangle/1, quadrilateral/1, pentagon/1 and hexagon/1; 2) regular poly/1 (regular polygon); 3) right tri/1 (right triangle). Note that in the third task, we used the best hypothesis of triangle/1 learned in the first task as back- ground knowledge. Part of the datasets are presented in Figure 3. All datasets were partitioned into 5-folds respectively, 4 of them were used for training and the remainder for testing. The details of all tasks are showed as follows: Learning polygon shapes: We randomly generated 40 images of triangles, quadri- laterals, pentagons and hexagons respectively. Each image contains one polygon. Each 9 (a) Learning triangles, quadrilaterals, etc. (b) Learning regular polygons (c) Learning right-angle triangles Fig. 3. Part of datasets for the 3 learning tasks polygon is black-colored, displays on white canvas. Target concepts in this task are triangle/1, quadrilateral/1, pentagon/1 and hexagon/1. Learning regular polygons: We randomly generated 10 images of regular and ir- regular triangles, quadrilaterals, pentagons and hexagons respectively, so there are 80 images in total. In this task, the target concept is regular/1. Learning right-angle polygons: We randomly generated 40 images of right-angle triangles as positive examples. The negative examples consist of 10 triangles, quadri- laterals, pentagons and hexagons respectively. In order to learn a correct definition of right-angle triangle, the generated quadrilaterals, pentagons and hexagons in negative example set may contain right-angles with probability 0.4. Please note that in this task, M etagolLogicalV ision re-used the best hypothesis of triangle/1 that learned from the first task as background knowledge. Target concept of this task is right tri/1. 10 (a) (b) (c) Fig. 4. Noise of polygon extraction: (a) is the ground truth image. (b) and (c) are two polygons extracted by our algorithm, where (c) contains a redundant vertex. 5.2 Methods LogicalVisionPoly : This is the proposed approach. In order to handle the noise caused by polygon extraction (e.g. Figure 4), for each image we ran the extraction procedure five times independently to duplicate the input instances (both for training and testing). During evaluation, the learned hypotheses were tested on all the five extracted polygons and the final prediction was based on an equal weighted vote. Statistics-based Learning: We used a popular statistics-based computer vision tool- box VLFeat [33] to implement the statistical learning algorithms. The experiments are carried with different kinds of features. Because the sizes of datasets are small, we used support vector machine (libSVM [7]) as classifier. The parameters are selected by 5-fold cross-validation. The features we have used in the experiments are listed as follows: – HOG: The Histogram of Oriented Gradients (HOG) [8] is a feature descriptor com- monly used in computer vision and image processing for the purpose of object de- tection. The technique counts occurrences of gradient orientation in localized por- tions of an image. Specifically, in order to conserve all gradient information in the image, the HOG descriptors in our experiments are not summarized by k-nearest neighbor bag-of-word models. This feature is a 148800-dimensional vector. – Dense-SIFT: Scale Invariant Feature Transform (SIFT) [19] is an image descriptor for image-based matching and recognition. The dense-SIFT descriptor is roughly equivalent to running SIFT on a dense gird of locations at a fixed scale and orienta- tion. It is often used for object categorization and has been proven to be very useful in practice for image matching and object recognition under real-world conditions. In our experiments, it produces a 34048-dimensional vector for each image. – LBP: Local binary pattern (LBP) [26] is a powerful feature for texture classification which labels the pixels of an image by thresholding the neighborhood of each pixel and considers the result as a binary number. The LBP texture analysis can be seen as a unifying approach to the traditionally divergent statistical and structural models of texture analysis. We extracted a 69600-dimensional vector for this feature. – CNN: Convolutional neural network (CNN) is a type of biological inspired feed- forward artificial neural network where the individual neurons are tiled in such a way that they respond to overlapping regions in the visual field. After integrating deep structures, they became very popular for image and video recognition [15, 29]. In the experiments, we extracted a 4096-dimensional feature for each image with a pre-trained deep CNN model called imagenet-vgg-verydeep-16 [30] which 11 Table 2. Predictive accuracy of learning simple geometrical shapes on single object datasets. The feature combinations are abbreviated by the initial letters of each method. ACC tri quad pen hex reg r tri HOG 0.83 ± 0.04 0.76 ± 0.01 0.73 ± 0.03 0.75 ± 0.07 0.63 ± 0.08 0.74 ± 0.04 dense-SIFT 0.82 ± 0.05 0.66 ± 0.06 0.64 ± 0.04 0.71 ± 0.03 0.71 ± 0.05 0.77 ± 0.07 LBP 0.87 ± 0.05 0.69 ± 0.04 0.67 ± 0.03 0.73 ± 0.03 0.65 ± 0.05 0.75 ± 0.05 CNN 0.91 ± 0.01 0.75 ± 0.00 0.75 ± 0.00 0.84 ± 0.02 0.59 ± 0.06 0.85 ± 0.04 H+d 0.82 ± 0.01 0.75 ± 0.01 0.76 ± 0.01 0.76 ± 0.01 0.64 ± 0.05 0.80 ± 0.03 C+d 0.82 ± 0.01 0.75 ± 0.00 0.76 ± 0.01 0.76 ± 0.01 0.69 ± 0.04 0.80 ± 0.03 C+L 0.87 ± 0.05 0.75 ± 0.01 0.76 ± 0.01 0.76 ± 0.01 0.61 ± 0.05 0.78 ± 0.06 C+d+L 0.82 ± 0.01 0.75 ± 0.00 0.76 ± 0.01 0.76 ± 0.01 0.64 ± 0.05 0.80 ± 0.04 LVP oly 1.00 ± 0.00 0.99 ± 0.01 1.00 ± 0.00 0.99 ± 0.01 1.00 ± 0.00 1.00 ± 0.00 Table 3. F1-measure of learning simple geometrical shapes on single object datasets. The feature combinations are abbreviated by the initial letters of each method. F1 tri quad pen hex reg r tri HOG 0.67 ± 0.11 0.80 ± 0.03 0.81 ± 0.02 0.72 ± 0.11 0.43 ± 0.05 0.31 ± 0.16 dense-SIFT 0.51 ± 0.10 0.67 ± 0.03 0.67 ± 0.04 0.67 ± 0.08 0.39 ± 0.04 0.33 ± 0.19 LBP 0.53 ± 0.15 0.72 ± 0.03 0.71 ± 0.05 0.67 ± 0.07 0.39 ± 0.03 0.36 ± 0.19 CNN 0.33 ± 0.18 0.86 ± 0.00 0.86 ± 0.00 0.71 ± 0.05 0.47 ± 0.04 0.40 ± 0.07 H+d 0.63 ± 0.12 0.85 ± 0.00 0.84 ± 0.02 0.81 ± 0.08 0.37 ± 0.06 0.22 ± 0.15 C+d 0.63 ± 0.12 0.86 ± 0.00 0.84 ± 0.02 0.81 ± 0.08 0.47 ± 0.10 0.22 ± 0.15 C+L 0.53 ± 0.15 0.84 ± 0.03 0.85 ± 0.02 0.81 ± 0.05 0.37 ± 0.10 0.31 ± 0.20 C+d+L 0.63 ± 0.12 0.86 ± 0.00 0.84 ± 0.02 0.81 ± 0.08 0.37 ± 0.06 0.24 ± 0.14 LVP oly 1.00 ± 0.00 0.97 ± 0.02 1.00 ± 0.00 0.98 ± 0.02 1.00 ± 0.00 1.00 ± 0.00 is implemented by MaxConvNet [32]. The descriptors were trained from ImageNet ILSVRC-2012 dataset (1.5 million photos), and these descriptors had showed state- of-the-art generalization performance in many image classification tasks [30]. – Feature Combinations: The experiments of statistical computer vision learning also had been carried on combinations of above feature sets. 5.3 Results Table 2 and 3 show the results of our experiments. Performance of compared methods were evaluated by both predictive accuracy and F1-score on the hold-out test data in each fold. Following are some examples of the hypotheses that learned by LogicalV isionP oly : triangle 1(A,C,H):-rmv rdndnt(A,B,C),list length(B,H). triangle 0(A,A2,B2):-polygon(A,B),triangle 1(B,A2,B2). triangle(A):-triangle 0(A,0.04,3). triangle 0(A,G):-polygon(A,B),list length(B,G). triangle(A):-triangle 0(A,3). 12 quadrilateral 0(A,G):-polygon(A,B),list length(B,G). quadrilateral(A):-quadrilateral 0(A,4). pentagon 0(A,G):-polygon(A,B),list length(B,G). pentagon(A):-pentagon 0(A,5). hexagon 1(A,C,H):-rmv rdndnt(A,B,C),list length(B,H). hexagon 0(A,A2,B2):-polygon(A,B),hexagon 1(B,A2,B2). hexagon(A):-hexagon 0(A,0.004,6). hexagon 0(A,G):-polygon(A,B),list length(B,G). hexagon(A):-hexagon 0(A,6). regular poly 1(A,G):-angles list(A,B),std dev bounded(B,G). regular poly 0(A,A2):-polygon(A,B),regular poly 1(B,A2). regular poly(A):-regular poly 0(A,0.02). right tri 2(A,G,H):-angles list(A,B),has angle(B,G,H). right tri 1(A,A2,B2):-polygon(A,B),right tri 2(B,A2,B2). right tri 0(A,A2,B2):-right tri 1(A,A2,B2),triangle(A). right tri(A):-right tri 0(A,0.5,0.015). where list length/2 returns the length of a list; angles list/2 returns a list of sizes of polygon’s angles. std dev bounded/2 examines whether the standard devia- tion of a set of real numbers is bounded by a threshold; has angle/2 checks whether a list (of angle sizes) contains a real number within an error bound; rmv rdndnt/3 removes redundant edges as section 4.2 introduced. For the right tri/1 task, we in- cluded the best hypothesis in the triangle/1 task as background knowledge. From the results we can see that the performance of LogicalV isionP oly is signifi- cantly better than the compared statistics-based vision learning methods on these tasks, which suggests that symbolic learning can be of benefit to visual concepts learning. 6 Conclusion and future works 6.1 Conclusion and discussions This paper studies a novel approach to the problem of visual concept learning, distinct from that employed by traditional computer vision learning algorithms. By using the proposed Logical Vision approach, we are able to exploit background knowledge flex- ibly and effectively in visual concepts learning tasks. Owes to its symbolic paradigm, the background-knowledge-guided mid-level features extraction and high-level visual concepts learning can be simply implemented by logic programming languages such as Prolog. The experimental results indicate that the proposed framework has potential to analyze which are traditionally hard for more statistically-oriented approaches. The main reason for LogicalV isionP oly outperforming statistics-based computer vision learners in our experiments is because it enables the incorporation of first-order 13 background knowledge which is very useful for learning the target concepts. Logical Vision exploits background knowledge in two ways: the first one is using high-level background knowledge to guide the observation of low-level features. This mechanism is reflected by the usage of the predicate edge/3 in section 4.1. The second one is mak- ing use of background knowledge as primitive predicates like general Inductive Logic Programming approaches, e.g. the predicates list length/2, angles list/2, etc. defined in M etagolLogicalV ision . However, it is difficult for the statistical learning al- gorithms to incorporate these kinds of prior knowledge in its learning processes. For example, the HOG feature descriptor exploits local gradient information exactly same as the edge point/1 predicate, but its statistical learning framework lacks of a effec- tive methodology to include other complex background knowledge to organize them into more informative mid-level features. On the other hand, the CNN feature descrip- tors can encode mid/high-level features of certain degrees, yet this ability could hardly be obtained from a small dataset. Different to the statistics-based computer vision systems, Logical Vision treats the visual learning problem more likely to human cognition. Its learning process relies much more on the background knowledge than the quantity of data. Besides, the learned theory also can serve as background knowledge for subsequent tasks. Furthermore, it learns a constructive theory to define the target concept, rather than discriminative mod- els that only tell differences between classes. From this angle of view, it is not hard to understand why did LogicalV isionP oly outperforms the statistics-based methods in the experiments. 6.2 Future works In this work, we built a primary system targeting for simple geometric concepts learning tasks such as polygon shapes learning. Although polygon identification itself cannot be directly applied to general computer vision tasks, it can be seen as a basic step for fur- ther visual recognition. As we illustrated in section 1, complex cognition tasks can be decomposed into easier sub-problems hierarchically [21]. After obtained knowledge of polygons, we can then define complex shapes with combination of separate/occluded polygons. Moreover, by taking consideration of pixels inside of polygon regions, we can define complex objects by shapes with patterns. Furthermore, we can include back- ground knowledge about 3 dimensional geometry to help the understanding of an image by symbolic processing. In recent future extensions of this work we hope to extend our study to more com- plicated tasks such as images involving a multiplicity of overlapping colored polygons as the first step. 7 Acknowledgment This research was supported by the National Natural Science Foundation of China (61333014, 61321491), the RAEng Newton Research Collaboration Programme (NRCP/1415/133) and the Program B for Outstanding PhD candidate of Nanjing Uni- versity. 14 References 1. Ahn, N., Yosinski, J., Clune, J.: Deep neural networks are easily fooled: High confidence pre- dictions for unrecognizable images. In: Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA (2015) 2. Andrzejewski, D., Zhu, X., Craven, M., Recht, B.: A framework for incorporating general domain knowledge into latent dirichlet allocation using first-order logic. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence. pp. 1171–1177. Barcelona, Spain (2011) 3. Antanas, L., van Otterlo, M., Oramas Mogrovejo, J., Tuytelaars, T., De Raedt, L.: There are plenty of places like home: Using relational representations in hierarchies for distance-based image understanding. Neurocomputing 123, 75–85 (2014) 4. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (SURF). Computer Vision and Image Understanding 110(3), 346 – 359 (2008) 5. Bengio, Y.: Learning deep architectures for AI. Foundations and Trends in Machine Learning 2(1), 1–127 (2009) 6. Borenstein, E., Ullman, S.: Combined top-down/bottom-up segmentation. IEEE Transac- tions on Pattern Analysis and Machine Intelligence 30(12), 2109–2125 (2008) 7. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM Transac- tions on Intelligent Systems and Technology 2, 27:1–27:27 (2011), software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm 8. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings of the 13rd IEEE Conference on Computer Vision and Pattern Recognition. pp. 886–893. San Diego, CA (2005) 9. Felzenszwalb, P.: Object detection grammars. In: Proceedings of the 13th IEEE International Conference on Computer Vision Workshops. pp. 691–691. Barcelona, Spain (2011) 10. Gary, B.: Opencv library. http://http://opencv.org/ (2000) 11. Hartz, J., Neumann, B.: Learning a knowledge base of ontological concepts for high-level scene interpretation. In: Proceedings of the 6th International Conference on Machine Learn- ing and Applications. pp. 436–443. Cincinnati, OH (2007) 12. Hinton, G.E.: Learning multiple layers of representation. Trends in Cognitive Sciences 11(10), 428–434 (2007) 13. Jin, Y., Geman, S.: Context and hierarchy in a probabilistic image model. In: Proceedings of the 12nd IEEE Conference on Computer Vision and Pattern Recognition. pp. 2145–2152. New York, NY (2006) 14. Krig, S.: Computer Vision Metrics: Survey, Taxonomy, and Analysis. Apress, Berkely, CA (2014) 15. Krizhevsky, A., Sutskever, I., E. Hinton, G.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates, Inc. (2012) 16. Lake, B.M., Salakhutdinov, R., Gross, J., Tenenbaum, J.B.: One shot learning of simple visual concepts. In: Proceedings of the 33rd Annual Conference of the Cognitive Science Society. pp. 2568–2573 (2011) 17. Le, Q.V., Zou, W.Y., Yeung, S.Y., Ng, A.Y.: Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: Proceedings of 2011 IEEE Conference on Computer Vision and Pattern Recognition. pp. 3361–3368. Colorado Springs, CO (2011) 18. Liu, J., Huang, Y., Wang, L., Wu, S.: Hierarchical feature coding for image classification. Neurocomputing 144, 509–515 (2014) 15 19. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004) 20. Maclin, R., Shavlik, J., Walker, T., Torrey, L.: Knowledge-based support-vector regression for reinforcement learning. In: Proceedings of the 19th International Joint Conference on Artificial Intelligence Workshop on Reasoning, Representation, and Learning in Computer Games. Edinburgh, Scotland, UK (2005) 21. Marr, D.: Vision: a computational investigation into the human representation and processing of visual information. Henry Holt & Co., Inc., New York, NY (1982) 22. Mei, S., Zhu, J., Zhu, J.: Robust regbayes: Selectively incorporating first-order logic domain knowledge into bayesian models. In: Proceedings of the 31th International Conference on Machine Learning. pp. 253–261. Beijing, China (2014) 23. Muggleton, S.H., Lin, D., Pahlavi, N., Tamaddoni-Nezhad, A.: Meta-interpretive learning: application to grammatical inference. Machine Learning 94(1), 25–49 (2014) 24. Muggleton, S.H., Lin, D., Tamaddoni-Nezhad, A.: Meta-interpretive learning of higher-order dyadic datalog: Predicate invention revisited. Machine Learning (2015), published online: DOI 10.1007/s10994-014-5471-y 25. Ohta, Y.i., Kanade, T., Sakai, T.: An analysis system for scenes containing objects with sub- structures. In: Proceedings of the 4th International Joint Conference on Pattern Recognitions. pp. 752–754. Kyoto, Japan (1978) 26. Ojala, T., Pietikäinen, M., Mäenpää, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(7), 971–987 (2002) 27. Porway, J., Wang, Q., Zhu, S.C.: A hierarchical and contextual model for aerial image pars- ing. International Journal of Computer Vision 88(2), 254–283 (2010) 28. Quinlan, J.R.: Learning logical definitions from relations. Machine Learning 5, 239–266 (1990) 29. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: Integrated recognition, localization and detection using convolutional networks. In: Proceedings of the 2nd International Conference on Learning Representations. Banff, Canada (2014) 30. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recog- nition. In: Proceedings of the 3rd International Conference on Learning Representations. San Diego, CA (2015) 31. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I.J., Fergus, R.: Intriguing properties of neural networks. CoRR abs/1312.6199 (2013) 32. Vedaldi, A., Lenc, K.: Matconvnet – convolutional neural networks for matlab. In: Proceed- ings of the 23rd Annual ACM Conference on Multimedia. Brisbane, Australia (2015) 33. Vedaldi, A., Fulkerson, B.: VLFeat: An open and portable library of computer vision algo- rithms. http://www.vlfeat.org/ (2008) 34. Wielemaker, J., Schrijvers, T., Triska, M., Lager, T.: SWI-Prolog. Theory and Practice of Logic Programming 12(1-2), 67–96 (2012) 35. Zheng, S., Tu, Z., Yuille, A.: Detecting object boundaries using low-, mid-, and high-level in- formation. In: Proceedings of 15th IEEE Conference on Computer Vision and Pattern Recog- nition. pp. 1–8. Minneapolis, MN (2007) 16