=Paper=
{{Paper
|id=Vol-2206/paper1
|storemode=property
|title=GPU-Accelerated Hypothesis Cover Set Testing for Learning in Logic
|pdfUrl=https://ceur-ws.org/Vol-2206/paper1.pdf
|volume=Vol-2206
|authors=Eyad Algahtani,Dimitar Kazakov
|dblpUrl=https://dblp.org/rec/conf/ilp/AlgahtaniK18
}}
==GPU-Accelerated Hypothesis Cover Set Testing for Learning in Logic==
GPU-Accelerated Hypothesis Cover Set Testing for Learning in Logic Eyad Algahtani and Dimitar Kazakov University of York, Heslington, York YO10 5GH, UK, ea922@york.ac.uk, kazakov@cs.york.ac.uk, WWW home page: https://www-users.cs.york.ac.uk/kazakov/ Abstract. ILP learners are commonly implemented to consider sequen- tially each training example for each of the hypotheses tested. Computing the cover set of a hypothesis in this way is costly, and introduces a major bottleneck in the learning process. This computation can be implemented more efficiently through the use of data level parallelism. Here we pro- pose a GPU-accelerated approach to this task for propositional logic and for a subset of first order logic. This approach can be used with one’s strategy of choice for the exploration of the hypothesis space. At present, the hypothesis language is limited to logic formulae using unary and bi- nary predicates, such as those covered by certain types of description logic. The approach is tested on a commodity GPU and datasets of up to 200 million training examples, achieving run times of below 30ms per cover set computation. Keywords: Inductive Logic Programming, learning in logic, hypothesis cover set, data parallelism, GPU, GPGPU 1 Introduction Classical ILP learners are implemented as sequential algorithms, which either consider training examples one at a time (Golem [1]) or they search through the hypothesis space computing the cover set of one hypothesis at a time (Foil [4]), or they combine elements of these two sequential strategies, e.g. they select a positive example, and then explore the hypothesis space implied by it (Progol [5], Aleph [6]). In all cases, computing the cover set of a hypothesis is a costly process that usually involves considering all negative examples1 along with all relevant positive examples2 in a sequential manner. This is a computation that is based on set membership tests. These tests can be parallelised efficiently through the use of data parallel computations. In particular, GPU-based data parallelism has proven very efficient in a number of application areas [15]. We propose a GPU- accelerated approach to the computation of the cover set for a given hypothesis 1 These may not always be explicitly provided, but estimated [7] or inferred using an assumption, such as output completeness in Foidl [16]. 2 In the case of a cautious learner, that is all positive examples. For eager learners, only the positive examples not covered yet by another clause are considered. for a subset of first order logic, which results in a speed up of several orders of magnitude. This approach could potentially be integrated with various strategies for the exploration of the hypothesis space. At present, the hypothesis language is limited to logic formulae using unary and binary predicates, such as those covered by certain types of description logic. This should be considered in the context of an increasing number of ILP algorithms targeting specifically description logic as a data and hypothesis language [17–19]. We test our approach on a commodity GPU, Nvidia GeForce GTX 1070, and data comprising datasets of size up to 200 million examples, achieving speeds of under 30ms on the full dataset. The rest of this paper is structured as follows: Section 2 introduces related background information on GPU and GPGPU, Section 3 presents and evaluates our novel, parallel implementation of the cover set computation for propositional data, Section 4 extends our approach to hypotheses with binary predicates, and Section 5 discusses the results and future work on the implementation of an ILP algorithm making use of the GPU-accelerated cover set computation proposed here. 2 Related Work Previous work on the efficient implementation of cover set evaluation in ILP has included efforts to avoid redundancy, e.g. through the use of query packs [9], and the use of concurrent computation [11]. Fonseca et al [8] provides a review and an evaluation of strategies for parallelising ILP, which are split in three groups. Firstly, when the target predicate amounts to a description of separate classes, each of these can be learned independently, in parallel, potentially from a separate copy of the whole dataset. Secondly, the search space can be explored in a parallel fashion, dividing the task among several processes. Thirdly, the data can be split and processed separately, in parallel, including for the task of testing hypothesis coverage, where the results of such mapping are merged in the final step. Another way to categorise parallel evaluation approaches in ILP is according to whether they use shared or distributed memory [10]. Ohwada and Mizoguchi [11] used the former approach to combine branch-and-bound ILP search with a parallel logic programming language, while Graham, Page and Kamal [12] parallelised the hypothesis evaluation by assigning different clauses to different processors to be evaluated at the same time. In the distributed memory cat- egory, Matsui et al [13] studied the impact of parallel evaluation on FOIL [4], achieving linear growth in the speed-up for up to 4 processors, and lower relative gains for larger numbers as communication overheads started to become more prominent. In other work, Konstantopoulos [14] combined a parallel coverage test on multiple machines (using the MPI library) with the Aleph system [6]. Our approach shares similarities with the last two approaches [13, 14]. How- ever, it uses the GPU in a shared memory architecture. The GPU provides much greater computational powers for the same cost when compared to a typical multi-core CPU; thus, efficient and more effective use of the available hardware resources is achieved. Also, in our approach, the entire data is copied (gigabytes of data are transferred within few seconds) from the main memory to the GPU’s own memory once, before the learning process starts. The data remain in the GPU’s memory until the learning process is completed. Our approach has a very low communication overhead compared to that of Konstantopoulos [14]; this reduces the learning time dramatically (especially for millions of examples) by allowing more rules to be evaluated per second. Moreover, during learning, as the CPU sends the rule for evaluation (asynchronously) to the GPU, it is then free to focus on the remaining parts of the learning process while waiting for the result. 3 GPU and GPGPU The graphics processing unit (GPU) is a specialised hardware that has many processing cores intended to accelerate graphics-related computations. In recent years, there has been a growing interest in using the massive computation ca- pabilities of GPUs to accelerate general purpose computations, especially in the scientific community. This has evolved into the concept of General-Purpose GPU (GPGPU) that enables the use of GPUs for the acceleration of general purpose computations [15]. The GPU is built upon the Single Instruction Multiple Data (SIMD) architecture, i.e. applying the same operation to every data item in par- allel. The GPU is essentially assigning each thread to process a certain area of the input data and output its results. The number of threads depends on the GPU used, and it is not uncommon to have more than 1000 threads. Figure 1 demonstrates the differences between the CPU and GPU on the problem of summing up two arrays. One can observe that in the CPU, the result for each row is computed sequentially using a single thread. However, in the GPU, there are many threads each of which will process a group of array rows (indices) simultaneously, resulting in a massive performance gain, especially for large data. A sequence diagram of a typical GPU-accelerated program can be seen in Figure 2. In that figure, the program is executed mostly by the CPU. However, when the program contains a computationally intensive operation, it sends it to the GPU which calculates the results and returns them back to the CPU. After sending the computation task to the GPU, the control is returned immediately to the CPU, which can either wait for the results or do something else while waiting. When developing GPU-accelerated programs, the code for both GPU and CPU is written using the same high level programming language, residing in the same file(s) and compiled at the same time with all other code. There are many programming platforms available that enables the writing, debugging and execution of general purpose GPU programs. Some of them are brand-specific, such as CUDA [2], others are brand-agnostic, e.g. OpenCL [3]. Since our work revolves around the CUDA platform, we will discuss it in more de- tail. Compute Unified Device Architecture (CUDA) is the Nvidia brand-specific platform for writing general purpose programs for its GPUs. CUDA divides the Fig. 1. Adding two arrays: CPU vs GPU Fig. 2. Interplay between CPU and GPU of a typical GPU-accelerated program. GPU into grids, each of which can run a separate general purpose GPU pro- gram, known as kernel, and contains multiple blocks of threads (thread blocks); the number of thread blocks within a grid is known as grid size. The execution of thread blocks is done through Streaming Multiprocessors (SMs). A GPU has one or multiple SMs, where each SM can execute certain number of threads blocks in parallel (independently of other SMs). The GPU has a hardware scheduler that constantly distribute thread blocks to SMs. Whenever a SM is idling (or finished computing), the scheduler feeds another thread blocks. In other words, all the SMs are computing results (in parallel) independent of each other while the hardware scheduler keeps them busy at all times by continuously feeding thread blocks to the available SM (s). When all SMs in the GPU are actively computing, the GPU can easily (depending on the hardware) execute 10,000s of threads simultaneously. A kernel needs at minimum two parameters: the grid size (number of blocks) and block size (number of threads per block). There are standard, commonly used approaches to computing the number of blocks as a function of the dataset size, and of N , the user-specified number of threads per block. For the purposes of this study, which focuses on the development of a suitable representation of the data and the development of the corresponding algorithms, we will only be concerned with the choice of N , from which all other relevant parameters of the computation will be derived. There are many types of memory in the GPU. The first type is the registers, which are the fastest memory on the GPU and the smallest in capacity. The second type is global memory. It is the largest memory on the GPU and the slowest in terms of speed of access. The third type is the local memory, which is an abstraction of the global memory with smaller capacity and as slow as the global memory. The fourth type is the shared memory, which is a memory shared among the threads within the same block. It can be very fast (comparable to the registers) or slow (caused by multiple threads trying to access the same memory bank at the same time). The fifth type is the constant memory, which is a read-only memory that can have high access speeds (very close to register access speed). The last memory type is the texture memory, which is a read only memory suitable for 2D spatial locality needs. Texture memory has higher access speed than local and global memory. See Figure 3 for the CUDA memory model and Table 1 for a summary of CUDA memory types. Table 1. Summary of CUDA memory types Memory type Access scope Access type Access speed Register Thread Read/Write Fastest Shared Block Read/Write Slow-to-fast Constant Application Read only Slow-to-fast Texture Application Read only Medium Local Thread Read/Write Slowest Global Application Read/Write Slowest After discussing CUDA and its memory model, we will discuss the typical flow of GPU-accelerated program using CUDA. For example, assume we want to calculate the sum of two arrays, A and B, using CUDA. First, we will allocate memory to store three arrays in the GPU’s memory, namely, A, B and R (in order to store the result). Second, we will copy the contents of A and B from the main memory (RAM) into the GPU allocated memory. Next, we run the GPU code to sum the two arrays in parallel (just like in Figure 1) and store the result in R. After the GPU finishes the computations, R has the result, so we copy it from the GPU’s memory into the CPU. The flow of the program is identical to Figure 2. However, in Figure 4 we provide more technical details. Fig. 3. CUDA memory model [20]. Fig. 4. Typical CUDA-accelerated application 4 Computing Propositional Hypothesis Cover Sets 4.1 Proposed Method A GPU can manipulate matrices very efficiently. Here this is first used to show how to speed up the calculation of the cover set of propositional hypotheses. We adopt the terminology used in Description Logic, and talk about concepts, defined as sets over a universe of individuals in the same way in which unary predicates can be defined for a domain consisting of a set of terms. We can represent a propositional data set as a Boolean 2D matrix M with each column corresponding to a concept, and each row to an individual (see Figure 5). The membership of an individual to a concept (i.e. whether the unary predicate is true when it takes this term as argument) is represented as a Boolean value in the matrix. To make it possible to process a large amount of data (of the order of gigabytes), this array is stored in the global memory of the GPU. Using the above representation, it is possible to employ data parallelism (Single-Instruction-Multiple-Data) for the application of logic operations to any subset of concepts. The rows of the table are divided among multiple threads. For maximum efficiency, the number of threads per block (block size) is always chosen to be equal to a power of 2, N = 2i , where the upper limit of i is imposed by the GPU hardware. Since this article does not focus on performance optimisation, we only report results for N = 32 (threads/block), a value that resulted in good performance across the board that is representative for the problem and GPU at hand. The grid size for the kernel is then calculated from the number of individuals and the block size. For 200,000,000 individuals and N = 32, this results in the use of 6,250,000 blocks. Now the cover set of a conjunction ∩m i=1 Ci of any set of concepts can be calculated through the parallel execution of all threads Tj , each of which pro- cessing sequentially the individuals Ik it has been allocated (see Algorithm 1). Lazy evaluation is used to calculate whether each individual is a member of the conjunction of concepts: if it is not covered by one of them, the rest are ignored and the result is returned immediately (see the if. . . break line of pseudocode). The cover set of a disjunction ∪m i=1 Ci of a set of concepts is computed in an analogous way, starting for each individual with result set to 0, and applying the logic OR operator until the temporary result is set to 1 or all concepts in S are considered. Similarly, the negation operator divides all individuals among multiple threads, and returns a vector result(1..numberOfIndividuals) with all Boolean values negated. In all cases, the size of the cover set is computed by summing up the elements of the vector result(1..numberOfIndividuals). This is done with the help of the Nvidia Thrust library in a process that adds a very small overhead, which is included in all reported execution times. For all logic operators, the vector repre- senting the cover set can be left in the global memory of the GPU to implement memoization [21]. When given two sets of positive and negative examples of a target concept, computing the cover set for each is controlled by a separate CPU process, and both are run in parallel. The CPU threads communicate directly Fig. 5. Concepts × Individuals matrix representation with the GPU to conduct the coverage test (in the GPU) as well as to retrieve back the results. All results here report the overall run time of executing both parallel CPU threads. It should be noted that the 2D matrix shown in Figure 5 and discussed above is represented internally as a 1D array using Row-major order in order to minimise the time for memory allocation/access. E.g. allocating memory for a 10 × 10 array would require 1 + 10 = 11 memory allocation calls, whereas a 1D array of size 100 would only need one such call. Allocating memory sequentially also facilitates coalesced global memory access [20]. 4.2 Results and Evaluation Due to the lazy evaluation strategy used, the Worst Case Execution Time (WCET) for evaluating the cover set of a conjunction of concepts is when the Algorithm 1 For a Boolean matrix M (individuals × concepts) procedure ParallelConceptConjunctionCoverSet set S := list of concepts in conjunction parallel_foreach thread T_i | foreach individual I_j in thread T_i | | set result(row(I_j)) := 1 | | foreach concept C_k in set S | | | set result(row(I_j)) := result(row(I_j)) && M(row(I_j),column(C_k)) | | | if (result(row(I_j)) == 0) break | | endfor | endfor endfor return Boolean array: result(1..numberOfIndividuals) membership matrix M contains all 1s. This also computationally equivalent to the WCET for evaluating the cover set a disjunction of concepts with M only made of zeros. Here we list the parallel (GPU) and Serial (GPU and CPU) exe- cution times for data sets in which the number of individuals (Table 2, Table 3 and Figure 6) or the number of concepts (Table 4 and Figure 7) is varied. The counting of examples (final step) for both CPU (Serial) and GPU (Serial and Parallel) are computed in parallel (in the GPU) using Nvidia’s Thrust library. Also, due to CPU caching, there are fluctuations in the CPU execution times, i.e. in the serial CPU experiment. Table 2. Conjunction/Disjunction of concepts cover set run times in parallel (GPU) Instances 4 concepts (pos+neg) Time Time (all 1s) (all 0s) conj disj conj disj 2 × 100 10.69 µs 9.09 µs 10.69 µs 11.20 µs 2 × 1, 000 12.83 µs 16.38 µs 11.14 µs 11.81 µs 2 × 10, 000 20.35 µs 29.95 µs 17.34 µs 20.35 µs 2 × 100, 000 44.16 µs 31.97 µs 31.55 µs 45.02 µs 2 × 1, 000, 000 315.65 µs 224.74 µs 223.30 µs 314.08 µs 2 × 10, 000, 000 2.76 ms 2.05 ms 2.06 ms 2.75 ms 2 × 100, 000, 000 27.55 ms 20.63 ms 19.10 ms 25.45 ms The results show that both the worst case and best case execution times have linear time complexity with respect to the number of individuals. Clearly, memoization can be used as a trade-off between time and space complexity, e.g. by using the existing result for ∩m i=1 Ci in order to compute the cover set of the conjunction of any superset of these concepts. When the number of concepts is Table 3. Conjunction of Concepts: Serial CPU vs Serial GPU Instances 4 concepts (pos+neg) Serial GPU Serial CPU (Single thread) (Single thread) All 1s All 0s All 1s All 0s 2 × 100 222.7 µs 31.6 µs ∼ 439.32µs ∼ 624µs 2 × 1, 000 2.1 ms 109.9 µs ∼ 994.59µs ∼ 1.44ms 2 × 10, 000 21.2 ms 1 ms ∼ 1.79ms ∼ 1.63ms 2 × 100, 000 211.6 ms 10 ms ∼ 3.68ms ∼ 1.94ms 2 × 1, 000, 000 1.8 S 99.9 ms ∼ 12.12ms ∼ 4.2ms 2 × 10, 000, 000 - - ∼ 103.02ms ∼ 29.23ms 2 × 100, 000, 000 - - ∼ 969.66ms ∼ 262.72ms Table 4. Computing conjunction of N concepts for varying N in parallel (GPU) Instances 4 concepts 8 concepts 16 concepts 32 concepts (pos+neg) Time Time Time Time Time Time Time Time (all 1s) (all 0s) (all 1s) (all 0s) (all 1s) (all 0s) (all 1s) (all 0s) 2 × 10, 000, 000 2.77 2.06 4.38 1.95 7.41 2.34 18.66 3.88 ms ms ms ms ms ms ms ms varied, the slope of the WCET plot appears to suggest an approximately linear growth. 10000 1000 100 Execution 10 Time (ms) 1 0.1 0.01 2 × 1E+2 2 × 1E+3 2 × 1E+4 2 × 1E+5 2 × 1E+6 2 × 1E+7 2 × 1E+8 Number of Instances All 1s (Worst Case) GPU Multi-Threaded All 0s (Best Case) GPU Multi-Threaded All 1s (Worst Case) GPU Single-Threaded All 0s (Best Case) GPU Single-Threaded All 1s (Worst Case) CPU Single-Threaded All 0s (Best Case) CPU Single-Threaded Fig. 6. Computing conjunction of N concepts, N = 4 (data from Table 2 & Table 3) Fig. 7. Computing conjunction of N concepts for 2 × 107 individuals and varying N (data from Table 4) 5 Computing Cover Sets for Hypotheses with Binary Predicates We shall now consider how the data and hypothesis languages can be extended to include roles, i.e. binary predicates. We adopt a representation for this type of data as shown in Figure 8. We use an array with three columns in which the middle contains the name of the role/predicate, here represented with an integer. The pair of individuals related through a given role are listed in Column 1 and Column 3, respectively. With this representation in mind, we can, for instance, compute a hypothesis of the type ∃role.(C1 u · · · u Cm ), e.g. ∃hasChild.(Female u M usician) using Algorithm 1 to compute the conjunction and Algorithm 2 to apply the existential restriction operator. In Alg. 2, setAllElementsInResultT oV al inP arallel(value) is a utility function used to fill all the elements of the result array (in parallel) with either 0 or 1. Algorithms for other DL operators, such as value restriction or number restriction can be defined in a similar way. Algorithm 2 For an individual-role-individual matrix R and Boolean matrix M (individuals × concepts) procedure ParallelExistentialRestriction Call setAllElementsInResultToVal_inParallel(0) var Concept := the concept in the restriction set IndA := list of individualA (individualA values) from R matrix (same role) set IndB := list of individualB (individualB values) from R matrix (same role) parallel_foreach thread T_i | foreach individualA I_A in set IndA | | | set result(row(I_A)) := M(row(I_B),column(Concept)) | endfor endfor return Boolean array: result(1..numberOfIndividuals) Fig. 8. Individual-Role-Individual matrix representation 6 Conclusions We have demonstrated here how GPGPU can be used to accelerate the com- putation of the cover set for a hypothesis expressed in propositional logic. The results show that even the use of a commodity GPU can provide the ability to process data sets of size well beyond what can be expected from a CPU-based sequential algorithm of the same type, and within a time that makes the evalu- ation of many thousands of hypotheses on a dataset with 108 training examples a viable proposition. We have also indicated how our approach can be extended to handle binary predicates. The goal in sight is the ability to evaluate hypothe- ses on DL databases, e.g. linked open data ontologies. The approach adopted here fits well with the open world assumption that is commonly made in DL. An example of the potential benefits is, for instance, the ability to learn about user preferences in e-commerce [19]. A case study with real-world data would need to identify the type of DL needed and potentially extend appropriately the algorithms for the computation of the cover set, and add an appropriate search strategy. References 1. Muggleton, S. and Feng, C.: Efficient Induction of Logic Programs. Turing Insti- tute, pp. 368–381 (1990). 2. CUDA Zone. URL: https://developer.nvidia.com/cuda-zone. Retrieved: June 2018. 3. OpenCL Overview. URL: https://www.khronos.org/opencl/. Retrieved: June 2018. 4. Quinlan, R.: Learning Logical Definitions from Relations. Machine Learning 5:239– 266 (1990). 5. Muggleton, S.: Inverse Entailment and Progol. New Generation Computing 13(3): 245–286 (1995). 6. Srinivasan, A.: The Aleph Manual (2001). URL: http://www.cs.ox.ac.uk/activities/machinelearning/Aleph/aleph.html 7. Muggleton, S.: Learning from Positive Data. Selected Papers from the 6th Inter- national Workshop on Inductive Logic Programming, pp. 358–376 (1996). 8. Fonseca, N. A., Silva, F. and Camacho, R.: Strategies to Parallelize ILP Systems. Inductive Logic Programming: Proc. of the 15th Intl Conf. (2005). 9. Blockeel, H., Dehaspe, L., Demoen, B., Janssens, G., Ramon, J. and Vandecasteele, H.: Executing Query Packs in ILP. International Conference on Inductive Logic Programming ILP 2000, pp. 60–77 (2000). 10. Fonseca, N. A., Srinivasan, A., Silva, F. and Camacho, R.: Parallel ILP for Distributed-memory Architectures. Machine Learning 74(3):257–279 (2009). 11. Ohwada, H., Mizoguchi, F.: Parallel Execution for Speeding up Inductive Logic Programming Systems. Proceedings of the 9th Intl Workshop on Inductive Logic Programming, pp. 277–286 (1999). 12. Graham, J., Page, D., Kamal, A.: Accelerating the Drug Design Process through Parallel Inductive Logic Programming Data Mining. Proceedings of the 2003 IEEE Bioinformatics Conference CSB2003 (2003). 13. Matsui, T., Inuzuka, N., Seki, H., Itoh, H.: Comparison of Three Parallel Imple- mentations of an Induction Algorithm. Proceedings of the 8th Intl Parallel Com- puting Workshop, pp. 181–188 (1998). 14. Konstantopoulos, S. K..: A Data-parallel Version of Aleph. Proceedings of the Workshop on Parallel and Distributed Computing for Machine Learning (2003). 15. Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Krüger, J., Lefohn, A. E., and Purcell, T.: A Survey of General-Purpose Computation on Graphics Hardware. Computer Graphics Forum, 26(1):80–113 (2007). 16. Califf, M.E. and Mooney, R.J.: Advantages of Decision Lists and Implicit Negatives in Inductive Logic Programming. New Generation Computing 16, 263–281 (1998). 17. Bühmann, L., Lehmann, J. and Westphal, P.: DL-Learner – A Framework for Inductive Learning on the Semantic Web. Journal of Web Semantics: Science, Services and Agents on the World Wide Web 39, 15–24 (2016). 18. Fanizzi, N., d’Amato, C. and Esposito, F.: DL-FOIL: Concept Learning in Descrip- tion Logic. Proc. of the 18th Conf. on Inductive Logic Programming (2008). 19. Qomariyah, N. and Kazakov, D.: Learning from Ordinal Data with Inductive Logic Programming in Description Logic. Proc. of the Late Breaking Papers of the 27th Intl Conf. on Inductive Logic Programming, 38–50 (2017). 20. NVIDIA CUDA Programming Guide (2007). URL: http://developer.download.nvidia.com/compute/cuda/1.0/NVIDIA CUDA Programming Guide 1.0.pdf. Retrieved: May 2018. 21. Michie, D.: Memo Functions and Machine Learning, Nature 218:19–22 (1968).