Partial enumeration of minimal transversals of a hypergraph Lhouari Nourine, Alain Quilliot and Hélène Toussaint Clermont-Université, Université Blaise Pascal, LIMOS, CNRS, France {nourine, quilliot, helene.toussaint}@isima.fr Abstract. In this paper, we propose the first approach to deal with enumeration problems with huge number of solutions, when interesting- ness measures are not known. The idea developed in the following is to partially enumerate the solutions, i.e. to enumerate only a representative sample of the set of all solutions. Clearly many works are done in data sampling, where a data set is given and the objective is to compute a representative sample. But, to our knowledge, we are the first to deal with sampling when data is given implicitly, i.e. data is obtained using an algorithm. The experiments show that the proposed approach gives good results according to several criteria (size, frequency, lexicographical order). 1 Introduction Most of problems in data mining ask for the enumeration of all solutions that satisfy some given property [1, 10]. This is a natural process in many applications, e.g. marked basket analysis [1] and biology [2] where experts have to choose between those solutions. An enumeration problem asks to design an output- polynomial algorithm for listing without duplications the set of all solutions. An output-polynomial algorithm is an algorithm whose running time is bounded by a polynomial depending on the sum of the sizes of the input and output. There are several approachs to enumerate all solutions to a given enumeration problem. Johnson et al. [13] have given a polynomial-time algorithm to enumer- ate all maximal cliques or stables of a given graph. Fredman and Khachiyan [7] have proposed a quasi-polynomial-time algorithm to enumerate all minimal transversal of an hypergraph. For enumeration problems the size of the output may be exponential in the size of the input, which in general is different from optimization problems where the size of the output is polynomially related to the size of the input. The drawback of the enumeration algorithms is that the number of solutions may be exponential in the size of the input, which is infea- sible in practice. In data mining, some interestingness measures or constraints are used to bound the size of the output, e.g. these measures can be explicitly specified by the user [8]. In operation research, we use quality criteria in order to consider appropriate decision [21]. In this paper, we deal with enumeration problems with huge number of so- lutions, when interestingness measures are not known. This case happens when c paper author(s), 2015. Published in Sadok Ben Yahia, Jan Konecny (Eds.): CLA 2015, pp. 123–134, ISBN 978–2–9544948–0–7, Blaise Pascal University, LIMOS laboratory, Clermont-Ferrand, 2015. Copying permitted only for private and academic purposes. 124 Lhouari Nourine, Alain Quilliot and Hélène Toussaint the expert has no idea about data and knowledges that are looking for. The objective is to enumerate only a representative sample of the set of all solutions. Clearly many works are done in data sampling, where are given a data set and the objective is to compute a representative sample. To our knowledges, this idea is new for sampling when data is given implicitly, i.e. data is obtained using an algorithm. One can use the naive approach which first enumerates all the solu- tions and then applies sampling methods, which is not possible for huge number of solutions. To evaluate our approach, we consider a challenging enumeration problem, which is related to mining maximal frequent item sets [1, 10], dualization of monotone boolean functions [5] and other problems [10]. We applied our ap- proach to several instances of transversal hypergraphs [17, 20], and obtain good results. 2 Related works Golovach et al. [9] have proposed an algorithm to enumerate all minimal domi- nating sets of a graph. First they generate maximal independent sets and then apply a flipping operation to them to generate new minimal dominating sets, where the enumeration of maximal independent sets is polynomial. Clearly, a relaxation of the flipping operation leads to a partial enumeration since the number of minimal dominating sets can be exponential in the number of min- imal independent sets, e.g. cobipartie graphs. Jelassi et al.[12] and Raynaud et al.[19] have considered some kind of redundancy in hypergraphs like twin ele- ments to obtain a concise representation. Their ideas can avoid the enumeration of similar minimal transversals of an hypergraph. 3 Transversal hypergraph enumeration A hypergraph H = (V, E) consists of a finite collection E of sets over a finite set V . The elements of E are called hyperedges, or simply edges. An hypergraph is said simple if for any E, E 0 ∈ E E 6⊆ E 0 . A transversal (or hitting set) of H is a set T ⊆ V that intersects every edge of E. A vertex x in a transversal T is said to be redundant if T \ {x} is still a transversal. A transversal is minimal if it does not contain any redundant vertex. The set T of all minimal transversal of H = (V, E) constitutes together with V also a hypergraph T r(H) P = (V, T ), which is called the transversal hypergraph of H. We denote by k = E∈E | E |. Example 1. Consider the hypergraph H = (V, E): V = {1, 2, 3, 4, 5} and E = {E1 , E2 , E3 } with E1 = {1, 3, 4}, E2 = {1, 3, 5} and E3 = {1, 2}. The set of all minimal transversals is T = {{1}, {2, 3}, {2, 4, 5}} and k = 3 + 3 + 2 = 8 Given a simple hypergraph H = (V, E), the transversal hypergraph enumera- tion problem concerns the enumeration without repetitions of T r(H). This prob- lem has been intensively studied due to its link with several problems isuch as Partial enumeration of minimal transversals of a hypergraph 125 data mining and learning [3, 4, 11, 15, 18]. Recently, Kante et al.[14] have shown that the enumeration of all minimal transversals of an hypergraph is polynomi- ally equivalent to the enumeration of all minimal domination sets of a graph. It is known that the corresponding decision problem belongs to coNP, but still open whether there exists an output-polynomial-time algorithm. 4 Partial transversal hypergraph enumeration We introduce the partial (or incomplete) search algorithm for enumerating min- imal transversals of an hypergraph H. The search space is the set of all transver- sals which is very large. The strategy is divided into two steps: – The initialization procedure considers a transversal T of H and then ap- plies a reduction (at random) algorithm to T in order to obtain a minimal transversal Tm of H. This step is detailed in section 4.1. – The local search algorithm considers a minimal transversal Tm and then applies local changes to Tm in which we add and delete vertices according to some ordering of the vertices. This step is detailed in section 4.2 These steps are repeated for at most k transversals depending on the input hypergraph H. Figure 1 illustrates the proposed approach. Fig. 1. Approach to partial enumeration of minimal transversals 4.1 Initialization Let H(V, E) be the input hypergraph, E ∈ E and x ∈ E. The initialization step starts with the transversal (V \ E) ∪ {x} and then applies a reduction 126 Lhouari Nourine, Alain Quilliot and Hélène Toussaint algorithm to obtain a minimal transversal. The following property shows that the set (V \ E) ∪ {x} is a transversal and any minimal transversal contained in (V \ E) ∪ {x} contains the vertex x. Property 1. Let H = (V, E) be a simple hypergraph, E ∈ E and x ∈ E. Then (V \ E) ∪ {x} a transversal of H. Moreover, any minimal transversal T ⊆ (V \ E) ∪ {x} will contain x. Proof. Let H = (V, E) be a simple hypergraph and E 0 ∈ E with E 6= E 0 . Since H is simple then there exists at least one element y ∈ E 0 such that y 6∈ E. So y ∈ (V \ E) and thus E 0 ∩ (V \ E) 6= ∅. We conclude that (V \ E) ∪ {x} is a transversal since x ∈ E. Now let T ⊆ (V \E)∪{x} be a minimal transversal. Then E ∩(V \E)∪{x} = {x} and thus x must belong to T otherwise T does not intersect E.  According to property 1, we can apply the initialization procedure to any pair (x, E) where E ∈ E and x ∈ E. In other words, the initialization is applied to at most k transversals of H as shown in Algorithm 1. Algorithm 1: Initialization Input : A hypergraph H(V, E) and σ an ordering of V Output: A sample of minimal transversals begin ST RAN S = ∅; for E ∈ E do for x ∈ E do T = (V \ E) ∪ {x};{Initial transversal} Tm = Reduce(T, σ); ST RAN S = ST RAN S ∪ {Tm }; return(ST RAN S); Now we describe the reduction process, which takes a transversal T and a random ordering σ of V and returns a minimal transversal Tm . Indeed, we delete vertices from T according to the ordering σ until we obtain a minimal transversal. Algorithm 2: Reduce(T, σ) Input : A transversal T and an ordering σ = σ1 ...σ|V | of the vertices of H. Output: A minimal transversal for i = 1 to |V | do if T \ {σi } is a transversal then T ← T \ {σi } ; Return(T ); Partial enumeration of minimal transversals of a hypergraph 127 Example 2 (continued). Suppose we are given the hypergraph in example 1 and σ = (1, 2, 3, 4, 5) for the input to Algorithm 1. First, it takes the hyperedge E = {1, 3, 4} and for x = 1 we obtain the minimal transversal {1}, for x = 3 we obtain {2, 3} and for x = 4 we obtain {2, 4, 5}. Then the algorithm con- tinue with the hyper edges {1, 3, 5} and {1, 2}. Finally the algorithm returns ST RAN S = {{1}, {2, 3}, {2, 4, 5}}, i.e. the other iterations do not add new minimal transversals. Theorem 1. Algorithm 1 computes at most k minimal transversals of an input hypergraph H = (V, E). Proof. The initialization procedure considers at most k minimal transversals of H. Since a minimal transversal can be obtained several times, the result follows.  The following proposition shows that any minimal transversal of the hyper- graph H = (V, E) can be obtained using the initialization procedure. Indeed, the choice of the ordering σ is important in the proposed strategy. Proposition 1. Let H = (V, E) be an hypergraph and T be a minimal transver- sal of H. Then, there exists a total order σ, E ∈ E and x ∈ E such that T = Reduce((V \ E) ∪ {x}, σ). Proof. Let T be a minimal transversal of H = (V, E). Then there exists at least one hyperedge E ∈ E such that T ∩ E = {x}, x ∈ V , otherwise T is not minimal. Thus T ⊆ (V \ E) ∪ {x}. Now, if we take the elements that are not in T before the elements in T in σ, the algorithm Reduce((V \ E) ∪ {x}, σ) returns T .  The initialization procedure guaranties that for any vertex x ∈ V at least one minimal transversal containing x is listed. The experiments in section 5, shows the sample of minimal transversals obtained by the initialization procedure is a representative sample of the set of all minimal transversals. 4.2 Local search algorithms The local search algorithm takes each minimal transversal found in the initial- ization step and searches for new minimal transversals to improve the initial solution. The search of neighbors is based on vertices orderings. Let H = (V, E) be an hypergraph and x ∈ V . We define the frequency of x as the number of minimal transversals of H that contain x. The algorithm takes a minimal transversal T and a bound N max which bounds the number of iterations and the number of neighboors generated by T . Each iteration of the while loop, starts with a minimal transversal T and computes two orderings as follows: – σ c is an ordering according to the increasing order of frequency of vertices in V \ T in minimal transversals already obtained by the current call. This ordering has a better coverage of the solution set, i.e. by keeping the rarest vertices in the transversals. 128 Lhouari Nourine, Alain Quilliot and Hélène Toussaint – σ is a random ordering of the vertices in T . Algorithm 3: Neighboor(T, N max) Input : A minimal transversal T of H = (V, E) and an integer N max Output: A set of minimal transversals Q = T; i = 1; while i ≤ N max do σ c ← the set V \ T sorted in increasing order of frequency of vertices in minimal transversals in Q; σ ← sort T at random; Add elements the elements in σ c to T until a vertex x ∈ T \ σ c becomes redundant in T ; T =Reduce(T, σ); Q = Q ∪ {T }; i = i + 1; return(Q); Now we give the global procedure of the proposed approach. Algorithm 4: Global procedure for partial enumeration of minimal transversals Input : An hypergraph H = (V, E) and an integer N max Output: A sample of minimal transversals of H σ =choose a random ordering of V ; ST RAN = Q = Initialization(H, σ); while Q 6= ∅ do T = choose a minimal transversal T in Q; ST RAN S = ST RAN S ∪ N eighboor(T, N max); Return(ST RAN S); In the following, we describe experiments to evaluate the results that have been obtained. 5 Experimentation The purpose of the experiments is to see if the proposed approach allow us to generate a representative set of solutions. For this reason, we have conducted the experiments on two different classes of hypergraphs (see [20]) for which the number of minimal transversals is huge compared to the size of the input. We use Uno’s Algorithm SHD (Sparsity-based Hypergraph Dualization, ver. 3.1) [20], to enumerate all minimal transversals. The experiments are done using linux CentOS cpu Intel Xeon 3.6 GHz and C++ language. In the following, we denote Tpartial the set of minimal transversals generated by Algorithm 4, and Texact the set of all minimal transversals. First, we analyze Partial enumeration of minimal transversals of a hypergraph 129 T the percentage Tpartial exact and then we evaluate the representativeness of the sample Tpartial . 5.1 The size of Tpartial We will distinguish between minimal transversals that are obtained using Algo- rithm 1 (or the initialization procedure) and those that are generated using the local search. For these tests we set the maximal number of neighboors N max to 3. Tables 1 and 2 show the results for the two classes of hypergraph instances, namely ”lose” and random ”p8”. The first three columns have the following meaning: – instance: instance name. – instance size: the size of the instance (number of edges × number of vertices). – total # of transv.: the exact number of minimal transversals | Texact |. The second (resp. last) three columns give the results for the initialization procedure (resp. Global algorithm): – # transv. found : the number of minimal transversals found. – % transv. found : the percentage of minimal transversals found. – cpu (s): the run time in seconds Table 1. Results for all ”lose” instances According to these tests, we can see that the percentage of minimal transver- sals found using the initialization procedure is very low, but it decreases as far as the size of Texact increases. Clearly, this percentage is strongly related to the input. Indeed, the number k (entropy) of the hypergraph increases according to the size of the input hypergraph. We can also see that the local search increases significantly the number of solutions found by a factor 2 to 2.5 approximatively. 130 Lhouari Nourine, Alain Quilliot and Hélène Toussaint Table 2. Results for all ”p8 ” instances But it remains coherent with the chosen value N max = 3. It argues that the lo- cal search finds other transversals that are not found either by the initialization procedure nor previous local search. Notice that the parameter N max can be increased whenever the size of Tpartial is not sufficient. 5.2 The representativeness of Tpartial To evaluate the representativeness of Tpartial , we consider three criteria: – Size of the minimal transversals in Tpartial . – Frequency of vertices in Tpartial . – Lexicographical rank of the minimal transversals in Tpartial . Each criteria is illustrated using a bar graph for two instances from different classes. The bar graphs in figures 2 and 3 are surprising. Indeed the bar graphs vary nearly in the same manner with respect to the initialization and the local search algorithm for all the considered criteria. Figures 2(a) 3(a) show that the percentage of minimal transversals of each size (found either by the initialization procedure and local search) fits the per- centage of all minimal transversals that are found. Figures 2(b) and 3(b) show that the same analysis holds when ordering min- imal transversals lexicographically (e.g. based on a total ordering of vertices). Clearly, the lexicographical rank of a minimal transversal belongs to the interval [1..2|V | ]. For visualization aspect, we divide this interval into |V | subintervals, where the subinterval i contains the number of minimal transversals with a rank |V | |V | r ∈ [ 2|V | i; 2|V | (i + 1)[, (i = 0, . . . , |V | − 1). Figures 2(c) and 3(c) confirm this behavior when considering frequency of vertices. Indeed, frequency of vertices in minimal transversals in Tpartial is the same when considering all minimal transversals, i.e.. the set Texact . Partial enumeration of minimal transversals of a hypergraph 131 Fig. 2. The bar graph for ”lose100”: a) number of minimal transversals per size b) number of minimal transversals according the lexicographical rank; c) Frequency of vertices in minimal transversals. Fig. 3. The bar graph for ”p8 200”; a) number of minimal transversals per size; b) number of minimal transversals according the lexicographical rank; c) Frequency of vertices in minimal transversals. 132 Lhouari Nourine, Alain Quilliot and Hélène Toussaint Fig. 4. Visualizing the solutions space of ”lose100”. The abscissa is given by the size of the transversal (transversals of the same size are spread out using a norm) and the ordinate corresponds to the frequency of its vertices. Fig. 5. Visualizing the solutions space of ”p8 200”. The abscissa is given by the size of the transversal (transversals of the same size are spread out using a norm) and the ordinate corresponds to the frequency of its vertices. Partial enumeration of minimal transversals of a hypergraph 133 Figures 4 and 5 show that the set Tpartial is also representative even when considering minimal transversals with the same size. Indeed, minimal transver- sals having the same size are spread out using a norm. We notice that the points corresponding to minimal transversals in Tpartial are scattered in the image. This experiment allows us to conclude that the sample Tpartial produced by Algorithm 4 is representative relatively to the criteria under consideration. Other results can be found in http://www2.isima.fr/˜toussain/ 6 Conclusion and discussions We are convinced that the initialization procedure is the most important in this approach. Indeed, the set of minimal transversals obtained using this procedure is a representative sample, since it garantee that for any vertex of the hypergraph there is at least one minimal transversal which contains it (see property 1). Moreover the local search procedure can be used to increase the number of solutions, and as we have seen in the experiments, it keeps the same properties as the initialization procedure. We hope that this approach improves enumeration in big data and will be of interests to the readers to investigate heuristics methods [6] for enumeration problems. This paper opens new challenges related to partial and approximate enu- meration problems. For example, given an hypergraph H = (V, E), is there an algorithm that for any given ε, it enumerates a set Tpartial ⊆ T r(H) such that (1 − ε)|T r(H)| ≤ |Tpartial | ≤ |T r(H)|? We also require that the algorithm is output-polynomial in the sizes of H, Tpartial and 1ε . To our knowledge, there is no work on approximate algorithms for enumeration problems, but results on approximate counting problems may be applied [16]. Acknowledgment: This work has been funded by the french national re- search agency (ANR DAG project, 2009-2013) and CNRS (Mastodons PETASKY project, 2012-2015). References 1. R. Agrawal, T. Imielinski, and A. Swami. Mining associations between sets of items in massive databases. In ACM SIGMOD 1993, Washington D.C., pages 207–216, 1993. 2. J. Y. Chen and S. Lonardi. Biological Data Mining. Chapman and Hall/CRC, 2009. 3. T. Eiter and G. Gottlob. Identifying the minimal transversals of a hypergraph and related problems. SIAM J. Comput., 24(6):1278–1304, 1995. 4. T. Eiter, G. Gottlob, and K. Makino. New results on monotone dualization and generating hypergraph transversals. SIAM J. Comput., 32(2):514–537, 2003. 5. T. Eiter, K. Makino, and G. Gottlob. Computational aspects of monotone dual- ization: A brief survey. Discrete Applied Mathematics, 156(11):2035–2049, 2008. 134 Lhouari Nourine, Alain Quilliot and Hélène Toussaint 6. T. Feo and M. Resende. Greedy randomized adaptive search procedures. Journal of Global Optimization, 6(2):109–133, 1995. 7. M. Fredman and L. Khachiyan. On the complexity of dualization of monotone disjunctive normal forms. Journal of Algorithms, 21:618–628, 1996. 8. L. Geng and H. J. Hamilton. Interestingness measures for data mining: A survey. ACM Comput. Surv., 38(3), Sept. 2006. 9. P. Golovach, P. Heggernes, D. Kratsch, and Y. Villanger. An incremental polyno- mial time algorithm to enumerate all minimal edge dominating sets. Algorithmica, 72(3):836–859, 2015. 10. D. Gunopulos, R. Khardon, H. Mannila, S. Saluja, H. Toivonen, and R. S. Sharm. Discovering all most specific sentences. ACM Trans. Database Syst., 28(2):140–174, 2003. 11. D. Gunopulos, R. Khardon, H. Mannila, and H. Toivonen. Data mining, hyper- graph transversals, and machine learning. In PODS, pages 209–216, 1997. 12. M. Jelassi, C. Largeron, and S. Ben Yahia. Concise representation of hypergraph minimal transversals: Approach and application on the dependency inference prob- lem. In Research Challenges in Information Science (RCIS), 2015 IEEE 9th In- ternational Conference on, pages 434–444, May 2015. 13. D. S. Johnson, C. H. Papadimitriou, and M. Yannakakis. On generating all maxi- mal independent sets. Inf. Process. Lett., 27(3):119–123, 1988. 14. M. M. Kanté, V. Limouzy, A. Mary, and L. Nourine. On the enumeration of mini- mal dominating sets and related notions. SIAM Journal on Discrete Mathematics, 28(4):1916–1929, 2014. 15. L. Khachiyan, E. Boros, K. M. Elbassioni, and V. Gurvich. An efficient implemen- tation of a quasi-polynomial algorithm for generating hypergraph transversals and its application in joint generation. Discrete Applied Mathematics, 154(16):2350– 2372, 2006. 16. J. Liu and P. Lu. Fptas for counting monotone cnf. In Proceedings of the Twenty- Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’15, pages 1531–1548. SIAM, 2015. 17. K. Murakami and T. Uno. Efficient algorithms for dualizing large-scale hyper- graphs. Discrete Applied Mathematics, 170:83–94, 2014. 18. L. Nourine and J.-M. Petit. Extending set-based dualization: Application to pat- tern mining. In ECAI, pages IOS Press ed, Montpellier, France, 2012. 19. O. Raynaud, R. Medina, and C. Noyer. Twin vertices in hypergraphs. Electronic Notes in Discrete Mathematics, 27:87–89, 2006. 20. T. Uno. http://research.nii.ac.jp/ uno/dualization.html. 21. C. A. Weber, J. R. Current, and W. Benton. Vendor selection criteria and methods. European Journal of Operational Research, 50(1):2 – 18, 1991.