Abstract and Local Concepts in Attributed Networks Henry Soldano1,2 , Guillaume Santini1 , and Dominique Bouthinon1 1 Université Paris 13, Sorbonne Paris Cité, L.I.P.N UMR-CNRS 7030 F-93430, Villetaneuse, France 2 Atelier de BioInformatique, ISYEB - UMR 7205 CNRS MNHN UPMC EPHE, Museum National d’Histoire Naturelle, F-75005, Paris, France Abstract. We consider attribute pattern mining in attributed graphs through re- cent developments of Formal Concept Analysis. The corresponding methods re- strain the extensional space 2O , i.e. the space of possible pattern extensions in the object set O, to a subset satisfying structural properties. When considering an at- tributed graph, we consider its vertices as the objects under study, each described in a pattern language, as 2I where I is an attribute set. The restriction of the ex- tensional space depends then on the graph topology. We consider two levels. At the global level, the core idea is to reduce the extension of each pattern in such a way that the corresponding abstract extension induces a subgraph made of dense parts whose nodes satisfy some connectivity property. At the local level a pattern has various extensions each associated to one dense part. We obtain that way ab- stract closed patterns and local closed patterns, together with abstract and local implication rules. Overall, we propose here a way to extract information associ- ated to the attributes labelling the graph vertices, according to its topology. We consider in particular the detection of communities in subgraphs of an attributed network associated to local closed patterns and local implications. 1 Introduction We consider an attributed graph G(O, E) where E is the edge set and whose vertices in O are labelled by a description in an attribute pattern language with a lattice structure, typically 2I where I is a set of binary attributes. A way recently investigated to search for frequent patterns is to define a restricted extensional space which is the range of an interior operator on 2O (see Section 3). The idea of such an operator in the case of an attributed graph is to minimally reduce a vertex subset until all the vertices of the reduced vertex subset, also called an abstract vertex subset, satisfies some connectivity property within the corresponding induced subgraph[1]. We call a graph abstraction the corresponding extensional space, i.e. the range of the interior operator mentioned above. A typical example of a connectivity property is the degree ≥ k property which is such that all vertices of an abstract vertex subset e have a degree greater than or equal to k in the subgraph Ge induced by e. Another example is the k-clique abstraction in which vertices in e have to belong to some k-clique in Ge . This approach, based on a previous work on abstraction in Formal Concept Analysis [2] produces abstract closed patterns i.e. maximal attribute patterns, obtained by applying a closure operator on the abstract support sets obtained when applying the interior operator to the support sets3 . 3 In data mining the extension of a pattern in a set of objects is also called its support set. In this case, the set of (abstract support set, abstract closed pattern) pairs forms a lattice called an abstract concept lattice, and we obtain a set of related abstract implications denoted by 2q → 2w that hold whenever the abstract support set of q is included in the abstract support set of w. Recent works in attributed graph mining are interested in searching for local pat- terns made of a constraint on a subset of attributes together with a density constraint on a vertex subset, and this using various notions of maximality [3,4]. In a companion article [5], we have defined local closed patterns corresponding to maximal attribute patterns each associated to one dense subgraph, allowing to extract local implications, particular to specific dense groups of objects. For that purpose Formal Concept Anal- ysis (FCA) had to be extended in order to take into account this notion of locality. In that case, several closure operators may be applied to the same pattern: a closed pattern will then be local as the closure will depend on which region of the extensional space is concerned. The simplest example is obtained by considering that the support set of a pattern induces a subgraph made of various connected components, and associating to each connected component a local closed pattern, i.e. the most specific pattern shared by the vertices of this connected component. In what follows, the subgraph induced by the pattern q support set is simply called the pattern q subgraph. Formally, the dense vertex subsets we consider as elements of the extensional space form a partial order called a confluence in a recent investigation in Formal Concept Analysis [6] and close to, but different from, confluent familiies recently investigated in [7]. The confluence structure generalizes the abstraction structure and may have several minimal elements (see Section 4). In the case mentioned above, a simple graph confluence, the minimal elements are the singletons {v}, where v is a vertex, and the elements of the confluence are connected vertex subsets i.e. vertex subsets each inducing a connected subgraph. The structure of the set of (local support set, local closed pattern) pairs, we call local concepts, has been shown to be a more general structure generalizing the lattice struc- ture, and called a pre-confluence [5]. We call local concept pre-confluence the ordered set of local concepts. Again we may associate to this structure a set of implications, called local implications written 2m q → 2m w where m is any minimal element of the confluence. Such a local implication means that the (unique) local support set including m of pattern q is included in the local support set of w including m. In the simple exam- ple mentioned above, 2{v} q → 2{v} w holds whenever i) v belongs to the support set of q and ii) the connected component containing v of the pattern q subgraph is included in the connected component containing v of the pattern w subgraph. Both approaches can be mixed by considering the simple graph confluence F men- tioned above together with a graph abstraction A. What happens then is that FA = F ∩A also is a confluence, we call a cc-confluence. In practice, this means that we choose some abstraction A, for instance considering the degree >= k graph abstraction and then consider among its elements only connected vertex subsets. When investigating some attributed graph we have then to chose A, or consider a set of graph abstractions ordered by set theoretic inclusion A1 , . . . , An obtained for instance by increasing k of the degree >= k. As we will see in our experiments, we may extract this way local patterns and implications at different levels. However connected components of abstract subgraphs as represented in cc-conflu- ences does not always completely capture the idea of communities as considered in social network analysis. As discussed in [5], we may however enlarge the local closed patterns approach by deriving a new graph GT from G whose vertex set T is a set of vertex subsets of G. Figure 1 displays a graph whose vertices represents pupils on a school in the West of Scotland, whose edges represent friendship relations and whose vertex attributes concern substance use and sporting activity4 . As a running example we consider the empty pattern subgraph i.e. the whole graph. By applying a 3-clique graph abstraction we select the vertices related by colored or bold edges. We will then obtain 4 local closed patterns li each representing the most specific attribute pattern occurring in a connected component of the empty pattern abstract subgraph. However, in this example the largest connected component is clearly made of distinct dense parts, i.e. communities, we would like to consider when defining local closed patterns. Fortu- nately, when considering k-communities [8] we can solve this problem by applying the cc-confluence approach to a new graph derived from the original graph. More precisely, a k-community is a vertex subset in a graph G that corresponds to a connected compo- nent in a derived graph GT . The vertices of GT are k-cliques in G and an edge relates two vertices whenever the corresponding k-cliques share k − 1 vertices in G. Each col- ored subgraph in Figure 1 defines such a 3-community. When labelling each vertex in 28 39 27 12 43 44 23 42 29 7 26 24 22 34 33 30 19 17 31 37 2 25 18 21 32 10 11 5 14 15 35 1 16 40 48 41 38 9 4 47 46 8 6 50 45 49 36 3 13 20 Fig. 1. The original friendship graph of a group of West Scotland pupils. The pupils and edges forming 3-communities of size at least 4 are colored. Empty vertices do not belong to the 3-clique abstraction of the graph. GT by the intersection of the corresponding k-clique vertex labels, the local closed pat- tern associated to the connected component of some pattern subgraph in GT represents the most specific attribute pattern occurring in the corresponding k-community. 4 http://www.stats.ox.ac.uk/˜snijders/siena/s50_data.htm Section 2 describes the attributed graphs used in our experiments. Section 3 presents abstract concept lattices, abstract implications and graph abstractions. Section 4 defines local concept pre-confluences, related local implications and cc-confluences. In Sec- tion 5 we show how using derived c-confluences we extract the set of 3-communities associated to pattern subgraphs, and we display the local concept pre-confluence of the teenage friendship attributed network displayed Figure 1. In Section 6 we briefly discuss the implementation used in our experiments. 2 Datasets We will further consider experiments in two datasets. In both cases the data is described as a graph G = (O, E) whose vertices have as labels elements of 2I where I is a set of items, i.e. binary attributes. As objects are not always described using binary attributes, the binarization preprocessing is described when necessary. 2.1 Teenage Friends and Lifestyle Study The dataset is denoted as s50-1 and is a standard attributed graph dataset5 . It represents 148 friendship relations between 50 pupils of a school in the West of Scotland, and labels concern the substance use (tobacco, cannabis and alcohol) and sporting activity. Values of the corresponding variables are ordered. The binarization process consists in defining variables representing the value intervals. T stands for Tobacco consumption and has values 1 (no smoking), 2 (occasional) and 3 (regular). C stands for cannabis consumption and has values 1 (never tries) to 4, D stands for alcohol consumption and has values 1 (does not drink) to 5, and S stands for sporting activity and has two values 1 (occasional) and (2) regular. A binary variable represents an interval, as for instance C23 that has value 1 whenever the value of C is in [2, 3]. For sake of simplicity we have merged the two highest values in variables T,C and D. For instance values 4 and 5 in alcohol consumption are merged into a 4m (4 and more) value. We report hereunder the binary attributes whose conjunctions allow to represent any interval (for instance D=2 is obtained as {D12,D23m}): Tobacco Cannabis Alcohol T1,T2m C1,C12,C23m,C3m D1,D12,D123,D23m,D34m,D4m 2.2 A DBLP dataset This is the DBLP dataset as described in [9]. There is 45131 vertices, 228188 edges and 555 connected components. Vertices are authors that have published at least one paper in one among 29 journal or conference of the Database and Datamining commu- nities6 during the 1/1990 to 2/2011 period. An edge links two authors whenever they 5 http://www.stats.ox.ac.uk/˜snijders/siena/s50_data.htm 6 Conferences: KDD, ICDM, ECML/PKDD, PAKDD, SIAM DM, AAAI, ICML, IJCAI, IDA, DASFAA, VLDB, CIKM, SIGMOD, PODS, ICDE, EDBT, ICDT, SAC ? Journals: IEEE TKDE, DAMI, IEEE Int. Sys., SIGKDD Exp., Comm. ACM, IDA J., KAIS, SADM, PVLDB, VLDB J., ACM TKDD are coauthors of at least one article. The conferences are clustered in three clusters: DB (databases), DM (data mining) and AI (artificial intelligence) according to a conference ranking site categorization7 . The binary attributes are the journal and conference names together with the three clusters. An attribute has value 1 if the author has published in the corresponding journal or conference or cluster. 3 Abstract closed patterns in attributed networks 3.1 Abstract closed patterns A standard pattern mining consists in considering the set of occurrences of a pattern q, belonging to some pattern language L with a lattice structure 8 , as a subset of an object set O. This language is partially ordered following a general-to-specific ordering and each object o is described as a particular pattern d(o). A pattern q occurs in object o whenever d(o) is less specific than q. The set of occurrences ext(q) of a pattern q is called its support set or its extension in O. A pattern q is said support-closed whenever it is a maximal pattern (i.e. maximally specific) among those that are equivalent in what they share the same support set as q. Now, whenever there is a unique support closed pattern corresponding to a given support set e, as it is the case in the standard FCA or itemset mining framework, an intension function int(e) returns the support-closed pattern associated to the support set e. This means that we relate a pattern q to the corresponding support closed pattern by applying the closure operator int ◦ ext to q. The pattern language L typically is 2I where I is a set of binary attributes (aka items). With no loss of generality we will further use such a pattern language. The closure operator then simply intersects the object descriptions of the support set of the entry pattern. The set of frequent support closed patterns, i.e. the support-closed elements with support greater than or equal to some threshold minsupp represents then all the equiv- alence classes corresponding to frequent supports. Such a class has also minimal ele- ments, called generators. When the patterns belong to 2X , the min-max basis of impli- cation rules[10] that represents all the implications t → t0 that hold on O, i.e. such that ext(t) ⊆ ext(t0 ), is defined as follows: m = {g → f \g | f is a closed pattern , g is a generator f 6= g, ext(t) = ext(f )} We define hereunder closure operators and also interior operators that will be fur- ther used to restrict the support sets to be abstract support sets. In what follows all ordered sets are finite, and in particular any topped meet-semilattice (resp. pointed join- semilattice) is a lattice. 7 http://webdocs.cs.ualberta.ca/˜zaiane/htmldocs/ConfRanking. html. DB = {VLDB, SIGMOD, PODS, ICDE, ICDT, EDBT, DASFAA, CIKM}; DM= {SIGKDD Explorations, ICDM, PAKDD, ECML/PKDD, SDM}; AI= {IJCAI, AAAI, ICML, ECML/PKDD}; 8 We recall that in a lattice any pair of elements (x, y) has a greatest lower bound x ∧ y (or meet) and a least upper bound (or join) x ∨ y Definition 1 Let U be an ordered set and f : U → U a self map such that for any x, y ∈ U , f is monotone, i.e. x ≤ y implies f (x) ≤ f (y) and idempotent, i.e. f (f (x)) = f (x), then: - If f is extensive, i.e. f (x) ≥ x, f is called a closure operator - If f is intensive, i.e. f (x) ≤ x, f is called a dual closure operator, an interior operator, or also a projection. In the first case, an element such that x = f (x) is called a closed element. Ranges of interior operators on lattices are called abstractions and are characterized by the following Proposition: Proposition 1 (see [2]) A subset A of X = 2O is the range p[X] of some interior operator p on X, if and only if for any elements x, y in A, their join x ∪ y also belongs to A and A contains the empty set. The interior operator is related to its range as follows: p(x) = sup{a∈A|a≤x} a. Let then p be the interior operator associated to some abstraction A, p(x) is the great- est element of A included in x. Closed pattern analysis has been recently extended to abstract closed pattern analysis by noticing that applying an interior operator on the ex- tensional space 2O we obtain again closure operators on the pattern language 2I [11,2]: Proposition 2 Let X = 2O and L = 2I , p be an interior operator on 2O , and A = p[X] be the associated abstraction, we have that (int, p ◦ ext) is a Galois connection on (A, L), i.e.: f = int ◦ p ◦ ext is a closure operator on L, The abstract support set of pattern q is defined as p ◦ ext(q). There is then a unique abstract support closed pattern, i.e. a most specific pattern among all patterns sharing the same abstract support set, which is obtained as f (q) = int ◦ p ◦ ext(q). f (q) is then called an abstract closed pattern. This leads to the definition of abstract concepts organized in a concept lattice: Corollary 1 [2]. The set of (abstract support set, abstract closed pattern) pairs (e = ext(c), c = int(e)), ordered following A, is a lattice called an abstract concept lattice. Note that, as p is monotone, whenever ext(q) ⊆ ext(w), i.e. q → w is valid we also have extA (p) = p ◦ ext(q) ⊆ extA (w) = p ◦ ext(w), i.e. the abstract implication 2A q → 2A w is also valid. This way we obtain abstract min-max basis with the same definition as in section 3.1 except that extA replaces ext and therefore abstract implications relate minimal el- ements (i.e. A-generators) to maximal element (the abstract closed pattern, or A-closed pattern) of the same abstract equivalence class. We have then the following definition: Definition 2 The abstract min-max basis mA of valid abstract implications is defined as mA = {2A g → 2A f \g | f is an A-closed pattern, g is a A-generator , f 6= g, extA (g) = extA (f )} 3.2 Graph abstractions These ideas has been applied to attributed graphs by defining graph abstractions [1]. The set of objects O is then the set of vertices of a graph G = (O, E) and each vertex o is labelled by an attribute pattern d(o) ∈ 2I . A graph abstraction is an abstraction of 2O defined through a characteristic property P (x, e) which expresses some minimal connectivity requirement of the vertex x within the subgraph Ge induced by some vertex subset e: Lemma 1 Let P be such that – P (x, e) implies x ∈ e and – e ⊆ e0 and P (x, e) implies P (x, e0 ), and let q be a mapping defined by q(e) = {x ∈ e|P (x, e)}, then the mapping p defined by p(e) = fixedpoint(q, e) is an interior operator on 2O p(e) represents the greatest vertex subset of e inducing a subgraph whose vertices all satisfy the associated characteristic property. We give hereunder examples of graph abstractions, defined through their characteristic property and exemplified in Figure 2. 1. degree ≥ k. 2. k-club ≥ s: x has to belong to at least one k-club of size at least s in Ge . This is a relaxation of the notion of clique[12]: a k-club is a subset c of vertices such that there is a path of length ≤ k between any pair of vertices in Gc . A triangle, a 3-clique, is a 1-club of size 3 (Figure 2-a). Figure 2-b represents a 2-club of size 6 and therefore a 2-club≥ 6 abstract group. 3. nearStar(k, d): x has to have degree at least k or there must be a path of length at most d between x and some y with degree at least k. For instance, the simplest nearStar(8, 1) abstract group is a central node connected with 8 nodes. Such an abstraction is useful when we want the abstraction to preserve hubs [13](i.e high degree vertices) together with their (low degree) neighbors (see Figure 2-c). 4. cc ≥ s: x has to belong to a connected component of size at least s in Ge (see Figure 2-d). 5. k-cliqueGroup ≥ s: x has to belong to a k-clique group of size at least s. A k- clique group is a union of k-vertex cliques that can be reached from each other through a series of adjacent k-vertex cliques (where adjacency means sharing k - 1 nodes). Maximal k-clique groups are denoted as k-cliques communities and formalize the idea of community in complex networks [14]. Finally, it is interesting to note that we can combine two (or more) abstractions A1 and A2 in two ways, defining a new composite abstraction either stronger or weaker than both A1 and A2 . For instance, we may want to consider an abstract subgraph where vertices both have a degree larger than some k and belong to a connected component exceeding a minimal size s. On the contrary, we may want an abstract subgraph such that at least one of the two characteristic properties is satisfied by all the vertices. This would be the case for instance, if we want to keep both vertices that have a degree larger than, say 10, and vertices in a star, i.e connected to a hub which degree is at least 50. The following lemma states that we can freely combine abstractions in both directions. high degree vertices) together with their (low degree) neighbors and abstract closures int p ext(t). All top-down generate and clos (see Figure 2-c). algorithms, like LCM [16] can then be adapted to direct computatio 4. cc s: x has to belong to a connected component of size at least of abstract closed patterns4 . In the experiments in the next section w s in Ge (see Figure 2-d). have used an indirect approach: we first compute frequent closed pa 5. k-cliqueGroup s: x has to belong to a k-clique group of size terns and corresponding generators using the CORON software[15 at least s. A k-clique group is a union of k-cliques (cliques of Starting from the closed patterns t and their supports, we then com size k) that can be reached from each other through a series of pute the abstract closed patterns int p ext(t). Finally we consid adjacent k-cliques (where adjacency means sharing k - 1 nodes). for each abstract closed pattern tA the generators of all the closed pa Maximal k-cliques groups are denoted as k-cliques communities terns that have produced tA and select the minimal elements amon and formalize the idea of community in complex networks [10]. them in order to obtain the corresponding abstract generators5 . Fro abstract generators and abstract closed terms, computing the min max implication rule basis is straightforward. On one hand, the in direct approach needs prior computation of the (non abstract) close ⇥ ⇥ patterns, and this can be much more costly than the direct compu ⇥ ⇥ ⇥ ⇥ tation of abstract closed patterns. On the other hand, once this fir computation is performed, we can apply as many abstract comput tions we need, varying graph abstractions and their parameters, an this can be cost-saving when investigating some new large attribute (a) (b) (c) (d) graph (see Section 4.3). We describe hereunder a generic algorithm, relying on the abstra Figure 2. Graph abstractions corresponding to various vertex characteris- Fig. 2. Graph abstractions corresponding tion characteristic property, to compute the projection of some subs tic properties. In each graph to various plain vertex circles and plain characteristic properties. lines form the abstract In each graph of the set of objects O: plain circles and plain lines subgraph, form crosses andthe abstract dotted subgraph, lines represent crosses the vertices and and edges out dotted of the lines represent the vertices and edges abstract out ofsubgraph. the abstract (a) x hassubgraph. to belong to a(a) x has triangle, (b) to belong x has totoaa triangle, to belong (b)ex✓has // Given to a characteristic property P O and 2-clan of size at least 6, (c) x has to be connected to a vertex y such that the belong to a 2-club of size at least 6, (c) x has to be connected to a vertex y suchu0thatfalse the degree degree of y is at least 6, (d) x has to belong to a connected component whose e e of y is at least 6, (d)size x ishas to belong at least 3. to a connected component whose size is at least 3. While u = false u true For all vertex x in e0 Finally, it is interesting to note that we can combine two (or more) If P (x, e0 ) is false abstractions A1 and A2 in two ways, defining a new composite ab- u false straction either stronger or weaker than both A1 and A2 . For in- e0 e0 {x} Lemma 2 Let P1stance,and weP2may two wantcharacteristic properties to consider an abstract subgraph of abstractions where vertices defined endIfon the both have a degree larger than some k and belong to a connected endFor same object set O,component and let exceeding P1 ∧ P2a and P ∨ P be defined as follows: minimal1 size s.2 On the contrary, we may endWhile want an abstract subgraph such that at least one of the two charac- // As u = false, P (x, e0 ) is true for all x in e0 – P1 ∧ P2 (x, e)teristic = P1properties (x, e) ∧is P satisfied by all the vertices. This would be the 2 (x, e) // e0 = p(e0 ) is the abstraction of e with respect to P case for instance, if we want to keep both vertices that have a degree – P1 ∨ P2 (x, e)larger = Pthan, 1 (x,saye) 10,∨and P2vertices (x, e)in a star, i.e connected to a hub which degree is at least 50. The following lemma states that we can freely This generic algorithm is in O(n2 ⇤ d) where d is the cost of com combine abstractions in both directions. puting P (x, e0 ). In the graph abstraction case, computing P (x, e Both P1 ∧ P2 and P1 ∨ P2 are characteristic properties of abstractions.requires to update the induced subgraph Ge0 when some vertex is r Lemma 2 Let P1 and P2 two characteristic properties of abstrac- moved from e0 . Furthermore, the cost d depends on the characterist tions defined on the same object set O, and let P1 ^ P2 and P1 _ P2 property and will be small as far as the property needs to consid be defined as follows: only close neighbors of x. For instance, considering the degree 3.3 Experiments abstraction, first, there is no need to access neighbors of x, and fu • P1 ^ P2 (x, e) = P1 (x, e) ^ P2 (x, e) thermore, rather than explicitly updating Ge0 when some x is r • P1 _ P2 (x, e) = P1 (x, e) _ P2 (x, e) moved from e0 it is more efficient to decrease the degree of the ve Some experiments on the two datasets described in Section 2 have been Both P1 ^ P2 and P1 _ P2 are characteristic properties of abstrac- performed tices connected to x in e0 . Another example is the cc s graph ab and presented in [1]. tions.We discuss here some new details and experiments on the straction, in DBLP which computing the abstraction of some e comes dow to compute the connected components of Ge and remove the sma dataset. The experiment consisted in applying a degree ≥ k abstraction Finally note that requiring a frequency property also corresponds with increasing ones with no need to iterate the process. k-values and we focussed in abstract to an abstraction patterns property whose characteristic obtained is Pwith k==|e|16 which corresponds m (x, e) to a very strong abstraction: minsupp, and that in ancanabstract support be therefore combined setto each author is required any abstraction, 4 Experiments to have therefore defining frequent abstract closed patterns. 16 co-authors within the abstract support set. We obtained few abstractWe closed patterns consider here some preliminary experiments in three datasets. I and in particular the abstract closed pattern VLDBJ, ICDE, SIGMOD,allVLDB 3.2 Graph-based closed patterns computation and and the the data is described as a graph G = (O, E three experiments, related abstract implication 2 VLDBJ → 2 ICDE, SIGMOD, VLDB. analysis Both 4 Work the ab- in progress stract closed pattern and its abstract generator VLDBJ have an When we have defined abstractions and corresponding projections, abstract support set 5 Recall that each closed pattern that produces an abstract closed pattern t represents an equivalence class of patterns that will be included in the cla of 38 among the graph-based 1276 VLDBJ authors abstract closed in are patterns thealso dataset. The implication de facto defined. Using of tstates thatequivalence A in the new a relation relying on abstract supports. dense group of co-authors that have published in the Very Large Database Journal also have published in several database conferences. We present Figure 3 the correspond- ing subgraph. Such a very dense co-authoring subgraph within the VLDBJ subgraph is somewhat unexpected. We made then some investigations in the DBLP repository, focussing of these authors, and found an article whose abstract begins as follows: A group of senior database researchers gathers every few years to assess the state of database research ... with the following reference: dblp: Serge Abiteboul 28/07/2015 15:18 Entrepôts de contenu autour de XML et des services Web. EDA 2006: 1-2 [c128] Serge Abiteboul, Ioana Manolescu, Emanuel Taropa: A Framework for Distributed XML Data Management. EDBT 2006: 1049-1058 [c127] Serge Abiteboul, Pierre Senellart: Querying and Updating Probabilistic Information in XML. Fig. 3. The subgraph obtained when applying the degree ≥ 16 abstraction to the VLDBJ EDBT 2006: 1059-1068 subgraph in the DBLP co-authoring experiment. [c126] Karl Schnaitter, Serge Abiteboul, Tova Milo, Neoklis Polyzotis: COLT: continuous on-line tuning. SIGMOD Conference 2006: 793-795 2005 [j56] Serge Abiteboul, Rakesh Agrawal, Philip A. Bernstein, Michael J. Carey, Stefano Ceri, W. Bruce Croft, David J. DeWitt, Michael J. Franklin, Hector Garcia-Molina, Dieter Gawlick, Jim Gray, Laura M. Haas, Alon Y. Halevy, Joseph M. Hellerstein, Yannis E. Ioannidis, Martin L. Kersten, Michael J. Pazzani, Michael Lesk, David Maier, Jeffrey F. Naughton, Hans-Jörg Schek, Timos K. Sellis, Avi Silberschatz, Michael Stonebraker, Richard T. Snodgrass, Jeffrey D. Ullman, Gerhard Weikum, Jennifer Widom, Stanley B. Zdonik: The Lowell database research self-assessment. Commun. ACM 48(5): 111-118 (2005) [j55] Yannis E. Ioannidis, David Maier, Serge Abiteboul, Peter Buneman, Susan B. Davidson, In some sense the explanation of the Edward patternA.we Fox, Alon Y. Halevy, discovered Craig A. is straightforward. How- Knoblock, Fausto Rabitti, Hans-Jörg Schek, Gerhard Weikum: ever, the whole purposeDigital of pattern mining is to find unexpected patterns, library information-technology infrastructures. Int. J. hidden within large datasets, and interpret themLibraries on Digital in order to266-274 5(4): acquire(2005) some new knowledge. It is exactly what happens here: we were not aware of these regular meetings of senior database re- [j54] Serge Abiteboul, Richard Hull, Victor Vianu, Sheila A. Greibach, searchers, and we learned something Michael new, A. Harrison, though, Ellis ofDaniel Horowitz, course, this knowledge J. Rosenkrantz, Jeffrey is clearly widely known within the D. database community. Ullman, Moshe Y. Vardi: When considering aInweaker memory of Seymour Ginsburg abstraction, namely here1928 a - 2004. degreeSIGMOD ≥ 4 Record abstraction, we 34(1): 5-12 (2005) obtain more abstract closed patterns sometimes made of several connected components. Figure 4[j53] represents the Tova DMKD,Milo, Serge pattern Abiteboul, IDArev subgraph Bernd together Amann, Omar with the subgraph Benjelloun, Frederic Dang induced by the abstract support Ngoc: set of the pattern. This abstract subgraph is made of two connected components, Exchanging the one in intensional the right part 30(1): 1-40 (2005) XML data. ACM Trans. Database Syst. of the Figure is made of 10 vertices and we are then interested in knowing whether there is some more specific pattern than the abstract [c125] Serge Abiteboul, closed pattern DMKD, IDArev Susan whichB. Davidson, would be Tova Milo: by this connected com- shared Active XML and Data Activation. Abstract State Machines 2005: ponent. Answering such11-16 questions means mining at a local level the attributed graph, and this is the subject of the next section. http://dblp.uni-trier.de/pers/hd/a/Abiteboul:Serge Page 9 sur 26 6 dblp.nb Extension Abstraite Locale centrée sur le triangle Fig. 4. The DMKD,IDArev pattern subgraph in the DBLP co-authoring experiment. The red ver- tices and edges represent the subgraph induced by the degree ≥ 4 abstract support set. 4 Local closed patterns in attributed networks In [5] we introduced locality in the closure framework with as main motivation investi- gating local patterns in attributed graphs. We first summarize here main definitions and results. For that purpose we have to consider pre-confluences and confluences which are structures weaker than lattices investigated in [1,5]. Confluences, in particular, are close to but different from confluent families as defined in [7]. We further denote by E x the up sets {y ∈ E|y ≥ x} of an ordered set E, by Ex its down sets {y ∈ E|y ≤ x}, and by min(E) the set of its minimal elements. First note that partial orders considered here are all finite. We first define a pre- confluence as an ordered structures that generalize the lattice structure: Definition 3 F is a pre-confluence if and only if for any m ∈ min(F ), F m = {x ∈ F | x ≥ m} is a lattice. A consequence of this definition is that a (finite) lattice is a pre-confluence with a minimum. The structure has a partial join operator: Lemma 3 For any x, y ∈ F m their least upper bound does not depend on m: 1. x ∨F y is the least element of F x ∩ F y This means that a pre-confluence is a union of lattices in which joins coincide. A particular case is which of a pre-confluence included in a host lattice and which is join preserving: Definition 4 Let T be a lattice and F ⊆ T be a pre-confluence with as join ∨F = ∨T , F is called a confluence of T . An abstraction of T , as defined above is a confluence of T with ⊥T as minimum. We have then the following property when considering 2O as the host lattice: Proposition 3 Let X = 2O be a lattice, F ⊆ X is a confluence of X if and only if for any x, y ∈ F m with m ∈ min(F ), we have that x ∪ y belongs to F . A confluence, is then associated to a set of interior operators: Proposition 4 Let F be a confluence of a lattice X, and m ∈ min(F ), – pm : X m → X m , such that pm (x) = ∨q∈F m ∩Xx q, is an interior operator and pm [X m ] = F m . We are concerned here with extensional confluences, i.e. confluences of X = 2O [5] that generalize extensional abstractions as graph abstractions. In this case, let x be an element of X greater than or equal to some minimal element m of F , then pm (x) returns the greatest subset of x in F that includes m. In the example that follows we define a graph confluence by only considering vertex subsets inducing connected subgraphs of some graph G = (O, E). 1 3 1 1 4 2 3 4 2 3 1 4 1 1 4 4 2 2 3 3 2 3 1 4 2 3 Fig. 5. A square graph (in the bottom of the figure) and the Hass diagram of the confluence F 1+3 of connected vertex subsets that contain vertices 1 or 3. Example 1. Let G = (O, E) be a graph (displayed at the bottom of Figure 5) whose vertex set is O = {1, 2, 3, 4}. Let F ⊆ 2O be the set of vertex subsets inducing con- nected subgraphs of G. We call them connected vertex subsets. F is a confluence whose set of minimal elements is min(F ) = {{1}, {2}, {3}, {4}}, i.e. the set of singletons of 2O . The union of two connected vertex subsets that each contains a given vertex s obvi- ously also is a connected vertex subsets and therefore F is a confluence of 2O . In what follows, by abuse of notation we write ps and F s rather than p{s} and F {s} .The inte- rior operator ps projects then any vertex subset e containing vertex s on the connected component of the subgraph Ge induced by e that contains s. The up set F s is then the set of connected vertex subsets containing s and the union of all these F s represents the whole set of connected subgraphs of G. The subset F 1+3 = F 1 ∪ F 3 representing con- nected vertex subsets containing vertices 1 or 3 also is a confluence. Figure 5 displays the diagram of F 1+3 . 2 The support set e of a pattern q may then be projected, through interior operators, on various smaller local support sets {ei } corresponding, in the graph confluence case, to the connected components of the pattern subgraph. These interior operators are asso- ciated to local closure operators[5]: Proposition 5 Let F be a confluence of X = 2O , m a minimal element of F and Lint(m) be the down set of the pattern lattice L whose elements q are such that q ≥ int(m), then fm = int ◦ pm ◦ ext is a closure operator on Lint(m) In the graph confluence case, let e = ext(q), ps (e) is the connected component of the pattern subgraph Ge to which belongs the vertex s. Obviously ps (e) = pv (e) for any vertex v in the same connected component. Therefore fs (q) is a local closed pattern w.r.t. any vertex in this connected component, i.e. the most specific pattern shared by the vertices in the connected component. Now an important result is that that the set of local support sets is a pre-confluence: Theorem 1. The mapping h : F → F : h(e) = pm ◦ ext ◦ int(e) for m ≤ e is a closure operator on F and E = h[F ] is a pre-confluence. h(e) is therefore the local support set of int(e) that contains m ≤ e. h[F ] is a pre-confluence isomorphic to the set P of local concept pairs defined as follows: Definition 5 The set of local concept pairs P = {(e, l) | e = pm ◦ ext(l), l = int(e), m ≤ e} is called a local concept pre-confluence. To summarize we have defined local concepts as (local support set, local closed pat- terns) pairs and shown they were organized in a structure with possibly several minimal elements, therefore generalizing the concept lattice definition. In the simple graph con- fluence exemplified above the local support sets simply are the connected components of the pattern subgraphs. We will now extend graph confluences by intersecting this simple graph confluence with abstractions. 4.1 Cc-confluences We remark now that we can freely intersect confluences: Proposition 6 Let F1 and F2 be confluences of X, then F1 ∩ F2 is a confluence And as abstractions of X are confluences of X with the bottom element of X as their unique minimal element, the above proposition means we can freely intersect ab- stractions with confluences to build smaller confluences. Many confluences can then be derived from a graph confluence by intersecting it with some abstractions. We call this family of confluences the cc-confluences. For instance, considering A as the k- clique abstraction, we obtain the cc-confluence of connected subgraphs of G made of k-cliques. Note that cc-confluences have an important property: rather than consider- ing the minimal elements of F when defining local closure operators we can consider vertices. This is because given a vertex subset v in some pattern subgraph, all minimal elements containing v belong to the same connected component as v, and therefore the local support sets are the same. This is computationally important as this means that when considering local support sets we only need to consider each vertex in the support set and associate it to the connected component to which it belongs. 4.2 Local implications Inclusion of local support sets define local implication rules 2m q → 2m w where m is a minimal element of FA in the support set of q. Note that, as the local support set of pattern q is obtained by applying an interior operator, which is monotone, to the support set of q, we have that whenever 2q → 2w is valid and m ⊆ extA (w), we also have that 2m q → 2m w is valid, i.e. we may infer the latter local rule from the former abstract rule. We search now for a basis B of valid local implication rules from which we may infer any local implication rules. We will consider a basis Bm for a given minimal element m of F and obtain the whole basis B = ∪Bm by joining these bases. Consider a given abstract closed pattern c whose abstract support set has a connected component that contains m, and let l = fm (c) be the corresponding local closed pattern, with respect to the cc-confluence FA . We have then that the implication rule 2m c → 2m l holds. We select then a basis Bm of informative (l 6= c) and irredundant ( there is no other rule 2m c0 → 2m l with c0 less specific than c in the rule set) ones. From Bm we may infer all local implication rules associated to m by applying standard axioms in the same way as in the case of the min-max basis in the standard closed or abstract framework. The basis B = ∪Bm represents the local knowledge deriving from the reduction of the extensional space from abstraction A to cc-confluence FA . 4.3 Experiments on cc-confluences To shortly examplifiy local implications and local concepts we come back to our exper- iments on the DBLP dataset in Section 3.3 and specifically the abstract closed pattern DMKD, IDArev, with respect to the degree ≥ 4 abstraction, whose abstract support set is represented in red on Figure 4. When considering the connected component on the left of FIgure 4, we obtain the local implication DMKD,IDArev →268924 DMgroup stating that in the connected component of the abstract subgraph to which belongs the author 268924, each author has also published in some data mining conference belong- ing to DMgroup. Now, when considering the Teenage Friends attributed graph displayed Figure 1, clearly the friendship relations are organized in 3-cliques, therefore any stronger ab- straction will be poorly informative. However, as mentioned in Section 1, when consid- ering the 3-clique abstract graph associated to the empty pattern the unique connected component could be separated in several (overlapping) communities (displayed in vari- ous colors). We discuss and exemplify in the next section how to apply the local closure strategy to discover such subcommunities in an attributed graph. 5 Derived cc-graph confluences In what follows, we will consider a family T ⊆ 2O of vertex subsets, and consider T as the vertex set of a graph GT = (T, ET ) derived from G. The simple graph confluence F of 2T is then the new extensional space and we will search for the corresponding local closed patterns. The local support sets are afterwards transformed into support sets in 2O . Let u : 2T → 2O be such that u(eT ) = ∪t∈eT t. u(eT ) is called the flattening of eT . We consider then the two maps extT and intT defined as follows: – extT : L → 2T with extT (q) = {t|t ⊆ ext(q)} – intT : 2T → L with intT (eT ) = int ◦ u(eT ) extT (q) represents the support set of q in T when considering that q occurs in t when- ever q occurs in all elements of t (seen as a subset of O). Conversely intT (eT ) represents the greatest pattern in L whose support set in T includes eT , i.e. whose support set in O contains, as subsets, the elements of eT . Now, consider as T the family of k-cliques of G and that (t1 , t2 ) ∈ ET whenever t1 and t2 share k − 1 vertices in G. A k-community in G [8] is a vertex subset that results from the flattening (in the sens defined above) of some connected component of GT . The local closed patterns w.r.t. F are then most specific patterns occurring in k-communities of pattern subgraphs of G. This way we obtain a local concept pre-confluence, and associated local implications, whose local support sets are these k-communities. 5.1 Experiments on derived cc-confluences Coming back to our Teenage Friendship attributed graph, we have applied this strategy and built the derived graph GT where T is the set of 3-cliques of the original attributed graph. We display Figures 6 and 7 the local concept pre-confluence of 3-communities which size is greater than or equal to 4 members9 . The minimal 3-communities are the lowest ones on both Figures. Each element of the pre-confluence represents a (3- community, local closed pattern) pair but may be associated to several non redundant lo- cal implications. This happens for one 3-community displayed on the right at the bottom of Figure 6 and associated to two local implications each represented in a square. Each square displays in red the 3-community, and in red+green+blue the abstract support set of the abstract closed pattern forming the left part of the implication. In Figure 7 we have a unique maximal 3-community on the top, and a hierarchy of sub-communities. 6 Implementation In our experiments we use the CORON software[15] to compute frequent closed pat- terns, according to some frequency threshold, then apply a set of PYTHON functions 9 Formally, this means that we also apply to the derived graph an abstraction to avoid connected components corresponding to 3-communities smaller than 4 members. file:///Users/soldano/Dropbox/rangementsAutres/softGuillau... file:///Users/soldano/Dropbo Fig. 6. The pre-confluence of size ≥ 4 3-communities of the Teenage Friendship graph(part-I) file:///Users/soldano/Dropbox/rangements Fig. 7. The pre-confluence of size ≥ 4 3-communities of the Teenage Friendship graph(part-II) 1 of 1 17/06/2015 11:08 1 of 1 as a post processing10 . Starting from the set of frequent (possibly abstract) closed pat- terns C we then compute for each such pattern c ∈ C the subgraph induced by its (abstract) support set, extract the various connected components {e1 , ..., ek } that are large enough, compute the corresponding local closed patterns {c1 , ..., ck } and output the corresponding local implications. When computing k-communities, we start from the k-clique graph abstract closed and build the k-clique graph GT , where T is the set of k-cliques in G, compute the local closures corresponding to the subgraphs of GT induced by the connected components of our abstract closed patterns, and output the corresponding triples, where the local support sets are flattened to be expressed as subsets of O. In a work in progress, we consider a more efficient computation of abstract and local closed patterns and related implications. The corresponding algorithm uses a divide and conquer strategy similar to that proposed in [7], and therefore allows to directly apply the frequency constraints at the abstract and local level. 7 Conclusion In the present article we are interested in addressing problems in which the extensional space, made of the vertex subsets of an attributed network, is constrained according to connectivity properties. We have first considered abstract vertex subsets in which a constraint have to be satisfied by each vertex in the subgraph they induce, as for instance a minimum degree constraint. The extensional space is in this case a particular lattice called an abstraction. We have then shown, benefiting from previous work in FCA, how abstract support closed patterns, i.e. maximal patterns among those sharing the same abstract support set, could be obtained using a closure operator. This led to define a wide class of abstract concept lattices, whose elements are (abstract support set, abstract closed pattern) pairs, each corresponding to a particular abstraction. We obtain this way a global information on how the graph topology is related to the patterns support sets. We have then considered a way to extract local knowledge from an attributed network. For that purpose, using a recent extension of FCA to local extensional spaces, called confluences, we have related each pattern to various local support sets, corresponding to connected components in subgraphs induced by abstract vertex subsets. We obtain this way a set of local concepts, organized in a generalization of the lattice structure called a pre-confluence. Furthermore we have defined both abstract implications and local implications rules representing knowledge which is valid at the abstract and local levels, i.e., regarding the latter, in the vicinity of particular vertices. Finally we have applied these ideas to define the pre-confluence of 3-communities in a social network. These 3-communities are in fact sub-communities as each is a 3-community in some subnetwork induced by an attribute pattern. Overall, what we propose here is a new way, brought by recent developments in Formal Concept Analysis, to explore social networks as attributed graphs. Future works concerns, on the extensional side, applying these ideas to attributed directed graphs or multiplex networks. We also consider to use abstract and local extensional constraints while extending the pattern language to a 10 The corresponding software is to be found in https://lipn.univ-paris13.fr/ ˜santini/ . wider class of pattern languages. First, as in [16,11,17] by building a meet-semilattice adatpted to the mining problem and using interior operators to reduce it to a tractable language. This has been in particular successfully applied to graph mining [18].Then, as in [6,7] by considering confluent languages allowing to consider connectivity within the pattern language. References 1. Soldano, H., Santini, G.: Graph abstraction for closed pattern mining in attributed network. In Schaub, T., Friedrich, G., O’Sullivan, B., eds.: European Conference in Artificial Intel- ligence (ECAI). Volume 263 of Frontiers in Artificial Intelligence and Applications., IOS Press (2014) 849–854 2. Soldano, H., Ventos, V.: Abstract Concept Lattices. In Valtchev, P., Jäschke, R., eds.: Interna- tional Conference on Formal Concept Analysis (ICFCA). Volume 6628 of LNAI., Springer, Heidelberg (2011) 235–250 3. Mougel, P.N., Rigotti, C., Gandrillon, O.: Finding collections of k-clique percolated com- ponents in attributed graphs. In: PAKDD(2), Advances in Knowledge Discovery and Data Mining - 16th Pacific-Asia Conference, PAKDD 2012, Kuala Lumpur, Malaysia, May 29 - June 1, 2012. Volume 7302 of Lecture Notes in Computer Science., Springer (2012) 181–192 4. Silva, A., Meira, Jr., W., Zaki, M.J.: Mining attribute-structure correlated patterns in large attributed graphs. Proc. VLDB Endow. 5(5) (January 2012) 466–477 5. Soldano, H.: Extensional confluences and local closure operators. In Baixeries, J., Sacarea, C., Ojeda-Aciego, M., eds.: Formal Concept Analysis - 13th International Conference, ICFCA 2015, Nerja, Spain, June 23-26, 2015, Proceedings. Volume 9113 of Lecture Notes in Computer Science., Springer (2015) 128–144 6. Soldano, H.: Closed patterns and abstraction beyond lattices. In Glodeanu, C.V., Kaytoue, M., Sacarea, C., eds.: Formal Concept Analysis 12th International Conference, ICFCA 2014, Cluj-Napoca, Romania, June 10-13. Volume 8478 of Lecture Notes in Computer Science., Springer (2014) 203–218 7. Boley, M., Horváth, T., Poigné, A., Wrobel, S.: Listing closed sets of strongly accessible set systems with applications to data mining. Theor. Comput. Sci. 411(3) (2010) 691–700 8. Palla, G., Derenyi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community structure of complex networks in nature and society. Nature 435(7043) (Jun 2005) 814–818 9. Bechara Prado, A., Plantevit, M., Robardet, C., Boulicaut, J.F.: Mining Graph Topological Patterns: Finding Co-variations among Vertex Descriptors. IEEE Transactions on Knowl- edge and Data Engineering 25(9) (September 2013) 2090–2104 10. Pasquier, N., Taouil, R., Bastide, Y., Stumme, G., Lakhal, L.: Generating a condensed repre- sentation for association rules. Journal Intelligent Information Systems (JIIS) 24(1) (2005) 29–60 11. Pernelle, N., Rousset, M.C., Soldano, H., Ventos, V.: Zoom: a nested Galois lattices-based system for conceptual clustering. J. of Experimental and Theoretical Artificial Intelligence 2/3(14) (2002) 157–187 12. Balasundaram, B., Butenko, S., Trukhanov, S.: Novel approaches for analyzing biological networks. Journal of Combinatorial Optimization 10 (2005) 23–39 13. Barabàsi, A.L., Albert, R.: Emergence of scaling in random networks. Science 286(5439) (1999) 509–512 14. Palla, G., Derenyi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community structure of complex networks in nature and society. Nature 435(7043) (Jun 2005) 814–818 15. Szathmary, L., Napoli, A.: Coron: A framework for levelwise itemset mining algorithms. In Ganter, B., Godin, R., Nguifo, E.M., eds.: Third International Conference on Formal Concept Analysis (ICFCA’05), Lens, France, Supplementary Proceedings. (2005) 110–113 Supple- mentary Proceedings. 16. Ganter, B., Kuznetsov, S.O.: Pattern structures and their projections. ICCS-01, LNCS 2120 (2001) 129–142 17. Ferré, S., Ridoux, O.: An introduction to logical information systems. Information Process- ing and Management 40(3) (2004) 383–419 18. Kuznetsov, S.O., Samokhin, M.V.: Learning closed sets of labeled graphs for chemical appli- cations. In Kramer, S., Pfahringer, B., eds.: ILP. Volume 3625 of Lecture Notes in Computer Science., Springer (2005) 190–208