An Algorithmic Way to Generate Simplexes for Topological Data Analysis Krzysztof Rykaczewski1,2 , Piotr Wiśniewski2 , and Krzysztof Stencel2,3 1 DLabs, Lęborska 3B, 80-230 Gdańsk krykaczewski@gmail.com 2 Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Chopina 12/18, 87-100 Toruń, Poland pikonrad@mat.umk.pl 3 Faculty of Mathematics, Informatics and Mechanics University of Warsaw, Banacha 2, 02-097 Warsaw, Poland stencel@mimuw.edu.pl Abstract. In this article we present a new algorithm for creating sim- plicial Vietoris-Rips complexes that is easily parallelizable using compu- tation models like MapReduce and Apache Spark. The algorithm does not involve any computation in homology spaces. Keywords: data analysis, topology, simplicial complex 1 Introduction The article of Silva and Grist [2] showed that the algebraic topology was a very practical tool. In this paper its authors analyse the coverage of an area with a network of sensors. This network is interpreted as a simplicial complex. Its homology type is then computed and exploited. If this complex is homotopy equivalent with point, the coverage is complete. Therefore, the authors translated a technical problem into a mathematical problem. In the article [3] Grist searches for topology structures in big data sets. Again, this results in building simplicial complexes and analysing them. The abovementioned works inspire to ask the question how to build simplicial complexes efficiently and how to analyse them. In this article w propose a novel algorithm to build the simplicial complex. This algorithm has the unique prop- erty that it can be implemented within massive parallel computational models like MapReduce and Apache Spark. 2 Background Hereafter we assume that the data set given is a finite set of points P := {x1 , . . . , xn } ⊂ Rd . This data can come up as a result of some physical ex- periments, psychological tests etc., and can generally be high-dimensional. We have no insight into how the data actually look like and how it is embedded, but it would be enough to know their “shape” to say something about how it is are arranged in the original space. In recent papers of Carlsson and collaborators [1] presented ideas to explore some properties of high-dimensional data sets. They use algebraic topology tools to find certain characteristics of the data or its simpler representation. This approach allows to discover objects and features that are immersed in high- dimensional space. We start presentation of basic facts from the notion of a simplex. Colloquially speaking, simplex is a generalization of a segment, a triangle and a tetrahedron to any dimension. More formally, suppose the k + 1 points p0 , . . . , pk ∈ Rk are affinely independent, which means p1 − p0 , . . . , pk − p0 are linearly independent. Then, k-simplex determined by them is the set of points ( k ) X σ := λ0 p0 + · · · + λk pk | λi ≥ 0, 0 ≤ i ≤ k, λi = 1 , (1) i=0 i.e. k-simplex is a k-dimensional polytope which is the convex hull of its k + 1 vertices. We often write σ = [p0 , . . . , pk ] for short. The points p0 , . . ., pk are called vertices of the k-simplex. Each simplex is uniquely determined by its vertices. Any subset of vertices constitutes simplex with dimension smaller that the whole one, called a face of simplex. We say two vertices v and w are neighbors if there is 1-simplex (edge) con- necting them. Definition 1. A simplicial complex K is a set of simplexes that satisfies the following conditions: 1. Any face of a simplex from K is also in K. 2. The intersection of any two simplexes σ1 , σ2 ∈ K is either ∅ or a face of both σ1 and σ2 . Having data set P , we can use it to build a variety of complexes. We shall now formulate two most frequent definitions. Definition 2. For a discrete set of points P ⊂ Rd and a parameter  > 0, we define the Čech complex of radius  as Č(P, ) by ( ) \ Č(P, ) := [p1 , p2 , . . . , pk ] | {p1 , p2 , . . . , pk } ⊂ P, B(pi , /2) 6= ∅ , (2) i where B(p, ) is the closed ball of radius  centered at p. The nerve theorem states that Čech complex Č(P, ) has homotopy type of the union of closed balls with radius /2 around data P This means that the Čech complex gives faithful representation of thickened data (they are common ”shape”). Vietoris-Rips complex is closely related to the Čech complex. Fig. 1. Example of a Čech complex. Definition 3. For a given  > 0, the Vietoris-Rips complex (later called Rips complex) R(P, ) is the largest simplicial complex that shares the same 1-skeleton (graph) as the Čech complex Č(P, ). More explicitly, R(P, ) := {σ := [p1 , p2 , . . . , pk ] | {p1 , p2 , . . . , pk } ⊂ P, diam(σ) ≤ } , (3) where diam(σ) is the diameter of simplex σ. Fig. 2. Comparison of Čech (left) and Vietoris-Rips (right) complexes. Lemma 1 (Vietoris-Rips). For every finite set of points P ⊂ Rd and r ≥ 0, Č(P, ) ⊂ R(P, ) ⊂ Č(P, 2). (4) Thus, if the Čech complex at the same time for  and 2 approximates the data in a good way, then Vietoris-Rips complex do it as well. This estimate can be further improved [2]. Remark 1. Čech complex is difficult to calculate, but it is quite small. However, Vietoris-Rips complex is easy to calculate, but is usually very big. Both Čech and Vietoris-Rips complexes can produce simplicial complex of dimension greater than the considered space. Therefore, there are considered many alternatives for them: Delaunay Complex, Alpha Complex, (Lazy) Witness Complex, Mapper Complex, Sparsified Rips complex, Graph Induced Complex (GIC). For small  > 0 we have disjoint set of balls, whereas for big  the covering by balls is simply connected (without holes), so we are totally losing the knowledge about the data structure. Since (in general) we do not know which radius  to take, we consider them all and obtain in this way a filtration of simplicial complexes, i.e. Č(P, 1 ) ⊂ Č(P, 2 ) ⊂ . . . ⊂ Č(P, n ), (5) for 0 < 1 ≤ 2 ≤ . . . ≤ n . With the change of radius also changes the topology of the data: some holes are created, and some disappear. Roughly speaking, those holes that last longest represent the shape of data. It is usually represented in the form of the so-called barcode [3]. Let us recall the following definition from topology. Definition 4. Two topological spaces X and Y are called homotopy equivalent if there exist continuous maps f : X → Y and g : Y → X, such that the composition f ◦ g is homotopic (i.e. it can be deformed in a continuous way) to the identity idY on Y , and g ◦ f is homotopic to idX . 3 Related Work Data anaylsis is a process of modeling data in order to discover useful informa- tion. Unfortunately, mining high-dimensional data is very challenging. Therefore, a number of methods was created so as to make it easier for researchers. One of the youngest but rapidly growing field now is topological data analysis (TDA). TDA is an approach to analyse data sets using techniques from topology. This method relies on calculation of some properties of the space/data, which main- tain (are invariant) under certain transformations. Although most of topological invariants are difficult to calculate, there is an algorithm which can calculate homology groups and barcodes [4, 5]. These groups are often used in applica- tions [3]. Topological methods due to their nature can be used to cope with many problems where traditional methods fail. It has broad application in coverage problem in sensor network [2, 3], detecting breast cancer [6], computer vision [7], detection of periodicity in time series (behaviour classification) [8] etc. Another issue addressed in this subject is the speed of algorithms. In paper [9] Dłotko et al. presented distributed algorithm for homology computation over a sensor network. Their technique involve reduction and coreduction of simplicial complexes. Regardless of the algebraic approach, there are created algorithms that do not use calculations on homology groups to obtain some approximation of the shape of data [10]. Algorithm proposed in the last paper can have problems with more complicated structures, like the sphere, but it is shown that it does well with coverage in sensor networks. Notice that apart from using Čech and Rips complexes there are other direc- tions in computational topology, but we do not discuss them here [2]. 4 A Method to Build Complex using MapReduce In this Section we present an efficient method to build Vietoris-Rips complexes. We make the following assumptions: – R is a Euclidean space. – A fixed number of dimensions n is given. – M ⊂ Rn is the set of input data points. – We assume that each input data point has a unique identifier id (x). Those identifiers are items of an linearly ordered set. – For any i ∈ {1, 2, . . . , n} we define Πi : Rn → R to be the projection to the i-th dimension. – In Rn we assume the Euclidean distance. For any two x, y ∈ Rn let |x, y| denote the Euclidean distance of x and y. – Let us choose  > 0. We are going to build the Vietoris-Rips complex R(M, ) The first step to build a Vietoris-Rips complex is to find all pair of points x, y ∈ M such that |x, y| < . If the set M is large, computing distances between all pairs of points is too time consuming. If the number of dimensions n is equal to 1, we can assign all points to segments of the form hk, (k +1)). Then, instead of computing distances for all pairs, we consider only those pairs of points that fall into the same segment or two neighbouring segments. As the result, each point is paired only with points from three segments. This can significantly accelerate the computation. Of course, in a one-dimensional space, a simpler method that encompasses sorting and sequential scanning will be faster. However, this method cannot be extended to more dimensions. For a two-dimensional space we can apply a method similar to the former one described above. Instead of segments, we have then squares with sides of length . Eventually, each point will be paired for distance computation only with points from nine other squares. The three-dimensional case is similar, but we use cubes with side of length  and each point is paired with points from 27 cubes. Unfortunately, enormous growth of the number of dimensions prohibits direct usage of this method. In case of a 100-dimentional space, the “segments” will be 100-dimensional hypercubes with -side. Each point will be paired for distance computation with points from 3100 hypercubes. Obviously the naïve method will amount to be better for sure in this case. However, we can apply the abovementioned three-dimensional method in multi-dimensional spaces indirectly. If we choose three dimensions i, j, k, we can consider projecting the whole space to them and further search for close pairs like in a three-dimensional space. Let us consider the projection Π = Πi × Πj × Πk : Rn → R3 . Obviously for each pair x, y ∈ Rn the following inequality holds: |Π(x), Π(y)|3 ≤ |x, y|, By | . . . |3 we denote the Euclidean distance in R3 . We will assign all points to -sided cubes according to their projections onto the chosen dimensions. Then, it is enough to pair each point for distance compu- tation only with points residing in the same cube or in 26 neighbouring cubes. If two points are not assigned to the same cube or neighbouring cubes, the distance of their projections onto the chosen dimensions is larger than . Therefore, so is their distance in Rn and their pair does not have to be considered. The efficiency of this method notably depends on the choice of the three pro- jected dimension. Chosen dimensions may reduce the number of distance com- putations required. Our first idea was to compare the projection onto each single dimension with the uniform distribution using the algorithm based on Kullback– Leibler divergence [11]. However, this approach required two scans of input data. Moreover, it is hard to distinguish more dense and more sparse distributions. The problem lies rather in sparseness (with respect to ) that uniformity. As the result, we invented the following quality measure of dimensions. For a given dimension i and an integer t we define: σ(i, t) = |{x ∈ M; t ≤ Πi (x) < (t + 1)}| The number σ(i, t) tells how many points will fall into the segment identified by t if we project to the dimension i. Note that if a point x satisfies the condition in the definition above, then t = bΠi (x)/c. bac is the largest integer not greater than a. For a dimension i ∈ {1, 2, . . . , n} and an input data point x ∈ M we define ti (x) to be the integer: ti (x) = bΠi (x)/c. Observe that the computation of all non-zero values of σ is possible using only a single scan of input data. Moreover, it is easily implementable in the MapReduce computation model. The mapper emits hi, ti (x)i for each input data point x ∈ M and for each dimension i. The reducer simply counts the number of occurrences of pairs hi, ti. Non-dispersion measure of a dimension i is the number X ∆i = σ(i, t)2 t∈Z Now, for each dimension we compute its non-dispersion measure by means of the MapReduce procedure presented above. Then we chose three dimensions with smallest non-dispersions. Assume i, j, k to be the three dimensions with smallest non-dispersion. The algorithm to find all pairs of points closer that  using these dimensions can be easily parallelized using the MapReduce computation model. The mapper reads all input data points and for each point x emits 27 key- value pairs: hht1 , t2 , t3 i, xi where t1 , t2 , t3 satisfy the following conditions: t1 ∈ { ti (x) − 1, ti (x), ti (x) + 1 } t2 ∈ { tj (x) − 1, tj (x), tj (x) + 1 } t3 ∈ { tk (x) − 1, tk (x), tk (x) + 1 } A single reducer collects all pairs that came with a given key ht1 , t2 , t3 i. If a pair x, y satisfies the conditions id (x) < id (y), t1 = ti (x), t2 = tj (x), t3 = tk (x), then the distance of points x, y is computed. The three dimension equalities assure that x lies in the very -sided cube ht1 , t2 , t3 i. If |x, y| < , the pair hx, yi is added to the result. The collection of all pairs hx, yi that satisfy |x, y| <  is the most time- consuming phase of the construction of the Vietoris-Rips complex. Those pairs are single-dimensional simplexes. The second phase computes triplets hx, y, zi such that the three points are pairwise closer than  and id (x) < id (y) < id (z). Following phases consist in identifying longer and longer lists of points that satisfy analogous conditions. It can be done as in the Apriori algorithm to find frequent subsets, because if all points in a set are closer than  that the same must hold for all its subsets. If  is not too big, these phases are not time- consuming. However, it is much simpler to build higher-dimensional simplexes using the following lemma. Lemma 2 (Raising the dimension of simplexes). Let R be a Vietoris-Rips complex. A simplex a := hx1 , ..., xn , xn+1 , xn+2 i belongs to R if and only if all the three following simplexes are also in R: b := hx1 , ..., xn , xn+1 i c := hx1 , ..., xn , xn+2 i d := hxn+1 , xn+2 i Proof. The “only if” part is straightforward. If a ∈ R, then all its subsimplexes obviously belong to R. Thus, so do b, c, d. The proof of the “if” part is based on the following property of Vietoris-Rips complexes. If a simplex a belongs to R, then all its one-dimensional subsimplexes also belong to R. Assume a one-dimensional simplex e := hxj , xk i ⊂ a, where j, k ∈ {1, . . . , n + 2}. We will show that e ∈ R. We have to consider three cases. Firstly, if j = n + 1 ∧ k = n + 2, then e = d ∈ R. Secondly, if j < n + 1 ∧ k = n + 2, then e ⊂ c ∈ R, thus e ∈ R. Thirdly, if k < n + 2, then j < n + 1, thus e ⊂ b ∈ R and also e ∈ R. Since the choice of j and k is arbitrary, the “if” part has been proven. So has been the whole lemma. t u As the result of the first phase of the algorithm, we have a collection pairs that contains all pairs hxi , xj i such that |xi , xj | <  ∧ id j < id k We recall that id i is the identifier of the point xi . In the second phase of the algorithm, we build collections of simplexes of sub- sequent dimensions n that belong to the Vietoris-Rips complex R. The algorithm stops when for a given n we get the empty list of simplexes. For a given n, we search for all b, c and d that satisfy the constraints of Lemma 2. The symbols introduced in Lemma 2 will prove to be handy again. Therefore, we perform the following steps: 1. We convert the collection of simplexes hx1 , x2 , . . . , xn+1 i of the dimension n to the collection of pairs hhx1 , x2 , . . . , xn i, xn+1 i. 2. We compute a natural self-join of this set of pairs using the first coordi- nate. As the result we get the set of all triplets: hhx1 , x2 , . . . , xn i, xj , xk i (a as in Lemma 2) such that there are pairs hhx1 , x2 , . . . , xn i, xj i (b) and hhx1 , x2 , . . . , xn i, xk i (c) in the previous step. 3. We leave (select) only those triples that satisfy id j < id k . 4. We equi-join the remaining triplets with the set of all pairs using the last two coordinates of the triplets. We do it to assure that the last two coordinates of triplets (d) form a one-dimensional simplex belonging to the complex. 5. We flatten remaining triplets to get complexes of the dimension n + 1 of the form hx1 , x2 , . . . , xn , xj , xk i. Note, that this procedure contains only operations that are expressible in relational algebra (or calculus). Therefore, the whole algorithm is formulated in a highly abstract language that is known to be easy to optimise and execute in parallel. We exploited this possibility and implemented it in the Apache Spark for efficient execution on large computing infrastructures of commodity computers. An example Python code that implements the whole algorithm is shown below. komplex = [] komplex.append(points.map(lambda p: p[0])) komplex.append(pairs) # initial base = pairs pairsAsKey = pairs.map(lambda p: (p, 1)) noAnylizedObjects = pairs.count() #iterated while noAnylizedObjects > 0: base = base.map(lambda a:(a[:-1], a[-1])) base = base.join(base) \ .filter(lambda a: a[1][0] < a[1][1]) \ .map(lambda a: (a[1] , a[0])) \ .join(pairsAsKey) \ .map(lambda a:rigthTupleSum(a[1][0], a[0])) noAnylizedObjects = base.count() if noAnylizedObjects > 0: komplex.append(base) 5 Conclusions In this article we have presented an efficient method to build Vietoris-Rips com- plexes. Moreover, this method is easily parallelizable using the MapReduce com- putation model or Apache Spark. Further research will focus on algorithms that reduce constructed complexes. It will facilitate finding homology types of com- plexes. As a result it will also allow assessing of the correctness of the selection of the  value. References 1. Carlsson, G., Ishkhanov, T., De Silva, V., Zomorodian, A.: On the local behavior of spaces of natural images. International journal of computer vision 76 (2008) 1–12 2. De Silva, V., Ghrist, R.: Coverage in sensor networks via persistent homology. Algebraic & Geometric Topology 7 (2007) 339–358 3. Ghrist, R.: Barcodes: the persistent topology of data. Bulletin of the American Mathematical Society 45 (2008) 61–75 4. Edelsbrunner, H., Letscher, D., Zomorodian, A.: Topological persistence and sim- plification. Discrete and Computational Geometry 28 (2002) 511–533 5. Zomorodian, A., Carlsson, G.: Computing persistent homology. Discrete & Com- putational Geometry 33 (2005) 249–274 6. Nicolau, M., Levine, A.J., Carlsson, G.: Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival. Proceedings of the National Academy of Sciences 108 (2011) 7265–7270 7. Freedman, D., Chen, C.: Algebraic topology for computer vision. Computer Vision (2009) 239–268 8. Vejdemo-Johansson, M., Pokorny, F.T., Skraba, P., Kragic, D.: Cohomological learning of periodic motion. Applicable Algebra in Engineering, Communication and Computing 26 (2015) 5–26 9. Dłotko, P., Ghrist, R., Juda, M., Mrozek, M.: Distributed computation of coverage in sensor networks by homological methods. Applicable Algebra in Engineering, Communication and Computing 23 (2012) 29–58 10. Ćwiszewski, A., Wiśniewski, P.: Coverage verification algorithm for sensor net- works. In: Computer Applications for Bio-technology, Multimedia, and Ubiquitous City. Springer (2012) 397–405 11. Amari, S.I., Cichocki, A., Yang, H.H., et al.: A new learning algorithm for blind signal separation. Advances in neural information processing systems (1996) 757– 763