-

Heterogeneous Data Co-Clustering by Pseudo-Semantic Affinity Functions

0 RAI Centre for Research and Technological Innovation , Corso E. Giambone 68, I10135 Torino , Italy

The convergence between Web technology and multimedia production is enabling the distribution of content through dynamic media platforms such as RSS feeds and hybrid digital television. Heterogeneous data clustering is needed to analyse, manage and access desired information from this variety of information sources. This paper defines a new class of pseudo-semantic affinity functions that allow for a compact representation of cross-modal documents relations.

Co-clustering Pseudo-semantic Model Syntactic Affinity

The proliferation of multimedia production tools is enabling the convergence of different media technologies. The result of this technological breakthrough is the creation of new relationships between different media objects and the modalities in which they are generated and consumed.

Heterogeneous data clustering can be defined as the set of methods that combine data assets that are different in nature, for example audiovisual and textual material, and use the obtained information to present additional knowledge, potentially not discoverable by the analysis of the individual sources. A type of multi-dimensional heterogeneous data clustering is co-clustering, which allows simultaneous clustering of the rows and columns of a matrix. Given a set of M rows in N columns, a co-clustering algorithm generates co-clusters i.e. a subset of rows which exhibit similar behaviour across a subset of columns, or vice-versa. A pioneer work is described in [1], where a collection of text data is modelled as a bipartite graph between documents and words, thus reducing the clustering problem to a graph partitioning problem. A similar approach applied to multimedia collections is presented in [7]. The work in [5] describes a co-clustering algorithm that monotonically increases the preserved mutual information by intertwining both the row and column clustering at all stages. As clustering of high-dimensional data can suffer from the curse of dimensionality problem, Domeniconi et al. [2] proposed a technique to generate clusters in subspaces spanned by different combinations of dimensions.

Another problem related to co-clustering is how to set the optimum similarity function for maximising the cluster separability [3]. This paper presents an asymmetric vector affinity function on which pseudo-semantic dependency graphs among data objects are built based on the cross-projection of two sets of heterogeneous data spaces. Under this framework we implemented a novel coclustering algorithm for the seamless fusion of multimedia content coming from various sources. Because of asymmetric nature of the adopted model, hierarchical relations, i.e. syntactic entailment and equivalence, between data objects, as well as representativeness degrees of the found groupings can be expressed.

The paper is organised as follows. §2 describes the mathematical principles on which the pseudo-semantic affinity measure is built. §3 outlines the functional blocks performing the proposed co-clustering algorithm. §4 presents the experimental evaluations conducted to demonstrate the effectiveness and potential of the framework and the algorithms. §5 provides an example of a real-world application based on the proposed technology. §6 concludes the paper. 2

Pseudo-semantic Affinity

In our work co-clustering is based on the on the concept of pseudo-semantic affinity (PSA), a generalisation of the concept of semantic similarity originally proposed in [4]. By this generalisation we aim at increasing completeness without loosing accuracy. The PSA is loosely inspired by common language when we speak about relationships among people. There we call affine people who are related by some relativeness path. Some affinity concepts are symmetrical, e.g. being brother of, some of the symmetrical affinities are similarities, e.g. DNA sequence alignment measurements, but there are also generic affinities in which neither of the two properties hold, e.g. a nephew is not his uncle’s uncle. Definition 1. Let a, b be two non-negative real vectors of size N. A pseudosemantic affinity is a function Sm,n(a, b) : !0+N × !0+N ⇒ !0+ of the form: Sm,n(a, b) =

$a, b% &a&m&b&n ( 1 ) where m, n ∈ [0, +∞), and $·% is the inner product.

It can be demonstrated that the PSA is a pseudoquasimetric1 (or hemimetric) for the similarity of a and b in the considered vector space. Also, if a and b are two feature vectors representing two data objects, then Eq. ( 1 ) models syntactic properties of the relations between a and b, such as equivalence, entailment (or, conversely, dependency), and inclusion, thus providing hierarchical relationships among the considered data sets.

Let be Δ = m−n and k ∈ !, then S(·) has the following analytical properties: Auto-affinity ( 2 ), Co-symmetry ( 3 ), Symmetrisation ( 4 ), Index Shift/Scaling ( 5 - 6 ), Argument Scaling ( 7 ), Cosine Generalisation (8)

Sm,n(a, a) = &a&1−m&a&1−n = &a&2−n−m ( 2 ) 1 See http://en.wikipedia.org/wiki/Metric_(mathematics)#Pseudoquasimetrics Sm,n(a, b) = Sn,m(b, a)

Sm,n(a, b) = Sn,n(a, b)&a&−Δ = Sm,m(a, b)&b&Δ Sm+k,n(a, b) = Sm,n(a, b)&a&−k = Sm+k,n+k(a, b)&b&k

Sm,m(a, b) = Sn,n(a, b)&a&−Δ&b&−Δ

Sm,n(a, ka) = k1−nSm,n(a, a) Sm,n(a, b) = cos(a, b)&a&1−n−Δ&b&1−n ( 3 ) ( 4 ) ( 5 ) ( 6 ) ( 7 ) (8) (9) (10) (11) (12) (13) Eq. ( 2 ) implies that the affinity of a vector with itself depends on its norm. However, it is possible to demonstrate that independence from &a& is achieved iff Sm,n(a, a) = γ, where γ is a constant value. This implies that &a&(2−m−n) = γ, which is true if and only if 2−m−n = 0. In this case the PSA is said well-defined. Eq. (8) states that any PSA is a generalisation of the Cosine similarity measure, where cos(a, b) = "a,b# = S1,1(a, b).

$a$$b$ 2.1

Geometrical Properties of PSA

Let A = {x1, . . . , xM} ⊆ R0+N be a set of non-negative real-valued vectors in which we want to solve a constraining problem on the values of the PSA measurement. A particular constraining problem, is represented by thresholding, i.e. finding couples (xi, xj) ∈ A ⊆ R0+N such that Sm,n(xi, xj) ≥ α. A way with which this problem can be solved is constituted by analysing the adjacency matrix

A = Acos • F n,Δ = Acos • Φ1−n−Δ • Φ1−nT where aij = Sm,n(xi, xj), • is the Hadamard product, Acos is the Cosine similarity adjacency matrix, and φij = &xi&. This re-formulation allows for the interpretation that an adjacency matrix corresponding to a PSA Sn+Δ,n(xi, xj) is a distortion of the Cosine similarity adjacency matrix A obtained by Hadamardmultiplying Acos by the term Φ1−n−Δ • Φ1−nT . We observe that the matrix Φ is a static feature for A, i.e. it can be calculated once for all, and that the level of distortion of Acos depends only on n and Δ. To give empirical examples of these features, let us consider as A a set of randomly generated vectors in the space [0, 1] × [0, 1] and let us add the element x∗ = (1/2, 1/2) ∈ A. We then calculate the matrix A of Eq. (9) and find the following subsets of A: Eqx∗ = {x ∈ A : Sm,n(x∗, x) ≥ α ∧ Sm,n(x, x∗) ≥ α} Disx∗ = {x ∈ A : Sm,n(x∗, x) < α ∧ Sm,n(x, x∗) < α} Enx∗ = {x ∈ A : Sm,n(x∗, x) ≥ α ∧ Sm,n(x, x∗) < α}

Ienx∗ = {x ∈ A : Sm,n(x∗, x) < α ∧ Sm,n(x, x∗) ≥ α} Eqs. (10) and (11) denote, respectively, syntactic equivalence and syntactic disjointness between vectors x∗ and x. Eq. (12) denotes a syntactic entailment from x∗ to x. Eq. (13) denotes a (inverse) syntactic entailment to x∗ from x (see [6] for further detail). Examples of the results for four possible configurations of m, n, with α = 0.87 are shown in Fig. 1, where subsets Eqx∗ , Disx∗ , Enx∗ , and Ienx∗ are rendered in white, black, light grey, and dark grey respectively. A particular case of Eq. ( 1 ) is when m = 2 and n = 0. In this situation the PSA measure has some interesting properties w.r.t. plain Cosine similarity, which will be illustrated by some examples in this Section. Empirical investigations demonstrated the effectiveness of choosing S2,0(·) as affinity function for realworld clustering problems (see §4). Let us consider the following vectors: for which it results S1,1(a, b) = cos(a, b) ! 0.77, S2,0(a, b) = 1, S2,0(b, a) = 0.6, S1,1(c, b) = cos(c, b) ! 0.77, S2,0(c, b) = 2, S2,0(b, c) = 0.3, S2,0(c, a) = 2 and S2,0(a, c) = 0.5. It can be observed that S1,1(c, b) = S1,1(a, b), S2,0(a, c) = S2,0(a, a2 ) = 12 S2,0(a, a) and S2,0(a, b) = 12 S2,0(c, b), as defined by the analytical properties of the PSA measure.

These examples enlighten the following qualitative properties of the pseudosemantic affinity measurement introduced in this research: – The pseudo-semantic affinity measure accounts compactly for the asymmetric relationship existing among vectors in terms of angle and size at the same time; – as such, and differently from Cosine similarity, the pseudo-semantic affinity is sensible to vector scaling, and accounts proportionally to the scaling coefficient.

A natural synthesis of the two above example brings to state that vector a is completely included in vector b, and that vector c is doubly included in vector b. Furthermore we can say that vector b is included by 35 in vector a. Intuitively, it means that S2,0(a, b) accounts for the extent to which the vector a is explained by the vector b.

As a final example, let us consider a vector e defined as e = a + % with % orthogonal both to a and b and ρ = $"$ $a$ - 1, i.e. a vector representing an additive bi-orthogonal noise to a. This example models situations in which there a 2 are random errors in the calculation of feature vectors due to uncorrelated noisy input, e.g. spurious terms included in a textual feature vector by an imperfect HTML text extraction engine, or wrong words in the transcription of spoken content. The PSA measures between e and b are:

cos(e, b) = cos(b, e) ≈ cos(a, b) !1 − 12 ρ2"

S2,0(e, b) ≈ S2,0(a, b) #1 − ρ2$ and S2,0(b, e) = S2,0(b, a) These results enlighten an important feature distinguishing the pseudo-semantic affinity measurement from Cosine similarity: addition of bi-orthogonal noise to a vector always affects Cosine similarity, while it affects semantic affinity only partially. In other words, the degree of inclusion of a vector a when perturbed with by bi-orthogonal noise % in another vector b is affected, but the opposite does not hold, i.e. the degree of inclusion of a vector b in a vector a is not affected by bi-orthogonal noise added to a. 3

Co-clustering Algorithm

Let Π = {πi}im=1 and B = {βj }jn=1 be two input spaces of data sets, for which a metric in the hybrid output space H = Π ∪ B is not defined. Our co-clustering algorithm uses the PSA measure to aggregate the objects from the input spaces in the output space H. Figure 2 shows all the procedural blocks involved in the framework. The data from the ancillary space (B) are used as a dynamic dictionary with which the data from the primary space (Π) are represented. Each aggregation γi∗ ∈ D∗(α) consists of two components: – A subset of items of Π which share common attributes in B, as given by a PSA measurement Sm,n(πa, πb), πa, πb ∈ Π, which operates on the crossrelevance between items of Π and items of B condensed by the transformation T into the space A. It is possible to demonstrate that the PSA is a pseudo-quasi-metric and thus can be used to compare the heterogeneous data objects in A; – The attributes of the dictionary B which are relevant to each or all depending on the chosen criterion for multimodal aggregation (i.e. either intersection or union) aggregated elements of γi; The following subsections describes all the constituent parts of the framework. 3.1

Relevance Cross-Projection

Let T : Π × B → [0, 1] be a linking function such that T (πi, βj ) tends to 1 if βj ∈ B is somewhat relevant to πi ∈ Π, and T (πi, βj ) tends to 0 if βj ∈ B is not relevant to πk ∈ Π. As an example, T (·) could be the score function that project the data from the primary space Π to the data of the ancillary space B. The relations between items from the two spaces can be thus expressed by the cross-projection matrix T, whose elements are tij = T (πi, βj ). From the matrix T we evaluate the affinity between all the couples of the primary space objects. Formally, let πa, πb be two objects from the primary space, represented by the cross-projection vectors ta = #T (πa, βj )jn=1$, tb = #T (πb, βj )jn=1$, ta, tb ∈ T. Then the PSA of order m, n from ta to tb is evaluated according to Eq. ( 1 ), and its value is stored in the affinity matrix E = (eab), where eab =  1, if a = b 

S(ta, tb), if a 1= b and S(ta, tb) ≥ α  0, otherwise. (14) 3.3

Graph Visit and Multimodal Aggregation

The affinity matrix E can be interpreted as the adjacency matrix of a directed disconnected graph G = (V, E) whose vertices are the primary space objects and whose edges are the affinities among them. The co-clustering problem can be thus reduced to a graph visiting problem. For that, we first symmetrise E, by considering either connected or not each couple of nodes. then we use the depth first search (DFS) algorithm, obtaining the set D(α) = {g1, g2, . . . , gK } so that (i gi = G and ∀i 1= j, gj ) gi = ∅ of disconnected sub-graphs included in G. Since the DFS algorithm is valid for not oriented graphs, first we generate a not oriented representation of the original oriented graph using one of the following criteria: – Maximum affinity: vertices a, b are linked if max (S(ta, tb), S(tb, ta)) > α – Average affinity: vertices a, b are linked if 12 (S(ta, tb) + S(tb, ta)) > α – Minimum affinity: vertices a, b are linked if min (S(ta, tb), S(tb, ta)) > α Given D(α) we can finally build the set of multimodal aggregations D(α)∗ = {γ1∗, . . . , γ|∗D(α)|} ⊆ 2H by collecting the elements of B which are relevant to each γi ∈ D(α). Letting K = |D(α)|, this can be done according to one of the following criteria.

Union criterion (15) (16) (17) (18) (19) (20) ∀i : γi∗ = γi ∪ Bi, i = 1, . . . , K

Bi = ∪|jγ=i|1βij βij = {b ∈ B : R(πij , b) > η} , where η is a parametric threshold. Following this criterion, the function of D(α)∗ is that of integrating the partition D(α) with all the relevant elements of B. Notice that D(α)∗ is not in general a partition of H = Π ∪ B, because elements of B may be relevant to elements of Π belonging to different elements of D(α), and because some elements of B may be not relevant to any element of Π.

Intersection criterion

∀i : γi∗ = γi ∪ Bi, i = 1, . . . , K

Bi = ∪|jγ=i|1βij βij = {b ∈ B : ∀k : R(πik, b) > η} , Following this criterion, the function of D(α)∗ is that of integrating the partition D(α) with the elements of B that are relevant to each element of γi. Notice that neither in this case D(α)∗ is a partition of H = Π ∪ B. 4

Experimental Results

This section provides empirical evidence to demonstrate the benefits of our coclustering framework and algorithm. In particular, we show that the co-clustering approach based on PSA measurement overcomes clustering methods based on symmetric similarity measures. For this purpose, we applied the algorithm to the real case of aggregating news content from the Web and TV broadcasts, i.e. the primary space Π and the ancillary space B defined in §3, respectively. For the evaluations we set up a pool of 25 users, taken from the employers of our organisation, who were unaware of the rationales of the framework.

We randomly collected a set of 2, 155 news articles taken from 16 different newspapers and press agencies websites. The articles were randomly chosen in order to cover a broad range of subjects such as politics, sports, entertainment, current events and foreign affairs. Two different feature extraction and clustering strategies were applied to these data.

The first strategy adopted the common term frequency-inverse document frequency (tf/idf) weighting-based vector model and Cosine similarity. Each Web article was represented by the tf/idf values for the terms occurring in the document. Clustering was performed using Eq. (8) as distance metric, where we set n = 1 and Δ = 0. The minimum affinity criteria for different values of α was used for determining whether two graph vertices were connected or not.

In the second strategy we first performed part-of-speech tagging , selecting for each article the set of words tagged as nouns, proper nouns or adjectives. These word sets were then matched with the speech content of about 24, 000 TV news stories, resulting from the automatic segmentation of 3, 670 newscast programmes. Each Web article was thus represented by a cross-projection vector counting the relevance scores of the text documents obtained by the automatic transcription of each TV story’s speech content w.r.t. the set of words extracted from the article. This procedure practically implements the relevance cross-projection described in §3.1. Further detail on the overall process can be found in [6]. For each couple of Web articles a, b the pseudo-semantic affinities Sm,n(a, b) and Sm,n(b, a) were calculated across several combinations of m, n. Clustering was performed applying the maximum affinity criteria for different values of α and using the union criterion with η = 0.55.

We evaluated the effectiveness of our framework by measuring the clustering cohesion (Ω), the clustering coverage (Ψ ) and the clustering efficiency (Λ = Ω2Ω+ΨΨ ). The clustering cohesion can be related to the precision of the algorithm and is calculated as the ratio of objects in a cluster that are relevant to the concept expressed by the cluster and the total number of objects included in the same cluster averaged on the set of detected clusters. The concept expressed by the cluster is the concept that the user assumed as such during the assessment of a specific cluster. This implicitly means that performance should not empirically be lower than 0.5, since it is expected that in a situation in which the two halves of a cluster are equally distributed between two concepts the user would choose one of the two as the reference concept and discard the other. The clustering coverage can be related to the recall of the algorithm and is calculated as the ratio between the average number of objects included in a cluster with cardinality greater than 1 (Avg. Card. * Tot. Clusters) and the total number of objects to be clustered. The clustering efficiency can be related to the F-measure accounting for the balance between cohesion of the detected clusters and their expected cardinality.

Table 1 shows the performance results obtained for the tf/idf-based clustering strategy. Table 2 shows the performance results obtained for the relevance crossprojection-based co-clustering strategy using different symmetric combinations of the PSA measure. Table 3 compares these two clustering approaches in terms of clustering efficiency. Observe that the cross-projection-based strategy always outperforms the tf/idf-based strategy. Table 4 shows the performance results obtained for the relevance cross-projection-based co-clustering strategy using generic combinations of the asymmetric PSA measure. Results indicate that best performers are not always the same. For example for α = 0.5 best cohesion is achieved for m = 7 and n = 3, while for α = 0.95 it is m = 5 and n = 0. However, it can be observed that under some combinations of m and n, the algorithm always achieves good results independently from the chosen value of α. In the absence of the concrete possibility of full ground-truth testing of all possible combinations, these or similar global observations may help in finding the proper choice of PSA parameters for a given information retrieval system. For example, Table 4 shows that the combination (m, n, α) = (2, 0, 0.85) provides better balance between the clustering cohesion and coverage for the studied case. Though these findings are not enough to identify clear conditions and criteria under which to select the proper parameter combination for an information retrieval system, we think that at least they clearly show that the generalisation towards the PSA domain is not useless nor trivial to resolve, and as such needs further thorough investigation. In order to provide interactive visualisation of the generated multimodal aggregations for news retrieval, browsing and editing, we developed a browsing interface that enables users to track down the structural properties of the selected aggregation and, at the same time, navigate and manage its contents. This functionality is inspired by hyperlink networks, e.g. social and citation networks, Internet communications, and graph theory. The structure of each multimodal aggregation is represented as an oriented graph where the included Web articles are the nodes and the relationships among them are the edges, reflecting the asymmetric structure obtained by applying the PSA-based co-clustering algorithm. The graphs are browsable through a Java Web interface. Users can select the visualisation layout to apply for graph display (default is Fruchterman-Reingold layout - FRL). The nodes are labelled according to the titles of the Web articles, and are sized based to their representativeness (i.e., in-degree) w.r.t. the main topic of the aggregation, so that more representative nodes are drawn with larger shapes and bolder colors. Moving on one node shows the information related to the node, in terms of node’s title, in-degree and out-degree. Similarly, moving on one edge shows the edge weight. Starting from one node, the user can select, traverse and manipulate all its context/detail nodes. Context nodes of a node v are those nodes z for which S(v, z) ≥ α, while detail nodes are those nodes w for which S(w, v) ≥ α. This functionality allows users to find semantically meaningful paths between the news items of the aggregation.

The use of the FRL graph layout algorithm allows for intuitive visual-aided data navigation and manipulation. The assumption is that the graph’s topology can be used to describe the underlying concepts of the aggregation. Cohesive aggregations would be modelled by dense graphs. On the other hand, less cohesive aggregations would be modelled by spread graphs. Figures 3 and 4 show two examples. The former illustrates how the system is able to visually distinguish two subtopics in the context of a larger topic. The overall topic is about the summer traffic jams. The two subtopics (one of which is highlighted in yellow) are aggregations reporting of the same main topic (traffic jams) over two different consecutive weeks, marked respectively as ”red” and ”black” by the Italian Transport Ministry. The latter expresses still the same concept in the case of January 2009 bad weather period. However in this case the highlighted part is referring to the disasters happened in France and Spain (foreign countries) while the remain part is referring to Italian casualties.

Fig. 4. Visual-aided graph-based news navigation - example 2.

Conclusions and Future Work

This paper presented a family of pseudo-semantic affinity (PSA) functions for modelling the relationships between heterogeneous data objects. A co-clustering framework based on these functions was also presented. The motivation behind the use of this framework is its capability of representing hierarchical relationships among data of different nature, which are not directly derivable by using one-way clustering techniques (e.g. tf/idf vector space model and Cosine similarity). The experiments demonstrate that PSA measure outperforms the Cosine similarity in terms of clusters’ cohesion and coverage, thus providing better precision and recall. It was also demonstrated that the PSA is only partially affected by noisy data. Future work will explore how to adaptively select the best combination of PSA parameters for the optimisation of information retrieval processes.

1. I. S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning . In KDD '01: Proc. of the 7th ACM SIGKDD Intl. Conf. on Knowledge discovery and data mining , pages 269 - 274 , 2001 .

Domeniconi and

Gunopulos and

Ma and

Yan and

Al-Razgan and D. Papadopoulos Locally Adaptive Metrics for Clustering High Dimensional Data . Data Mining and Knowledge Discovery Journal , 14 ( 1 ): 63 - 97 , 2007 .

Domeniconi and

Peng and

Yan Composite Kernels for Semi-supervised Clustering . Knowledge and Information Systems , 1 - 18 , 2010 .

Kashyap and

Sheth . Semantic and schematic similarities between database objects: a context-based approach . The VLDB Journal , 5 ( 4 ): 276 - 304 , 1996 .

Long ,

Zhang , and

P. S.

Yu . Co-clustering by block value decomposition . In KDD '05: Proc. of the 11th ACM SIGKDD Intl. Conf. on Knowledge discovery and data mining , pages 635 - 640 , 2005 .

Messina and

Montagnuolo . A generalised cross-modal clustering method applied to multimedia news semantic indexing and retrieval . In WWW '09: Proc. of the 18th Intl. Conf. on World wide web , pages 321 - 330 , 2009 .

Rege ,

Dong , and

Fotouhi . Co-clustering image features and semantic concepts . In Proc. of the Intl. Conf. on Image Processing , pages 137 - 140 , 2006 .