<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Social Network Aggregation Using Face-Recognition</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Patrick</forename><surname>Minder</surname></persName>
							<email>minder@ifi.uzh.ch</email>
							<affiliation key="aff0">
								<orgName type="department">Dynamic and Distributed Informations Systems Group</orgName>
								<orgName type="institution">University of Zurich</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Abraham</forename><surname>Bernstein</surname></persName>
							<email>bernstein@ifi.uzh.ch</email>
							<affiliation key="aff0">
								<orgName type="department">Dynamic and Distributed Informations Systems Group</orgName>
								<orgName type="institution">University of Zurich</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Social Network Aggregation Using Face-Recognition</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">FA36AFFB9EC7CF30F2F384D8D269F038</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T04:24+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>With the rapid growth of the social web an increasing number of people started to replicate their off-line preferences and lives in an on-line environment. Consequently, the social web provides an enormous source for social network data, which can be used in both commercial and research applications. However, people often take part in multiple social network sites and, generally, they share only a selected amount of data to the audience of a specific platform. Consequently, the interlinkage of social graphs from different sources getting increasingly important for applications such as social network analysis, personalization, or recommender systems. This paper proposes a novel method to enhance available user re-identification systems for social network data aggregation based on face-recognition algorithms. Furthermore, the method is combined with traditional text-based approaches in order to attempt a counter-balancing of the weaknesses of both methods. Using two samples of real-world social networks (with 1610 and 1690 identities each) we show that even though a pure face-recognition based method gets outperformed by the traditional text-based method (area under the ROC curve 0.986 vs. 0.938) the combined method significantly outperforms both of these (0.998, p = 0.0001) suggesting that the face-based method indeed carries complimentary information to raw text attributes.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>With the rapid growth of the social web an increasing number of people started to replicate their off-line preferences and lives in an on-line environment. Indeed, the usage of social network sites (SNS) such as Facebook, Google+, or LinkedIN the use of messaging services (e.g., Twitter), tagging systems (e.g., del.ico.us), sharing and recommendation services (e.g., Last.fm) has not only increased immensely, but the activities on these site become an integral element in the daily lives of millions of people. Hence, the social web provides an enormous source for social network data collection.</p><p>Often people take part in multiple of these SNSs. In some cases this multiparticipation arises from necessity, as some features may only be provided by some sites and not by others. However, in most cases, it is also the result of free choice. The many services allow people to "partition" their lives (e.g, they may use facebook for the private-and LinkedIN for the professional network). In fact, the construction of site-specific identities enables the possibility to gain multiple personalities as identifying features, such as the email address can be changed easily-an effect that has been called "multiplicity" by Internet researchers <ref type="bibr" target="#b20">[21]</ref>. Hence, users will continue to maintain multiple identities even if one service will cater to all their needs. At the same time, the identification of users for interlinking data from different and distributed systems is getting increasingly important for different kind of applications. In personalization, the use of cross-site profiles is essential as the incorporation of multi-source user profile data significantly increases the quality of preference recommendations <ref type="bibr" target="#b3">[4]</ref>; In social network analysis, the merging of multiple networks provides a more complete picture of the overall social graph and helps to minimize the data selection bias on which most single-site studies suffer <ref type="bibr" target="#b0">[1]</ref>; and trust networks can be created by aggregating relationships among network participants <ref type="bibr" target="#b16">[17]</ref>. Even if the semantic web were to become immensely popular the increased usage of a global identifier may not simplify universal identification of a person, as some sites may not use the same identifiers or even totally ignore the identification scheme and the users may choose-to ensure their multiplicity-to maintain multiple identifiers. In fact, Mika et al. <ref type="bibr" target="#b15">[16]</ref> argue that the key problem in the area of extraction of social network data-the disambiguation of identities and relationships-still remains, as different social web applications refer to relationship types, attributes, or tastes in profiles in different ways and do not share any common key for the identification of users. As a consequence, both researchers and practitioners (such as marketers) are placed in front of a complicated research question: how can we combine the multitude of information available about a person in the multiple SNSs to develop a holistic, combined (and as complete as possible) user model when the identity of the user in different sites is difficult to combine?</p><p>Current proposals for interlinking social network profiles based on comparing text-based attributes of user profiles <ref type="bibr" target="#b3">[4]</ref> or using the network structure <ref type="bibr" target="#b12">[13]</ref> have the drawback that these methods scale poorly or they need to contain some overlap in the relationship structure and result in a large computational expenditure respectively. In this paper we propose to enhance current text-based methodsin absence of semantic metadata -by combining it with face recognition algorithms. Specifically, we propose to use face-recognition software to compare the images uploaded by users on different SNSs as an additional feature for identity merging. As we show, this statistical entity resolution procedure significantly enhances the merging precision of two SNSs. Consequently, the contribution of this paper are: <ref type="bibr" target="#b0">(1)</ref>The presentation of an enhanced identity merging framework to incorporate images; <ref type="bibr" target="#b1">(2)</ref> The presentation of an algorithm that merges identities based on face recognition software. (3) The combination of traditional text-based and the introduced image-based merge-approach to counter-balance the respective weaknesses of each of the approaches.</p><p>To this end, we first ground our idea by giving an overview of related work and introducing the fundamental concepts of entity resolution (i.e. re-identification) and face-recognition. Then we present our novel re-identification technique and discuss our prototype. Finally, we evaluate our procedure empirically on three real-world datasets and close with a discussion of the limitations, future work and some general conclusions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Winkler <ref type="bibr" target="#b25">[26]</ref>, showed that with a minimal set of attributes a large portion of the US population can be re-identified based on US Census data. Furthermore, Gross et al. <ref type="bibr" target="#b9">[10]</ref> showed that about 80% of social network sites user provide enough public data for a direct re-identification and that at least 61% of the published profile images on Facebook.com allow a direct identification by a human.</p><p>Carmagnola et al. <ref type="bibr" target="#b3">[4]</ref> and Bekkermann et al. <ref type="bibr" target="#b1">[2]</ref> provide a cross-system identity discovery system, which is based on text-based identification probability calculations, whereby public available textual attributes of social network sites are analyzed by their positive, respectively negative, influence on identification. Further, <ref type="bibr" target="#b2">[3]</ref> suggest the use of key phrase extraction for the name disambiguation process, which is also used in POLYPHONET <ref type="bibr" target="#b13">[14]</ref> for interlinking web pages <ref type="bibr" target="#b12">[13]</ref> and <ref type="bibr" target="#b21">[22]</ref> provide re-identification algorithms based on network similarity. These system provide high accuracy, but lack on computational complexity and time expenditure.</p><p>A lot of research concerns shared approaches <ref type="bibr" target="#b11">[12]</ref>: Especially, the application of common semantic languages , such as the FOAF ontology<ref type="foot" target="#foot_0">1</ref> , the SIOC (Semantically-Interlinked Online Communities) ontology<ref type="foot" target="#foot_1">2</ref> for online communities or the SCOT (Social Semantic Cloud Of Tags) ontology<ref type="foot" target="#foot_2">3</ref> for tagging systems. Such systems are desirable, but not widely spread in reality. The most well-known system based on such data is FLINK <ref type="bibr" target="#b14">[15]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Theoretical Foundations</head><p>In this section, we present the theoretical foundations for our approach. First, we present a formal model for entity resolution and then succinctly explain the basics of face-recognition. Both foundations are used in our framework.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Entity Resolution and the Fellegi-Sunter Model</head><p>Entity resolution can be defined as the methodology of merging corresponding records from two or more sources <ref type="bibr" target="#b25">[26]</ref>. Consider for example a profile about "Peter J. Miller" and another one about "Peter Jonathan Miller" on two different SNS. Entity Resolution tries to decide if these two profiles belong to the same person or not. Therefore, entity resolution assumes that an individual shares similar features in different environments which can be used to identify an entity, even though no common key is defined. Generally, to complicate the resolution process, there are different entities that share similar attribute values.</p><p>Most current re-identification approaches are variants of the Fellegi-Sunter model-a distance-and rule-based technique. The Fellegi-Sunter Model determines a match between two entities by computing the similarity of their attribute (or feature) vectors <ref type="bibr" target="#b8">[9]</ref>. Specifically, given entities a ∈ A and b ∈ B, where both A and B are the set of entities in SNS A and B, it tries to assign each pair (a, b) of the space A × B to a set M or U whereby:</p><formula xml:id="formula_0">M := is the set of true matches = {(a, b); a ∈ A ∧ b ∈ B ∧ a = b} U := is the set of non-matches = {(a, b); a ∈ A ∧ b ∈ B ∧ a = b}</formula><p>It does so using a comparison function γ that computes the similarity measures for each of the n comparable attributes of the entities and arranges these in a vector:</p><formula xml:id="formula_1">γ(a, b) = {γ 1 (a, b), ..., γ n (a, b)}</formula><p>Based on the comparison vector γ(a, b) a decision rule L now assigns each pair (a, b) to either to the set M or U as follows:</p><formula xml:id="formula_2">(a, b) ∈ M if p(M |γ)≥p(U |γ) U otherwise</formula><p>whereby p(M |γ) is the probability that the comparison vector γ belongs to the match class and p(U |γ) that γ belongs to U . In other words, the Fellegi-Sunter model treats all pairs of possible matches as independent. Recently several authors argued that this independence offers the opportunity for enhancements. Singla et al <ref type="bibr" target="#b17">[18]</ref>, e.g., proposes such an enhancement based on Markov logic.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Face-Recognition and the Eigenface Algorithm</head><p>The face provides an enormous set of characteristics that the human perception system uses to identify other individuals. The problem of face-recognition can be formulated as follows "Given still or video images of a scene, identify or verify one or more person in the scene using a stored database of faces. Available collateral information [...] may be used in narrowing the search (enhancing recognition)" <ref type="bibr">[25, p. 4</ref>]. Accordingly, face-recognition includes <ref type="bibr" target="#b24">[25]</ref>: (1) The detection and location of an unknown number of faces in an image <ref type="bibr" target="#b10">[11]</ref>;</p><p>(2) The extraction of key facial-features; and (3) The identification [25, p. 12] which includes a comparison and matching of invariant biometric face signatures [25, p. <ref type="bibr">14 -16]</ref>. The identification can either be done by using holistic matching, featurebased matching, or hybrid matching methods which concern the whole face, local features-e.g. the location or geometry of the nose -or both as an input vector for classification respectively <ref type="bibr">[25, p. 14]</ref>.</p><p>Our re-identification framework uses the holistic face-recognition algorithm Eigenface <ref type="bibr" target="#b19">[20]</ref> based on Principal Component Analysis (PCA) and covering all relevant local and global features <ref type="bibr" target="#b19">[20]</ref>. The Eigenface approach tries to code all the relevant extracted information of a face image, such that the encoding can be done efficiently, allowing for a comparison of the information to a database of encoded models <ref type="bibr">[25, p. 67</ref>].The Eigenface algorithm can be split up into two parts:</p><p>(1) Representation of the Image Database in Principal Component Vectors Based on PCA, the principal components of a face-image are extracted, by <ref type="bibr" target="#b0">(1)</ref> acquiring an initial set of face images; (2) Defining the face space by calculating the eigenvectors (Eigenfaces) from the set and eliminating all but k best eigenvectors with the highest eigenvalues, by using PCA; and (3) Presenting each known individual by projecting their face image onto the face space.</p><p>Therefore, an image I(x, y) can be interpreted as a vector in a N -dimensional space, where N = rc and r are the rows and c columns of the image <ref type="bibr" target="#b19">[20]</ref> . Every coordinate in the N -dimensional vector I(x, y)-the image space -corresponds to a pixel of the image. This representation of an image obfuscates any relationship between neighboured pixels as long as all images are rearranged in the same manner. Thus the average face of the initially acquired training set</p><formula xml:id="formula_3">Γ := {γ 1 , γ 2 , ..., γ m } can be calculated by γ = 1 m m n=1 γ n .</formula><p>and the distance between an image and the average image is measured by φ i = γ i − γ. Whereby, the orthonormal vectors define an Eigenface with the eigenvectors:</p><formula xml:id="formula_4">u l = M k=1 e lk φ k ∀i ∈ [1, M]</formula><p>whereby the eigenvectors e l are calculated from the covariance matrix L = AA , where L mn = φ m φ n and A = [φ 1 , φ 2 , ..., φ M ]. The derivation of the best eigenvectors out of the covariance matrix is presented in <ref type="bibr" target="#b18">[19]</ref>. The k significant eigenvectors of L span an k-dimensional face space-a subspace of the N × N dimensional image space-where every face is represented as a linear combination of the Eigenfaces <ref type="bibr" target="#b19">[20]</ref> [25, p. 67 -72].</p><p>(2) The Identification Process The identification respectively verification of an image is processed by: (1)Subtracting the mean image from the new face images and projecting the result onto each of the eigenvectors (Eigenfaces); (2) Determining if the image is a face by calculating the distance to the face space and comparing it to a defined threshold; and (3)If it is a face, classifying the weight pattern as a known or unknown individual by using a distance metric, such as the Euclidian distance.</p><p>Thus, a new face image I(x, y) will be projected into the face space by ω k = u k (γ − γ) for ∀k = [1, ..., M ]. The weight matrix Ω = [ω 1 , ..., ω M ] represents the influence of each eigenvector on the input image. Hence, given a threshold θ ε , if the face class k, which minimizes the Euclidian distance is</p><formula xml:id="formula_5">ε k = (Ω − Ω k ) and θ ε &gt; ε k (1)</formula><p>then the image will belong to the same individual. Else the face is classified as unknown. Furthermore, the distance between an image and the face space can be characterised by the squared distance between the mean-adjusted input image:</p><formula xml:id="formula_6">ε 2 = (φ − φ f ) , where φ = γ k − γ and φ f = M i=1 ω i u i<label>(2)</label></formula><p>Therefore, a new face image I(x, y) will be calculated as a non-face image if</p><formula xml:id="formula_7">ε &gt; θ ε , as known face image if ε &lt; θ ε ∧ ε k &lt; θ ε and as an unknown face image if ε &lt; θ ε ∧ ε k &gt; θ ε .</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Re-Identification Framework</head><p>Our theoretical re-identification framework for user disambiguation in a social network aggregation and cross-system personalization process. is based on the Fellegi-Sunter-Model. The presented algorithms calculate the probability that two user profiles belong to the same entity, and incorporates the ability to incorporate images as an additional feature based on the Eigenface method. Therefore, the framework provides three kind of methods: a pure face-recognition based, a text-attribute based, and joined re-identification method.</p><p>The methods follow a simple re-identification algorithm. Assume, two sets A = {a 1 , a 2 , ..., a m } and B = {b 1 , b 2 , ..., b n } of user profiles from two different SNSs. Each profile is characterized by a set of text attributes and a single profile image. We can now define E = {e 1 , e 2 , ..., e z } as the set of different individuals, who have a profile in one or both social networks. Consequently, the re-identification algorithm is based on the following three subtasks: Therefore, a distance d k of 0 indicates, that the two attribute instances are completely equal, and a value of 1 indicates the opposite. 2. Matching Probability Calculation: Then, based on the comparison vector γ(a i , b j ), the probability ρ(a i , b j ), that a pair (a i , b j ) belongs to the same entity, is calculated. 3. Merging Task: Finally, if probability ρ(a i , b j ) is greater or equal to a threshold value θ ∈ [0, 1] (i.e., θ ≥ ρ(a i , b j )) then the profiles a i and b j are assumed to belong to the same person.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Attribute Comparison and Matching Probability Calculation</head><p>The following three generic methods allow the comparison of n different attributes and the calculation of a matching probability. The methods cover the first two subtasks of the above introduced re-identification algorithm.</p><p>(1) Pure Face-Recognition Based Method The method re-identifies user profiles only by the application of the face-recognition algorithm Eigenface on profile images. Hence, ∀a i ∈ A ∧ b j ∈ B, the probability ρ(a i , b j ), that two profiles a i and b j belong to the same entity e l ∈ E, is defined as:</p><formula xml:id="formula_8">ρ(a i , b j ) = ε ij (a i , b j ) = (Ω ai − Ω bj ) ∈ [0, 1]</formula><p>Whereas, it is assumed that the profile images are projected into the face space by ω ai = u k (a i − γ) and ω bj = u k (b j − γ). Additionally, the set B is used as training set for the initialization task, thus Γ = B.</p><p>(2) Text-Attribute Based Method The algorithm re-identifies user profiles by the application of text-attribute comparison. The attributes are compared with the token-based QGRAM algorithm <ref type="bibr" target="#b6">[7]</ref>. Note that spelling errors minimally affects the similarity when using QGRAM, as it uses q-grams instead of words are used as tokens. For the k th -attribute the algorithm computes a normalized distance d(a ik , b jk ) ∈ [0, 1], where the distance is zero, if the value of the k th -attribute of a i and b j are syntactically equivalent. As we discuss in Section 6, we considered name, email address, birthday, city as a minimal set of text attributes in the experiments as they where shown to be strong indicators for identification <ref type="bibr" target="#b4">[5]</ref> [26] <ref type="bibr" target="#b9">[10]</ref> and other attributes such as address or phone number are often not accessible. As a result, the matching probability is calculated by a logistic function <ref type="bibr" target="#b7">[8]</ref>:</p><formula xml:id="formula_9">ρ(a i , b j ) = exp(Y T (a i , b j )) 1 + exp(Y T (a i , b j )) ∈ [0, 1] where Y T (a i , b j ) = α 0 + n k=1 α k d(a ik , b jk )</formula><p>The intercept α 0 and regression coefficients {α 1 , ..., α n } for the linear regression model Y T (a i , b j ) are learned by a logistic regression on a specific training set.</p><p>(3) Joined Method Finally, the two previously described methods are joined to a method that uses both face-image-based and text-attribute-based identification. Thus, for all pairs of profiles a i ∈ A ∧ b j ∈ B, it is assumed that the matching probability is equal to:</p><formula xml:id="formula_10">ρ(a i , b j ) = exp(Y J (a i , b j )) 1 + exp(Y J (a i , b j )) ∈ [0, 1] where Y J (a i , b j ) = α 0 + n k=1 α k d(a ik , b jk ) + βε ij (a i , b j )</formula><p>Again, the intercept α 0 and regression coefficients {α 1 , ..., α n , β} for the linear regression model Y J (a i , b j ) are learned by a logistic regression on a specific training set.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Merging Task</head><p>Finally, based on one of the above introduced matching probabilities, a pair (a i , b j ) is called to belong to the same entity (i.e., (a i , b j ) ∈ M), if:</p><formula xml:id="formula_11">∀a i ∈ A ∧ b j ∈ B : θ ≥ ρ(a i , b j ) −→ M<label>(3)</label></formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Prototype</head><p>Our re-identification framework consists of four major components. Currently, the Data Gathering and Acquisition module enables the acquisition of network data from the social network sites Facebook, LinkedIn, Twitter and Flickr, whereby only concerns public available data. The Data Preprocessing module preprocesses the crawled data by transforming profile attributes into an internal schema and establish connections between profiles for each relationship in the source network. The implementation provides functionality for both the integration of text attributes and profile images. For the integration of profile images, we use an implementation of the face detection algorithm OpenCV </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Experiments</head><p>We evaluated the accuracy of the framework based on two experiments. In the first experiment we determined various input parameters, the intercept and the coefficients for the two regression models. The second experiment benchmarked the suitability of profile images for user disambiguation in the pure face-recognition and joined method against the text-based matching algorithm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Experiment 1: Determining the Parameters</head><p>In the first experiment two social networks with a size of 47 and 45 were generated from data crawled on Facebook. 36 of these users had a profile in both networks. The profile image was randomly selected from all public available published images in the specific Facebook profile. We performed a pairwise comparison of the two sets, whereas for each pair the attribute similarities were stored as a quintuple [name, emailaddress, birthday, city, image similarity ] whilst varying the number of Eigenfaces in the image similarity computation. Finally, the optimal number of Eigenfaces and parameters for the two linear models were calculated using a logistic regression model in SPSS <ref type="foot" target="#foot_6">7</ref> .</p><p>Performance metric The profile image similarity measurements based on Eigenfaces were compared using Receiver Operating Characteristics (ROC) curves. The ROC-curve graphs the true positive rate (y-axis) respectively sensitivity against the false positive rate (x-axis) respectively 1 -Specificity, where an ideal curve would go from the origin (0,0) to the top left (0,1) corner, before proceeding to the top right (1,1) one <ref type="bibr">[24, p. 244 -225]</ref>. The area under the ROC-curve (AUC, also called c-statistic in medicine) can be used as a single number performance metric for the merge accuracy. In contrast to the traditional precision, recall, or f-measure it has the advantage that both the ROC-curve and the AUC are independent of the prior data-distribution and, hence, serve as a more robust metric to compare the performance of two approaches. Results As illustrated in Figure <ref type="figure" target="#fig_0">1</ref>, the number of Eigenfaces influences on the accuracy of match. The accuracy of the algorithm increases when increasing the number of Eigenfaces until a specific barrier, where any increase in its numbers is not beneficial or even detrimental to the overall performance. Thus, the Eigenface algorithm should use between 50 to 60% of the top-most Eigenfaces-a result similar to <ref type="bibr" target="#b23">[24]</ref>. The resulting input parameters for the linear models are shown in Table <ref type="table" target="#tab_0">1</ref>.</p><p>Computational Costs The computational costs for the face-image comparison is higher than for single text-based comparison. On our test-machine (an Apple iMac computer with a 3.06 GHz Intel Core 2 Duo processor and 4 GB of RAM) the comparison of the four concerned text-attributes takes between 10 to 20ms per pair without data preprocessing; the image-based comparison alone takes 25 to 35ms/pair. Additionally, once per image, the face preprocessing, including face-detection and image resizing, takes between five and six seconds. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Experiment 2</head><p>For the second experiment we collected a subgraph of both Facebook and LinkedIn. Departing from the first author's profile we collected 1610 (Facebook) respectively 1690 (LinkedIn) profiles and manually determined that 166 users where present in both samples. We compared all these profiles with the three approaches using the input parameters determined in Experiment 1. Results Figure <ref type="figure" target="#fig_3">2</ref> graphs the ROC curves for the three methods. Note that whilst the text method (AUC=0.986) outperforms the pure image-based method (AUC=0.938), the combined method (AUC=0.998) significantly outperforms either methods (p = 0.001, p = 0.0001 compared with a non-parametric method described by DeLong <ref type="bibr" target="#b5">[6]</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3">Discussion, Limitations and Future Work</head><p>As the above results show the combined method clearly outperforms each of others. It is interesting to observe that the ROC-Curve of both text-based and the image-based method both shoot almost straight up until about (0,0.9). Then the text-based method flattens out whilst the combined one continues to rise. This suggests that the element of the method's accuracy is contributed mostly by the image-based method. Only then does the image-based method contribute additional predictive power. When looking at the regression parameters this suggestion receives some additional support as the parameters for the Email and City lose in their contribution whilst the algorithm relies more on the N ame, Image, and interestingly the Birthday. Obviously, all these results are limited by the usage of only one, albeit realworld, dataset and will have to be validated with a others. Also, our experiment assumed that we knew the semantic alignment of the text-attributes. When merging only two SNS this assumption seems reasonable, when more are involved the this alignment may introduce additional error. Consequently, we probably overestimated the accuracy of the textual method.</p><p>Last but not least, a real-world system would probably not perform a full pairwise comparison to limit the computational expenditure but use some optimization approach.</p><p>We intend to investigate all these limitations in our future work. In this paper we proposed an extension of the traditional text-attribute-based method for re-identification in social networks using the images of profiles. The experimental results show that the pure face-recognition based re-identification method does not compete the traditional text-based methods in accuracy and computational performance. A combined method, however, significantly outperforms the pure text-based method in accuracy suggesting that it contains complementary information. As we showed this combined method significantly improves the accuracy of a social network system merge. Consequently, we believe that it provides a more solid basis for both researchers and practitioners interested in investigating multiple SNSs and facing the problems of multiplicity.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>1 .</head><label>1</label><figDesc>Attribute Comparison: The attributes of two social network profiles are compared pairwise. The result is a comparison vector γ(a i , b j ) = {d 1 , d 2 , ..., d n }, where n is the number of compared attributes and d k ∈ [0, 1] indicates the distance between the values of the k th -attribute of the profiles a i and b j .</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>4</head><label>4</label><figDesc>HaarClasifier<ref type="bibr" target="#b22">[23]</ref> provided by the Faint 5 open source project. The algorithm returns the coordinates of every face region on an input image, whereby one region of the n returned regions is randomly selected and resized to a 50 × 50-pixel image. The Matching module performs a pairwise comparison of all possible profiles pairs (a i , b j ), where a i ∈ A ∧ b j ∈ B. The goal of the matching task is to calculate the comparison vector γ(a i , b j ) and matching probability ρ(a i , b j ) for each of the methods introduced in Section 4.1. The module uses text-based algorithm QGRAM provided by the open-source project SimMetrics 6 , and our own implementation of the Eigenface algorithm. Finally, The Merging module merges the data sources to an aggregated social graph based on rule introduced in Section 4.2.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 1 :</head><label>1</label><figDesc>Fig. 1: Showing the influence of the number of Eigenfaces on the area under the ROC-Curve based on data of the first experiment and a confidence interval of 95%</figDesc><graphic coords="9,194.32,344.18,226.47,151.96" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Fig. 2 :</head><label>2</label><figDesc>Fig. 2: Results of the second experiment merging two subnetworks of Facebook and LinkedIn</figDesc><graphic coords="11,208.52,103.46,198.31,147.36" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Input parameter for the regression based text-based and joined method models learned on the dataset of the first experiment and used in the second experiment as input.</figDesc><table><row><cell>Attribute</cell><cell>α0 αName α Email α Birthday αCity β</cell></row><row><cell cols="2">Text-Based Method YT -0.319 25.655 -1.763 9.750 25.334 -</cell></row><row><cell>Joined Method YJ</cell><cell>-6.659 26.656 0.234 11.536 18.272 8.788</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">http://www.foaf-project.org / http://xmlns.com/foaf/spec/20100101.html</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">http://sioc-project.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">http://scot-project.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">http://sourceforge.net/projects/opencvlibrary/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">http://faint.sourceforge.net/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5">http://www.sourceforge.net/projects/simmetrics/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_6">http://www.spss.com/</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">The missing links: Bugs and bug-fix commits</title>
		<author>
			<persName><forename type="first">A</forename><surname>Bachmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Bird</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Rahman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Devanbu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bernstein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ACM SIGSOFT / FSE &apos;10: Proceedings of the 18th International Symposium on the Foundations of Software Engineering</title>
				<imprint>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Disambiguating web appearnaces of people in a social network</title>
		<author>
			<persName><forename type="first">R</forename><surname>Bekkerman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mccallum</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Procceding of the WWW</title>
				<meeting>ceding of the WWW</meeting>
		<imprint>
			<date type="published" when="2005">2005. 2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Extracting key phrases to disambiguate personal names on the web</title>
		<author>
			<persName><forename type="first">D</forename><surname>Bollegara</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Matsuo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ishizuka</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceeding to CICling</title>
				<meeting>eeding to CICling</meeting>
		<imprint>
			<date type="published" when="2006">2006. 2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">User identification for cross-system personalisation</title>
		<author>
			<persName><forename type="first">F</forename><surname>Carmagnola</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Cena</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">formation Sciences: an International Journal</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="16" to="32" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">User data distributed on the social web: how to identify users on different social systems and collecting data about them</title>
		<author>
			<persName><forename type="first">F</forename><surname>Carmagnola</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Osborne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Torre</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 1st International Workshop on Information Heterogeneity and Fusion in Recommender Systems</title>
				<meeting>the 1st International Workshop on Information Heterogeneity and Fusion in Recommender Systems</meeting>
		<imprint>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach</title>
		<author>
			<persName><forename type="first">E</forename><surname>Delong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Delong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Clarke-Pearson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Biometrics</title>
		<imprint>
			<biblScope unit="volume">44</biblScope>
			<biblScope unit="issue">44</biblScope>
			<biblScope unit="page" from="837" to="845" />
			<date type="published" when="1988">1988</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Duplicate record detection: A survey</title>
		<author>
			<persName><forename type="first">A</forename><surname>Elmagarmid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Ipeirotis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Verykios</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Knowledge and Data Engineering</title>
		<imprint>
			<biblScope unit="volume">19</biblScope>
			<biblScope unit="issue">1</biblScope>
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><surname>Fahrmeir</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Pigeot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Tutz</surname></persName>
		</author>
		<title level="m">Statistik -Der Weg zur Datenanalyse</title>
				<meeting><address><addrLine>Berlin Heidelberg New York</addrLine></address></meeting>
		<imprint>
			<publisher>Springer-Verlag</publisher>
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">A theory for record linkage</title>
		<author>
			<persName><forename type="first">I</forename><surname>Fellegi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sunter</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal American Statistic Association</title>
		<imprint>
			<biblScope unit="volume">64</biblScope>
			<biblScope unit="page" from="1183" to="1210" />
			<date type="published" when="1969">1969</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Information revelation and privacy in online social networks</title>
		<author>
			<persName><forename type="first">R</forename><surname>Gross</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Acquisti</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2005 ACM Workshop on Privacy in the Electronic Society</title>
				<meeting>the 2005 ACM Workshop on Privacy in the Electronic Society</meeting>
		<imprint>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page" from="71" to="80" />
		</imprint>
	</monogr>
	<note>Workshop On Privacy In The Electronic Society</note>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Adaptive automatic facial feature segmentation</title>
		<author>
			<persName><surname>Demirel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">C</forename><surname>Clarke</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of 2nd International Conference on Automatic Face and Gesture Recognition</title>
				<meeting>of 2nd International Conference on Automatic Face and Gesture Recognition</meeting>
		<imprint>
			<date>196</date>
			<biblScope unit="page" from="277" to="282" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">User profile elicitation and conversion in a mashup environment</title>
		<author>
			<persName><forename type="first">E</forename><surname>Leonard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">J</forename><surname>Houben</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Van Der Sluijs</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hidders</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Herder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Abel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Krause</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Heckmann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Int. Workshop on Lightweight Integration on the Web, in conjunction with ICWE</title>
				<imprint>
			<date type="published" when="2009">2009. 2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title level="m" type="main">Unsupervised Name Disambiguation via Social Network Similarity</title>
		<author>
			<persName><forename type="first">B</forename><surname>Malin</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Polyphonet: An advanced social network extraction system</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Matsuo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Mori</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hamasaki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ishizuka</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Procceding of 15th International World Wide Web Conference</title>
				<meeting>ceding of 15th International World Wide Web Conference</meeting>
		<imprint>
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Flink: Semantic web technology for the extraction and analysis of social networks</title>
		<author>
			<persName><forename type="first">P</forename><surname>Mika</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Web Semantics</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="211" to="223" />
			<date type="published" when="2005-01">Jan 2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Descriptions of social relations</title>
		<author>
			<persName><forename type="first">P</forename><surname>Mika</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gangemi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the First Workshop on Friend of a Friend, Social Network and the Semantic Web</title>
				<meeting>the First Workshop on Friend of a Friend, Social Network and the Semantic Web</meeting>
		<imprint>
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Interlinking distributed social graphs</title>
		<author>
			<persName><forename type="first">M</forename><surname>Rowe</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. Linked Data on the Web Workshop, 18th Int. World Wide Web Conference</title>
				<meeting>Linked Data on the Web Workshop, 18th Int. World Wide Web Conference</meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Entity resolution with markov logic</title>
		<author>
			<persName><forename type="first">P</forename><surname>Singla</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Domingos</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICDM Sixth International Conference on Data Mining</title>
				<imprint>
			<date type="published" when="2006">2006. 2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Application of the karhunen-loève procedure for the characterization of human faces</title>
		<author>
			<persName><forename type="first">L</forename><surname>Sirovich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kirby</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Pattern Analysis and Machine Intelligence</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="page" from="103" to="108" />
			<date type="published" when="1990">1990</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Face recognition using eigenfaces</title>
		<author>
			<persName><forename type="first">M</forename><surname>Turk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Pentland</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Conference on Computer Vision and Pattern Recognition</title>
				<imprint>
			<date type="published" when="1991-01">Jan 1991</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Cyberspace and identity</title>
		<author>
			<persName><forename type="first">S</forename><surname>Turkle</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Contemporary Sociology</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="643" to="648" />
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<title level="m" type="main">Matching Profiles from Social Network Sites</title>
		<author>
			<persName><forename type="first">I</forename><surname>Veldman</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
		<respStmt>
			<orgName>University Twente</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Master&apos;s thesis</note>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Rapid object detection using a boosted cascade of simple features</title>
		<author>
			<persName><forename type="first">P</forename><surname>Viola</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Jones</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recogntiion</title>
				<meeting>the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recogntiion</meeting>
		<imprint>
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<monogr>
		<title level="m" type="main">Reliable Face Recognition Methods -System, Design, Implementation and Evaluation</title>
		<author>
			<persName><forename type="first">H</forename><surname>Wechsler</surname></persName>
		</author>
		<imprint>
			<publisher>Springer Media LLC</publisher>
			<biblScope unit="volume">2</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<monogr>
		<title level="m" type="main">Face Processing -Advanced Modeling and Methods</title>
		<author>
			<persName><forename type="first">Wenyi</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">C</forename></persName>
		</author>
		<imprint>
			<date type="published" when="2006">2006</date>
			<publisher>Academic Press</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<monogr>
		<author>
			<persName><forename type="first">W</forename><surname>Winkler</surname></persName>
		</author>
		<title level="m">The state of record linkage and current research problems</title>
				<imprint>
			<date type="published" when="1999">1999</date>
		</imprint>
		<respStmt>
			<orgName>Statistical Research Division, U.S. Census Bureau</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Tech. rep.</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
