<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Social Network Aggregation Using Face-Recognition</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Patrick Minder</string-name>
          <email>minder@ifi.uzh.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abraham Bernstein</string-name>
          <email>bernstein@ifi.uzh.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Zurich, Dynamic and Distributed Informations Systems Group</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>With the rapid growth of the social web an increasing number of people started to replicate their off-line preferences and lives in an on-line environment. Consequently, the social web provides an enormous source for social network data, which can be used in both commercial and research applications. However, people often take part in multiple social network sites and, generally, they share only a selected amount of data to the audience of a specific platform. Consequently, the interlinkage of social graphs from different sources getting increasingly important for applications such as social network analysis, personalization, or recommender systems. This paper proposes a novel method to enhance available user re-identification systems for social network data aggregation based on face-recognition algorithms. Furthermore, the method is combined with traditional text-based approaches in order to attempt a counter-balancing of the weaknesses of both methods. Using two samples of real-world social networks (with 1610 and 1690 identities each) we show that even though a pure face-recognition based method gets outperformed by the traditional text-based method (area under the ROC curve 0.986 vs. 0.938) the combined method significantly outperforms both of these (0.998, p = 0.0001) suggesting that the face-based method indeed carries complimentary information to raw text attributes.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>With the rapid growth of the social web an increasing number of people started
to replicate their off-line preferences and lives in an on-line environment. Indeed,
the usage of social network sites (SNS) such as Facebook, Google+, or LinkedIN
the use of messaging services (e.g., Twitter), tagging systems (e.g., del.ico.us),
sharing and recommendation services (e.g., Last.fm) has not only increased
immensely, but the activities on these site become an integral element in the daily
lives of millions of people. Hence, the social web provides an enormous source
for social network data collection.</p>
      <p>
        Often people take part in multiple of these SNSs. In some cases this
multiparticipation arises from necessity, as some features may only be provided by
some sites and not by others. However, in most cases, it is also the result of free
choice. The many services allow people to “partition” their lives (e.g, they may
use facebook for the private- and LinkedIN for the professional network). In fact,
the construction of site-specific identities enables the possibility to gain multiple
personalities as identifying features, such as the email address can be changed
easily—an effect that has been called “multiplicity” by Internet researchers [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
Hence, users will continue to maintain multiple identities even if one service will
cater to all their needs.
      </p>
      <p>
        At the same time, the identification of users for interlinking data from
different and distributed systems is getting increasingly important for different kind
of applications. In personalization, the use of cross-site profiles is essential as the
incorporation of multi-source user profile data significantly increases the quality
of preference recommendations [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]; In social network analysis, the merging of
multiple networks provides a more complete picture of the overall social graph
and helps to minimize the data selection bias on which most single-site studies
suffer [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]; and trust networks can be created by aggregating relationships among
network participants [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Even if the semantic web were to become immensely
popular the increased usage of a global identifier may not simplify universal
identification of a person, as some sites may not use the same identifiers or even
totally ignore the identification scheme and the users may choose—to ensure
their multiplicity—to maintain multiple identifiers. In fact, Mika et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]
argue that the key problem in the area of extraction of social network data—the
disambiguation of identities and relationships—still remains, as different social
web applications refer to relationship types, attributes, or tastes in profiles in
different ways and do not share any common key for the identification of users.
As a consequence, both researchers and practitioners (such as marketers) are
placed in front of a complicated research question: how can we combine the
multitude of information available about a person in the multiple SNSs to develop a
holistic, combined (and as complete as possible) user model when the identity of
the user in different sites is difficult to combine?
      </p>
      <p>
        Current proposals for interlinking social network profiles based on comparing
text-based attributes of user profiles [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] or using the network structure [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] have
the drawback that these methods scale poorly or they need to contain some
overlap in the relationship structure and result in a large computational expenditure
respectively. In this paper we propose to enhance current text-based methods—
in absence of semantic metadata — by combining it with face recognition
algorithms. Specifically, we propose to use face-recognition software to compare the
images uploaded by users on different SNSs as an additional feature for identity
merging. As we show, this statistical entity resolution procedure significantly
enhances the merging precision of two SNSs. Consequently, the contribution of this
paper are: (1)The presentation of an enhanced identity merging framework to
incorporate images; (2) The presentation of an algorithm that merges identities
based on face recognition software. (3) The combination of traditional text-based
and the introduced image-based merge-approach to counter-balance the respective
weaknesses of each of the approaches.
      </p>
      <p>To this end, we first ground our idea by giving an overview of related work and
introducing the fundamental concepts of entity resolution (i.e. re-identification)
and face-recognition. Then we present our novel re-identification technique and
discuss our prototype. Finally, we evaluate our procedure empirically on three
real-world datasets and close with a discussion of the limitations, future work
and some general conclusions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Winkler [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], showed that with a minimal set of attributes a large portion of the
US population can be re-identified based on US Census data. Furthermore, Gross
et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] showed that about 80% of social network sites user provide enough
public data for a direct re-identification and that at least 61% of the published
profile images on Facebook.com allow a direct identification by a human.
      </p>
      <p>
        Carmagnola et al.[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and Bekkermann et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] provide a cross-system
identity discovery system, which is based on text-based identification probability
calculations, whereby public available textual attributes of social network sites
are analyzed by their positive, respectively negative, influence on identification.
Further, [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] suggest the use of key phrase extraction for the name disambiguation
process, which is also used in POLYPHONET [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] for interlinking web pages
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] provide re-identification algorithms based on network similarity.
These system provide high accuracy, but lack on computational complexity and
time expenditure.
      </p>
      <p>
        A lot of research concerns shared approaches [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]: Especially, the
application of common semantic languages , such as the FOAF ontology1, the SIOC
(Semantically-Interlinked Online Communities) ontology2 for online
communities or the SCOT (Social Semantic Cloud Of Tags) ontology3 for tagging
systems. Such systems are desirable, but not widely spread in reality. The most
well-known system based on such data is FLINK [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Theoretical Foundations</title>
      <p>In this section, we present the theoretical foundations for our approach. First,
we present a formal model for entity resolution and then succinctly explain the
basics of face-recognition. Both foundations are used in our framework.
3.1</p>
      <sec id="sec-3-1">
        <title>Entity Resolution and the Fellegi-Sunter Model</title>
        <p>
          Entity resolution can be defined as the methodology of merging corresponding
records from two or more sources [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]. Consider for example a profile about
“Peter J. Miller” and another one about “Peter Jonathan Miller” on two different
SNS. Entity Resolution tries to decide if these two profiles belong to the same
1 http://www.foaf-project.org / http://xmlns.com/foaf/spec/20100101.html
2 http://sioc-project.org/
3 http://scot-project.org/
person or not. Therefore, entity resolution assumes that an individual shares
similar features in different environments which can be used to identify an entity,
even though no common key is defined. Generally, to complicate the resolution
process, there are different entities that share similar attribute values.
        </p>
        <p>
          Most current re-identification approaches are variants of the Fellegi-Sunter
model—a distance- and rule-based technique. The Fellegi-Sunter Model
determines a match between two entities by computing the similarity of their attribute
(or feature) vectors [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Specifically, given entities a ∈ A and b ∈ B, where both
A and B are the set of entities in SNS A and B, it tries to assign each pair
(a, b) of the space A × B to a set M or U whereby:
        </p>
        <p>M := is the set of true matches = {(a, b); a ∈ A ∧ b ∈ B ∧ a = b}
U := is the set of non-matches = {(a, b); a ∈ A ∧ b ∈ B ∧ a $= b}
It does so using a comparison function γ that computes the similarity measures
for each of the n comparable attributes of the entities and arranges these in a
vector:</p>
        <p>γ(a, b) = {γ1(a, b), ..., γn(a, b)}
Based on the comparison vector γ(a, b) a decision rule L now assigns each pair
(a, b) to either to the set M or U as follows:
(a, b) ∈
!M</p>
        <p>
          U
if p(M |γ)≥p(U |γ)
otherwise
whereby p(M |γ) is the probability that the comparison vector γ belongs to
the match class and p(U |γ) that γ belongs to U . In other words, the
FellegiSunter model treats all pairs of possible matches as independent. Recently several
authors argued that this independence offers the opportunity for enhancements.
Singla et al [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], e.g., proposes such an enhancement based on Markov logic.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Face-Recognition and the Eigenface Algorithm</title>
        <p>
          The face provides an enormous set of characteristics that the human perception
system uses to identify other individuals. The problem of face-recognition can
be formulated as follows ”Given still or video images of a scene, identify or
verify one or more person in the scene using a stored database of faces. Available
collateral information [...] may be used in narrowing the search (enhancing
recognition)” [25, p. 4]. Accordingly, face-recognition includes [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]: (1) The detection
and location of an unknown number of faces in an image [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]; (2) The
extraction of key facial-features; and (3) The identification [25, p. 12] which includes
a comparison and matching of invariant biometric face signatures [25, p. 14
16]. The identification can either be done by using holistic matching,
featurebased matching, or hybrid matching methods which concern the whole face, local
features— e.g. the location or geometry of the nose —or both as an input vector
for classification respectively [25, p. 14].
        </p>
        <p>
          Our re-identification framework uses the holistic face-recognition algorithm
Eigenface [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] based on Principal Component Analysis (PCA) and covering all
relevant local and global features [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. The Eigenface approach tries to code all
the relevant extracted information of a face image, such that the encoding can
be done efficiently, allowing for a comparison of the information to a database
of encoded models [25, p. 67].The Eigenface algorithm can be split up into two
parts:
        </p>
        <p>(1) Representation of the Image Database in Principal Component Vectors
Based on PCA, the principal components of a face-image are extracted, by (1)
acquiring an initial set of face images; (2) Defining the face space by calculating
the eigenvectors (Eigenfaces) from the set and eliminating all but k best
eigenvectors with the highest eigenvalues, by using PCA; and (3) Presenting each
known individual by projecting their face image onto the face space.</p>
        <p>
          Therefore, an image I(x, y) can be interpreted as a vector in a N -dimensional
space, where N = rc and r are the rows and c columns of the image [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] . Every
coordinate in the N -dimensional vector I(x, y)— the image space —corresponds
to a pixel of the image. This representation of an image obfuscates any
relationship between neighboured pixels as long as all images are rearranged in
the same manner. Thus the average face of the initially acquired training set
Γ := {γ1, γ2, ..., γm} can be calculated by
        </p>
        <p>1 m
γ¯ = " γn.</p>
        <p>m n=1
and the distance between an image and the average image is measured by
φi = γi − γ¯. Whereby, the orthonormal vectors define an Eigenface with the
eigenvectors:</p>
        <p>M
ul = " elkφk∀i ∈ [1, M ]</p>
        <p>
          k=1
whereby the eigenvectors el are calculated from the covariance matrix L = AA!,
where Lmn = φm!φn and A = [φ1, φ2, ..., φM ]. The derivation of the best
eigenvectors out of the covariance matrix is presented in [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. The k significant
eigenvectors of L span an k-dimensional face space—a subspace of the N × N
dimensional image space—where every face is represented as a linear combination of
the Eigenfaces [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] [25, p. 67 - 72].
        </p>
        <p>(2) The Identification Process The identification respectively verification of
an image is processed by: (1)Subtracting the mean image from the new face
images and projecting the result onto each of the eigenvectors (Eigenfaces); (2)
Determining if the image is a face by calculating the distance to the face space
and comparing it to a defined threshold; and (3)If it is a face, classifying the
weight pattern as a known or unknown individual by using a distance metric,
such as the Euclidian distance.</p>
        <p>Thus, a new face image I(x, y) will be projected into the face space by ωk =
uk!(γ − γ¯) for ∀k = [1, ..., M "]. The weight matrix Ω! = [ω1, ..., ωM! ] represents
the influence of each eigenvector on the input image. Hence, given a threshold
θε, if the face class k, which minimizes the Euclidian distance is
εk =( (Ω − Ωk) ( and θε &gt; εk
(1)
then the image will belong to the same individual. Else the face is classified as
unknown. Furthermore, the distance between an image and the face space can be
characterised by the squared distance between the mean-adjusted input image:
M!
ε2 =( (φ − φf ) ( , where φ = γk − γ¯ and φf = " ωiui (2)
i=1
Therefore, a new face image I(x, y) will be calculated as a non-face image if
ε &gt; θε, as known face image if ε &lt; θε ∧ εk &lt; θε and as an unknown face image
if ε &lt; θε ∧ εk &gt; θε.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Re-Identification Framework</title>
      <p>Our theoretical re-identification framework for user disambiguation in a social
network aggregation and cross-system personalization process. is based on the
Fellegi-Sunter-Model. The presented algorithms calculate the probability that
two user profiles belong to the same entity, and incorporates the ability to
incorporate images as an additional feature based on the Eigenface method. Therefore,
the framework provides three kind of methods: a pure face-recognition based, a
text-attribute based, and joined re-identification method.</p>
      <p>
        The methods follow a simple re-identification algorithm. Assume, two sets
A = {a1, a2, ..., am} and B = {b1, b2, ..., bn} of user profiles from two
different SNSs. Each profile is characterized by a set of text attributes and a single
profile image. We can now define E = {e1, e2, ..., ez} as the set of different
individuals, who have a profile in one or both social networks. Consequently, the
re-identification algorithm is based on the following three subtasks:
1. Attribute Comparison: The attributes of two social network profiles are
compared pairwise. The result is a comparison vector γ(ai, bj) = {d1, d2, ..., dn},
where n is the number of compared attributes and dk ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] indicates the
distance between the values of the kth-attribute of the profiles ai and bj.
Therefore, a distance dk of 0 indicates, that the two attribute instances are
completely equal, and a value of 1 indicates the opposite.
2. Matching Probability Calculation: Then, based on the comparison vector
γ(ai, bj), the probability ρ(ai, bj), that a pair (ai, bj) belongs to the same
entity, is calculated.
3. Merging Task: Finally, if probability ρ(ai, bj) is greater or equal to a
threshold value θ ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] (i.e., θ ≥ ρ(ai, bj)) then the profiles ai and bj are assumed
to belong to the same person.
4.1
      </p>
      <sec id="sec-4-1">
        <title>Attribute Comparison and Matching Probability Calculation</title>
        <p>The following three generic methods allow the comparison of n different
attributes and the calculation of a matching probability. The methods cover the
first two subtasks of the above introduced re-identification algorithm.</p>
        <p>
          (1) Pure Face-Recognition Based Method The method re-identifies
user profiles only by the application of the face-recognition algorithm Eigenface
on profile images. Hence, ∀ai ∈ A ∧ bj ∈ B, the probability ρ(ai, bj), that two
profiles ai and bj belong to the same entity el ∈ E, is defined as:
ρ(ai, bj) = εij (ai, bj ) =( (Ωai − Ωbj ) (∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]
Whereas, it is assumed that the profile images are projected into the face space
by ωai = uk!(ai − γ¯) and ωbj = uk!(bj − γ¯). Additionally, the set B is used as
training set for the initialization task, thus Γ = B.
        </p>
        <p>
          (2) Text-Attribute Based Method The algorithm re-identifies user
profiles by the application of text-attribute comparison. The attributes are
compared with the token-based QGRAM algorithm [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Note that spelling errors
minimally affects the similarity when using QGRAM, as it uses q-grams instead
of words are used as tokens. For the kth-attribute the algorithm computes a
normalized distance d(aik, bjk) ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ], where the distance is zero, if the value of the
kth-attribute of ai and bj are syntactically equivalent. As we discuss in Section
6, we considered name, email address, birthday, city as a minimal set of text
attributes in the experiments as they where shown to be strong indicators for
identification [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and other attributes such as address or phone number
are often not accessible. As a result, the matching probability is calculated by a
logistic function [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]:
ρ(ai, bj ) =
        </p>
        <p>exp(YT (ai, bj))
1 + exp(YT (ai, bj ))</p>
        <p>
          ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]
n
YT (ai, bj) = α0 + " αkd(aik, bjk)
k=1
The intercept α0 and regression coefficients {α1, ..., αn} for the linear regression
model YT (ai, bj ) are learned by a logistic regression on a specific training set.
        </p>
        <p>(3) Joined Method Finally, the two previously described methods are
joined to a method that uses both face-image-based and text-attribute-based
identification. Thus, for all pairs of profiles ai ∈ A ∧ bj ∈ B, it is assumed that
the matching probability is equal to:
ρ(ai, bj ) =</p>
        <p>exp(YJ (ai, bj))
1 + exp(YJ (ai, bj ))</p>
        <p>
          ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]
n
YJ (ai, bj) = α0 + " αkd(aik, bjk) + βεij (ai, bj)
        </p>
        <p>k=1
where
where
Again, the intercept α0 and regression coefficients {α1, ..., αn, β} for the
linear regression model YJ (ai, bj ) are learned by a logistic regression on a specific
training set.
Finally, based on one of the above introduced matching probabilities, a pair
(ai, bj) is called to belong to the same entity (i.e., (ai, bj) ∈ M), if:
∀ai ∈ A ∧ bj ∈ B : θ ≥ ρ(ai, bj) −→ M
(3)
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Prototype</title>
      <p>
        Our re-identification framework consists of four major components. Currently,
the Data Gathering and Acquisition module enables the acquisition of network
data from the social network sites Facebook, LinkedIn, Twitter and Flickr,
whereby only concerns public available data. The Data Preprocessing module
preprocesses the crawled data by transforming profile attributes into an internal
schema and establish connections between profiles for each relationship in the
source network. The implementation provides functionality for both the
integration of text attributes and profile images. For the integration of profile images,
we use an implementation of the face detection algorithm OpenCV4
HaarClasifier [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] provided by the Faint5 open source project. The algorithm returns the
coordinates of every face region on an input image, whereby one region of the n
returned regions is randomly selected and resized to a 50 × 50-pixel image. The
Matching module performs a pairwise comparison of all possible profiles pairs
(ai, bj), where ai ∈ A ∧ bj ∈ B. The goal of the matching task is to calculate
the comparison vector γ(ai, bj) and matching probability ρ(ai, bj) for each of
the methods introduced in Section 4.1. The module uses text-based algorithm
QGRAM provided by the open-source project SimMetrics6, and our own
implementation of the Eigenface algorithm. Finally, The Merging module merges the
data sources to an aggregated social graph based on rule introduced in Section
4.2.
6
      </p>
    </sec>
    <sec id="sec-6">
      <title>Experiments</title>
      <p>We evaluated the accuracy of the framework based on two experiments. In the
first experiment we determined various input parameters, the intercept and
the coefficients for the two regression models. The second experiment
benchmarked the suitability of profile images for user disambiguation in the pure
face-recognition and joined method against the text-based matching algorithm.
6.1</p>
      <sec id="sec-6-1">
        <title>Experiment 1: Determining the Parameters</title>
        <p>In the first experiment two social networks with a size of 47 and 45 were
generated from data crawled on Facebook. 36 of these users had a profile in both
4 http://sourceforge.net/projects/opencvlibrary/
5 http://faint.sourceforge.net/
6 http://www.sourceforge.net/projects/simmetrics/
networks. The profile image was randomly selected from all public available
published images in the specific Facebook profile. We performed a pairwise
comparison of the two sets, whereas for each pair the attribute similarities were stored
as a quintuple [name, emailaddress, birthday, city, imagesimilarity] whilst
varying the number of Eigenfaces in the imagesimilarity computation. Finally, the
optimal number of Eigenfaces and parameters for the two linear models were
calculated using a logistic regression model in SPSS7.</p>
        <p>Performance metric The profile image similarity measurements based on
Eigenfaces were compared using Receiver Operating Characteristics (ROC) curves.
The ROC-curve graphs the true positive rate (y-axis) respectively sensitivity
against the false positive rate (x-axis) respectively 1 - Specificity, where an ideal
curve would go from the origin (0,0) to the top left (0,1) corner, before
proceeding to the top right (1,1) one [24, p. 244 - 225]. The area under the ROC-curve
(AUC, also called c-statistic in medicine) can be used as a single number
performance metric for the merge accuracy. In contrast to the traditional precision,
recall, or f-measure it has the advantage that both the ROC-curve and the AUC
are independent of the prior data-distribution and, hence, serve as a more robust
metric to compare the performance of two approaches.</p>
        <p>
          Results As illustrated in Figure 1, the number of Eigenfaces influences on the
accuracy of match. The accuracy of the algorithm increases when increasing the
number of Eigenfaces until a specific barrier, where any increase in its numbers is
not beneficial or even detrimental to the overall performance. Thus, the Eigenface
algorithm should use between 50 to 60% of the top-most Eigenfaces—a result
similar to [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]. The resulting input parameters for the linear models are shown
in Table 1.
        </p>
        <p>Computational Costs The computational costs for the face-image
comparison is higher than for single text-based comparison. On our test-machine (an
7 http://www.spss.com/
Apple iMac computer with a 3.06 GHz Intel Core 2 Duo processor and 4 GB
of RAM) the comparison of the four concerned text-attributes takes between 10
to 20ms per pair without data preprocessing; the image-based comparison alone
takes 25 to 35ms/pair. Additionally, once per image, the face preprocessing,
including face-detection and image resizing, takes between five and six seconds.</p>
        <p>Attribute α0 αName αEmail αBirthday αCity β
Text-Based Method YT -0.319 25.655 -1.763 9.750 25.334</p>
        <p>
          Joined Method YJ -6.659 26.656 0.234 11.536 18.272 8.788
For the second experiment we collected a subgraph of both Facebook and LinkedIn.
Departing from the first author’s profile we collected 1610 (Facebook)
respectively 1690 (LinkedIn) profiles and manually determined that 166 users where
present in both samples. We compared all these profiles with the three
approaches using the input parameters determined in Experiment 1. Results
Figure 2 graphs the ROC curves for the three methods. Note that whilst the text
method (AUC=0.986) outperforms the pure image-based method (AUC=0.938),
the combined method (AUC=0.998) significantly outperforms either methods
(p = 0.001, p = 0.0001 compared with a non-parametric method described by
DeLong [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]).
6.3
        </p>
      </sec>
      <sec id="sec-6-2">
        <title>Discussion, Limitations and Future Work</title>
        <p>As the above results show the combined method clearly outperforms each of
others. It is interesting to observe that the ROC-Curve of both text-based and
the image-based method both shoot almost straight up until about (0,0.9). Then
the text-based method flattens out whilst the combined one continues to rise.
This suggests that the element of the method’s accuracy is contributed mostly
by the image-based method. Only then does the image-based method contribute
additional predictive power. When looking at the regression parameters this
suggestion receives some additional support as the parameters for the Email and
City lose in their contribution whilst the algorithm relies more on the N ame,
Image, and interestingly the Birthday.</p>
        <p>Obviously, all these results are limited by the usage of only one, albeit
realworld, dataset and will have to be validated with a others. Also, our experiment
assumed that we knew the semantic alignment of the text-attributes. When
merging only two SNS this assumption seems reasonable, when more are involved
the this alignment may introduce additional error. Consequently, we probably
overestimated the accuracy of the textual method.</p>
        <p>Last but not least, a real-world system would probably not perform a full
pairwise comparison to limit the computational expenditure but use some
optimization approach.</p>
        <p>We intend to investigate all these limitations in our future work.
In this paper we proposed an extension of the traditional text-attribute-based
method for re-identification in social networks using the images of profiles. The
experimental results show that the pure face-recognition based re-identification
method does not compete the traditional text-based methods in accuracy and
computational performance. A combined method, however, significantly
outperforms the pure text-based method in accuracy suggesting that it contains
complementary information. As we showed this combined method significantly improves
the accuracy of a social network system merge. Consequently, we believe that it
provides a more solid basis for both researchers and practitioners interested in
investigating multiple SNSs and facing the problems of multiplicity.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bachmann</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bird</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rahman</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Devanbu</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernstein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>The missing links: Bugs and bug-fix commits</article-title>
          .
          <source>In: ACM SIGSOFT / FSE '10: Proceedings of the 18th International Symposium on the Foundations of Software Engineering</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bekkerman</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Disambiguating web appearnaces of people in a social network</article-title>
          .
          <source>In: Procceding of the WWW</source>
          <year>2005</year>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bollegara</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Matsuo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ishizuka</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Extracting key phrases to disambiguate personal names on the web</article-title>
          . In: Proceeding to CICling
          <source>2006</source>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Carmagnola</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cena</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>User identification for cross-system personalisation</article-title>
          .
          <source>Information Sciences: an International Journal 1</source>
          <volume>-2</volume>
          (
          <issue>179</issue>
          ),
          <fpage>16</fpage>
          -
          <lpage>32</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Carmagnola</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Osborne</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torre</surname>
            ,
            <given-names>I.:</given-names>
          </string-name>
          <article-title>User data distributed on the social web: how to identify users on different social systems and collecting data about them</article-title>
          .
          <source>In: Proceedings of the 1st International Workshop on Information Heterogeneity and Fusion in Recommender Systems</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>DeLong</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>DeLong</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clarke-Pearson</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach</article-title>
          .
          <source>Biometrics</source>
          <volume>44</volume>
          (
          <issue>44</issue>
          ),
          <fpage>837</fpage>
          -
          <lpage>845</lpage>
          (
          <year>1988</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Elmagarmid</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ipeirotis</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verykios</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Duplicate record detection: A survey</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>19</volume>
          (
          <issue>1</issue>
          ) (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Fahrmeir</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pigeot</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tutz</surname>
          </string-name>
          , G.: In: Statistik - Der
          <source>Weg zur Datenanalyse</source>
          . Springer-Verlag Berlin Heidelberg New York (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Fellegi</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sunter</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A theory for record linkage</article-title>
          .
          <source>Journal American Statistic Association (64)</source>
          ,
          <fpage>1183</fpage>
          -
          <lpage>1210</lpage>
          (
          <year>1969</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Gross</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Acquisti</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Information revelation and privacy in online social networks</article-title>
          .
          <source>In: Workshop On Privacy In The Electronic Society, Proceedings of the 2005 ACM Workshop on Privacy in the Electronic Society</source>
          . pp.
          <fpage>71</fpage>
          -
          <lpage>80</lpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>H</given-names>
            <surname>Demirel</surname>
          </string-name>
          , TJ Clarke, P.C.:
          <article-title>Adaptive automatic facial feature segmentation</article-title>
          .
          <source>Proc. of 2nd International Conference on Automatic Face and Gesture</source>
          Recognition pp.
          <fpage>277</fpage>
          -
          <lpage>282</lpage>
          (
          <issue>196</issue>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Leonard</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Houben</surname>
            , G.J., van der Sluijs,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hidders</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Herder</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krause</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heckmann</surname>
            ,
            <given-names>D.:</given-names>
          </string-name>
          <article-title>User profile elicitation and conversion in a mashup environment</article-title>
          .
          <source>In: Int. Workshop on Lightweight Integration on the Web, in conjunction with ICWE</source>
          <year>2009</year>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Malin</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Unsupervised Name Disambiguation via Social Network Similarity (</article-title>
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Matsuo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mori</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hamasaki</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ishizuka</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Polyphonet: An advanced social network extraction system</article-title>
          .
          <source>In: Procceding of 15th International World Wide Web Conference</source>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Mika</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : Flink:
          <article-title>Semantic web technology for the extraction and analysis of social networks</article-title>
          .
          <source>Journal of Web Semantics</source>
          <volume>3</volume>
          (
          <fpage>211</fpage>
          -
          <lpage>223</lpage>
          ) (
          <year>Jan 2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Mika</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gangemi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Descriptions of social relations</article-title>
          .
          <source>In: Proceedings of the First Workshop on Friend of a Friend</source>
          ,
          <article-title>Social Network and the Semantic Web (</article-title>
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Rowe</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>Interlinking distributed social graphs</article-title>
          .
          <source>In: Proc. Linked Data on the Web Workshop, 18th Int. World Wide Web Conference</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Singla</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Domingos</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Entity resolution with markov logic</article-title>
          .
          <source>In: ICDM Sixth International Conference on Data Mining</source>
          <year>2006</year>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Sirovich</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kirby</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Application of the karhunen-lo`eve procedure for the characterization of human faces</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          . . .
          <volume>12</volume>
          ,
          <fpage>103</fpage>
          -
          <lpage>108</lpage>
          (
          <year>1990</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Turk</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pentland</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Face recognition using eigenfaces</article-title>
          .
          <source>Conference on Computer Vision and Pattern Recognition (Jan</source>
          <year>1991</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Turkle</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Cyberspace and identity</article-title>
          .
          <source>Contemporary Sociology</source>
          <volume>28</volume>
          (
          <issue>6</issue>
          ),
          <fpage>643</fpage>
          -
          <lpage>648</lpage>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Veldman</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Matching Profiles from Social Network Sites</article-title>
          .
          <source>Master's thesis</source>
          , University Twente (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Viola</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Rapid object detection using a boosted cascade of simple features</article-title>
          .
          <source>In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision</source>
          and Pattern
          <string-name>
            <surname>Recogntiion</surname>
          </string-name>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Wechsler</surname>
          </string-name>
          , H.:
          <article-title>Reliable Face Recognition Methods - System, Design, Implementation</article-title>
          and Evaluation.
          <source>Springer Media LLC (2”6)</source>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Wenyi</surname>
            <given-names>Zhao</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.C.</given-names>
            :
            <surname>Face Processing - Advanced Modeling</surname>
          </string-name>
          and Methods. Academic Press (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Winkler</surname>
            ,
            <given-names>W.:</given-names>
          </string-name>
          <article-title>The state of record linkage and current research problems</article-title>
          . Tech. rep., Statistical Research Division, U.S. Census Bureau. (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>