=Paper= {{Paper |id=Vol-2073/article-02 |storemode=property |title=None |pdfUrl=https://ceur-ws.org/Vol-2073/article-02.pdf |volume=Vol-2073 |dblpUrl=https://dblp.org/rec/conf/www/MontoyaTASS18 }} ==None== https://ceur-ws.org/Vol-2073/article-02.pdf

A Knowledge Base for Personal Information Management
David Montoya Thomas Pellissier Tanon Serge Abiteboul
Square Sense LTCI, Télécom ParisTech Inria Paris
david@montoya.one ttanon@enst.fr & DI ENS, CNRS, PSL Research University
serge.abiteboul@inria.fr

Pierre Senellart Fabian M. Suchanek
DI ENS, CNRS, PSL Research University LTCI, Télécom ParisTech
& Inria Paris suchanek@enst.fr
& LTCI, Télécom ParisTech
pierre@senellart.com

ABSTRACT the knowledge base, and integrated with other data. For example,
Internet users have personal data spread over several devices and we have to find out that the same person appears with different
across several web systems. In this paper, we introduce a novel email addresses in address books from different sources. Standard
open-source framework for integrating the data of a user from KB alignment algorithms do not perform well in our scenario, as we
different sources into a single knowledge base. Our framework show in our experiments. Furthermore, integration spans data of
integrates data of different kinds into a coherent whole, starting different modalities: to create a coherent user experience, we need
with email messages, calendar, contacts, and location history. We to align calendar events (temporal information) with the user’s
show how event periods in the user’s location data can be detected location history (spatiotemporal) and place names (spatial).
and how they can be aligned with events from the calendar. This We provide a fully functional and open-source personal knowl-
allows users to query their personal information within and across edge management system. A first contribution of our work is the
different dimensions, and to perform analytics over their emails, management of location data. Such information is becoming com-
events, and locations. Our system models data using RDF, extending monly available through the use of mobile applications such as
the schema.org vocabulary and providing a SPARQL interface. Google’s Location History [20]. We believe that such data becomes
useful only if it is semantically enriched with events and people in
1 INTRODUCTION the user’s personal space. We provide such an enrichment.
A second contribution is the adaptation of ontology alignment
Internet users commonly have their personal data spread over sev-
techniques to the context of personal KBs. The alignment of persons
eral devices and services. This includes emails, messages, contact
and organizations is rather standard. More novel are alignments
lists, calendars, location histories, and many other. However, com-
based on time (a meeting in the calendar and a GPS location), or
mercial systems often function as data traps, where it is easy to
space (an address in contacts and a GPS location).
check information in but difficult to query and exploit it. For exam-
Our third contribution is an architecture that allows the integra-
ple, a user may have all her emails stored with an email provider
tion of heterogeneous personal data sources into a coherent whole.
– but cannot find out which of her colleagues she interacts most
This includes the design of incremental synchronization, where
frequently with. She may have all her location history on her phone
a change in a data source triggers the loading and treatment of
– but cannot find out which of her friends’ places she spends the
just these changes in the central KB. Conversely, the user is able to
most time at. Thus, a user often has paradoxically no means to
perform updates on the KB, which are made persistent wherever
make full use of data that she has created or provided. As more and
possible in the sources. We also show how to integrate knowledge
more of our lives happen in the digital sphere, users are actually
enrichment components into this process, such as entity resolution
giving away part of their life to external data services.
and spatio-temporal alignments.
We aim to put the user back in control of her own data. We
As implemented, our system can provide answers to questions
introduce a novel framework that integrates and enriches personal
such as: Who have I contacted the most in the past month (requires
information from different sources into a single knowledge base
alignments of different email addresses)? How many times did I go
(KB) that lives on the user’s machine, a machine she controls. Our
to Alice’s place last year (requires alignment between contact list
system, Thymeflow, replicates data of different kinds from outside
and location history)? Where did I have lunch with Alice last week
services and thus acts as a digital home for personal data. This
(requires alignment between calendar and location history)?
provides the user with a high-level global view of that data, which
Our system, Thymeflow, was previously demonstrated in [32].
she can use for querying and analysis. All of this integration and
It is based on an extensible framework available under an open-
analysis happens locally on the user’s computer, thus guaranteeing
source software license1 . People can therefore freely use it, and
her privacy.
researchers can build on it.
Designing such a personal KB is not easy: Data of completely
We first introduce out data model and sources in Section 2, and
different nature has to be modeled in a uniform manner, pulled into
then present the system architecture of Thymeflow in Section 3.
LDOW 2018, April 2018, Lyon, France
1 https://github.com/thymeflow/thymeflow
LDOW 2018, April 2018, Lyon, France Montoya, Pellissier Tanon, Abiteboul, Senellart, and Suchanek

Section 4 details our knowledge enrichment processes, and Section 5 to associate it to geo-coordinates and richer place semantics. The
our experimental results. Related work is described in Section 6. Facebook Graph API [15] also models events the user is attending
Before concluding in Section 8, we discuss lessons learnt while or interested in, with richer location data and list of attendees (a
building and experimenting with Thymeflow in Section 7. list of names).

2 DATA MODEL Location history. Smartphones are capable of tracking the user’s
In this section, we briefly describe the schema of the knowledge location over time using different positioning technologies: satel-
base, and discuss the mapping of data sources to that schema. lite navigation, Wi-Fi, and cellular. Location history applications
continuously run in the background, and store the user’s location
Schema. We use the RDF standard [9] for knowledge represen- either locally or on a distant server. Each point in the user’s location
tation. We use the namespace prefixes schema for http://schema. history is represented by time, longitude, latitude, and horizontal
org/, and rdf and rdfs for the standard namespaces of RDF and accuracy (the measurement’s standard error). We use the Google
RDF Schema, respectively. A named graph is a set of RDF triples Location History format, in JSON, as Google users can easily export
associated with a URI (its name). A knowledge base (KB) is a set of their history in this format. A point is represented by a resource
named graphs. of type personal:Location with properties schema:geo, for geo-
For modeling personal information, we use the schema.org vo- graphic coordinates with accuracy, and personal:time for time.
cabulary when possible. This vocabulary is supported by Google,
Microsoft, Yahoo, and Yandex, and documented online. Wherever
this vocabulary is not fine-grained enough for our purposes, we 3 SYSTEM ARCHITECTURE
complement it with our own vocabulary, that lives in the namespace A personal knowledge base could be seen as a view defined over
http://thymeflow.com/personal# with prefix personal. personal information sources. The user would query this view in a
Figure 1 illustrates a part of our schema. Nodes represent classes, mediation style [17] and the data would be loaded only on demand.
rounded colored ones are non-literal classes, and an edge with However, accessing, analyzing and integrating these data sources
label p from X to Y means that the predicate p links instances on the fly would be expensive tasks. For this reason, Thymeflow
of X to instances of type Y . We use locations, people, organiza- uses a warehousing approach. Data is loaded from external sources
tions, and events from schema.org, and complement them with into a persistent store and then enriched.
more fine-grained types such as Stay, EmailAddress, and Pho- Thymeflow is a web application that the user installs, providing it
neNumber. Person and Organization classes are aggregated into a with a list of data sources, together with credentials to access them
personal:Agent class. (such as tokens or passwords). The system accesses the data sources
and pulls in the data. All code runs locally on the user’s machine.
Emails and contacts. We treat emails in the RFC 822 format [8]. None of the data leaves the user’s computer. Thus, the user remains
An email is represented as a resource of type schema:Email with in complete control of her data. The system uses adapters to access
properties such as schema:sender, personal:primaryRecipient, the sources, and to transform the data into RDF. We store the data
and personal:copyRecipient, which link to personal:Agent in- in a persistent triple store, which the user can query using SPARQL.
stances. Other properties are included for the subject, the sent and One of the main challenges in the creation of a personal KB is
received dates, the body, the attachments, the threads, etc. the temporal factor: data sources may change, and these updates
Email addresses are great sources of knowledge. An email ad- should be reflected in the KB. Changes can happen during the initial
dress such as “jane.doe@inria.fr” provides the given and family load time, while the system is asleep, or after some inferences have
names of a person, as well as her affiliation. However, some email already been computed. To address these dynamics, Thymeflow
addresses provide less knowledge and some almost none, e.g., uses software modules called synchronizers and enrichers. Figure 2
“j4569@gmail.com”. Sometimes, email fields contain a name, as shows synchronizers on the left, and enrichers in the center. Syn-
in “Jane Doe ”, which gives us a name triple. In chronizers are responsible for accessing data sources, enrichers
our model, personal:Agent instances extracted from emails with (see Section 4) for inferring new statements, such as alignments
the same combination of email address and name are considered between entities obtained by entity resolution.
indistinguishable (i.e., they are represented by the same URI). An Modules are scheduled dynamically and may be triggered by
email address does not necessarily belong to an individual; it can updates in the data sources (e.g., calendar entries) or by new pieces
also belong to an organization, as in edbt-school-2013@imag.fr or of information derived in the KB (e.g., the alignment of a position in
fancy_pizza@gmail.com. This is why, for instance, the sender, in the location history with a calendar event). The modules may also be
our data model, is a personal:Agent, and not a schema:Person. started regularly for particularly costly alignment processes. When
A vCard contact [36] is represented as an instance of a synchronizer detects a change in a source, a pipeline of enricher
personal:Agent with properties such as schema:familyName, and modules is triggered, as shown in Figure 2. Enrichers can also
schema:address. We normalize telephone numbers, based on a use knowledge from external data sources, such as Wikidata [44],
country setting provided by the user. Yago [41], or OpenStreetMap.
Calendar. The iCalendar format [11] can represent events. We Synchronizer modules are responsible for retrieving new data
model them as instances of schema:Event, with properties such from a data source. For each data source that has been updated, the
as name, location, organizer, attendee, and date. The location is adapter for that particular source transforms the source updates
typically given as a postal address, and we will discuss later how since last synchronization into a set of insertions/deletions in RDF.
A Knowledge Base for Personal Information Management LDOW 2018, April 2018, Lyon, France

dateReceived/ inReplyTo
dateSent
dateTime EmailMessage EmailAddress PhoneNumber
text/
headline re sen
sta cip de

ne
r ie r/

email
en tDat

ho
string nt Address
dD e/

lep
ate

te
name/
name
time description
Stay Event Agent string
attendee/ name
/
GeoVector velocity item organizer a me e
N am
en
giv ilyN
magnitude
angle/

location
Location
Person fam
n/
tio on
geo
geo c a i

affiliation
uncertainty Lo cat bir
ome kLo th
Da
h or te
double GeoCoordinates w
longitude/ geo location dateTime
latitude Place Organization

addressLocality/
addressRegion address Country

string PostalAddress ntry
s sCou
postalCode/ addre
streetAddress/
postOfficeBoxNumber address

Legend X schema:X X personal:X X xsd:X
schema:p personal:p rdfs:subClassOf
p p

Figure 1: Personal data model

INPUT
P1 S1 Updater User updates
S2
Pn Sn Loader
Personal ∆0
Information Synchronizers
Sources E1 Knowledge
∆1
Base
∆k−1
X1 Ek SPARQL
∆k
∆p−1 Query Answering
Ep Data Analysis
Xm
External Data Browsing
Sources Enricher Pipeline Visualization

Figure 2: System architecture
LDOW 2018, April 2018, Lyon, France Montoya, Pellissier Tanon, Abiteboul, Senellart, and Suchanek

This is of course relatively simple for data sources that track modi- input the current state of the KB, and a collection of changes ∆i
fications, e.g., CalDAV (calendar), CardDAV (contacts) and IMAP that have recently happened. It computes a new collection ∆i+1 of
(email). For others, this requires more processing. The result of this enrichments. Intuitively, this allows reacting to changes in a data
process is a delta update, i.e., a set of updates to the KB since the source. When some ∆0 is detected (typically by some synchronizer),
last time that particular source was considered. the system runs a pipeline of enrichers to take these changes into
The KB records the provenance of each newly obtained piece consideration. For instance, when a new entry is entered in the
of information. Synchronizers record a description of the data calendar with an address, a geocoding enricher is called to locate it.
source, and enrichers record their own name. We use named graphs Another enricher will later attempt to match it with a position in
to store the provenance. For example, the statements extracted the location history. For performance, particularly costly enrichers
from an email message in the user’s email server will be con- wait until there are enough changes, or when no more changes are
tained in a graph named with the concatenation of the server’s happening, before running on a batch of changes. This is the case
email folder URL and the message id. The graph’s URI is itself an for the entity resolution enricher. We now present this enricher
instance of personal:Document, and is related to its source via and another one that has been incorporated into the system.
the personal:documentOf property. The source is an instance of
personal:Source and is in this case the email server’s URL. Ac- 4.1 Agent Matching
count information is included in an instance of personal:Account
Facets. The KB keeps information as close to the original data as
via the personal:sourceOf property. Account instances allows us
possible. Thus, the knowledge base will typically contain several
to gather different kinds of data sources, (e.g., CardDAV, CalDAV
entities for the same person, if that person appears with different
and IMAP servers) belonging to one provider (e.g., corporate IT
names or different email addresses. We call such resources facets of
services) to which the user accesses through one identification. This
the same real-world agent. Different facets of the same agent will
provenance can be used to answer queries such as “What meetings
be linked by the personal:sameAs relation. The task of identifying
were recorded in my work calendar for next Monday?”.
equivalent facets has been intensively studied under different names
Finally, the system allows the propagation of information from
such as record linkage, entity resolution, or object matching [5]. In
the KB to the data sources. These can either be insertions/deletions
our case, we use techniques that are tailored to the context of per-
derived by the enrichers, or insertions/deletions explicitly specified
sonal KBs: identifier-based matching and attribute-based matching.
by the user. For instance, consider the information that different
email addresses correspond to the same person. This information Identifier-based matching. We can match two facets if they have
can be pushed to data sources, which may for example result in the same value for some particular attribute (such as an email
performing the merge of two contacts in the user’s list of contacts. address or a telephone number), which, in some sense, identifies or
To propagate the information to the source, we translate from the determines the entity. This approach is commonly used in personal
structure and terminology of the KB back to that of the data source information systems (in research and industry) and gives fairly
and use the API of that source. The user has the means of controlling good results for linking, e.g., facets extracted from emails and the
this propagation, e.g., specifying whether contact information in ones extracted from contacts. Such a matching may occasionally
our system should be synchronized to her phone’s contact list. be incorrect, e.g., when two spouses share a mobile phone or two
The user can update directly the KB by inserting or deleting employees share the same customer relations email address. In our
knowledge statements. Such updates to the KB are specified in the experience, such cases are rare, and we postpone their study to
SPARQL Update language [18]. When no source is specified for future work.
recording this new information, the system considers all the sources Two agent facets with the same first and family names have,
that know the subject of the particular statement. For insertion, if for instance, a higher probability to represent the same agent than
no source is able to register a corresponding insertion, the system two agent facets with different names, all other attributes held con-
performs the insertion in a special locally persistent graph, called stant. Besides names, attributes that can help determine a matching
the overwrite graph. For deletions, if one source fails to perform include schema:birthDate, schema:gender, and schema:email.
a deletion (e.g., because the statement is read-only), the system We tried holistic matching algorithms for graph alignments [40]
removes the statement from the KB anyway (even if the data is that we adapted to our setting. The results turned out to be dis-
still in some upstream source). A negative statement is added to appointing (see Section 5). We believe this is due to the follow-
the overwrite graph. This negative statement will prevent using ing: (i) almost all agent facets have a schema:email, and possibly
a source statement to reintroduce the corresponding statement in a schema:name, but most of them lack other attributes that are
KB: The negative statement overwrites the source statement. thus almost useless; (ii) names extracted from mails may contain
pseudonyms, abbreviations, or lack family names, which reduces
4 ENRICHERS matching precision. (iii) we cannot reliably compute name fre-
We describe the general principles of enricher modules. We then quency metrics from the knowledge base, since a rare name may
describe two specific enrichments: agent matching and event ge- appear many times for different email addresses if a person hap-
olocation. pens to be a friend of the user. Therefore, we developed our own
After loading, enricher modules perform inference tasks such as algorithm, AgentMatch, which works as follows:
entity resolution, event geolocation, and other knowledge enrich- (1) We partition Agents using the equivalence relation com-
ment tasks. An enricher works in a differential manner: it takes as puted by matching identifying attributes.
A Knowledge Base for Personal Information Management LDOW 2018, April 2018, Lyon, France

(2) For each Agent equivalence class, we compute its corre-
sponding set of names, and, for each name, its number of
occurrences (in email messages, etc.).
(3) We compute Inverse Document Frequency (IDF) scores,
where the documents are the equivalence classes, and the
terms are the name occurrences.
(4) For each pair of equivalence classes, we compute a numerical
similarity between each pair of names using an approximate
string distance that finds the best matching of words between
the two names and then compares matching words using
another string similarity function (discussed below). The sim-
ilarity between two names is computed as a weighted mean
using the sum of word-IDFs as weights. The best matching
of words corresponds to a maximum weight matching in
the bipartite graph of words where weights are computed Figure 3: Two clusters of stays (blue points inside black cir-
using the second string similarity function. The similarity cles) within the same building. Red points are outliers. The
(in [0, 1]) between two equivalence classes is computed as a other points are moves.
weighted mean of name pair similarity using the product of
word occurrences as weights. by location point p = (t, x, y, a), where t is the time, a the accu-
(5) Pairs for which the similarity is above a certain threshold racy, and (x, y) the coordinates, is represented by the distribution
are considered to correspond to two equivalent facets. P = N (µ P = (x, y), σP2 = a 2 ). When checking whether location
The second similarity function we use is based on the Levenshtein p can be added to an existing cluster C represented by distribu-
edit-distance, after string normalization (accent removal and low- tion Q, the process computes the Hellinger distance [34] between
ercasing). In our experiments, we have also tried the Jaro–Winkler the distribution P and the normal distribution Q = N (µ Q , σQ2 ):
r
distance. For performance reasons, we use 2- or 3-gram-based in- 2σ σ d (µ , µ )2
H 2 (P, Q) = 1 − σ 2 P+σQ2 exp − 41 σ 2P+σQ2 ∈ [0, 1], where
dexing of words in agent names, and only consider in step (4.) of the Q Q P Q
process those Agent parts with some ratio S of q-grams in common d(µ P , µ Q ) is the geographical distance between cluster centers.
in at least one word. For instance, two Agent parts with names The Hellinger distance takes into account both the accuracy and
“Susan Doe” and “Susane Smith” would be candidates. geographical distance between cluster centers, which allows us
to handle outliers no matter the location accuracy. The location
4.2 Geolocating Events is added to C if this distance is below a certain threshold λ, i.e.,
We discuss how to geolocate events, e.g., how we can detect that H 2 (P, Q) ⩽ λ2 < 1. In our system, we used a threshold of 0.95.
Monday’s lunch was at “Shana Thai Restaurant, 311 Moffett Boule- When p is added to cluster C, the resulting cluster is defined with
vard, Mountain View, CA 94043”. For this, we first analyze the a normal distribution whose expectation is the arithmetic mean of
location history from the user’s smartphone to detect places where location point centers weighted by the inverse accuracy squared,
the user stayed for a prolonged period of time. We then perform and whose variance is the harmonic mean of accuracies squared.
some spatiotemporal alignment between such stays and the events Formally, if a cluster C is formed by locations {p1 , . . . , pn }, where
in the user’s calendar. Finally, we use geocoding to provide location pi = (ti , x i , yi , ai ), then C is defined with distribution N (µ, σ 2 )
semantics to the events, e.g., a restaurant name and a street address. where µ is the weighted arithmetic mean of location centers (x i , yi )
weighted by their inverse accuracy squared ai−2 , and the variance
Detecting stays. Locations in the user’s location history can be σ 2 is the harmonic mean of location accuracies squared ai−2 .
put into two categories: stays and moves. Stays are locations where
n (x , y ) X n 1 −1 n 1 −1
! !
the user remained for some period of time (e.g., dinner at a restau-
i i 2
µ= σ =
X X
rant, gym training, office work), and moves are the others. Moves
i=1 ai2 i=1 ai
2
i=1 ai
2
usually correspond to locations along a journey from one place to
another, but might also correspond to richer outdoor activity (e.g., The coordinates are assumed to have been projected to an Euclidean
jogging, sightseeing). Figure 3 illustrates two stay clusters located plane locally approximating distances and angles on Earth around
inside the same building. cluster points. If n = 1, then µ = (x 1 , yi ) and σ 2 = a 21 , which
To transform the user’s location history into a sequence of stays corresponds to the definition of a cluster of size 1.
and moves, we perform time-based spatial clustering [29]. The idea A cluster that lasted more than a certain threshold is a candidate
is to create clusters along the time-axis. Locations are sorted by for being a stay. A difficulty is that a single location history (e.g.,
increasing time, and each new location is either added to an exist- Google Location History) may record locations of different devices,
ing cluster (that is geographically close and that is not too old), or e.g., a telephone and a tablet. The identity of the device may not be
added to a new cluster. To do so, a location is spatially represented recorded. The algorithm understands that two far-away locations,
as a two dimensional unimodal normal distribution N (µ, σ 2 ). The very close in time, must come from different devices. Typically, one
assumption of a normally distributed error is typical in the field of of the devices is considered to be stationary, and we try to detect a
processing location data. For instance, a cluster of size 1 formed movement of the other. Another difficulty comes when traveling in
LDOW 2018, April 2018, Lyon, France Montoya, Pellissier Tanon, Abiteboul, Senellart, and Suchanek

high speed trains with poor network connectivity. Location trackers We measured the loading times of Angela’s dataset into the
will often give the same location for a few minutes, which leads to system in two different scenarios: from source data on the Internet
the detection of an incorrect stay. (using Google API, except for the location history which is not
provided by the API and was loaded from a file), and from source
Matching stays with events. After the extraction of stays using data stored in local files. Loading took 19 and 4 minutes, respectively,
the previous algorithm, the next step is to match these with calendar on a desktop computer (Intel i7-2600k 4-core, 3.4 GHz, 20 GB RAM,
events. Such a matching turns out to be difficult because: (i) the SSD).
location of an event (address or geo-coordinates) is often missing;
(ii) when present, an address often does not identify a geographical
5.1 Agent Matching
entity, as in “John’s home” or “room C110”; (iii) in our experience,
starting times are generally reasonable (although a person may be We evaluated the precision and recall of the AgentMatch algorithm
late or early for a meeting) but durations are often not meaningful (Section 4) on Barack’s dataset. This dataset contains 40,483 Agent
(around 70% of events in our test datasets were scheduled for 1 hour; instances with a total of 25,381 schema:name values, of which 17,706
among the 1-hour events that we aligned, only 9% lasted between are distinct; it also contains 40,455 schema:email values, of which
45 and 75 minutes); (iv) some stays are incorrect. 24,650 are distinct. To compute the precision and recall, we sampled
Because of (i) and (ii), we do not rely much on the location 2,000 pairs of distinct Agents, and asked Barack to manually assign
explicitly listed in the user’s calendars. We match a stay with an to each possible pair a ground truth value (true/false). Barack was
event primarily based on time: the time overlap (or proximity) and provided with the email address and name of each agent, and was
the duration. In particular, we match the stay and the event, if the allowed to query the KB to get extra information.
ratio of the overlap duration over the entire stay duration is greater We tested both Levenshtein and Jaro–Winkler as secondary
than a threshold θ . As we have seen, event durations are often string distance, with and without IDF term weights. The term q-
unreliable because of (iii). Our method still yields reasonable results, gram match ratio (S) was set to 0.6. We varied λ so as to maximize
because it tolerates errors on the start of the stay for long stays the F1 value. Precision decreases while recall increases for decreas-
(because of their duration) and for short ones (because calendar ing threshold values. Our baseline is IdMatch, which matches two
events are scheduled usually for at least one hour). If the event has contacts iff they have the same email address.
geographical coordinates, we filter out stays that are too far away As competitor, we considered PARIS [40], an ontology alignment
from that location (i.e., when the distance is greater than δ ). We algorithm that is parametrized by a single threshold. We used string
discuss the choice of θ and δ for this process in Section 5. similarity for email addresses, and the name similarity metric used
by AgentMatch, except that it is applied to single Agent instances.
Geocoding event addresses. Once stays associated with events, PARIS computes the average number of outgoing edges for each
we enrich events with rich place semantics (country, street name, relation. Since our dataset contains duplicates, we gave PARIS an
postal code, place name). If an event has an explicit address, we use advantage by computing these values upfront.
a geocoder. Thymeflow allows using different geocoders, e.g., the We also considered Google’s “Find duplicates” feature. Google
Google Maps Geocoding API [21], which returns the geographic was not able to handle more than 27,000 contacts at the same time,
coordinates of an address, along with structured place and address and so we had to run it multiple times in batches. Since the fi-
data. The enricher only keeps the geocoder’s most relevant result nal output depends on the order in which contacts were loaded,
and adds its data (geographic coordinates, identifier, street address, we present two results, one for which the contacts were supplied
etc.) to the location in the knowledge base. For events that do not sorted by email address (Google1), and another for a random order
have an explicit address but that have been matched to a stay, we (Google2). Since Google’s algorithm failed to merge contacts that
use the geocoder to transform the geographic coordinates of the IdMatch did merge, we also tested running IdMatch on Google’s
stay into a list of nearby places. The most precise result is added output (GoogleId) for both runs. We also tested Mac OS X contact
as the event location. If the event has both an explicit address and de-duplication feature. However, its result did not contain all the
a match with a stay, we call the geocoder on this address, while meta data from the original contacts, so that we could not evaluate
restricting the search to a small area around the stay coordinates. this feature.
The results are shown in Table 1. As expected, our baseline
5 EXPERIMENTS IdMatch has a perfect precision, but a low recall (43%). Google,
likewise, gives preference to precision, but achieves a higher recall
In this section, we present the results of our experiments. We used
than the baseline (50%). The recall improves further if the Google
datasets from two real users, whom we call Angela and Barack. An-
is combined with IdMatch (61%). PARIS, in contrast, favors recall
gela’s dataset consists of 7,336 emails, 522 calendar events, 204,870
(92%) over precision (83%), and achieves a better F1 value overall.
location points, and 124 contacts extracted from Google’s email,
The highest F1-measure (95%) is reached for AgentMatch with the
contact, calendar, and location history services. This corresponds
Jaro–Winkler distance for a threshold of 0.825. It has a precision
to 1.6M triples in our schema. Barack’s dataset consists of 136,301
comparable to Google’s, and a recall comparable to PARIS’s.
emails, 3,080 calendar events, 1,229,245 location points, and 582
contacts extracted from the same sources. Barack’s emails cover a
period 5,540 days, locations cover 1,676 days. This corresponds to 5.2 Detecting Stays
10.3M triples, where 70.9 % come from the location history, 28.8 % We evaluated the extraction of stays from the location history on
from emails, 0.3 % from calendars and less than 0.1 % from contacts. Barack’s dataset. We randomly chose 15 days, and presented him
A Knowledge Base for Personal Information Management LDOW 2018, April 2018, Lyon, France

Table 1: Precision and recall of the Agent Matching task Table 3: Geocoding task on matched (event, stay) pairs in
on Barack’s dataset, for different parameters of the Agent- Barack’s dataset (in %)
Match, IdMatch, PARIS and Google algorithms
Method M F T PT T|A PT|A F1
Algorithm Similarity IDF λ Prec. Rec. F1
GoogleTimeline 0 82.8 14.8 14.8 17.2 17.2 29.4
IdMatch 1.000 0.430 0.601 EventSingle 50 40.8 4.0 7.9 9.6 19.0 27.7
Google1 0.995 0.508 0.672 Event 26 63.6 4.0 5.4 10.0 13.6 22.9
Google2 0.996 0.453 0.623 Stay 0 69.6 0.8 0.8 30.4 30.4 46.6
GoogleId2 0.997 0.625 0.768
StayEvent 50 16.4 27.2 54.4 33.6 67.2 57.3
GoogleId1 0.996 0.608 0.755
PARIS Jaro–Winkler T 0.425 0.829 0.922 0.873
StayEvent|Stay 0 50.0 28.4 28.4 50.0 50.0 66.7
AgentMatch Levenshtein F 0.725 0.945 0.904 0.924
AgentMatch Levenshtein T 0.775 0.948 0.900 0.923 δ , and found that the performance improves consistently with larger
AgentMatch Jaro–Winkler F 0.925 0.988 0.841 0.909 values. This indicates that filtering stays which are too far from
AgentMatch Jaro–Winkler T 0.825 0.954 0.945 0.949 event location coordinates (where available) should not be taken
into consideration. With these settings, the matching performs
quite well: We achieve a precision and recall of around 70%.
Table 2: Stay extraction evaluation on Barack’s dataset

Method θ #D DΘ Prec. Recall F1 5.3 Geocoding
We evaluated different geocoding enrichers for (event, stay) pairs
Thyme 15 33.3 % 0.6 % 91.3 % 98.4 % 94.7 %
described in Section 4. We used the Google Maps Geocoding API.
Thyme 10 46.0 % 0.9 % 82.9 % 98.4 % 90.0 %
We considered three different inputs to this API: the event loca-
Thyme 5 62.5 % 1.0 % 62.1 % 100.0 % 76.6 %
tion address attribute (Event), the stay coordinates (Stay), and the
Google 17.5 % N/A 87.7 % 89.1 % 88.4 %
event location attribute with the stay coordinates given as a bias
(StayEvent). We also considered a more precise version of Event,
a web interface with (1) the raw locations on a map, (2) the stays
which produces a result only if the geocoder returns a single un-
detected by Thymeflow, and (3) the stays detected by his Google
ambiguous location (EventSingle). Finally, we devised the method
Timeline [20]. Barack was then asked to annotate each stay as “true”,
StayEvent+Stay, which returns the StayEvent result if it exists, and
i.e., corresponding to an actual stay, or “false”, and to report missing
the Stay result otherwise.
stays, based on his memories. He was also allowed to use any other
For each input, the geocoder gave either no result (M), a false
available knowledge (e.g., his calendar). The exact definition of an
result (F), a true place (T), or just a true address (A). For instance,
actual stay was left to Barack, and the reliability of his annotations
an event occurring in “Hôtel Ritz Paris” is true if the output is for
were dependent on his recollection. In total, Barack found 64 ac-
instance “Ritz Paris”, while an output of “15 Place Vendôme, Paris”
tual stays. Sometimes, an algorithm would detect an actual stay
would count as a true address. For comparison, we also evaluated
as multiple consecutive stays. In that case, we counted one true
the place given by Google Timeline [20].
stay and counted the number of duplicates and the resulting move
Due to limitations of the API, geocoding from stay coordinates
duration in-between. For instance, an actual stay of 2 hours output
mostly yielded address results (99.2% of the time). To better evaluate
as two stays of 29 min and 88 min, with a short move of 3 minutes
these methods, we computed the number of times in which the
in-between would count as 1 true stay, 1 duplicate and 3 minutes
output was either a true place, or a true address (denoted T|A). For
of duplicate move duration.
those methods that did not always return a result, we computed a
Table 2 shows the resulting precision and recall for each method,
precision metric PT|A (resp., PT ), that is equal to the ratio of T|A
with varying stay duration thresholds for Thymeflow. We also show
(resp., T) to the number of times a result was returned. We computed
the duplicate ratio #D (number of duplicates over the number of true
a F1-measure based on the PT|A precision, and a recall assimilated
stays) and the move duration ratio DΘ (duplicate move duration
to the number of times the geocoder returned a result (1 − M).
over the total duration of true stays). Overall, Thymeflow obtains
The evaluation was performed on 250 randomly picked (stay,
results comparable to Google Timeline for stay extraction, with
event) pairs in Barack’s dataset. The results are shown in Ta-
better precision and recall, but more duplicates.
ble 3. The Google Timeline gave the right place or address only
Matching stays with events. We also evaluated the matching of 17.2% of the time. The EventSingle method, likewise, performs
stays in the user’s location history with the events in their calendar. poorly, indicating that the places are indeed highly ambiguous. The
We sampled stays from Angela and Barack’s datasets, and produced best precision (PT|A of 67.2%) is obtained by Geocoding with the
all possible matchings to events, i.e., all matchings produced by the StayEvent, but this method returns a result only 50.0% of the time.
algorithm whatever the threshold. Angela and Barack were then StayEvent+Stay, in contrast, can find the right place 28.4% of the
asked to manually label the matches as correct or incorrect. The time, and the right place or address 50.0% of the time, which is our
matching process relies on two parameters, namely the duration best result. We are happy with this performance, considering that
ratio threshold θ and the filtering distance δ . We varied θ and found around 45% of the event locations were room numbers without
that a value of 0.2 leads to best F1 values. With this value, we varied mention of a building or place name (i.e., C101).
LDOW 2018, April 2018, Lyon, France Montoya, Pellissier Tanon, Abiteboul, Senellart, and Suchanek

5.4 Use Cases see in our everyday lives [23]. MyLifeBits is a notable documented
The user can query the KB using SPARQL [25]. Additionally, example [19]. The advent of cheaper, more advanced and efficient
Thymeflow uses a modern triple store supporting full-text and wearable devices able to capture the different aspects of one’s life
geospatial queries. Since the KB unites different data sources, has made lifelogging indiscriminate of what it logs. Lifelogging
queries can seamlessly span multiple sources and data types. This activities include recording a history of machine enabled tasks (e.g.,
allows the user (Angela), to ask for instance: communications, editing, web browser’s history), passively captur-
ing what we see and hear (e.g., via a wearable camera), monitoring
• What are the phone numbers of her birthday party guests?
personal biometrics (e.g., steps taken, sleep quality), and logging
(combining information from the contacts and the emails)
mobile device and environmental context (e.g., the user’s location,
• What places did she visit during her last trip to London?
smart home sensing).
(combining geocoding information with stays)
Different from lifelogging, which does not focus on the analysis
• For each person she meets more than 3 times a week, what
of the logged information, the quantified self is a movement to
are the top 2 places where she usually meets that particular
incorporate data acquisition technology on certain focused aspects
person? (based on her calendar and location history)
of one’s daily life [23]. The quantified self focuses on logging expe-
Such queries are not supported by current proprietary cloud ser- riences with a clearer understanding of the goals, such as exercise
vices, which do not allow arbitrary queries. levels for fitness and health care.
Hub of personal knowledge. Finally, the user can use the bidi-
Organizing. PIM also deals with the management and organiza-
rectionality of synchronization to enrich her existing services. For
tion of information. It is for instance concerned with the manage-
instance, she can enrich her personal address book (CardDav) with
ment of privacy, security, distribution, and enrichment of informa-
knowledge inferred by the system (e.g., a friend’s birth date ex-
tion.
tracted from Facebook) using a SPARQL/Update query.
A personal data service (PDS) lets the user store, manage and
deploy her information in a structured way. The PDS may be used
6 RELATED WORK
to manage different identities and/or as central point of information
We now review the related work, on personal information man- exchange between services. For instance, an application that recom-
agement, on information integration, and on the specific tasks of mends new tracks based on what the user likes to listen may need
location analysis and calendar matching. to use adapters and authenticate with different services keeping
a listening history. Instead, the PDS centralizes this information
6.1 Personal Information Management and the application only needs an adapter to connect to this PDS.
This work is motivated by concept of personal information man- Higgins [43], OpenPDS [10] and the Hub of All Things [26] are
agement (PIM), taking the viewpoint of [1] as to what a PIM system examples of PDSs.
should be. [28] groups PIM research and development into the fol- MyData [37] describes a consent management framework that
lowing problems: finding and re-finding, keeping and organizing lets user control the flow of data between a service that has in-
personal information. We now present some notable contributions. formation about her and a service that uses this information. In
this framework, which is still at its early stage of development, a
Finding and Re-finding. PIM has been concerned with improving
central system holds credentials to access the different services on
how individuals go about retrieving a piece of information to meet
the user’s behalf. The user specifies the rules by which flows of
a particular need.
information between any two of those services are authorized. The
For searching within the user’s personal computer, desktop full-
central system is in charge of providing or revoking the necessary
text search tools capable have been developed for various operating
authorizations on each of those services to implement these rules.
systems and platforms [46]. Search entries are for instance the files
Contrary to a PDS, the actual data does not need to flow through
and folders on the file-system, email messages, browser history
the central system. Two services may spontaneously share infor-
pages, calendar events, contacts, and applications. Search may be
mation about the user with each other if legally entitled (e.g., two
performed on both the content and the metadata. In particular, the
public bodies), in which case the central system is notified. It is
IRIS [4] and NEPOMUK [24] projects used knowledge represen-
an all-or-nothing approach that represents a paradigm shift from
tation technologies to provide semantic search facilities and go
currently implemented ad-hoc flows of personal information across
beyond search to provide facilities for exchanging data between
organizations.
different applications within a single desktop computer.
Organizing information as a time-ordered stream of documents
Other research efforts have focused on ameliorating the process
(a lifestream) has been proposed as a simple scheme for reducing
of finding things the user has already seen, using whatever context
the time the user spends in manually organizing documents into
or meta-information that the user remembers [14, 39, 42].
a classic hierarchical file system [16]. It has the advantage of pro-
Keeping. PIM has also addressed the question: What kind of viding unified view of the user’s personal information. Lifestreams
information should be captured and stored in digital form? can be seen as a natural representation of lifelog information. The
A central idea of [3]’s vision is creating a device that is able to dig- Digital Me system uses this kind of representation to unify data
itally capture all of the experiences and acquired knowledge of the from different loggers [38].
user, so that it can act as a supplement to her memory. Lifelogging For managing personal information, different levels of organiza-
attempts to fulfil this vision by visually capturing the world that we tion and abstraction have been proposed. Personal data lakes [45]
A Knowledge Base for Personal Information Management LDOW 2018, April 2018, Lyon, France

and personal data spaces [12] offer little integration and focus on means to match calenders, emails, and events, as we do. The only
handling storage, metadata, and search. On the other end, personal set of vocabularies besides schema.org which provides a broad
knowledge bases, which include more semantics, have been used: coverage of all entities we are dealing with is the OSCAF ontolo-
Haystack [30], SEMEX [13], IRIS [4], and NEPOMUK [24]. Such a gies [33]. But their development was stopped in 2013 and they
structure allows the flexible representation of things like “this file, are not maintained anymore, contrary to schema.org which is ac-
authored by this person, was presented at this meeting about this tively supported by companies like Google and widely used on the
project”. They integrate several sources of information, including web [22]. Recently, a personal data service has been proposed that
documents, media, email messages, contacts, calendars, chats, and reuses the OSCAF ontologies, but they use a relational database
web browser’s history. However, these projects date from 2007 and instead of a knowledge base [38].
before and assume that most of the user’s personal information is
stored on her personal computer. Today, most of it is spread across 6.3 Location Analysis and Calendar Matching
several devices [1]. The ubiquity of networked mobile devices able to track users’ lo-
Some proprietary service providers, such as Google and Apple, cations over time has been greatly utilized for estimating traffic
have arguably come quite close to our vision of a personal knowl- and studying mobility patterns in urban areas. Improvements in
edge base. They integrate calendars, emails, and address books, accuracy and battery efficiency of mobile location technologies
and allow smart exchanges between them. Some of them even pro- have made possible the estimation of user activities and visited
vide intelligent personal assistants that proactively interact with the places on a daily basis [2, 20, 29]. Most of these works have mainly
user. However, these are closed source proprietary solutions that exploited sensor data (accelerometer, location, network) and readily
promote vendor lock-in. In response, open-source alternative solu- available geographic data. Few of them, however, have exploited the
tions have been developed, to cloud storage in particular, such as user’s calendar and other available data for creating richer and more
ownCloud [35] and Cozy [7]. These have evolved into application semantic activity histories. Recently, a study has recognized the
platforms that host various kinds of other user-oriented services importance of using the location history and social network infor-
(e.g., email, calendar, contacts). They leverage multi-device synchro- mation for improving the representation of information contained
nization facilities and standard protocols to facilitate integration in the user’s calendar: e.g. for distinguishing genuine real-world
with existing contact managers and calendars. Cozy is notable for events from reminders [31].
providing adapters for importing data from different kinds of ser-
vices (e.g., activity trackers, finance, social) into the system into a 7 DISCUSSION
document-oriented database. These tools bring the convenience of
The RDF model seems well-suited for building a personal knowl-
modern software-as-a-service solutions while letting the user be in
edge base. However powerful, making full use of it relies on being
control, not give away some of her privacy and free herself from
able to write and run performant queries. At this point, we cannot
vendor lock-in.
focus on optimizing for a specific application. It is not yet clear
up to what extent Thymeflow should hold raw data (such as the
6.2 Information Integration entire location history), which, depending on how it is used, may
Data matching (also known as record or data linkage, entity resolu- be loaded on demand, in mediation style. Additionally, we would
tion, object/field matching) is the task of finding records that refer like to raise the following issues that could drive future research
to the same entity across different sources. It is extensively utilized and development:
in data mining projects and in large-scale information systems by • Gathering data for experiments: The research community
business, public bodies and governments [5]. Example application might benefit from building and maintaining sets of anno-
areas include national census, the health sector, or fraud detec- tated multi-dimensional personal information for use in dif-
tion. In the context of personal information, SEMEX [13] integrates ferent kinds of tasks. This is challenging, specially due to
entity resolution facilities. SEMEX imports information from docu- privacy concerns.
ments, bibliography, contacts and email, and uses attributes as well • Opportunities: Internet companies that already hold a lot of
associations found between persons, institutions and conferences user data are not yet integrating everything they have in a
to reconcile references. However, different from our work, the inte- coherent whole, and are not performing as well as we think
gration is done at import time so the user cannot later manually they could. For instance, Google Location History does not
revoke it through an update, and incremental synchronization is not integrate the user’s calendar, unlike we do. We think that
handled. Recently, contact managers from known service providers there are still many opportunities to create new products
have started providing de-duplication tools for finding duplicate and functionalities from existing data alone.
contacts and merging them in bulk. However, these tools are often • Inaccessible information: The hard truth is that many pop-
restricted to contacts present in the user’s address book and do not ular Internet-based services still do not provide an API for
merge contacts from social networks or emails. conveniently retrieving user data out of them, or that such
Common standards, such as vCards and iCalendar, have ad- an API is not feature-complete (e.g., Facebook, WhatsApp).
vanced the state of the art by allowing provider-independent admin-
istration of personal information. There is also a proposed standard 8 CONCLUSION
for mapping vCard content and iCalendars into RDF [6, 27]. While The Thymeflow system integrates data from emails, calendars, ad-
such standards are useful in our context, they do not provide the dress books, and location history, providing novel functionalities
LDOW 2018, April 2018, Lyon, France Montoya, Pellissier Tanon, Abiteboul, Senellart, and Suchanek

on top of them. It can merge different facets of the same agent, [15] Facebook. 2016. The Graph API. (2016). https://developers.facebook.com/docs/
determine prolonged stays in the location history, and align them graph-api/
[16] Eric Freeman and David Gelernter. 1996. Lifestreams: A storage model for
with events in the calendar. personal data. ACM SIGMOD Record 25, 1 (1996), 80–86.
Our work is a unique attempt at building a personal knowledge [17] Hector Garcia-Molina, Yannis Papakonstantinou, Dallan Quass, Anand Rajara-
man, Yehoshua Sagiv, Jeffrey Ullman, Vasilis Vassalos, and Jennifer Widom. 1997.
base. First, the system is complementary to and does not pretend The TSIMMIS approach to mediation: Data models and languages. Journal of
to replace the existing user experience, applications, and function- intelligent information systems 8, 2 (1997).
alities, e.g., for reading/writing emails, managing a calendar, or- [18] Paula Gearon, Alexandre Passant, and Axel Polleres. 2013. SPARQL 1.1 Update.
https://www.w3.org/TR/sparql11-update/.
ganizing files. Second, we embrace personal information as being [19] Jim Gemmell, Gordon Bell, and Roger Lueder. 2006. MyLifeBits: a personal
fundamentally distributed and heterogeneous and focus on the need database for everything. Commun. ACM 49, 1 (2006), 88–95.
of providing knowledge integration on top for creating completely [20] Google. 2016. Google Maps Timeline. (2016). https://www.google.fr/maps/
timeline
new services (query answering, analytics). Finally, while the system [21] Google. 2017. Google Maps APIs. (2017). https://developers.google.com/maps/
could benefit from more advanced analysis such the extraction of documentation/
[22] RV Guha, Dan Brickley, and Steve Macbeth. 2016. Schema.org: Evolution of
entities from rich text (e.g., emails), for linking them with elements structured data on the web. CACM 59, 2 (2016).
of the KB, our first focus is on enriching existing semi-structured [23] Cathal Gurrin, Alan F Smeaton, and Aiden R Doherty. 2014. Lifelogging: Personal
data, which improves the quality of data for use by other services. big data. Foundations and trends in information retrieval 8, 1 (2014), 1–125.
[24] Siegfried Handschuh, Knud Möller, and Tudor Groza. 2007. The NEPOMUK
Our system can be extended in a number of directions, includ- project-on the way to the social semantic desktop. In I-SEMANTICS.
ing incorporating more data sources, extracting semantics from [25] Steve Harris, Andy Seaborne, and Eric Prud’hommeaux. 2013. SPARQL 1.1 Query
text, complex analysis of users’ data and behavior. Future applica- Language. http://www.w3.org/TR/sparql11-query/.
[26] HATDeX Ltd. 2017. Hub of All Things. (2017). https://hubofallthings.com
tions include personal analytics, cross-vendor search, intelligent [27] Renato Iannella and James McKinney. 2014. vCard Ontology - for describing People
event planning, recommendation, and prediction. Also, our system and Organizations. http://www.w3.org/TR/2014/NOTE-vcard-rdf-20140522/.
[28] William Jones and Jaime Teevan. 2011. Personal Information Management. Uni-
could use simpler query language, perhaps natural language, or versity of Washington Press, Seattle, WA U.S.A.
even proactively interact with the user, in the style of Apple’s Siri, [29] Jong Hee Kang, William Welbourne, Benjamin Stewart, and Gaetano Borriello.
Google’s Google Now, Microsoft’s Cortana, or Amazon Echo. 2004. Extracting Places from Traces of Locations. In WMASH.
[30] David R Karger, Karun Bakshi, David Huynh, Dennis Quan, and Vineet Sinha.
While the data obtained by Thymeflow remains under the user’s 2005. Haystack: A customizable general-purpose information management tool
direct control, fully respecting her privacy, the data residing outside for end users of semistructured data. In Proc. of the CIDR Conf.
of it may not. However, using Thymeflow, the user could have a [31] Tom Lovett, Eamonn O’Neill, James Irwin, and David Pollington. 2010. The
calendar as a sensor: analysis and improvement using data fusion with social
better understanding of what other systems know about her, which networks and location. In UbiComp.
is important first step in gaining control about it. [32] David Montoya, Thomas Pellissier Tanon, Serge Abiteboul, and Fabian Suchanek.
2016. Thymeflow, A Personal Knowledge Base with Spatio-temporal Data. In
CIKM. Demonstration paper.
[33] Nepomuk Consortium and OSCAF. 2007. OSCAF Ontologies. (2007). http:
REFERENCES //oscaf.sourceforge.net/
[1] Serge Abiteboul, Benjamin André, and Daniel Kaplan. 2015. Managing your [34] Mikhail S Nikulin. 2001. Hellinger distance. Encyclopedia of Mathematics (2001).
digital life. CACM 58, 5 (2015). [35] ownCloud. 2016. ownCloud – A safe home for all your data. (2016). https:
[2] Daniel Ashbrook and Thad Starner. 2003. Using GPS to learn significant locations //owncloud.org/
and predict movement across multiple users. Personal and Ubiquitous Computing [36] S. Perreault. 2011. vCard Format Specification. RFC 6350. IETF. https://tools.ietf.
7, 5 (2003). org/html/rfc6350
[3] Vannevar Bush. 1945. As We May Think. The Atlantic (1945). [37] A Poikola, K Kuikkanieni, and H Honko. 2014. MyData – A Nordic
[4] Adam Cheyer, Jack Park, and Richard Giuli. 2005. IRIS: Integrate. Relate. Infer. Model for human-centered personal data management and processing.
Share. Technical Report. DTIC Document. (2014). https://www.lvm.fi/documents/20181/859937/MyData-nordic-model/
[5] Peter Christen. 2012. Data matching: concepts and techniques for record linkage, 2e9b4eb0-68d7-463b-9460-821493449a63?version=1.0
entity resolution, and duplicate detection. Springer. [38] Mats Sjöberg, Hung-Han Chen, Patrik Floréen, Markus Koskela, Kai Kuikkaniemi,
[6] Dan Connolly and Libby Miller. 2005. RDF Calendar - an application of the Resource Tuukka Lehtiniemi, and Jaakko Peltonen. 2016. Digital Me: Controlling and
Description Framework to iCalendar Data. http://www.w3.org/TR/rdfcal/. Making Sense of My Digital Footprint. (2016). http://reknow.fi/dime/
[7] Cozy Cloud. 2016. Cozy – Simple, versatile, yours. (2016). https://cozy.io/ [39] Craig AN Soules and Gregory R Ganger. 2005. Connections: using context to
[8] David H. Crocker. 1982. Standard for the format of ARPA Internet text messages. enhance file search. ACM SIGOPS operating systems review 39, 5 (2005), 119–132.
RFC 822. IETF. https://tools.ietf.org/html/rfc822 [40] Fabian M Suchanek, Serge Abiteboul, and Pierre Senellart. 2011. PARIS: Proba-
[9] Richard Cyganiak, David Wood, and Markus Lanthaler. 2014. RDF 1.1 Concepts and bilistic alignment of relations, instances, and schema. PVLDB 5, 3 (2011).
Abstract Syntax. http://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/. [41] Fabian M Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: a core of
[10] Yves-Alexandre de Montjoye, Erez Shmueli, Samuel S Wang, and Alex Sandy semantic knowledge. In WWW.
Pentland. 2014. openpds: Protecting the privacy of metadata through safeanswers. [42] Jaime Teevan. 2007. The re: search engine: simultaneous support for finding and
PloS one 9, 7 (2014), e98790. re-finding. In Proceedings of the 20th Annual ACM Symposium on User Interface
[11] B. Desruisseaux. 2009. Internet Calendaring and Scheduling Core Object Specifica- Software and Technology, Newport, Rhode Island, USA, October 7-10, 2007. 23–32.
tion (iCalendar). RFC 5545. IETF. https://tools.ietf.org/html/rfc5545 DOI:https://doi.org/10.1145/1294211.1294217
[12] Jens-Peter Dittrich and Marcos Antonio Vaz Salles. 2006. iDM: A Unified and [43] Paul Trevithick and Mary Ruddy. 2012. Higgins – Personal Data Service. (2012).
Versatile Data Model for Personal Dataspace Management. In Proceedings of the http://www.eclipse.org/higgins/
32nd International Conference on Very Large Data Bases, Seoul, Korea, September [44] Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: A Free Collaborative
12-15, 2006. 367–378. http://dl.acm.org/citation.cfm?id=1164160 Knowledgebase. CACM 57, 10 (2014).
[13] Xin Dong and Alon Y. Halevy. 2005. A Platform for Personal Information Man- [45] Coral Walker and Hassan Alrehamy. 2015. Personal Data Lake with Data Gravity
agement and Integration. In CIDR. 119–130. http://www.cidrdb.org/cidr2005/ Pull. In Fifth IEEE International Conference on Big Data and Cloud Computing,
papers/P10.pdf BDCloud 2015, Dalian, China, August 26-28, 2015. 160–167. DOI:https://doi.org/
[14] Susan T. Dumais, Edward Cutrell, Jonathan J. Cadiz, Gavin Jancke, Raman Sarin, 10.1109/BDCloud.2015.62
and Daniel C. Robbins. 2003. Stuff I’ve seen: a system for personal information [46] Wikipedia contributors. 2016. List of search engines – Desktop search engines.
retrieval and re-use. In SIGIR 2003: Proceedings of the 26th Annual International (2016). https://en.wikipedia.org/w/index.php?title=List_of_search_engines
ACM SIGIR Conference on Research and Development in Information Retrieval, July
28 - August 1, 2003, Toronto, Canada. 72–79. DOI:https://doi.org/10.1145/860435.
860451