=Paper= {{Paper |id=Vol-2073/article-02 |storemode=property |title=None |pdfUrl=https://ceur-ws.org/Vol-2073/article-02.pdf |volume=Vol-2073 |dblpUrl=https://dblp.org/rec/conf/www/MontoyaTASS18 }} ==None== https://ceur-ws.org/Vol-2073/article-02.pdf
        A Knowledge Base for Personal Information Management
                 David Montoya                         Thomas Pellissier Tanon                               Serge Abiteboul
                  Square Sense                           LTCI, Télécom ParisTech                            Inria Paris
               david@montoya.one                             ttanon@enst.fr                  & DI ENS, CNRS, PSL Research University
                                                                                                     serge.abiteboul@inria.fr

                                       Pierre Senellart                        Fabian M. Suchanek
                              DI ENS, CNRS, PSL Research University           LTCI, Télécom ParisTech
                                           & Inria Paris                        suchanek@enst.fr
                                   & LTCI, Télécom ParisTech
                                       pierre@senellart.com

ABSTRACT                                                                the knowledge base, and integrated with other data. For example,
Internet users have personal data spread over several devices and       we have to find out that the same person appears with different
across several web systems. In this paper, we introduce a novel         email addresses in address books from different sources. Standard
open-source framework for integrating the data of a user from           KB alignment algorithms do not perform well in our scenario, as we
different sources into a single knowledge base. Our framework           show in our experiments. Furthermore, integration spans data of
integrates data of different kinds into a coherent whole, starting      different modalities: to create a coherent user experience, we need
with email messages, calendar, contacts, and location history. We       to align calendar events (temporal information) with the user’s
show how event periods in the user’s location data can be detected      location history (spatiotemporal) and place names (spatial).
and how they can be aligned with events from the calendar. This             We provide a fully functional and open-source personal knowl-
allows users to query their personal information within and across      edge management system. A first contribution of our work is the
different dimensions, and to perform analytics over their emails,       management of location data. Such information is becoming com-
events, and locations. Our system models data using RDF, extending      monly available through the use of mobile applications such as
the schema.org vocabulary and providing a SPARQL interface.             Google’s Location History [20]. We believe that such data becomes
                                                                        useful only if it is semantically enriched with events and people in
1    INTRODUCTION                                                       the user’s personal space. We provide such an enrichment.
                                                                            A second contribution is the adaptation of ontology alignment
Internet users commonly have their personal data spread over sev-
                                                                        techniques to the context of personal KBs. The alignment of persons
eral devices and services. This includes emails, messages, contact
                                                                        and organizations is rather standard. More novel are alignments
lists, calendars, location histories, and many other. However, com-
                                                                        based on time (a meeting in the calendar and a GPS location), or
mercial systems often function as data traps, where it is easy to
                                                                        space (an address in contacts and a GPS location).
check information in but difficult to query and exploit it. For exam-
                                                                            Our third contribution is an architecture that allows the integra-
ple, a user may have all her emails stored with an email provider
                                                                        tion of heterogeneous personal data sources into a coherent whole.
– but cannot find out which of her colleagues she interacts most
                                                                        This includes the design of incremental synchronization, where
frequently with. She may have all her location history on her phone
                                                                        a change in a data source triggers the loading and treatment of
– but cannot find out which of her friends’ places she spends the
                                                                        just these changes in the central KB. Conversely, the user is able to
most time at. Thus, a user often has paradoxically no means to
                                                                        perform updates on the KB, which are made persistent wherever
make full use of data that she has created or provided. As more and
                                                                        possible in the sources. We also show how to integrate knowledge
more of our lives happen in the digital sphere, users are actually
                                                                        enrichment components into this process, such as entity resolution
giving away part of their life to external data services.
                                                                        and spatio-temporal alignments.
    We aim to put the user back in control of her own data. We
                                                                            As implemented, our system can provide answers to questions
introduce a novel framework that integrates and enriches personal
                                                                        such as: Who have I contacted the most in the past month (requires
information from different sources into a single knowledge base
                                                                        alignments of different email addresses)? How many times did I go
(KB) that lives on the user’s machine, a machine she controls. Our
                                                                        to Alice’s place last year (requires alignment between contact list
system, Thymeflow, replicates data of different kinds from outside
                                                                        and location history)? Where did I have lunch with Alice last week
services and thus acts as a digital home for personal data. This
                                                                        (requires alignment between calendar and location history)?
provides the user with a high-level global view of that data, which
                                                                            Our system, Thymeflow, was previously demonstrated in [32].
she can use for querying and analysis. All of this integration and
                                                                        It is based on an extensible framework available under an open-
analysis happens locally on the user’s computer, thus guaranteeing
                                                                        source software license1 . People can therefore freely use it, and
her privacy.
                                                                        researchers can build on it.
    Designing such a personal KB is not easy: Data of completely
                                                                            We first introduce out data model and sources in Section 2, and
different nature has to be modeled in a uniform manner, pulled into
                                                                        then present the system architecture of Thymeflow in Section 3.
LDOW 2018, April 2018, Lyon, France
                                                                        1 https://github.com/thymeflow/thymeflow
LDOW 2018, April 2018, Lyon, France                                          Montoya, Pellissier Tanon, Abiteboul, Senellart, and Suchanek


Section 4 details our knowledge enrichment processes, and Section 5      to associate it to geo-coordinates and richer place semantics. The
our experimental results. Related work is described in Section 6.        Facebook Graph API [15] also models events the user is attending
Before concluding in Section 8, we discuss lessons learnt while          or interested in, with richer location data and list of attendees (a
building and experimenting with Thymeflow in Section 7.                  list of names).

2   DATA MODEL                                                               Location history. Smartphones are capable of tracking the user’s
In this section, we briefly describe the schema of the knowledge         location over time using different positioning technologies: satel-
base, and discuss the mapping of data sources to that schema.            lite navigation, Wi-Fi, and cellular. Location history applications
                                                                         continuously run in the background, and store the user’s location
   Schema. We use the RDF standard [9] for knowledge represen-           either locally or on a distant server. Each point in the user’s location
tation. We use the namespace prefixes schema for http://schema.          history is represented by time, longitude, latitude, and horizontal
org/, and rdf and rdfs for the standard namespaces of RDF and            accuracy (the measurement’s standard error). We use the Google
RDF Schema, respectively. A named graph is a set of RDF triples          Location History format, in JSON, as Google users can easily export
associated with a URI (its name). A knowledge base (KB) is a set of      their history in this format. A point is represented by a resource
named graphs.                                                            of type personal:Location with properties schema:geo, for geo-
   For modeling personal information, we use the schema.org vo-          graphic coordinates with accuracy, and personal:time for time.
cabulary when possible. This vocabulary is supported by Google,
Microsoft, Yahoo, and Yandex, and documented online. Wherever
this vocabulary is not fine-grained enough for our purposes, we          3    SYSTEM ARCHITECTURE
complement it with our own vocabulary, that lives in the namespace       A personal knowledge base could be seen as a view defined over
http://thymeflow.com/personal# with prefix personal.                     personal information sources. The user would query this view in a
   Figure 1 illustrates a part of our schema. Nodes represent classes,   mediation style [17] and the data would be loaded only on demand.
rounded colored ones are non-literal classes, and an edge with           However, accessing, analyzing and integrating these data sources
label p from X to Y means that the predicate p links instances           on the fly would be expensive tasks. For this reason, Thymeflow
of X to instances of type Y . We use locations, people, organiza-        uses a warehousing approach. Data is loaded from external sources
tions, and events from schema.org, and complement them with              into a persistent store and then enriched.
more fine-grained types such as Stay, EmailAddress, and Pho-                Thymeflow is a web application that the user installs, providing it
neNumber. Person and Organization classes are aggregated into a          with a list of data sources, together with credentials to access them
personal:Agent class.                                                    (such as tokens or passwords). The system accesses the data sources
                                                                         and pulls in the data. All code runs locally on the user’s machine.
   Emails and contacts. We treat emails in the RFC 822 format [8].       None of the data leaves the user’s computer. Thus, the user remains
An email is represented as a resource of type schema:Email with          in complete control of her data. The system uses adapters to access
properties such as schema:sender, personal:primaryRecipient,             the sources, and to transform the data into RDF. We store the data
and personal:copyRecipient, which link to personal:Agent in-             in a persistent triple store, which the user can query using SPARQL.
stances. Other properties are included for the subject, the sent and        One of the main challenges in the creation of a personal KB is
received dates, the body, the attachments, the threads, etc.             the temporal factor: data sources may change, and these updates
   Email addresses are great sources of knowledge. An email ad-          should be reflected in the KB. Changes can happen during the initial
dress such as “jane.doe@inria.fr” provides the given and family          load time, while the system is asleep, or after some inferences have
names of a person, as well as her affiliation. However, some email       already been computed. To address these dynamics, Thymeflow
addresses provide less knowledge and some almost none, e.g.,             uses software modules called synchronizers and enrichers. Figure 2
“j4569@gmail.com”. Sometimes, email fields contain a name, as            shows synchronizers on the left, and enrichers in the center. Syn-
in “Jane Doe ”, which gives us a name triple. In        chronizers are responsible for accessing data sources, enrichers
our model, personal:Agent instances extracted from emails with           (see Section 4) for inferring new statements, such as alignments
the same combination of email address and name are considered            between entities obtained by entity resolution.
indistinguishable (i.e., they are represented by the same URI). An          Modules are scheduled dynamically and may be triggered by
email address does not necessarily belong to an individual; it can       updates in the data sources (e.g., calendar entries) or by new pieces
also belong to an organization, as in edbt-school-2013@imag.fr or        of information derived in the KB (e.g., the alignment of a position in
fancy_pizza@gmail.com. This is why, for instance, the sender, in         the location history with a calendar event). The modules may also be
our data model, is a personal:Agent, and not a schema:Person.            started regularly for particularly costly alignment processes. When
   A vCard contact [36] is represented as an instance of                 a synchronizer detects a change in a source, a pipeline of enricher
personal:Agent with properties such as schema:familyName, and            modules is triggered, as shown in Figure 2. Enrichers can also
schema:address. We normalize telephone numbers, based on a               use knowledge from external data sources, such as Wikidata [44],
country setting provided by the user.                                    Yago [41], or OpenStreetMap.
   Calendar. The iCalendar format [11] can represent events. We             Synchronizer modules are responsible for retrieving new data
model them as instances of schema:Event, with properties such            from a data source. For each data source that has been updated, the
as name, location, organizer, attendee, and date. The location is        adapter for that particular source transforms the source updates
typically given as a postal address, and we will discuss later how       since last synchronization into a set of insertions/deletions in RDF.
A Knowledge Base for Personal Information Management                                                                              LDOW 2018, April 2018, Lyon, France


                                        dateReceived/                              inReplyTo
                                             dateSent
                        dateTime                          EmailMessage                        EmailAddress                    PhoneNumber
                                                            text/
                                                         headline                     re sen
                                        sta                                             cip de




                                                                                                                        ne
                                           r                                               ie r/




                                                                                                      email
                                        en tDat




                                                                                                                      ho
                                                               string                        nt                                    Address
                                          dD e/




                                                                                                                   lep
                                             ate




                                                                                                                 te
                                                                       name/
                                                                                                                               name
                             time                                      description
                                                   Stay        Event                               Agent                               string
                                                                                   attendee/                          name
                                                                                                                                        /
                     GeoVector velocity              item                          organizer                                        a me e
                                                                                                                                   N am
                                                                                                                                en
                                                                                                                             giv ilyN
                            magnitude
                            angle/




                                                                      location
                                                Location
                                                                                              Person                           fam
                                                                                          n/
                                                                                       tio on
                                                            geo
                                                     geo                            c a    i




                                                                                                   affiliation
                                  uncertainty                                     Lo cat                                 bir
                                                                              ome kLo                                        th
                                                                                                                                  Da
                                                                             h or                                                   te
                       double           GeoCoordinates                          w
                             longitude/          geo        location                                                               dateTime
                                latitude              Place          Organization

                                            addressLocality/
                                              addressRegion            address                     Country

                       string                             PostalAddress                                     ntry
                                                                                            s   sCou
                                       postalCode/                                    addre
                                    streetAddress/
                            postOfficeBoxNumber                                               address

                     Legend             X   schema:X              X              personal:X               X        xsd:X
                                            schema:p                             personal:p                        rdfs:subClassOf
                                        p                         p

                                                     Figure 1: Personal data model


                                                                                                                                       INPUT
                      P1                            S1                           Updater                                       User updates
                                                    S2
                      Pn                            Sn                           Loader
                              Personal                                                   ∆0
                           Information       Synchronizers
                               Sources                                              E1                           Knowledge
                                                                                         ∆1
                                                                                                                   Base
                                                                                         ∆k−1
                      X1                                                            Ek                                             SPARQL
                                                                                         ∆k
                                                                                         ∆p−1                              Query Answering
                                                                                    Ep                                     Data Analysis
                      Xm
                              External                                                                                     Data Browsing
                              Sources                                 Enricher Pipeline                                    Visualization

                                                      Figure 2: System architecture
LDOW 2018, April 2018, Lyon, France                                         Montoya, Pellissier Tanon, Abiteboul, Senellart, and Suchanek


This is of course relatively simple for data sources that track modi-    input the current state of the KB, and a collection of changes ∆i
fications, e.g., CalDAV (calendar), CardDAV (contacts) and IMAP          that have recently happened. It computes a new collection ∆i+1 of
(email). For others, this requires more processing. The result of this   enrichments. Intuitively, this allows reacting to changes in a data
process is a delta update, i.e., a set of updates to the KB since the    source. When some ∆0 is detected (typically by some synchronizer),
last time that particular source was considered.                         the system runs a pipeline of enrichers to take these changes into
   The KB records the provenance of each newly obtained piece            consideration. For instance, when a new entry is entered in the
of information. Synchronizers record a description of the data           calendar with an address, a geocoding enricher is called to locate it.
source, and enrichers record their own name. We use named graphs         Another enricher will later attempt to match it with a position in
to store the provenance. For example, the statements extracted           the location history. For performance, particularly costly enrichers
from an email message in the user’s email server will be con-            wait until there are enough changes, or when no more changes are
tained in a graph named with the concatenation of the server’s           happening, before running on a batch of changes. This is the case
email folder URL and the message id. The graph’s URI is itself an        for the entity resolution enricher. We now present this enricher
instance of personal:Document, and is related to its source via          and another one that has been incorporated into the system.
the personal:documentOf property. The source is an instance of
personal:Source and is in this case the email server’s URL. Ac-          4.1    Agent Matching
count information is included in an instance of personal:Account
                                                                            Facets. The KB keeps information as close to the original data as
via the personal:sourceOf property. Account instances allows us
                                                                         possible. Thus, the knowledge base will typically contain several
to gather different kinds of data sources, (e.g., CardDAV, CalDAV
                                                                         entities for the same person, if that person appears with different
and IMAP servers) belonging to one provider (e.g., corporate IT
                                                                         names or different email addresses. We call such resources facets of
services) to which the user accesses through one identification. This
                                                                         the same real-world agent. Different facets of the same agent will
provenance can be used to answer queries such as “What meetings
                                                                         be linked by the personal:sameAs relation. The task of identifying
were recorded in my work calendar for next Monday?”.
                                                                         equivalent facets has been intensively studied under different names
   Finally, the system allows the propagation of information from
                                                                         such as record linkage, entity resolution, or object matching [5]. In
the KB to the data sources. These can either be insertions/deletions
                                                                         our case, we use techniques that are tailored to the context of per-
derived by the enrichers, or insertions/deletions explicitly specified
                                                                         sonal KBs: identifier-based matching and attribute-based matching.
by the user. For instance, consider the information that different
email addresses correspond to the same person. This information             Identifier-based matching. We can match two facets if they have
can be pushed to data sources, which may for example result in           the same value for some particular attribute (such as an email
performing the merge of two contacts in the user’s list of contacts.     address or a telephone number), which, in some sense, identifies or
To propagate the information to the source, we translate from the        determines the entity. This approach is commonly used in personal
structure and terminology of the KB back to that of the data source      information systems (in research and industry) and gives fairly
and use the API of that source. The user has the means of controlling    good results for linking, e.g., facets extracted from emails and the
this propagation, e.g., specifying whether contact information in        ones extracted from contacts. Such a matching may occasionally
our system should be synchronized to her phone’s contact list.           be incorrect, e.g., when two spouses share a mobile phone or two
   The user can update directly the KB by inserting or deleting          employees share the same customer relations email address. In our
knowledge statements. Such updates to the KB are specified in the        experience, such cases are rare, and we postpone their study to
SPARQL Update language [18]. When no source is specified for             future work.
recording this new information, the system considers all the sources        Two agent facets with the same first and family names have,
that know the subject of the particular statement. For insertion, if     for instance, a higher probability to represent the same agent than
no source is able to register a corresponding insertion, the system      two agent facets with different names, all other attributes held con-
performs the insertion in a special locally persistent graph, called     stant. Besides names, attributes that can help determine a matching
the overwrite graph. For deletions, if one source fails to perform       include schema:birthDate, schema:gender, and schema:email.
a deletion (e.g., because the statement is read-only), the system           We tried holistic matching algorithms for graph alignments [40]
removes the statement from the KB anyway (even if the data is            that we adapted to our setting. The results turned out to be dis-
still in some upstream source). A negative statement is added to         appointing (see Section 5). We believe this is due to the follow-
the overwrite graph. This negative statement will prevent using          ing: (i) almost all agent facets have a schema:email, and possibly
a source statement to reintroduce the corresponding statement in         a schema:name, but most of them lack other attributes that are
KB: The negative statement overwrites the source statement.              thus almost useless; (ii) names extracted from mails may contain
                                                                         pseudonyms, abbreviations, or lack family names, which reduces
4   ENRICHERS                                                            matching precision. (iii) we cannot reliably compute name fre-
We describe the general principles of enricher modules. We then          quency metrics from the knowledge base, since a rare name may
describe two specific enrichments: agent matching and event ge-          appear many times for different email addresses if a person hap-
olocation.                                                               pens to be a friend of the user. Therefore, we developed our own
   After loading, enricher modules perform inference tasks such as       algorithm, AgentMatch, which works as follows:
entity resolution, event geolocation, and other knowledge enrich-           (1) We partition Agents using the equivalence relation com-
ment tasks. An enricher works in a differential manner: it takes as             puted by matching identifying attributes.
A Knowledge Base for Personal Information Management                                                      LDOW 2018, April 2018, Lyon, France


   (2) For each Agent equivalence class, we compute its corre-
       sponding set of names, and, for each name, its number of
       occurrences (in email messages, etc.).
   (3) We compute Inverse Document Frequency (IDF) scores,
       where the documents are the equivalence classes, and the
       terms are the name occurrences.
   (4) For each pair of equivalence classes, we compute a numerical
       similarity between each pair of names using an approximate
       string distance that finds the best matching of words between
       the two names and then compares matching words using
       another string similarity function (discussed below). The sim-
       ilarity between two names is computed as a weighted mean
       using the sum of word-IDFs as weights. The best matching
       of words corresponds to a maximum weight matching in
       the bipartite graph of words where weights are computed           Figure 3: Two clusters of stays (blue points inside black cir-
       using the second string similarity function. The similarity       cles) within the same building. Red points are outliers. The
       (in [0, 1]) between two equivalence classes is computed as a      other points are moves.
       weighted mean of name pair similarity using the product of
       word occurrences as weights.                                      by location point p = (t, x, y, a), where t is the time, a the accu-
   (5) Pairs for which the similarity is above a certain threshold       racy, and (x, y) the coordinates, is represented by the distribution
       are considered to correspond to two equivalent facets.            P = N (µ P = (x, y), σP2 = a 2 ). When checking whether location
The second similarity function we use is based on the Levenshtein        p can be added to an existing cluster C represented by distribu-
edit-distance, after string normalization (accent removal and low-       tion Q, the process computes the Hellinger distance [34] between
ercasing). In our experiments, we have also tried the Jaro–Winkler       the distribution P and the normal distribution Q = N (µ Q , σQ2 ):
                                                                                             r                              
distance. For performance reasons, we use 2- or 3-gram-based in-                                2σ σ             d (µ , µ )2
                                                                         H 2 (P, Q) = 1 − σ 2 P+σQ2 exp − 41 σ 2P+σQ2          ∈ [0, 1], where
dexing of words in agent names, and only consider in step (4.) of the                             Q   Q               P   Q
process those Agent parts with some ratio S of q-grams in common         d(µ P , µ Q ) is the geographical distance between cluster centers.
in at least one word. For instance, two Agent parts with names           The Hellinger distance takes into account both the accuracy and
“Susan Doe” and “Susane Smith” would be candidates.                      geographical distance between cluster centers, which allows us
                                                                         to handle outliers no matter the location accuracy. The location
4.2    Geolocating Events                                                is added to C if this distance is below a certain threshold λ, i.e.,
We discuss how to geolocate events, e.g., how we can detect that         H 2 (P, Q) ⩽ λ2 < 1. In our system, we used a threshold of 0.95.
Monday’s lunch was at “Shana Thai Restaurant, 311 Moffett Boule-            When p is added to cluster C, the resulting cluster is defined with
vard, Mountain View, CA 94043”. For this, we first analyze the           a normal distribution whose expectation is the arithmetic mean of
location history from the user’s smartphone to detect places where       location point centers weighted by the inverse accuracy squared,
the user stayed for a prolonged period of time. We then perform          and whose variance is the harmonic mean of accuracies squared.
some spatiotemporal alignment between such stays and the events          Formally, if a cluster C is formed by locations {p1 , . . . , pn }, where
in the user’s calendar. Finally, we use geocoding to provide location    pi = (ti , x i , yi , ai ), then C is defined with distribution N (µ, σ 2 )
semantics to the events, e.g., a restaurant name and a street address.   where µ is the weighted arithmetic mean of location centers (x i , yi )
                                                                         weighted by their inverse accuracy squared ai−2 , and the variance
   Detecting stays. Locations in the user’s location history can be      σ 2 is the harmonic mean of location accuracies squared ai−2 .
put into two categories: stays and moves. Stays are locations where
                                                                                             n (x , y ) X   n 1 −1            n 1 −1
                                                                                                                  !                  !
the user remained for some period of time (e.g., dinner at a restau-
                                                                                                     i i                2
                                                                                      µ=                               σ =
                                                                                           X                                 X
rant, gym training, office work), and moves are the others. Moves
                                                                                            i=1      ai2   i=1 ai
                                                                                                                2
                                                                                                                             i=1 ai
                                                                                                                                   2
usually correspond to locations along a journey from one place to
another, but might also correspond to richer outdoor activity (e.g.,     The coordinates are assumed to have been projected to an Euclidean
jogging, sightseeing). Figure 3 illustrates two stay clusters located    plane locally approximating distances and angles on Earth around
inside the same building.                                                cluster points. If n = 1, then µ = (x 1 , yi ) and σ 2 = a 21 , which
   To transform the user’s location history into a sequence of stays     corresponds to the definition of a cluster of size 1.
and moves, we perform time-based spatial clustering [29]. The idea          A cluster that lasted more than a certain threshold is a candidate
is to create clusters along the time-axis. Locations are sorted by       for being a stay. A difficulty is that a single location history (e.g.,
increasing time, and each new location is either added to an exist-      Google Location History) may record locations of different devices,
ing cluster (that is geographically close and that is not too old), or   e.g., a telephone and a tablet. The identity of the device may not be
added to a new cluster. To do so, a location is spatially represented    recorded. The algorithm understands that two far-away locations,
as a two dimensional unimodal normal distribution N (µ, σ 2 ). The       very close in time, must come from different devices. Typically, one
assumption of a normally distributed error is typical in the field of    of the devices is considered to be stationary, and we try to detect a
processing location data. For instance, a cluster of size 1 formed       movement of the other. Another difficulty comes when traveling in
LDOW 2018, April 2018, Lyon, France                                           Montoya, Pellissier Tanon, Abiteboul, Senellart, and Suchanek


high speed trains with poor network connectivity. Location trackers           We measured the loading times of Angela’s dataset into the
will often give the same location for a few minutes, which leads to        system in two different scenarios: from source data on the Internet
the detection of an incorrect stay.                                        (using Google API, except for the location history which is not
                                                                           provided by the API and was loaded from a file), and from source
    Matching stays with events. After the extraction of stays using        data stored in local files. Loading took 19 and 4 minutes, respectively,
the previous algorithm, the next step is to match these with calendar      on a desktop computer (Intel i7-2600k 4-core, 3.4 GHz, 20 GB RAM,
events. Such a matching turns out to be difficult because: (i) the         SSD).
location of an event (address or geo-coordinates) is often missing;
(ii) when present, an address often does not identify a geographical
                                                                           5.1    Agent Matching
entity, as in “John’s home” or “room C110”; (iii) in our experience,
starting times are generally reasonable (although a person may be          We evaluated the precision and recall of the AgentMatch algorithm
late or early for a meeting) but durations are often not meaningful        (Section 4) on Barack’s dataset. This dataset contains 40,483 Agent
(around 70% of events in our test datasets were scheduled for 1 hour;      instances with a total of 25,381 schema:name values, of which 17,706
among the 1-hour events that we aligned, only 9% lasted between            are distinct; it also contains 40,455 schema:email values, of which
45 and 75 minutes); (iv) some stays are incorrect.                         24,650 are distinct. To compute the precision and recall, we sampled
    Because of (i) and (ii), we do not rely much on the location           2,000 pairs of distinct Agents, and asked Barack to manually assign
explicitly listed in the user’s calendars. We match a stay with an         to each possible pair a ground truth value (true/false). Barack was
event primarily based on time: the time overlap (or proximity) and         provided with the email address and name of each agent, and was
the duration. In particular, we match the stay and the event, if the       allowed to query the KB to get extra information.
ratio of the overlap duration over the entire stay duration is greater        We tested both Levenshtein and Jaro–Winkler as secondary
than a threshold θ . As we have seen, event durations are often            string distance, with and without IDF term weights. The term q-
unreliable because of (iii). Our method still yields reasonable results,   gram match ratio (S) was set to 0.6. We varied λ so as to maximize
because it tolerates errors on the start of the stay for long stays        the F1 value. Precision decreases while recall increases for decreas-
(because of their duration) and for short ones (because calendar           ing threshold values. Our baseline is IdMatch, which matches two
events are scheduled usually for at least one hour). If the event has      contacts iff they have the same email address.
geographical coordinates, we filter out stays that are too far away           As competitor, we considered PARIS [40], an ontology alignment
from that location (i.e., when the distance is greater than δ ). We        algorithm that is parametrized by a single threshold. We used string
discuss the choice of θ and δ for this process in Section 5.               similarity for email addresses, and the name similarity metric used
                                                                           by AgentMatch, except that it is applied to single Agent instances.
   Geocoding event addresses. Once stays associated with events,           PARIS computes the average number of outgoing edges for each
we enrich events with rich place semantics (country, street name,          relation. Since our dataset contains duplicates, we gave PARIS an
postal code, place name). If an event has an explicit address, we use      advantage by computing these values upfront.
a geocoder. Thymeflow allows using different geocoders, e.g., the             We also considered Google’s “Find duplicates” feature. Google
Google Maps Geocoding API [21], which returns the geographic               was not able to handle more than 27,000 contacts at the same time,
coordinates of an address, along with structured place and address         and so we had to run it multiple times in batches. Since the fi-
data. The enricher only keeps the geocoder’s most relevant result          nal output depends on the order in which contacts were loaded,
and adds its data (geographic coordinates, identifier, street address,     we present two results, one for which the contacts were supplied
etc.) to the location in the knowledge base. For events that do not        sorted by email address (Google1), and another for a random order
have an explicit address but that have been matched to a stay, we          (Google2). Since Google’s algorithm failed to merge contacts that
use the geocoder to transform the geographic coordinates of the            IdMatch did merge, we also tested running IdMatch on Google’s
stay into a list of nearby places. The most precise result is added        output (GoogleId) for both runs. We also tested Mac OS X contact
as the event location. If the event has both an explicit address and       de-duplication feature. However, its result did not contain all the
a match with a stay, we call the geocoder on this address, while           meta data from the original contacts, so that we could not evaluate
restricting the search to a small area around the stay coordinates.        this feature.
                                                                              The results are shown in Table 1. As expected, our baseline
5    EXPERIMENTS                                                           IdMatch has a perfect precision, but a low recall (43%). Google,
                                                                           likewise, gives preference to precision, but achieves a higher recall
In this section, we present the results of our experiments. We used
                                                                           than the baseline (50%). The recall improves further if the Google
datasets from two real users, whom we call Angela and Barack. An-
                                                                           is combined with IdMatch (61%). PARIS, in contrast, favors recall
gela’s dataset consists of 7,336 emails, 522 calendar events, 204,870
                                                                           (92%) over precision (83%), and achieves a better F1 value overall.
location points, and 124 contacts extracted from Google’s email,
                                                                           The highest F1-measure (95%) is reached for AgentMatch with the
contact, calendar, and location history services. This corresponds
                                                                           Jaro–Winkler distance for a threshold of 0.825. It has a precision
to 1.6M triples in our schema. Barack’s dataset consists of 136,301
                                                                           comparable to Google’s, and a recall comparable to PARIS’s.
emails, 3,080 calendar events, 1,229,245 location points, and 582
contacts extracted from the same sources. Barack’s emails cover a
period 5,540 days, locations cover 1,676 days. This corresponds to         5.2    Detecting Stays
10.3M triples, where 70.9 % come from the location history, 28.8 %         We evaluated the extraction of stays from the location history on
from emails, 0.3 % from calendars and less than 0.1 % from contacts.       Barack’s dataset. We randomly chose 15 days, and presented him
A Knowledge Base for Personal Information Management                                                     LDOW 2018, April 2018, Lyon, France

Table 1: Precision and recall of the Agent Matching task                   Table 3: Geocoding task on matched (event, stay) pairs in
on Barack’s dataset, for different parameters of the Agent-                Barack’s dataset (in %)
Match, IdMatch, PARIS and Google algorithms
                                                                             Method              M     F      T      PT    T|A    PT|A    F1
  Algorithm        Similarity     IDF    λ       Prec.   Rec.      F1
                                                                             GoogleTimeline 0 82.8 14.8             14.8 17.2     17.2   29.4
  IdMatch                                        1.000   0.430    0.601      EventSingle    50 40.8 4.0              7.9 9.6      19.0   27.7
  Google1                                        0.995   0.508    0.672      Event          26 63.6 4.0              5.4 10.0     13.6   22.9
  Google2                                        0.996   0.453    0.623      Stay            0 69.6 0.8              0.8 30.4     30.4   46.6
  GoogleId2                                      0.997   0.625    0.768
                                                                             StayEvent      50 16.4 27.2            54.4 33.6     67.2   57.3
  GoogleId1                                      0.996   0.608    0.755
  PARIS            Jaro–Winkler   T     0.425    0.829   0.922    0.873
                                                                             StayEvent|Stay  0 50.0 28.4            28.4 50.0     50.0   66.7
  AgentMatch       Levenshtein    F     0.725    0.945   0.904    0.924
  AgentMatch       Levenshtein    T     0.775    0.948   0.900    0.923    δ , and found that the performance improves consistently with larger
  AgentMatch       Jaro–Winkler   F     0.925    0.988   0.841    0.909    values. This indicates that filtering stays which are too far from
  AgentMatch       Jaro–Winkler   T     0.825    0.954   0.945    0.949    event location coordinates (where available) should not be taken
                                                                           into consideration. With these settings, the matching performs
                                                                           quite well: We achieve a precision and recall of around 70%.
   Table 2: Stay extraction evaluation on Barack’s dataset

  Method       θ        #D        DΘ     Prec.     Recall         F1       5.3    Geocoding
                                                                           We evaluated different geocoding enrichers for (event, stay) pairs
  Thyme        15     33.3 % 0.6 %      91.3 % 98.4 %            94.7 %
                                                                           described in Section 4. We used the Google Maps Geocoding API.
  Thyme        10     46.0 % 0.9 %      82.9 % 98.4 %            90.0 %
                                                                           We considered three different inputs to this API: the event loca-
  Thyme         5     62.5 % 1.0 %      62.1 % 100.0 %           76.6 %
                                                                           tion address attribute (Event), the stay coordinates (Stay), and the
  Google              17.5 % N/A        87.7 % 89.1 %            88.4 %
                                                                           event location attribute with the stay coordinates given as a bias
                                                                           (StayEvent). We also considered a more precise version of Event,
a web interface with (1) the raw locations on a map, (2) the stays
                                                                           which produces a result only if the geocoder returns a single un-
detected by Thymeflow, and (3) the stays detected by his Google
                                                                           ambiguous location (EventSingle). Finally, we devised the method
Timeline [20]. Barack was then asked to annotate each stay as “true”,
                                                                           StayEvent+Stay, which returns the StayEvent result if it exists, and
i.e., corresponding to an actual stay, or “false”, and to report missing
                                                                           the Stay result otherwise.
stays, based on his memories. He was also allowed to use any other
                                                                              For each input, the geocoder gave either no result (M), a false
available knowledge (e.g., his calendar). The exact definition of an
                                                                           result (F), a true place (T), or just a true address (A). For instance,
actual stay was left to Barack, and the reliability of his annotations
                                                                           an event occurring in “Hôtel Ritz Paris” is true if the output is for
were dependent on his recollection. In total, Barack found 64 ac-
                                                                           instance “Ritz Paris”, while an output of “15 Place Vendôme, Paris”
tual stays. Sometimes, an algorithm would detect an actual stay
                                                                           would count as a true address. For comparison, we also evaluated
as multiple consecutive stays. In that case, we counted one true
                                                                           the place given by Google Timeline [20].
stay and counted the number of duplicates and the resulting move
                                                                              Due to limitations of the API, geocoding from stay coordinates
duration in-between. For instance, an actual stay of 2 hours output
                                                                           mostly yielded address results (99.2% of the time). To better evaluate
as two stays of 29 min and 88 min, with a short move of 3 minutes
                                                                           these methods, we computed the number of times in which the
in-between would count as 1 true stay, 1 duplicate and 3 minutes
                                                                           output was either a true place, or a true address (denoted T|A). For
of duplicate move duration.
                                                                           those methods that did not always return a result, we computed a
    Table 2 shows the resulting precision and recall for each method,
                                                                           precision metric PT|A (resp., PT ), that is equal to the ratio of T|A
with varying stay duration thresholds for Thymeflow. We also show
                                                                           (resp., T) to the number of times a result was returned. We computed
the duplicate ratio #D (number of duplicates over the number of true
                                                                           a F1-measure based on the PT|A precision, and a recall assimilated
stays) and the move duration ratio DΘ (duplicate move duration
                                                                           to the number of times the geocoder returned a result (1 − M).
over the total duration of true stays). Overall, Thymeflow obtains
                                                                              The evaluation was performed on 250 randomly picked (stay,
results comparable to Google Timeline for stay extraction, with
                                                                           event) pairs in Barack’s dataset. The results are shown in Ta-
better precision and recall, but more duplicates.
                                                                           ble 3. The Google Timeline gave the right place or address only
    Matching stays with events. We also evaluated the matching of          17.2% of the time. The EventSingle method, likewise, performs
stays in the user’s location history with the events in their calendar.    poorly, indicating that the places are indeed highly ambiguous. The
We sampled stays from Angela and Barack’s datasets, and produced           best precision (PT|A of 67.2%) is obtained by Geocoding with the
all possible matchings to events, i.e., all matchings produced by the      StayEvent, but this method returns a result only 50.0% of the time.
algorithm whatever the threshold. Angela and Barack were then              StayEvent+Stay, in contrast, can find the right place 28.4% of the
asked to manually label the matches as correct or incorrect. The           time, and the right place or address 50.0% of the time, which is our
matching process relies on two parameters, namely the duration             best result. We are happy with this performance, considering that
ratio threshold θ and the filtering distance δ . We varied θ and found     around 45% of the event locations were room numbers without
that a value of 0.2 leads to best F1 values. With this value, we varied    mention of a building or place name (i.e., C101).
LDOW 2018, April 2018, Lyon, France                                              Montoya, Pellissier Tanon, Abiteboul, Senellart, and Suchanek


5.4     Use Cases                                                             see in our everyday lives [23]. MyLifeBits is a notable documented
The user can query the KB using SPARQL [25]. Additionally,                    example [19]. The advent of cheaper, more advanced and efficient
Thymeflow uses a modern triple store supporting full-text and                 wearable devices able to capture the different aspects of one’s life
geospatial queries. Since the KB unites different data sources,               has made lifelogging indiscriminate of what it logs. Lifelogging
queries can seamlessly span multiple sources and data types. This             activities include recording a history of machine enabled tasks (e.g.,
allows the user (Angela), to ask for instance:                                communications, editing, web browser’s history), passively captur-
                                                                              ing what we see and hear (e.g., via a wearable camera), monitoring
      • What are the phone numbers of her birthday party guests?
                                                                              personal biometrics (e.g., steps taken, sleep quality), and logging
        (combining information from the contacts and the emails)
                                                                              mobile device and environmental context (e.g., the user’s location,
      • What places did she visit during her last trip to London?
                                                                              smart home sensing).
        (combining geocoding information with stays)
                                                                                 Different from lifelogging, which does not focus on the analysis
      • For each person she meets more than 3 times a week, what
                                                                              of the logged information, the quantified self is a movement to
        are the top 2 places where she usually meets that particular
                                                                              incorporate data acquisition technology on certain focused aspects
        person? (based on her calendar and location history)
                                                                              of one’s daily life [23]. The quantified self focuses on logging expe-
Such queries are not supported by current proprietary cloud ser-              riences with a clearer understanding of the goals, such as exercise
vices, which do not allow arbitrary queries.                                  levels for fitness and health care.
   Hub of personal knowledge. Finally, the user can use the bidi-
                                                                                 Organizing. PIM also deals with the management and organiza-
rectionality of synchronization to enrich her existing services. For
                                                                              tion of information. It is for instance concerned with the manage-
instance, she can enrich her personal address book (CardDav) with
                                                                              ment of privacy, security, distribution, and enrichment of informa-
knowledge inferred by the system (e.g., a friend’s birth date ex-
                                                                              tion.
tracted from Facebook) using a SPARQL/Update query.
                                                                                 A personal data service (PDS) lets the user store, manage and
                                                                              deploy her information in a structured way. The PDS may be used
6     RELATED WORK
                                                                              to manage different identities and/or as central point of information
We now review the related work, on personal information man-                  exchange between services. For instance, an application that recom-
agement, on information integration, and on the specific tasks of             mends new tracks based on what the user likes to listen may need
location analysis and calendar matching.                                      to use adapters and authenticate with different services keeping
                                                                              a listening history. Instead, the PDS centralizes this information
6.1     Personal Information Management                                       and the application only needs an adapter to connect to this PDS.
This work is motivated by concept of personal information man-                Higgins [43], OpenPDS [10] and the Hub of All Things [26] are
agement (PIM), taking the viewpoint of [1] as to what a PIM system            examples of PDSs.
should be. [28] groups PIM research and development into the fol-                MyData [37] describes a consent management framework that
lowing problems: finding and re-finding, keeping and organizing               lets user control the flow of data between a service that has in-
personal information. We now present some notable contributions.              formation about her and a service that uses this information. In
                                                                              this framework, which is still at its early stage of development, a
   Finding and Re-finding. PIM has been concerned with improving
                                                                              central system holds credentials to access the different services on
how individuals go about retrieving a piece of information to meet
                                                                              the user’s behalf. The user specifies the rules by which flows of
a particular need.
                                                                              information between any two of those services are authorized. The
   For searching within the user’s personal computer, desktop full-
                                                                              central system is in charge of providing or revoking the necessary
text search tools capable have been developed for various operating
                                                                              authorizations on each of those services to implement these rules.
systems and platforms [46]. Search entries are for instance the files
                                                                              Contrary to a PDS, the actual data does not need to flow through
and folders on the file-system, email messages, browser history
                                                                              the central system. Two services may spontaneously share infor-
pages, calendar events, contacts, and applications. Search may be
                                                                              mation about the user with each other if legally entitled (e.g., two
performed on both the content and the metadata. In particular, the
                                                                              public bodies), in which case the central system is notified. It is
IRIS [4] and NEPOMUK [24] projects used knowledge represen-
                                                                              an all-or-nothing approach that represents a paradigm shift from
tation technologies to provide semantic search facilities and go
                                                                              currently implemented ad-hoc flows of personal information across
beyond search to provide facilities for exchanging data between
                                                                              organizations.
different applications within a single desktop computer.
                                                                                 Organizing information as a time-ordered stream of documents
   Other research efforts have focused on ameliorating the process
                                                                              (a lifestream) has been proposed as a simple scheme for reducing
of finding things the user has already seen, using whatever context
                                                                              the time the user spends in manually organizing documents into
or meta-information that the user remembers [14, 39, 42].
                                                                              a classic hierarchical file system [16]. It has the advantage of pro-
   Keeping. PIM has also addressed the question: What kind of                 viding unified view of the user’s personal information. Lifestreams
information should be captured and stored in digital form?                    can be seen as a natural representation of lifelog information. The
   A central idea of [3]’s vision is creating a device that is able to dig-   Digital Me system uses this kind of representation to unify data
itally capture all of the experiences and acquired knowledge of the           from different loggers [38].
user, so that it can act as a supplement to her memory. Lifelogging              For managing personal information, different levels of organiza-
attempts to fulfil this vision by visually capturing the world that we        tion and abstraction have been proposed. Personal data lakes [45]
A Knowledge Base for Personal Information Management                                                   LDOW 2018, April 2018, Lyon, France


and personal data spaces [12] offer little integration and focus on       means to match calenders, emails, and events, as we do. The only
handling storage, metadata, and search. On the other end, personal        set of vocabularies besides schema.org which provides a broad
knowledge bases, which include more semantics, have been used:            coverage of all entities we are dealing with is the OSCAF ontolo-
Haystack [30], SEMEX [13], IRIS [4], and NEPOMUK [24]. Such a             gies [33]. But their development was stopped in 2013 and they
structure allows the flexible representation of things like “this file,   are not maintained anymore, contrary to schema.org which is ac-
authored by this person, was presented at this meeting about this         tively supported by companies like Google and widely used on the
project”. They integrate several sources of information, including        web [22]. Recently, a personal data service has been proposed that
documents, media, email messages, contacts, calendars, chats, and         reuses the OSCAF ontologies, but they use a relational database
web browser’s history. However, these projects date from 2007 and         instead of a knowledge base [38].
before and assume that most of the user’s personal information is
stored on her personal computer. Today, most of it is spread across       6.3     Location Analysis and Calendar Matching
several devices [1].                                                      The ubiquity of networked mobile devices able to track users’ lo-
   Some proprietary service providers, such as Google and Apple,          cations over time has been greatly utilized for estimating traffic
have arguably come quite close to our vision of a personal knowl-         and studying mobility patterns in urban areas. Improvements in
edge base. They integrate calendars, emails, and address books,           accuracy and battery efficiency of mobile location technologies
and allow smart exchanges between them. Some of them even pro-            have made possible the estimation of user activities and visited
vide intelligent personal assistants that proactively interact with the   places on a daily basis [2, 20, 29]. Most of these works have mainly
user. However, these are closed source proprietary solutions that         exploited sensor data (accelerometer, location, network) and readily
promote vendor lock-in. In response, open-source alternative solu-        available geographic data. Few of them, however, have exploited the
tions have been developed, to cloud storage in particular, such as        user’s calendar and other available data for creating richer and more
ownCloud [35] and Cozy [7]. These have evolved into application           semantic activity histories. Recently, a study has recognized the
platforms that host various kinds of other user-oriented services         importance of using the location history and social network infor-
(e.g., email, calendar, contacts). They leverage multi-device synchro-    mation for improving the representation of information contained
nization facilities and standard protocols to facilitate integration      in the user’s calendar: e.g. for distinguishing genuine real-world
with existing contact managers and calendars. Cozy is notable for         events from reminders [31].
providing adapters for importing data from different kinds of ser-
vices (e.g., activity trackers, finance, social) into the system into a   7     DISCUSSION
document-oriented database. These tools bring the convenience of
                                                                          The RDF model seems well-suited for building a personal knowl-
modern software-as-a-service solutions while letting the user be in
                                                                          edge base. However powerful, making full use of it relies on being
control, not give away some of her privacy and free herself from
                                                                          able to write and run performant queries. At this point, we cannot
vendor lock-in.
                                                                          focus on optimizing for a specific application. It is not yet clear
                                                                          up to what extent Thymeflow should hold raw data (such as the
6.2    Information Integration                                            entire location history), which, depending on how it is used, may
Data matching (also known as record or data linkage, entity resolu-       be loaded on demand, in mediation style. Additionally, we would
tion, object/field matching) is the task of finding records that refer    like to raise the following issues that could drive future research
to the same entity across different sources. It is extensively utilized   and development:
in data mining projects and in large-scale information systems by               • Gathering data for experiments: The research community
business, public bodies and governments [5]. Example application                  might benefit from building and maintaining sets of anno-
areas include national census, the health sector, or fraud detec-                 tated multi-dimensional personal information for use in dif-
tion. In the context of personal information, SEMEX [13] integrates               ferent kinds of tasks. This is challenging, specially due to
entity resolution facilities. SEMEX imports information from docu-                privacy concerns.
ments, bibliography, contacts and email, and uses attributes as well            • Opportunities: Internet companies that already hold a lot of
associations found between persons, institutions and conferences                  user data are not yet integrating everything they have in a
to reconcile references. However, different from our work, the inte-              coherent whole, and are not performing as well as we think
gration is done at import time so the user cannot later manually                  they could. For instance, Google Location History does not
revoke it through an update, and incremental synchronization is not               integrate the user’s calendar, unlike we do. We think that
handled. Recently, contact managers from known service providers                  there are still many opportunities to create new products
have started providing de-duplication tools for finding duplicate                 and functionalities from existing data alone.
contacts and merging them in bulk. However, these tools are often               • Inaccessible information: The hard truth is that many pop-
restricted to contacts present in the user’s address book and do not              ular Internet-based services still do not provide an API for
merge contacts from social networks or emails.                                    conveniently retrieving user data out of them, or that such
   Common standards, such as vCards and iCalendar, have ad-                       an API is not feature-complete (e.g., Facebook, WhatsApp).
vanced the state of the art by allowing provider-independent admin-
istration of personal information. There is also a proposed standard      8     CONCLUSION
for mapping vCard content and iCalendars into RDF [6, 27]. While          The Thymeflow system integrates data from emails, calendars, ad-
such standards are useful in our context, they do not provide the         dress books, and location history, providing novel functionalities
LDOW 2018, April 2018, Lyon, France                                                           Montoya, Pellissier Tanon, Abiteboul, Senellart, and Suchanek


on top of them. It can merge different facets of the same agent,                          [15] Facebook. 2016. The Graph API. (2016). https://developers.facebook.com/docs/
determine prolonged stays in the location history, and align them                              graph-api/
                                                                                          [16] Eric Freeman and David Gelernter. 1996. Lifestreams: A storage model for
with events in the calendar.                                                                   personal data. ACM SIGMOD Record 25, 1 (1996), 80–86.
    Our work is a unique attempt at building a personal knowledge                         [17] Hector Garcia-Molina, Yannis Papakonstantinou, Dallan Quass, Anand Rajara-
                                                                                               man, Yehoshua Sagiv, Jeffrey Ullman, Vasilis Vassalos, and Jennifer Widom. 1997.
base. First, the system is complementary to and does not pretend                               The TSIMMIS approach to mediation: Data models and languages. Journal of
to replace the existing user experience, applications, and function-                           intelligent information systems 8, 2 (1997).
alities, e.g., for reading/writing emails, managing a calendar, or-                       [18] Paula Gearon, Alexandre Passant, and Axel Polleres. 2013. SPARQL 1.1 Update.
                                                                                               https://www.w3.org/TR/sparql11-update/.
ganizing files. Second, we embrace personal information as being                          [19] Jim Gemmell, Gordon Bell, and Roger Lueder. 2006. MyLifeBits: a personal
fundamentally distributed and heterogeneous and focus on the need                              database for everything. Commun. ACM 49, 1 (2006), 88–95.
of providing knowledge integration on top for creating completely                         [20] Google. 2016. Google Maps Timeline. (2016). https://www.google.fr/maps/
                                                                                               timeline
new services (query answering, analytics). Finally, while the system                      [21] Google. 2017. Google Maps APIs. (2017). https://developers.google.com/maps/
could benefit from more advanced analysis such the extraction of                               documentation/
                                                                                          [22] RV Guha, Dan Brickley, and Steve Macbeth. 2016. Schema.org: Evolution of
entities from rich text (e.g., emails), for linking them with elements                         structured data on the web. CACM 59, 2 (2016).
of the KB, our first focus is on enriching existing semi-structured                       [23] Cathal Gurrin, Alan F Smeaton, and Aiden R Doherty. 2014. Lifelogging: Personal
data, which improves the quality of data for use by other services.                            big data. Foundations and trends in information retrieval 8, 1 (2014), 1–125.
                                                                                          [24] Siegfried Handschuh, Knud Möller, and Tudor Groza. 2007. The NEPOMUK
    Our system can be extended in a number of directions, includ-                              project-on the way to the social semantic desktop. In I-SEMANTICS.
ing incorporating more data sources, extracting semantics from                            [25] Steve Harris, Andy Seaborne, and Eric Prud’hommeaux. 2013. SPARQL 1.1 Query
text, complex analysis of users’ data and behavior. Future applica-                            Language. http://www.w3.org/TR/sparql11-query/.
                                                                                          [26] HATDeX Ltd. 2017. Hub of All Things. (2017). https://hubofallthings.com
tions include personal analytics, cross-vendor search, intelligent                        [27] Renato Iannella and James McKinney. 2014. vCard Ontology - for describing People
event planning, recommendation, and prediction. Also, our system                               and Organizations. http://www.w3.org/TR/2014/NOTE-vcard-rdf-20140522/.
                                                                                          [28] William Jones and Jaime Teevan. 2011. Personal Information Management. Uni-
could use simpler query language, perhaps natural language, or                                 versity of Washington Press, Seattle, WA U.S.A.
even proactively interact with the user, in the style of Apple’s Siri,                    [29] Jong Hee Kang, William Welbourne, Benjamin Stewart, and Gaetano Borriello.
Google’s Google Now, Microsoft’s Cortana, or Amazon Echo.                                      2004. Extracting Places from Traces of Locations. In WMASH.
                                                                                          [30] David R Karger, Karun Bakshi, David Huynh, Dennis Quan, and Vineet Sinha.
    While the data obtained by Thymeflow remains under the user’s                              2005. Haystack: A customizable general-purpose information management tool
direct control, fully respecting her privacy, the data residing outside                        for end users of semistructured data. In Proc. of the CIDR Conf.
of it may not. However, using Thymeflow, the user could have a                            [31] Tom Lovett, Eamonn O’Neill, James Irwin, and David Pollington. 2010. The
                                                                                               calendar as a sensor: analysis and improvement using data fusion with social
better understanding of what other systems know about her, which                               networks and location. In UbiComp.
is important first step in gaining control about it.                                      [32] David Montoya, Thomas Pellissier Tanon, Serge Abiteboul, and Fabian Suchanek.
                                                                                               2016. Thymeflow, A Personal Knowledge Base with Spatio-temporal Data. In
                                                                                               CIKM. Demonstration paper.
                                                                                          [33] Nepomuk Consortium and OSCAF. 2007. OSCAF Ontologies. (2007). http:
REFERENCES                                                                                     //oscaf.sourceforge.net/
 [1] Serge Abiteboul, Benjamin André, and Daniel Kaplan. 2015. Managing your              [34] Mikhail S Nikulin. 2001. Hellinger distance. Encyclopedia of Mathematics (2001).
     digital life. CACM 58, 5 (2015).                                                     [35] ownCloud. 2016. ownCloud – A safe home for all your data. (2016). https:
 [2] Daniel Ashbrook and Thad Starner. 2003. Using GPS to learn significant locations          //owncloud.org/
     and predict movement across multiple users. Personal and Ubiquitous Computing        [36] S. Perreault. 2011. vCard Format Specification. RFC 6350. IETF. https://tools.ietf.
     7, 5 (2003).                                                                              org/html/rfc6350
 [3] Vannevar Bush. 1945. As We May Think. The Atlantic (1945).                           [37] A Poikola, K Kuikkanieni, and H Honko. 2014.              MyData – A Nordic
 [4] Adam Cheyer, Jack Park, and Richard Giuli. 2005. IRIS: Integrate. Relate. Infer.          Model for human-centered personal data management and processing.
     Share. Technical Report. DTIC Document.                                                   (2014). https://www.lvm.fi/documents/20181/859937/MyData-nordic-model/
 [5] Peter Christen. 2012. Data matching: concepts and techniques for record linkage,          2e9b4eb0-68d7-463b-9460-821493449a63?version=1.0
     entity resolution, and duplicate detection. Springer.                                [38] Mats Sjöberg, Hung-Han Chen, Patrik Floréen, Markus Koskela, Kai Kuikkaniemi,
 [6] Dan Connolly and Libby Miller. 2005. RDF Calendar - an application of the Resource        Tuukka Lehtiniemi, and Jaakko Peltonen. 2016. Digital Me: Controlling and
     Description Framework to iCalendar Data. http://www.w3.org/TR/rdfcal/.                    Making Sense of My Digital Footprint. (2016). http://reknow.fi/dime/
 [7] Cozy Cloud. 2016. Cozy – Simple, versatile, yours. (2016). https://cozy.io/          [39] Craig AN Soules and Gregory R Ganger. 2005. Connections: using context to
 [8] David H. Crocker. 1982. Standard for the format of ARPA Internet text messages.           enhance file search. ACM SIGOPS operating systems review 39, 5 (2005), 119–132.
     RFC 822. IETF. https://tools.ietf.org/html/rfc822                                    [40] Fabian M Suchanek, Serge Abiteboul, and Pierre Senellart. 2011. PARIS: Proba-
 [9] Richard Cyganiak, David Wood, and Markus Lanthaler. 2014. RDF 1.1 Concepts and            bilistic alignment of relations, instances, and schema. PVLDB 5, 3 (2011).
     Abstract Syntax. http://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/.             [41] Fabian M Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: a core of
[10] Yves-Alexandre de Montjoye, Erez Shmueli, Samuel S Wang, and Alex Sandy                   semantic knowledge. In WWW.
     Pentland. 2014. openpds: Protecting the privacy of metadata through safeanswers.     [42] Jaime Teevan. 2007. The re: search engine: simultaneous support for finding and
     PloS one 9, 7 (2014), e98790.                                                             re-finding. In Proceedings of the 20th Annual ACM Symposium on User Interface
[11] B. Desruisseaux. 2009. Internet Calendaring and Scheduling Core Object Specifica-         Software and Technology, Newport, Rhode Island, USA, October 7-10, 2007. 23–32.
     tion (iCalendar). RFC 5545. IETF. https://tools.ietf.org/html/rfc5545                     DOI:https://doi.org/10.1145/1294211.1294217
[12] Jens-Peter Dittrich and Marcos Antonio Vaz Salles. 2006. iDM: A Unified and          [43] Paul Trevithick and Mary Ruddy. 2012. Higgins – Personal Data Service. (2012).
     Versatile Data Model for Personal Dataspace Management. In Proceedings of the             http://www.eclipse.org/higgins/
     32nd International Conference on Very Large Data Bases, Seoul, Korea, September      [44] Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: A Free Collaborative
     12-15, 2006. 367–378. http://dl.acm.org/citation.cfm?id=1164160                           Knowledgebase. CACM 57, 10 (2014).
[13] Xin Dong and Alon Y. Halevy. 2005. A Platform for Personal Information Man-          [45] Coral Walker and Hassan Alrehamy. 2015. Personal Data Lake with Data Gravity
     agement and Integration. In CIDR. 119–130. http://www.cidrdb.org/cidr2005/                Pull. In Fifth IEEE International Conference on Big Data and Cloud Computing,
     papers/P10.pdf                                                                            BDCloud 2015, Dalian, China, August 26-28, 2015. 160–167. DOI:https://doi.org/
[14] Susan T. Dumais, Edward Cutrell, Jonathan J. Cadiz, Gavin Jancke, Raman Sarin,            10.1109/BDCloud.2015.62
     and Daniel C. Robbins. 2003. Stuff I’ve seen: a system for personal information      [46] Wikipedia contributors. 2016. List of search engines – Desktop search engines.
     retrieval and re-use. In SIGIR 2003: Proceedings of the 26th Annual International         (2016). https://en.wikipedia.org/w/index.php?title=List_of_search_engines
     ACM SIGIR Conference on Research and Development in Information Retrieval, July
     28 - August 1, 2003, Toronto, Canada. 72–79. DOI:https://doi.org/10.1145/860435.
     860451