=Paper= {{Paper |id=Vol-170/paper-12 |storemode=property |title=iMeMex: A Platform for Personal Dataspace Management |pdfUrl=https://ceur-ws.org/Vol-170/paper11.pdf |volume=Vol-170 }} ==iMeMex: A Platform for Personal Dataspace Management== https://ceur-ws.org/Vol-170/paper11.pdf
 iMeMex: A Platform for Personal Dataspace Management∗
                  Marcos Antonio Vaz Salles                                                       Jens-Peter Dittrich

                                                         Institute of Information Systems
                                                                    ETH Zurich
                                                             8092 Zurich, Switzerland
                                                                 dbis.ethz.ch | iMeMex.org


ABSTRACT                                                                        in the Lowell Report [1], discussed in a VLDB panel [19], con-
Desktop computers provide thousands of different applications that              sidered in an NSF-sponsored workshop [17], debated in the SIGIR
query and store data in hundreds of thousands of files of different             2006 PIM workshop [22], and became topic of both SIGMOD 2005
formats. Those files are stored in the local filesystem and also in a           keynotes [2, 21].
number of remote data sources, such as network shares or as attach-                In spite of these previous efforts, we argue that a satisfactory so-
ments to emails. To handle this heterogeneous and distributed mix               lution has not yet been brought forward to the issues of physical and
of personal information, data processing logic is re-invented inside            logical data independence on the desktop. Physical data indepen-
each application. This results in an undesirable situation: most ad-            dence relates to abstraction from the devices and formats in which
vanced data management functionality, such as complex queries,                  data is represented. This is clearly not achieved by the simple data
backup and recovery, versioning, provenance tracking, among oth-                model of the current generation of file systems. Applications de-
ers, is (at least partially) performed by end-users in tedious, manual          velop specific solutions that directly handle protocols to access the
tasks. To solve these problems we propose a software platform that              data (email, RSS/ATOM, network file system, etc) and also formats
brings physical and logical data independence to the desktop, free-             in which data is stored (XML, LATEX, image and audio formats,
ing users from low-level data management activities. Unlike cur-                etc). This creates application-specific data silos in which data man-
rent relational DBMSs, this platform unifies data from several inde-            agement functionality, e.g., querying, updating, performing backup
pendent personal data sources without imposing semantic schema                  and recovery operations, are absent or re-invented. Logical data
integration. It manages the complex dataspace [12] of one’s per-                independence relates to the capability of defining views over the
sonal information. We attack three major research challenges in the             logical data model in which data is represented. It is also only par-
building of that platform: (i) definition of a data model that allows           tially achieved with current desktop systems, e.g. smart folders.
the integration of information in distinct representations and loca-            Personal Dataspaces. DBMS technology successfully resolved
tions, (ii) design of a new search&query language over this data                the physical and logical data independence problem for highly struc-
model along with algorithms for the efficient processing of com-                tured data, but not for the highly heterogeneous data mix present in
plex queries, and (iii) formulation of an update model that enables             personal information. Indeed, Franklin et al. [12] argue that today
soft durability guarantees, when compared to ACID properties, on                we rarely have a situation in which all the data that needs to be
data authored independently from the platform.                                  managed can fit nicely into a relational DBMS. Rather, most of the
                                                                                data will be authored independently from the DBMS and will not
                                                                                be in its full control. Franklin et al. introduce the term dataspace
1.    INTRODUCTION                                                              to describe this world of disparate, distributed and independently
   In 1945, Bush [3] presented a vision of a personal information               authored unstructured, semi-structured and structured data.
management system named memex. That vision has deeply influ-                       In this project we focus on personal dataspaces, that is the to-
enced several advances in computing. Part of that vision led to the             tal of all personal information pertaining to a given individual. In
development of the Personal Computer in the 1980’s. It also led                 contrast to the vision of [12], we propose one concrete Personal
to the development of hypertext and the World Wide Web in the                   Dataspace Management System (PDSMS) implementation, named
1990’s. Since then, several projects have attempted to implement                iMeMex (integrated memex). Unlike traditional information inte-
other memex-like functionality [13, 2, 4, 18]. Further, personal in-            gration approaches, a PDSMS does not require semantic data in-
formation management regained interest in the DB research com-                  tegration before any data services are provided. Rather, a PDSMS
munity [15, 9, 8]. Moreover, it was identified as an important topic            is a data co-existence approach in which tighter integration is per-
∗ This work is partially supported by the Swiss National Science                formed in “pay-as-you-go” fashion [12].
Foundation (SNF) under contract 200021-112115.                                  Current Status. The ultimate goal of the dissertation is to build the
                                                                                first publicly available PDSMS. The dissertation has so far one year
                                                                                of development and a research plan has been drawn for the next
                                                                                three years. In the first year of work, the Ph.D. student has helped
                                                                                to set the vision and context for the iMeMex project. As a result of
                                                                                this work, we have written a research proposal detailing the goals
                                                                                and work breakdown for the whole project. This proposal has been
 c 2006 for the individual paper by the paper’ authors. Copying permitted       accepted by the Swiss National Science Foundation (SNF)[6] and
for private and scientific purposes. Re-publication of material on this page
requires permission by the copyright owners.
                                                                                supports two Ph.D. positions for a period of three years.
Proceedings of the VLDB2006 Ph.D. Workshop                                         To evaluate our ideas, we have developed one first prototype of
Seoul, Rep of Korea, 2006
iMeMex. It was demonstrated in [8] and provided a traditional file      a PDSMS enables data to be authored and updated independently
system interface to explore arbitrary views over one’s personal in-     by the interfaces offered by the underlying data sources. Further, in
formation. In parallel to the development of the prototype, we have     these systems, advanced PDSMS queries that bridge structural in-
defined our data model for representing personal information: the       formation across the inside-outside file boundary are not available
iMeMex Data Model (iDM). That model is presented in [7].                (see Example 1).
   The current, second, version of the iMeMex platform extends the
first prototype and incorporates the work on iDM. It offers a uni-      3.    RESEARCH CHALLENGES
fied view on a set of personal data sources and allows basic query
processing on that view. Our current implementation (Java 1.4)             In the following, we discuss major research challenges that are
contains about 215 classes and 22, 000 lines of code.                   targeted by the Ph.D. on the iMeMex PDSMS.
Outline. We present related personal information management ap-         Challenge 1 (Representing Personal Information). A major re-
proaches in Section 2. We proceed by discussing the research chal-      search challenge of managing personal information is dealing with
lenges involved in building a PDSMS in Section 3. We then de-           its heterogeneity. Heterogeneity relates to data models and formats
scribe in Section 4 some of our solutions to these challenges, which    used to represent the information. It also relates to the data sources
consist of: (1) a unified data model for personal information (Sec-     in which that information is available and to the mechanisms avail-
tion 4.1); (2) a flexible query language that operates on this model,   able for data delivery (push/pull). Let’s consider an example:
along with techniques for efficient query processing (Section 4.2);     E XAMPLE 1 (I NSIDE AND O UTSIDE F ILES ) Users organize their
(3) an update model for a PDSMS which includes mechanisms for           workspaces in folder hierarchies and use applications to store infor-
recovery and versioning of all data present in a personal dataspace     mation inside files. Each file is an independent data cage in which
(Section 4.3); and (4) the architecture of a PDSMS, which inte-         complex structural representations may be stored. Consider the fol-
grates all of the previous contributions into a unified framework       lowing query: “Show me all LATEX ‘Introduction’ sections pertain-
(Section 4.4). Finally, we conclude in Section 5.                       ing to project PIM that contain the phrase ‘Personal Information”’.
                                                                        With current technology, this query cannot be issued in one single
                                                                        request by the user as it has to bridge the inside-outside file bound-
2.    STATE OF THE ART                                                  ary. The user may only search the file system using simple system
   As we approach an age in which each computer user will face          tools like grep, find, or a keyword search engine. However, these
the challenge of managing her own personal terabyte, PIM research       tools may return a large number of results which would have to
has obtained renewed interest in a variety of areas, such as HCI, IR    be examined manually to determine the final result. Even when a
and data management [17]. Due to space limitations, we only com-        matching file is encountered, then, for structured file formats like
ment on a few solutions in this section. Current operating systems      Microsoft PowerPoint, the user typically has to conduct a second
have been amended in the past years to include full-text search ap-     search inside the file to find the desired information [4]. Moreover,
pliances, such as Google Desktop, Apple Spotlight, and Phlat [4].       state-of-the-art operating systems do not support at all exploitation
These systems offer an intuitive keyword search interface, some-        of structured information inside the user’s documents.              2
times augmented by generic metadata (modification date, author,
                                                                           Ideally, we would like to have a common representation for all
etc). Their data models, however, are unable to represent structural
                                                                        personal information in different data models and sources. This
information inside documents. A PDSMS, in contrast, enriches
                                                                        common representation (or view) would enable queries that ignore
keyword and property search with advanced structural querying.
                                                                        how the data is stored or where it is located. In addition, we should
   Systems such as SEMEX [9] and Haystack [18] allow users to
                                                                        be able to construct that view without performing labor-intensive
browse by association. They employ an ETL cycle to extract in-
                                                                        semantic schema integration. Rather, we would like to perform
formation from desktop data sources into a repository and repre-
                                                                        lightweight data model integration and leave expensive semantic
sent that information in a domain model (ontology). The domain
                                                                        integration to be carried out in a “pay-as-you-go” fashion.
model is a high-level mediated schema over the personal informa-
                                                                        Challenge 2 (Querying Personal Information). Once we have an
tion sources. These systems focus on creating a queryable, how-
                                                                        integrated view on one’s personal dataspace, the next natural chal-
ever non-updatable, view on the user’s personal information. In
                                                                        lenge is how to query this view. Users have traditionally employed
contrast, a PDSMS offers support for not only advanced querying
                                                                        browsing (i.e., neighborhood expansion) and keyword queries to
and browsing but also for updating information in the underlying
                                                                        explore their data. Ideally, we would like to provide one single
personal dataspace whenever possible. In fact, all of the systems
                                                                        search&query language to analyze and modify all data in a per-
above may be thought of as applications on top of a PDSMS.
                                                                        sonal dataspace. This query language should allow impreciseness
   Other systems offer tools to ease the management of personal
                                                                        in query formulation and also integrate ranking of query results.
data. Lifestreams [13] organizes all personal documents in a time-
                                                                        Further, advanced functionality, such as branching expressions and
line. In Placeless Documents [11], users may tag their documents
                                                                        joins, should be available. Note that expressions written in this lan-
with active properties, such as “backup” or “replicate”, and the ap-
                                                                        guage should be processed with interactive response times.
propriate actions will be carried out by the system. MyLifeBits [2]
                                                                        Challenge 3 (Updating Personal Information). Given a unified
models each piece of information as resources and allows these
                                                                        view of all personal information, another important challenge is
resources to be annotated and organized in collections. Microsoft
                                                                        to provide means to update that personal information through that
WinFS [23], now discontinued1 , represented information in an item
                                                                        unified view. Current desktop search engines (DSEs) are read op-
data model which is a subset of the object-oriented data model
                                                                        timized systems. DSEs are able to detect updates made to the data
and offered a basic class library to represent data items commonly
                                                                        sources and to incorporate those updates into their index structures.
found in user desktops. Like MyLifeBits, WinFS based storage of
                                                                        They have, however, two important drawbacks: (1) DSEs do not al-
items on a relational DBMS. All of these systems need full control
                                                                        low applications to perform updates on the data sources through the
of the data to offer features such as backup&recovery. In contrast,
                                                                        DSEs’ interfaces, and (2) DSEs do not offer any update guarantees
1 The downloadable beta as well as all other preliminary informa-       on the underlying data, such as durability (e.g., to allow recovery
tion about WinFS were recently removed from its web-site [23].          of past images of the data).
   DBMSs, on the other hand,                                                                                     Projects
                                              Projects
provide update interfaces and also
                                                 PIM                                                                    PIM
strict transactional ACID guaran-
                                                                                                                              vldb 2006.tex
tees, but demand a high price for                    vldb 2006.tex
them: full control of the data.                                                                                                      documentclass
                                                         \documentclass{vldb}
In contrast to both approaches,                                                                                                       title
                                                         \title{iDM: A Unified ...}
a PDSMS occupies the middle-                             \abstract{Personal Information...}                                                   text
                                                         \begin{document}
ground between a read-only DSE                           \section{Introduction}
                                                                                                                                     abstract
(without update guarantees) and                          Personal Information...                                                              text
a write-optimized DBMS (with                             ...                                                                         document
                                                         \subsection{The Problem}
strict ACID guarantees). Guar-                           \label{sec:theproblem}
                                                                                                                                              Introduction
antees may vary according to                             .. concepts in Section~\ref{sec:preliminaries} ..                                           text
the interfaces offered by the data                       \section{Preliminaries}                                                                     The Problem
                                                         \label{sec:preliminaries}
sources managed by the PDSMS.                            As mentioned in Section~\ref{sec:theproblem} ..
                                                                                                                                                               text
To enforce soft durability guar-                         \end{document}                                                                       Preliminaries        ref
antees, a PDSMS must provide                                                                                                                         text
algorithms for backup&recovery,                  OLAP                                                                   OLAP                          ref
versioning, update handling, a-                     (a) Heterogeneous Personal Information                                        (b) Resource View Graph
mong others.
                                              Figure 1: iDM represents heterogeneous information in a single resource view graph
4.    OUR APPROACH                                                                        data, RSS/ATOM messages, bookmarks, query results, calls to
   In this section, we provide more details on our previous and                           web services and many others [7]. The granularity at which
ongoing work on the iMeMex PDSMS. First, we discuss our so-                               resource views are represented is determined by a set of plugin
lutions for each of the challenges described in the previous section.                     components in our system architecture (see Section 4.4).
We then conclude by presenting the iMeMex PDSMS architecture,                          • Graphs: resource views in iDM are linked to each other form-
which serves as a framework to deploy the presented techniques.                           ing directed graph structures. In Figure 1(b), we show the re-
                                                                                          source view graph that corresponds to the personal data in Fig-
4.1    Representing Personal Information                                                  ure 1(a). In that graph, there is no inside-outside file boundary.
   Figure 1(a) depicts the situation described in Example 1. It shows                     All structural elements (folders, sections, subsections, etc) are
a files&folders hierarchy with information on research projects of                        represented in the same model and queries may address them
one of the authors. Note that arbitrary cyclic graph structures may                       uniformly. Note that cycles may naturally arise in that graph (in
naturally arise in data inside files. In the LATEX document “vldb                         this example as a consequence of section cross referencing).
2006.tex”, for example, inside the subsection “The Problem”, there                     • Intensional Data: any given resource view or parts of a re-
is a reference to the section “Preliminaries” and vice versa. An                          source view graph may be either materialized (i.e., extensional
extended example of such occurences is provided in [7].                                   data) or computed on demand as the result to a query or to a re-
Why XML is not Enough. Ideally, files&folders as well as the                              mote web service invocation (i.e., intensional data [20]). This
structure inside files should be represented into the same logical                        is in sharp contrast to static data models such as XML.
data model. One could try to employ XML technology to address
this challenge of representation heterogeneity. In fact, we followed                   • Stream Support: another important feature of our model is
that approach in [8]. Unfortunately, XML is associated to both a                          the ability of resource views to contain finite as well as infinite
logical data model and a physical markup to represent this log-                           components. Infinite resource view components are used to
ical model. This means that the manipulation of XML views is                              represent data streams (e.g., RSS, publish/subscribe) and con-
coupled with serialization concerns. Recent work has identified                           tent streams (e.g., audio and video) in our model.
this gap, e.g. [20, 16, 14], and argues in favor of clearly separated                   In our approach, the notion of impreciseness is included in our
logical data models supporting more advanced features, e.g. mul-                     query language, briefly discussed in Section 4.2.
tiple hierarchies [16]. However, none of the existing approaches                     Data Model Instantiations. A resource view is given by the fol-
is sufficient to naturally represent the complex, possibly infinite,                 lowing four formal components:
distributed and lazily computed information graph encountered in                          name η             Name of the resource view.
a personal dataspace. Therefore, we have decided to represent all                         tuple τ            List of attribute value pairs
personal information based on a novel, more powerful, logical data                                           ((name0 , value0 ), (name1 , value1 ), . . .).
model: the iMeMex Data Model (iDM).                                                       content χ          (in)finite In-/Output of content (e.g. text).
Resource View Graph. We briefly sketch a few characteristics of                           group γ            References to other resource views.
iDM in this section; full details are provided elsewhere [7]. iDM                                            - S: (in)finite set {. . .}
                                                                                                             - Q: (in)finite ordered sequence h. . .i
enables a logical representation of a personal dataspace, as shown
in Figure 1(b). The main features of iDM are:                                          We use resource view classes to constrain resource view compo-
  • Resource Views: in iDM, all personal information is repre-                      nents. Resource view classes allow integration of data from diverse
     sented by fine-grained resource views. A resource view is made                 data models into iDM without requiring time consuming semantic
     of components that express structured, semi-structured and un-                 schema integration. A resource view Vi of class C is denoted by
     structured pieces of the underlying data. For instance, resource               ViC . Similarly, its components are denoted by ηC    C    C        C
                                                                                                                                    i , τi , χi , and γi .
     views may represent nodes in a files&folders hierarchy as well                    We show in Table 1 how our model may be constrained to rep-
     as elements in an XML, LATEX or other office document. Other                   resent files, folders and the core subset of XML. We denote the
     than that, we use resource views to uniformly represent email                  name of an underlying data item i by Ni , attribute-value pairs as-
     messages, email attachments, infinite data streams, relational                 sociated to it by a schema W and a tuple Ti , and its content by Ci .
  Resource View Class            Resource View Components Definition                     count for trade-offs in the usage of alternative query plans, e.g., to
  Description   Name       ηCi      τCi    χCi  γCi
                                                                                         consider join orders and different access methods.
                                                  S                   ∅
       File        file    Nf    WFS , T f   Cf                                          Neighborhood Queries. Providing context is key to enable explo-
                                                  Q                    hi
                                                                                         ration of query results [5]. Thus, it is a common pattern to query the
                                                           {V1child , . . . ,Vmchild }
      Folder      folder   NF    WFS , TF
                                                  S
                                                          child ∈ {file, folder}
                                                                                         neighborhood of objects returned from a previous query. One alter-
                                                  Q                    hi                native to speed-up such queries is to keep their results materialized
                                                  S                   ∅                  in a special index structure. This index may cover only the imme-
 XML text node   xmltext                     Ct
                                                  Q                    hi                diate neighborhood of each resource view or it may be extended to
                                                  S                   ∅                  include other reachable resource views. We plan to evaluate how
 XML element     xmlelem   NE    WE , TE                   hV1child , . . . ,Vnchild i   much context should be kept in the index structure to account for
                                                  Q
                                                      child ∈ {xmltext, xmlelem}
                                                                                         trade-offs in querying speed, indexing time and update processing.
                                                  S                   ∅
XML document     xmldoc
                                                  Q                 xmlelem
                                                               hVroot          i         4.3    Updating Personal Information
                                                  S                   ∅
   XML File      xmlfile   Nf    WFS , T f   Cf                      xmldoc
                                                                                            The iMeMex PDSMS should offer soft durability guarantees on
                                                  Q             hVdoc         i
                                                                                         updates made through its interface or via the APIs of the underlying
  Table 1: Resource View Classes for files&folders and XML                               data sources bypassing iMeMex. In the following, we discuss our
                                                                                         ideas for tackling that challenge.
The instantiations shown in Table 1 allow the creation of resource                       Dataspace Update Model. We plan to design an update model for
view graphs as the one shown in Figure 1(b). Our data model is,                          the iMeMex PDSMS that accounts for the fact that data may be in-
however, much more powerful: instantiations for relations and data                       dependently updated via the APIs of the underlying data sources.
streams, as well as a more rigorous discussion on intensional as-                        In this scenario, ACID guarantees are too strict, once the iMeMex
pects [20] of iDM are presented in [7].                                                  PDSMS may be notified of updates “after the fact”. Neverthe-
                                                                                         less, we believe that classical database logging techniques may be
4.2      Querying Personal Information                                                   adapted to this setting to provide softer recovery guarantees (e.g.,
   The iMeMex PDSMS should offer querying services on the re-                            all items updated more than 5 min ago may be recovered).
source view graph representing all of one’s personal dataspace. In                       Versioning. In relational systems, previous versions of a given
the following, we discuss our ongoing work and open issues on                            tuple may be reconstructed from the database log (see e.g. “time
query specification and processing.                                                      travel” feature of Oracle). However, personal items are typically
Personal Dataspace Query Language. We propose a new search&                              more heavyweight than relational tuples, as they may have medium
query language for schema-agnostic querying of a resource view                           to large content. An alternative to logging would be to keep an in-
graph: the iMeMex Query Language (iQL). The definition of the                            dependent versioning subsystem (e.g. Subversion) to track content
iQL syntax and associated semantics is work in progress. In our                          evolution. We plan to investigate how to integrate versioning into
current implementation, the syntax of iQL is a mix between typical                       our update model for personal information and also whether there
search engine keyword expressions and XPath navigational restric-                        are profitable interactions with the techniques chosen for recovery.
tions. The semantics of our language are, however, much differ-                          Write back. Updates to personal information may be performed
ent than those of XPath and XQuery. Our language’s goal is to                            via the API of a given data source or via iMeMex’s API. In the latter
enable querying of a resource view graph that has not necessarily                        case, one must write the data back to the affected data sources. If
been submitted to expensive schema integration. Therefore, as in                         the data is not already present in any data source, iMeMex must
search technology, we account for impreciseness in query formula-                        decide in which subsystem(s) it is most suitable to be represented.
tion. For example, by default, when an attribute name is specified                       Distribution. When a user has several devices, it is natural to ask
(e.g. size > 10K), we do not require exact matches on the (implicit                      how to manage several iMeMex instances and coordinate distributed
or explicit) schema for that attribute, but rather return fuzzy, ranked                  query and update processing among these instances. We believe
results for the resource views that better match the specified con-                      that the many challenges of this scenario exceed the scope of the
ditions (e.g. size, fileSize, docSize). This allows us to define mal-                    current Ph.D. work. Those challenges will be tackled by a separate
leable schemas as in [10]. Other important features of iQL are the                       Ph.D. thesis as part of the iMeMex project.
ability to reflect structural constraints, e.g. to explore the context or
neighborhood of items, the definition of extensible algebraic oper-                      4.4    iMeMex PDSMS Architecture
ations like joins and grouping, and the specification of updates to                         We present in this section the current architecture of the iMeMex
the resource view graph.                                                                 PDSMS, which serves as a framework for all of the previously dis-
Indexing Techniques. In our current implementation of iMeMex,                            cussed technical contributions. We also indicate points of ongoing
we index all components of every resource view created in the sys-                       work in which the architecture will be extended.
tem. This full indexing strategy follows the intuition that the PIM                         The core idea of iMeMex is to implement a logical layer that ab-
environment shares with data warehousing the characteristic of low                       stracts from the underlying subsystems and data sources, such as
update rates, allowing us to trade space and indexing time for query                     file systems, email servers, network shares, music streams, RSS
performance. The information from each component of a resource                           feeds, etc. That logical layer does not take full control of the data,
view (e.g., name or group of related resource views) goes to a dif-                      so it may be bypassed by applications. Figure 2 depicts that layer
ferent index and we perform intersects to process conditions on sev-                     and its current implementation in iMeMex.
eral components. We plan to investigate whether it pays off to have                         iMeMex contains two important sublayers: iQL Query Proces-
integrated index structures for various resource view components.                        sor and Resource View Manager. The main task of the iQL Query
In contrast to traditional XML indexing, our index structures must                       Processsor is to translate incoming iQL queries and to create query
operate in a general graph data model on possibly infinite data.                         plans for those queries. Our current implementation is based on
Cost-based Optimization. Cost-based optimization (CBO) is one                            rule-based query optimization. We plan to invest in cost-based op-
key technique to provide interactive response times in read-mostly                       timization techniques as part of future work.
environments. We are planning to build a CBO for iMeMex to ac-                              The Resource View Manager (RVM) is the central instance to
                            Application Layer
                                                                           mix found in personal dataspaces. As one application of our model
     iMeMex - iQL GUI          iMeMex - iQL Shell              3rd party
                                                                           we bridge the artificial boundary that separates inside and outside
                                                                           files. Second, we are working on a new search&query language
                                                                iMeMex
         iMeMex Query Language (iQL)
                                                                 PDSMS
                                                                           that operates on our data model. The processing of expressions in
               Query Processor                                             this language calls for the design of efficient techniques, e.g. for
                                                                  Layer
                                                                           indexing and neighborhood querying. Third, we are working on
                      Resource View Manager
                                                                           a dataspace update model. That model will provide soft durabil-
                 Handler                     Sync Manager
                                                                           ity guarantees, write-back to data sources as well as detection of
           Replica & Indexes             ContentToiDM Converters           changes made on data sources bypassing iMeMex. We plan to de-
                Replicas    Catalog       XML LaTeX ...                    sign integrated recovery and versioning techniques to support our
     Indexes

                                              Data Source Proxy
                                                                           update model. By building the first publicly available PDSMS, we
      Search Engine        DBMS            FS       IMAP   RSS      ...    believe that we make a significant contribution to the development
                                         Data Source Plugins               of advanced PIM applications.

                          Data Source Layer                                6.    REFERENCES
         File System, RSS, IMAP, Database, Live Streaming, etc.             [1] S. Abiteboul, R. Agrawal, P. A. Bernstein, and others. The Lowell
                                                                                Database Research Self Assessment. The Computing Research
               Figure 2: iMeMex PDSMS architecture                              Repository (CoRR), cs.DB/0310006, 2003.
                                                                            [2] G. Bell. MyLifeBits: a Memex-Inspired Personal Store; Another TP
managing resource views. Its major components are: Data Source                  Database (Keynote). In ACM SIGMOD, 2005.
Proxy, ContentToiDMConverters, Replica&Indexes Module, and                  [3] V. Bush. As we may think. Atlantic Monthly, 1945.
Synchronization Manager. We describe them in the following.                 [4] E. Cutrell, D. Robbins, S. Dumais, and R. Sarin. Fast, flexible
Data Source Proxy. Provides connectivity to the distinct types of               filtering with Phlat — Personal search and organization made easy.
subsystems. It contains a set of Data Source Plugins that repre-                In CHI, 2006.
sent the data from the different subsystems (e.g., file systems, RSS,       [5] J.-P. Dittrich, P. M. Fischer, and D. Kossmann. AGILE: Adaptive
IMAP, databases, etc) as an initial iDM graph.                                  Indexing for Context-Aware Information Filters. In ACM SIGMOD,
                                                                                2005.
ContentToiDMConverters Module. Enriches the iDM graph pro-
                                                                            [6] J.-P. Dittrich and D. Kossmann. iMeMex: A Unified Approach to
vided by the data source proxy. This is achieved by converting re-              Personal Information Management. In SNF project under contract
source view content to iDM subgraphs that then reflect structural               200021-112115.
information (e.g., in LATEX, XML, etc). The result is an iDM graph          [7] J.-P. Dittrich and M. A. V. Salles. iDM: A Unified and Versatile Data
such as the one presented in Section 4.1.                                       Model for Personal Dataspace Management. In VLDB, 2006.
Replica&Indexes Module. Materializes mappings between re-                   [8] J.-P. Dittrich, M. A. V. Salles, D. Kossmann, and L. Blunschi.
source view identifiers and resource view components (e.g., name                iMeMex: Escapes from the Personal Information Jungle (Demo
or group of related resource views) to accelerate query processing.             Paper). In VLDB, 2005.
A mapping from resource view identifiers to copies of component             [9] X. Dong and A. Halevy. A Platform for Personal Information
                                                                                Management and Integration. In CIDR, 2005.
instances is termed a replica. The inverse mapping is termed an
                                                                           [10] X. Dong and A. Y. Halevy. Malleable Schemas: A Preliminary
index. Currently, our implementations of replicas and indexes are               Report. In WebDB, pages 139–144, 2005.
based on a DBMS (Apache Derby) for structured information, such            [11] P. Dourish et al. Extending Document Management Systems with
as attribute-value pairs and resource view connections, and on in-              User-Specific Active Properties. ACM Transactions on Information
verted keyword lists (Apache Lucene) for textual information, such              Systems (TOIS), 18(2):140–170, 2000.
as names and text content. We plan to extend this module to pro-           [12] M. Franklin, A. Halevy, and D. Maier. From Databases to
vide specialized index structures as discussed in Section 4.2.                  Dataspaces: A New Abstraction for Information Management.
Synchronization Manager. Monitors registered data sources for                   SIGMOD Record, 34(4):27–33, 2005.
changes. When a data source is registered at the RVM, the Syn-             [13] E. Freeman and D. Gelernter. Lifestreams: A Storage Model for
                                                                                Personal Data. SIGMOD Record, 25(1):80–86, 1996.
chronization Manager analyzes the data found on the data source
                                                                           [14] J. Graupmann, R. Schenkel, and G. Weikum. The SphereSearch
and sends each resource view definition to the Replica&Indexes                  Engine for Unified Ranked Retrieval of Heterogeneous XML and
Module. The Synchronization Manager also subscribes to update                   Web Documents. In VLDB, 2005.
notifications from the data source. As a consequence, updates per-         [15] A. Halevy et al. Crossing the Structure Chasm. In CIDR, 2003.
formed on the data source bypassing the RVM layer are then imme-           [16] H. V. Jagadish, L. V. S. Lakshmanan, M. Scannapieco, D. Srivastava,
diately considered by the Synchronization Manager and the Repli-                and N. Wiwatwattana. Colorful XML: One Hierarchy Isn’t Enough.
ca&Indexes Module. If the data source does not offer update noti-               In ACM SIGMOD, 2004.
fications, the Synchronization Manager generates them based on a           [17] W. Jones and H. Bruce. A Report on the NSF-Sponsored Workshop
                                                                                on Personal Information Management, Seattle, Washington, 2005.
generic polling facility. We will extend this module to incorporate
                                                                           [18] D. R. Karger et al. Haystack: A Customizable General-Purpose
recovery and versioning techniques, as described in Section 4.3.                Information Management Tool for End Users of Semistructured
                                                                                Data. In CIDR, 2005.
5.    CONCLUSION                                                           [19] M. Kersten, G. Weikum, M. Franklin, D. Keim, A. Buch-
                                                                                mann, and S. Chaudhuri. Panel: A Database Striptease or How to
   Personal Information Management has become a key necessity                   Manage Your Personal Databases. In VLDB, 2003.
for almost everybody. Reflecting this prominence, considerable at-         [20] T. Milo, S. Abiteboul, et al. Exchanging Intensional XML Data. In
tention has been given to PIM research in the recent past. At the               ACM SIGMOD, 2003.
same time, it has become clear that what is missing is a unified           [21] T. Mitchell. Computer Workstations as Intelligent Agents (Keynote).
approach to bring physical and logical data independence to the                 In ACM SIGMOD, 2005.
management of one’s personal dataspace. We address three ma-               [22] SIGIR PIM 2006.
                                                                                http://pim.ischool.washington.edu/pim06home.htm.
jor research challenges in the pursuit of this goal. First, we define
                                                                           [23] http://msdn.microsoft.com/data/WinFS/ WinFS.
a new data model capable of representing the heterogeneous data