=Paper=
{{Paper
|id=Vol-170/paper-12
|storemode=property
|title=iMeMex: A Platform for Personal Dataspace Management
|pdfUrl=https://ceur-ws.org/Vol-170/paper11.pdf
|volume=Vol-170
}}
==iMeMex: A Platform for Personal Dataspace Management==
iMeMex: A Platform for Personal Dataspace Management∗
Marcos Antonio Vaz Salles Jens-Peter Dittrich
Institute of Information Systems
ETH Zurich
8092 Zurich, Switzerland
dbis.ethz.ch | iMeMex.org
ABSTRACT in the Lowell Report [1], discussed in a VLDB panel [19], con-
Desktop computers provide thousands of different applications that sidered in an NSF-sponsored workshop [17], debated in the SIGIR
query and store data in hundreds of thousands of files of different 2006 PIM workshop [22], and became topic of both SIGMOD 2005
formats. Those files are stored in the local filesystem and also in a keynotes [2, 21].
number of remote data sources, such as network shares or as attach- In spite of these previous efforts, we argue that a satisfactory so-
ments to emails. To handle this heterogeneous and distributed mix lution has not yet been brought forward to the issues of physical and
of personal information, data processing logic is re-invented inside logical data independence on the desktop. Physical data indepen-
each application. This results in an undesirable situation: most ad- dence relates to abstraction from the devices and formats in which
vanced data management functionality, such as complex queries, data is represented. This is clearly not achieved by the simple data
backup and recovery, versioning, provenance tracking, among oth- model of the current generation of file systems. Applications de-
ers, is (at least partially) performed by end-users in tedious, manual velop specific solutions that directly handle protocols to access the
tasks. To solve these problems we propose a software platform that data (email, RSS/ATOM, network file system, etc) and also formats
brings physical and logical data independence to the desktop, free- in which data is stored (XML, LATEX, image and audio formats,
ing users from low-level data management activities. Unlike cur- etc). This creates application-specific data silos in which data man-
rent relational DBMSs, this platform unifies data from several inde- agement functionality, e.g., querying, updating, performing backup
pendent personal data sources without imposing semantic schema and recovery operations, are absent or re-invented. Logical data
integration. It manages the complex dataspace [12] of one’s per- independence relates to the capability of defining views over the
sonal information. We attack three major research challenges in the logical data model in which data is represented. It is also only par-
building of that platform: (i) definition of a data model that allows tially achieved with current desktop systems, e.g. smart folders.
the integration of information in distinct representations and loca- Personal Dataspaces. DBMS technology successfully resolved
tions, (ii) design of a new search&query language over this data the physical and logical data independence problem for highly struc-
model along with algorithms for the efficient processing of com- tured data, but not for the highly heterogeneous data mix present in
plex queries, and (iii) formulation of an update model that enables personal information. Indeed, Franklin et al. [12] argue that today
soft durability guarantees, when compared to ACID properties, on we rarely have a situation in which all the data that needs to be
data authored independently from the platform. managed can fit nicely into a relational DBMS. Rather, most of the
data will be authored independently from the DBMS and will not
be in its full control. Franklin et al. introduce the term dataspace
1. INTRODUCTION to describe this world of disparate, distributed and independently
In 1945, Bush [3] presented a vision of a personal information authored unstructured, semi-structured and structured data.
management system named memex. That vision has deeply influ- In this project we focus on personal dataspaces, that is the to-
enced several advances in computing. Part of that vision led to the tal of all personal information pertaining to a given individual. In
development of the Personal Computer in the 1980’s. It also led contrast to the vision of [12], we propose one concrete Personal
to the development of hypertext and the World Wide Web in the Dataspace Management System (PDSMS) implementation, named
1990’s. Since then, several projects have attempted to implement iMeMex (integrated memex). Unlike traditional information inte-
other memex-like functionality [13, 2, 4, 18]. Further, personal in- gration approaches, a PDSMS does not require semantic data in-
formation management regained interest in the DB research com- tegration before any data services are provided. Rather, a PDSMS
munity [15, 9, 8]. Moreover, it was identified as an important topic is a data co-existence approach in which tighter integration is per-
∗ This work is partially supported by the Swiss National Science formed in “pay-as-you-go” fashion [12].
Foundation (SNF) under contract 200021-112115. Current Status. The ultimate goal of the dissertation is to build the
first publicly available PDSMS. The dissertation has so far one year
of development and a research plan has been drawn for the next
three years. In the first year of work, the Ph.D. student has helped
to set the vision and context for the iMeMex project. As a result of
this work, we have written a research proposal detailing the goals
and work breakdown for the whole project. This proposal has been
c 2006 for the individual paper by the paper’ authors. Copying permitted accepted by the Swiss National Science Foundation (SNF)[6] and
for private and scientific purposes. Re-publication of material on this page
requires permission by the copyright owners.
supports two Ph.D. positions for a period of three years.
Proceedings of the VLDB2006 Ph.D. Workshop To evaluate our ideas, we have developed one first prototype of
Seoul, Rep of Korea, 2006
iMeMex. It was demonstrated in [8] and provided a traditional file a PDSMS enables data to be authored and updated independently
system interface to explore arbitrary views over one’s personal in- by the interfaces offered by the underlying data sources. Further, in
formation. In parallel to the development of the prototype, we have these systems, advanced PDSMS queries that bridge structural in-
defined our data model for representing personal information: the formation across the inside-outside file boundary are not available
iMeMex Data Model (iDM). That model is presented in [7]. (see Example 1).
The current, second, version of the iMeMex platform extends the
first prototype and incorporates the work on iDM. It offers a uni- 3. RESEARCH CHALLENGES
fied view on a set of personal data sources and allows basic query
processing on that view. Our current implementation (Java 1.4) In the following, we discuss major research challenges that are
contains about 215 classes and 22, 000 lines of code. targeted by the Ph.D. on the iMeMex PDSMS.
Outline. We present related personal information management ap- Challenge 1 (Representing Personal Information). A major re-
proaches in Section 2. We proceed by discussing the research chal- search challenge of managing personal information is dealing with
lenges involved in building a PDSMS in Section 3. We then de- its heterogeneity. Heterogeneity relates to data models and formats
scribe in Section 4 some of our solutions to these challenges, which used to represent the information. It also relates to the data sources
consist of: (1) a unified data model for personal information (Sec- in which that information is available and to the mechanisms avail-
tion 4.1); (2) a flexible query language that operates on this model, able for data delivery (push/pull). Let’s consider an example:
along with techniques for efficient query processing (Section 4.2); E XAMPLE 1 (I NSIDE AND O UTSIDE F ILES ) Users organize their
(3) an update model for a PDSMS which includes mechanisms for workspaces in folder hierarchies and use applications to store infor-
recovery and versioning of all data present in a personal dataspace mation inside files. Each file is an independent data cage in which
(Section 4.3); and (4) the architecture of a PDSMS, which inte- complex structural representations may be stored. Consider the fol-
grates all of the previous contributions into a unified framework lowing query: “Show me all LATEX ‘Introduction’ sections pertain-
(Section 4.4). Finally, we conclude in Section 5. ing to project PIM that contain the phrase ‘Personal Information”’.
With current technology, this query cannot be issued in one single
request by the user as it has to bridge the inside-outside file bound-
2. STATE OF THE ART ary. The user may only search the file system using simple system
As we approach an age in which each computer user will face tools like grep, find, or a keyword search engine. However, these
the challenge of managing her own personal terabyte, PIM research tools may return a large number of results which would have to
has obtained renewed interest in a variety of areas, such as HCI, IR be examined manually to determine the final result. Even when a
and data management [17]. Due to space limitations, we only com- matching file is encountered, then, for structured file formats like
ment on a few solutions in this section. Current operating systems Microsoft PowerPoint, the user typically has to conduct a second
have been amended in the past years to include full-text search ap- search inside the file to find the desired information [4]. Moreover,
pliances, such as Google Desktop, Apple Spotlight, and Phlat [4]. state-of-the-art operating systems do not support at all exploitation
These systems offer an intuitive keyword search interface, some- of structured information inside the user’s documents. 2
times augmented by generic metadata (modification date, author,
Ideally, we would like to have a common representation for all
etc). Their data models, however, are unable to represent structural
personal information in different data models and sources. This
information inside documents. A PDSMS, in contrast, enriches
common representation (or view) would enable queries that ignore
keyword and property search with advanced structural querying.
how the data is stored or where it is located. In addition, we should
Systems such as SEMEX [9] and Haystack [18] allow users to
be able to construct that view without performing labor-intensive
browse by association. They employ an ETL cycle to extract in-
semantic schema integration. Rather, we would like to perform
formation from desktop data sources into a repository and repre-
lightweight data model integration and leave expensive semantic
sent that information in a domain model (ontology). The domain
integration to be carried out in a “pay-as-you-go” fashion.
model is a high-level mediated schema over the personal informa-
Challenge 2 (Querying Personal Information). Once we have an
tion sources. These systems focus on creating a queryable, how-
integrated view on one’s personal dataspace, the next natural chal-
ever non-updatable, view on the user’s personal information. In
lenge is how to query this view. Users have traditionally employed
contrast, a PDSMS offers support for not only advanced querying
browsing (i.e., neighborhood expansion) and keyword queries to
and browsing but also for updating information in the underlying
explore their data. Ideally, we would like to provide one single
personal dataspace whenever possible. In fact, all of the systems
search&query language to analyze and modify all data in a per-
above may be thought of as applications on top of a PDSMS.
sonal dataspace. This query language should allow impreciseness
Other systems offer tools to ease the management of personal
in query formulation and also integrate ranking of query results.
data. Lifestreams [13] organizes all personal documents in a time-
Further, advanced functionality, such as branching expressions and
line. In Placeless Documents [11], users may tag their documents
joins, should be available. Note that expressions written in this lan-
with active properties, such as “backup” or “replicate”, and the ap-
guage should be processed with interactive response times.
propriate actions will be carried out by the system. MyLifeBits [2]
Challenge 3 (Updating Personal Information). Given a unified
models each piece of information as resources and allows these
view of all personal information, another important challenge is
resources to be annotated and organized in collections. Microsoft
to provide means to update that personal information through that
WinFS [23], now discontinued1 , represented information in an item
unified view. Current desktop search engines (DSEs) are read op-
data model which is a subset of the object-oriented data model
timized systems. DSEs are able to detect updates made to the data
and offered a basic class library to represent data items commonly
sources and to incorporate those updates into their index structures.
found in user desktops. Like MyLifeBits, WinFS based storage of
They have, however, two important drawbacks: (1) DSEs do not al-
items on a relational DBMS. All of these systems need full control
low applications to perform updates on the data sources through the
of the data to offer features such as backup&recovery. In contrast,
DSEs’ interfaces, and (2) DSEs do not offer any update guarantees
1 The downloadable beta as well as all other preliminary informa- on the underlying data, such as durability (e.g., to allow recovery
tion about WinFS were recently removed from its web-site [23]. of past images of the data).
DBMSs, on the other hand, Projects
Projects
provide update interfaces and also
PIM PIM
strict transactional ACID guaran-
vldb 2006.tex
tees, but demand a high price for vldb 2006.tex
them: full control of the data. documentclass
\documentclass{vldb}
In contrast to both approaches, title
\title{iDM: A Unified ...}
a PDSMS occupies the middle- \abstract{Personal Information...} text
\begin{document}
ground between a read-only DSE \section{Introduction}
abstract
(without update guarantees) and Personal Information... text
a write-optimized DBMS (with ... document
\subsection{The Problem}
strict ACID guarantees). Guar- \label{sec:theproblem}
Introduction
antees may vary according to .. concepts in Section~\ref{sec:preliminaries} .. text
the interfaces offered by the data \section{Preliminaries} The Problem
\label{sec:preliminaries}
sources managed by the PDSMS. As mentioned in Section~\ref{sec:theproblem} ..
text
To enforce soft durability guar- \end{document} Preliminaries ref
antees, a PDSMS must provide text
algorithms for backup&recovery, OLAP OLAP ref
versioning, update handling, a- (a) Heterogeneous Personal Information (b) Resource View Graph
mong others.
Figure 1: iDM represents heterogeneous information in a single resource view graph
4. OUR APPROACH data, RSS/ATOM messages, bookmarks, query results, calls to
In this section, we provide more details on our previous and web services and many others [7]. The granularity at which
ongoing work on the iMeMex PDSMS. First, we discuss our so- resource views are represented is determined by a set of plugin
lutions for each of the challenges described in the previous section. components in our system architecture (see Section 4.4).
We then conclude by presenting the iMeMex PDSMS architecture, • Graphs: resource views in iDM are linked to each other form-
which serves as a framework to deploy the presented techniques. ing directed graph structures. In Figure 1(b), we show the re-
source view graph that corresponds to the personal data in Fig-
4.1 Representing Personal Information ure 1(a). In that graph, there is no inside-outside file boundary.
Figure 1(a) depicts the situation described in Example 1. It shows All structural elements (folders, sections, subsections, etc) are
a files&folders hierarchy with information on research projects of represented in the same model and queries may address them
one of the authors. Note that arbitrary cyclic graph structures may uniformly. Note that cycles may naturally arise in that graph (in
naturally arise in data inside files. In the LATEX document “vldb this example as a consequence of section cross referencing).
2006.tex”, for example, inside the subsection “The Problem”, there • Intensional Data: any given resource view or parts of a re-
is a reference to the section “Preliminaries” and vice versa. An source view graph may be either materialized (i.e., extensional
extended example of such occurences is provided in [7]. data) or computed on demand as the result to a query or to a re-
Why XML is not Enough. Ideally, files&folders as well as the mote web service invocation (i.e., intensional data [20]). This
structure inside files should be represented into the same logical is in sharp contrast to static data models such as XML.
data model. One could try to employ XML technology to address
this challenge of representation heterogeneity. In fact, we followed • Stream Support: another important feature of our model is
that approach in [8]. Unfortunately, XML is associated to both a the ability of resource views to contain finite as well as infinite
logical data model and a physical markup to represent this log- components. Infinite resource view components are used to
ical model. This means that the manipulation of XML views is represent data streams (e.g., RSS, publish/subscribe) and con-
coupled with serialization concerns. Recent work has identified tent streams (e.g., audio and video) in our model.
this gap, e.g. [20, 16, 14], and argues in favor of clearly separated In our approach, the notion of impreciseness is included in our
logical data models supporting more advanced features, e.g. mul- query language, briefly discussed in Section 4.2.
tiple hierarchies [16]. However, none of the existing approaches Data Model Instantiations. A resource view is given by the fol-
is sufficient to naturally represent the complex, possibly infinite, lowing four formal components:
distributed and lazily computed information graph encountered in name η Name of the resource view.
a personal dataspace. Therefore, we have decided to represent all tuple τ List of attribute value pairs
personal information based on a novel, more powerful, logical data ((name0 , value0 ), (name1 , value1 ), . . .).
model: the iMeMex Data Model (iDM). content χ (in)finite In-/Output of content (e.g. text).
Resource View Graph. We briefly sketch a few characteristics of group γ References to other resource views.
iDM in this section; full details are provided elsewhere [7]. iDM - S: (in)finite set {. . .}
- Q: (in)finite ordered sequence h. . .i
enables a logical representation of a personal dataspace, as shown
in Figure 1(b). The main features of iDM are: We use resource view classes to constrain resource view compo-
• Resource Views: in iDM, all personal information is repre- nents. Resource view classes allow integration of data from diverse
sented by fine-grained resource views. A resource view is made data models into iDM without requiring time consuming semantic
of components that express structured, semi-structured and un- schema integration. A resource view Vi of class C is denoted by
structured pieces of the underlying data. For instance, resource ViC . Similarly, its components are denoted by ηC C C C
i , τi , χi , and γi .
views may represent nodes in a files&folders hierarchy as well We show in Table 1 how our model may be constrained to rep-
as elements in an XML, LATEX or other office document. Other resent files, folders and the core subset of XML. We denote the
than that, we use resource views to uniformly represent email name of an underlying data item i by Ni , attribute-value pairs as-
messages, email attachments, infinite data streams, relational sociated to it by a schema W and a tuple Ti , and its content by Ci .
Resource View Class Resource View Components Definition count for trade-offs in the usage of alternative query plans, e.g., to
Description Name ηCi τCi χCi γCi
consider join orders and different access methods.
S ∅
File file Nf WFS , T f Cf Neighborhood Queries. Providing context is key to enable explo-
Q hi
ration of query results [5]. Thus, it is a common pattern to query the
{V1child , . . . ,Vmchild }
Folder folder NF WFS , TF
S
child ∈ {file, folder}
neighborhood of objects returned from a previous query. One alter-
Q hi native to speed-up such queries is to keep their results materialized
S ∅ in a special index structure. This index may cover only the imme-
XML text node xmltext Ct
Q hi diate neighborhood of each resource view or it may be extended to
S ∅ include other reachable resource views. We plan to evaluate how
XML element xmlelem NE WE , TE hV1child , . . . ,Vnchild i much context should be kept in the index structure to account for
Q
child ∈ {xmltext, xmlelem}
trade-offs in querying speed, indexing time and update processing.
S ∅
XML document xmldoc
Q xmlelem
hVroot i 4.3 Updating Personal Information
S ∅
XML File xmlfile Nf WFS , T f Cf xmldoc
The iMeMex PDSMS should offer soft durability guarantees on
Q hVdoc i
updates made through its interface or via the APIs of the underlying
Table 1: Resource View Classes for files&folders and XML data sources bypassing iMeMex. In the following, we discuss our
ideas for tackling that challenge.
The instantiations shown in Table 1 allow the creation of resource Dataspace Update Model. We plan to design an update model for
view graphs as the one shown in Figure 1(b). Our data model is, the iMeMex PDSMS that accounts for the fact that data may be in-
however, much more powerful: instantiations for relations and data dependently updated via the APIs of the underlying data sources.
streams, as well as a more rigorous discussion on intensional as- In this scenario, ACID guarantees are too strict, once the iMeMex
pects [20] of iDM are presented in [7]. PDSMS may be notified of updates “after the fact”. Neverthe-
less, we believe that classical database logging techniques may be
4.2 Querying Personal Information adapted to this setting to provide softer recovery guarantees (e.g.,
The iMeMex PDSMS should offer querying services on the re- all items updated more than 5 min ago may be recovered).
source view graph representing all of one’s personal dataspace. In Versioning. In relational systems, previous versions of a given
the following, we discuss our ongoing work and open issues on tuple may be reconstructed from the database log (see e.g. “time
query specification and processing. travel” feature of Oracle). However, personal items are typically
Personal Dataspace Query Language. We propose a new search& more heavyweight than relational tuples, as they may have medium
query language for schema-agnostic querying of a resource view to large content. An alternative to logging would be to keep an in-
graph: the iMeMex Query Language (iQL). The definition of the dependent versioning subsystem (e.g. Subversion) to track content
iQL syntax and associated semantics is work in progress. In our evolution. We plan to investigate how to integrate versioning into
current implementation, the syntax of iQL is a mix between typical our update model for personal information and also whether there
search engine keyword expressions and XPath navigational restric- are profitable interactions with the techniques chosen for recovery.
tions. The semantics of our language are, however, much differ- Write back. Updates to personal information may be performed
ent than those of XPath and XQuery. Our language’s goal is to via the API of a given data source or via iMeMex’s API. In the latter
enable querying of a resource view graph that has not necessarily case, one must write the data back to the affected data sources. If
been submitted to expensive schema integration. Therefore, as in the data is not already present in any data source, iMeMex must
search technology, we account for impreciseness in query formula- decide in which subsystem(s) it is most suitable to be represented.
tion. For example, by default, when an attribute name is specified Distribution. When a user has several devices, it is natural to ask
(e.g. size > 10K), we do not require exact matches on the (implicit how to manage several iMeMex instances and coordinate distributed
or explicit) schema for that attribute, but rather return fuzzy, ranked query and update processing among these instances. We believe
results for the resource views that better match the specified con- that the many challenges of this scenario exceed the scope of the
ditions (e.g. size, fileSize, docSize). This allows us to define mal- current Ph.D. work. Those challenges will be tackled by a separate
leable schemas as in [10]. Other important features of iQL are the Ph.D. thesis as part of the iMeMex project.
ability to reflect structural constraints, e.g. to explore the context or
neighborhood of items, the definition of extensible algebraic oper- 4.4 iMeMex PDSMS Architecture
ations like joins and grouping, and the specification of updates to We present in this section the current architecture of the iMeMex
the resource view graph. PDSMS, which serves as a framework for all of the previously dis-
Indexing Techniques. In our current implementation of iMeMex, cussed technical contributions. We also indicate points of ongoing
we index all components of every resource view created in the sys- work in which the architecture will be extended.
tem. This full indexing strategy follows the intuition that the PIM The core idea of iMeMex is to implement a logical layer that ab-
environment shares with data warehousing the characteristic of low stracts from the underlying subsystems and data sources, such as
update rates, allowing us to trade space and indexing time for query file systems, email servers, network shares, music streams, RSS
performance. The information from each component of a resource feeds, etc. That logical layer does not take full control of the data,
view (e.g., name or group of related resource views) goes to a dif- so it may be bypassed by applications. Figure 2 depicts that layer
ferent index and we perform intersects to process conditions on sev- and its current implementation in iMeMex.
eral components. We plan to investigate whether it pays off to have iMeMex contains two important sublayers: iQL Query Proces-
integrated index structures for various resource view components. sor and Resource View Manager. The main task of the iQL Query
In contrast to traditional XML indexing, our index structures must Processsor is to translate incoming iQL queries and to create query
operate in a general graph data model on possibly infinite data. plans for those queries. Our current implementation is based on
Cost-based Optimization. Cost-based optimization (CBO) is one rule-based query optimization. We plan to invest in cost-based op-
key technique to provide interactive response times in read-mostly timization techniques as part of future work.
environments. We are planning to build a CBO for iMeMex to ac- The Resource View Manager (RVM) is the central instance to
Application Layer
mix found in personal dataspaces. As one application of our model
iMeMex - iQL GUI iMeMex - iQL Shell 3rd party
we bridge the artificial boundary that separates inside and outside
files. Second, we are working on a new search&query language
iMeMex
iMeMex Query Language (iQL)
PDSMS
that operates on our data model. The processing of expressions in
Query Processor this language calls for the design of efficient techniques, e.g. for
Layer
indexing and neighborhood querying. Third, we are working on
Resource View Manager
a dataspace update model. That model will provide soft durabil-
Handler Sync Manager
ity guarantees, write-back to data sources as well as detection of
Replica & Indexes ContentToiDM Converters changes made on data sources bypassing iMeMex. We plan to de-
Replicas Catalog XML LaTeX ... sign integrated recovery and versioning techniques to support our
Indexes
Data Source Proxy
update model. By building the first publicly available PDSMS, we
Search Engine DBMS FS IMAP RSS ... believe that we make a significant contribution to the development
Data Source Plugins of advanced PIM applications.
Data Source Layer 6. REFERENCES
File System, RSS, IMAP, Database, Live Streaming, etc. [1] S. Abiteboul, R. Agrawal, P. A. Bernstein, and others. The Lowell
Database Research Self Assessment. The Computing Research
Figure 2: iMeMex PDSMS architecture Repository (CoRR), cs.DB/0310006, 2003.
[2] G. Bell. MyLifeBits: a Memex-Inspired Personal Store; Another TP
managing resource views. Its major components are: Data Source Database (Keynote). In ACM SIGMOD, 2005.
Proxy, ContentToiDMConverters, Replica&Indexes Module, and [3] V. Bush. As we may think. Atlantic Monthly, 1945.
Synchronization Manager. We describe them in the following. [4] E. Cutrell, D. Robbins, S. Dumais, and R. Sarin. Fast, flexible
Data Source Proxy. Provides connectivity to the distinct types of filtering with Phlat — Personal search and organization made easy.
subsystems. It contains a set of Data Source Plugins that repre- In CHI, 2006.
sent the data from the different subsystems (e.g., file systems, RSS, [5] J.-P. Dittrich, P. M. Fischer, and D. Kossmann. AGILE: Adaptive
IMAP, databases, etc) as an initial iDM graph. Indexing for Context-Aware Information Filters. In ACM SIGMOD,
2005.
ContentToiDMConverters Module. Enriches the iDM graph pro-
[6] J.-P. Dittrich and D. Kossmann. iMeMex: A Unified Approach to
vided by the data source proxy. This is achieved by converting re- Personal Information Management. In SNF project under contract
source view content to iDM subgraphs that then reflect structural 200021-112115.
information (e.g., in LATEX, XML, etc). The result is an iDM graph [7] J.-P. Dittrich and M. A. V. Salles. iDM: A Unified and Versatile Data
such as the one presented in Section 4.1. Model for Personal Dataspace Management. In VLDB, 2006.
Replica&Indexes Module. Materializes mappings between re- [8] J.-P. Dittrich, M. A. V. Salles, D. Kossmann, and L. Blunschi.
source view identifiers and resource view components (e.g., name iMeMex: Escapes from the Personal Information Jungle (Demo
or group of related resource views) to accelerate query processing. Paper). In VLDB, 2005.
A mapping from resource view identifiers to copies of component [9] X. Dong and A. Halevy. A Platform for Personal Information
Management and Integration. In CIDR, 2005.
instances is termed a replica. The inverse mapping is termed an
[10] X. Dong and A. Y. Halevy. Malleable Schemas: A Preliminary
index. Currently, our implementations of replicas and indexes are Report. In WebDB, pages 139–144, 2005.
based on a DBMS (Apache Derby) for structured information, such [11] P. Dourish et al. Extending Document Management Systems with
as attribute-value pairs and resource view connections, and on in- User-Specific Active Properties. ACM Transactions on Information
verted keyword lists (Apache Lucene) for textual information, such Systems (TOIS), 18(2):140–170, 2000.
as names and text content. We plan to extend this module to pro- [12] M. Franklin, A. Halevy, and D. Maier. From Databases to
vide specialized index structures as discussed in Section 4.2. Dataspaces: A New Abstraction for Information Management.
Synchronization Manager. Monitors registered data sources for SIGMOD Record, 34(4):27–33, 2005.
changes. When a data source is registered at the RVM, the Syn- [13] E. Freeman and D. Gelernter. Lifestreams: A Storage Model for
Personal Data. SIGMOD Record, 25(1):80–86, 1996.
chronization Manager analyzes the data found on the data source
[14] J. Graupmann, R. Schenkel, and G. Weikum. The SphereSearch
and sends each resource view definition to the Replica&Indexes Engine for Unified Ranked Retrieval of Heterogeneous XML and
Module. The Synchronization Manager also subscribes to update Web Documents. In VLDB, 2005.
notifications from the data source. As a consequence, updates per- [15] A. Halevy et al. Crossing the Structure Chasm. In CIDR, 2003.
formed on the data source bypassing the RVM layer are then imme- [16] H. V. Jagadish, L. V. S. Lakshmanan, M. Scannapieco, D. Srivastava,
diately considered by the Synchronization Manager and the Repli- and N. Wiwatwattana. Colorful XML: One Hierarchy Isn’t Enough.
ca&Indexes Module. If the data source does not offer update noti- In ACM SIGMOD, 2004.
fications, the Synchronization Manager generates them based on a [17] W. Jones and H. Bruce. A Report on the NSF-Sponsored Workshop
on Personal Information Management, Seattle, Washington, 2005.
generic polling facility. We will extend this module to incorporate
[18] D. R. Karger et al. Haystack: A Customizable General-Purpose
recovery and versioning techniques, as described in Section 4.3. Information Management Tool for End Users of Semistructured
Data. In CIDR, 2005.
5. CONCLUSION [19] M. Kersten, G. Weikum, M. Franklin, D. Keim, A. Buch-
mann, and S. Chaudhuri. Panel: A Database Striptease or How to
Personal Information Management has become a key necessity Manage Your Personal Databases. In VLDB, 2003.
for almost everybody. Reflecting this prominence, considerable at- [20] T. Milo, S. Abiteboul, et al. Exchanging Intensional XML Data. In
tention has been given to PIM research in the recent past. At the ACM SIGMOD, 2003.
same time, it has become clear that what is missing is a unified [21] T. Mitchell. Computer Workstations as Intelligent Agents (Keynote).
approach to bring physical and logical data independence to the In ACM SIGMOD, 2005.
management of one’s personal dataspace. We address three ma- [22] SIGIR PIM 2006.
http://pim.ischool.washington.edu/pim06home.htm.
jor research challenges in the pursuit of this goal. First, we define
[23] http://msdn.microsoft.com/data/WinFS/ WinFS.
a new data model capable of representing the heterogeneous data