=Paper=
{{Paper
|id=Vol-2167/paper6
|storemode=property
|title=WASP: Web Archiving and Search Personalized
|pdfUrl=https://ceur-ws.org/Vol-2167/paper6.pdf
|volume=Vol-2167
|authors=Johannes Kiesel,Arjen P. De Vries,Matthias Hagen,Benno Stein,Martin Potthast
|dblpUrl=https://dblp.org/rec/conf/desires/KieselVHSP18
}}
==WASP: Web Archiving and Search Personalized==
https://ceur-ws.org/Vol-2167/paper6.pdf
WASP: Web Archiving and Search Personalized
Johannes Kiesel Arjen P. de Vries Matthias Hagen
Bauhaus-Universität Weimar Radboud University Martin-Luther-Universität
Weimar, Germany Nijmegen, The Netherlands Halle-Wittenberg
johannes.kiesel@uni-weimar.de a.devries@cs.ru.nl Halle, Germany
matthias.hagen@informatik.
uni-halle.de
Benno Stein Martin Potthast
Bauhaus-Universität Weimar Leipzig University
Weimar, Germany Leipzig, Germany
benno.stein@uni-weimar.de martin.potthast@uni-leipzig.de
ABSTRACT apparently, do the aforementioned tools index the content of web
Logging and re-finding the information we encounter every day pages visited, but only their titles and URLs. In fact, the possibility
while browsing the web is a non-trivial task that is, at best, in- to track (let alone search) one’s browsing history using off-the-shelf
adequately supported by existing tools. It is time to take another tools is still fairly limited.
step forward: we introduce WASP, a fully functional prototype of a In this context, it is not surprising that personal information
personal web archive and search system, which is available open access was one of the major topics discussed at the Third Strategic
source and as an executable Docker image. Based on the experiences Workshop on Information Retrieval in Lorne (SWIRL 2018) [4].
and insights gained while designing and using WASP, we outline The attendees noted that this problem, open for so long, has not
how personal web archive and search systems can be implemented, been addressed adequately, and, worse, that it is an ever more
discuss what technological and privacy-related challenges such daunting challenge to help people re-find and re-visit their online
systems face, and propose a setup to evaluate their effectiveness. information and prior information interactions with these sources;
As a key insight, we argue that the indexing and retrieval for a as this information today resides in multiple devices and a large
personal archive search can be strongly tailored towards a specific variety of information services, that each construct their own data
user and their behavior on the visited pages compared to regular silos and search APIs (if such access is offered at all). Specifically,
web search. the report mentions the high cost of entry for scientists as a major
obstacle, where “there is substantial engineering required for a
1 INTRODUCTION minimal working system: to fetch data from different silos, parse
different data formats, and monitor user activity.”
Lifelogging1 has become a common practice, as a result of the om-
We propose to take a pragmatic “shortcut” and to establish em-
nipresence of smartphones, smart watches and fitness trackers, and
pirically how far that workaround can bring us. Increasingly, access
emerging technologies such as smart glasses, wearable technologies
to our digital information takes place through the web browser
and sensor-enabled smart homes. Isn’t it surprising that keeping
as the interface. Therefore, we set out to develop WASP, a proto-
track of one’s online activities is comparably underdeveloped? Sig-
type system for personal web archiving and search. WASP saves
nificant amount of work has been invested into understanding
one’s personal web browsing history using state-of-the-art web
personal information management [10] and developing tools to
archiving technology and offers a powerful retrieval interface over
support it, including the winner of the SIGIR 2014 Test of Time
that history. This browser-focused setup enables the user to recall
Award “Stuff I’ve Seen” (SIS) by Dumais et al. [6]. With a bit of irony
information they personally gathered without the need to deal with
however, neither SIS nor follow-up Phlat [5] are available today,
the large variety of information sources. Even if we do not cover the
even if the key insights gained have likely informed the develop-
full range of digital objects that may accrue on a person’s desktop
ment of Windows desktop search and intelligent assistant Cortana.
and mobile devices, high-quality archival of web pages visited may
Likewise, Spotlight on MacOS supports search over local documents
capture a large fraction of the information we interact with.
and other digital assets. Both are integrated with the web browsers
In addition to a detailed technical description of WASP in Sec-
from Microsoft and Apple, respectively, to index browsing history.
tion 2, this paper reports on the observations that we made (Sec-
Meanwhile, the history tabs of modern Web browsers provide ac-
tion 3) and the challenges for personal web archiving and search
cess to the history of the currently open browser as well as pages
that we identified (Section 4) through our extensive use of the
recently visited on other devices. However, current browsers do
WASP prototype—which we provide both open source and as an
not align and integrate the browsing histories across devices, nor,
executable Docker container so that others can use it within their
1 https://en.wikipedia.org/wiki/Lifelog
research or personal lifelogging setup.2, 3
DESIRES 2018, August 2018, Bertinoro, Italy
2 https://hub.docker.com/r/webis/wasp/
© 2018 Copyright held by the author(s).
3 https://github.com/webis-de/wasp
DESIRES 2018, August 2018, Bertinoro, Italy Johannes Kiesel, Arjen P. de Vries, Matthias Hagen, Benno Stein, and Martin Potthast
2 THE WASP PROTOTYPE
The WASP 4 prototype integrates existing archiving, indexing, and Search
Interface Index
reproduction technology for the first time into a single application.
Figure 1 illustrates how the user’s browser interacts through WASP pywb World Wide Web
WARCs
with the World Wide Web under the three usage scenarios archival,
Browser
search, and reproduction of web pages, as detailed below. proxy
warcprox
2.1 Archiving Proxy and Indexing (a)
After starting WASP, the user has to reconfigure his or her browser
to accept WASP as forward proxy and to trust its certificate. WASP
then archives all HTTP(S) requests and responses from and to the /search Search
browser in the standard Web archiving format (WARC) (Figure 1 (a)). Interface Index
This is achieved using the Internet Archive’s warcprox software,5
World Wide Web
whose WARC contain all the information necessary to reproduce pywb WARCs
an archived browsing session at a later time. Browser
In order to enable searching the archived content, we devised a warcprox
software component that monitors WARC files and automatically
indexes HTML responses and their corresponding requests in an (b)
ElasticSearch index.6 In detail, we use the Lemur project’s WARC
parser7 and Apache’s HttpClient library8 to read HTTP messages
as they are appended to the WARC files. The title and text of the Search
Interface Index
HTTP responses that have the MIME type HTML are extracted from
responses using the Jericho HTML Parser library.9 The title and /archive//
World Wide Web
pywb
text of the HTTP response is indexed along with the corresponding WARCs
HTTP request’s time and URL. Later page revisits (identified by Browser
warcprox through hash value matching on HTTP responses) are warcprox
added to the response’s record in the index. When the index is
queried, the aggregation of requests avoids duplicate results in case (c)
a page is visited more than once.
Even if web pages vanish or change, WASP can reproduce the Figure 1: Architecture of the prototype: (a) during regular
content the user saw in the past using the Web archiving toolkit browsing, the container works as a forward proxy that stores
pywb.10 Like our automatic indexing setup described above, pywb all requests ( ) and responses ( ) in web archive files
monitors and indexes changes to the web archives. While the Elas- (WARCs) and indexes them; (b) when browsing to localhost:
ticSearch index is tailored toward search within the HTML content /search, the browser shows our search inter-
of the archived web pages, the pywb index is tailored towards retriev- face (Figure 2), where results link to (c) the reproduction
ing the HTTP response corresponding to a given HTTP request, server, which serves content from the WARC that (fuzzy)
enabling efficient reproduction of pages from the personal archive. matches a specific time and URL.
2.2 Search Interface
Access to the archived web pages is provided using the ElasticSearch
index detailed in Section 2.1 (Figure 1 (b)). Under a configurable A difference to a regular search engine results page is that in
port, WASP provides the user with a basic search engine. Figure 2 WASP, each result item consists of two hyperlinks: one resolving
shows a screenshot of the interface. Unlike regular web search the URL to the live web 6 as usual, and another one pointing to the
engines, WASP’s interface provides controls to specify the time the archived version of the web page. This latter hyperlink refers to the
user recall visiting the desired web page 1 , 2 , 3 11 in addition to port of the WASP container’s reproduction proxy and the access
the familiar query box 4 . Web pages are retrieved by matching time and URL of the web page that should be reproduced. In case
query words against the title and contents of web pages visited in several non-identical versions of the same page are found in the
the specified time interval. ElasticSearch’s highlight feature is used requested interval, the prototype displays all of them as separate
to generate query-related snippets for the results 9 . results. However, we expect that more mature personal web archiv-
ing and search systems will rather condense the different versions
4 WASP is short for Web Archiving and Search, Personalized
5 https://github.com/internetarchive/warcprox
of a web page, especially when the context of the query terms is
6 https://www.elastic.co/ similar in the versions. The resulting user experience offers key
7 http://www.lemurproject.org/clueweb09/workingWithWARCFiles.php advantages with respect to search users’ privacy: search activities
8 https://hc.apache.org/httpcomponents-client-ga/
remain local to WASP, and the user is left in control whether to
9 http://jericho.htmlparser.net/docs/index.html
10 https://github.com/webrecorder/pywb visit the live web page (without leaking their preferences to another
11 Date and time picker widget: https://eonasdan.github.io/bootstrap-datetimepicker/ search engine), or to be satisfied with the archived result.
WASP: Web Archiving and Search Personalized DESIRES 2018, August 2018, Bertinoro, Italy
1 2 3
4
5
6
7 8
9
Figure 3: Screenshot of a web page reproduced from the
archive. pywb is configured to insert a small black banner at
the bottom right of the browser viewport to remind users
that they are viewing an archived page.
Figure 2: Search interface for WASP: 1 shortcuts for fre-
quently used time settings; 2 selected query time interval;
the same as the live version did at the time of archiving. Yet, tech-
3 date and time picker for exact time specification; 4 query
nological difficulties may prevent the faithful reproduction of an
box; 5 description of current result page; 6 title of result
archived web page. Since it is usually impractical for WASP to take
with links to archived and live version; 7 URL of the result;
web server snapshots, WASP will only capture a page’s client side.
8 archive time of the result; 9 snippet for the result.
Therefore, only a subset of the potential server interactions end up
being represented in the archive and available for the reproduction:
2.3 Reproduction Server the scrolling, clicking, form submissions, video and audio stream
playback, etc. that the user performed on the live web page. If user
When using a personal Web archive in a re-finding scenario, WASP interactions on the archived web page trigger unseen requests to
fulfills the need of users to access information from their brows- the web server, reproducing the archived web page will either do
ing history using pywb; a state-of-the-art web page reproduction nothing, show an error, or stop working.
software which uses intricate URL rewriting and code injection to However, even in the case the user repeats the same basic in-
serve the archived web pages like they were originally received teractions on the archived page that they performed on the live
(Figure 1 (c)). Through the use of specific URLs, pywb can serve page, only about half of web pages can be reproduced flawlessly [9].
multiple versions of the same web page. WASP’s search interface These reproduction errors mostly stem from randomized requests.
uses this feature to refer the user to exactly that version of the web Indeed, in about two-third of flawed reproductions, the errors are
page that corresponds to the clicked result link. In order to avoid on the level of missing advertisements or similar. While pywb re-
confusion on the user’s side as to whether or not they are browsing places the JavaScript random number generator by a deterministic
within the archive, a small black banner is inserted and fixed to the one, this only affects the archived page and does not fully solve the
bottom right corner of the browser viewport for all pages that are problem: different timings in the network communications lead to a
reproduced from the archive (cf. Figure 3). varying execution order and thus a different order of pop-requests
from the “random” number sequence. To greater effect, pywb em-
3 QUALITATIVE EVALUATION ploys a fuzzy matching of GET parameters that ignores some of the
Given that the WASP prototype became operational only recently, parameters that it assumes to have random values (e.g., session ids),
the ongoing evaluation of its archiving and retrieval quality is still be it by the parameter name or by a hash-like appearance of the
in its infancy. Nevertheless, since we have been using the prototype, parameter value. While it is unclear how many false positives this
this section reports on insights gathered so far, namely the results process introduces, it naturally can’t find all random parameters as
of an error analysis regarding archiving quality, and an outline of there exists no standard whatsoever in this regard.
evaluation methodology regarding retrieval quality. Another interesting problem for web archiving we noticed are
push notifications: while they are properly recorded, it remains a
3.1 Archiving Quality: Error Analysis difficult choice if and when to trigger them during the reproduction
When revisiting an archived web page, one naturally expects the of a web page. Should the trigger time be based on the time spent
version reproduced from the archive to look and behave exactly on the page or based on other events?
DESIRES 2018, August 2018, Bertinoro, Italy Johannes Kiesel, Arjen P. de Vries, Matthias Hagen, Benno Stein, and Martin Potthast
Finally, we found that differences between browsers can also af-
fect the reproduction quality. Though this had only minor effects on
our experience with WASP so far, the ongoing development of the
web technology stack may render old web pages in the archive less
reproducible in the long run. For an example, consider the ongoing
demise of Flash as a major container for dynamic content. In this
regard, old versions of browsers and even old versions of operating
systems may need to be kept, which is a definite requirement for
web archiving in general, and also possible based on WASP’s use of
Docker containers, though not necessarily important for our usage
scenario of personal web archiving.
3.2 Retrieval Quality Evaluation: An Outline
In principle, it should be easier to re-find something in a personal (a)
web archive than using some commercial search engine on the
live web. Since a personal archive will typically be many orders of
magnitude smaller, not as many candidate results for simple queries
exist as on the live web. Ideally, compared to finding a needle in
the huge haystack of the web, with a tailored search interface for
one’s smaller personal archive, the ratio of needles to hay is much (b)
higher in a re-finding scenario than in general web search. Still,
since WASP is a prototype that was created very recently, we can Figure 4: Example of dynamic HTML content in WASP:
only provide anecdotes of retrieval problems and sketch how we (a) original tweet as it appeared while scrolling down the
want to evaluate whether WASP actually helps to re-find needles. Twitter timeline (b) Twitter card as it was requested for dis-
The main evaluation scenarios we envision is re-finding some- play, archived, and indexed.
thing a user recalls having seen earlier on the web. Such re-finding
intents will be different from the frequent re-visit patterns users
show on the web [1] since their purpose is not to visit some favorite front. Besides monitoring user queries against WASP’s search func-
page but to check some information seen before. In this regard, we tionality for users who agree to share parts of their browsing and
do not believe that, at the time of visiting a web page the first time search activity, one will periodically trigger active users of WASP
around, users will have enough foresight and presence of mind to with a re-finding game similar to PageHunt [11]. The user will
anticipate its future uses and hence place a bookmark, rendering a be shown the screenshot of a page they have seen, or only parts
search in their personal archive indispensable. thereof (e.g., only the color scheme of the layout), or will be asked
We used WASP for one week in an informal self-experiment to to re-find a piece of information they have seen a given period of
figure out what problems arise and what should thus be integrated time ago (e.g., three days ago, two weeks ago, etc.). Their task will
in a formal evaluation. The most obvious problem that differs from be to come up with a sequence of queries (and clicks) such that
the general web search scenario is that of dealing with several in the end the prescribed web page appears in the top-k ranks of
versions of the same web page. During our short-term usage of WASP’s retrieval component. In such cases, the desired item will
WASP, we found that most retrieved web pages are actually relevant, be known for evaluation purposes and the re-finding task can have
but that the result lists are cluttered with different versions of the several difficulty levels (showing full information vs. only color
same web page that were—with respect to our information needs— scheme, target information at top of a page or only requested upon
practically identical; as predicted by a recent retrievability study of interaction, etc.). To measure retrieval success, the length of real
Web archive search [12]. A probably even more difficult problem, and the comparably artificial re-finding query and click sequences
but one that our scenario shares with general web search, arises can be measured as well as the specificity of the queries contrasted
from the fact that nowadays web pages request a large part of by the size of the personal collection. But of course, the overall
their content dynamically and only if necessary. A good example interesting measure will be for how many real re-finding tasks the
of this is the Twitter timeline: while scrolling through the timeline, users are able to pull out the desired result from their personal
more tweets are requested from the server. Since WASP is currently archive—their needle stack.
limited to indexing HTML responses, it catches only some parts
of the tweets (see Figure 4), which turn out to be HTML templates 4 DISCUSSION AND LESSONS LEARNED
requested via Ajax for integration into the Twitter page.
Our primary goal with WASP was to develop a vertical prototype
Based on these observations, we propose the following evalua-
of a web archiving and retrieval framework, which archives every
tion setup for personal web archives. Since re-finding in personal
web page and every request made by a web page, and then indexes
web archives has not been part of any evaluation campaign so far,
everything archived. Based on first practical experiences with using
a respective set of topics and user interactions has to be built up
WASP for our own respective web traffic, however, there are still
WASP: Web Archiving and Search Personalized DESIRES 2018, August 2018, Bertinoro, Italy
many things to be sorted out before we can claim a flawless retrieval
experience. Unsurprisingly, the devil is in the details, but somewhat
surprisingly, we will be forced to revisit the basic notions of what
is a web page, what needs to be archived, and what needs to be
indexed. This section discusses lessons learned, outlining a number
of exciting future directions for research and development on web
archiving and retrieval in general, and for WASP in particular.
4.1 Which pages to archive?
Although WASP currently follows “archive first, ask questions later,”
users of a personal archiving system likely do not wish for all their
traffic to be archived, even if stored within their personal data
space. Without specific measures, sensitive data will end up in the
archive, e.g., banking pages, health-related browsing, as well as
browsing sessions with privacy-mode enabled (where users expect
all traces of their activities to be purged after the browser is closed);
users may not expect for such data to emerge in search results,
weeks, months, or even years later. Furthermore, just as some users
regularly clean or clear their browsing history, they will wish to
clean or clear their archive. Similarly, it will be necessary to protect
the personal archive from unauthorized access, analyze known Figure 5: Screenshot mode mockup; the screenshot in the
and new attack vectors on the archiving setup, and formalize the 2nd row and 2nd column is highlighted by mouse-over.
security implications that stem from the use of such a system.
Based on these deliberations, it is clear that the user must be
given fine-grained control over what sites or pages are archived, al-
lowing for personal adjustments and policies. The recorded archive
needs to be browseable, so that individual entries can be selected for Figure 6: Firefox toolbar indicating archiving is activated.
removal. For more convenient browsing (both for cleaning and gen- The context-menu of this icon allows to turn of the proxy-
eral re-finding), we suggest a screenshot-based interface as shown usage. thereby implementing a “pause-archiving” button.
in Figure 5. At present, users can already influence which pages
should not be archived using proxy-switching plugins available
for all modern browsers that seamlessly integrate with WASP’s
Besides accidental page visits, another example of irrelevant
proxy-based architecture (e.g., cf. Figure 6). Of course, specifying
pages may be found in more complex web applications. Take web-
wildcard expressions hardly qualifies as a user-friendly interface for
based RSS feed readers the likes of Feedly as an example: there is
non-computer scientists, so that a better interface will be required
no need to index every page and every state of every page of the
in practice (e.g., using classification techniques similar to [7]).
feed reader. Rather, the feed items to which the user pays attention
Under some circumstances personal archiving systems could
are of interest for indexing, since only they are the ones the user
act on their own behalf to allow for an improved experience of
may eventually remember and wish to revisit. In this regard, two
the archived page, by archiving content the users did not request
cases can be distinguished, namely the case where feed items are
themselves. This possibility leads to several new research ques-
displayed only partially, so that the user has to click on a link
tions. For example, should all videos on a visited page be requested
pointing to an external web page to consume a piece of content, and
and archived, so that the user can watch them later on from their
the case where feed items are displayed in full on the feed reader’s
archive? Or in general, should the system predict and simulate
page. The former case is straightforward, since a click indicates user
interactions that the user may later want to do on the archived
attention, so that the feed reader’s page can be entirely ignored. In
page to archive the corresponding resources while they are still
the latter case, however, every feed item the user reads should be
available? Moreover, should the system perform such a simulation
indexed, whereas the ones the user skips should not so as not to
multiple times in order to detect the randomness in the web page’s
pollute the user’s personal search results.
requests and consider this information in the reproduction?
More generally, all kinds of portal pages and doorway pages,
ranging from newspaper front pages via social network pages to
4.2 Which pages to index? search results pages are candidates for omission. Analyzing the
While a comprehensive archive is necessary for a high-quality user’s browsing behavior gives evidence which page they suffi-
reproduction of web pages, not everything that the browser receives ciently scrutinized for it to be indexed. If a user spends time reading
is actually of interest to the user. From our own web browsing habits, the headlines and excerpts of a front page, this would suggest to
we can informally tell that many pages opened are not relevant for index that page, but may be difficult to discern in practice. Other-
future retrieval, because they are dismissed upon first glance (e.g., wise, a user’s behavior may be used as implicit relevance feedback
pop-ups) or not even looked-at at all. to be incorporated into tailored retrieval models.
DESIRES 2018, August 2018, Bertinoro, Italy Johannes Kiesel, Arjen P. de Vries, Matthias Hagen, Benno Stein, and Martin Potthast
4.3 What is the document unit for indexing? less emphasis on scalability. Developments in the UI/UX of web
In its present form, WASP archives everything that runs under a archive search are, however, likely transferable, in both directions—
given URL—including GET parameters, but excluding fragment as argued in Section 4.2, what we learn from observing interactions
identifiers—as one unit. Just like in regular search, not every piece with personal web archives may very well carry over to the large
of content is relevant for indexing. Main content extraction is an web archives of interest to Digital Humanities researchers [3].
obvious solution to this problem, but the current state-of-the-art We find that a new blend of techniques that have been proposed
frequently fails on pages where many small pieces of content can previously will be necessary to design the right user experience,
be found. Furthermore, many websites today spread one coherent and we realize that we have only scratched the surface so far. For
piece of content over many sub-pages (so-called pagination). For example, searching the social web is different from searching the
instance, news publishers often employ pagination, forcing readers web, as shown convincingly in [2]. We also highlight the immediate
to switch pages (possibly to serve extra display ads or improve relevance of research into focused retrieval carried out in context of
engagement metrics that determine the value of the display ads INEX. The question of how to determine a retrieval unit has clearly
shown on the publisher’s site). For archive retrieval purposes, how- not been solved, yet, and the usage scenario of personalized web
ever, pagination can be detrimental, penalizing the relevance of a archive search that we envision has increased the urgency to revisit
paginated news article to a query, since only parts of the article are that line of research.
scored at a time.
On the other hand, physical pages are also not necessarily atomic: 6 SUMMARY
many web pages built with modern web design tools are single- This paper introduces WASP, a prototypical implementation of a
page applications, where different pieces of content are shown upon personal web archive and search system, it provides a first qualita-
user request under the same URL. For instance, a blog platform tive evaluation of such a system, and outlines future steps in this
may show each blog post requested by a user simply by loading regard, as well as discusses the challenges that such systems face.
it in the background using a JavaScript-based AJAX request, and WASP combines state-of-the-art archiving and retrieval technology
replacing the currently shown post with a new one. In this case, to which it adds an intuitive and tailored search interface. Gener-
the perfect web archive search would identify the single posts ally, the use case for personal web archive search is more the one
and index them separately, injecting code upon reproduction that of a re-finding engine. We identify current limitations in archiv-
replaces the displayed post with the desired one. Currently, we ing technology for this use case and discuss how the evaluation
are technologically far from such a feature. In a different case, of a search engine has to be adapted for search in personal web
like the Twitter timeline, a web page consists of several (possibly archives (e.g., to several versions of a single web page when it is
independent) content segments. Again, each such segment should revisited). In the same context, we discuss what content should be
be indexed separately for an appropriate relevance computation. To archived and what content should be indexed, highlighting privacy
meet this challenge, web pages should be segmented into coherent issues (e.g., archiving in incognito mode) and advantages (re-finding
units of content that belong together on a page, and each segment information using only local data).
identified should be treated as a document unit. However, just like
with most of the aforementioned problems, page segmentation, too, REFERENCES
is still in its infancy. [1] E. Adar, J. Teevan, and S.T. Dumais. 2008. Large scale analysis of web revisitation
patterns. In CHI ’08. 1197–1206.
For an optimization, the click behavior and dwell times on certain [2] O. Alonso, V. Kandylas, S.-E. Tremblay, J.M. Hofman, and S. Sen. 2017. What’s
pages may be the best features to determine what parts should be Happening and What Happened: Searching the Social Web. In WebSci ’17. 191–
indexed, whether pages should be merged into one, or one divided 200.
[3] Anat Ben-David and Hugo Huurdeman. 2014. Web Archive Search as Research:
into many. Furthermore, such information on user behavior would Methodological and Theoretical Implications. Alexandria 25, 1-2 (2014), 93–111.
be very useful for ranking results in the personal search. Currently, [4] J. Shane Culpepper, Fernando Diaz, and Mark D. Smucker. 2018. Report from
however, such behavioral data is probably not even available to the Third Strategic Workshop on Information Retrieval in Lorne (SWIRL 2018).
Technical Report.
commercial search engines. [5] E. Cutrell, D. Robbins, S. Dumais, and R. Sarin. 2006. Fast, Flexible Filtering with
Phlat. In CHI ’06. 261–270.
[6] S. Dumais, E. Cutrell, J.J. Cadiz, G. Jancke, R. Sarin, and D.C. Robbins. 2003. Stuff
5 RELATED WORK I’Ve Seen: A System for Personal Information Retrieval and Re-use. In SIGIR ’03.
72–79.
WASP is directly related to prior work on desktop search, including [7] C. Eickhoff, K. Collins-Thompson, P.N. Bennett, and S.T. Dumais. 2013. Designing
the already mentioned Stuff I’ve Seen [6]. However, apart from not Human-Readable User Profiles for Search Evaluation. In ECIR 2013. 701–705.
indexing all documents that may exist on a desktop, the intended [8] H. Holzmann, V. Goel, and A. Anand. 2016. ArchiveSpark: Efficient Web Archive
Access, Extraction and Derivation. In JCDL ’16. 83–92.
usage differs slightly as well: WASP aims to track everything a user [9] Milad Alshomary Benno Stein Matthias Hagen Martin Potthast Johannes Kiesel,
has seen, as they saw it, and in that sense provides some notion Florian Kneist. 2018. Reproducible Web Corpora: Interactive Archiving with
Automatic Quality Assessment. Journal of Data and Information Quality (2018).
of versioning. While not yet implemented, a future version should [10] W. Jones. 2010. Keeping found things found: The study and practice of personal
explore the functionality once implemented in diff-IE, i.e., to rank information management. Morgan Kaufmann.
pages that evolved differently from static ones, and this way provide [11] H. Ma, R. Chandrasekar, C. Quirk, and A. Gupta. 2009. Improving search engines
using human computation games. In CIKM ’09. 275–284.
immediate insight in changes of the web over time [13]. [12] Th. Samar, M.C. Traub, J. van Ossenbruggen, L. Hardman, and A.P. de Vries. 2018.
WASP is also related to search tools for web archives, such as Quantifying retrieval bias in Web archive search. International Journal on Digital
ArchiveSpark [8]. However, due to handling a single user’s view of Libraries 19, 1 (01 Mar 2018), 57–75.
[13] J. Teevan, S. Dumais, and D. Liebling. 2010. A Longitudinal Study of How
the online world only, the system aspects to be addressed include Highlighting Web Content Change Affects People’s Web Interactions. In CHI ’10.