=Paper=
{{Paper
|id=Vol-2167/paper6
|storemode=property
|title=WASP: Web Archiving and Search Personalized
|pdfUrl=https://ceur-ws.org/Vol-2167/paper6.pdf
|volume=Vol-2167
|authors=Johannes Kiesel,Arjen P. De Vries,Matthias Hagen,Benno Stein,Martin Potthast
|dblpUrl=https://dblp.org/rec/conf/desires/KieselVHSP18
}}
==WASP: Web Archiving and Search Personalized==
<pdf width="1500px">https://ceur-ws.org/Vol-2167/paper6.pdf</pdf>
<pre>
                        WASP: Web Archiving and Search Personalized
                  Johannes Kiesel                               Arjen P. de Vries                                   Matthias Hagen
          Bauhaus-Universität Weimar                           Radboud University                           Martin-Luther-Universität
              Weimar, Germany                               Nijmegen, The Netherlands                           Halle-Wittenberg
        johannes.kiesel@uni-weimar.de                           a.devries@cs.ru.nl                               Halle, Germany
                                                                                                           matthias.hagen@informatik.
                                                                                                                   uni-halle.de

                                                Benno Stein                          Martin Potthast
                                          Bauhaus-Universität Weimar               Leipzig University
                                              Weimar, Germany                      Leipzig, Germany
                                          benno.stein@uni-weimar.de          martin.potthast@uni-leipzig.de

ABSTRACT                                                                  apparently, do the aforementioned tools index the content of web
Logging and re-finding the information we encounter every day             pages visited, but only their titles and URLs. In fact, the possibility
while browsing the web is a non-trivial task that is, at best, in-        to track (let alone search) one’s browsing history using off-the-shelf
adequately supported by existing tools. It is time to take another        tools is still fairly limited.
step forward: we introduce WASP, a fully functional prototype of a           In this context, it is not surprising that personal information
personal web archive and search system, which is available open           access was one of the major topics discussed at the Third Strategic
source and as an executable Docker image. Based on the experiences        Workshop on Information Retrieval in Lorne (SWIRL 2018) [4].
and insights gained while designing and using WASP, we outline            The attendees noted that this problem, open for so long, has not
how personal web archive and search systems can be implemented,           been addressed adequately, and, worse, that it is an ever more
discuss what technological and privacy-related challenges such            daunting challenge to help people re-find and re-visit their online
systems face, and propose a setup to evaluate their effectiveness.        information and prior information interactions with these sources;
As a key insight, we argue that the indexing and retrieval for a          as this information today resides in multiple devices and a large
personal archive search can be strongly tailored towards a specific       variety of information services, that each construct their own data
user and their behavior on the visited pages compared to regular          silos and search APIs (if such access is offered at all). Specifically,
web search.                                                               the report mentions the high cost of entry for scientists as a major
                                                                          obstacle, where “there is substantial engineering required for a
1     INTRODUCTION                                                        minimal working system: to fetch data from different silos, parse
                                                                          different data formats, and monitor user activity.”
Lifelogging1 has become a common practice, as a result of the om-
                                                                             We propose to take a pragmatic “shortcut” and to establish em-
nipresence of smartphones, smart watches and fitness trackers, and
                                                                          pirically how far that workaround can bring us. Increasingly, access
emerging technologies such as smart glasses, wearable technologies
                                                                          to our digital information takes place through the web browser
and sensor-enabled smart homes. Isn’t it surprising that keeping
                                                                          as the interface. Therefore, we set out to develop WASP, a proto-
track of one’s online activities is comparably underdeveloped? Sig-
                                                                          type system for personal web archiving and search. WASP saves
nificant amount of work has been invested into understanding
                                                                          one’s personal web browsing history using state-of-the-art web
personal information management [10] and developing tools to
                                                                          archiving technology and offers a powerful retrieval interface over
support it, including the winner of the SIGIR 2014 Test of Time
                                                                          that history. This browser-focused setup enables the user to recall
Award “Stuff I’ve Seen” (SIS) by Dumais et al. [6]. With a bit of irony
                                                                          information they personally gathered without the need to deal with
however, neither SIS nor follow-up Phlat [5] are available today,
                                                                          the large variety of information sources. Even if we do not cover the
even if the key insights gained have likely informed the develop-
                                                                          full range of digital objects that may accrue on a person’s desktop
ment of Windows desktop search and intelligent assistant Cortana.
                                                                          and mobile devices, high-quality archival of web pages visited may
Likewise, Spotlight on MacOS supports search over local documents
                                                                          capture a large fraction of the information we interact with.
and other digital assets. Both are integrated with the web browsers
                                                                             In addition to a detailed technical description of WASP in Sec-
from Microsoft and Apple, respectively, to index browsing history.
                                                                          tion 2, this paper reports on the observations that we made (Sec-
Meanwhile, the history tabs of modern Web browsers provide ac-
                                                                          tion 3) and the challenges for personal web archiving and search
cess to the history of the currently open browser as well as pages
                                                                          that we identified (Section 4) through our extensive use of the
recently visited on other devices. However, current browsers do
                                                                          WASP prototype—which we provide both open source and as an
not align and integrate the browsing histories across devices, nor,
                                                                          executable Docker container so that others can use it within their
1 https://en.wikipedia.org/wiki/Lifelog
                                                                          research or personal lifelogging setup.2, 3

DESIRES 2018, August 2018, Bertinoro, Italy
                                                                           2 https://hub.docker.com/r/webis/wasp/
© 2018 Copyright held by the author(s).
                                                                           3 https://github.com/webis-de/wasp
DESIRES 2018, August 2018, Bertinoro, Italy                       Johannes Kiesel, Arjen P. de Vries, Matthias Hagen, Benno Stein, and Martin Potthast


2     THE WASP PROTOTYPE
The WASP 4 prototype integrates existing archiving, indexing, and                                                         Search
                                                                                                                         Interface    Index
reproduction technology for the first time into a single application.
Figure 1 illustrates how the user’s browser interacts through WASP                                                        pywb                  World Wide Web
                                                                                                                                     WARCs
with the World Wide Web under the three usage scenarios archival,
                                                                                       Browser
search, and reproduction of web pages, as detailed below.                                                       proxy
                                                                                                                                     warcprox

2.1     Archiving Proxy and Indexing                                                                                        (a)
After starting WASP, the user has to reconfigure his or her browser
to accept WASP as forward proxy and to trust its certificate. WASP
then archives all HTTP(S) requests and responses from and to the                                              /search     Search
browser in the standard Web archiving format (WARC) (Figure 1 (a)).                                                      Interface    Index

This is achieved using the Internet Archive’s warcprox software,5
                                                                                                                                                World Wide Web
whose WARC contain all the information necessary to reproduce                                                             pywb       WARCs
an archived browsing session at a later time.                                          Browser
   In order to enable searching the archived content, we devised a                                                                   warcprox
software component that monitors WARC files and automatically
indexes HTML responses and their corresponding requests in an                                                               (b)
ElasticSearch index.6 In detail, we use the Lemur project’s WARC
parser7 and Apache’s HttpClient library8 to read HTTP messages
as they are appended to the WARC files. The title and text of the                                                         Search
                                                                                                                         Interface    Index
HTTP responses that have the MIME type HTML are extracted from
responses using the Jericho HTML Parser library.9 The title and                                  /archive/<time>/<url>
                                                                                                                                                World Wide Web
                                                                                                                          pywb
text of the HTTP response is indexed along with the corresponding                                                                    WARCs

HTTP request’s time and URL. Later page revisits (identified by                        Browser

warcprox through hash value matching on HTTP responses) are                                                                          warcprox

added to the response’s record in the index. When the index is
queried, the aggregation of requests avoids duplicate results in case                                                       (c)
a page is visited more than once.
   Even if web pages vanish or change, WASP can reproduce the                          Figure 1: Architecture of the prototype: (a) during regular
content the user saw in the past using the Web archiving toolkit                       browsing, the container works as a forward proxy that stores
pywb.10 Like our automatic indexing setup described above, pywb                        all requests ( ) and responses ( ) in web archive files
monitors and indexes changes to the web archives. While the Elas-                      (WARCs) and indexes them; (b) when browsing to localhost:
ticSearch index is tailored toward search within the HTML content                      <search-port>/search, the browser shows our search inter-
of the archived web pages, the pywb index is tailored towards retriev-                 face (Figure 2), where results link to (c) the reproduction
ing the HTTP response corresponding to a given HTTP request,                           server, which serves content from the WARC that (fuzzy)
enabling efficient reproduction of pages from the personal archive.                    matches a specific time and URL.

2.2     Search Interface
Access to the archived web pages is provided using the ElasticSearch
index detailed in Section 2.1 (Figure 1 (b)). Under a configurable                        A difference to a regular search engine results page is that in
port, WASP provides the user with a basic search engine. Figure 2                      WASP, each result item consists of two hyperlinks: one resolving
shows a screenshot of the interface. Unlike regular web search                         the URL to the live web 6 as usual, and another one pointing to the
engines, WASP’s interface provides controls to specify the time the                    archived version of the web page. This latter hyperlink refers to the
user recall visiting the desired web page 1 , 2 , 3 11 in addition to                  port of the WASP container’s reproduction proxy and the access
the familiar query box 4 . Web pages are retrieved by matching                         time and URL of the web page that should be reproduced. In case
query words against the title and contents of web pages visited in                     several non-identical versions of the same page are found in the
the specified time interval. ElasticSearch’s highlight feature is used                 requested interval, the prototype displays all of them as separate
to generate query-related snippets for the results 9 .                                 results. However, we expect that more mature personal web archiv-
                                                                                       ing and search systems will rather condense the different versions
4 WASP is short for Web Archiving and Search, Personalized
5 https://github.com/internetarchive/warcprox
                                                                                       of a web page, especially when the context of the query terms is
6 https://www.elastic.co/                                                              similar in the versions. The resulting user experience offers key
7 http://www.lemurproject.org/clueweb09/workingWithWARCFiles.php                       advantages with respect to search users’ privacy: search activities
8 https://hc.apache.org/httpcomponents-client-ga/
                                                                                       remain local to WASP, and the user is left in control whether to
9 http://jericho.htmlparser.net/docs/index.html
10 https://github.com/webrecorder/pywb                                                 visit the live web page (without leaking their preferences to another
11 Date and time picker widget: https://eonasdan.github.io/bootstrap-datetimepicker/   search engine), or to be satisfied with the archived result.
WASP: Web Archiving and Search Personalized                                                    DESIRES 2018, August 2018, Bertinoro, Italy


                            1                              2      3


                                4
5

    6
    7                                                  8


    9


                                                                         Figure 3: Screenshot of a web page reproduced from the
                                                                         archive. pywb is configured to insert a small black banner at
                                                                         the bottom right of the browser viewport to remind users
                                                                         that they are viewing an archived page.

Figure 2: Search interface for WASP: 1 shortcuts for fre-
quently used time settings; 2 selected query time interval;
                                                                         the same as the live version did at the time of archiving. Yet, tech-
 3 date and time picker for exact time specification; 4 query
                                                                         nological difficulties may prevent the faithful reproduction of an
box; 5 description of current result page; 6 title of result
                                                                         archived web page. Since it is usually impractical for WASP to take
with links to archived and live version; 7 URL of the result;
                                                                         web server snapshots, WASP will only capture a page’s client side.
 8 archive time of the result; 9 snippet for the result.
                                                                         Therefore, only a subset of the potential server interactions end up
                                                                         being represented in the archive and available for the reproduction:
2.3      Reproduction Server                                             the scrolling, clicking, form submissions, video and audio stream
                                                                         playback, etc. that the user performed on the live web page. If user
When using a personal Web archive in a re-finding scenario, WASP         interactions on the archived web page trigger unseen requests to
fulfills the need of users to access information from their brows-       the web server, reproducing the archived web page will either do
ing history using pywb; a state-of-the-art web page reproduction         nothing, show an error, or stop working.
software which uses intricate URL rewriting and code injection to           However, even in the case the user repeats the same basic in-
serve the archived web pages like they were originally received          teractions on the archived page that they performed on the live
(Figure 1 (c)). Through the use of specific URLs, pywb can serve         page, only about half of web pages can be reproduced flawlessly [9].
multiple versions of the same web page. WASP’s search interface          These reproduction errors mostly stem from randomized requests.
uses this feature to refer the user to exactly that version of the web   Indeed, in about two-third of flawed reproductions, the errors are
page that corresponds to the clicked result link. In order to avoid      on the level of missing advertisements or similar. While pywb re-
confusion on the user’s side as to whether or not they are browsing      places the JavaScript random number generator by a deterministic
within the archive, a small black banner is inserted and fixed to the    one, this only affects the archived page and does not fully solve the
bottom right corner of the browser viewport for all pages that are       problem: different timings in the network communications lead to a
reproduced from the archive (cf. Figure 3).                              varying execution order and thus a different order of pop-requests
                                                                         from the “random” number sequence. To greater effect, pywb em-
3       QUALITATIVE EVALUATION                                           ploys a fuzzy matching of GET parameters that ignores some of the
Given that the WASP prototype became operational only recently,          parameters that it assumes to have random values (e.g., session ids),
the ongoing evaluation of its archiving and retrieval quality is still   be it by the parameter name or by a hash-like appearance of the
in its infancy. Nevertheless, since we have been using the prototype,    parameter value. While it is unclear how many false positives this
this section reports on insights gathered so far, namely the results     process introduces, it naturally can’t find all random parameters as
of an error analysis regarding archiving quality, and an outline of      there exists no standard whatsoever in this regard.
evaluation methodology regarding retrieval quality.                         Another interesting problem for web archiving we noticed are
                                                                         push notifications: while they are properly recorded, it remains a
3.1      Archiving Quality: Error Analysis                               difficult choice if and when to trigger them during the reproduction
When revisiting an archived web page, one naturally expects the          of a web page. Should the trigger time be based on the time spent
version reproduced from the archive to look and behave exactly           on the page or based on other events?
DESIRES 2018, August 2018, Bertinoro, Italy            Johannes Kiesel, Arjen P. de Vries, Matthias Hagen, Benno Stein, and Martin Potthast


   Finally, we found that differences between browsers can also af-
fect the reproduction quality. Though this had only minor effects on
our experience with WASP so far, the ongoing development of the
web technology stack may render old web pages in the archive less
reproducible in the long run. For an example, consider the ongoing
demise of Flash as a major container for dynamic content. In this
regard, old versions of browsers and even old versions of operating
systems may need to be kept, which is a definite requirement for
web archiving in general, and also possible based on WASP’s use of
Docker containers, though not necessarily important for our usage
scenario of personal web archiving.


3.2    Retrieval Quality Evaluation: An Outline
In principle, it should be easier to re-find something in a personal                                       (a)
web archive than using some commercial search engine on the
live web. Since a personal archive will typically be many orders of
magnitude smaller, not as many candidate results for simple queries
exist as on the live web. Ideally, compared to finding a needle in
the huge haystack of the web, with a tailored search interface for
one’s smaller personal archive, the ratio of needles to hay is much                                       (b)
higher in a re-finding scenario than in general web search. Still,
since WASP is a prototype that was created very recently, we can          Figure 4: Example of dynamic HTML content in WASP:
only provide anecdotes of retrieval problems and sketch how we            (a) original tweet as it appeared while scrolling down the
want to evaluate whether WASP actually helps to re-find needles.          Twitter timeline (b) Twitter card as it was requested for dis-
   The main evaluation scenarios we envision is re-finding some-          play, archived, and indexed.
thing a user recalls having seen earlier on the web. Such re-finding
intents will be different from the frequent re-visit patterns users
show on the web [1] since their purpose is not to visit some favorite     front. Besides monitoring user queries against WASP’s search func-
page but to check some information seen before. In this regard, we        tionality for users who agree to share parts of their browsing and
do not believe that, at the time of visiting a web page the first time    search activity, one will periodically trigger active users of WASP
around, users will have enough foresight and presence of mind to          with a re-finding game similar to PageHunt [11]. The user will
anticipate its future uses and hence place a bookmark, rendering a        be shown the screenshot of a page they have seen, or only parts
search in their personal archive indispensable.                           thereof (e.g., only the color scheme of the layout), or will be asked
   We used WASP for one week in an informal self-experiment to            to re-find a piece of information they have seen a given period of
figure out what problems arise and what should thus be integrated         time ago (e.g., three days ago, two weeks ago, etc.). Their task will
in a formal evaluation. The most obvious problem that differs from        be to come up with a sequence of queries (and clicks) such that
the general web search scenario is that of dealing with several           in the end the prescribed web page appears in the top-k ranks of
versions of the same web page. During our short-term usage of             WASP’s retrieval component. In such cases, the desired item will
WASP, we found that most retrieved web pages are actually relevant,       be known for evaluation purposes and the re-finding task can have
but that the result lists are cluttered with different versions of the    several difficulty levels (showing full information vs. only color
same web page that were—with respect to our information needs—            scheme, target information at top of a page or only requested upon
practically identical; as predicted by a recent retrievability study of   interaction, etc.). To measure retrieval success, the length of real
Web archive search [12]. A probably even more difficult problem,          and the comparably artificial re-finding query and click sequences
but one that our scenario shares with general web search, arises          can be measured as well as the specificity of the queries contrasted
from the fact that nowadays web pages request a large part of             by the size of the personal collection. But of course, the overall
their content dynamically and only if necessary. A good example           interesting measure will be for how many real re-finding tasks the
of this is the Twitter timeline: while scrolling through the timeline,    users are able to pull out the desired result from their personal
more tweets are requested from the server. Since WASP is currently        archive—their needle stack.
limited to indexing HTML responses, it catches only some parts
of the tweets (see Figure 4), which turn out to be HTML templates         4    DISCUSSION AND LESSONS LEARNED
requested via Ajax for integration into the Twitter page.
                                                                          Our primary goal with WASP was to develop a vertical prototype
   Based on these observations, we propose the following evalua-
                                                                          of a web archiving and retrieval framework, which archives every
tion setup for personal web archives. Since re-finding in personal
                                                                          web page and every request made by a web page, and then indexes
web archives has not been part of any evaluation campaign so far,
                                                                          everything archived. Based on first practical experiences with using
a respective set of topics and user interactions has to be built up
                                                                          WASP for our own respective web traffic, however, there are still
WASP: Web Archiving and Search Personalized                                                        DESIRES 2018, August 2018, Bertinoro, Italy


many things to be sorted out before we can claim a flawless retrieval
experience. Unsurprisingly, the devil is in the details, but somewhat
surprisingly, we will be forced to revisit the basic notions of what
is a web page, what needs to be archived, and what needs to be
indexed. This section discusses lessons learned, outlining a number
of exciting future directions for research and development on web
archiving and retrieval in general, and for WASP in particular.

4.1    Which pages to archive?
Although WASP currently follows “archive first, ask questions later,”
users of a personal archiving system likely do not wish for all their
traffic to be archived, even if stored within their personal data
space. Without specific measures, sensitive data will end up in the
archive, e.g., banking pages, health-related browsing, as well as
browsing sessions with privacy-mode enabled (where users expect
all traces of their activities to be purged after the browser is closed);
users may not expect for such data to emerge in search results,
weeks, months, or even years later. Furthermore, just as some users
regularly clean or clear their browsing history, they will wish to
clean or clear their archive. Similarly, it will be necessary to protect
the personal archive from unauthorized access, analyze known                Figure 5: Screenshot mode mockup; the screenshot in the
and new attack vectors on the archiving setup, and formalize the            2nd row and 2nd column is highlighted by mouse-over.
security implications that stem from the use of such a system.
    Based on these deliberations, it is clear that the user must be
given fine-grained control over what sites or pages are archived, al-
lowing for personal adjustments and policies. The recorded archive
needs to be browseable, so that individual entries can be selected for      Figure 6: Firefox toolbar indicating archiving is activated.
removal. For more convenient browsing (both for cleaning and gen-           The context-menu of this icon allows to turn of the proxy-
eral re-finding), we suggest a screenshot-based interface as shown          usage. thereby implementing a “pause-archiving” button.
in Figure 5. At present, users can already influence which pages
should not be archived using proxy-switching plugins available
for all modern browsers that seamlessly integrate with WASP’s
                                                                               Besides accidental page visits, another example of irrelevant
proxy-based architecture (e.g., cf. Figure 6). Of course, specifying
                                                                            pages may be found in more complex web applications. Take web-
wildcard expressions hardly qualifies as a user-friendly interface for
                                                                            based RSS feed readers the likes of Feedly as an example: there is
non-computer scientists, so that a better interface will be required
                                                                            no need to index every page and every state of every page of the
in practice (e.g., using classification techniques similar to [7]).
                                                                            feed reader. Rather, the feed items to which the user pays attention
    Under some circumstances personal archiving systems could
                                                                            are of interest for indexing, since only they are the ones the user
act on their own behalf to allow for an improved experience of
                                                                            may eventually remember and wish to revisit. In this regard, two
the archived page, by archiving content the users did not request
                                                                            cases can be distinguished, namely the case where feed items are
themselves. This possibility leads to several new research ques-
                                                                            displayed only partially, so that the user has to click on a link
tions. For example, should all videos on a visited page be requested
                                                                            pointing to an external web page to consume a piece of content, and
and archived, so that the user can watch them later on from their
                                                                            the case where feed items are displayed in full on the feed reader’s
archive? Or in general, should the system predict and simulate
                                                                            page. The former case is straightforward, since a click indicates user
interactions that the user may later want to do on the archived
                                                                            attention, so that the feed reader’s page can be entirely ignored. In
page to archive the corresponding resources while they are still
                                                                            the latter case, however, every feed item the user reads should be
available? Moreover, should the system perform such a simulation
                                                                            indexed, whereas the ones the user skips should not so as not to
multiple times in order to detect the randomness in the web page’s
                                                                            pollute the user’s personal search results.
requests and consider this information in the reproduction?
                                                                               More generally, all kinds of portal pages and doorway pages,
                                                                            ranging from newspaper front pages via social network pages to
4.2    Which pages to index?                                                search results pages are candidates for omission. Analyzing the
While a comprehensive archive is necessary for a high-quality               user’s browsing behavior gives evidence which page they suffi-
reproduction of web pages, not everything that the browser receives         ciently scrutinized for it to be indexed. If a user spends time reading
is actually of interest to the user. From our own web browsing habits,      the headlines and excerpts of a front page, this would suggest to
we can informally tell that many pages opened are not relevant for          index that page, but may be difficult to discern in practice. Other-
future retrieval, because they are dismissed upon first glance (e.g.,       wise, a user’s behavior may be used as implicit relevance feedback
pop-ups) or not even looked-at at all.                                      to be incorporated into tailored retrieval models.
DESIRES 2018, August 2018, Bertinoro, Italy            Johannes Kiesel, Arjen P. de Vries, Matthias Hagen, Benno Stein, and Martin Potthast


4.3    What is the document unit for indexing?                            less emphasis on scalability. Developments in the UI/UX of web
In its present form, WASP archives everything that runs under a           archive search are, however, likely transferable, in both directions—
given URL—including GET parameters, but excluding fragment                as argued in Section 4.2, what we learn from observing interactions
identifiers—as one unit. Just like in regular search, not every piece     with personal web archives may very well carry over to the large
of content is relevant for indexing. Main content extraction is an        web archives of interest to Digital Humanities researchers [3].
obvious solution to this problem, but the current state-of-the-art           We find that a new blend of techniques that have been proposed
frequently fails on pages where many small pieces of content can          previously will be necessary to design the right user experience,
be found. Furthermore, many websites today spread one coherent            and we realize that we have only scratched the surface so far. For
piece of content over many sub-pages (so-called pagination). For          example, searching the social web is different from searching the
instance, news publishers often employ pagination, forcing readers        web, as shown convincingly in [2]. We also highlight the immediate
to switch pages (possibly to serve extra display ads or improve           relevance of research into focused retrieval carried out in context of
engagement metrics that determine the value of the display ads            INEX. The question of how to determine a retrieval unit has clearly
shown on the publisher’s site). For archive retrieval purposes, how-      not been solved, yet, and the usage scenario of personalized web
ever, pagination can be detrimental, penalizing the relevance of a        archive search that we envision has increased the urgency to revisit
paginated news article to a query, since only parts of the article are    that line of research.
scored at a time.
    On the other hand, physical pages are also not necessarily atomic:    6     SUMMARY
many web pages built with modern web design tools are single-             This paper introduces WASP, a prototypical implementation of a
page applications, where different pieces of content are shown upon       personal web archive and search system, it provides a first qualita-
user request under the same URL. For instance, a blog platform            tive evaluation of such a system, and outlines future steps in this
may show each blog post requested by a user simply by loading             regard, as well as discusses the challenges that such systems face.
it in the background using a JavaScript-based AJAX request, and           WASP combines state-of-the-art archiving and retrieval technology
replacing the currently shown post with a new one. In this case,          to which it adds an intuitive and tailored search interface. Gener-
the perfect web archive search would identify the single posts            ally, the use case for personal web archive search is more the one
and index them separately, injecting code upon reproduction that          of a re-finding engine. We identify current limitations in archiv-
replaces the displayed post with the desired one. Currently, we           ing technology for this use case and discuss how the evaluation
are technologically far from such a feature. In a different case,         of a search engine has to be adapted for search in personal web
like the Twitter timeline, a web page consists of several (possibly       archives (e.g., to several versions of a single web page when it is
independent) content segments. Again, each such segment should            revisited). In the same context, we discuss what content should be
be indexed separately for an appropriate relevance computation. To        archived and what content should be indexed, highlighting privacy
meet this challenge, web pages should be segmented into coherent          issues (e.g., archiving in incognito mode) and advantages (re-finding
units of content that belong together on a page, and each segment         information using only local data).
identified should be treated as a document unit. However, just like
with most of the aforementioned problems, page segmentation, too,         REFERENCES
is still in its infancy.                                                   [1] E. Adar, J. Teevan, and S.T. Dumais. 2008. Large scale analysis of web revisitation
                                                                               patterns. In CHI ’08. 1197–1206.
    For an optimization, the click behavior and dwell times on certain     [2] O. Alonso, V. Kandylas, S.-E. Tremblay, J.M. Hofman, and S. Sen. 2017. What’s
pages may be the best features to determine what parts should be               Happening and What Happened: Searching the Social Web. In WebSci ’17. 191–
indexed, whether pages should be merged into one, or one divided               200.
                                                                           [3] Anat Ben-David and Hugo Huurdeman. 2014. Web Archive Search as Research:
into many. Furthermore, such information on user behavior would                Methodological and Theoretical Implications. Alexandria 25, 1-2 (2014), 93–111.
be very useful for ranking results in the personal search. Currently,      [4] J. Shane Culpepper, Fernando Diaz, and Mark D. Smucker. 2018. Report from
however, such behavioral data is probably not even available to                the Third Strategic Workshop on Information Retrieval in Lorne (SWIRL 2018).
                                                                               Technical Report.
commercial search engines.                                                 [5] E. Cutrell, D. Robbins, S. Dumais, and R. Sarin. 2006. Fast, Flexible Filtering with
                                                                               Phlat. In CHI ’06. 261–270.
                                                                           [6] S. Dumais, E. Cutrell, J.J. Cadiz, G. Jancke, R. Sarin, and D.C. Robbins. 2003. Stuff
5     RELATED WORK                                                             I’Ve Seen: A System for Personal Information Retrieval and Re-use. In SIGIR ’03.
                                                                               72–79.
WASP is directly related to prior work on desktop search, including        [7] C. Eickhoff, K. Collins-Thompson, P.N. Bennett, and S.T. Dumais. 2013. Designing
the already mentioned Stuff I’ve Seen [6]. However, apart from not             Human-Readable User Profiles for Search Evaluation. In ECIR 2013. 701–705.
indexing all documents that may exist on a desktop, the intended           [8] H. Holzmann, V. Goel, and A. Anand. 2016. ArchiveSpark: Efficient Web Archive
                                                                               Access, Extraction and Derivation. In JCDL ’16. 83–92.
usage differs slightly as well: WASP aims to track everything a user       [9] Milad Alshomary Benno Stein Matthias Hagen Martin Potthast Johannes Kiesel,
has seen, as they saw it, and in that sense provides some notion               Florian Kneist. 2018. Reproducible Web Corpora: Interactive Archiving with
                                                                               Automatic Quality Assessment. Journal of Data and Information Quality (2018).
of versioning. While not yet implemented, a future version should         [10] W. Jones. 2010. Keeping found things found: The study and practice of personal
explore the functionality once implemented in diff-IE, i.e., to rank           information management. Morgan Kaufmann.
pages that evolved differently from static ones, and this way provide     [11] H. Ma, R. Chandrasekar, C. Quirk, and A. Gupta. 2009. Improving search engines
                                                                               using human computation games. In CIKM ’09. 275–284.
immediate insight in changes of the web over time [13].                   [12] Th. Samar, M.C. Traub, J. van Ossenbruggen, L. Hardman, and A.P. de Vries. 2018.
   WASP is also related to search tools for web archives, such as              Quantifying retrieval bias in Web archive search. International Journal on Digital
ArchiveSpark [8]. However, due to handling a single user’s view of             Libraries 19, 1 (01 Mar 2018), 57–75.
                                                                          [13] J. Teevan, S. Dumais, and D. Liebling. 2010. A Longitudinal Study of How
the online world only, the system aspects to be addressed include              Highlighting Web Content Change Affects People’s Web Interactions. In CHI ’10.

</pre>