=Paper= {{Paper |id=Vol-2167/paper6 |storemode=property |title=WASP: Web Archiving and Search Personalized |pdfUrl=https://ceur-ws.org/Vol-2167/paper6.pdf |volume=Vol-2167 |authors=Johannes Kiesel,Arjen P. De Vries,Matthias Hagen,Benno Stein,Martin Potthast |dblpUrl=https://dblp.org/rec/conf/desires/KieselVHSP18 }} ==WASP: Web Archiving and Search Personalized== https://ceur-ws.org/Vol-2167/paper6.pdf
                        WASP: Web Archiving and Search Personalized
                  Johannes Kiesel                               Arjen P. de Vries                                   Matthias Hagen
          Bauhaus-Universität Weimar                           Radboud University                           Martin-Luther-Universität
              Weimar, Germany                               Nijmegen, The Netherlands                           Halle-Wittenberg
        johannes.kiesel@uni-weimar.de                           a.devries@cs.ru.nl                               Halle, Germany
                                                                                                           matthias.hagen@informatik.
                                                                                                                   uni-halle.de

                                                Benno Stein                          Martin Potthast
                                          Bauhaus-Universität Weimar               Leipzig University
                                              Weimar, Germany                      Leipzig, Germany
                                          benno.stein@uni-weimar.de          martin.potthast@uni-leipzig.de

ABSTRACT                                                                  apparently, do the aforementioned tools index the content of web
Logging and re-finding the information we encounter every day             pages visited, but only their titles and URLs. In fact, the possibility
while browsing the web is a non-trivial task that is, at best, in-        to track (let alone search) one’s browsing history using off-the-shelf
adequately supported by existing tools. It is time to take another        tools is still fairly limited.
step forward: we introduce WASP, a fully functional prototype of a           In this context, it is not surprising that personal information
personal web archive and search system, which is available open           access was one of the major topics discussed at the Third Strategic
source and as an executable Docker image. Based on the experiences        Workshop on Information Retrieval in Lorne (SWIRL 2018) [4].
and insights gained while designing and using WASP, we outline            The attendees noted that this problem, open for so long, has not
how personal web archive and search systems can be implemented,           been addressed adequately, and, worse, that it is an ever more
discuss what technological and privacy-related challenges such            daunting challenge to help people re-find and re-visit their online
systems face, and propose a setup to evaluate their effectiveness.        information and prior information interactions with these sources;
As a key insight, we argue that the indexing and retrieval for a          as this information today resides in multiple devices and a large
personal archive search can be strongly tailored towards a specific       variety of information services, that each construct their own data
user and their behavior on the visited pages compared to regular          silos and search APIs (if such access is offered at all). Specifically,
web search.                                                               the report mentions the high cost of entry for scientists as a major
                                                                          obstacle, where “there is substantial engineering required for a
1     INTRODUCTION                                                        minimal working system: to fetch data from different silos, parse
                                                                          different data formats, and monitor user activity.”
Lifelogging1 has become a common practice, as a result of the om-
                                                                             We propose to take a pragmatic “shortcut” and to establish em-
nipresence of smartphones, smart watches and fitness trackers, and
                                                                          pirically how far that workaround can bring us. Increasingly, access
emerging technologies such as smart glasses, wearable technologies
                                                                          to our digital information takes place through the web browser
and sensor-enabled smart homes. Isn’t it surprising that keeping
                                                                          as the interface. Therefore, we set out to develop WASP, a proto-
track of one’s online activities is comparably underdeveloped? Sig-
                                                                          type system for personal web archiving and search. WASP saves
nificant amount of work has been invested into understanding
                                                                          one’s personal web browsing history using state-of-the-art web
personal information management [10] and developing tools to
                                                                          archiving technology and offers a powerful retrieval interface over
support it, including the winner of the SIGIR 2014 Test of Time
                                                                          that history. This browser-focused setup enables the user to recall
Award “Stuff I’ve Seen” (SIS) by Dumais et al. [6]. With a bit of irony
                                                                          information they personally gathered without the need to deal with
however, neither SIS nor follow-up Phlat [5] are available today,
                                                                          the large variety of information sources. Even if we do not cover the
even if the key insights gained have likely informed the develop-
                                                                          full range of digital objects that may accrue on a person’s desktop
ment of Windows desktop search and intelligent assistant Cortana.
                                                                          and mobile devices, high-quality archival of web pages visited may
Likewise, Spotlight on MacOS supports search over local documents
                                                                          capture a large fraction of the information we interact with.
and other digital assets. Both are integrated with the web browsers
                                                                             In addition to a detailed technical description of WASP in Sec-
from Microsoft and Apple, respectively, to index browsing history.
                                                                          tion 2, this paper reports on the observations that we made (Sec-
Meanwhile, the history tabs of modern Web browsers provide ac-
                                                                          tion 3) and the challenges for personal web archiving and search
cess to the history of the currently open browser as well as pages
                                                                          that we identified (Section 4) through our extensive use of the
recently visited on other devices. However, current browsers do
                                                                          WASP prototype—which we provide both open source and as an
not align and integrate the browsing histories across devices, nor,
                                                                          executable Docker container so that others can use it within their
1 https://en.wikipedia.org/wiki/Lifelog
                                                                          research or personal lifelogging setup.2, 3

DESIRES 2018, August 2018, Bertinoro, Italy
                                                                           2 https://hub.docker.com/r/webis/wasp/
© 2018 Copyright held by the author(s).
                                                                           3 https://github.com/webis-de/wasp
DESIRES 2018, August 2018, Bertinoro, Italy                       Johannes Kiesel, Arjen P. de Vries, Matthias Hagen, Benno Stein, and Martin Potthast


2     THE WASP PROTOTYPE
The WASP 4 prototype integrates existing archiving, indexing, and                                                         Search
                                                                                                                         Interface    Index
reproduction technology for the first time into a single application.
Figure 1 illustrates how the user’s browser interacts through WASP                                                        pywb                  World Wide Web
                                                                                                                                     WARCs
with the World Wide Web under the three usage scenarios archival,
                                                                                       Browser
search, and reproduction of web pages, as detailed below.                                                       proxy
                                                                                                                                     warcprox

2.1     Archiving Proxy and Indexing                                                                                        (a)
After starting WASP, the user has to reconfigure his or her browser
to accept WASP as forward proxy and to trust its certificate. WASP
then archives all HTTP(S) requests and responses from and to the                                              /search     Search
browser in the standard Web archiving format (WARC) (Figure 1 (a)).                                                      Interface    Index

This is achieved using the Internet Archive’s warcprox software,5
                                                                                                                                                World Wide Web
whose WARC contain all the information necessary to reproduce                                                             pywb       WARCs
an archived browsing session at a later time.                                          Browser
   In order to enable searching the archived content, we devised a                                                                   warcprox
software component that monitors WARC files and automatically
indexes HTML responses and their corresponding requests in an                                                               (b)
ElasticSearch index.6 In detail, we use the Lemur project’s WARC
parser7 and Apache’s HttpClient library8 to read HTTP messages
as they are appended to the WARC files. The title and text of the                                                         Search
                                                                                                                         Interface    Index
HTTP responses that have the MIME type HTML are extracted from
responses using the Jericho HTML Parser library.9 The title and                                  /archive/