<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Collecting and Analysing Personal Information Management Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Charlie Abela</string-name>
          <email>charlie.abela@um.edu.mt</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chris Staff</string-name>
          <email>chris.staff@um.edu.mt</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Siegfried Handschuh</string-name>
          <email>siegfried.handschuh@deri.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Mathematics, University of Passau</institution>
          ,
          <addr-line>Bavaria</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Intelligent Computer Systems, University of Malta</institution>
          ,
          <country country="MT">Malta</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Personal Information Management (PIM) research has investigated the information trail generated by an individual while performing some information-seeking task on their desktop, with the aim of improving PIM tool-support. Nevertheless, due to the personal nature of the data, this is rarely released for reuse. Furthermore, there exists no tool that allows a PIM researcher to investigate how PIM related data evolves over time nor one that allows for the results of applying different approaches over such data to be analysed. In this paper, we present the Personal Information Management Analytix framework (PiMx) that leverages upon a graph-analytics approach for the analysis and visualisation of evolving activity-data generated by individuals performing tasks on their desktops. We further describe a data collection methodology that opens up the data for reuse and briefly discuss how PiMx is used to analyse such a collection.</p>
      </abstract>
      <kwd-group>
        <kwd>Graph analytics</kwd>
        <kwd>Personal Information Management</kwd>
        <kwd>Task identification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        When we perform some information-seeking task we tend to spend a considerable
amount of time looking back, establishing past references and remembering [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
To find and re-find information items, we tend to rely on our organisational skills
and the support of search, bookmarking and history tools [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. However, most of
these tools tend to consider the user’s information-seeking activities as unrelated
events, unlike the way we actually organise things, which is usually in terms of
directories (on our desktop) and tasks (conceptually) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>In our research we are motivated by the need to better understand how these
activities evolve over time and the extent to which it is possible to automatically
organise them in terms of tasks. In this paper we present PiMx, a Personal
information Management analytix framework that we implemented to support
us in our investigation. PiMx enables us to simulate the incremental evolution
of PIM data and to exploit graph-analytics to analyse and visualise the user’s
information-seeking process. It is also possible to apply different algorithmic
approaches and analyse their performance in addressing the task-identification
problem. To the best of our knowledge, no such tool is available.</p>
      <p>It is difficult to find suitable PIM datasets freely available for reuse and
evaluation. We therefore performed a controlled experiment to collect our own
dataset. We briefly elaborate on the adopted methodology and describe how we
use PiMx to analyse and compare a number of approaches scoped at automatically
identifying task-clusters from the data.</p>
      <p>In the rest of paper we provide some related work in Sec. 2 which is followed
by a description of the controlled experiment we performed and the data that
was collected. In Sec. 4 we give an overview of the PiMx framework used to
analyse the collected data and conclude with some future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        The task modelling ontology proposed by [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] provides task-related information
support for knowledge workers and links the user’s task activities with her personal
information context. The user’s desktop activity context was also modelled by [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
as an OWL-DL ontology and used to enhance the performance of task detection
algorithms. A similar context model was proposed by [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] which defines events
and contextual elements relevant to a knowledge worker and so needs to also
deal with projects and collaborative work. We have adopted this latter model for
our activity data.
      </p>
      <p>
        Evaluating PIM tool-support is inherently difficult, in particular because of
the lack of readily available datasets [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. One available dataset is provided by
the Web History Repository project3. Participants in the project can voluntarily
relinquish their Web browsing history which is anonymised and remotely sent to
a server via a dedicated Firefox4 plug-in. Each user’s history is uniquely identified
by a global ID and the URLs accessed are encrypted.
      </p>
      <p>Existing graph analysis tools, such as Visone5 and Gephi6, do not support
the analysis, over time, of streams of users’ activity nor is it possible to apply
and compare different user-defined algorithms over the streams.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Collecting and Modeling Activity Data</title>
      <p>Although the dataset from the Web History Repository is substantial, we were
unable to use it as, for our research, we need to know, for each user, the sequence
in which documents were accessed as well as the tasks the user was engaged
in when accessing the documents. Thus, we conducted our own data collection</p>
      <sec id="sec-3-1">
        <title>3 http://webhistoryproject.blogspot.com/ 4 https://www.mozilla.org/en-US/firefox/desktop/ 5 http://visone.info/ 6 http://gephi.github.io/</title>
        <p>experiment in a controlled environment during which we logged the browsing
activity of 20 participants while performing three pre-defined tasks. The tasks
were: providing specific information about the planning of a vacation in a specific
country; answering questions related to the research area of human computation;
and, providing information about any two upcoming music events.</p>
        <p>For the experiment, we set up a cluster of machines in one of our laboratories.
Each PC ran Windows OS with two activity-monitoring applications installed on
them. The first application was a Firefox plug-in used to collect each participant’s
Web browsing activity. The second application monitored the file browsing activity
(such as word processing documents) on their desktop. We cleaned the data and
removed references that could lead to the identification of the participants.
References to the accessed documents were however retained. In the future, the
dataset can be made accessible and shared for research evaluation purposes, in
line with the Web History repository’s philosophy.</p>
        <p>
          The logged data is represented in RDF and is based on the context model
developed by [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. It includes information about the type of event, such as whether
it is a navigational or a tabbed event; the application that generated the event;
the timestamp; the URI of the document accessed as a result of the event; and an
excerpt of text from the window caption. Other information specific to particular
events is also captured. This includes the URI and window caption of the page
that was in focus before the event was triggered and information about files
found on the user’s desktop such as the file name and whether a document was
edited or not. The example shown in Listing: 1.1 represents an instance of an
EnteredURL event which is generated whenever the user manually enters the
URL in the browser’s address bar.
1 &lt;http://test.org/actions/EnteredURL20140430T134829&gt;
2 a &lt;http://test.org/vocabulary/actions/EnteredURL&gt; ;
3 actions:timeStamp "2014-04-30T13:48:29"^^xsd:dateTime ;
4 actions:processName "firefox"^^rdfs:Literal ;
5 actions:uri &lt;https://www.google.com.mt/search?q=carl+bee+song&gt; ;
6 actions:docInfo "carl_bee_song"^^rdfs:Literal ;
7 actions:fromURI &lt;http://www.tvm.com.mt/news-isle-of-mtv-malta/&gt; ;
8 actions:fromPageTitle "Isle_of_MTV_Malta"^^rdfs:Literal .
        </p>
        <p>Listing 1.1: EnteredURL Event
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>PiMx: tool for analysing the data</title>
      <p>We implemented the PiMx (Personal information Management analytix)
framework to better analyse the collected data. This tool enables us to load a user’s
activity-log (collected during the experiment described in Sec. 3), simulate the
user’s behaviour by replaying the activity trail for that user and analyse the
evolution of the task-clusters through different views. This process can be paused and
resumed at any stage. PiMx uses the JUNG graph library7 for the graph-analytics
and Apache Jena8 for modeling and querying the data.</p>
      <sec id="sec-4-1">
        <title>7 http://jung.sourceforge.net/ 8 https://jena.apache.org/</title>
        <p>PiMx  Viz </p>
        <p>PiMx  History 
PiMx  Stats </p>
        <p>PiMx  Clusters </p>
        <p>PiMx includes an interactive PiMx-Viz component which currently presents
two visualisations (see Fig. 1). The first visualisation shows the complete,
unadulterated activity-log as an undirected graph that evolves over time. The second
displays the coloured task-clusters as they are incrementally created. Nodes are
assigned a global unique ID and their size is computed in relation to a ranking
value. Edges are weighted based on the number of switches between any two
nodes. Further information about the nodes and edges, such as the node’s degree
and its URL, as well as the type of edge and timestamp of last access can be
viewed by hovering over them. It is also possible to click on each node separately
and visualise the induced subgraph generated by the nodes’ neighbourhood.</p>
        <p>The PiMx-Stats component (Fig. 1) shows different graph-related statistics,
such as the number of vertices and edges in the graph and the clustering coefficient,
average distance, and diameter. There is also information about the number
of search and removed nodes. The former represent the pages associated with
search queries while the latter refer to those documents that were closed by the
participants after being accessed. This view also provides information about the
type and number of occurrences of the events that were triggered.</p>
        <p>It is possible to incrementally view a detailed history of all the accessed
documents through the PiMx-History view. This information includes the application
used to access a document, the window caption, the time of the last access, the
URL and the number of times that a document was accessed. The researcher can
apply filters over this data and view it based on different time-windows, such
as by last hour, last 4 hours, today and yesterday, as well as by application or
file-type. A search facility based on Jena-Fuseki’s text query and a Lucene index
allows for keyword search over the data.</p>
        <p>The PiMx-Clustering component is specific to the goal we wanted to attain
and provides information about the automatically generated task-clusters. Each
cluster is assigned a unique ID and each document node within a cluster has
associated with it a ranking value based on its importance within that cluster.</p>
        <p>
          Through PiMx we are currently able to compare the suitability of different
algorithmic approaches, in particular we applied the Bron-Kerbosch algorithm
for finding maximal cliques [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], the community detection algorithm developed by
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] and our iDeTaCt density-based clustering algorithm, details of which can be
found in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In the near future we will be collecting more PIM data and making it available
for reuse. We also plan to provide extensibility interfaces for PiMx so that other
evolving data such as social-network and web-usage data, can be visualised,
analysed and searched.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abela</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Staff</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Handschuh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Automatic Task-Cluster Generation based on Document Switching and Revisitation</article-title>
          .
          <source>In Proceedings of the 1st Workshop on Deep Content Analytics Techniques for Personalized and Intelligent Services, co-located with UMAP'15</source>
          . (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bron</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Kerbosch</surname>
          </string-name>
          , J.:
          <source>Algorithm</source>
          <volume>457</volume>
          :
          <article-title>finding all cliques of an undirected graph</article-title>
          .
          <source>In Commun. ACM</source>
          <volume>16</volume>
          , no.
          <issue>9</issue>
          : pp.
          <fpage>575</fpage>
          -
          <lpage>577</lpage>
          . (
          <year>1973</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>W.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Teevan</surname>
          </string-name>
          , J.:
          <source>Personal Information Management. ISBN 9780295987378</source>
          , University of Washington Press (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Mayer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Web History Tools and Revisitation Support: A Survey of Existing Approaches and Directions</article-title>
          . Found. Trends
          <string-name>
            <surname>Human-Computer</surname>
            <given-names>Interaction</given-names>
          </string-name>
          , volume
          <volume>2</volume>
          ,
          <issue>3</issue>
          , pp.
          <fpage>173</fpage>
          -
          <lpage>278</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Morris</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ringel</surname>
            <given-names>Morris</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Venolia</surname>
          </string-name>
          , G.:
          <article-title>Searchbar: a search-centric web history for task resumption and information re-finding. In: 26th annual SIGCHI conference on Human factors in computing systems</article-title>
          ,
          <source>CHI '08</source>
          , pp.
          <fpage>1207</fpage>
          -
          <lpage>1216</lpage>
          . ACM Press, New York, NY, USA (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Newman</surname>
            ,
            <given-names>M. E. J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Girvan</surname>
            ,
            <given-names>M..</given-names>
          </string-name>
          <article-title>"Finding and evaluating community structure in networks</article-title>
          .
          <source>" Physical Review E</source>
          <volume>69</volume>
          , no.
          <volume>026113</volume>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Ong</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riss</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grebner</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Du</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Semantic Task Management Framework, in K. Tochtermann</article-title>
          and H. Maurer, ed.,
          <source>'I-KNOW '08 Proceedings of the 8th International Conference on Knowledge Management</source>
          , Graz, Austria (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Rath</surname>
            ,
            <given-names>A.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Devaurs</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Lindstaedt</surname>
            ,
            <given-names>S.N.:</given-names>
          </string-name>
          <article-title>UICO: an ontology-based user interaction context model for automatic task detection on the computer desktop</article-title>
          .
          <source>In Proceedings of the 1st Workshop on Context, Information and Ontologies (CIAO '09)</source>
          . ACM, New York, NY, USA (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Schwarz</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A Context Model for Personal Knowledge Management. Applications</article-title>
          . In Modeling and Retrieval of Context, volume
          <volume>3946</volume>
          , Springer, Berlin Heidelberg (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>