<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On Using Information Retrieval for the Selection and Sensitivity Review of Digital Public Records</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Timothy Gollins</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Graham McDonald</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Craig Macdonald</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iadh Ounis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Computing Science University of Glasgow</institution>
          ,
          <addr-line>Glasgow</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <abstract>
        <p>Open government facilitate citizens access to government records, through Freedom of Information laws, and through government archives after a period of years (e.g. 20) has elapsed. However, there are growing challenges in established archival processes that have been brought about by the introduction of digital records and the consequent breakdown of the pre-existing administrative practices within government institutions. In this paper, we discuss challenges that arise from two stages in the archiving digital government records, which information retrieval research can address: the selection/appraisal of appropriate records to archive, and the review of those records to ensure that no sensitive information is released. We also suggest tentative solutions for sensitivity review based on our own work.</p>
      </abstract>
      <kwd-group>
        <kwd>Sensitivity Review</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Open government holds that citizens have the right to
access the records (documents and proceedings) of government
and other public organisations, to facilitate accountability
under the rule of law [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Freedom of information (FOI)
legislation (e.g. in the UK [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and the US [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]) facilitates
this right; citizens can request government documents be
provided subject to certain proscribed exemptions (e.g.
personal privacy, health &amp; safety, commercial confidentiality).
      </p>
      <p>
        The principles of open government have also been
enshrined historically, both in the UK and in other
jurisdictions, in that public records must be released to archives
after a number of years have elapsed (in the UK 30 years, but
now being reduced to 20 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]). There are two broad models
of archival legislation guaranteeing access to public records.
These are “Open by default” as typified by the UK [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ],
and “Closed1 by default - release on FOI request” typified
by the US federal code [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>
        Under both access models, it is necessary to ensure that
no sensitivities remain in the records released. This requires
that the records are reviewed by human assessors who are
familiar with the topics concerned and can verify that no
exemptions should be applied. For instance in the UK, the
mention of a name of an informant in a theatre of conflict,
could put their life or family in danger, and the record would
1Closed records are those records held by an organisation or
archive that have yet to be released to public view.
be closed on the grounds of health &amp; safety [1, section 38]
for up to 120 years. In the US there has been considerable
work on the impact of privacy on archival practice [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], with
regulations on closure deriving from the constitution, federal
codes and state laws.
      </p>
      <p>
        20-30 years ago, governments in 1st world countries
increasingly moved to digital record keeping, as the means of
information production became digital (e.g. networked PCs
&amp; email). This resulted in substantial changes in
administrative practice, a consequent increase in the volume and
complexity of (digital) records kept, and a break down in
the previous well-managed patterns of their organisation [
        <xref ref-type="bibr" rid="ref10 ref9">9,
10</xref>
        ]. However, until very recently the archival and records
management community have almost exclusively focused on
the apparently insurmountable challenge of preserving
digital records. Recent work has begun to demonstrate that this
emphasis on preservation may be misplaced [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]; the more
immediate challenge arises in safely capturing the digital
records in the first place. This includes difficulties in
selection/appraisal and in particular, sensitivity review. Any
archive of public records will soon be forced to address both
of these issues to ensure open government remains a reality.
In the remainder of this paper, we detail the challenges that
may be addressed by research in information retrieval
(Section 2), and provide concluding remarks and a roadmap for
future efforts based on our own work (Section 3).
2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>CHALLENGES IN DIGITAL ARCHIVING</title>
      <p>In the following, we discuss challenges for information
retrieval in the archiving of digital records.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Selection/Appraisal</title>
      <p>When a record reaches the age it should be archived, it
must be appraised to decide if it should be kept for
permanent preservation. This is an essential response to the
unsustainable costs of keeping (storage and conservation or
preservation) and finding (curating and cataloguing or indexing)
everything. In the digital environment, while some aspects
of these costs change substantially, in practice archives
cannot afford to keep everything and while all records are
important by some measure, some are clearly more important
than others; the need to select and appraise remains.</p>
      <p>
        As the volumes of digital records to be deposited in archives
around the world increases, archivists will need tools to
enable them to efficiently and effectively determine those
records worthy of permanent preservation. Archivists must
also be able organise digital records in ways that reflect the
circumstances of their creation, so that they can be reliably
interpreted by historians of the future; in archives, context
is king. The breakdown of administrative practices that
occurred in the transition to the digital environment [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ]
means that that the traditional reliance on metadata will not
work. Tools that can extract and confer meaningful
structure on large corpora of digital records based – not only on
topic matter, but also on the context of creation and
distribution will be essential. This presents a new set of significant
and interesting challenges for information retrieval,
information extraction, and text classification research, as work on
the George Bush Senior presidential archive illustrates [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Sensitivity Review</title>
      <p>
        The challenge of reviewing digital records for sensitivity
is particularly acute. Closing significant volumes of public
records, as a precaution to prevent a small volume of truly
sensitive records being released, is lawful in some
jurisdictions when justified by the cost of review [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However such
precautionary closure will not be morally, ethically or
politically acceptable in an era of increasingly open government.
It is essential that decisions on the closure of records are
conducted at the individual document level.
      </p>
      <p>
        Review for sensitivity may seem similar to the challenge of
identifying documents that are relevant to the specific legal
matter in a litigation (i.e. e-discovery [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]). However, in the
case of sensitivity review, while the nature of a sensitivity
can be described (e.g. personal privacy), the specific features
that will render the record sensitive are generally unknown
to the reviewer in advance. This is because such
sensitivities are not only conferred by the content of the record
(the topics and entities) but also by the context of creation
and distribution (who said what to whom in which
circumstances). Finally, sensitivity is not limited to considerations
of personal privacy, but also includes commercial
confidentiality, health &amp; safety of individuals, matters of defence &amp;
national security, and damage to international relations.
      </p>
      <p>
        In our own proof-of-concept work on Project Abac´a [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
we have established that, while for many UK government
departments the protection of personal privacy is the most
significant issue by volume of records, other sensitivities often
represent greater overall risks (e.g. damage to international
relations or national security). We have also established that
some of the most challenging aspects of privacy protection
are shared by sensitivity. Of particular interest is the
diffused nature of both privacy and sensitivity, which means
that apparently innocuous statements combined with open
information or knowledge can result in significant breaches.
In this respect, we believe that sensitivity is a wider
concept that actually encompasses privacy, and hence solutions
to the privacy protection problem that do not address these
complex and subtle aspects will be inadequate – indeed, the
study of sensitivity is essential to developing general
solutions for privacy.
      </p>
      <p>
        The volume, complexity and lack of organisation of
digital records and the risks implicit in an error of judgement
(the risk of precautionary closure or the risk of
inappropriate opening) together with the nuanced nature of sensitivity
makes this field a particularly interesting source of research
questions, as we have begun to explore [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>In considering digital archival practices as a source of
significant research challenges, we have identified a number of
strands from our own work on sensitivity review, which have
parallels in classical IR tasks and research. We draw on this
classical work to inspire the extension of the field to address
sensitivity (and thus privacy) review.</p>
      <p>
        This includes: understanding human judgement of
sensitivity (c.f. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]), the identification of features (from the
document or its context, explicit or implicit) that indicate
sensitivity (c.f. [
        <xref ref-type="bibr" rid="ref13 ref6">6, 13</xref>
        ]), understanding the relationship between
automation of sensitivity review and technical assistance of
human reviewers in managing the risks of review (c.f. [
        <xref ref-type="bibr" rid="ref18 ref3">3,
18</xref>
        ]), understanding the significance of order of presentation
of documents in the human sensitivity review task, and thus
in machine assistance (c.f. [
        <xref ref-type="bibr" rid="ref14 ref4">4, 14</xref>
        ]).
      </p>
      <p>
        Our own work with UK government departments [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] makes
it clear that a fully automated approach to sensitivity review
is unlikely to be acceptable. There is a clear reluctance on
the part of reviewers to trust technology alone. Nevertheless,
in the UK at least, many recognise the challenges brought
about by the digital age, and the need for new methods and
tools to explicitly manage the increased risks from the open
release of digital records in the era of internet search.
      </p>
      <p>
        Our work in developing our test collection has shown the
value of close observation and study of human reviewers in
beginning to understand the nature of sensitivity. It also
helped us to identify additional document and context
features to classify for sensitivity; the application of a
simple bag-of-words text classification baseline appears
inadequate [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The development of a learned classifier, drawing
on features extracted from a representative test collection,
appears to be a fruitful starting point to develop a decision
support and review prioritisation tool [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[1] Freedom of Information Act</source>
          <year>2000</year>
          (UK).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Public</given-names>
            <surname>Records</surname>
          </string-name>
          <article-title>Act 1958, as amended (UK).</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Azzopardi</surname>
          </string-name>
          .
          <article-title>The Economics in Interactive Information Retrieval</article-title>
          .
          <source>In Proceedings of SIGIR</source>
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Berardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Esuli</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Sebastiani</surname>
          </string-name>
          .
          <article-title>A utility-theoretic ranking method for semi-automated text classification</article-title>
          .
          <source>In Proceedings of SIGIR</source>
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T. H.</given-names>
            <surname>Bingham</surname>
          </string-name>
          .
          <source>The Rule of Law. Penguin</source>
          , London,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Forman</surname>
          </string-name>
          .
          <article-title>An Extensive Empirical Study of Feature Selection Metrics for Text Classification</article-title>
          .
          <source>J. Mach. Learn. Res., 3</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gollins</surname>
          </string-name>
          .
          <source>Putting Parsimonious Preservation into Practice. Tech Report, The Natnl Archives</source>
          ,
          <year>2012</year>
          . http://bit.ly/1m4xerK
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>McDonald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Macdonald</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Ounis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Gollins</surname>
          </string-name>
          .
          <article-title>Towards a Classifier for Digital Sensitivity Review</article-title>
          .
          <source>In Proceedings of ECIR</source>
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Moss</surname>
          </string-name>
          .
          <article-title>The Hutton Inquiry, the President of Nigeria and What the Butler Hoped to See</article-title>
          . English Historical Review,
          <volume>120</volume>
          (
          <issue>487</issue>
          ),
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Moss</surname>
          </string-name>
          .
          <source>Where Have All the Files Gone? Lost in Action Points Every One? J. Contemporary History</source>
          ,
          <volume>47</volume>
          (
          <issue>4</issue>
          ),
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Project</surname>
            <given-names>Abaca´. Project</given-names>
          </string-name>
          <string-name>
            <surname>Website</surname>
          </string-name>
          . http://projectabaca.wordpress.com/.
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Oard</surname>
          </string-name>
          and
          <string-name>
            <given-names>W.</given-names>
            <surname>Webber</surname>
          </string-name>
          . Information Retrieval for E-Discovery.
          <source>Foundations and Trends in Information Retrieval</source>
          ,
          <volume>7</volume>
          (
          <issue>2-3</issue>
          ),
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Qin</surname>
          </string-name>
          , T.-Y. Liu,
          <string-name>
            <given-names>X.-D.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          , and W.-Y. Ma.
          <article-title>A Study of Relevance Propagation for Web Search</article-title>
          .
          <source>In Proceedings of SIGIR</source>
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>F.</given-names>
            <surname>Scholer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kelly</surname>
          </string-name>
          , W.-C. Wu,
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Webber</surname>
          </string-name>
          .
          <article-title>The Effect of Threshold Priming and Need for Cognition on Relevance Calibration and Assessment</article-title>
          .
          <source>In Proceedings of SIGIR</source>
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Solove</surname>
          </string-name>
          .
          <article-title>Access and Aggregation: Privacy, Public Records, and the Constitution</article-title>
          .
          <source>Minnesota Law Review</source>
          ,
          <volume>86</volume>
          (
          <issue>6</issue>
          ),
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>W. E. Underwood. Speech</given-names>
            <surname>Acts</surname>
          </string-name>
          and
          <string-name>
            <given-names>Electronic</given-names>
            <surname>Records</surname>
          </string-name>
          .
          <source>Proceedings of DigCCurr2009 Digital Curation: Practice, Promise and Prospects</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17] 5 U.S. Code §
          <fpage>552</fpage>
          - Public information; agency rules, opinions, orders, records,
          <source>and proceedings.</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>W.</given-names>
            <surname>Webber</surname>
          </string-name>
          .
          <article-title>Approximate Recall Confidence Intervals</article-title>
          .
          <source>Trans. Inf</source>
          . Syst.,
          <volume>31</volume>
          (
          <issue>1</issue>
          ),
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>W.</given-names>
            <surname>Webber</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Pickens</surname>
          </string-name>
          .
          <article-title>Assessor Disagreement and Text Classifier Accuracy</article-title>
          .
          <source>In Proceedings of SIGIR</source>
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>