<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Artificial Intelligence and Accessibility for Administrative Applications</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sara Frug</string-name>
          <email>sara@liicornell.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Bruce</string-name>
          <email>tom@liicornell.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Legal Information Institute, Cornell University</institution>
          ,
          <addr-line>Ithaca NY</addr-line>
          <country country="US">United States</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <abstract>
        <p>In this paper, we suggest that accessibility is an emerging, underfulfilled legal requirement that presents not only a potential locus for activity but also an avenue for research. We describe a proof-of-concept use of machine-learning-based image classification as a managerial support tool for accessibility enhancement, and suggest directions for further research. Although this discussion focuses on the government information landscape in the United States, the adoption of the Web Content Accessibility Guidelines in the European Union extends its applicability.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Accessibility • Assistive technologies • People with disabilities
1.1</p>
    </sec>
    <sec id="sec-2">
      <title>Regulatory Requirements</title>
      <p>
        In the United States, the 1998 amendments [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to The
Rehabilitation Act of 1973 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] explicitly require that federal
electronic and information technology (EIT) be accessible to
people with disabilities. The regulations promulgated under the
1998 amendments required adoption of standards consistent with
(but not identical to) the Web Content Accessibility Guidelines
Version 1.0 Level A. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] In 2017, the regulations were refreshed
to incorporate by reference the Web Content Accessibility
Guidelines Version 2.0. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
1.2
The Web Content Accessibility Guidelines provide both specific
requirements and a general framework for understanding what
makes a document accessible. The acronym “POUR”
(Perceivable, Operable, Understandable, Robust) summarizes
these requirements, the most fundamental of which ensure that
information (e.g., words) not be locked in a medium (e.g., a
picture PDF) that cannot be perceived by a person with a
disability (e.g., blindness). [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
1.3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Non-Compliance</title>
      <p>
        In 2008 (ten years after the 1998 amendments), the Digital
Communications Division of the Department of Health and
Human Services (HHS) wrote:
“Section 508 requires that Web sites and associated
content created with federal funding, whether internal or
external, government- or contractor-hosted, are
accessible to persons with disabilities. The law has been
in effect since June 21, 2001. Federal compliance –
including that of HHS -- has lagged.” [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
By that point, the 2.0 version of the Web Content Accessibility
Guidelines was about to be released. HHS’s compliance timetable
put project completion at 2013.
      </p>
      <p>
        In 2018, WCAG 2.0 became the standard for Federal websites.
The safe harbor provision, however, protected legacy content.
“This safe harbor provision applies on an
“element-byelement” basis in that each component or portion of
existing ICT is assessed separately. In specifying
“components or portions” of existing ICT, the safe
harbor provision independently exempts those aspects
of ICT that comply with the existing 508 Standards
from mandatory upgrade or modification after the final
rule takes effect. This means, for example, that if two
paragraphs of text are changed on an agency Web page,
only the altered paragraphs are required to comply with
the Revised 508 Standards; the rest of the Web page can
remain “as is” so long as otherwise compliant with the
existing 508 Standards.” [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
As of this writing, even Section508.gov and 18F’s Accessibility
Guide yielded accessibility errors.
      </p>
      <p>
        Beyond the protection of the safe harbor, government agencies
persist in publishing new, non-accessible content. Most
prominently, on April 18, 2019, the U.S. Department of Justice
released the much-anticipated so-called Mueller Report as an
image-PDF, downloadable from a web page that displayed the
following notice:
“The Department recognizes that these documents may
not yet be in an accessible format. If you have a
disability and the format of any material on the site
interferes with your ability to access some information,
please email the Department of Justice webmaster. To
enable us to respond in a manner that will be of most
help to you, please indicate the nature of the
accessibility problem, your preferred format (electronic
format (ASCII, etc.), standard print, large print, etc.),
the web address of the requested material, and your full
contact information, so we can reach you if questions
arise while fulfilling your request.” [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
Although the most high-profile, this is far from the only example
of new, non-compliant content published on federal agency
websites.
1.4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Publication Practices</title>
      <p>The Mueller Report is a good example of a general data
impoverishment phenomenon in government publishing, which
deserves to be the object of attention from all communities that
consume government information. The Mueller Report could not
have been drafted as a set of pictures of words; rather, the
original, machine-readable document had to have been converted
for publication into a set of pictures. This data-impoverishment
process is not unique to this document—it can be observed
throughout the Code of Federal Regulations. Documents that had
to have been authored electronically are converted to pictures for
publication, leaving the data consumers to “unscramble the egg”
and convert them back into machine-readable data formats.
2</p>
    </sec>
    <sec id="sec-5">
      <title>Artificial Intelligence and Document</title>
    </sec>
    <sec id="sec-6">
      <title>Accessibility</title>
      <p>
        Although there is promising work, notably from Rohatgi [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], Wu
et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], and Choi et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], to support extraction of
machinereadable data from images of charts, graphs, and other data
artifacts, for researchers and application developers, common
image types have not been addressed systematically.
2.1
      </p>
    </sec>
    <sec id="sec-7">
      <title>Pilot Project: Workflows, Experimentation, and Decision Support</title>
      <p>LII has begun a pilot project to establish a data conversion
workflow and support automation efforts for
data-deimpoverishment. The approach has been three-pronged: 1)
manually sort and convert figures to SVG and images of equations
to MML; 2) annotate SVG images with descriptions of their
content; 3) research machine-readable data sources represented as
pictures; 4) apply machine-learning techniques to provide
decision support for human annotation and conversion.
The pilot project involved collaboration from a specialist in
graphics conversion, law and computer science students, and LII’s
text specialist. The graphics conversion specialist analyzed 14,486
images from the Code of Federal Regulations and sorted them into
categories, such as math (6255), diagrams (1410), data tables
(1238), maps (3194), forms (1892), labels (351) and logos (77)
(some outlier categories, such as photographs, were discovered in
the process). Images transformed prior to this project (1149) were
sorted into math (241) and non-math (908) and set aside for
testing. The images were grouped according to which areas of the
CFR they appeared in and prioritized according to how much web
traffic each containing document (section or appendix) received
on the LII website. As of this writing, the graphics conversion
specialist has converted 2913 math elements to MML and 1005
diagrams to SVG format. Also as of this writing, law students
have located alternate sources for 2706 images, most notably over
90 images of pages from the 1991 Standards for Accessible
Design as Originally Published on July 26, 1991. The data that
has been gathered and generated in this process will be reusable
for other such endeavors.</p>
      <p>In the process of planning our accessibility project, LII discovered
the following problems. First, manual annotation of images has
proceeded quite slowly compared with other tasks. As of this
writing, fewer than 100 image annotations have been completed.
Second, math conversion is much faster than SVG conversion.
Third, sorting for the purposes of identifying good candidates for
SVG conversion produces a different categorization than sorting
for purposes of distinguishing similar content.</p>
      <p>Because LII wished to deploy newly-accessible content as quickly
as possible, we focused on techniques that would enable us to
quickly prepopulate a queue with mathematical content, which is
easy both to classify and convert. At the same time, the
classification process provides additional clues to aid in re-sorting
non-mathematical images for further treatment. Using Keras and
OpenCV, we trained a classifier on the eCFR images for the
purpose of identifying math. Initial results yielded precision 0.86
and recall 0.88. In practical terms, this approach immediately
identified 215 out of 243 math images for conversion and
incorrectly identified only 35 out of 875 non-math images. This
enables us to speed deployment by prepopulating a work queue
through automation.
2.2</p>
    </sec>
    <sec id="sec-8">
      <title>Future Work</title>
      <p>The initial proof-of-concept effort simplified the task to address
identification of mathematical images and non-mathematical
images. This pre-sorting is adequate for cost estimation purposes
and makes it feasible to generate machine-readable data before
comprehensive sorting is complete.</p>
      <p>Because conversion projects frequently include tabular data,
forms, and textual images, training the model using additional
categories would be quite valuable. Because images may contain
mixed content, feature identification and multi-label classification
are natural areas for further work.</p>
      <p>The initial proof-of-concept effort deliberately eschewed image
preprocessing. Characteristics of the images suggest techniques
for producing more robust and comprehensive models. For
example, basic case-insensitive extraction detected image labels
—variants of the terms “figure” (1395), “illustration” (19), “plate”
(240), or “legend” (410)—in approximately 14% of the
trainingset images. Because the choice to annotate within the image rather
than within the text surrounding the image should be arbitrary,
and because images classified as equations almost never have a
legend, it seems worthwhile to purge the image legend before
training.</p>
      <p>Finally, thus far, LII has not yet taken advantages of metadata
external to the images themselves. Because the images in question
are embedded within documents that are published on the web,
several additional variables could be made available to the model.
The training data could include the catchline for the section or
appendix within which the image appears; the full structural
location of that document; the text, if any, immediately preceding
or following the image; terms assigned to the containing
document from an unsupervised topic model; terms assigned to
the containing Part by the Office of the Federal Register; even
variables such as co-location within a single document or volume
of web-traffic to the containing document could prove relevant to
image type and could be worth testing.
3</p>
    </sec>
    <sec id="sec-9">
      <title>Caveats and Conclusions</title>
      <p>As mentioned earlier, in the pilot study, the greatest impediment
to training a model proved to be some subtle and some
not-sosubtle differences between the type of classification needed to
support professional workflow and the type of classification that
would support automated extraction. Because our preferences for
populating the queue in this instance were determined by the
volume of traffic and co-location of images within a section,
several types of content were not distinguished in the initial
sorting. For example, where multi-page forms appeared, images
containing entirely textual content (such as full pages of
instructions) were not distinguished from the form pages for
which they provided guidance. Other images, such as tables,
typically contained three sections: a caption, the data table, and a
set of footnotes. In order to produce useful decision-support tools,
training data would best be annotated granularly, identifying
features within each image.</p>
      <p>
        Law-and-AI researchers who work on public administration
should be aware that the Access Board estimated day-forward
web-accessibility compliance resources for the federal
government at 5% of web development, software development,
and audio-visual production costs, plus an additional 1.25% for
evaluation. Should comprehensive conformance become a
requirement, the costs will increase accordingly. The Office for
Civil Rights of the U.S. Department of Education has, of late,
included web accessibility in its enforcement of Section 504 of the
Rehabilitation Act, which requires comprehensive equal access to
educational services for recipients of federal funding; this means
that, as a rule, universities are scrambling to bring their websites
into conformance with WCAG 2.0 level AA. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] Finally, the
number of ADA lawsuits treating websites as public
accommodations has increased dramatically during the past few
years, and a public accommodations case is currently pending
before the United States Supreme Court. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] Reducing data
impoverishment in the publication process should limit the need
for such work to addressing the challenge of converting
non-borndigital images. The combination of labor required and urgency of
need makes AI-enhanced automation a timely and valuable
avenue for research. Finally, an increased focus on document
accessibility can create a virtuous circle in which artificial
intelligence applications will both help create, and benefit from,
the availability of more machine-readable data.
      </p>
    </sec>
    <sec id="sec-10">
      <title>ACKNOWLEDGMENTS</title>
      <p>Our thanks to the LII development team, Sylvia Kwakye, Nic
Ceynowa, Ayham Boucher, and Jim Phillips; to students Mason
Roth, Evelyn Hudson, Charu Murugesan and Jiali Liu; to Point.B
Studios and Public.Resource.Org; and to Justia, Inc.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Directive</surname>
          </string-name>
          (EU)
          <year>2016</year>
          /
          <article-title>2102 of the European Parliament and of the Council of 26 October 2016 on the accessibility of the websites and mobile applications of public sector bodies</article-title>
          . ELI: http://data.europa.eu/eli/dir/2016/2102/oj.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>[2] Electronic and information technology</source>
          . 29 U.S.C. § 794d. Retrieved from https://www.law.cornell.edu/uscode/text/29/794d.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Pub</surname>
          </string-name>
          .L.
          <volume>93</volume>
          -
          <issue>112</issue>
          , 87 Stat. 355, enacted September 26,
          <year>1973</year>
          ), codified as 29 U.S.C.
          <article-title>§ 701 et seq</article-title>
          . https://www.govinfo.gov/content/pkg/STATUTE87/pdf/STATUTE-87-
          <fpage>Pg355</fpage>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>[4] Architectural and Transportation Barriers Compliance Board</article-title>
          .
          <source>Electronic and Information Technology Accessibility Standards</source>
          .
          <year>2000</year>
          . https://www.federalregister.gov/documents/2000/12/21/00-32017/electronicand-information
          <article-title>-technology-accessibility-standards .</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>[5] Architectural and Transportation Barriers Compliance Board. Information and Communication Technology (ICT) Standards and Guidelines</article-title>
          .
          <source>(Final Rule)</source>
          .
          <year>2017</year>
          .
          <volume>82</volume>
          FR 5790. https://www.federalregister.gov/documents/2017/01/18/2017- 00395/information-and
          <article-title>-communication-technology-ict-standards-andguidelines.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <fpage>W3C</fpage>
          .
          <article-title>Web Content Accessibility Guidelines (WCAG) 2</article-title>
          .
          <fpage>0</fpage>
          .
          <year>2008</year>
          . https://www.w3.org/TR/WCAG20/#intro-layers-guidance.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>United</given-names>
            <surname>States</surname>
          </string-name>
          <article-title>Department of Health and Human Services. 508 Web Compliance</article-title>
          and
          <string-name>
            <given-names>Remediation</given-names>
            <surname>Framework</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Retrieved by the Internet Archive on 2/6/2018</article-title>
          . https://web.archive.org/web/20180206161308/https://www.hhs.gov/web/section -508/compliance-and-remediation/framework/index.html .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Special</given-names>
            <surname>Counsel's Office</surname>
          </string-name>
          .
          <source>Report on the Investigation into Russian Interference in the 2016 Presidential Election</source>
          .
          <year>2019</year>
          . https://www.justice.gov/storage/report.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Ankit</given-names>
            <surname>Rohatgi</surname>
          </string-name>
          .
          <source>WebPlotDigitizer. Version 4.2</source>
          .
          <year>2019</year>
          . https://automeris.io/WebPlotDigitizer.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Shaomei</surname>
            <given-names>Wu</given-names>
          </string-name>
          , Jeffrey Wieland, Omid Farivar, and
          <string-name>
            <given-names>Julie</given-names>
            <surname>Schiller</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Automatic Alt-text: Computer-generated Image Descriptions for Blind Users on a Social Network Service</article-title>
          .
          <source>In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW '17)</source>
          . ACM, New York, NY, USA,
          <fpage>1180</fpage>
          -
          <lpage>1192</lpage>
          . DOI: https://doi.org/10.1145/2998181.2998364.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.G.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Choo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N</given-names>
            <surname>Elmqvist</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Visualizing for the Non-Visual: Enabling the Visually Impaired to Use Visualization</article-title>
          .
          <source>Eurographics Conference on Visualization (EuroVis)</source>
          <year>2019</year>
          , Computer Graphics Forum, Vol.
          <volume>38</volume>
          , No. 3. http://users.umiacs.umd.edu/~elm/projects/vis4nonvisual/vis4nonvisual.pdf .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Lindsay</surname>
            <given-names>McKenzie</given-names>
          </string-name>
          , Feds Prod Universities to Address Website Accessibility Complaints.
          <volume>11</volume>
          /16/2018.
          <article-title>Inside Higher Education</article-title>
          . https://www.insidehighered.com/news/2018/11/06/universities-still
          <article-title>-strugglemake-websites-accessible-all .</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Lindsay</surname>
            <given-names>McKenzie</given-names>
          </string-name>
          <source>, 50 Colleges Hit With ADA Lawsuits</source>
          .
          <volume>12</volume>
          /10/2018. https://www.insidehighered.com/news/2018/12/10/fifty-colleges
          <article-title>-sued-barrageada-lawsuits-over-web-accessibility .</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>