Artificial Intelligence and Accessibility for Administrative Applications Sara Frug Thomas Bruce Legal Information Institute Legal Information Institute Cornell University Cornell University Ithaca NY United States Ithaca NY United States sara@liicornell.org tom@liicornell.org ABSTRACT electronic and information technology (EIT) be accessible to people with disabilities. The regulations promulgated under the In this paper, we suggest that accessibility is an emerging, 1998 amendments required adoption of standards consistent with underfulfilled legal requirement that presents not only a potential (but not identical to) the Web Content Accessibility Guidelines locus for activity but also an avenue for research. We describe a Version 1.0 Level A. [4] In 2017, the regulations were refreshed proof-of-concept use of machine-learning-based image to incorporate by reference the Web Content Accessibility classification as a managerial support tool for accessibility Guidelines Version 2.0. [5] enhancement, and suggest directions for further research. Although this discussion focuses on the government information 1.2 Document Accessibility and Web landscape in the United States, the adoption of the Web Content Accessibility Guidelines in the European Union extends its Accessibility Content Guidelines (WCAG) applicability. The Web Content Accessibility Guidelines provide both specific requirements and a general framework for understanding what CCS CONCEPTS makes a document accessible. The acronym “POUR” • Accessibility • Assistive technologies • People with disabilities (Perceivable, Operable, Understandable, Robust) summarizes these requirements, the most fundamental of which ensure that KEYWORDS information (e.g., words) not be locked in a medium (e.g., a Accessibility, Artificial Intelligence, Regulations picture PDF) that cannot be perceived by a person with a disability (e.g., blindness). [6] ACM Reference format: In: Proceedings of the First Workshop on AI in the Administrative State, 1.3 Non-Compliance June 17, 2019, Montreal, QC, Canada. In 2008 (ten years after the 1998 amendments), the Digital Copyright © 2019 for this paper by its authors. Use permitted under Communications Division of the Department of Health and Creative Commons License Attribution 4.0 International (CC BY 4.0). Human Services (HHS) wrote: Published at http://ceur-ws.org “Section 508 requires that Web sites and associated content created with federal funding, whether internal or external, government- or contractor-hosted, are 1 Information Accessibility and Government accessible to persons with disabilities. The law has been Administration in effect since June 21, 2001. Federal compliance – including that of HHS -- has lagged.” [7] The availability of government information is well accepted as a requirement for efficient public administration. Machine- By that point, the 2.0 version of the Web Content Accessibility readability of administrative information, although frequently Guidelines was about to be released. HHS’s compliance timetable acknowledged as a goal, is often neglected. As a basis for put project completion at 2013. accessibility for the disabled, it receives even less attention. This discussion focuses on web accessibility, although it views web In 2018, WCAG 2.0 became the standard for Federal websites. accessibility as a consequence of document accessibility. The safe harbor provision, however, protected legacy content. Although this discussion focuses on the United States, the “This safe harbor provision applies on an “element-by- adoption of the Web Content Accessibility Guidelines in the element” basis in that each component or portion of European Union [1] extends its applicability. existing ICT is assessed separately. In specifying “components or portions” of existing ICT, the safe 1.1 Regulatory Requirements harbor provision independently exempts those aspects of ICT that comply with the existing 508 Standards In the United States, the 1998 amendments [2] to The from mandatory upgrade or modification after the final Rehabilitation Act of 1973 [3] explicitly require that federal AIAS’19, June, 2019, Montreal, QC Canada S. Frug and T. Bruce. rule takes effect. This means, for example, that if two 2.1 Pilot Project: Workflows, Experimentation, paragraphs of text are changed on an agency Web page, and Decision Support only the altered paragraphs are required to comply with the Revised 508 Standards; the rest of the Web page can LII has begun a pilot project to establish a data conversion remain “as is” so long as otherwise compliant with the workflow and support automation efforts for data-de- existing 508 Standards.” [5] impoverishment. The approach has been three-pronged: 1) manually sort and convert figures to SVG and images of equations As of this writing, even Section508.gov and 18F’s Accessibility to MML; 2) annotate SVG images with descriptions of their Guide yielded accessibility errors. content; 3) research machine-readable data sources represented as pictures; 4) apply machine-learning techniques to provide Beyond the protection of the safe harbor, government agencies decision support for human annotation and conversion. persist in publishing new, non-accessible content. Most prominently, on April 18, 2019, the U.S. Department of Justice The pilot project involved collaboration from a specialist in released the much-anticipated so-called Mueller Report as an graphics conversion, law and computer science students, and LII’s image-PDF, downloadable from a web page that displayed the text specialist. The graphics conversion specialist analyzed 14,486 following notice: images from the Code of Federal Regulations and sorted them into “The Department recognizes that these documents may categories, such as math (6255), diagrams (1410), data tables not yet be in an accessible format. If you have a (1238), maps (3194), forms (1892), labels (351) and logos (77) disability and the format of any material on the site (some outlier categories, such as photographs, were discovered in interferes with your ability to access some information, the process). Images transformed prior to this project (1149) were please email the Department of Justice webmaster. To sorted into math (241) and non-math (908) and set aside for enable us to respond in a manner that will be of most help to you, please indicate the nature of the testing. The images were grouped according to which areas of the accessibility problem, your preferred format (electronic CFR they appeared in and prioritized according to how much web format (ASCII, etc.), standard print, large print, etc.), traffic each containing document (section or appendix) received the web address of the requested material, and your full on the LII website. As of this writing, the graphics conversion contact information, so we can reach you if questions specialist has converted 2913 math elements to MML and 1005 arise while fulfilling your request.” [8] diagrams to SVG format. Also as of this writing, law students Although the most high-profile, this is far from the only example have located alternate sources for 2706 images, most notably over of new, non-compliant content published on federal agency 90 images of pages from the 1991 Standards for Accessible websites. Design as Originally Published on July 26, 1991. The data that has been gathered and generated in this process will be reusable 1.4 Publication Practices for other such endeavors. The Mueller Report is a good example of a general data In the process of planning our accessibility project, LII discovered impoverishment phenomenon in government publishing, which the following problems. First, manual annotation of images has deserves to be the object of attention from all communities that proceeded quite slowly compared with other tasks. As of this consume government information. The Mueller Report could not writing, fewer than 100 image annotations have been completed. have been drafted as a set of pictures of words; rather, the Second, math conversion is much faster than SVG conversion. original, machine-readable document had to have been converted Third, sorting for the purposes of identifying good candidates for for publication into a set of pictures. This data-impoverishment SVG conversion produces a different categorization than sorting process is not unique to this document—it can be observed for purposes of distinguishing similar content. throughout the Code of Federal Regulations. Documents that had to have been authored electronically are converted to pictures for Because LII wished to deploy newly-accessible content as quickly publication, leaving the data consumers to “unscramble the egg” as possible, we focused on techniques that would enable us to and convert them back into machine-readable data formats. quickly prepopulate a queue with mathematical content, which is easy both to classify and convert. At the same time, the classification process provides additional clues to aid in re-sorting 2 Artificial Intelligence and Document non-mathematical images for further treatment. Using Keras and Accessibility OpenCV, we trained a classifier on the eCFR images for the Although there is promising work, notably from Rohatgi [9], Wu purpose of identifying math. Initial results yielded precision 0.86 et al. [10], and Choi et al. [11], to support extraction of machine- and recall 0.88. In practical terms, this approach immediately readable data from images of charts, graphs, and other data identified 215 out of 243 math images for conversion and artifacts, for researchers and application developers, common incorrectly identified only 35 out of 875 non-math images. This image types have not been addressed systematically. enables us to speed deployment by prepopulating a work queue through automation. Artificial Intelligence and Accessibility for Administrative AIAS’19, June, 2019, Montreal, QC Canada Applications which they provided guidance. Other images, such as tables, typically contained three sections: a caption, the data table, and a 2.2 Future Work set of footnotes. In order to produce useful decision-support tools, The initial proof-of-concept effort simplified the task to address training data would best be annotated granularly, identifying identification of mathematical images and non-mathematical features within each image. images. This pre-sorting is adequate for cost estimation purposes and makes it feasible to generate machine-readable data before Law-and-AI researchers who work on public administration comprehensive sorting is complete. should be aware that the Access Board estimated day-forward web-accessibility compliance resources for the federal Because conversion projects frequently include tabular data, government at 5% of web development, software development, forms, and textual images, training the model using additional and audio-visual production costs, plus an additional 1.25% for categories would be quite valuable. Because images may contain evaluation. Should comprehensive conformance become a mixed content, feature identification and multi-label classification requirement, the costs will increase accordingly. The Office for are natural areas for further work. Civil Rights of the U.S. Department of Education has, of late, included web accessibility in its enforcement of Section 504 of the The initial proof-of-concept effort deliberately eschewed image Rehabilitation Act, which requires comprehensive equal access to preprocessing. Characteristics of the images suggest techniques educational services for recipients of federal funding; this means for producing more robust and comprehensive models. For that, as a rule, universities are scrambling to bring their websites example, basic case-insensitive extraction detected image labels into conformance with WCAG 2.0 level AA. [12] Finally, the —variants of the terms “figure” (1395), “illustration” (19), “plate” number of ADA lawsuits treating websites as public (240), or “legend” (410)—in approximately 14% of the training- accommodations has increased dramatically during the past few set images. Because the choice to annotate within the image rather years, and a public accommodations case is currently pending than within the text surrounding the image should be arbitrary, before the United States Supreme Court. [13] Reducing data and because images classified as equations almost never have a impoverishment in the publication process should limit the need legend, it seems worthwhile to purge the image legend before for such work to addressing the challenge of converting non-born- training. digital images. The combination of labor required and urgency of need makes AI-enhanced automation a timely and valuable Finally, thus far, LII has not yet taken advantages of metadata avenue for research. Finally, an increased focus on document external to the images themselves. Because the images in question accessibility can create a virtuous circle in which artificial are embedded within documents that are published on the web, intelligence applications will both help create, and benefit from, several additional variables could be made available to the model. the availability of more machine-readable data. The training data could include the catchline for the section or appendix within which the image appears; the full structural ACKNOWLEDGMENTS location of that document; the text, if any, immediately preceding Our thanks to the LII development team, Sylvia Kwakye, Nic or following the image; terms assigned to the containing Ceynowa, Ayham Boucher, and Jim Phillips; to students Mason document from an unsupervised topic model; terms assigned to Roth, Evelyn Hudson, Charu Murugesan and Jiali Liu; to Point.B the containing Part by the Office of the Federal Register; even Studios and Public.Resource.Org; and to Justia, Inc. variables such as co-location within a single document or volume of web-traffic to the containing document could prove relevant to REFERENCES image type and could be worth testing. [1] Directive (EU) 2016/2102 of the European Parliament and of the Council of 26 October 2016 on the accessibility of the websites and mobile applications of public sector bodies. ELI: http://data.europa.eu/eli/dir/2016/2102/oj. [2] Electronic and information technology. 29 U.S.C. § 794d. Retrieved from 3 Caveats and Conclusions https://www.law.cornell.edu/uscode/text/29/794d. [3] Pub.L. 93–112, 87 Stat. 355, enacted September 26, 1973), codified as 29 As mentioned earlier, in the pilot study, the greatest impediment U.S.C. § 701 et seq. https://www.govinfo.gov/content/pkg/STATUTE- to training a model proved to be some subtle and some not-so- 87/pdf/STATUTE-87-Pg355.pdf. [4] Architectural and Transportation Barriers Compliance Board. Electronic and subtle differences between the type of classification needed to Information Technology Accessibility Standards. 2000. support professional workflow and the type of classification that https://www.federalregister.gov/documents/2000/12/21/00-32017/electronic- and-information-technology-accessibility-standards . would support automated extraction. Because our preferences for [5] Architectural and Transportation Barriers Compliance Board. Information and populating the queue in this instance were determined by the Communication Technology (ICT) Standards and Guidelines. (Final Rule). 2017. 82 FR 5790. volume of traffic and co-location of images within a section, https://www.federalregister.gov/documents/2017/01/18/2017- several types of content were not distinguished in the initial 00395/information-and-communication-technology-ict-standards-and- sorting. For example, where multi-page forms appeared, images guidelines. [6] W3C. Web Content Accessibility Guidelines (WCAG) 2.0. 2008. containing entirely textual content (such as full pages of https://www.w3.org/TR/WCAG20/#intro-layers-guidance. instructions) were not distinguished from the form pages for AIAS’19, June, 2019, Montreal, QC Canada S. Frug and T. Bruce. [7] United States Department of Health and Human Services. 508 Web Compliance and Remediation Framework. 2008. Retrieved by the Internet Archive on 2/6/2018. https://web.archive.org/web/20180206161308/https://www.hhs.gov/web/section -508/compliance-and-remediation/framework/index.html . [8] Special Counsel’s Office. Report on the Investigation into Russian Interference in the 2016 Presidential Election. 2019. https://www.justice.gov/storage/report.pdf. [9] Ankit Rohatgi. WebPlotDigitizer. Version 4.2. 2019. https://automeris.io/WebPlotDigitizer. [10] Shaomei Wu, Jeffrey Wieland, Omid Farivar, and Julie Schiller. 2017. Automatic Alt-text: Computer-generated Image Descriptions for Blind Users on a Social Network Service. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW '17). ACM, New York, NY, USA, 1180-1192. DOI: https://doi.org/10.1145/2998181.2998364. [11] J. Choi, S. Jung, D.G. Park, J. Choo, and N Elmqvist. 2019. Visualizing for the Non-Visual: Enabling the Visually Impaired to Use Visualization. Eurographics Conference on Visualization (EuroVis) 2019, Computer Graphics Forum, Vol. 38, No. 3. http://users.umiacs.umd.edu/~elm/projects/vis4nonvisual/vis4nonvisual.pdf . [12] Lindsay McKenzie, Feds Prod Universities to Address Website Accessibility Complaints. 11/16/2018. Inside Higher Education. https://www.insidehighered.com/news/2018/11/06/universities-still-struggle- make-websites-accessible-all . [13] Lindsay McKenzie, 50 Colleges Hit With ADA Lawsuits. 12/10/2018. https://www.insidehighered.com/news/2018/12/10/fifty-colleges-sued-barrage- ada-lawsuits-over-web-accessibility .