<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Keynote: Evaluation of NLP Tools for Hairy RE Tasks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniel M. Berry</string-name>
          <email>dberry@uwaterloo.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Waterloo</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>Natural language processing (NLP) has been used since the 1980s to construct tools for performing natural language (NL) requirements engineering (RE) tasks. While these NL RE tasks are not inherently di cult for humans, on the scale of the collection of NL artifacts for the development of a typical large-scale computer-based system (CBS), these tasks become unmanageable, i.e., hairy. Because these hairy tasks are di cult for humans to do thoroughly, we try to build tools to assist humans in doing these tasks. The RE eld has often adopted information retrieval (IR) algorithms for use in implementing these NL RE tools. Traditionally, also the methods for evaluating an NL RE tool have been inherited from the IR eld without considering whether they make sense from the viewpoint of the requirements for NLP RE tools for hairy tasks. That is, despite that the main goal of an NLP RE tool is to assist a human to do a hairy task, which is hard for a human to do completely, many RE tool builders consider precision to be at least as important, if not more important, than recall. This talk brie y surveys requirements for NLP tools for hairy NL RE tasks and what they imply about evaluation of the tools. It then describes the data that must be gathered during the evaluation of the tool and how to use them to do the evaluation. The talk walks through several example tools and typical data for them and justi es various di erent conclusions about these tools on the bases of these data.</p>
      </abstract>
    </article-meta>
  </front>
  <body />
  <back>
    <ref-list />
  </back>
</article>