<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Specification For CS Education Dataset Documentation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Samiha Marwan</string-name>
          <email>samarwan@ncsu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Austin C. Bart</string-name>
          <email>acbart@udel.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas W. Price</string-name>
          <email>twprice@ncsu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>North Carolina State University</institution>
          ,
          <addr-line>NC</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Delaware</institution>
          ,
          <addr-line>DE</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Sharing datasets has many benefits, such as enabling study replication, and supporting secondary analysis. However, many of the publicly available datasets in computing education lack comprehensive documentation or omit key contextual information. For example, missing classroom contextual information (such classroom demographics) or instructional interventions used makes datasets dificult to interpret, and therefore, inhibits the usefulness of shared data. While there is work in standardizing data formats for Computing Ed research (e.g. ProgSnap2 format), there is no standard for describing contextual metadata which is vital to provide high-quality scientific insights. In this paper, we propose a documentation practice to support CS Ed researchers in creating a comprehensive, consistent dataset documentation. We also call for collaboration among researchers to adopt the proposed documentation approach to allow convenient, and reliable use of educational datasets.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR</p>
      <p>ceur-ws.org
‘datasheet’ design that explains the dataset motivation, usage, and collection process [4]. Pushkarna
et al. presented ‘Data Cards’ which is a proposed structured summary of essential information in ML
datasets that could not be inferred directly from data (such as data training methods, and intended use
cases) [5]. While the ‘datasheet’ and ‘Data Cards’ ideas are relevant to our proposed work, they do
not align with the specific documentation needs of CS Ed datasets (such as description of instructional
strategies and classroom context). In this preliminary work, we propose a documentation design
embedded within a single, guided file, that prompts researchers with targeted questions to describe
key elements of their dataset (e.g., classroom setting, assessment structure, table summaries). Such
documentation design can reduce cognitive load by eliminating the need to decide where or how to
begin the documentation process.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methods</title>
      <p>
        For space limitations, we provide in Appendix A a template for our proposed documentation design
where future users can use. This design has been revised by several CS Ed researchers. For each element
in documentation design, we provide a question designed to illustrate the information needed (discussed
in detail below). We use the following design principles to create the proposed documentation design:
(
        <xref ref-type="bibr" rid="ref1 ref6">1</xref>
        ) Balance Efort with Completeness: The proposed design guides users to include the most critical
elements of documentation, while allowing for more comprehensive input when time and resources
permit. (
        <xref ref-type="bibr" rid="ref2 ref7">2</xref>
        ) Ease of Creation without Redundancy: The documentation design provides one place to
document any useful resources either in the form of adding links, or uploading files without requiring
duplicate content. (
        <xref ref-type="bibr" rid="ref3 ref8">3</xref>
        ) Guidance for newcomers: The proposed design can be usable by those unfamiliar
with dataset documentation norms, but also flexible and useful for experienced users.
      </p>
      <p>The proposed documentation workflow is implemented as a structured set of questions (see
Appendix A), where researchers can complete it by simply answering each question, resulting in a
standardized file that serves as a documentation for their dataset. The documentation design begins with a
‘Documentation Objective’ which acts as a motivating introduction highlighting the value of a clear
and complete documentation. This is followed by a section on ‘Author Contact Information’, which
is essential when future users have questions or need access to other resources. The documentation is
then organized into four categories of dataset information: ‘Dataset Overview’: This section collects
general information about the dataset, such as its name, and relevant published papers to this dataset.
The ’Dataset Log Data Attributes’ section collects details about the dataset structure and content,
such as files format, tables and their attributes. The ’Dataset Contextual Questions’ section collects
metadata information which is any information about the context of the classroom or lab study such
as programming environments used, participants’ demographics, or any relevant contextual variables
that helps in interpreting data accurately. ’Dataset Resources’ section documents any instructional
or assessment materials that would help in understanding any evidence resulting from secondary
data analysis. For example, dataset resources include the problems used, the assessment questions,
assessment policies, final exams, …, etc. These resources can be directly added to the documentation, or
just add a link for this data resource. In case any of the above information cannot be shared publicly,
the Author Contact information sections can enable users to request access permission to this data.</p>
      <sec id="sec-2-1">
        <title>In our future work, we plan to release example documentation for two datasets, and to implement</title>
        <p>this documentation design where some fields can be autocompleted, such as the tables’ names, or
column names that could be automatically extracted from the uploaded data. Additionally, we want
to call the CER community to incentivize datasets’ documentation to encourage dataset owners to
document their data, which is crucial to enable reuse and secondary analysis on their datasets.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
      <sec id="sec-3-1">
        <title>This material is based upon work supported by the National Science Foundation under grant #2213792.</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Declaration on Generative AI</title>
      <sec id="sec-4-1">
        <title>During the preparation of this work, the author(s) used ChatGPT in order to: Grammar and spelling check, and Paraphrase. After using this tool/service, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.</title>
        <sec id="sec-4-1-1">
          <title>A.1. Dataset Documentation Objective</title>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>There are multiple benefits to documenting data. For example, thoughtful dataset communication can facilitate and enable data reuse, support data analysis replication, allow collaboration between researchers and educators, and mitigate risks and incorrect results caused by inaccurate data.</title>
        <p>Please answer as many questions as you can. If a question has no answer, write ’N/A’.</p>
        <sec id="sec-4-2-1">
          <title>A.2. Author Contact Information:</title>
          <p>1. Name:</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>2. Email address [Preferable if the email is not linked to an institution]:</title>
      </sec>
      <sec id="sec-4-4">
        <title>3. Backup Contact [if applicable]:</title>
        <sec id="sec-4-4-1">
          <title>A.3. Dataset Overview</title>
          <p>This section provides an overview of the dataset.</p>
        </sec>
      </sec>
      <sec id="sec-4-5">
        <title>1. What is the Dataset Name? [answer]</title>
      </sec>
      <sec id="sec-4-6">
        <title>2. Please Provide an Overview Description of the Dataset: [answer]</title>
      </sec>
      <sec id="sec-4-7">
        <title>3. If you have, Please Add Relevant Published Papers: [answer]</title>
        <sec id="sec-4-7-1">
          <title>A.4. Dataset Log Data Attributes</title>
          <p>This section includes all information about the dataset tables.
A.4.1. Dataset Format
A.4.2. Dataset Tables</p>
        </sec>
      </sec>
      <sec id="sec-4-8">
        <title>1. Provide an overview of the dataset format. If you have, add a reference of this format: The data is stored in the [answer] format, ….</title>
        <p>1. Name:</p>
      </sec>
      <sec id="sec-4-9">
        <title>2. Description: [answer]</title>
      </sec>
      <sec id="sec-4-10">
        <title>3. No. of Columns:</title>
      </sec>
      <sec id="sec-4-11">
        <title>4. Columns: [write down each column name, a description for it, and, if applicable, the possible</title>
        <p>values of this column]
1. Name:</p>
      </sec>
      <sec id="sec-4-12">
        <title>2. Description: [answer]</title>
      </sec>
      <sec id="sec-4-13">
        <title>3. No. of Columns:</title>
      </sec>
      <sec id="sec-4-14">
        <title>4. Columns: [write down each column name, a description for it, and, if applicable, the possible</title>
        <p>values of this column]
A.4.3. Dataset Properties</p>
      </sec>
      <sec id="sec-4-15">
        <title>1. Are there any unique properties for this dataset? [answer]</title>
        <sec id="sec-4-15-1">
          <title>A.5. Dataset Contextual Questions</title>
          <p>This section contains questions designed to provide any relevant details that help in understanding and
interpreting the data.</p>
          <p>A.5.1. Classroom/Lab Settings
As applicable, please provide:</p>
        </sec>
      </sec>
      <sec id="sec-4-16">
        <title>1. Programming Language Used:</title>
      </sec>
      <sec id="sec-4-17">
        <title>2. Programming Environments Used:</title>
      </sec>
      <sec id="sec-4-18">
        <title>3. Number of Instructors:</title>
      </sec>
      <sec id="sec-4-19">
        <title>4. Duration of Activities, or Lab Work:</title>
        <p>A.5.2. Participants’ Demographics
As applicable, please provide a paper that includes the student/participants demographics, or type the
population’s:</p>
      </sec>
      <sec id="sec-4-20">
        <title>1. Population Size:</title>
        <p>2. Age:</p>
      </sec>
      <sec id="sec-4-21">
        <title>3. Grade Level:</title>
      </sec>
      <sec id="sec-4-22">
        <title>4. Gender:</title>
      </sec>
      <sec id="sec-4-23">
        <title>5. Race/Ethnicity of Students:</title>
      </sec>
      <sec id="sec-4-24">
        <title>6. Prior CS Experience:</title>
      </sec>
      <sec id="sec-4-25">
        <title>1. What kind of assignments did the student have? (e.g. quizzes, projects, multiple choice questions,</title>
        <p>…)? [answer]</p>
      </sec>
      <sec id="sec-4-26">
        <title>2. What is the Grading Policies (e.g. : late policies, penalties, …, etc)? [answer]</title>
      </sec>
      <sec id="sec-4-27">
        <title>3. What are the topics covered before the data is collected?</title>
      </sec>
      <sec id="sec-4-28">
        <title>4. What is the grade percentage of each assessment? [answer]</title>
      </sec>
      <sec id="sec-4-29">
        <title>5. How the assessments are assessed (if applicable, can you provide the rubric?) [answer]</title>
        <sec id="sec-4-29-1">
          <title>A.6. Dataset Resources</title>
          <p>This section collects additional file resources to give more context to the log data. For each question, you
can either type your answer or upload a file. If not applicable, write N/A.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Kiesler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Impagliazzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Biernacka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kapoor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Kazmi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. G.</given-names>
            <surname>Ramagoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Taneja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Where's the data? exploring datasets in computing education</article-title>
          ,
          <source>in: Proceedings of the ACM Conference on Global Computing Education</source>
          Vol
          <volume>2</volume>
          ,
          <year>2023</year>
          , pp.
          <fpage>209</fpage>
          -
          <lpage>210</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T. W.</given-names>
            <surname>Price</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hovemeyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Rivers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Bart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Kazerouni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Petersen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gusukuma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Edwards</surname>
          </string-name>
          , et al.,
          <article-title>Progsnap2: A flexible format for programming process data</article-title>
          ,
          <source>in: Proceedings of the 2020 ACM Conference on Innovation and Technology in Computer Science Education</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>356</fpage>
          -
          <lpage>362</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Sanders</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Richards</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Moström</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Almstrum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Edwards</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fincher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gunion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hanks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lonergan</surname>
          </string-name>
          , et al.,
          <article-title>Dcer: sharing empirical computer science education data</article-title>
          ,
          <source>in: Proceedings of the Fourth international Workshop on Computing Education Research</source>
          ,
          <year>2008</year>
          , pp.
          <fpage>137</fpage>
          -
          <lpage>148</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gebru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Morgenstern</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Vecchione</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Vaughan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          , H. D. III,
          <string-name>
            <surname>K. Crawford</surname>
          </string-name>
          , Datasheets for datasets
          <volume>64</volume>
          (
          <year>2021</year>
          )
          <fpage>86</fpage>
          -
          <lpage>92</lpage>
          . doi:
          <volume>10</volume>
          .1145/3458723.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pushkarna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zaldivar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kjartansson</surname>
          </string-name>
          ,
          <article-title>Data cards: Purposeful and transparent dataset documentation for responsible ai</article-title>
          ,
          <source>in: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>1776</fpage>
          -
          <lpage>1826</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          1.
          <article-title>What are the Programming Problems Text Given to Students? The assignments for the course, their descriptions, and solutions</article-title>
          , can be found in …
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          2.
          <article-title>What are the Help Resources Accessible to Students during Practices (e.g. feedback type, or other interventions</article-title>
          )? [answer]
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          3.
          <string-name>
            <surname>If</surname>
            <given-names>possible</given-names>
          </string-name>
          , Please Include a
          <string-name>
            <surname>De-Identified</surname>
            <given-names>Syllabus</given-names>
          </string-name>
          : [answer]
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          4.
          <string-name>
            <surname>If</surname>
            <given-names>possible</given-names>
          </string-name>
          ,
          <article-title>Please Provide the Assessments Questions (e</article-title>
          .g. Midterms, Final Exams, Pretest, Posttest): [answer]
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          5.
          <article-title>If you have any Additional Data Resources</article-title>
          , Please Upload it Here: [answer]
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>