=Paper= {{Paper |id=None |storemode=property |title=An Applied Approach to Data Curation Training at the Inter-university Consortium for Political and Social Research (ICPSR) |pdfUrl=https://ceur-ws.org/Vol-1016/paper14.pdf |volume=Vol-1016 |dblpUrl=https://dblp.org/rec/conf/digcurv/LyleVCN13 }} ==An Applied Approach to Data Curation Training at the Inter-university Consortium for Political and Social Research (ICPSR)== https://ceur-ws.org/Vol-1016/paper14.pdf
 An Applied Approach to Data Curation Training at the
  Inter-university Consortium for Political and Social
                   Research (ICPSR)
     Jared Lyle, Mary Vardigan                               Jacob Carlson                                   Ron Nakao
    ICPSR, University of Michigan                          Purdue University                              Stanford University
        Ann Arbor, MI, U.S.A.                          West Lafayette, IN, U.S.A.                        Stanford, CA, U.S.A.
     {lyle, vardigan}@umich.edu                        jakecarlson@purdue.edu                            ronbo@stanford.edu



  Abstract—ICPSR recently developed two new training initiatives      can assist in curation efforts? In 2011, ICPSR began planning a
in digital curation: a week-long applied data curation workshop       data curation workshop to address these questions.
where participants learn the theories and methods of data curation
using the ICPSR “processing pipeline” as framework, and an            A. Background
ongoing virtual working group of data librarians that discusses
similar core data curation topics while giving participants
                                                                          The workshop was intended for individuals interested or
independent access to curate their own data using ICPSR’s             actively engaged in the management and curation of research
processing environment and tools. This paper discusses the            data, particularly data scientists, data managers and analysts,
background, structure, and lessons learned from these new training    librarians, archivists, and data stewards and curators. The initial
initiatives.                                                          goal of the workshop was to “raise awareness about the benefits
                                                                      of life cycle principles for data management, including how to
  Keywords—Digital curation, data curation, training, curriculum.     create, comply with, and evaluate required data management
                             I. OVERVIEW                              plans, how to encourage and trace re-use, and how to manage
                                                                      data from its inception through archiving and beyond.”
    The Inter-university Consortium for Political and Social
Research (ICPSR), a research center in the Institute for Social           We believed, and continue to feel, that ICPSR is uniquely
Research at the University of Michigan and the world’s largest        positioned to offer a course on data curation. First, ICPSR plays a
archive of social science data, recently developed two new            central role in many social science data curation standards and
training initiatives in digital curation. The first initiative is a   activities, including serving as the home office for the Data
week-long applied data curation workshop offered as part of the       Documentation Initiative (DDI) and as a founding member of the
ICPSR Summer Program in Quantitative Methods, where                   Data Preservation Alliance for the Social Sciences (Data-PASS).
participants learn the theories and methods of data curation using    DDI has become an international standard for metadata in the
the ICPSR “processing pipeline” as framework. The second              social sciences. ICPSR and many other data archives use the DDI
initiative is an ongoing virtual working group of data librarians     XML to document information about the data in our repositories;
that discusses similar core data curation topics while giving         the ICPSR online catalog is also built on DDI metadata, allowing
participants independent access to curate their own data using        structured searching across the entire repository at the variable-
ICPSR’s processing environment and tools. This paper discusses        and even the value-level. Data-PASS is a voluntary partnership
the background, structure, and lessons learned from these new         of organizations created to archive, catalog, and preserve data
training initiatives.                                                 used for social science research. The Data-PASS partners
                                                                      collaborate on best practices for data archiving and have a shared
                    II. DATA CURATION WORKSHOP                        digital preservation strategy.
    As data multiply in sheer quantity and become increasingly           Second, ICPSR has established workflows for curating,
important in the research process, the demand for data curation       preserving, and providing access to data. These workflows,
knowledge rises. What are the best practices for curating research    described as the “ICPSR Pipeline Process” (Fig. 1), have been
data? How does one apply them to daily practice? What tools           developed and refined over 50 years of archiving more than
                                                                      8,000 research collections from across all social science
disciplines, and are informed by the Reference Model for an               practices. The workflow segments, which are broken into
Open Archival Information System (OAIS) for the preservation              digestible portions, make it easier for students to follow and learn
of digital objects as well as other community-based best                  curation processes.




                                                         Figure. 1. ICPSR Pipeline Process.

    Third, ICPSR has an established Summer Program in                     elected by the Consortium membership and provides overall
Quantitative Methods that offers more than 70 courses every               guidance, strongly encourages our participation in initiatives to
summer. The program provides an instructional infrastructure              promote digital curation. We are eager to share our experience
readily accessible for curation instruction. For the past several         and knowledge. We also recognize and appreciate the benefits
years, for instance, we have offered a course for data librarians         from the course: increased connection with front-line curators,
called “Providing Social Science Data Services: Strategies for            improved understanding of the needs and workflows of the
Design and Operation.” More recently, a course on confidential            community, and new opportunities to influence the curation of
data, “Assessment and Mitigation of Disclosure Risk in Data:              data further upstream in the data lifecycle (i.e., closer to the
Essentials for Social Science,” was offered.                              original production of the data).
    Finally, ICPSR is committed to global leadership in the area          B. Structure
of digital curation, especially through instruction. Direction 1 of
                                                                              The workshop, titled “Applied Data Science: Managing
the ICPSR Strategic Plan reads: “Through global leadership and
                                                                          Research Data for Re-Use,” was held July 23-27, 2012 in Ann
strong partnerships, set standards for excellence in data curation
                                                                          Arbor, Michigan. ICPSR teamed with the University of
and in the ethics of data access and protection for the social
                                                                          Michigan School of Information to host the workshop. The core
sciences and related disciplines.” The ICPSR Council, which is
                                                                          instructors were Mary Vardigan and Jared Lyle from ICPSR,
Kathleen Fear from the UM School of Information, and Jake            feedback now informs our future development. Some of the
Carlson from Purdue University.                                      shortcomings of the workshop that were identified, along with
                                                                     plans to address them, include:
    Twenty-five participants attended, representing diverse
institutions from the United States and Canada, as well as a range     1) Covering Too Much Content: While many participants
of disciplines, including engineering, chemistry, physics, the       enjoyed the broad range of curation topics discussed, we also
physical sciences, and the social sciences. Participants came to     heard comments like “Almost too much material...difficult to
the workshop with a wide variety of interests. Many participants     digest in short space of time” and “Too many briefings that tried
were interested in broad-based training. Others were establishing    to cover too much material in a short presentation.” We intend
or expanding their own repositories and needed “shovel ready”        to remedy this by discussing fewer topics but diving more
plans for curating data. Still others came with very specific        deeply. Instead of discussing, for instance, the many possible
questions in mind, such as how to manage confidential data or        data types in detail, leaving small chunks of time to each, we
how to address copyright questions.                                  intend to provide a quick but broad overview of the subject and
    The workshop was grouped into five themed days that              then spend quite a bit of time discussing the specifics of one or
followed an ICPSR dataset across the data life cycle through         two examples with hands-on activities.
creation, deposit, data processing, dissemination, preservation,
                                                                       2) More Discussion and Collaboration: A few of our days
and reuse [1]. Day 1 provided an overview of the research life
cycle stages and data curation. Day 2 covered data management        were especially long on lectures and short on discussion. We
planning and acquisitions. Day 3 highlighted metadata. Day 4         wanted to impart as much of our knowledge as possible, along
covered data processing, confidential data management, and           with that of our invited experts. What the participants really
repository requirements. Day 5 addressed dissemination,              wanted was a mixture of learning from experts and discussion
preservation, and tracking reuse.                                    among their peers. “Would have liked more opportunity to share
                                                                     challenges/solutions with participants,” wrote one attendee.
    Throughout the workshop, guest speakers provided insight on      Another said, “A forum for discussing individual situations,
a wide variety of curation topics, such as managing video data,      problem-solving suggestions for next steps, etc. would be
geospatial data, provenance, and repository assessment. Case         helpful.” As a solution, we are building more discussion time
studies and hands-on curation activities designed to help
                                                                     into the schedule, including structured thirty-minute blocks each
participants apply the material presented were woven throughout
                                                                     morning and afternoon and a longer lunch break. We are
the workshop. Examples of hands-on activities included creating
                                                                     exploring building peer-to-peer collaboration into the exercises
study- and variable-level metadata, reviewing unprocessed data
within Google Refine, and checking a dataset for confidentiality     as well. We intend to better capitalize on the expertise and
issues.                                                              knowledge that many workshop participants bring with them.

C. Lessons Learned                                                     3) Applied, Applied, Applied: Though we tried to pair applied
                                                                     examples and exercises with each lecture, workshop participants
    Overall, the participants had very positive comments about       wanted more. Many participants mentioned there are quite a few
the workshop. Most rated it as “exceptional” or “above average”      opportunities to learn about curation, but few chances for hands-
when compared to other graduate level courses they have taken.       on active learning and interaction. While we feel applied
    Expertise, breadth of subject material, and applicability were   interaction is one of the strengths of our workshop, we are
main strong points mentioned in the course evaluations. “This        looking to fine-tune the exercises that worked well and add
workshop provided an insider’s view of the data curation             others.
process,” wrote one participant, adding that “having presenters
                                                                       4) More Science in the Curriculum: As a social science data
that specialize in key parts of the process was very
                                                                     archive, the curation material that we discussed naturally
valuable.” Another participant noted, “The ‘pipeline’ served as
an excellent framework.” Yet another appreciated “the hands-on       emphasized methods and content from just one slice of the
aspects of the course and the various print-based handouts.”         research data spectrum. Our participants recognized the
                                                                     applicability of social science data curation to all types and
    As this was the first time this workshop was offered, we were    formats of data, and we did include some examples from the
particularly active in gathering feedback. We surveyed the           ‘hard sciences.’ That said, the participants wanted to “cover a
participants at the end of each day of the course and applied the    wider array of data types and the unique management issues for
feedback we received to adjust the course pace and content for       each.” While we will continue to highlight our own data and
the subsequent days. At the end of the course, the Summer            methods from the social sciences, we can attempt to better
Program also conducted an official, proctored evaluation. This
diversify the types of data covered in the exercises and             the summer data curation workshop: acquisition (gathering
discussions. One option, for example, would be to offer              information from the data producer, legal agreements, and
participants a choice of the types of data to work with during       appraisal), review (quality and disclosure review), processing
exercises.                                                           (data cleaning, insuring data integrity, and quality checking),
                                                                     metadata (standards, and variable- and study-level metadata),
                III. DATA CURATION WORKING GROUP                     dissemination (final packaging, delivery mechanisms), and
     Shortly before the start of the summer data curation            preservation (policies and actions).
workshop, ICPSR discussed with Ron Nakao, Stanford                       At this time, the working group is still active. Participants
University, some possible mechanisms to provide more hands-          have access to the ICPSR secure data processing environment
on, localized data curation training to librarians, especially the   through September 2013.
Official Representatives at member institutions who assist
faculty, staff, and students with ICPSR resources. Many              C. Lessons Learned
librarians have limited experience with data management and              As in the workshop, participants were generally excited to be
curation. In addition, as budgets are increasingly tightening,       learning about and practicing data curation. “This was a fantastic
librarians may not have the chance to travel for week-long           opportunity,”    wrote      one     participant.   “The      most
training. Even the more experienced data librarians do not have      useful/informative aspect has been applying the ICPSR’s
the tools or resources that ICPSR can provide. Although multiple     workflows and practices to an actual data collection and seeing
venues exist to meet and discuss data curation topics -- from        what’s involved in getting the data in sync with those workflows
listservs to conferences -- few opportunities arise for data         and practices.”
curators to engage in personalized but collaborative hands-on
work using the tools of an established domain repository.               Since the group is ongoing, and since group members are still
                                                                     processing and curating their data, we anticipate learning more
A. Background                                                        about the successes and challenges of this training format. In the
    We proposed a virtual data curation working group where          meantime, we offer a few in-progress lessons learned.
participants would apply curation theories to practice through         5) Bring Your Own Data: All working group participants
actual data processing, interact with and ask questions of other
                                                                     brought their own data to process and curate. As a result, the
data specialists within a working environment, and gain first-
                                                                     participants were highly invested and motivated; the questions
hand experience using ICPSR’s internal tools and procedures for
                                                                     and discussions raised were timely and relevant rather than
curation. The course would last approximately four months, with
one virtual meeting of 1 ½ hours approximately every other           purely theoretical.
week.                                                                   6) Hands-on Activities Were Key: Similar to bringing their
    ICPSR would benefit from the group as well. By opening our       own data, hands-on activities using ICPSR’s processing
processing environment and tools to outsiders, we would learn        environment and tools helped the group members understand
more about the tools and services data librarians want and need,     and experience the core work of curation instead of just talking
and the suitability of expanding the use of ICPSR’s own curation     through what can seem like generalized concepts. As one
tools to a broader community. This interest coincides with our       participant mentioned, “...The real work was with going through
work in an IMLS National Leadership Grant (LG-05-09-0084-            the data and documentation and seeing things like discrepancies
09) to investigate tools and services to assist librarians with      in variable names and the need to flesh out citations to make
specialized tasks in the archiving and dissemination of social       them more informative. That was both interesting in its own
science data. Another benefit of the working group would be that     right and illuminating to provide a sense of what data curation
more data would be curated and archived, benefiting the ICPSR        actually consists of in practice."
membership and the entire social science community.
                                                                       7) Scheduling Issues: Virtual meetings have distinct benefits,
B. Structure                                                         including saving time and money, and allowing participants to
    The working group first met -- virtually -- in September 2012.   practice methods and tools in between group discussions.
Participants hailed from Emory, Duke, UCLA, and UC Berkeley,         However, many in our group experienced one big drawback:
along with Jared Lyle from ICPSR as facilitator and Ron Nakao        scheduling conflicts. As on member lamented, “I guess the only
as the chair. Participants received access to the ICPSR secure       real ‘problem’ with the group was that scheduling/timing issues
processing environment and brought their own data to curate. Bi-     were such that we had to do a lot of the work during the
weekly discussions focused on topics similar to those found in       semester, when other demands on my time made it hard to focus
                                                                     on the project in a sustained manner.” Another member
expressed similar frustration. “Unfortunately, my schedule           understanding of the needs and workflows of the community, and
shifted pretty dramatically this semester, and it was often          new opportunities to influence the curation of data further
difficult to fit in the call and prep work needed to make the call   upstream in the data lifecycle.
most useful.” By not leaving their physical job work                                           ACKNOWLEDGMENT
environments, it was increasingly challenging for participants to
carve curation time away from the everyday job demands and               We wish to acknowledge Nancy McGovern, ICPSR Digital
expectations.                                                        Preservation Officer from 2007-2012, who played a leading role
                                                                     in developing the initial draft and goals of the workshop. We
                            IV.SUMMARY                               thank the workshop and working group participants for their
    As part of ICPSR’s commitment to global leadership in the        feedback and participation. We also thank Dan Meisler for his
area of digital curation, especially through instruction, we will    edits. IMLS National Leadership Grant (LG-05-09-0084-09)
offer the data curation summer workshop again in July 2013.          supported data curation working group activities that identified
Likewise, the data curation working group is running through         services to assist with archiving and disseminating social science
September 2013.                                                      data.

    We see continued demand by professionals to learn about                                        REFERENCES
curation, especially through applied learning, and feel we can           [1]   J. Carlson, K. Fear, J. Lyle, and M. Vardigan. “Applied Data Science:
play a role in helping educate the research and digital curation               Managing Research Data for Reuse,” Workshop Syllabus. ICPSR
community through teaching and discussing the curation                         Summer Program in Quantitative Methods, July 2012.
                                                                               http://www.icpsr.umich.edu/files/sumprog/biblio/2012/Applied%20D
experiences and processes that have shaped our 50 years as a data              ata%20Science%20Managing%20Research%20Data%20for%20Reu
archive. As we do this, we recognize and appreciate the benefits:              se.pdf.
increased connection with front-line curators, improved