=Paper=
{{Paper
|id=None
|storemode=property
|title=An Applied Approach to Data Curation Training at the Inter-university Consortium for Political and Social Research (ICPSR)
|pdfUrl=https://ceur-ws.org/Vol-1016/paper14.pdf
|volume=Vol-1016
|dblpUrl=https://dblp.org/rec/conf/digcurv/LyleVCN13
}}
==An Applied Approach to Data Curation Training at the Inter-university Consortium for Political and Social Research (ICPSR)==
An Applied Approach to Data Curation Training at the Inter-university Consortium for Political and Social Research (ICPSR) Jared Lyle, Mary Vardigan Jacob Carlson Ron Nakao ICPSR, University of Michigan Purdue University Stanford University Ann Arbor, MI, U.S.A. West Lafayette, IN, U.S.A. Stanford, CA, U.S.A. {lyle, vardigan}@umich.edu jakecarlson@purdue.edu ronbo@stanford.edu Abstract—ICPSR recently developed two new training initiatives can assist in curation efforts? In 2011, ICPSR began planning a in digital curation: a week-long applied data curation workshop data curation workshop to address these questions. where participants learn the theories and methods of data curation using the ICPSR “processing pipeline” as framework, and an A. Background ongoing virtual working group of data librarians that discusses similar core data curation topics while giving participants The workshop was intended for individuals interested or independent access to curate their own data using ICPSR’s actively engaged in the management and curation of research processing environment and tools. This paper discusses the data, particularly data scientists, data managers and analysts, background, structure, and lessons learned from these new training librarians, archivists, and data stewards and curators. The initial initiatives. goal of the workshop was to “raise awareness about the benefits of life cycle principles for data management, including how to Keywords—Digital curation, data curation, training, curriculum. create, comply with, and evaluate required data management I. OVERVIEW plans, how to encourage and trace re-use, and how to manage data from its inception through archiving and beyond.” The Inter-university Consortium for Political and Social Research (ICPSR), a research center in the Institute for Social We believed, and continue to feel, that ICPSR is uniquely Research at the University of Michigan and the world’s largest positioned to offer a course on data curation. First, ICPSR plays a archive of social science data, recently developed two new central role in many social science data curation standards and training initiatives in digital curation. The first initiative is a activities, including serving as the home office for the Data week-long applied data curation workshop offered as part of the Documentation Initiative (DDI) and as a founding member of the ICPSR Summer Program in Quantitative Methods, where Data Preservation Alliance for the Social Sciences (Data-PASS). participants learn the theories and methods of data curation using DDI has become an international standard for metadata in the the ICPSR “processing pipeline” as framework. The second social sciences. ICPSR and many other data archives use the DDI initiative is an ongoing virtual working group of data librarians XML to document information about the data in our repositories; that discusses similar core data curation topics while giving the ICPSR online catalog is also built on DDI metadata, allowing participants independent access to curate their own data using structured searching across the entire repository at the variable- ICPSR’s processing environment and tools. This paper discusses and even the value-level. Data-PASS is a voluntary partnership the background, structure, and lessons learned from these new of organizations created to archive, catalog, and preserve data training initiatives. used for social science research. The Data-PASS partners collaborate on best practices for data archiving and have a shared II. DATA CURATION WORKSHOP digital preservation strategy. As data multiply in sheer quantity and become increasingly Second, ICPSR has established workflows for curating, important in the research process, the demand for data curation preserving, and providing access to data. These workflows, knowledge rises. What are the best practices for curating research described as the “ICPSR Pipeline Process” (Fig. 1), have been data? How does one apply them to daily practice? What tools developed and refined over 50 years of archiving more than 8,000 research collections from across all social science disciplines, and are informed by the Reference Model for an practices. The workflow segments, which are broken into Open Archival Information System (OAIS) for the preservation digestible portions, make it easier for students to follow and learn of digital objects as well as other community-based best curation processes. Figure. 1. ICPSR Pipeline Process. Third, ICPSR has an established Summer Program in elected by the Consortium membership and provides overall Quantitative Methods that offers more than 70 courses every guidance, strongly encourages our participation in initiatives to summer. The program provides an instructional infrastructure promote digital curation. We are eager to share our experience readily accessible for curation instruction. For the past several and knowledge. We also recognize and appreciate the benefits years, for instance, we have offered a course for data librarians from the course: increased connection with front-line curators, called “Providing Social Science Data Services: Strategies for improved understanding of the needs and workflows of the Design and Operation.” More recently, a course on confidential community, and new opportunities to influence the curation of data, “Assessment and Mitigation of Disclosure Risk in Data: data further upstream in the data lifecycle (i.e., closer to the Essentials for Social Science,” was offered. original production of the data). Finally, ICPSR is committed to global leadership in the area B. Structure of digital curation, especially through instruction. Direction 1 of The workshop, titled “Applied Data Science: Managing the ICPSR Strategic Plan reads: “Through global leadership and Research Data for Re-Use,” was held July 23-27, 2012 in Ann strong partnerships, set standards for excellence in data curation Arbor, Michigan. ICPSR teamed with the University of and in the ethics of data access and protection for the social Michigan School of Information to host the workshop. The core sciences and related disciplines.” The ICPSR Council, which is instructors were Mary Vardigan and Jared Lyle from ICPSR, Kathleen Fear from the UM School of Information, and Jake feedback now informs our future development. Some of the Carlson from Purdue University. shortcomings of the workshop that were identified, along with plans to address them, include: Twenty-five participants attended, representing diverse institutions from the United States and Canada, as well as a range 1) Covering Too Much Content: While many participants of disciplines, including engineering, chemistry, physics, the enjoyed the broad range of curation topics discussed, we also physical sciences, and the social sciences. Participants came to heard comments like “Almost too much material...difficult to the workshop with a wide variety of interests. Many participants digest in short space of time” and “Too many briefings that tried were interested in broad-based training. Others were establishing to cover too much material in a short presentation.” We intend or expanding their own repositories and needed “shovel ready” to remedy this by discussing fewer topics but diving more plans for curating data. Still others came with very specific deeply. Instead of discussing, for instance, the many possible questions in mind, such as how to manage confidential data or data types in detail, leaving small chunks of time to each, we how to address copyright questions. intend to provide a quick but broad overview of the subject and The workshop was grouped into five themed days that then spend quite a bit of time discussing the specifics of one or followed an ICPSR dataset across the data life cycle through two examples with hands-on activities. creation, deposit, data processing, dissemination, preservation, 2) More Discussion and Collaboration: A few of our days and reuse [1]. Day 1 provided an overview of the research life cycle stages and data curation. Day 2 covered data management were especially long on lectures and short on discussion. We planning and acquisitions. Day 3 highlighted metadata. Day 4 wanted to impart as much of our knowledge as possible, along covered data processing, confidential data management, and with that of our invited experts. What the participants really repository requirements. Day 5 addressed dissemination, wanted was a mixture of learning from experts and discussion preservation, and tracking reuse. among their peers. “Would have liked more opportunity to share challenges/solutions with participants,” wrote one attendee. Throughout the workshop, guest speakers provided insight on Another said, “A forum for discussing individual situations, a wide variety of curation topics, such as managing video data, problem-solving suggestions for next steps, etc. would be geospatial data, provenance, and repository assessment. Case helpful.” As a solution, we are building more discussion time studies and hands-on curation activities designed to help into the schedule, including structured thirty-minute blocks each participants apply the material presented were woven throughout morning and afternoon and a longer lunch break. We are the workshop. Examples of hands-on activities included creating exploring building peer-to-peer collaboration into the exercises study- and variable-level metadata, reviewing unprocessed data within Google Refine, and checking a dataset for confidentiality as well. We intend to better capitalize on the expertise and issues. knowledge that many workshop participants bring with them. C. Lessons Learned 3) Applied, Applied, Applied: Though we tried to pair applied examples and exercises with each lecture, workshop participants Overall, the participants had very positive comments about wanted more. Many participants mentioned there are quite a few the workshop. Most rated it as “exceptional” or “above average” opportunities to learn about curation, but few chances for hands- when compared to other graduate level courses they have taken. on active learning and interaction. While we feel applied Expertise, breadth of subject material, and applicability were interaction is one of the strengths of our workshop, we are main strong points mentioned in the course evaluations. “This looking to fine-tune the exercises that worked well and add workshop provided an insider’s view of the data curation others. process,” wrote one participant, adding that “having presenters 4) More Science in the Curriculum: As a social science data that specialize in key parts of the process was very archive, the curation material that we discussed naturally valuable.” Another participant noted, “The ‘pipeline’ served as an excellent framework.” Yet another appreciated “the hands-on emphasized methods and content from just one slice of the aspects of the course and the various print-based handouts.” research data spectrum. Our participants recognized the applicability of social science data curation to all types and As this was the first time this workshop was offered, we were formats of data, and we did include some examples from the particularly active in gathering feedback. We surveyed the ‘hard sciences.’ That said, the participants wanted to “cover a participants at the end of each day of the course and applied the wider array of data types and the unique management issues for feedback we received to adjust the course pace and content for each.” While we will continue to highlight our own data and the subsequent days. At the end of the course, the Summer methods from the social sciences, we can attempt to better Program also conducted an official, proctored evaluation. This diversify the types of data covered in the exercises and the summer data curation workshop: acquisition (gathering discussions. One option, for example, would be to offer information from the data producer, legal agreements, and participants a choice of the types of data to work with during appraisal), review (quality and disclosure review), processing exercises. (data cleaning, insuring data integrity, and quality checking), metadata (standards, and variable- and study-level metadata), III. DATA CURATION WORKING GROUP dissemination (final packaging, delivery mechanisms), and Shortly before the start of the summer data curation preservation (policies and actions). workshop, ICPSR discussed with Ron Nakao, Stanford At this time, the working group is still active. Participants University, some possible mechanisms to provide more hands- have access to the ICPSR secure data processing environment on, localized data curation training to librarians, especially the through September 2013. Official Representatives at member institutions who assist faculty, staff, and students with ICPSR resources. Many C. Lessons Learned librarians have limited experience with data management and As in the workshop, participants were generally excited to be curation. In addition, as budgets are increasingly tightening, learning about and practicing data curation. “This was a fantastic librarians may not have the chance to travel for week-long opportunity,” wrote one participant. “The most training. Even the more experienced data librarians do not have useful/informative aspect has been applying the ICPSR’s the tools or resources that ICPSR can provide. Although multiple workflows and practices to an actual data collection and seeing venues exist to meet and discuss data curation topics -- from what’s involved in getting the data in sync with those workflows listservs to conferences -- few opportunities arise for data and practices.” curators to engage in personalized but collaborative hands-on work using the tools of an established domain repository. Since the group is ongoing, and since group members are still processing and curating their data, we anticipate learning more A. Background about the successes and challenges of this training format. In the We proposed a virtual data curation working group where meantime, we offer a few in-progress lessons learned. participants would apply curation theories to practice through 5) Bring Your Own Data: All working group participants actual data processing, interact with and ask questions of other brought their own data to process and curate. As a result, the data specialists within a working environment, and gain first- participants were highly invested and motivated; the questions hand experience using ICPSR’s internal tools and procedures for and discussions raised were timely and relevant rather than curation. The course would last approximately four months, with one virtual meeting of 1 ½ hours approximately every other purely theoretical. week. 6) Hands-on Activities Were Key: Similar to bringing their ICPSR would benefit from the group as well. By opening our own data, hands-on activities using ICPSR’s processing processing environment and tools to outsiders, we would learn environment and tools helped the group members understand more about the tools and services data librarians want and need, and experience the core work of curation instead of just talking and the suitability of expanding the use of ICPSR’s own curation through what can seem like generalized concepts. As one tools to a broader community. This interest coincides with our participant mentioned, “...The real work was with going through work in an IMLS National Leadership Grant (LG-05-09-0084- the data and documentation and seeing things like discrepancies 09) to investigate tools and services to assist librarians with in variable names and the need to flesh out citations to make specialized tasks in the archiving and dissemination of social them more informative. That was both interesting in its own science data. Another benefit of the working group would be that right and illuminating to provide a sense of what data curation more data would be curated and archived, benefiting the ICPSR actually consists of in practice." membership and the entire social science community. 7) Scheduling Issues: Virtual meetings have distinct benefits, B. Structure including saving time and money, and allowing participants to The working group first met -- virtually -- in September 2012. practice methods and tools in between group discussions. Participants hailed from Emory, Duke, UCLA, and UC Berkeley, However, many in our group experienced one big drawback: along with Jared Lyle from ICPSR as facilitator and Ron Nakao scheduling conflicts. As on member lamented, “I guess the only as the chair. Participants received access to the ICPSR secure real ‘problem’ with the group was that scheduling/timing issues processing environment and brought their own data to curate. Bi- were such that we had to do a lot of the work during the weekly discussions focused on topics similar to those found in semester, when other demands on my time made it hard to focus on the project in a sustained manner.” Another member expressed similar frustration. “Unfortunately, my schedule understanding of the needs and workflows of the community, and shifted pretty dramatically this semester, and it was often new opportunities to influence the curation of data further difficult to fit in the call and prep work needed to make the call upstream in the data lifecycle. most useful.” By not leaving their physical job work ACKNOWLEDGMENT environments, it was increasingly challenging for participants to carve curation time away from the everyday job demands and We wish to acknowledge Nancy McGovern, ICPSR Digital expectations. Preservation Officer from 2007-2012, who played a leading role in developing the initial draft and goals of the workshop. We IV.SUMMARY thank the workshop and working group participants for their As part of ICPSR’s commitment to global leadership in the feedback and participation. We also thank Dan Meisler for his area of digital curation, especially through instruction, we will edits. IMLS National Leadership Grant (LG-05-09-0084-09) offer the data curation summer workshop again in July 2013. supported data curation working group activities that identified Likewise, the data curation working group is running through services to assist with archiving and disseminating social science September 2013. data. We see continued demand by professionals to learn about REFERENCES curation, especially through applied learning, and feel we can [1] J. Carlson, K. Fear, J. Lyle, and M. Vardigan. “Applied Data Science: play a role in helping educate the research and digital curation Managing Research Data for Reuse,” Workshop Syllabus. ICPSR community through teaching and discussing the curation Summer Program in Quantitative Methods, July 2012. http://www.icpsr.umich.edu/files/sumprog/biblio/2012/Applied%20D experiences and processes that have shaped our 50 years as a data ata%20Science%20Managing%20Research%20Data%20for%20Reu archive. As we do this, we recognize and appreciate the benefits: se.pdf. increased connection with front-line curators, improved