HITL IRL: 12 Reflections on Expertise Finding and Engagement for a Large Data Curation Team Brendan Coon1,∗ 1 Spotify, 3 Center Plaza, Boston, MA 02108 Abstract As ML and AI increasingly shape product development, the need for a rigorous humans-in-the-loop approach for quality control increases in importance. Impactful Data Curation teams are responsible for understanding and assessing the quality of the training data feeding into models and algorithms, and are able to package their evaluations in a consumable and actionable format. This paper covers some of the necessary steps to build a successful Data Curation team that can continuously deliver value, even as your core business or academic use case evolves. By providing an overview of what has worked during my 9 years on the team, I aim to provide an essential guide to building a new team or improve an existing one. My contention is that the unique perspective contained in this paper is advice that can help several disciplines that might be looking after a Data Curation team as part of their remit—researchers, ML engineers, product managers—get high-integrity data and algorithm evaluations from the experts they engage. Building and maintaining a Data Curation team will directly impact any product team’s ability to “identify issues with usability and comprehensibility associated most closely with content quality and with the user experience.” [1] It is important that you find the right people and retain them — this paper lays out how to do both. Some key takeaways the reader might acquire from this paper are how to find and identify the right experts, how to support and work with those experts, and how to retain and engage those experts. They are mostly pulled from my experience in a business environment, but can apply to an academic setting as well. Keywords humans in the loop, data curation, annotation, ML evaluation, subject matter expertise, curator engagement 1. Introduction ground for those curious about how to work with a Data Curation team, but it is particularly targeted at those The goal of this paper is to help guide anyone working in looking for key steps to actually find and engage the a product development environment who needs to build subject matter experts on a Data Curation team itself. or improve a Data Curation team they’re responsible for. This responsibility does not always fall on an individ- ual as a single, dedicated task. Often the job goes to a 2. Background lead researcher, ML engineer, or Product Manager despite often requiring the energy and attention of a full-time, In 2013, I was hired as one of the first four Data Curators dedicated leader who may have even been an individual at a music start-up called The Echo Nest. We worked re- contributor Data Curator themselves. This isn’t neces- motely and part-time, validating data mapping via a web sarily the wrong organizational structure, but it can limit crawler on the order of 10k or 15k entities over several the amount of exposure and time the responsible party months. This project and team workflow, and others like has to build and run a Data Curation team when it is only it — experts in music and music in culture confirming part of their remit. computational results — proved valuable to Research and This paper covers how to find the human subject mat- Development as they iterated on algorithms valued by ter experts, encourage retention and enable high per- multiple B2B customers. By 2015, awhile after being ac- formance - it does not go into technical details about quired by Spotify, the team became full-time and began the process of integrating data or similar experimental branching out from label confirmation and correction subjects. We know that immediate or early ML output to the corresponding work of heuristic evaluation. The is often wrong, unintuitive, or off-brand, and can vary types of work required of our team started fairly sim- wildly from end-user to end-user, but a well constructed ply — evaluating one or two playlist concepts at a time and maintained Data Curation team can point product over several rounds of review. But our remit eventually teams in the direction of improving that output quickly expanded, including but not limited to: evaluation of per- and consistently. This paper may be interesting back- sonalized music playlists; natural language processing (NLP) results; image quality assessment; search query ful- Proceedings of the CIKM 2022 Workshops, 2022 fillment; podcast show, episode and clip recommendation ∗ Corresponding author. analysis; track transition programming; as well as the Envelope-Open bcoon@spotify.com (B. Coon) building of a scalable taxonomy for music culture train- © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). ing data. Over many years, we have developed our own CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) bespoke frameworks to package lots of nuanced analysis 4.2. Simultaneously scope those areas into actionable insights. We have collaborated weekly At the same time, you should accept that the scope of with music editors on discovery playlists that break up the expertise you’re able to provide to the company must and coming artists. We have strategically shaped what always have some appropriate limits, and that you should should (and should not) go into music culture-centric prioritize the knowledge that will likely improve the user marketing campaign data stories. This list only scratches experience for the most end-users. For example, if you’re the surface of what the Data Curation team has done in looking for music experts, you might find a candidate our 7 years of being full time. I have led the team since who is an authority on every recording ever committed 2016, and during my leadership we have moved into the to wax cylinder by the Edison Concert Band, but that product insights part of the company, and grown from knowledge is not practically valuable in today’s music an east coast-based team of 5 to an international team streaming market. A candidate who is integrally aware of 25 subject matter experts, with some expansion yet to of the performers featured in XXL’s latest freshman class come over the next few years. You may have experienced and can apply that awareness to a recommender system some of my Spotify’s personalized products, so chances evaluation is arguably of more value to your business are someone on my team had something to do with your case than someone with a PhD who can identify every experience from their role ”in the loop.” 78 produced by the Victor Talking Machine Company. Prioritize the expertise you need based on the market 3. 12 Reflections and customer base you’re serving, not necessarily at the expense of the Edison Concert band fans, but within a It is possible to share much more than 12 points about proper balance that favors your users. how to build and maintain a Data Curation team, but I’ve identified these lessons as the most helpful, action- 4.3. Hire from diverse backgrounds able, and applicable to a variety of Data Curation team scenarios regardless of domain. Your strength as a Data Curation team is proportionate to the level of diversity you’re able to acquire, so you should hire a diverse team to meet whatever your needs are. If 4. Expertise Finding you need experts in a range of cultures or languages, do not hesitate to venture outside of a particular candidate 4.1. Determine expertise areas profile. Consider a multitude of different professional As you start building or as you inherit your team, you backgrounds — do not exclude any academic majors or must determine specific areas of expertise you will abso- previous career paths. For example, we have had very lutely need. This may sound obvious, but the way you successful members of our Data Curation team with aca- build a team based on identified needs can impact how demic backgrounds from music schools, but also business, flexible you’re able to be as your use case needs evolve. political science, statistics, theater and English. We have For example, when I took over hiring for my team, we hired people from companies similar to ours, but also were just starting to understand how we might work from the DJ community, education, retail, nonprofit, and effectively with Natural Language Processing, and pod- real estate. The subject matter experts you are looking casts had not even been mentioned on a product roadmap for are not always the most obvious candidates jumping yet. Once it became clear that the role of our Data Cura- out of your hiring pipeline, and you will find that the tors was going to evolve beyond “just” music expertise, strength and quality of your work will benefit from being adjustments were made to the hiring process to attract open minded about your candidate pool. and screen for a broader pool of expertise. The benefit of this has been that while we maintain a core group of 4.4. Find knowledge lovers who can music experts, we are also able to provide value for the leverage that knowledge company’s increasing scope. If your company’s mission is made up of multiple verticals, think of the team you’re Your curators should love acquiring knowledge, doing building as a platform to share and serve the workload research, and applying both in a machine learning or iter- for that growth. Otherwise you can end up with several ative product environment. There are extremely capable islands of Data Curators spread out due to institutional professionals who have and can develop much of the history not intentional alignment, and those teams might knowledge your problem space might require, but they miss the opportunity to share knowledge, tooling, or may not be the same individuals who are able to apply even a consistent career development framework. their knowledge in an actionable way. Conversely, you may find stellar project managers who are efficient at organizing a task against a deadline, but simply have too much of a domain knowledge gap to be a fit for your team. hip hop, make sure to ask them about it several times, Personality types vary of course, and this isn’t an oblig- specifically. atory requirement, but some of the ideal candidates are people who are already participating in activities like the 4.6.2. Untenable Snobbery job they’re applying for in their free time. For example, if someone you are considering is already updating online Simultaneously, some subject matter experts can be detri- assets with sources, or painstakingly curating their own mentally snobby, so you have to investigate their pro- music library with what are essentially track attributes, fessional flexibility. For example - “As part of the hiring these are very promising signs. If you do not interrogate process, some editors had to make a playlist for Susan how much your potential hire appreciates research and Boyle fans to prove they could pick songs that do not data improvement, you may end up with an expert who necessarily align with their own taste. ‘Even if it is done does not appreciate the application of their expertise they by a super expert, it’s still for a general audience,’ says are now professionally responsible for. Ensure that your Jessica Suarez, a product marketing manager at Google hires can appreciate the glory in what others might find who serves as one of Play Music’s editors. ‘We’re trying mundane. to reach as many people as possible.’” [2] I highly recom- mend this sort of assessment, as any Data Curator will eventually have to annotate or evaluate data they do not 4.5. Develop unique screening exercises personally like or find interesting with the same level When hiring, develop smart, non-punitive screening ex- of professionalism they apply to the data they are more ercises aimed at testing knowledge, as well the ability naturally passionate and knowledgeable about. to speak fluidly about thorny concepts (e.g. music gen- res.) These hiring tests should simulate the work so that both the candidate and employer know what they are 5. Engagement getting into, but they should also help to assess curiosity, detail awareness and of course domain knowledge. For 5.1. Take on imposter syndrome head on example, if you envision the candidate will be largely Recognize and embrace the imposter syndrome that is responsible for annotating descriptions of tracks in a par- often felt by subject matter experts who are part of a ticular language, test their ability to complete this work Data Curation team, especially those who are joining for the music or culture they have already communicated one for the first time. Working with engineers, scientists is within their area of expertise, and do this right along and product managers comes with a potential learning side tracks they may be less familiar with. Even the best curve that can be intimidating. A Data Curator does not experts have to do work outside of their comfort zone, so necessarily have to understand python, active learning you will want to see how a candidate handles what might concepts, or cluster analysis. Although some curators be unfamiliar data to them, and ask how they might start will want to learn more about these related areas, it is not their research if this was part of a real work project in part of their required skill set or how they necessarily add their first week of employment. This will tell you a lot the most value to your use case. Nevertheless, Data Cura- about what kind of learning mindset your candidate is tors have often shared with me that when compared with likely to maintain, and how satisfied that learning is likely their counterparts in engineering and other disciplines to make them. they often feel like they don’t necessarily “deserve their positions.” This natural feeling but misguided sentiment 4.6. Balance benchmarking with bespoke must be countered directly and regularly. For example, I and the other managers on my team investigation loudly make the point that our work enables those engi- When developing these tests, there are two points I want neers to iterate, those scientists to test various iterations, to suggest you remain vigilantly aware of: and those product managers to judge whether or not user needs are being met. So in fact, Data Curators are the in- 4.6.1. False Claims tegral glue that all of those disciplines require for ground truth and quality measurement. Curators are often able It is important that the hiring process exposes exagger- to get very close to what an actual user experience is like, ated or false claims made in an candidate’s application and their ideas about what is not working in that experi- regarding their expertise, so it is critical that you tai- ence can often expose product teams to specific examples lor some interview materials to examine these bespoke of user painpoints. If a Data Curator feels intimidated claims, while also designing identical tasks every candi- because they cannot speak authoritatively about casual date must complete for proper benchmarking. For ex- inference or a similar technical concept, we try to remind ample, if a candidate states that they have expertise in them about something they do uniquely know and can apply — like maybe knowing all nine official members of 5.3. Make tenets and live by them Wu-Tang Clan. This sort of knowledge — the type Data Your Data Curation team is not just employed to do data Curators often take for granted given what disciplines clean up work as an afterthought—they are there to prac- they are comparing themselves to - is just as valuable tice a tangible, measurable and integral discipline. Most when doing the majority of our work (i.e., annotation legitimate disciplines have tenets, and in Data Curation and evaluation) and you must coach Data Curators to you must have bold tenets. For example: treat their own knowledge with respect and value. 5.3.1. Tenet 1 5.2. Frame the work as memorable Every user should feel like our product gets them, regard- Data Curators can be ground truth oracles for heuristic less of who they are, where they are from, where they or model training data, expert tuners of algorithms, or live, or what they like. evaluators for algorithmic output, and are quite often all three. But your Data Curators, particularly when they 5.3.2. Tenet 2 are just joining, don’t necessarily have this context or nomenclature. To keep this simple, try to frame most Global growth is dependent on understanding cultural of the work encompassed in this diverse set of tasks as nuances within our products. something memorable. Your Data Curation team should eventually learn more about precision and recall and 5.3.3. Tenet 3 the many related topics, but it’s important that they’re immediately able to connect their work with how it might Personalization is not just our products — it is truly the be effecting models and, subsequently, end users. For end-to-end user journey. example, we talk about the “the 3 T’s”: 5.3.4. Tenet 4 5.2.1. Training Subject matter expertise cannot be automated, and the Humans annotate data with labels or free text. This success of our products depends on alignment with col- ground truth or ”golden data” gives models high quality laborative influence. and high volume training data. There is more than one approach to machine learning (ML) but typically ML 5.3.5. Tenet 5 algorithms learn to make decisions from this training We reject the false dichotomy of human vs. machine and data, depending on the particular corpus(es) a use case embrace the necessary and powerful collaboration of that involves. Typically this is the part of the process people relationship. are referring to when the term “humans-in-the-loop” is used. 5.4. Develop tools and make it fun 5.2.2. Tuning Always be willing to develop and maintain tools and best Humans tune the model in various ways, but mostly by practices that are easy for Data Curators to use, based off scoring data to track things like the limiting of accurate of sound best practices from human computer interaction predictions due to overfitting, edge cases a model/classi- research. These tools should be dependable and flexible — fier has not seen yet, or new categories and attributes in do not just use spreadsheets for work your Data Curation a schema that a model needs. team will be repeating regularly. For example, spread- sheets work fine for many tasks, but as an annotation and evaluation tool they are incomplete interfaces. In 5.2.3. Testing our case, we developed an internal tool that integrates Humans test, validate and evaluate a model by scoring its with spreadsheets, but adds a number of benefits, and is outputs, especially in places where an algorithm has low self service. The tool sets up each would be spreadsheet confidence about a correct judgment or high confidence row as a “card” (the tool is amusingly called “cardi” in about an incorrect judgment. This is usually done with tribute to one of our favorite rappers.) It can adapt to any test sets to make the model robust and less likely to overfit schema, handle enriched URIs for content playback, and or retain biases. produce on the fly analytics to track progress or trends from an evaluation. By all measures available, investing the time in this tool tripled our productivity, because its features were sourced from its Data Curating practition- ers directly. Without the right tool, either purchased or about in the evaluation was getting something test ready by “identifying issues.” This sort of focus on the value of the work proves critical to Data Curation engagement—it is the ”why” the team is often looking for and can add energy to team morale and motivation. 5.6. Use the right evaluation framework Having the right evaluation framework provides Data Curation teams with a formal and interoperable set of attributes that both focuses the feedback Data Curators generate and provides clear reporting of that feedback to stakeholders. For example, our Data Curation team has developed a “Content Recommendation Scorecard” for evaluating products or listening experiences against Figure 1: Design memorable ways to aid your team’s under- standing of statistical concepts in a manner relevant to their acceptable quality levels. Given the cognitive complex- subject matter expertise. Here we see a fun way to remember ity of trying to leverage subject matter expertise in an the difference between Type 1 and Type 2 errors relevant to objective way, the framework allows the team to rate a the domain the experts are working in. Created by the author, playlist or a track using several dimensions of quality - using photographs from his own collection and via the Library attributes like coherence or representation. When Data of Congress, William P. Gottlieb Collection [Public domain] Curators and product teams are speaking an overlapping (https://loc.gov/item/gottlieb.00151). language, curators can ensure that they are evaluating systems consistently, and product teams can determine takeaways like “the new approach more strongly met our developed, you will always be leaving some time, data criteria in terms of the attributes we wanted to optimize and quality on the table. for.” [1] A detailed framework might be take time to con- Also, applying bespoke best practices can be fun! struct and fine tune, as a healthy level of inquiry should There is no harm in finding relevant and creative ways to be applied within whatever dimensions you deem appro- visualize important concepts germane to the work you priate. Before you develop a more rigorous evaluation are doing as a team as shown in Figure 1. framework, you can keep it simple with something like: 5.6.1. Personal Relevance 5.5. KPIs aren’t always obvious but are always necessary Does the recommendation match user tastes and personal preferences? KPIs can be hard to come by and are often contextual when it comes to a Data Curation team. You can use raw 5.6.2. Cultural Relevance counts of annotations in a database, connections made in a graph, or rates of project completion over time. Yet we Does the recommendation account for the current cul- have found that the better metric is something closer to tural or localized context, like contemporary trends or the number of tests that launched over a quarter because appropriate language? of our team’s work. When possible, any corresponding positive movement on numbers like consumption or re- 5.6.3. Expert Artisanship tention is nice, but our mandate is to unlock the potential for those improvements — it is the responsibility of the Does the recommendation feel brilliant - made by some- product team to actually improve their code and the re- one who knows the material inside and out and its rela- sulting product consumption. You can always learn a lot tion to user taste? about how much value you are adding and where you can have the biggest impact by staying close to product These tasks require thoughtful work and consistent development, so test launch measurement is a helpful standards. Without sampling actual user segments across quantification. our most important cohorts to see and hear what vari- For example, when a product was in development, a ous product experiences are surfacing to them, you are Data Curation team “Performed a heuristic review, where always sort of guessing. Data Curation removes some of (they) reviewed a number of (examples) with a variety of that guesswork, enabling stakeholders with directional taste overlap scores.” [1] The KPI the Data Curation cared analysis that leads to beneficial action. 6. Conclusions Some key takeaways from this paper center around how to find and identify the right experts, how to support and work with those experts, and how to keep them engaged to retain them. They are often pulled from my time in a business environment, but can also apply to an academic one. Building and maintaining a Data Curation team will directly impact any product team that leverages their expertise. Finding the right talent and engaging that talent to retain them is an important consideration, and as I have articulated in this paper, there are specific steps anyone responsible for a Data Curation team can take too optimize for both. Acknowledgments Thanks to my entire Data Curation team, past and cur- rent, and my colleagues in Spotify’s Insights and Re- search communities, especially Sam Way, Claudia Huff, Aditya Ponnada, Ang Li, Praveen Ravichandran, Mounia Lalmas-Roelleke, Henriette Cramer, and Laura Lake for your guidance and support. This paper would not exist without all of your generously shared wisdom. References [1] J. Lamere, A look behind blend: The person- alized playlist for you... and you, 2021. URL: https://engineering.atspotify.com/2021/12/ a-look-behind-blend-the-personalized-playlist-for-youand-you/. [2] V. Luckerson, These are the people picking your next internet radio song, 2015. URL: https://time. com/3947080/streaming-music-human-curators/.