HITL IRL: 12 Reflections on Expertise Finding and
Engagement for a Large Data Curation Team
Brendan Coon1,∗
1
    Spotify, 3 Center Plaza, Boston, MA 02108


                                          Abstract
                                          As ML and AI increasingly shape product development, the need for a rigorous humans-in-the-loop approach for quality
                                          control increases in importance. Impactful Data Curation teams are responsible for understanding and assessing the quality of
                                          the training data feeding into models and algorithms, and are able to package their evaluations in a consumable and actionable
                                          format. This paper covers some of the necessary steps to build a successful Data Curation team that can continuously deliver
                                          value, even as your core business or academic use case evolves. By providing an overview of what has worked during my 9
                                          years on the team, I aim to provide an essential guide to building a new team or improve an existing one. My contention is
                                          that the unique perspective contained in this paper is advice that can help several disciplines that might be looking after
                                          a Data Curation team as part of their remit—researchers, ML engineers, product managers—get high-integrity data and
                                          algorithm evaluations from the experts they engage. Building and maintaining a Data Curation team will directly impact any
                                          product team’s ability to “identify issues with usability and comprehensibility associated most closely with content quality
                                          and with the user experience.” [1] It is important that you find the right people and retain them — this paper lays out how
                                          to do both. Some key takeaways the reader might acquire from this paper are how to find and identify the right experts,
                                          how to support and work with those experts, and how to retain and engage those experts. They are mostly pulled from my
                                          experience in a business environment, but can apply to an academic setting as well.

                                          Keywords
                                          humans in the loop, data curation, annotation, ML evaluation, subject matter expertise, curator engagement


1. Introduction                                                                                    ground for those curious about how to work with a Data
                                                                                                   Curation team, but it is particularly targeted at those
The goal of this paper is to help guide anyone working in                                          looking for key steps to actually find and engage the
a product development environment who needs to build                                               subject matter experts on a Data Curation team itself.
or improve a Data Curation team they’re responsible for.
   This responsibility does not always fall on an individ-
ual as a single, dedicated task. Often the job goes to a                                           2. Background
lead researcher, ML engineer, or Product Manager despite
often requiring the energy and attention of a full-time,                                                        In 2013, I was hired as one of the first four Data Curators
dedicated leader who may have even been an individual                                                           at a music start-up called The Echo Nest. We worked re-
contributor Data Curator themselves. This isn’t neces-                                                          motely and part-time, validating data mapping via a web
sarily the wrong organizational structure, but it can limit                                                     crawler on the order of 10k or 15k entities over several
the amount of exposure and time the responsible party                                                           months. This project and team workflow, and others like
has to build and run a Data Curation team when it is only                                                       it — experts in music and music in culture confirming
part of their remit.                                                                                            computational results — proved valuable to Research and
   This paper covers how to find the human subject mat-                                                         Development as they iterated on algorithms valued by
ter experts, encourage retention and enable high per-                                                           multiple B2B customers. By 2015, awhile after being ac-
formance - it does not go into technical details about                                                          quired by Spotify, the team became full-time and began
the process of integrating data or similar experimental                                                         branching out from label confirmation and correction
subjects. We know that immediate or early ML output                                                             to the corresponding work of heuristic evaluation. The
is often wrong, unintuitive, or off-brand, and can vary                                                         types of work required of our team started fairly sim-
wildly from end-user to end-user, but a well constructed                                                        ply — evaluating one or two playlist concepts at a time
and maintained Data Curation team can point product                                                             over several rounds of review. But our remit eventually
teams in the direction of improving that output quickly                                                         expanded, including but not limited to: evaluation of per-
and consistently. This paper may be interesting back-                                                           sonalized music playlists; natural language processing
                                                                                                                (NLP) results; image quality assessment; search query ful-
Proceedings of the CIKM 2022 Workshops, 2022                                                                    fillment; podcast show, episode and clip recommendation
∗
     Corresponding author.                                                                                      analysis; track transition programming; as well as the
Envelope-Open bcoon@spotify.com (B. Coon)                                                                       building of a scalable taxonomy for music culture train-
                   © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                   Attribution 4.0 International (CC BY 4.0).                                                   ing data. Over many years, we have developed our own
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
bespoke frameworks to package lots of nuanced analysis       4.2. Simultaneously scope those areas
into actionable insights. We have collaborated weekly
                                                             At the same time, you should accept that the scope of
with music editors on discovery playlists that break up
                                                             the expertise you’re able to provide to the company must
and coming artists. We have strategically shaped what
                                                             always have some appropriate limits, and that you should
should (and should not) go into music culture-centric
                                                             prioritize the knowledge that will likely improve the user
marketing campaign data stories. This list only scratches
                                                             experience for the most end-users. For example, if you’re
the surface of what the Data Curation team has done in
                                                             looking for music experts, you might find a candidate
our 7 years of being full time. I have led the team since
                                                             who is an authority on every recording ever committed
2016, and during my leadership we have moved into the
                                                             to wax cylinder by the Edison Concert Band, but that
product insights part of the company, and grown from
                                                             knowledge is not practically valuable in today’s music
an east coast-based team of 5 to an international team
                                                             streaming market. A candidate who is integrally aware
of 25 subject matter experts, with some expansion yet to
                                                             of the performers featured in XXL’s latest freshman class
come over the next few years. You may have experienced
                                                             and can apply that awareness to a recommender system
some of my Spotify’s personalized products, so chances
                                                             evaluation is arguably of more value to your business
are someone on my team had something to do with your
                                                             case than someone with a PhD who can identify every
experience from their role ”in the loop.”
                                                             78 produced by the Victor Talking Machine Company.
                                                             Prioritize the expertise you need based on the market
3. 12 Reflections                                            and customer base you’re serving, not necessarily at the
                                                             expense of the Edison Concert band fans, but within a
It is possible to share much more than 12 points about       proper balance that favors your users.
how to build and maintain a Data Curation team, but
I’ve identified these lessons as the most helpful, action- 4.3. Hire from diverse backgrounds
able, and applicable to a variety of Data Curation team
scenarios regardless of domain.                            Your strength as a Data Curation team is proportionate to
                                                           the level of diversity you’re able to acquire, so you should
                                                           hire a diverse team to meet whatever your needs are. If
4. Expertise Finding                                       you need experts in a range of cultures or languages, do
                                                           not hesitate to venture outside of a particular candidate
4.1. Determine expertise areas                             profile. Consider a multitude of different professional
As you start building or as you inherit your team, you backgrounds — do not exclude any academic majors or
must determine specific areas of expertise you will abso- previous career paths. For example, we have had very
lutely need. This may sound obvious, but the way you successful members of our Data Curation team with aca-
build a team based on identified needs can impact how demic backgrounds from music schools, but also business,
flexible you’re able to be as your use case needs evolve. political science, statistics, theater and English. We have
For example, when I took over hiring for my team, we hired people from companies similar to ours, but also
were just starting to understand how we might work from the DJ community, education, retail, nonprofit, and
effectively with Natural Language Processing, and pod- real estate. The subject matter experts you are looking
casts had not even been mentioned on a product roadmap for are not always the most obvious candidates jumping
yet. Once it became clear that the role of our Data Cura- out of your hiring pipeline, and you will find that the
tors was going to evolve beyond “just” music expertise, strength and quality of your work will benefit from being
adjustments were made to the hiring process to attract open minded about your candidate pool.
and screen for a broader pool of expertise. The benefit
of this has been that while we maintain a core group of      4.4. Find knowledge lovers who can
music experts, we are also able to provide value for the          leverage that knowledge
company’s increasing scope. If your company’s mission
is made up of multiple verticals, think of the team you’re   Your curators should love acquiring knowledge, doing
building as a platform to share and serve the workload       research, and applying both in a machine learning or iter-
for that growth. Otherwise you can end up with several       ative product environment. There are extremely capable
islands of Data Curators spread out due to institutional     professionals who have and can develop much of the
history not intentional alignment, and those teams might     knowledge your problem space might require, but they
miss the opportunity to share knowledge, tooling, or         may not be the same individuals who are able to apply
even a consistent career development framework.              their knowledge in an actionable way. Conversely, you
                                                             may find stellar project managers who are efficient at
                                                             organizing a task against a deadline, but simply have too
much of a domain knowledge gap to be a fit for your team.      hip hop, make sure to ask them about it several times,
Personality types vary of course, and this isn’t an oblig-     specifically.
atory requirement, but some of the ideal candidates are
people who are already participating in activities like the    4.6.2. Untenable Snobbery
job they’re applying for in their free time. For example, if
someone you are considering is already updating online        Simultaneously, some subject matter experts can be detri-
assets with sources, or painstakingly curating their own      mentally snobby, so you have to investigate their pro-
music library with what are essentially track attributes,     fessional flexibility. For example - “As part of the hiring
these are very promising signs. If you do not interrogate     process, some editors had to make a playlist for Susan
how much your potential hire appreciates research and         Boyle fans to prove they could pick songs that do not
data improvement, you may end up with an expert who           necessarily align with their own taste. ‘Even if it is done
does not appreciate the application of their expertise they   by a super expert, it’s still for a general audience,’ says
are now professionally responsible for. Ensure that your      Jessica Suarez, a product marketing manager at Google
hires can appreciate the glory in what others might find      who serves as one of Play Music’s editors. ‘We’re trying
mundane.                                                      to reach as many people as possible.’” [2] I highly recom-
                                                              mend this sort of assessment, as any Data Curator will
                                                              eventually have to annotate or evaluate data they do not
4.5. Develop unique screening exercises                       personally like or find interesting with the same level
When hiring, develop smart, non-punitive screening ex- of professionalism they apply to the data they are more
ercises aimed at testing knowledge, as well the ability naturally passionate and knowledgeable about.
to speak fluidly about thorny concepts (e.g. music gen-
res.) These hiring tests should simulate the work so that
both the candidate and employer know what they are
                                                              5. Engagement
getting into, but they should also help to assess curiosity,
detail awareness and of course domain knowledge. For
                                                              5.1. Take on imposter syndrome head on
example, if you envision the candidate will be largely Recognize and embrace the imposter syndrome that is
responsible for annotating descriptions of tracks in a par- often felt by subject matter experts who are part of a
ticular language, test their ability to complete this work Data Curation team, especially those who are joining
for the music or culture they have already communicated one for the first time. Working with engineers, scientists
is within their area of expertise, and do this right along and product managers comes with a potential learning
side tracks they may be less familiar with. Even the best curve that can be intimidating. A Data Curator does not
experts have to do work outside of their comfort zone, so necessarily have to understand python, active learning
you will want to see how a candidate handles what might concepts, or cluster analysis. Although some curators
be unfamiliar data to them, and ask how they might start will want to learn more about these related areas, it is not
their research if this was part of a real work project in part of their required skill set or how they necessarily add
their first week of employment. This will tell you a lot the most value to your use case. Nevertheless, Data Cura-
about what kind of learning mindset your candidate is tors have often shared with me that when compared with
likely to maintain, and how satisfied that learning is likely their counterparts in engineering and other disciplines
to make them.                                                 they often feel like they don’t necessarily “deserve their
                                                              positions.” This natural feeling but misguided sentiment
4.6. Balance benchmarking with bespoke must be countered directly and regularly.
                                                                 For example, I and the other managers on my team
       investigation                                          loudly make the point that our work enables those engi-
When developing these tests, there are two points I want neers to iterate, those scientists to test various iterations,
to suggest you remain vigilantly aware of:                    and those product managers to judge whether or not user
                                                              needs are being met. So in fact, Data Curators are the in-
4.6.1. False Claims                                           tegral glue that all of those disciplines require for ground
                                                              truth and quality measurement. Curators are often able
It is important that the hiring process exposes exagger- to get very close to what an actual user experience is like,
ated or false claims made in an candidate’s application and their ideas about what is not working in that experi-
regarding their expertise, so it is critical that you tai- ence can often expose product teams to specific examples
lor some interview materials to examine these bespoke of user painpoints. If a Data Curator feels intimidated
claims, while also designing identical tasks every candi- because they cannot speak authoritatively about casual
date must complete for proper benchmarking. For ex- inference or a similar technical concept, we try to remind
ample, if a candidate states that they have expertise in them about something they do uniquely know and can
apply — like maybe knowing all nine official members of      5.3. Make tenets and live by them
Wu-Tang Clan. This sort of knowledge — the type Data
                                                             Your Data Curation team is not just employed to do data
Curators often take for granted given what disciplines
                                                             clean up work as an afterthought—they are there to prac-
they are comparing themselves to - is just as valuable
                                                             tice a tangible, measurable and integral discipline. Most
when doing the majority of our work (i.e., annotation
                                                             legitimate disciplines have tenets, and in Data Curation
and evaluation) and you must coach Data Curators to
                                                             you must have bold tenets. For example:
treat their own knowledge with respect and value.

                                                             5.3.1. Tenet 1
5.2. Frame the work as memorable
                                                             Every user should feel like our product gets them, regard-
Data Curators can be ground truth oracles for heuristic      less of who they are, where they are from, where they
or model training data, expert tuners of algorithms, or      live, or what they like.
evaluators for algorithmic output, and are quite often all
three. But your Data Curators, particularly when they
                                                             5.3.2. Tenet 2
are just joining, don’t necessarily have this context or
nomenclature. To keep this simple, try to frame most         Global growth is dependent on understanding cultural
of the work encompassed in this diverse set of tasks as      nuances within our products.
something memorable. Your Data Curation team should
eventually learn more about precision and recall and         5.3.3. Tenet 3
the many related topics, but it’s important that they’re
immediately able to connect their work with how it might     Personalization is not just our products — it is truly the
be effecting models and, subsequently, end users. For        end-to-end user journey.
example, we talk about the “the 3 T’s”:
                                                             5.3.4. Tenet 4
5.2.1. Training                                              Subject matter expertise cannot be automated, and the
Humans annotate data with labels or free text. This          success of our products depends on alignment with col-
ground truth or ”golden data” gives models high quality      laborative influence.
and high volume training data. There is more than one
approach to machine learning (ML) but typically ML           5.3.5. Tenet 5
algorithms learn to make decisions from this training
                                                             We reject the false dichotomy of human vs. machine and
data, depending on the particular corpus(es) a use case
                                                             embrace the necessary and powerful collaboration of that
involves. Typically this is the part of the process people
                                                             relationship.
are referring to when the term “humans-in-the-loop” is
used.
                                                             5.4. Develop tools and make it fun
5.2.2. Tuning                                                 Always be willing to develop and maintain tools and best
Humans tune the model in various ways, but mostly by practices that are easy for Data Curators to use, based off
scoring data to track things like the limiting of accurate of sound best practices from human computer interaction
predictions due to overfitting, edge cases a model/classi- research. These tools should be dependable and flexible —
fier has not seen yet, or new categories and attributes in do not just use spreadsheets for work your Data Curation
a schema that a model needs.                                  team will be repeating regularly. For example, spread-
                                                              sheets work fine for many tasks, but as an annotation
                                                              and evaluation tool they are incomplete interfaces. In
5.2.3. Testing
                                                              our case, we developed an internal tool that integrates
Humans test, validate and evaluate a model by scoring its with spreadsheets, but adds a number of benefits, and is
outputs, especially in places where an algorithm has low self service. The tool sets up each would be spreadsheet
confidence about a correct judgment or high confidence row as a “card” (the tool is amusingly called “cardi” in
about an incorrect judgment. This is usually done with tribute to one of our favorite rappers.) It can adapt to any
test sets to make the model robust and less likely to overfit schema, handle enriched URIs for content playback, and
or retain biases.                                             produce on the fly analytics to track progress or trends
                                                              from an evaluation. By all measures available, investing
                                                              the time in this tool tripled our productivity, because its
                                                              features were sourced from its Data Curating practition-
                                                              ers directly. Without the right tool, either purchased or
                                                              about in the evaluation was getting something test ready
                                                              by “identifying issues.” This sort of focus on the value of
                                                              the work proves critical to Data Curation engagement—it
                                                              is the ”why” the team is often looking for and can add
                                                              energy to team morale and motivation.

                                                              5.6. Use the right evaluation framework
                                                               Having the right evaluation framework provides Data
                                                               Curation teams with a formal and interoperable set of
                                                               attributes that both focuses the feedback Data Curators
                                                               generate and provides clear reporting of that feedback
                                                               to stakeholders. For example, our Data Curation team
                                                               has developed a “Content Recommendation Scorecard”
                                                               for evaluating products or listening experiences against
Figure 1: Design memorable ways to aid your team’s under-
standing of statistical concepts in a manner relevant to their acceptable quality levels. Given the cognitive complex-
subject matter expertise. Here we see a fun way to remember ity of trying to leverage subject matter expertise in an
the difference between Type 1 and Type 2 errors relevant to objective way, the framework allows the team to rate a
the domain the experts are working in. Created by the author, playlist or a track using several dimensions of quality -
using photographs from his own collection and via the Library attributes like coherence or representation. When Data
of Congress, William P. Gottlieb Collection [Public domain] Curators and product teams are speaking an overlapping
(https://loc.gov/item/gottlieb.00151).                         language, curators can ensure that they are evaluating
                                                               systems consistently, and product teams can determine
                                                               takeaways like “the new approach more strongly met our
developed, you will always be leaving some time, data criteria in terms of the attributes we wanted to optimize
and quality on the table.                                      for.” [1] A detailed framework might be take time to con-
   Also, applying bespoke best practices can be fun! struct and fine tune, as a healthy level of inquiry should
There is no harm in finding relevant and creative ways to be applied within whatever dimensions you deem appro-
visualize important concepts germane to the work you priate. Before you develop a more rigorous evaluation
are doing as a team as shown in Figure 1.                      framework, you can keep it simple with something like:

                                                              5.6.1. Personal Relevance
5.5. KPIs aren’t always obvious but are
     always necessary                                         Does the recommendation match user tastes and personal
                                                              preferences?
KPIs can be hard to come by and are often contextual
when it comes to a Data Curation team. You can use raw
                                                              5.6.2. Cultural Relevance
counts of annotations in a database, connections made in
a graph, or rates of project completion over time. Yet we     Does the recommendation account for the current cul-
have found that the better metric is something closer to      tural or localized context, like contemporary trends or
the number of tests that launched over a quarter because      appropriate language?
of our team’s work. When possible, any corresponding
positive movement on numbers like consumption or re-          5.6.3. Expert Artisanship
tention is nice, but our mandate is to unlock the potential
for those improvements — it is the responsibility of the      Does the recommendation feel brilliant - made by some-
product team to actually improve their code and the re-       one who knows the material inside and out and its rela-
sulting product consumption. You can always learn a lot       tion to user taste?
about how much value you are adding and where you
can have the biggest impact by staying close to product          These tasks require thoughtful work and consistent
development, so test launch measurement is a helpful          standards. Without sampling actual user segments across
quantification.                                               our most important cohorts to see and hear what vari-
   For example, when a product was in development, a          ous product experiences are surfacing to them, you are
Data Curation team “Performed a heuristic review, where       always sort of guessing. Data Curation removes some of
(they) reviewed a number of (examples) with a variety of      that guesswork, enabling stakeholders with directional
taste overlap scores.” [1] The KPI the Data Curation cared    analysis that leads to beneficial action.
6. Conclusions
Some key takeaways from this paper center around how
to find and identify the right experts, how to support and
work with those experts, and how to keep them engaged
to retain them. They are often pulled from my time in a
business environment, but can also apply to an academic
one. Building and maintaining a Data Curation team will
directly impact any product team that leverages their
expertise. Finding the right talent and engaging that
talent to retain them is an important consideration, and
as I have articulated in this paper, there are specific steps
anyone responsible for a Data Curation team can take
too optimize for both.


Acknowledgments
Thanks to my entire Data Curation team, past and cur-
rent, and my colleagues in Spotify’s Insights and Re-
search communities, especially Sam Way, Claudia Huff,
Aditya Ponnada, Ang Li, Praveen Ravichandran, Mounia
Lalmas-Roelleke, Henriette Cramer, and Laura Lake for
your guidance and support. This paper would not exist
without all of your generously shared wisdom.


References
[1] J. Lamere, A look behind blend: The person-
    alized playlist for you... and you, 2021. URL:
    https://engineering.atspotify.com/2021/12/
    a-look-behind-blend-the-personalized-playlist-for-youand-you/.
[2] V. Luckerson, These are the people picking your
    next internet radio song, 2015. URL: https://time.
    com/3947080/streaming-music-human-curators/.