A Realist Representation of Social Identity Data Amanda Hicks, Ph.D. Department of Health Outcomes and Policy University of Florida Gainesville, USA aehicks@ufl.edu Abstract—Social identities merit special treatment in realist identities. Section Four describes a framework for ontologies. Their ontological status is unsettled, so we should ontologically representing social identities in OMRSE to model them in a manner that is agnostic with respect to their support semantic integration of demographic data. Section Five ontological status. Nevertheless, there is a clear criterion for describes the results of the validation of our representation with determining whether a specific person has a particular identity, competency questions. Section Six discusses results and future namely, whether that person asserts that they do. This social act work. forms the basis for a realist representation, not of social identities themselves, but of data about social identities. We report the representation of social identities in the Ontology of Medically II. BACKGROUND ASSUMPTIONS AND HYPOTHESES Related Social Entities and show that it supports data integration [2] notes that demographic data are about a heterogeneous and retrieval. group of things; they may include data about preferred language, biological sex, gender identity, race, date of birth, Keywords—data integration; demographic information; and marital status. The ontological status of some of these ethnicity; gender identity; identity; Ontology of Medically Related entities is clear. Biological sex is a quality of an organism [7]; Social Entities; race date of birth is a time interval; and marital status is the result of a contractual act. However, the ontological status of race, I. INTRODUCTION ethnicity, and gender identity is controversial [8, 9]. For this Demographic information is widely used in information reason, this paper does not attempt to answer the question, systems. In medical and health information systems they what kind of things are race, ethnicity, and gender identities? support a variety of biomedical and informatics tasks such as Instead, it places the process of asserting an identity at the cohort discovery, statistical comparison of groups of people, center of a realist represention of social identity data in and record linkage [2]. Common demographic data collected in OMRSE. medical settings include birth date, preferred language, race, We begin our work with the assumption that there is a ethnicity and sex or gender. In 2011 the Institute of Medicine difference between demographic data such as gender identity, recommended collecting information on sexual orientation and race, ethnicity, on the one hand, and sex, birth date, and marital gender identity (as distinct from biological sex) in electronic status on the other. Although the latter group is heterogeneous, health records [3], and Stage 3 for Meaningful Use requires its members do share something significant in common; that electronic health records (EHR) certified for meaningful statements about each can be verified as inter-subjective facts use have fields for collecting information on sexual identity by about the world. Although we often gather data about a person 2018 [4-6]. It is, therefore, increasingly important to by asking questions such as Are you male or female?, What is semantically represent gender identity and other social your birth date?, and Are you married?, biological sex, birth identities coherently to support data retrieval and integration. date, and marital status refer to inter-subjective features of the [2] discusses previous work on realist representations of world. If by ‘sex’ we mean karyotypic or phenotypic sex, we demographic information in general in the Ontology of can perform genetic testing to determine a person’s karyotype Medically Related Social Entities (OMRSE). or a physical examination to determine phenotype. While we This paper describes social identities as a special subset of cannot directly observe the date of a person’s birth, once the demographic information and describes a realist representation event is completed, a birth date is something that multiple of social identities to support data retrieval and data people observe and come to consensus on. We can determine integration. This representation supports integration and that a person is married by producing a marriage certificate; if retrieval of data about people according to their social there is no marriage certificate, there is no marriage. In this identities. For the purpose of this paper, social identities sense, reports of one’s own sex, birth date, and marital status include (but are not be limited to) race, ethnicity, and gender are corrigible in the face of facts about the inter-subjective identity. world. However, reports of one’s own gender identity, race, and ethnicity are not similarly corrigible. That is, if Jane says Section Two describes the background assumptions, and that she is a black, Latina, woman, she has already provided all hypothesis of this paper. Section Three provides background the information we can hope to acquire to determine and verify on data collection for gender identity, sexual orientation, race her race, ethnicity, and gender identity. There is nothing in and ethnicity, drawing important distinctions for understanding either the physical or social the world that we can consult to the semantics of terms used to describe these types of social This work was supported in part by the NIH/NCATS Clinical and Translational Science Awards to the University of Florida UL1 TR000064 and by award CDRN-1501-26692 from the Patient Centered Outcomes Research Institute (PCORI). The content is solely the responsibility of the authors and does not necessarily represent the official views of NIH/NCATS or PCORI. verify the truth of these claims unless it is to return to Jane subjective report of their identity rather than an objective or herself and ask her to verify these statements. inter-subjective criterion. Nevertheless, it seems that it is possible for Jane to provide Gender identity does not refer to biological and misinformation about at least some aspects of her identity. For physiological characteristics since it is distinct from biological example, one might object that if Jane has white, non-Latino sex. Furthermore, gender identity cannot be ascertained or parents who insist that Jane herself is neither black nor Latina, verified by gender expression. Consider two cases. 1) Some that this constitutes intrasubjective evidence that her claims are trans individuals have not socially transitioned to their false. This scenario underscores the importance of the context perceived identity. A biological male who lives as a man but of data collection for determining the meaning of the data has a subjective sense of being a woman may have a masculine collected. As we will see in the next section, the race and gender expression that would not be indicative of their ethnicity data collection practices and guidelines prevalent in feminine gender identity. 2) Some people adopt the cultural U.S. healthcare system explicitly rule out defining race and norms associated with a particular gender expression, but ethnicity in terms of “blood” quotas or other inclusion criteria. identify differently. For example, a non-binary person may Furthermore, the definitions that do exist for these terms are have a masculine gender expression without identifying as a seldom presented to respondets. The result is that the data that man. are currently, routinely collected only tell us how the person actually identifies themselves. Notice how this affects the case B. Race and Ethnicity where Jane’s parents are white, non-Latino. In the absence of The Office of Management and Budget (OMB) has defined clear inclusion and exclusion criteria for “white” and “Latino”, a minimal set of categories for collecting data on race and all we know is that Jane’s parents identify themselves as white ethnicity in the U.S. Census. These categories are also used in and non-Latino. This does not rule out Jane having reasons to health care settings and health research in the U.S. [11, 12]. It identify some other way. Finally, we may be concerned that is important to note that, while the OMB defines the minimum Jane has deliberately provided misinformation about her race and ethnicity categories partially in terms of genealogy, identity. There are two things to note about this scenario. First, they explicitly do not regard the categories as naturalistic, no ontology can get around the problem of potential dishonesty anthropological, or scientific, but instead as social-constructs. or bad data collection practices, nor are they intended to. Furthermore, they encourage self-identification in the data Second, even in the broader context of data management we do collection process wherever possible [11]. not regard this as a pressing issue since, we have no reason to suspect that providing deliberately misleading inforamtion about one’s identity is a common enough pratice to effect the TABLE I. DEFINITIONS FROM THE IOM 2011 REPORT ON THE HEALTH OF LGBT PEOPLE results of data quality and data analysis significantly. TERM DEFINITION Our hypothesis was that representing social identity data with respect to the process of identifying rather than in terms of Sex a biological construct, referring to the genetic, identities themselves can support data integration and retrieval hormonal, anatomical, and physiological characteristics on whose basis one is labeled at birth in a realist framework while avoiding controversial ontological as either male or female commitments. Gender the cultural meanings of patterns of behavior, experience, and personality that are labeled masculine or feminine III. DATA COLLECTION FOR GENDER IDENTITY, RACE, AND Gender Expression the manifestation of characteristics in one’s ETHNICITY personality, appearance, and behavior that are For the purpose of this work we have adopted the definition culturally defined as masculine or feminine and characterization of gender identity in [1]. For race and Gender Identity a person’s subjective sense of his or her gender ethnicity we use the Office of Management and Budget (OMB) definitions and guidelines[10] since this standard is already The OMB definitions for race characterize racial categories widely used in biomedicine. Most medical terminologies, on the basis of their descent from the original peoples of some coding schemes, and surveys use terms that are intended to geographic region (Table 2). This characterization poses comply with the Office of Management and Budget (OMB) problems for a realist representation. First, the criterion is minimum set of categories for race and ethnicity [11, 12]. ambiguous insofar as it does not define ‘original peoples’. At what point in human history are original peoples determined? A. Gender identity Second, the criterion is not applied consistently. ‘American Table 1 contains definitions of terms related to sex and Indian or Alaska Native’ is defined as a person who has origins gender as presented in [1]. These definitions have been in any of the original peoples of North and South America influential in shaping the discussion of the collection of data (including Central America), and maintains tribal affiliation or about gender identity [11] and conform to standard usage community attachment (emphasis added). This is the only race where the distinctions between (a) sex and gender and (b) category that has the extra requirement of a social relationship, gender expression and gender identity are observed. which renders the categories not exhaustive. For example, Mexican-Americans who have origins in the original peoples By examining these definitions we can see that the of South or Central America but do not maintain a tribal verification criteria for gender identity is the individual’s own affiliation or community attachment do not fit any of OMB • “Respect for individual dignity should guide the categories for race. processes and methods for collecting data on race and ethnicity; ideally, respondent self-identification should However, despite the genealogical criterion in the be facilitated to the greatest extent possible, recognizing definitions of these terms, the OMB guidelines stress that in some data collection systems observer interpreting statements about race as socio-cultural identification is more practical.” characteristics that involve ancestry rather than as biological or genetic characteristics. This connection to ancestry suggests • “do not establish criteria or qualifications (such as that the verification criterion for an OMB-based statement blood quantum levels) that are to be used in about racial identity is about a historical fact since ancestry is determining a particular individual's racial or ethnic determined by inter-subjective criteria. However, this contrasts classification.” (original emphasis) with additional guidelines for data collection that indicate that that the verification criteria are the subject’s response to OMB • “do not tell an individual who he or she is, or specify questions about race. how an individual should classify himself or herself.” (original emphasis) [11]. TABLE II. DEFINITIONS FOR THE OFFICE OF MANAGEMENT AND BUDGET MINIMUM CATEGORIES FOR RACE OMB CATEGORY OMB DEFINITIONS American Indian or Alaska Native A person having origins in any of the original peoples of North and South America (including Central America), and who maintains tribal affiliation or community attachment. Asian A person having origins in any of the original peoples of the Far East, Southeast Asia, or the Indian subcontinent including, for example, Cambodia, China, India, Japan, Korea, Malaysia, Pakistan, the Philippine Islands, Thailand, and Vietnam. Black or African American A person having origins in any of the black racial groups of Africa. Terms such as “Haitian” or “Negro” can be used in addition to “Black or African American.” Native Hawaiian or Other Pacific Islander A person having origins in any of the original peoples of Hawaii, Guam, Samoa, or other Pacific Islands. White A person having origins in any of the original peoples of Europe, the Middle East, or North Africa. In short, the ontological types of things that a race and Similarly to race, the OMB’s definition of ethnicity also ethnicity datum might be about are heterogeneous, and to make invokes genealogy. The term ‘Hispanic’ refers to persons who matters worse, there is often not a single type that is common trace their origin or descent to Mexico, Puerto Rico, Cuba, to all of them that would provide either necessary or sufficient Central and South America, and other Spanish cultures. conditions. Furthermore, these categories are not historically stable and stem from contingent circumstances. Even if an However, the same caveats that were discussed for race ontologist were confident that there are universals for social apply to ethnicity, namely, 1) ‘ethnicity’ should not to be identities, the historical contingency of identity categories interpreted as referring to biological or genetic characteristics, makes ontologically representing these social identities as but rather as referring to ancestry, and 2) the verification stable universals impractical. Nevertheless, ontologists can criterion for OMB-based statements about ethnicity is the provide a realistic representation of how people actually subject’s response to OMB-based questions about ethnicity. identify when asked to do so. The lack of inter-subjective Finally, we should not expect existing data on race and verification criteria for identity statements in tandem with the ethnicity to reflect a consistent, genealogical criterion since stress on self-identification in the instructions provides a most patients are not presented with definitions of racial and principled basis for representing social identity data differently ethnic terms during the intake process at a clinic or on a survey from data with an inter-subjective or objective verification and because the language used to describe these categories may criterion such as birth date and diagnosis. vary at the discretion and preference of the person(s) designing the form. For example, ‘black’, ‘African American’, and ‘black IV. A REALIST REPRESENTATION OF IDENTIFICATION or African American’ can all be used to describe the same PROCESSES AND IDENTITY DATA racial category. categories as long as they are extensions of and mappable to the OMB minimum categories, i.e., as long as they do not introduce new categories but are equivalent or subcategories to those in the minimal set [10]. In cases where the expanded set includes subcategories of OMB classes, corresponding identity data can be introduced as a subclass of the appropriate OMB datum. For example, Fig. 3 shows CDC Spanish Basque datum as a subclass of OMB Racial& Hispanic Racial&Iden+ty& Datum& Iden+fica+on& Process& or Fig. 1. Representation of Identification Data and Identification Processes Latino In OMRSE. datum. In light of the fact that it is not clear what kinds of things OMB&Racial& OMB&Racial& identities are, OMRSE does not model identities as such. Iden+ty& Iden+fica+on& However, we do know how identity data are collected and that B. Inte Datum& has&specified& Process& &output& their verification criterion involves the process of identifying. grati PCORnet& OMB&Asian& For this reason, we make the processes of asserting an identity ng Racial& Iden+ty& Iden+ty& PCORnet& central to representing social identity data, rather than identities Hete Datum& Datum& Racial& Iden+fica+on& themselves. An identification process is a planned process that roge has&specified&& output& Process& might utilize a specific vocabulary or common data model, neou such as the OMB minimal categories for race and ethnicity. s PCORnet&Asian& Iden+ty&Datum& However, some identification processes might not use a Data common vocabulary or common data elements. For example, Desp some may only utilize a free text field. Identification processes, ite the Fig. 2. An example of how to represent heterogeneous social identity as we represent them here, are planned process that record an data using similar identity statement about an individual person. They should not categori be confused with the private and internal mental or emotional es and identical definitions, the PCORnet CDM and the OMB process that involve or give rise to a subject sense of one’s racial categories describe different classes of people. The OMB identity. Identification processes, as we describe them here, are guidelines allow people to select more than one race [14]. planned, social, and result in identity data. OMRSE represents PCORnet CDM does not. Instead, the PCORnet CDM has a these data as information content entities that are the outputs of class for multiple race. Consider a person who identifies as identification processes. Conversely, all identity data are the both Black and Asian according to the OMB definitions. specified outputs of an identification processes. Fig. 1 According the OMB guidelines in which a person can select illustrates the representation of identity data and identity more than one race, someone could identify as both Black and processes in OMRSE. as Asian, and that person would be retrieved by a query for Subclasses of identification process include racial people who identified as Black, people who identified as identification process, ethnic identification process, and gender Asian, and people who identified as both. If the same person identification process. Identification processes that use a were filling out a medical intake form using the PCORnet particular set of terms or coding scheme can be the basis of CDM guidelines, they would be instructed to choose only one further descendent classes of identification process. For race. They could, therefore, choose either Black or Asian or example, OMB racial identification process and PCORnet multiple race, but they could not choose both Black and Asian. racial identification process are subclasses of racial With OMB standards, the classes of people who identify as identification process (Fig. 2). The latter represents racial Black and who identify as Asian can overlap. For the PCORnet identification used in the PCORnet Common Data Model CDM, they are disjoint. Therefore, the class of people who can (CDM), a data standard for representing clinical patient data identify with OMB Asian is not identical with the class of from clinical sites across the US for use in the National Patient- people who can identify PCORnet Asian but is actually a Centered Clinical Research Network (PCORnet) [13]. superclass class. It is worth noting that transforming OMB compliant racial data into the PCORnet CDM results in an Table 3 contains definitions related to representing OMB’s irretrievable loss of information. Namely, persons who have categories related to OMB Asian as an example of how identified with multiple OMB races will be indicated as identities that employ a common data model or common identifying with the semantically less rich category “multiple vocabulary are represented with this approach. races” in the PCORnet CDM. This loss of information is revealed by accurately representing the semantics of these A. Extended categories coding schemes, but, in such cases of loss of information, not The OMB guidelines for race and ethnicity allow data even a good ontology can not recover information that has not collectors to use a larger number of race and ethnicity been stored. OMB$Hispanic$ or$La3no$ datum$ racial identity categories actually have a different meaning CDC$Ethnic$ from the OMB racial identity categories, it would be Homo$ CDC$Spanish$ sapiens$ Basque$datum$ Iden3fica3on$ inappropriate to use subclass relations to connect them. We are Process$ currently considering using SKOS:broader and SKOS:narrower to describe the relations between the EI1$ intentional meanings of the terms, but it is not clear that this is$about$ is$specified$ output$of$ will support data retrieval. HS1$ USCSP1$ V. VALIDATION AND RESULTS has$ par3cipant$ Competency questions are frequently used to validate modeling decisions in ontologies. They are questions that Fig. 3. Representation of Instance Level Social Identity Data reflect the needs of the end user and that the ontology ought to be able to support. We partially validated the suitability of this We developed a strategy for representing social identity representation for data retrieval and data integration with the data that supports integrating OMB and PCORnet CMD data. following competency questions below. This validation is only This strategy is not idiosyncratic to these data models, but is partial since there are outstanding competency questions that generalizable. This representation involves articulating the require additional modelling decisions. We generated an OWL relations among classes of people who identify with OMB file with synthetic individuals and constructed Description Asian and those who identify with PCORnet Asian, as an Logic queries that answered three out of four of the example. The OMB category Asian means the person has competency questions. These queries in Manchester syntax are declared some Asian descent. The PCORnet CDM category listed below. The OWL file with synthetic individuals is Asian means the person has declared only Asian descent. Fig. 2 available at https://github.com/ufbmi/socid. illustrates how identification processes and identification data that result from these two heterogeneous coding schemes are 1. Which people are racially identified as Asian related. Notice that PCORnet racial identity datum is not a according to the OMB criteria? subclass of OMB racial identity datum. Since the PCORnet TABLE III. SAMPLE DEFINITIONS FOR REPRESENTING RACIAL IDENTITY DATA Ontological Definitions OMB  racial  identity  datum   A   racial   identity   that   is   the   output   of   a   racial   identification   process   that   uses   OMB   terminology   for   race   or   terminology  that  is  mapped  the  OMB  race  terms.   OMB  Asian  identity  datum   An   OMB   racial   identity   datum   about   a   person   who   is   identified   as   having   origins   in   any   of   the   original   peoples   of   the  Far  East,  Southeast  Asia,  or  the  Indian  subcontinent.   Subject   of   an   OMB   Asian   identity   A  human  being  who  is  the  subject  of  an  OMB  Asian  identity  datum   datum   Subject   of   a   self-­‐identified   OMB   A  human  being  who  is  the  subject  of  an  OMB  Asian  identity  datum  and  who  is  the  agent  of  the  planned  process   Asian  identity   for  which  that  identity  is  a  specified  output.   inverse 'is about' some 'Asian identity' We have included this representation of identity data in OMRSE, available at www.github.com/ufbmi/omrse. 2. Which people are racially identified with multiple races according to OMB criteria? VI. DISCUSSION inverse 'is about' min 2 'OMB racial identity' This proposal diverges from traditional realist approaches 3. Which people are racially identified with more than insofar as it advocates representing social identities in terms of one race in either OMB or PCORnet CDM? their verification criteria rather than according to their ontological properties. This approach has the advantage of inverse 'is about' min 2 'OMB racial identity' or inverse 'is supporting data integration and retrieval according to realist about' some 'PCORnet multiple race identity principles, without making dubious ontological commitments. 4. Which people are racially identified only as Asian It also does not sacrifice clear semantics, interoperability of according to OMB or PCORnet criteria? data, or data retrieval. While our competency questions only address racial identity, they do show that different types of Competency Question 4 requires indicating that each of the social identity data that have been gathered according to OMB race categories are different. For example, we must different criteria can be adequately represented according to the decide whether the classes OMB Asian identity datum and general ontological principles described in this paper. OMB Alaska Native or Native American datum are disjoint. Analogous questions involving ethnicity and gender identity Adding a disjointness axiom would rule out the possibility of a can be expected to be handled by this approach since they have single identity datum item that indicates that person has both the same logical form. identities, but may support this competency question. Future work will focus on the best way to represent this situation. Future work includes representing relations between types of identity data to handle the remaining competency question, developing a set of gender identity terms to include in foundation for better understanding. Washington (DC): National OMRSE, and query real patient data to assess the impact of Academies Press (US); Buffalo, New York: 2011. this representation on cohort discovery tasks that include race [2] Hogan WR, Garimalla S, Tariq SA, editors. Representing the reality underlying demographic data. International Conference on Biomedical and ethnicity. Ontologies (ICBO); 2011. [3] Institute of Medicine (US) Committee on Lesbian G, Bisexual, and VII. CONCLUSIONS Transgender Health Issues and Research Gaps and Opportunities. Collecting sexual orientation and gender identity data in electronic Our hypothesis was that representing social identity data health records: Workshop summary. Washington DC: The National with respect to processes of identifying rather than identities Academies Press, 2013 0309268044 9780309268042. themselves can support data integration and retrieval in a [4] Cahill SR, Baker K, Deutsch MB, Keatley J, Makadon HJ. Inclusion of realist framework while avoiding controversial ontological sexual orientation and gender identity in Stage 3 Meaningful Use commitments. Guidelines: A huge step forward for LGBT health. LGBT health. 2015. [5] Department of Helath and Human Services CfMaMS. 42 cfr parts 412 We have produced a BFO-based representation of race and and 495, [cms-3310-fc and cms-3311-fc], rins 0938-as26 and 0938-as58. ethnicity identities and developed strategies for semantically Medicare and Medicaid programs; Electronic Health Record Incentive integrating social identity data that have been collected using a) Program—Stage 3 and modifications to Meaningful Use in 2015 through 2017. 2015 October 7. the OMB minimal categories for race and ethnicity, b) [6] Department of Helath and Human Services CfMaMS. 45 cfr part 170, extensions of the OMB minimal categories for race and rin 0991-ab93. 2015 edition health information technology (healthit) ethnicity, and c) common data models such as the PCORnet certification criteria, 2015 edition based electronic health record (ehr) CDM whose semantics differ from the OMB minimum definition, and onc health it certification program modification. 2015 categories due to pick one/pick many discrepancies. We have October 6, 2015. added this representation to the OMRSE and produced a [7] Smith B, Ceusters W. Ontological realism: A methodology for synthetic data set in an OWL file to test our competency coordinated evolution of scientific ontologies. Applied Ontology. 2010;5(3-4):139. questions. Our representation to date handles three out of four of our competency questions. [8] James M. Race 2016 [updated March 16, 2016; cited 2016 April 19]. Available from: http://plato.stanford.edu/archives/spr2016/entries/race/. [9] Mikkola M. Feminist perspectives on sex and gender 2016 [updated ACKNOWLEDGMENTS January 29, 2016; cited 2016 April 19]. Available from: http://plato.stanford.edu/archives/spr2016/entries/feminism-gender/. Thanks to William R. Hogan for reviewing and [10] Revisions to the standards for the classification of federal data on race commenting on the manuscript and to the Clinical and and ethnicity, (1997). Translational Science Ontology Group for providing feedback [11] Helsing K, editor Capturing social and behavioral domains and on a presentation of earlier work at the Charleston, SC meeting measures in electronic health records. 143rd APHA Annual Meeting and in September 2015. Exposition (October 31-November 4, 2015); 2015: APHA. [12] Racial and ethnic categories and definitions for NIH diversity programs and for other reporting purposes, NOT-OD-15-089 (2015). REFERENCES [13] PCORnet Common Data Model (cdm) [updated Last updated on [1] Institute of Medicine (US) Committee on Lesbian G, Bisexual, and December 18, 2015 cited 2015 April 25]. Available from: Transgender Health Issues and Research Gaps and Opportunities,. The http://www.pcornet.org/pcornet-common-data-model/. health of lesbian, gay, bisexual, and transgender people: Building a