<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Are knowledge graph embedding models biased, or is it the data that they are trained on?</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Wessel</forename><surname>Radstok</surname></persName>
							<affiliation key="aff0">
								<orgName type="laboratory">Data Intensive Systems Group</orgName>
								<orgName type="institution">Utrecht University</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Melisachew</forename><forename type="middle">Wudage</forename><surname>Chekol</surname></persName>
							<affiliation key="aff0">
								<orgName type="laboratory">Data Intensive Systems Group</orgName>
								<orgName type="institution">Utrecht University</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Mirko</forename><forename type="middle">Tobias</forename><surname>Schäfer</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">Department of Media and Culture Studies</orgName>
								<orgName type="institution">Utrecht University</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Are knowledge graph embedding models biased, or is it the data that they are trained on?</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">902E897186496B486C83E207B271E099</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T11:48+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Recent studies on bias analysis of knowledge graph (KG) embedding models focus primarily on altering the models such that sensitive features are dealt with differently from other features. The underlying implication is that the models cause bias, or that it is their task to solve it. In this paper we argue that the problem is not caused by the models but by the data, and that it is the responsibility of the expert to ensure that the data is representative for the intended goal. To support this claim, we experiment with two different knowledge graphs and show that the bias is not only present in the models, but also in the data. Next, we show that by adding new samples to balance the distribution of facts with regards to specific sensitive features, we can reduce the bias in the models. 3  </p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>For several days in early July 2018, Google and Apple's search assistants wrongfully reported that the man behind the Marvel comic books, Stan Lee, had passed away . <ref type="foot" target="#foot_1">4</ref> It did not take long for news articles to start popping up noting the unjustified death declaration. Although Google and Apple never officially reported on this issue, its source is likely traced back to Wikidata. On June 27th, a Wikidata user ran their own script made to parse data from Wikipedia and insert them as claims into Wikidata. This script then mistakenly pronounced Stan Lee dead. Other users soon corrected the error, which resulted in an edit war that became so bad that the page had to be temporarily locked against vandalism. This is not the only occurrence of incorrect information in knowledge graphs causing issues in downstream search queries. In the second half of 2018, the former Guantanamo Bay detainee Omar Khadr was incorrectly reported by Google search for the query 'Canadian Soldiers'. Again the cause was the script written by the aforementioned user. Although the issue was quickly resolved after online outrage, it cropped up twice more over a period of several months. It eventually led Google to take manual action to fix the knowledge graph. <ref type="foot" target="#foot_2">5</ref>In addition to the presence of incorrect information in knowledge graphs due to either an error in the KG construction or intentionally supplied by content curators, KGs can also be incomplete. As an example, in Freebase <ref type="bibr" target="#b1">[2]</ref>, over 70% of person entities have no known place of birth and over 75% have no known nationality <ref type="bibr" target="#b7">[8]</ref>. In Wikidata <ref type="bibr" target="#b13">[14]</ref>, we observe a similar behavior, for instance, over 97% of humans have no known religion and over 83% of humans have no known spoken, written or signed languages. Subsets of both Wikidata and Freebase have been widely used for testing knowledge graph completion models. However, these subsets do not take into account the incompleteness of the KGs and are prepared in a way to test solely the accuracy of models. However, if the subsets are incomplete (or unbalanced), the models can be biased. For instance, the Wikidata12K <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b5">6]</ref> dataset contains 80% male and 20% female politicians. Clearly, this dataset is unbalanced and a model trained on it will likely overrepresent men in its predictions.</p><p>Indeed this is shown in our experiments using the TransE <ref type="bibr" target="#b2">[3]</ref> model; when asked to predict people most likely to be politicians, the top 100 ranked answers contain just 12.4% while the remaining 83.6% are male. In order to mitigate such biased predictions, recently there has been a growing effort towards adapting/extending KG completion models <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b9">10]</ref>. These studies on bias analysis of KG embedding models focus primarily on altering the models such that sensitive features (such as gender, sexual orientation, etc) are dealt with differently from other features. The underlying implication is that the models cause bias, or that it is their task to solve it. However, we found out that the datasets on which the models are trained on are biased/unbalanced. Although algorithms for the automatic balancing of data do exist <ref type="bibr" target="#b4">[5]</ref>, these are not trivial to apply to graph datasets. Our experiments showed unsatisfactory results using these methods.</p><p>Furthermore, adapting models to remove bias means that the resulting embeddings will be bias-neutral with regards to the strength of the model used. That is, removing bias requires a bias detection model and the extent of the the bias removed depends on how much bias is detected. As a result, embeddings are not truly neutral: a more powerful model might still be able to detect biases. Therefore we argue that a domain expert must remain in the loop.</p><p>In this work, we address the problem by working directly on the data rather than altering KG embedding models. In other words, we investigate a new approach in order to balance (mitigate bias) a given dataset: we automatically extend a dataset by extracting additional facts to complete missing values of sensitive features. Moreover, so as to motivate the proposed approach, we carried out a comprehensive analysis of the distribution of sensitive features in Wikidata highlighting various skewed data distributions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>We group the related work into two classes of bias analysis: (i) knowledge graphs and (ii) embedding models.</p><p>Bias analysis of knowledge graphs. <ref type="bibr" target="#b6">[7]</ref> proposes methods to trace the provenance of crowdsourced fact checking to enable bias transparency rather than aiming at eliminating bias from a KG. Furthermore, they investigate how paid crowdsourcing can be used to understand contributors' implicit bias. Specifically, they recruit click workers to verify controversial facts and study them as they do so. I.e., they track what search engines are used and which position the URL used to validate was ranked in the result page. An example verification task is the question of whether Catalonia is a part of Spain or an independent country. The paper proposes adding both facts to the knowledge graph, with a statement testifying how much support there is for each fact.</p><p>[15] introduces ProWD, a framework and tool for profiling the completeness of Wikidata. Completeness measure is based on Class-Facet-Attribute (CFA) profiles. For example one could compare how often the attribute "educated at" or"date of birth" compare between male, German computer scientists, and female, Indonesian computer scientists.</p><p>Bias analysis of embedding models. Bourli et al. <ref type="bibr" target="#b3">[4]</ref> present an analysis method for investigating gender bias with regards to occupation in entity embeddings. Specifically, they subtract the male embedding from the female entity embedding to get the bias vector. Projecting an occupation on this vector then gives them the bias in this occupation. Furthermore, they introduce a de-biasing approach that generates new de-biased embedding vectors from the existing one by subtracting it from the bias vector.</p><p>[10] conduct experiments on Wikidata and Freebase, and show that harmful social biases related to professions are encoded in the embeddings with respect to gender, religion, ethnicity and nationality. They first explain how traditional word embeddings metrics do not apply to KG embeddings due to the transformations applied. They then provide a method for evaluating bias. Their method operates through increasing/decreasing an entities score of a sensitive attribute (e.g., make it more male and less female) and then recording how the likelihood of a certain target triple being true changes (e.g., whether they are a nurse of a lawyer). As a followup, the authors present a novel approach to KG embedding where embeddings are trained to be neutral with respect to sensitive features using an adversarial loss function <ref type="bibr" target="#b8">[9]</ref> . To achieve this, they add a neural-network based classifier to the scoring function: scores are penalized when this classifier can predict the value of the sensitive attribute from the existing embedding. However, this means that the embeddings are only neutral with respect to the power of the model: a more powerful model might be able to infer the sensitive values.</p><p>These (and other initiatives) indicate that there is a growing attention to bias in knowledge graphs, and efforts to make bias visible. As knowledge graphs often are collaborative repositories, it is relevant to provide users with accessible means for identifying possible bias. The examples above are helpful but limited in two ways: they either are valid for a specific knowledge graph, and/or a limited number of attributes. A general framework might provide more possibilities to map bias in knowledge graphs and enable users to become aware of the dis-Fig. <ref type="figure">1</ref>: How many humans have at least one occurrence of a given property? tribution of items and attributes in a given knowledge graph. With their own subject specific expertise, these users can then decide which bias is problematic, and how to address it.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Wikidata Completeness Analysis</head><p>Wikidata is the large, open knowledge graph which acts as central storage for the structured data of other Wikimedia projects such as Wikipedia. Data is stored as claims or triples, containing a subject item, a property and a value. Values are entities or literals such as a quantity, a string or even a coordinate. Items are identified through URIs starting with 'Q' (e.g. Q22686 for Donald Trump) and properties are identified through URIs starting with 'P' (e.g., P40 for Child ). Claims can be contextualized with additional data such as sources (for the data), ranks (in case of multiple values for a property) and qualifiers (e.g., to note that a fact was true at a specific point in time, or that a fact is disputed). A claim and its additional data are collectively referred to as a statement.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Completeness</head><p>We investigated how several properties were distributed among the class of humans in Wikidata. An item x is a human when it is an instance of (P31) human (Q5), i.e., item x must have the claim (x, P31, Q5). Using the Wikidata dump from 2021/03/31, we extracted 9,028,271 such items. We will now give a brief overview of some of our preliminary findings.</p><p>To begin we, for each item in the subset we counted whether or not a property occurs among its claims. This gives us an overview of how often a property occurs at least once. The result is displayed in Figure <ref type="figure">1</ref>. Ignoring the predicate instance of, which per our definition is present on all humans, the most occurring predicates are sex or gender (P21), occupation (P106), and given name (P735). These occur on 7,079,543 (78%), 6,359,256 (70%) and 5,635,238 (62%) humans respectively.</p><p>Additionally, we counted the number of languages each item had a label in. This gives us an overview of how complete Wikidata is over several languages. The result of this is displayed in Figure <ref type="figure" target="#fig_0">2</ref>. Expectedly, the most common language is English, with 8.517.283 (94%) humans having an English label. More unexpectedly however is the fact that, in spite of being a small country with only 17 million inhabitants, the second most common label language is Dutch with 7.785.518 (86%) humans having a Dutch label.</p><p>Next, we can look at the distribution of object entities for a given predicate. I.e., given a predicate such as place of death (P20) we can count how many people have object values such has Moscow or Paris. From this data, we have created a bar graph for a selection of predicates in Figure <ref type="figure" target="#fig_2">3</ref>.</p><p>Looking at this data, it is immediately clear that it is not representative of the common population. For instance, the most common occupation by far is researcher (20%). Yet in reality, even in the USA only around 2% of the population has a PhD. <ref type="foot" target="#foot_3">6</ref> We of course understand that an encyclopedia covers persons of interest and not the general population. Hence it is logical that there is a bias. However, the problematic bias is not the overrepresentation of scholars but the overrepresentation of of white male scholars at western universities. If we want to inquire to what extend the population of researchers in Wikipedia is skewed we need to inquire about the presence of other occupations for persons of interest for an encyclopedia, such as athletes, activists, politicians, engineers and inventors.</p><p>We hypothesize that there are two main sources of bias present in this data. The first is availability bias, i.e., much of the data present in Wikidata is there because it could be easily imported. For instance through the use of bots. The second is interest bias, where the interests of the people who work on Wikidata end up deciding what content will dominate the dataset. Examples of this bias are the most common occupation being researcher (imported through article papers) and the second most common place of death being a concentration camp.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Spatiotemporal Analysis</head><p>Temporal information in Wikidata present itself in two ways. Firstly, predicates can directly have timestamps as their object value. For instance, the date of birth of a person. All predicates that can have a timestamp as object value must be instances of (P31) Wikidata property with datatype 'time' (Q18636219). There are 34 such predicates. Secondly, temporal information can be included in any other predicate through the use of qualifiers. I.e., predicates start time (P569) and end time (P570) can be applied to a triple through reification to add temporal information to that triple.</p><p>Since we are interested in how humans are represented in Wikidata, we restrict spatiotemporal analysis to the human class. Specifically, we ground data in space by looking at a persons place of birth (P19 ) and in time by looking at the place of birth (P569 ). Through this we can analyse the completeness of  Wikidata over time. Some results are displayed in Figure <ref type="figure" target="#fig_4">4</ref>. We observe that the further we go back in time, the less distinct countries are observed in Wikidata. I.e., facts seem to be more based on a few countries. Addtionally, we investigate the occurrences of the most common ethnic groups listed in Wikidata. Interestingly, the use of ethnic group seems to have fallen out of favour for people born more recently. In the 18th century the most common ethnic groups was Greeks with over 400 occurrences, whereas in the 21st century the most common ethnic group is African American with just over 100 occurrences.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Bias Analysis of Knowledge Graph Embedding Models</head><p>In this section we perform bias analysis of the knowledge graph embedding models. Specifically, we analyze the effect of balancing the data on link prediction performance. For this task we utilize two popular models, TransE <ref type="bibr" target="#b2">[3]</ref> and Dist-Mult <ref type="bibr" target="#b15">[16]</ref>. We perform our experiments on two state-of-the-art knowledge graphs. The first is Wikidata12k, a subset of Wikidata extracted by <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b5">6]</ref>. The second is DBP15k, a subset of DBpedia <ref type="bibr" target="#b0">[1]</ref> originally created by <ref type="bibr" target="#b12">[13]</ref> to test Entity-Alignment models. As we are interested in link prediction rather than entity alignment, we select a single instance of the dataset (the English version) and perform our experiments on it. All of our code is available on Github<ref type="foot" target="#foot_4">7</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Embedding Models</head><p>For a triple (s, p, o), let (e s , e p , e o ) denote its embedding vectors. Taking a KG and a random initialization of the vectors as an input, a vector representation of the KG is gradually learned using a scoring function φ(s, p, o). The scoring function should reflect how well the embedding captures the semantics of the KG. The learned embeddings can be used in tasks such as classification, clustering, and link prediction. In this work, we are focussed on the last. Link prediction is the task of predicting the most likely element given a tuple where one element is missing, e.g., given a triple (s,p,? ), to predict the most likely object entity.</p><p>The most popular embedding model is TransE (Translating embeddings). Its scoring function is based on the intuition that the the subject and object vector should be close together after adding the predicate vector. Its scoring function is written as φ(s, p, o) = ||e s + e p − e o || 1,2 . While being very powerful, it has limited expressiveness due to its is simplicty. Therefore, we also perform experiments with DistMult, a multiplicative model. Its scoring function is written as φ(s, p, o) = ||e s * e p * e o || 1,2, . In our experiments we do not use pre-trained models and instead train the embeddings from scratch.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">SMOTE</head><p>One way of balancing datasets is to use Synthetic Minority Over-sampling TEchnique (SMOTE) <ref type="bibr" target="#b4">[5]</ref>. SMOTE is an over-sampling technique that allows one to construct new examples of a given class based on existing examples in order to address imbalances in the dataset. An example would be oversampling female football players. However, SMOTE is not intended for graph datasets and as such is not trivial to apply to knowledge graphs while maintaining the underlying structure.</p><p>One way to apply SMOTE to graph data is through embedding the graph first. We use this approach to evaluate how well SMOTE is suited for our scenario. Our method is as follows. After obtaining the embeddings, we create a categorical variable with a category for each possible combination of sensitive features. In the case of 5 occupations and 2 genders, this implies 10 categories. The embeddings vectors are then combined with this categorical value associated with the sample it represents. Finally, we instruct SMOTE to generate the maximum number of samples for each possible entry using the python imbalancedlearn library <ref type="bibr" target="#b11">[12]</ref>, i.e., given that there are 1850 male association football players, we create both male and female physicists until there are 1850 of each.</p><p>However, preliminary experiments found that this method did not suffice for generating a balanced set of embeddings. Applying our evaluation method to datasets produced by above procedure did not create balanced predictions. We hypothesize that it is because SMOTE generates new examples based on existing biases. By interpolating new 'female' examples from existing female embeddings, we are only creating new examples in the same cluster. That means that locations in the embedding space which are already female become much more so. Therefore, we instead extend the datasets by sampling additional triples from the original knowledge graphs. To ensure that the new triples have healthy connectivity with regards to the rest of the graph this is done in a three step process. Firstly, all women with the required occupations are selected from the complete knowledge graph. Secondly, from this selection the women which have the largest number of predicates which are also in the original dataset are picked. Finally, we select the women whose object values are already in the graph. The last step ensures that we do not add object entities which occur only a few times, and only with women.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Wikidata12k</head><p>Wikidata12k does not contain any information about gender or occupation. However, we can look up this data by querying the original Wikidata knowledge graph. As Wikidata12k is originally a temporal knowledge graph, we strip out the temporal information and remove and duplicate triples that may be created by this process.</p><p>The five most common occupations are association football player Q937857 (1867), politician Q82955 (918), actor Q33999 (211), writer Q36180 (184) and physicist Q169470 (143). These occupations are not uniformly distributed with regards to gender: there are only a handful of women football players, and there is not a single woman physicist in the entire dataset.</p><p>In total, we add around 10,000 triples with female entities as subject to the Wikidata12k knowledge graph, resulting a new graph over 50,0000 triples. This increases the average number of mentions as subject (i.e., the average number of outlinks) per female entity from 3.49 in the original graph to 4.74 in the balanced graph. However, the number of outlinks still falls short of that of men, which is 5.41.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5">DBP15k</head><p>DBP15k is a subset of DBpedia <ref type="bibr" target="#b0">[1]</ref> created by <ref type="bibr" target="#b12">[13]</ref>  with them. To prevent the graph from being too sparse for an embedding model to learn we delete all predicates which occur less than 50 times. DBpedia does not store any information about peoples sex or gender in a structured way. I.e., although a person can be of rdf:type of Man or Woman, manual inspection of the data did not reveal that this information was consistently present. However, most entities do contain their Wikidata identifiers. Since Wikidata does list peoples gender, we determine a persons gender by querying Wikidata for the given identifiers.</p><p>The five most common occupations are OfficeHolder (2508), Athlete (1436), Royalty (1002), SportsManager (288), and Scientist (282). Like Wikidata12k, the male/female ratio in these occupations is unbalanced, skewing heavily towards men. In addition to balancing the data by adding additional samples, we remove some male entities and their triples to create the balanced dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.6">Evaluation</head><p>To evaluate whether an embedding model contains bias with regards to gender and occupation we perform the following procedure. Firstly, we count the fraction of men and women that have a certain occupation (P106 ) x. Then, we ask the model to predict the n most likely entities for the query (?, P106, x). If the fraction of men or women returned is consistently larger than that the fraction present in the data, the model is biased. Specifically, when more men are predicted the model is biased against women and vice versa. If this bias is only present in the unbalanced dataset and not the balanced datasets, then the model reflects the data it has been trained on. However, if the bias is present in both scenarios, the models are either inherently biased or manage to pick up some form of bias in the data which is not reflected in our analysis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.7">Results</head><p>Our results are displayed in Tables <ref type="table" target="#tab_2">2 and 3</ref> for DBpedia and Wikidata12k respectively using TransE <ref type="bibr" target="#b2">[3]</ref>, and in Tables <ref type="table" target="#tab_3">4 and 5</ref> using DistMult <ref type="bibr" target="#b15">[16]</ref>. We observe that in both original datasets, the percentage of women predicted is very low and close to the percentage of women in the dataset. The largest difference is observed on the occupation Royalty in the DBpedia15k dataset. Here the difference is just over 10 percentage points.</p><p>When we extend our view to the balanced datasets, we find that the percentage of women predicted has moved upwards with the percentage of women in the dataset. Balancing the datasets thus helps with improving the representation of minority classes in the model output. However, we do observe that the absolute differences between the number of men in the dataset, and the number of men predicted (and for women) has increased, suggesting that the model has become less accurate.</p><p>Even so, we believe a more likely explanation to be that the larger number of entities predicted induces more variance in the predictions. This explanation is strengthened by the fact that the difference is smaller when using DistMult, Table <ref type="table">5</ref>: Comparison between the male/female distribution and the resulting DistMult model predictions in the original Wikidata12k dataset (top) and our balanced dataset (bottom). Difference column contains the difference (in percentage points) between the % of women in the predicted and the % of women predicted.</p><p>which is a more expressive model and can thus model the information more accurately.</p><p>Another point of note is the observation that on both datasets and almost all occupations, the sign of the difference between the percentage of women predicted and the percentage of women in the dataset is mostly positive. This means that women are actually overrepresented in the models predictions, indicating that the model is actually less biased than the data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>In this paper we proposed a new approach to mitigate bias in knowledge graphs embedding models by leveraging the distribution of the datasets in which the models are trained on. Specifically, rather than adapting models to mitigate bias, we instead analyze and augment the data that is fed into the model. We carried out several experiments using state of the art embedding models (namely, TransE and DistMult) and two knowledge graphs (namely DBpedia and Wikidata) and showed that balancing the data with regards to specific sensitive features (e.g. gender and occupation) improves the overall prediction capabilities of the models. Additionally, to motivate our work, we have done a completeness analysis of Wikidata using a number of sensitive features.</p><p>As a future work, we will extend the proposed approach to build a system that takes as an input a dataset and a selection of sensitive features and automatically balances the data with respect to the given features.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 2 :</head><label>2</label><figDesc>Fig. 2: How many humans have a label specified in the given language.</figDesc><graphic coords="5,203.93,115.84,207.48,119.73" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head></head><label></label><figDesc>(a) Most common place of birth (left) and place of death (right) entities. (b) Most common ethnic groups (left) and languages spoken (right) entities. (c) Most common occupations (left) and religions (right) entities.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 3 :</head><label>3</label><figDesc>Fig. 3: Overview of the most common values for several predicates in Wikidata.</figDesc><graphic coords="6,134.77,357.86,169.45,99.38" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head></head><label></label><figDesc>(a) Number of countries with at least one fact in the given century. (b) Comparison between the most common ethnic groups listed in Wikidata in the 18th (left) and 21st (right) century.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Fig. 4 :</head><label>4</label><figDesc>Fig. 4: Overview of some select metrics of Wikidata temporal metrics.</figDesc><graphic coords="7,307.29,229.73,169.45,81.69" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Dataset statistics4.3 Sampling processWe enrich the original knowledge graphs by adding female triples, i.e. extra triples with female entities as subject. The data is enriched in such a way that the number of men and women associated with each of the top 5 most common occupations becomes approximately equal. The triples are obtained from the complete Wikidata and DBpedia datasets.</figDesc><table><row><cell>Dataset</cell><cell cols="5"># Triples # Entities # Pred. # Men # Women</cell></row><row><cell>Wikidata12k (original)</cell><cell>38,970</cell><cell>12,848</cell><cell>25</cell><cell>4,905</cell><cell>717</cell></row><row><cell>Wikidata12k (balanced)</cell><cell>51,682</cell><cell>15,957</cell><cell>25</cell><cell>4,905</cell><cell>3,610</cell></row><row><cell>DBpedia15k (original)</cell><cell>92,746</cell><cell>18,716</cell><cell>206</cell><cell>6,767</cell><cell>1,087</cell></row><row><cell>DBpedia15k (balanced)</cell><cell>95,827</cell><cell>27,459</cell><cell>206</cell><cell>5,916</cell><cell>5,917</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>to test Entity-Alignment models. The majority of predicates in DBP15k have very few triples associated Comparison between the male/female distribution and the resulting TransE model predictions in the original DBPedia15k dataset (top) and our balanced dataset (bottom). Difference column contains the difference (in percentage points) between the % of women in the predicted and the % of women predicted.</figDesc><table><row><cell></cell><cell></cell><cell>Data</cell><cell></cell><cell></cell><cell cols="2">Prediction</cell><cell></cell></row><row><cell></cell><cell cols="7">Men Women Women (%) Men Women Women (%) Diff (p.p.)</cell></row><row><cell>Officeholder</cell><cell>1803</cell><cell>180</cell><cell>9.1%</cell><cell>85</cell><cell>15</cell><cell>15.0%</cell><cell>6.9</cell></row><row><cell>Athlete</cell><cell>1142</cell><cell>6</cell><cell>0.5%</cell><cell>100</cell><cell>0</cell><cell>0.0%</cell><cell>-0.5</cell></row><row><cell>Royalty</cell><cell>569</cell><cell>235</cell><cell>29.2%</cell><cell>60</cell><cell>40</cell><cell>40.0%</cell><cell>10.8</cell></row><row><cell cols="2">Sportsmanager 225</cell><cell>0</cell><cell>0.0%</cell><cell>100</cell><cell>0</cell><cell>0.0%</cell><cell>0.0</cell></row><row><cell>Scientist</cell><cell>216</cell><cell>6</cell><cell>2.7%</cell><cell>95</cell><cell>5</cell><cell>5.0%</cell><cell>2.3</cell></row><row><cell>Total</cell><cell>3955</cell><cell>427</cell><cell>9.7%</cell><cell>440</cell><cell>60</cell><cell>12.0%</cell><cell>-</cell></row><row><cell></cell><cell></cell><cell>Data</cell><cell></cell><cell></cell><cell cols="2">Prediction</cell><cell></cell></row><row><cell></cell><cell cols="7">Men Women Women (%) Men Women Women (%) Diff (p.p.)</cell></row><row><cell>Officeholder</cell><cell cols="2">1498 1770</cell><cell>54.2%</cell><cell>16</cell><cell>83</cell><cell>83.8%</cell><cell>29.7</cell></row><row><cell>Athlete</cell><cell>990</cell><cell>1320</cell><cell>57.1%</cell><cell>55</cell><cell>45</cell><cell>45.0%</cell><cell>-12.1</cell></row><row><cell>Royalty</cell><cell>472</cell><cell>596</cell><cell>55.8%</cell><cell>25</cell><cell>74</cell><cell>74.7%</cell><cell>18.9</cell></row><row><cell cols="2">Sportsmanager 188</cell><cell>31</cell><cell>14.2%</cell><cell>85</cell><cell>15</cell><cell>15.0%</cell><cell>0.8</cell></row><row><cell>Scientist</cell><cell>195</cell><cell>225</cell><cell>53.6%</cell><cell>32</cell><cell>67</cell><cell>67.7%</cell><cell>14.1</cell></row><row><cell>Total</cell><cell cols="2">3343 3942</cell><cell>54.1%</cell><cell>213</cell><cell>284</cell><cell>57.1%</cell><cell>-</cell></row><row><cell></cell><cell></cell><cell>Data</cell><cell></cell><cell></cell><cell cols="2">Prediction</cell><cell></cell></row><row><cell></cell><cell cols="7">Men Women Women (%) Men Women Women (%) Diff (p.p.)</cell></row><row><cell>Politician</cell><cell>619</cell><cell>96</cell><cell>13.4%</cell><cell>80</cell><cell>20</cell><cell>20.0%</cell><cell>6.6</cell></row><row><cell>Writer</cell><cell>100</cell><cell>33</cell><cell>24.8%</cell><cell>74</cell><cell>25</cell><cell>25.3%</cell><cell>0.5</cell></row><row><cell>Actor</cell><cell>67</cell><cell>89</cell><cell>57.1%</cell><cell>38</cell><cell>61</cell><cell>61.6%</cell><cell>4.6</cell></row><row><cell cols="2">Football player 1465</cell><cell>14</cell><cell>0.9%</cell><cell>98</cell><cell>2</cell><cell>2.0%</cell><cell>1.1</cell></row><row><cell>Physicist</cell><cell>108</cell><cell>2</cell><cell>1.8%</cell><cell>98</cell><cell>2</cell><cell>2.0%</cell><cell>0.2</cell></row><row><cell>Total</cell><cell>2359</cell><cell>234</cell><cell>9.0%</cell><cell>388</cell><cell>110</cell><cell>22.1%</cell><cell>-</cell></row><row><cell></cell><cell></cell><cell>Data</cell><cell></cell><cell></cell><cell cols="2">Prediction</cell><cell></cell></row><row><cell></cell><cell cols="7">Men Women Women (%) Men Women Women (%) Diff (p.p.)</cell></row><row><cell>Politician</cell><cell>644</cell><cell>331</cell><cell>33.9%</cell><cell>53</cell><cell>47</cell><cell>47.0%</cell><cell>13.1</cell></row><row><cell>Writer</cell><cell>111</cell><cell>59</cell><cell>34.7%</cell><cell>25</cell><cell>74</cell><cell>74.7%</cell><cell>40.0</cell></row><row><cell>Actor</cell><cell>71</cell><cell>93</cell><cell>56.7%</cell><cell>33</cell><cell>66</cell><cell>66.7%</cell><cell>10.0</cell></row><row><cell cols="3">Football player 1455 1427</cell><cell>49.5%</cell><cell>8</cell><cell>91</cell><cell>91.9%</cell><cell>42.4</cell></row><row><cell>Physicist</cell><cell>122</cell><cell>44</cell><cell>26.5%</cell><cell>56</cell><cell>44</cell><cell>44.0%</cell><cell>17.5</cell></row><row><cell>Total</cell><cell cols="2">2403 1954</cell><cell>44.8%</cell><cell>175</cell><cell>322</cell><cell>64.8%</cell><cell>-</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 :</head><label>3</label><figDesc>Comparison between the male/female distribution and the resulting TransE model predictions in the original Wikidata12k dataset (top) and our balanced dataset (bottom). Difference column contains the difference (in percentage points) between the % of women in the predicted and the % of women predicted.</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4 :</head><label>4</label><figDesc>Comparison between the male/female distribution and the resulting DistMult model predictions in the original DBpedia15k dataset (top) and our balanced dataset (bottom). Difference column contains the difference (in percentage points) between the % of women in the predicted and the % of women predicted.</figDesc><table><row><cell></cell><cell></cell><cell>Data</cell><cell></cell><cell></cell><cell cols="2">Prediction</cell><cell></cell></row><row><cell></cell><cell cols="7">Men Women Women (%) Men Women Women (%) Diff (p.p.)</cell></row><row><cell>Officeholder</cell><cell>1803</cell><cell>180</cell><cell>9.1%</cell><cell>87</cell><cell>13</cell><cell>13.0%</cell><cell>3.9</cell></row><row><cell>Athlete</cell><cell>1142</cell><cell>6</cell><cell>0.5%</cell><cell>100</cell><cell>0</cell><cell>0.0%</cell><cell>0.5</cell></row><row><cell>Royalty</cell><cell>569</cell><cell>235</cell><cell>29.2%</cell><cell>68</cell><cell>32</cell><cell>32.0%</cell><cell>2.8</cell></row><row><cell cols="2">Sportsmanager 225</cell><cell>0</cell><cell>0.0%</cell><cell>100</cell><cell>0</cell><cell>0.0%</cell><cell>0.0</cell></row><row><cell>Scientist</cell><cell>216</cell><cell>6</cell><cell>2.7%</cell><cell>96</cell><cell>4</cell><cell>4.0%</cell><cell>1.3</cell></row><row><cell>Total</cell><cell>3955</cell><cell>427</cell><cell>9.7%</cell><cell>451</cell><cell>49</cell><cell>9.8%</cell><cell>-</cell></row><row><cell></cell><cell></cell><cell>Data</cell><cell></cell><cell></cell><cell cols="2">Prediction</cell><cell></cell></row><row><cell></cell><cell cols="7">Men Women Women (%) Men Women Women (%) Diff (p.p.)</cell></row><row><cell>Officeholder</cell><cell cols="2">1498 1770</cell><cell>54.2%</cell><cell>47</cell><cell>53</cell><cell>53.0%</cell><cell>-1.2</cell></row><row><cell>Athlete</cell><cell>990</cell><cell>1320</cell><cell>57.1%</cell><cell>78</cell><cell>22</cell><cell>22.0%</cell><cell>-35.14</cell></row><row><cell>Royalty</cell><cell>472</cell><cell>596</cell><cell>55.8%</cell><cell>39</cell><cell>61</cell><cell>61.0%</cell><cell>5.2</cell></row><row><cell cols="2">Sportsmanager 188</cell><cell>31</cell><cell>14.2%</cell><cell>88</cell><cell>12</cell><cell>12.0%</cell><cell>-2.2</cell></row><row><cell>Scientist</cell><cell>195</cell><cell>225</cell><cell>53.6%</cell><cell>51</cell><cell>49</cell><cell>49.0%</cell><cell>-4.6</cell></row><row><cell>Total</cell><cell cols="2">3343 3942</cell><cell>54.1%</cell><cell>303</cell><cell>197</cell><cell>39.4%</cell><cell>-</cell></row><row><cell></cell><cell></cell><cell>Data</cell><cell></cell><cell></cell><cell cols="2">Prediction</cell><cell></cell></row><row><cell></cell><cell cols="7">Men Women Women (%) Men Women Women (%) Diff (p.p.)</cell></row><row><cell>Politician</cell><cell>619</cell><cell>96</cell><cell>13.4%</cell><cell>83</cell><cell>17</cell><cell>17.0%</cell><cell>3.6</cell></row><row><cell>Writer</cell><cell>100</cell><cell>33</cell><cell>24.8%</cell><cell>72</cell><cell>28</cell><cell>28.0%</cell><cell>3.2</cell></row><row><cell>Actor</cell><cell>67</cell><cell>89</cell><cell>57.1%</cell><cell>50</cell><cell>50</cell><cell>50.0%</cell><cell>-7.1</cell></row><row><cell cols="2">Football player 1465</cell><cell>14</cell><cell>0.9%</cell><cell>99</cell><cell>1</cell><cell>1.0%</cell><cell>0.1</cell></row><row><cell>Physicist</cell><cell>108</cell><cell>2</cell><cell>1.8%</cell><cell>95</cell><cell>5</cell><cell>5.0%</cell><cell>3.2</cell></row><row><cell>Total</cell><cell>2359</cell><cell>234</cell><cell>9.0%</cell><cell>399</cell><cell>101</cell><cell>20.2%</cell><cell></cell></row><row><cell></cell><cell></cell><cell>Data</cell><cell></cell><cell></cell><cell cols="2">Prediction</cell><cell></cell></row><row><cell></cell><cell cols="7">Men Women Women (%) Men Women Women (%) Diff (p.p.)</cell></row><row><cell>Politician</cell><cell>644</cell><cell>331</cell><cell>33.9%</cell><cell>51</cell><cell>49</cell><cell>49.0%</cell><cell>15.1</cell></row><row><cell>Writer</cell><cell>111</cell><cell>59</cell><cell>34.7%</cell><cell>39</cell><cell>61</cell><cell>61.0%</cell><cell>26.3</cell></row><row><cell>Actor</cell><cell>71</cell><cell>93</cell><cell>56.7%</cell><cell>37</cell><cell>63</cell><cell>63.0%</cell><cell>6.3</cell></row><row><cell cols="3">Football player 1455 1427</cell><cell>49.5%</cell><cell>50</cell><cell>50</cell><cell>50.0%</cell><cell>0.5</cell></row><row><cell>Physicist</cell><cell>122</cell><cell>44</cell><cell>26.5%</cell><cell>42</cell><cell>58</cell><cell>58.0%</cell><cell>31.5</cell></row><row><cell>Total</cell><cell cols="2">2403 1954</cell><cell>44.8%</cell><cell>219</cell><cell>281</cell><cell>56.2%</cell><cell></cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0">Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_1">.0 International (CC BY 4.0) 4 https://www.cinemablend.com/news/2444550/ siri-is-telling-people-stan-lee-died-yesterday</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_2">https://www.cbc.ca/news/technology/ omar-khadr-google-search-knowledge-graph-scheer-russia-1.4999775</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_3">https://data.worldbank.org/indicator/SE.TER.CUAT.DO.ZS?locations=US</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_4">https://github.com/wradstok/KGE-bias-analyzer</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Dbpedia-a crystallization point for the web of data</title>
		<author>
			<persName><forename type="first">C</forename><surname>Bizer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lehmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Kobilarov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Auer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Becker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Cyganiak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hellmann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of web semantics</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="154" to="165" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Freebase: a collaboratively created graph database for structuring human knowledge</title>
		<author>
			<persName><forename type="first">K</forename><surname>Bollacker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Evans</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Paritosh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Sturge</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Taylor</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2008 ACM SIGMOD international conference on Management of data</title>
				<meeting>the 2008 ACM SIGMOD international conference on Management of data</meeting>
		<imprint>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="1247" to="1250" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Translating embeddings for modeling multi-relational data</title>
		<author>
			<persName><forename type="first">A</forename><surname>Bordes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Usunier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Garcia-Duran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Weston</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Yakhnenko</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">26</biblScope>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Bias in knowledge graph embeddings</title>
		<author>
			<persName><forename type="first">S</forename><surname>Bourli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Pitoura</surname></persName>
		</author>
		<idno type="DOI">10.1109/ASONAM49781.2020.9381459</idno>
		<ptr target="https://doi.org/10.1109/ASONAM49781.2020.9381459" />
	</analytic>
	<monogr>
		<title level="m">IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)</title>
				<imprint>
			<date type="published" when="2020">2020. 2020</date>
			<biblScope unit="page" from="6" to="10" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Smote: synthetic minority over-sampling technique</title>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">V</forename><surname>Chawla</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">W</forename><surname>Bowyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">O</forename><surname>Hall</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">P</forename><surname>Kegelmeyer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of artificial intelligence research</title>
		<imprint>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="page" from="321" to="323" />
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Hyte: Hyperplane-based temporally aware knowledge graph embedding</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">S</forename><surname>Dasgupta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">N</forename><surname>Ray</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Talukdar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2018 conference on empirical methods in natural language processing</title>
				<meeting>the 2018 conference on empirical methods in natural language processing</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="2001" to="2011" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Implicit bias in crowdsourced knowledge graphs</title>
		<author>
			<persName><forename type="first">G</forename><surname>Demartini</surname></persName>
		</author>
		<idno type="DOI">10.1145/3308560.3317307</idno>
		<ptr target="https://doi.org/10.1145/3308560.3317307" />
	</analytic>
	<monogr>
		<title level="m">Companion Proceedings of The 2019 World Wide Web Conference</title>
				<meeting><address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="624" to="630" />
		</imprint>
	</monogr>
	<note>WWW &apos;19</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Knowledge vault: A web-scale approach to probabilistic knowledge fusion</title>
		<author>
			<persName><forename type="first">X</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Gabrilovich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Heitz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Horn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Lao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Murphy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Strohmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Zhang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining</title>
				<meeting>the 20th ACM SIGKDD international conference on Knowledge discovery and data mining</meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="601" to="610" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Debiasing knowledge graph embeddings</title>
		<author>
			<persName><forename type="first">J</forename><surname>Fisher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mittal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Palfrey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Christodoulopoulos</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2020.emnlp-main.595</idno>
		<ptr target="https://www.aclweb.org/anthology/2020.emnlp-main.595" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</title>
				<meeting>the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2020-11">Nov 2020</date>
			<biblScope unit="page" from="7332" to="7345" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Fisher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Palfrey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Christodoulopoulos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mittal</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1912.02761todo</idno>
		<title level="m">Measuring social bias in knowledge graph embeddings</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Deriving validity time in knowledge graph</title>
		<author>
			<persName><forename type="first">J</forename><surname>Leblay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">W</forename><surname>Chekol</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Companion Proceedings of the The Web Conference</title>
				<imprint>
			<date type="published" when="2018">2018. 2018</date>
			<biblScope unit="page" from="1771" to="1776" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning</title>
		<author>
			<persName><forename type="first">G</forename><surname>Lemaître</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Nogueira</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">K</forename><surname>Aridas</surname></persName>
		</author>
		<ptr target="http://jmlr.org/papers/v18/16-365" />
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">18</biblScope>
			<biblScope unit="issue">17</biblScope>
			<biblScope unit="page" from="1" to="5" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Cross-lingual entity alignment via joint attribute-preserving embedding</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">The Semantic Web -ISWC 2017</title>
				<editor>
			<persName><forename type="first">C</forename><surname>Amato</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Fernandez</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">V</forename><surname>Tamma</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">F</forename><surname>Lecue</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Cudré-Mauroux</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Sequeda</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Lange</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Heflin</surname></persName>
		</editor>
		<meeting><address><addrLine>Cham</addrLine></address></meeting>
		<imprint>
			<publisher>Springer International Publishing</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="628" to="644" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Wikidata: a free collaborative knowledgebase</title>
		<author>
			<persName><forename type="first">D</forename><surname>Vrandečić</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Krötzsch</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Communications of the ACM</title>
		<imprint>
			<biblScope unit="volume">57</biblScope>
			<biblScope unit="issue">10</biblScope>
			<biblScope unit="page" from="78" to="85" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Wikidata completeness profiling using prowd</title>
		<author>
			<persName><forename type="first">A</forename><surname>Wisesa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Darari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Krisnadhi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Nutt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Razniewski</surname></persName>
		</author>
		<idno type="DOI">10.1145/3360901.3364425</idno>
		<idno>3360901.3364425</idno>
		<ptr target="https://doi.org/10.1145/3360901.3364425" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 10th International Conference on Knowledge Capture</title>
				<meeting>the 10th International Conference on Knowledge Capture<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="123" to="130" />
		</imprint>
	</monogr>
	<note>K-CAP &apos;19</note>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<title level="m" type="main">Embedding entities and relations for learning and inference in knowledge bases</title>
		<author>
			<persName><forename type="first">B</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">T</forename><surname>Yih</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Deng</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1412.6575</idno>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
