<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">RDF2Vec Light -A Lightweight Approach for Knowledge Graph Embeddings</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Jan</forename><forename type="middle">Philipp</forename><surname>Portisch</surname></persName>
							<email>jan.portisch@sap.com</email>
							<affiliation key="aff0">
								<orgName type="department">Data and Web Science Group</orgName>
								<orgName type="institution">University of Mannheim</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">SAP SE Product Engineering Financial Services</orgName>
								<address>
									<settlement>Walldorf</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Michael</forename><surname>Hladik</surname></persName>
							<email>michael.hladik@sap.com</email>
							<affiliation key="aff1">
								<orgName type="institution">SAP SE Product Engineering Financial Services</orgName>
								<address>
									<settlement>Walldorf</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">RDF2Vec Light -A Lightweight Approach for Knowledge Graph Embeddings</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">848B70057766BFF6013607828FCC3708</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T08:35+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>RDF2Vec</term>
					<term>knowledge graph embeddings</term>
					<term>knowledge graphs</term>
					<term>data mining</term>
					<term>scalability</term>
					<term>resource efficient embeddings</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Knowledge graph embedding approaches represent nodes and edges of graphs as mathematical vectors. Current approaches focus on embedding complete knowledge graphs, i.e. all nodes and edges. This leads to very high computational requirements on large graphs such as DBpedia or Wikidata. However, for most downstream application scenarios, only a small subset of concepts is of actual interest. In this paper, we present RDF2Vec Light, a lightweight embedding approach based on RDF2Vec which generates vectors for only a subset of entities. To that end, RDF2Vec Light only traverses and processes a subgraph of the knowledge graph. Our method allows the application of embeddings of very large knowledge graphs in scenarios where such embeddings were not possible before due to a significantly lower runtime and significantly reduced hardware requirements.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Public knowledge graphs (KGs), such as DBpedia or Wikidata, provide deep background knowledge that can be exploited for downstream tasks such as questionanswering or recommender systems <ref type="bibr" target="#b2">[3]</ref>. KG embeddings (KGEs) represent vertices and, depending on the approach, also edges of a KG as numeric vectors. This representation is easily consumable by most algorithms and can be exploited in downstream tasks. Advantages of KGEs, once they have been trained, include simple applicability, fast run time, good performance on multiple tasks, and reusability in downstream applications. On the downside, KGEs produce very large models 3 , and are very expensive to train and re-train in the case of evolving knowledge bases. For very large knowledge graphs, such as Wikidata, computing a complete embedding typically takes up to a day or longer <ref type="bibr" target="#b1">[2]</ref>.</p><p>In this paper, we address the scalability aspect of knowledge graph embeddings: Our novel approach, RDF2Vec Light, allows to train partial, task-specific models with 3 For example, the 200 dimensional DBpedia RDF2Vec embedding model available at KGvec2go <ref type="bibr" target="#b4">[5]</ref> requires more than 10GB of disk storage. only a fraction of the computation requirements compared to other embedding approaches, while retaining a high performance on multiple tasks. The resulting models contain only vectors for entities of interest. Internally, RDF2Vec Light only traverses a subset of the underlying knowledge graph which leads to processing times that are much shorter than the original RDF2Vec approach which always processes an entire knowledge graph. Moreover, the resulting models are much smaller. <ref type="foot" target="#foot_0">4</ref></p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">RDF2Vec Light</head><p>RDF2Vec is based on performing random walks on a graph <ref type="bibr" target="#b5">[6]</ref>. The underlying idea of RDF2Vec Light embeddings is to generate only local walks for entities of interest given a predefined task. After the walk generation has been completed, the training of vectors can be performed like in the original approach.</p><p>Rather than starting random walks at all entities of interest, it is randomly decided for each depth-iteration whether to go backwards, i.e. to one of the node's predecessors, or forwards, i.e. to the node's successors (line 9 of Algorithm 1). As a result, the entities of interest can be at the beginning, at the end, or in the middle of a walk which better captures the context of the entity. This generation process is described in Algorithm 1. The RDF2Vec method as well as the RDF2Vec Light extension have been implemented in Java and Python. <ref type="foot" target="#foot_1">5</ref> The implementation can handle various RDF formats such as n-triples, RDF/XML, Turtle, or HDT <ref type="bibr" target="#b0">[1]</ref>. In addition, a REST API has been implemented and is provided on http://www.kgvec2go.org.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Evaluation</head><p>In order to evaluate the approach presented in this paper, the classification and regression experiments, as well as the entity and document relatedness experiments of Ristoski et al. <ref type="bibr" target="#b5">[6]</ref> have been repeated. The evaluation follows the setup defined in <ref type="bibr" target="#b3">[4]</ref>. Six classic and six light embedding spaces have been trained each with the following parameters held constant: wi nd ow si ze = 5, neg at i ve sampl es = 25. The parameters that were changed are the generation mode (cbow and sg ) as well as the dimension of the embedding space (50, 100, 200). All walks have been generated with 500 walks per entity and a depth of 4. For the evaluation, the DBpedia knowledge graph as of 2016-10<ref type="foot" target="#foot_2">6</ref> has been used.</p><p>For the classification and regression tasks, we follow the same setup as in the original RDF2vec paper <ref type="bibr" target="#b6">[7]</ref>: For the classification tasks, four classifiers have been evaluated: Naïve Bayes, C4.5 (decision tree algorithm), k-NN with k = 3, and Support Vector Machines (SVM) with C ∈ {10 −3 , 10 −2 , 0.1, 1, 10, 10 2 , 10 3 } where the best C is chosen. A 10-fold cross validation has been used to calculate the performance statistics. For the regression tasks, three approaches have been evaluated: linear regression, k-NN, and M5rules. For the sake of brevity, we only report results for the best performing approaches (SVM and LR). <ref type="foot" target="#foot_3">7</ref>Table <ref type="table">2</ref>. Results on the document relatedness task (LP50), reporting the harmonic mean of Pearson correlation and Spearman rank correlation, and on entity relatedness (KORE), using cosine similarity. The best value of each comparison group is highlighted in bold. The overall best value is additionally underlined. In the results tables, strategy refers to the configuration with which the embeddings have been obtained. The structure can be read as follows: &lt;mode&gt;_&lt;number_of_walks_per_entity&gt;_&lt;walk_depth&gt;_&lt;training_mode&gt;_&lt;dimen-sion&gt; where mode is either Light or Classic. For example, Light_500_4_CBOW_100 refers to RDF2Vec Light embeddings with 500 walks per entity, a walk depth of 4, CBOW configuration, and an embedding space dimensionality of 100.</p><p>For classification and regression, we can observe that except for the cities dataset, the difference between the two approaches is rather marginal. For entity and document relatedness, the results are less conclusive. Here, we see that the RDF2vec light approach is en par with the classic approach for the CBOW variant, but the results are reversed when looking at the SG variant, which also yields the best results globally.</p><p>In order to analyze those results more deeply, and to distinguish the cases where RDF2vec Light is en par with classic RDF2vec from those where it is clearly inferior, we looked at the linkage degree of the entities at hand, as well the homogeneity of the entities of interest.</p><p>For the linkage degree, we can observe that a higher degree of the entities of interest leads to a worse performance of RDF2vec Light. This can be seen in the inferior performance of RDF2vec Light for the cities datasets in classification and regression. Cities are among the most strongly interlinked entities in DBpedia <ref type="bibr" target="#b2">[3]</ref>. At the same time, the document and entity similarity datasets contain a larger number of strongly interlinked head entities.</p><p>While for classification and regression problems, the set of entities is rather homogeneous (i.e., all are cities, albums, etc.), the homogeneity is lower for the document and entity relatedness, where the entities of interest are scattered across many classes. Both degree and homogeneity contribute to the density of the considered subgraphs, as depicted in Fig. <ref type="figure" target="#fig_1">1</ref>. From the plots, we can observe a correlation of RDF2Vec Light performance and the density of the graph spanned by the random walks -the more dense the graph (i.e., the less head entities there are and the more homogeneous the entity set at hand), the better the performance of RDF2Vec Light.</p><p>The runtime of RDF2Vec Light is linear in the number of entities of interest. On commodity hardware, the runtime is roughly 1 minute per 10 nodes. In comparison, training RDF2Vec on the full DBpedia graph takes a few days.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Conclusion and Outlook</head><p>In this paper, we presented RDF2Vec Light, an approach for learning latent representations of knowledge graph entities that requires only a fraction of the computing power compared to other embedding approaches. Rather than embedding the whole knowledge graph, RDF2Vec Light trains vectors for only few entities of interest and their context. For this approach, the walk generation algorithm has been adapted to better represent the context of the entities. Our experiments show that the results achieved with RDF2Vec Light are comparable to those obtained with the standard RDF2Vec, while requiring only a fraction of the runtime.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>3 for 1 to n do 4 add v to w 5 pr 1 :</head><label>3451</label><figDesc>Data: G = (V,E): RDF Graph, V I : vertices of interest, d: walk depth, n: number of walks Result: W G : Set of walks 1 W G = 2 for vertex v ∈ V I do length() &lt; d do 8 cand = pr ed ∪ succ 9 el em = pickRandomElementFrom(cand ) 10 if el em ∈ pr ed then 11 add el em at the beginning of w 12 pr ed = getIngoingEdges(el em) el em at the end of w 16 succ = getOutgoingEdges(el em) Walk generation algorithm for RDF2Vec Light.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 1 .</head><label>1</label><figDesc>Figure 1. Depiction of the graphs which were assembled using the generated walks.</figDesc><graphic coords="4,134.77,262.49,345.80,167.99" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Classification (accuracy) and regression (RMSE) results with RDF2Vec Classic and RDF2Vec Light. The best classic and light results are highlighted.</figDesc><table><row><cell></cell><cell>Cities</cell><cell cols="2">Movies</cell><cell cols="2">Albums</cell><cell>AAUP</cell><cell>Forbes</cell></row><row><cell>Strategy</cell><cell>SVM</cell><cell>LR SVM</cell><cell cols="2">LR SVM</cell><cell cols="2">LR SVM</cell><cell>LR SVM</cell><cell>LR</cell></row><row><cell>Light_500_4_CBOW_50</cell><cell cols="7">52.56 19.23 73.31 19.75 72.44 12.52 61.93 68.35 60.79 34.62</cell></row><row><cell cols="8">Classic_500_4_CBOW_50 49.36 16.95 55.25 22.77 51.70 14.06 55.33 70.59 57.28 36.64</cell></row><row><cell cols="8">Light_500_4_CBOW_100 71.78 21.16 73.50 19.90 72.43 12.35 63.21 65.85 60.81 34.96</cell></row><row><cell cols="8">Classic_500_4_CBOW_100 49.36 22.15 58.21 22.94 57.44 14.17 55.00 73.33 57.36 42.32</cell></row><row><cell cols="8">Light_500_4_CBOW_200 71.61 54.86 73.93 19.60 73.34 12.50 61.74 67.65 59.11 35.97</cell></row><row><cell cols="8">Classic_500_4_CBOW_200 49.36 99.73 58.79 23.54 59.18 14.24 56.83 80.29 57.57 45.76</cell></row><row><cell>Light_500_4_SG_50</cell><cell cols="7">75.90 19.39 74.15 19.34 76.49 12.00 65.46 67.66 61.56 34.58</cell></row><row><cell>Classic_500_4_SG_50</cell><cell cols="7">80.57 12.95 72.81 19.89 76.42 11.80 68.04 64.85 61.08 34.89</cell></row><row><cell>Light_500_4_SG_100</cell><cell cols="7">73.99 20.89 74.89 19.21 76.98 11.89 64.54 66.59 61.38 34.48</cell></row><row><cell>Classic_500_4_SG_100</cell><cell cols="7">79.01 15.26 72.72 19.61 76.51 11.57 64.72 65.50 60.42 35.26</cell></row><row><cell>Light_500_4_SG_200</cell><cell cols="7">73.81 44.38 74.58 19.45 76.35 12.16 62.83 70.13 60.26 36.73</cell></row><row><cell>Classic_500_4_SG_200</cell><cell cols="7">77.06 28.34 73.85 19.71 75.66 11.92 66.74 67.96 61.82 36.93</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_0">RDF2Vec Light models are typically only a few kilobytes in size, compared to multiple gigabytes of disk space required to persist classic embedding models.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_1">https://github.com/dwslab/jRDF2Vec</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_2">https://wiki.dbpedia.org/downloads-2016-10</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_3">The complete result tables are available at http://www.rdf2vec.org/rdf2vec_light</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Binary RDF representation for publication and exchange (HDT)</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Fernández</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Martínez-Prieto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Gutiérrez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Polleres</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Arias</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Web Semantics</title>
		<imprint>
			<biblScope unit="volume">19</biblScope>
			<biblScope unit="page" from="22" to="41" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">OpenKE: An open toolkit for knowledge embedding</title>
		<author>
			<persName><forename type="first">X</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Cao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Xin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of EMNLP</title>
				<meeting>EMNLP</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Knowledge graphs on the web-an overview</title>
		<author>
			<persName><forename type="first">N</forename><surname>Heist</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hertling</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ringler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Paulheim</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Knowledge Graphs for eXplainable Artificial Intelligence: Foundations, Applications and Challenges</title>
				<editor>
			<persName><forename type="first">I</forename><surname>Tiddi</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">F</forename><surname>Lécué</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Hitzler</surname></persName>
		</editor>
		<imprint>
			<publisher>IOS Press</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="3" to="22" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">A configurable evaluation framework for node embedding techniques</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Pellegrino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cochez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Garofalo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Ristoski</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">The Semantic Web: ESWC 2019 Satellite Events</title>
				<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="156" to="160" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">KGvec2go -knowledge graph embeddings as a service</title>
		<author>
			<persName><forename type="first">J</forename><surname>Portisch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hladik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Paulheim</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2020">2020</date>
			<publisher>LREC</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">RDF2Vec: RDF graph embeddings and their applications</title>
		<author>
			<persName><forename type="first">P</forename><surname>Ristoski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Rosati</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">D</forename><surname>Noia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">D</forename><surname>Leone</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Paulheim</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Semantic Web</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="721" to="752" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">A collection of benchmark datasets for systematic evaluations of machine learning on the semantic web</title>
		<author>
			<persName><forename type="first">P</forename><surname>Ristoski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">K D</forename><surname>De Vries</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Paulheim</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ISWC</title>
				<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="186" to="194" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
