<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Topic Modeling for RDF Graphs</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Jennifer</forename><surname>Sleeman</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Computer Science and Electrical Engineering</orgName>
								<orgName type="institution">University of Maryland</orgName>
								<address>
									<postCode>21250</postCode>
									<settlement>Baltimore County Baltimore</settlement>
									<region>MD</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Tim</forename><surname>Finin</surname></persName>
							<email>finin@cs.umbc.edu</email>
							<affiliation key="aff0">
								<orgName type="department">Computer Science and Electrical Engineering</orgName>
								<orgName type="institution">University of Maryland</orgName>
								<address>
									<postCode>21250</postCode>
									<settlement>Baltimore County Baltimore</settlement>
									<region>MD</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Anupam</forename><surname>Joshi</surname></persName>
							<email>joshi@cs.umbc.edu</email>
							<affiliation key="aff0">
								<orgName type="department">Computer Science and Electrical Engineering</orgName>
								<orgName type="institution">University of Maryland</orgName>
								<address>
									<postCode>21250</postCode>
									<settlement>Baltimore County Baltimore</settlement>
									<region>MD</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Topic Modeling for RDF Graphs</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">24E0900146E9F885E03E37E866425AB7</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T09:12+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Alan_Turing dbpo:award dbp:Order_of_the_British_Empire</term>
					<term>dbp:Alan_Turing dbpo:birthDate &quot;1912-06-23+02:00&quot;^^xsd:date</term>
					<term>dbp:Alan_Turing dbpo:birthPlace dbp:Paddington</term>
					<term>dbp:Alan_Turing dbpo:field dbp:Computer_science</term>
					<term>dbp:Alan_Turing rdfs:label &quot;Alan Turing&quot; dbp:Alan_Turing rdf:type dbpo:Scientist</term>
					<term>dbp:Alan_Turing rdf:type foaf:Person</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Topic models are widely used to thematically describe a collection of text documents and have become an important technique for systems that measure document similarity for classification, clustering, segmentation, entity linking and more. While they have been applied to some non-text domains, their use for semi-structured graph data, such as RDF, has been less explored. We present a framework for applying topic modeling to RDF graph data and describe how it can be used in a number of linked data tasks. Since topic modeling builds abstract topics using the co-occurrence of document terms, sparse documents can be problematic, presenting challenges for RDF data. We outline techniques to overcome this problem and the results of experiments in using them. Finally, we show preliminary results of using Latent Dirichlet Allocation generative topic modeling for several linked data use cases.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Data presented as RDF triples can be problematic for tasks that involve identifying entities, finding entities that are the same, finding communities of entities and aligning ontological information. Data describing a resource can be sparse, making it harder to distinguish one resource from other similar resources. Data describing a resource can be noisy, having excessive data that is not relevant to the resource. There can be large volumes of data which may contribute to an increase in noise, errors, and ambiguities.</p><p>When data originating from multiple sources is used, combining and resolving resource information can be challenging. For example, when aligning attributes from one ontology to another, often there are attributes that simply are not alignable <ref type="bibr" target="#b23">[24]</ref>. In this paper, we show how topic modeling can be used to support tasks such as aligning ontologies, recognizing type information, community detection and resolving resources that are the same. Though topic modeling can be challenged by problems related to sparseness and noise, we show ways to overcome these problems.</p><p>Topic modeling has quickly become a popular method for modeling large document collections for a variety of natural language processing tasks. Topic modeling is based on statistics of the co-occurrence of terms (typically words) and establishes topics that are groupings of terms to describe documents. It has been used to describe <ref type="bibr" target="#b1">[2]</ref> and classify documents <ref type="bibr" target="#b20">[21,</ref><ref type="bibr" target="#b0">1]</ref>, as a feature selection method <ref type="bibr" target="#b7">[8]</ref>, sentiment analysis <ref type="bibr" target="#b15">[16]</ref> and as a tool for clustering things of interest.</p><p>Topic modeling is a statistical method Fig. <ref type="figure">1</ref>: The Graphical Model for LDA that results in abstract categories or topics from the processing of a set of documents. Several methods have been developed for generating topics. Early work by Deerwester et al. <ref type="bibr" target="#b4">[5]</ref> introduced the concept of Latent Semantic Analysis (LSA) which uses singular value decomposition for finding the semantic structure of documents to improve indexing and retrieval.</p><p>Hofmann <ref type="bibr" target="#b12">[13]</ref> later used the concept of Probabilistic Latent Semantic Indexing (pLSI) to introduce a probabilistic generative approach. More recent work that has grown in popularity is Latent Dirichlet Allocation (LDA) <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b1">2]</ref> which is also a probabilistic approach but differs from pLSI by the introduction of a conjugate Dirichlet prior and uses variational and sampling based methods to estimate posterior probabilities. The LDA graphical model is typically conveyed by a plate diagram as can be seen in Figure <ref type="figure">1</ref> where W represents the words, β 1..k are the topics, θ d,k is the topic proportion of topic k in document D, and Z d,k is the topic assignments.</p><p>With LDA, the terms in the collection of documents produce a vocabulary that is then used to generate the latent topics. Documents are treated as a mixture of topics, where a topic is a probability distribution over this set of terms. Each document is then seen as a probability distribution over the set of topics. We can think of the data as coming from a generative process that is defined by the join probability distribution over what is observed and what is hidden <ref type="bibr" target="#b1">[2]</ref>. This generative process is defined as follows.</p><p>For each document: (1) Choose a distribution over topics; <ref type="bibr" target="#b1">(2)</ref> For each word in the document: (a) select a topic from the document's distribution over topics and (b) select a term from the associated distribution over terms.</p><p>The computational portion of LDA involves learning the topic distributions by means of inference. Though there are a number of variational and sampling based methods for performing the inference, Gibbs sampling <ref type="bibr" target="#b11">[12]</ref> is frequently used.</p><p>We describe how one might use topic modeling for RDF data and explore its application to several problems faced by the Semantic Web community. RDF data is less typical in terms of the documents that are used to create a topic model but since a bag of words is typically used with this model, we will show how RDF data can be used. Topic modeling was originally used to characterize relatively long documents, such as newswire articles or scientific papers. More recently, researchers have outlined successful strategies <ref type="bibr" target="#b27">[28,</ref><ref type="bibr" target="#b14">15,</ref><ref type="bibr" target="#b20">21,</ref><ref type="bibr" target="#b21">22]</ref>, for using topic modeling for short texts such as tweets and SMS messages. We build on these ideas to establish an approach to using topic modeling with RDF data. Fig. <ref type="figure">2</ref>: A simple set of triples making up an RDF "document" and the word-like tokens extracted from them for our topic modeling system There are several issues in applying topic models to short texts <ref type="bibr" target="#b27">[28]</ref>. The first is the discriminative problem, where words in short documents do not discriminate as well as in longer ones. The second is that short documents provide much less context than longer ones. RDF data shares both of these and adds a third: none of its serializations are like any natural, human language.</p><p>2 Topic models and RDF graphs</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Topic models for text</head><p>To give light to these problems we show how topic modeling for text documents differ from RDF documents by describing how topic models are used with text documents and how we apply them to RDF graphs.</p><p>Although there are a number of algorithms for defining and using topic models, they share several common aspects. A topic model uses a fixed set of K topics to describe documents in a corpus. K varies with the application and is usually between 100 and 1000. A text documents could be anything from a tweet to a 30-page scientific article, but they typically contain at least several paragraphs of text. The mixture of topics in a document is represented as a vector of real numbers between 0 and 1, where the kth number specifies the amount of topic k that the document exhibits. Using topic vectors makes it easy to define the "semantic" distance between two documents (often using the cosine similarity).</p><p>The K topics making up a topic model are not specified in advance, but learned by a statistical process that discovers the 'hidden thematic structure in a document collection' <ref type="bibr" target="#b1">[2]</ref>. This stems from the probability that a word will appear in a document about a different topic, which leads to an effective way to compute the topic model vector for a document given its the bag of words.</p><p>One common problem is that many of the automatically induced topics in a model may not correspond to concepts that are easy for people to identify. For topic models over text documents, the best that can be done is to list the most frequent words associated with each topic. This is often sufficient to recognize that topic number 32 has something to do with politics and elections where Fig. <ref type="figure">3</ref>: Small and Large RDF Graphs as topic number 126 seems to be about software and computer applications. However, there are typically topics that are difficult or impossible to associate with familiar concepts.</p><p>Once a topic model has been learned or trained from a document collection, it can be used to infer a document's topic vector from its bag of words. These vectors can then be used for a number of different tasks, such as classifying, clustering or recommending documents.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Topic models for RDF</head><p>While topic models were originally defined for text documents, they have been applied to other kinds of data (e.g., images and generic sequences) and can be used with RDF graphs. To do this, we must define what we will mean by a "document" and the word-like elements within them and also how to compile large collections of those "documents" to train our topic modeling system. For natural language, topic models sometimes exploit linguistic concepts like part-of-speech tags, stop words, and word lemmas and also apply normalization operations (e.g., downcasing, punctuation removal, abbreviation expansion, etc.) to improve performance or accuracy, so we might consider analogs to these notions for RDF data.</p><p>What's a document? We assume that a knowledge base is represented by triples, where a triple has a subject s, predicate p, and an object o, forming a t(s, p, o) with the following definitions.</p><formula xml:id="formula_0">s ∈ (U RI ∪ Blank),p ∈ (U RI) and o ∈ (U RI ∪ Blank ∪ Literal)</formula><p>We define an 'entity' by t 1 ...t n ∈ T associated with a common s URI. In our current model, we treat a document as the set of triples that describe a single 'entity'. We experiment with this definition of a document by working with different parts of the triple, supplementing the triples with additional data, and including 1-hop in-bound and out-bound links.</p><p>Alternatively, we could define it as the set of triples in which a given node is either the subject or the object. If we consider a large dataset like DBpedia to be a document collection, we probably want to further restrict the nodes in the graph that we will consider to be documents. A node like dbp:Alan Turing makes a good subject but T-box nodes like owl:sameAs or dbpo:birthDate probably do not. Similarly, structural nodes such as Freebase's compound value type nodes or nodes that link a measurement with a units and a numeric value may not be suitable subjects for documents.</p><p>What's a word? The "words" in a document are extracted from the subjects, predicates and objects of each of its triples and the extractions are treated as bags of words. The words related to an entity (given by a URI) are tokenized by extracting all the triples related to a particular URI from a triple store. Then by removing paths from subjects, predicates and objects. Specifically, literals, i.e., strings, are used in which they are first sanitized with stop words removed. Figure <ref type="figure">2</ref> shows an example of a set of triples forming a simple document and its associated word-like tokens.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">RDF Short Text Problem</head><p>Short text suffers from two distinct problems: sparseness affects how well the model can discriminate and the lack of context affects word senses <ref type="bibr" target="#b27">[28]</ref>. If a word has multiple meanings often context can be used to identify the correct meaning. RDF data also suffers from "unnatural" language since RDF data is represented as triples, the natural structural clues found in human languages are not present.</p><p>Sparseness. RDF data can suffer from sparseness. If we choose to think of a document as a set of triples associated with a resource defined by an URI, the set can be large, resulting in a larger, more context enriched bag of words or small, offering very little information at all as shown in Figure <ref type="figure">3</ref>. Even of the large set of triples, the data that could actually be used in the bag of words, after pre-processing could result in a smaller set of words.</p><p>Lack of Context. Context can be particularly problematic for RDF data, as often words are used that can have multiple meanings and due to the potential sparseness of RDF data in addition to the unnatural language characteristic, it could be hard to distinguish that meaning. For example, the description Alternative rock contains the word rock, without additional context, this word could be interpreted in multiple ways.</p><p>Unnatural Language. RDF data suffers from unnatural language issues. Since RDF data is graph-based the natural structure of a sentence does not exist. Often the components of a sentence provide additional context for understanding words which may be polysemous or homonymous. In addition, the text is more prone to error during pre-processing. For example, it is not uncommon to find parts of a triple that have unexpected characters, unusual letter casing, pointers to another resource and data that is simply hard to parse. We show some of these examples in Table <ref type="table" target="#tab_0">1</ref>.</p><p>The short text problems in RDF. Researchers tend to take two approaches to overcoming short text related problems. They either supplement the text or they create modified versions of LDA to support their specific problem. We currently use the approach of supplementing text using a set of baseline techniques. Our future work will include additional techniques for supplementing text and a modified LDA algorithm for RDF graphs. We show ways to supplement RDF data in Figure <ref type="figure">4</ref>. We could simply use the object literals of the triple, for example "University of North Carolina" is an object literal for the resource "University of North Carolina at Greensboro". We could also use the predicates, in addition to the object literal. For example, the predicate "http://dbpedia.org/property/name", may be the predicate for the triple with the object literal "University of North Carolina". We may also choose to use Wordnet to supplement the RDF data. For example, for the word "Boston" if we take a subset of synsets and the definition, we enrich the word "Boston" with the following data: [capital of Massachusetts, state capital and largest city of Massachusetts; a major center for banking and financial services, Beantown, Bean Town, Boston, Hub of the Universe]. We also looked at using 1-hop in-links and 1-hop out-links. For example, "Boston" may refer to a mayor, which with 1-hop we could consume the triples related to the mayor of Boston. We could do this similarly with in-links.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Related Work</head><p>Work by Hong et al. <ref type="bibr" target="#b14">[15]</ref> focuses on Twitter data and takes the approach of both developing a modified version of LDA and also defining a number of modeling schemes to be used. They take the approach of inferring a topic mixture for messages and authors where each word in a document is associated with an author latent variable and a topic latent variable. In our work we are not proposing a modification to LDA but rather a way to supplement RDF triples such that the data is better suited for LDA modeling.</p><p>Since topic modeling works on the co-occurrences of terms, sparse documents can be problematic. Work by Yan et al. <ref type="bibr" target="#b27">[28]</ref> bring light to this problem in terms of 'short text'. As described in this work, often researchers aggregate short text documents or customize the topic model to train on aggregated data. Others make assumptions as to how documents relate to topics. They take the approach of a generative model specifically for 'biterms' which is an 'unordered word-pair Fig. <ref type="figure">4</ref>: Bag of Words Variations co-occurrence'. Again, they specifically address short text by modifying LDA. Though we think this work has merit in this paper we specifically look at how to modify the data itself.</p><p>Work by Phan et al. <ref type="bibr" target="#b20">[21]</ref> describes how external data sources can supplement short text. They describe a framework for classifying short, sparse text which includes collecting large data sets that are used to create hidden topics. These hidden topics are then used in conjunction with the small data set to support classification. In this approach they were able to address the data sparseness problem and expand their training set to be more representative. This approach differs from ours in that they supplement the short text with large data sets that they apply topic modeling to whereas we supplement the RDF and then apply topic modeling.</p><p>Work by Dietz et al. <ref type="bibr" target="#b5">[6]</ref> uses topic modeling for bibliographical data and the results are presented as RDF data. However this work does not address the problem of using topic modeling directly on RDF data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Applying Topic Modeling</head><p>Given our description of topic modeling and how it could be used with RDF data, we have outlined a number of ways topic modeling could be applied to research tasks within the Semantic Web community.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Predicting entity types</head><p>Often there is a need to associate type information with entities that are defined within RDF data <ref type="bibr" target="#b16">[17,</ref><ref type="bibr" target="#b25">26,</ref><ref type="bibr" target="#b19">20]</ref>. For example, it is not clear from its name what the resource 'http://dbpedia.org/resource/City of Golden Shadow' refers to. However, with associated type information, the types book, WrittenWork and Creative Work are associated with the resource. By predicting type information, when type information does not exist, types provide additional information about the entity, supporting tasks such as knowledge base population, entity coreference resolution and entity linking.</p><p>We use topic modeling to support entity type recognition by creating a topic model from a sample of data which contains known type information. We use the model to associate topics to the types. Given new data with missing type information, we then infer topics for new entities. Using KL divergence, for each entity with an unknown type, we measure the divergence between its topic vector and the topic vectors of each known type. Based on this measure we assign known types to new entities.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Entity Disambiguation</head><p>The need to match instances across different data sets or to link new instance information with existing knowledge base instances is common <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b26">27,</ref><ref type="bibr" target="#b13">14,</ref><ref type="bibr" target="#b24">25]</ref>. This method usually involves taking information from each instance and applying a matching algorithm to identify which instances are likely the same. Topic modeling supports creating clusters of entities that are closely related, which can be used as a preprocessing step for matching instances or disambiguating entities. In our work, we assume an existing knowledge base and create a topic model from its data. With this, we can compute topic vectors for new entities to be integrated into the knowledge base. We use cosine similarity to compare the new entity topic vectors to vectors for existing KB entities. We treat this approach as a candidate selection method, where the entities that have similar topic vectors should be evaluated for similarity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Ontology alignment -Class and Property Alignment</head><p>Ontology alignment <ref type="bibr" target="#b22">[23]</ref> can include classes, properties and instances. For example, from the OAEI initiative <ref type="bibr" target="#b18">[19]</ref> oaei 101#author<ref type="foot" target="#foot_0">1</ref> from ontology 1 aligns with oaei 103#author from ontology 3. We use a topic model based on one ontology then we infer topic vectors for our second ontology, making our properties and classes our 'entities' of interest. We take the cosine similarity to directly align properties and classes. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Community detection</head><p>Community detection approaches <ref type="bibr" target="#b28">[29,</ref><ref type="bibr" target="#b10">11,</ref><ref type="bibr" target="#b8">9]</ref> can be categorized as topological or topical <ref type="bibr" target="#b6">[7]</ref>. We address topical detection in this work. By examining the graphs of RDF data and based on highly connected vertices, communities can be detected by node connections. As an example, we took a set of resources from a DBpedia <ref type="bibr" target="#b3">[4]</ref> sample and clustered resources that are fiction literature and produced communities based on sharing the same publishers and the same genre. A community of authors can be seen in Figure <ref type="figure" target="#fig_0">5</ref>.</p><p>This topical approach can be performed by using topic modeling. For example, authors of the Fantasy Literature and Penguin Publishing community might have more topics in common than authors of other genres associated with different publishing companies.</p><p>In our work we build a topic model from a data set that we identify as having communities of interest. From this model we then associate topic vectors to each entity. We look for entities which have a number n of topics in common. In order to find entities which have n topics in common, we create a histogram from the topic probabilities for each entity. Assuming a topic defines a sub-community, we use the histogram to tell us where the most density is among topics for the entity and set a threshold so that we only consider the topics which are most relevant to the entity. From this we assign entities to topic sub-communities. We then find entities that have n sub-communities in common.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Experiments and Results</head><p>We experimented with different ways to build a bag of words from the RDF data based on approaches in Figure <ref type="figure">4</ref>. For each problem which we outlined previously, we used LDA without any modifications to the algorithm itself. Our goal with this work was to show how to supplement RDF data to overcome issues related to spareness, lack of context and the use of unnatural language. We did see improvement in supplementing the RDF data with repetition of key words, and using a limited set of synsets and definitions from Wordnet. Specifically, when working with large graphs, using the object literals alone may be sufficient but we have observed better results when including either the predicate or the predicate and the subject. We have also found that where the graphs are particularly sparse, using Wordnet <ref type="bibr" target="#b17">[18]</ref> synsets and definitions can improve performance. Using 1-hop in-links and 1-hop out-links often increased the noise factor which negatively impacted the performance. We limited the data we incorporated from in-links and out-links to predicates that were of type name and label. This approach reduced the noise but we didn't see significant improvements in performance. Our future work will include exploring links more, possibly by examining graph similarities.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Predicting entity types</head><p>We used DBpedia data and created two different randomly selected data sets. One was used to actually build the topic model, it had 6000 unique resources. The second data set had 100 unique resources. We associated topics to known types and then used the model to infer types for each entity in the second data set. We then used KL divergence to compare topic vectors. From this we mapped types from the first data set to entities in the second data set. We tested with 200 topics and 400 topics with resources that had an average of seven types that should be recognized. Our ground truth in this case was the type defintions in the DBpedia data set. We removed the type defintions for our test data set then evaluated our predictions with what was defined by DBpedia. Though our test set was relatively small, we were able to see how precision changed based on data variations. As can be seen in Figure <ref type="figure">6</ref> we saw the highest precision using predicates and objects and in Figure <ref type="figure">7</ref> we saw the highest precision using predicates and objects that included the Wordnet synsets and definitions. Though it was clear that objects alone did not perform as well as including the predicate, future work will further explore the relationship between supplemental data and the number of topics chosen for the model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Entity Disambiguation</head><p>For this experiment we took a subset of DBpedia data as our knowledge base including 300 unique entities with an average of 19 triples per entity and used this data set to build a topic model. We created a second data set with 100 unique entities obtained from the same data, except we obfuscated the subjects such that subjects could not be directly matched. For example, the unobfuscated subject "Falling in Love with Jazz" became "Jxiiwhw wh Uaka kwki Uxhh". We used this approach as to create a ground truth for entity matching. We used a lookup table to correlate between the obfuscated subjects and the unobfuscated subjects to evaluate our approach. We associated topics with each entity in our knowledge base. We then took our obfuscated data set and inferred topics for each entity. From this we used cosine similarity to compare entities and tried to Fig. <ref type="figure">6</ref>: Entity types with 200 topics Fig. <ref type="figure">7</ref>: Entity types with 400 topics match entities from the two data sets. Though topic modeling is too coarse to use directly to match instances, it does provide a way to significantly reduce the number of candidates that need to be matched. Our experiments showed topic modeling was a reasonable approach for candidate selection reducing on average the number of candidates from 1000s to 100s. However, more work is required to show how this method could be used in conjunction with a entity matching algorithm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Ontology alignment</head><p>We tested two different alignments that included aligning properties and classes. We used the data from oaei 101 and oaei 103, where the class and property names are identical. We also used the data from oaei 101 and oaei 205, where class and property names to be aligned are not spelled the same. For example, oaei 101#Booklet and oaei 205#Brochure should be recognized as alignable. We used the OAEI ref alignments to evaluate our approach. This alignment document indicates which properties and classes should be aligned. We excluded instance alignments for this evaluation. We did however extract the instance data only to generate a topic model. Our evaluation examined how well we aligned properties and classes. We tested with 25, 50, 100 and 200 topics and saw the best performance with 50 topics. We exercised the different variations for the RDF data. The ontologies are good examples of sparseness, by using repetition and supplemental data we were able to get approximately 80% precision, where we selected the top N candidates of either an attribute or class match. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4">Community detection</head><p>We took a sample of the DBpedia data set and performed topic modeling using 50, 100, and 200 topics. From this we associated a set of topics with each entity. We then looked for entities that had n topics in common. Commonality is based on first identifying topics for each entity that are most relevant given their probabilities. then comparing entities based on this subset of topics. Our data set did not include ground truth for this evaluation. However, as seen in figure <ref type="figure" target="#fig_2">10</ref> our preliminary results found interesting communities, such as a community that included Vini Lopez and Bruce Springsteen who are related by playing in the same band. Future work will perform more comprehensive experiments to evaluate this method further.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>We described a framework for applying topic modeling to RDF graph data and described how it can be used in a number of linked data tasks, including predicting entity types, instance matching, ontology alignment, context identification and community detection. By supplementing RDF data we can address the problems related to spareness, lack of context and unnatural language. We have used different problems in Semantic Web research to exercise LDA modeling. For preliminary results over a small amount of data, topic modeling shows promise for a number of tasks. Repetition and Wordnet supplemental data improves performance. More work is needed to determine how we could use in-links and out-links to supplement the data without increasing the noise. Our results, though preliminary, provide some insight into how a basic LDA model might perform given </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 5 :</head><label>5</label><figDesc>Fig. 5: Example of Fantasy Literature Penguin Publishing Author Community</figDesc><graphic coords="9,139.95,115.84,335.45,148.50" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 8 :</head><label>8</label><figDesc>Fig. 8: Ontology alignment (101-103) with 50 topics</figDesc><graphic coords="12,203.93,421.89,207.49,207.49" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 10 :</head><label>10</label><figDesc>Fig. 10: Community Detection</figDesc><graphic coords="14,212.58,115.84,190.20,143.81" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0"><head></head><label></label><figDesc></figDesc><graphic coords="7,152.06,115.84,311.24,219.68" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Example objects extracted from triples.</figDesc><table /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">We use the abbreviation oaei 101, oaei 103, and oaei</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="205" xml:id="foot_1">for http://oaei.ontologymatching.org/tests/101/onto.rdf, http://oaei.ontologymatching.org/tests/-103/onto.rdf and http://oaei.ontologymatching.org/tests/205/onto.rdf, respectively.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Acknowledgment. This work was supported by NSF grants 0910838 and 1228673.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Latent dirichlet allocation in web spam filtering</title>
		<author>
			<persName><forename type="first">I</forename><surname>Bíró</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Szabó</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">A</forename><surname>Benczúr</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">4th int. Workshop on Adversarial Information Retrieval on the Web</title>
				<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="29" to="32" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Probabilistic topic models</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>Blei</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Comm. of the ACM</title>
		<imprint>
			<biblScope unit="volume">55</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="77" to="84" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Latent dirichlet allocation</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>Blei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">Y</forename><surname>Ng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">I</forename><surname>Jordan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">the Journal of machine Learning research</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="993" to="1022" />
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<ptr target="http://dbpedia.org/Datasets" />
		<title level="m">DBpedia: Dbpedia data set</title>
				<imprint>
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Indexing by latent semantic analysis</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">C</forename><surname>Deerwester</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">T</forename><surname>Dumais</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">K</forename><surname>Landauer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">W</forename><surname>Furnas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">A</forename><surname>Harshman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">JAsIs</title>
		<imprint>
			<biblScope unit="volume">41</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="391" to="407" />
			<date type="published" when="1990">1990</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Utilize probabilistic topic models to enrich knowledge bases</title>
		<author>
			<persName><forename type="first">L</forename><surname>Dietz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Stewart</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the ESWC 2006 Workshop on Mastering the Gap: From Information Extraction to Semantic Representation</title>
				<meeting>of the ESWC 2006 Workshop on Mastering the Gap: From Information Extraction to Semantic Representation</meeting>
		<imprint>
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Community detection: Topological vs. topical</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Ding</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Informetrics</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="498" to="514" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Feature selection for sentiment analysis based on content and syntax models</title>
		<author>
			<persName><forename type="first">A</forename><surname>Duric</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Song</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Decision Support Systems</title>
		<imprint>
			<biblScope unit="volume">53</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="704" to="711" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">A state of the art on social network analysis and its applications on a semantic web</title>
		<author>
			<persName><forename type="first">G</forename><surname>Erétéo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Buffa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Gandon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Grohan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Leitzelman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Sander</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">7th Int. Semantic Web Conference</title>
				<imprint>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Data linking for the semantic web</title>
		<author>
			<persName><forename type="first">A</forename><surname>Ferraram</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Nikolov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Scharffe</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Semantic Web: Ontology and Knowledge Base Enabled Tools, Services and Applications</title>
				<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Community detection in graphs</title>
		<author>
			<persName><forename type="first">S</forename><surname>Fortunato</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Physics Reports</title>
		<imprint>
			<biblScope unit="volume">486</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="75" to="174" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><surname>Griffiths</surname></persName>
		</author>
		<ptr target="http://bit.ly/1IA88Pc" />
		<title level="m">Gibbs sampling in the generative model of latent dirichlet allocation</title>
				<imprint>
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Probabilistic latent semantic indexing</title>
		<author>
			<persName><forename type="first">T</forename><surname>Hofmann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval</title>
				<meeting>the 22nd annual international ACM SIGIR conference on Research and development in information retrieval</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="1999">1999</date>
			<biblScope unit="page" from="50" to="57" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora</title>
		<author>
			<persName><forename type="first">A</forename><surname>Hogan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zimmermann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Umbrich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Polleres</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Decker</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Web Semantics</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page" from="76" to="110" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Empirical study of topic modeling in twitter</title>
		<author>
			<persName><forename type="first">L</forename><surname>Hong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">D</forename><surname>Davison</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the First Workshop on Social Media Analytics</title>
				<meeting>the First Workshop on Social Media Analytics</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="80" to="88" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Joint sentiment/topic model for sentiment analysis</title>
		<author>
			<persName><forename type="first">C</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>He</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">18th ACM Conf. on Information and Knowledge Management</title>
				<imprint>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="375" to="384" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Typifier: Inferring the type semantics of structured data</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Tran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Bicer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">29th Int. Conf. on Data Engineering</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="206" to="217" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Wordnet: a lexical database for English</title>
		<author>
			<persName><forename type="first">G</forename><surname>Miller</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">CACM</title>
		<imprint>
			<biblScope unit="volume">38</biblScope>
			<biblScope unit="issue">11</biblScope>
			<biblScope unit="page" from="39" to="41" />
			<date type="published" when="1995">1995</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<ptr target="http://oaei.-ontologymatching.org/2014/" />
		<title level="m">Ontology alignment evaluation initiative -OAEI 2014 campaign</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Type inference on noisy rdf data</title>
		<author>
			<persName><forename type="first">H</forename><surname>Paulheim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Bizer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Semantic Web Conference</title>
				<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Learning to classify short and sparse text &amp; web with hidden topics from large-scale data collections</title>
		<author>
			<persName><forename type="first">X</forename><forename type="middle">H</forename><surname>Phan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">M</forename><surname>Nguyen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Horiguchi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">17th WWW Conf</title>
				<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="91" to="100" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Characterizing microblogs with topic models</title>
		<author>
			<persName><forename type="first">D</forename><surname>Ramage</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">T</forename><surname>Dumais</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">J</forename><surname>Liebling</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ICWSM</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page" from="1" to="1" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Ontology matching: state of the art and future challenges</title>
		<author>
			<persName><forename type="first">P</forename><surname>Shvaiko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Euzenat</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Knowledge and Data Engineering</title>
		<imprint>
			<biblScope unit="volume">25</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="158" to="176" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
	<note>IEEE Transactions on</note>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Opaque attribute alignment</title>
		<author>
			<persName><forename type="first">J</forename><surname>Sleeman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Alonso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Pope</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Badia</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. 3rd Int. Workshop on Data Engineering Meets the Semantic Web</title>
				<meeting>3rd Int. Workshop on Data Engineering Meets the Semantic Web</meeting>
		<imprint>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Computing FOAF co-reference relations with rules and machine learning</title>
		<author>
			<persName><forename type="first">J</forename><surname>Sleeman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Finin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">3rd Workshop on Social Data on the Web</title>
				<imprint>
			<publisher>ISWC</publisher>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Entity type recognition for heterogeneous semantic graphs</title>
		<author>
			<persName><forename type="first">J</forename><surname>Sleeman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Finin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Joshi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">AI Magazine</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<biblScope unit="page" from="75" to="86" />
			<date type="published" when="2105-03">March 2105</date>
			<publisher>AAAI Press</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Domain-independent entity coreference for linking ontology instances</title>
		<author>
			<persName><forename type="first">D</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Heflin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Data and Information Quality</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page">7</biblScope>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">A biterm topic model for short texts</title>
		<author>
			<persName><forename type="first">X</forename><surname>Yan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Cheng</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">22nd Int. Conf. on the World Wide Web</title>
				<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="1445" to="1456" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Probabilistic community discovery using hierarchical latent gaussian mixture model</title>
		<author>
			<persName><forename type="first">H</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Giles</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Foley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">AAAI</title>
				<imprint>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="663" to="668" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
