<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Identifying Information Needs by Modelling Collective Query Patterns</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Khadija</forename><surname>Elbedweihy</surname></persName>
							<email>k.elbedweihy@dcs.shef.ac.uk</email>
							<affiliation key="aff0">
								<orgName type="department">Dept. of Computer Science</orgName>
								<orgName type="laboratory">OAK Group</orgName>
								<orgName type="institution">University of Sheffield</orgName>
								<address>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Suvodeep</forename><surname>Mazumdar</surname></persName>
							<email>s.mazumdar@dcs.shef.ac.uk</email>
							<affiliation key="aff0">
								<orgName type="department">Dept. of Computer Science</orgName>
								<orgName type="laboratory">OAK Group</orgName>
								<orgName type="institution">University of Sheffield</orgName>
								<address>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Amparo</forename><forename type="middle">E</forename><surname>Cano</surname></persName>
							<email>a.cano@dcs.shef.ac.uk</email>
							<affiliation key="aff0">
								<orgName type="department">Dept. of Computer Science</orgName>
								<orgName type="laboratory">OAK Group</orgName>
								<orgName type="institution">University of Sheffield</orgName>
								<address>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Stuart</forename><forename type="middle">N</forename><surname>Wrigley</surname></persName>
							<email>s.wrigley@dcs.shef.ac.uk</email>
							<affiliation key="aff0">
								<orgName type="department">Dept. of Computer Science</orgName>
								<orgName type="laboratory">OAK Group</orgName>
								<orgName type="institution">University of Sheffield</orgName>
								<address>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Fabio</forename><surname>Ciravegna</surname></persName>
							<email>f.ciravegna@dcs.shef.ac.uk</email>
							<affiliation key="aff0">
								<orgName type="department">Dept. of Computer Science</orgName>
								<orgName type="laboratory">OAK Group</orgName>
								<orgName type="institution">University of Sheffield</orgName>
								<address>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Identifying Information Needs by Modelling Collective Query Patterns</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">106DDAF36510492CE21BD0307A41ECE0</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T11:48+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>linked data</term>
					<term>information visualisation</term>
					<term>semantic query log analysis</term>
					<term>information needs</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>With individuals, organisations and Governments releasing large amounts of linked data, users have now access to an immense repository of highly structured data, ready to be queried and reasoned upon. However, it is important at this stage to ask questions like What do Linked Data users look for and how do they search for information? Understanding the information needs of users accessing such data could be invaluable to researchers, developers and linked data providers and consumers. In this paper, we present an approach to formalise query log analysis and how we consume such analysis. We present SEMLEX, a visualisation interface that facilitates exploration of user's information needs by analysing queries issued to a public dataset.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Over the last two decades traditional search engines have improved accuracy by adapting their processing to address the information needs of web users. Part of this progress has been possible thanks to the analysis and interpretation of query logs <ref type="bibr" target="#b11">[12,</ref><ref type="bibr" target="#b5">6]</ref>. These studies addressed statistics such as query length, term analysis and topic classification <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b12">13]</ref>, as well as the identification of changes in users' search behaviour over time <ref type="bibr" target="#b6">[7]</ref>. However, the structure and information captured in traditional query logs limits the analysis to a set of timestamped keywords and URIs, which lacks structure and semantic context.</p><p>The movement from the 'web of documents' towards structured and linked data has made significant progress in recent years. Semantic Web gateways (e.g., Sindice <ref type="bibr" target="#b13">[14]</ref>) expose SPARQL endpoints, which allow users or software agents to perform more complex querying and reasoning over the 'web of data'. Although the use of these gateways has built up a rich semantic trail of users' information needs in the form of semantic query logs, little research has been done on the interpretation of query logs as clues for analysing and predicting information needs at the semantic level.</p><p>Previous studies have focused on metadata statistics derived from Semantic Web search engines (e.g., <ref type="bibr" target="#b8">[9]</ref>). In this work, we investigate the size of the semantic gap between supply and demand within the Semantic Web by analysing the semantic content of query logs. For our analysis, we define information needs as the set of concepts and properties users refer to while using SPARQL queries. Consider: PREFIX dbo: &lt;http://dbpedia.org/ontology/&gt; SELECT ?manufacturer WHERE { &lt;http://dbpedia.org/resource/Acura_ZDX&gt; dbo:manufacturer ?manufacturer. }</p><p>This query shows a user looking for the manufacturer of a particular car. The user's information needs would be represented as http://dbpedia.org.../Automobile and dbo:manufacturer. The concept Automobile would be inferred by querying the linked data endpoint.</p><p>The contributions of this paper are as follows: 1. We provide a new approach for analysing semantic query logs. 2. We describe a set of methods for extracting patterns in semantic query logs. 3. We implemented these methods in an interactive tool which enables the exploration of information needs revealed by the semantic query logs analysis. We use a DBpedia query log dataset as a case study for testing our methodology. In this study, we explore aspects such as what information individuals or software agents commonly look for and the manner in which they perform the query. Such analyses can give an insight into the coverage and distribution of queries over the data and whether users and agents are making use of the whole or just a small portion of a dataset. Our visualisation tool supports the identification of interesting trends or hidden patterns.</p><p>This paper is structured as follows: Section 2 presents a review of the current state of the art in analysing query logs. Section 3 discusses our approach in analysing query logs by modelling log entries and describes the subsequent analysis results. Section 4 describes the dataset we have used for our analysis. Section 5 presents our approach in consuming our analysis results and some observations. Section 6 concludes the paper and discusses the next stages of our research.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>With the size of the Semantic Web currently approaching 40 billion triples, there has been a growing interest in studying different aspects related to its use and characteristics. Two recent studies <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b3">4]</ref> investigated whether casual web users can satisfy their information needs using the Semantic Web. The first study focused on extracting the main objects and attributes users were interested in from query logs which were then compared with Wikipedia templates to examine whether the schema of structured data on the web matched the users' needs as a key indicator of the success of semantic search. On the other hand, Halpin <ref type="bibr" target="#b3">[4]</ref> used a named entity recogniser to identify names of people and places together with WordNet <ref type="bibr" target="#b2">[3]</ref> to identify abstract concepts found in the users' queries. To investigate whether the Semantic Web provided answers to these queries, Falcon-S <ref type="bibr" target="#b1">[2]</ref> was used as the Semantic Web search engine and the results of executing the queries were analysed. On average, 1,339 URIs were returned for entity queries, while 26,294 URIs were returned for concept queries. The authors explained this finding that semantic search engines similar to FalconS contain interesting information for ordinary users. Möller et al. <ref type="bibr" target="#b9">[10]</ref> were the first to address the usage patterns of Linked Open Data (LOD). Unlike previous studies which had a primary focus on the content of the queries, this study had a broader view of web usage mining: it answered the questions of who is using LOD and how it is being used. The agents issuing the requests are classified into semantic and conventional based on their ability to process structured data. Additionally, the study investigated the relevance of a dataset according to how its usage statistics are affected by events of public interest such as conferences or political events. Similarly, Kirchberg et al. <ref type="bibr" target="#b7">[8]</ref> used query logs provided by the USEWOD2011 data challenge <ref type="foot" target="#foot_0">1</ref> to analyze the relationship between traffic of queries to Linked Data resources and whether different time frames have an influence on this traffic.</p><p>The work done by Arias et al. <ref type="bibr" target="#b0">[1]</ref> builds on <ref type="bibr" target="#b9">[10]</ref> and performs further analysis on the nature of the SPARQL queries. The structure of the queries was examined to identify the most frequent pattern types, joins as well as SPARQL features such as OPTIONAL and UNION. This information is valuable in a number of ways including query optimisation.</p><p>3 Query Logs Analysis</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Modelling Query Logs</head><p>In order to identify concepts and relations of interest from user queries, there is a need to formalise individual query logs to a structured and standardised representation. We propose the QLog (QueryLog) ontology<ref type="foot" target="#foot_1">2</ref> to represent the main concepts and relations that can be extracted from a query log entry and by its subsequent analysis stages. The ontology has been developed by identifying the concepts of a log entry that follows the Combined Log Format (CLF) <ref type="foot" target="#foot_2">3</ref> . Fig. <ref type="figure" target="#fig_0">1</ref> shows an example of a CLF log entry.</p><p>A query log entry is extracted to identify the different properties of the log entry including date and time, response size, response code, agent, query string (including SPARQL query) etc. In addition to the concepts that were identified from a CLF log entry, the QLog ontology also contains concepts to describe our analysis on the query log entry itself. The query string (identified as Request String in a CLF log entry) is further parsed and analysed to identify which concepts and relations have been queried for. The SPARQL query is also analysed to identify properties that can be derived like types and number of triple patterns, joins, filters etc. Fig. <ref type="figure" target="#fig_1">2</ref> shows the proposed QLog ontology.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Analysing Query Logs</head><p>Fig. <ref type="figure" target="#fig_2">3</ref> shows the steps carried out in the analysis of the query logs. Since a web server log includes requests to particular web pages, RDF resources or to its SPARQL endpoint, the first step in the analysis is to filter the dataset and extract the requests issued to the SPARQL endpoint. The properties associated with a log entry, as shown in the QLog ontology are extracted first. These include the agent type and IP address, the request date as well as the response code, referrer and result size. The IP address can be used in different user-based studies that requires identifying requests coming from the same user. The request date was used in the study carried out in <ref type="bibr" target="#b7">[8]</ref> to investigate the relationship between Linked Data resources and traffic of requests to these resources over different time windows. Agents requesting resources can be browsers (human usage), bots (machine usage), as well as tools (curl, wget, etc.) and data-services <ref type="bibr" target="#b9">[10]</ref>. Identifying kinds of agents requesting resources and their distribution is useful for designers of Linked Data tools to understand what information is being accessed and how.</p><p>The next step was to verify the correctness of each SPARQL query before extracting its properties. Queries were parsed using Jena<ref type="foot" target="#foot_3">4</ref> and those producing parsing errors were excluded. For each successfully-parsed query, its type was first identified. The type can be either SELECT, ASK, CONSTRUCT or DESCRIBE. In this analysis, we only considered SELECT queries since it accounted for almost 97% of the query logs <ref type="bibr" target="#b0">[1]</ref>. A SPARQL query can have one or more triple patterns, solution modifiers such as LIMIT and DISTINCT, pattern matching constructs such as OPTIONAL and UNION as well as FILTERs for restricting the solution space. These query parts are identified and triple patterns are analysed to extract the properties associated with the query and the triples.</p><p>A triple pattern consists of three components: a subject, a predicate and an object with each component in a triple pattern being either bound (having a specific value) or unbound (as a variable). There are 8 types of triple patterns according to the place of existence of variables and constants. The most general one is &lt;?S, ?P, ?O&gt; which is used to retrieve everything in the queried data. More specific ones include patterns having 1 variable such as &lt;S, P, ?O&gt; which retrieves the object values given a subject and a predicate, or 2 variables such as &lt;S, ?P, ?O&gt; retrieving all predicates and their values for a given subject. Finally the most specific triple pattern &lt;S, P, O&gt; does not ask for any data to be returned. After excluding the most general and specific triple patterns, the other types were identified when used in a query.</p><p>Two triple patterns used in a query can be joined by using the same unbound component in both of them. For instance ?x hasName ?y and ?x hasAge ?z are joined using the unbound subject ?x. Using this approach, six different join types were identified according to the place of the common variable in both patterns. For instance, the Subject-Subject join is one in which the common variable is found in the Subject place in both triple patterns. The other types are Subject-Predicate, Subject-Object, Predicate-Predicate, Predicate-Object and Object-Object.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Dataset Analysis</head><p>DBpedia was the founding dataset of the LOD cloud and remains one of its largest; indeed, in 2009, it was shown that almost 83% of all Semantic Web queries related to DBpedia <ref type="bibr" target="#b3">[4]</ref>. Its knowledge base currently describes more than 3.4 million things spanning multiple domains such as People, Places and Species.</p><p>The data used in this study is made available by the USEWOD2011 data challenge <ref type="foot" target="#foot_4">5</ref> . The query logs follow the combined log format. The challenge data however included two additional fields, namely Country code and Hash of original IP to support both location and user-based analyses. The logs contained around 5 million queries issued to DBpedia over a time period of almost 4 months. Table <ref type="table" target="#tab_0">1</ref> shows the basic statistics of the query logs.</p><p>In order to count the number of unique triple patterns used in the queries, the variables found in the patterns were first normalised. In that sense, the two triple patterns '...dbpedia...resource/X hasPage ?page' and '...dbpedia...resource/X ?hasPage ?homepage' were considered to be similar since the same information is being requested. The large difference between the number of unique subjects and objects supports the findings of <ref type="bibr" target="#b0">[1]</ref>, as they showed that the most frequent triple pattern is &lt;S P ?O&gt;. This means that most of the queries request the value of the object, given a specific subject and predicate; the object is given as a variable and thus not counted.</p><p>In a similar way to analysing complexity of keyword queries on the Web of Documents in terms of query length, the first metric for Linked Data queries is the number of triple patterns used in a query. Almost 65% of the queries contained only 1 triple pattern, 18% contained 2 triple patterns while 15% contained 3 triple patterns. This shows that queries follow a power-law distribution in which most of the queries are simple and lie at the head of the distribution, while more complicated queries with triple pattern counts ranging from 4 to 20 lie at the tail of the distribution. After excluding the most general and specific types of triple patterns (?S,?P,?O and S,P,O), the distribution of the other types is shown in Table <ref type="table" target="#tab_1">2</ref>. As shown in <ref type="bibr" target="#b0">[1]</ref>, the analysis shows that the most frequent triple pattern is &lt;S P ?O&gt;. This means that for almost 50% of all queries, the information need is very specific: the value of a specific predicate for a given resource is required. Indeed, since the second most frequent pattern is &lt;S ?P ?O&gt;, over 75% of queries are about a specific resource. Some Linked Data querying approaches build indexes to identify the relevant sources for answering a query or even use them to obtain the answer itself. In this sense, the identification of the most frequent triple patterns is valuable to optimise the indices which in turn would improve the search performance.</p><p>Additionally, Table <ref type="table" target="#tab_1">2</ref> shows that around 86% of the queries are simple with no joins. The number of joins then increase from 1 to 20 with an inverse relation with the percent of queries. An interesting finding of the analysis is that more than half of the joins (54%) were of type Subject-Subject and almost 32% were of type Subject-Object.</p><p>Knowing this information is valuable for query planning and optimisation during the query execution process.</p><p>In addition to the basic graph patterns, there are three other constructs that can be used: OPTIONAL, UNION and FILTER. Only FILTER occurred in more than half of the queries (55%). It is used to restrict the results according to a given criteria. The most frequently observed use was with LANG which restricts the results to the specified language. The OPTIONAL feature increases flexibility: it allows information to be returned if found but does not reject the solution when part of the query does not have matches in the data. However, it is arguably the most expensive operator in query evaluation <ref type="bibr" target="#b10">[11]</ref>. It is interesting to find that it occurred in only 15% of the observed queries. Although this low rate will be beneficial to search engines, it raises the question of why it is not used more frequently in Linked Data queries. One explanation could be the knowledge and experience of the query language required to use such a construct effectively.</p><p>Finally, the UNION construct combines graph patterns in the same way as OR is used in SQL and occured in only 9.5% of queries.</p><p>The number of variables found in the SELECT part of a SPARQL query shows how many data items the user needs in the results. These can be instances, concepts or relations between them. This was found to range between 1 and 13 with a variable count of 2 being the most frequent followed by 1 and 3. Using SELECT * indicates either a lack of knowledge regarding the structure of the data or a broad and non-specific information need (e.g., data exploration). Interestingly, this accounted for only 9.5% of the queries; thus, more than 90% of queries had a specific information need and knowledge of the data structure.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Consuming Query Log Analysis</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Visualisation of Query Logs</head><p>Analysis of query log entries can provide great insights into how individuals and software agents consume Linked Data. Making such analysis efforts available using a formalised representation is valuable since it facilitates a generic approach to consume such data. For example, experts can gain an understanding of the information needs that emerge from a dataset. Visualisation tools and interfaces can consume such data thereby providing a quick means to identify emerging trends and patterns from collective information needs. Fig. <ref type="figure" target="#fig_3">4</ref> shows how we make use of our analysis to provide visualisations to users.</p><p>In order to consume the query log analysis findings, we have developed software to visualise query log analysis data captured using the QLog ontology described above. It provides two different types of information:</p><p>1. Concept Graph: concepts according to query frequency (A, in Fig. <ref type="figure" target="#fig_3">4</ref>) 2. Predicate Order Tree: query predicate order (B, in Fig. <ref type="figure" target="#fig_3">4</ref>)</p><p>The query log analysis process described in Fig. <ref type="figure" target="#fig_2">3</ref> results in RDF triples that are stored in a local triplestore ('KB' in Fig. <ref type="figure" target="#fig_3">4</ref>). In order to relate the information needs with concepts in the dataset, the Linked Data endpoint is initially queried to identify the types of the instances being queried for (A1). For example, querying the DBpedia endpoint for the type of the instance 'Acura ZDX' returns http://dbpedia.../Automobile. Once a type has been determined for a particular instance, the endpoint is queried again to understand how many instances in the data are associated with that type (A2). In this example, DBpedia will be queried again for how many instances of Automobiles exist. This process would continue until all the instances and classes have been analysed. The resulting information is then be assimilated into data tables (A3). A further interesting feature that can be identified by analysing SPARQL query logs is how users query for information especially when using multiple predicates to connect individual triple patterns. We refer to a predicate order as the order of the predicates that are observed in a query when the triple patterns use different predicates to identify different subsets of the data. Consider:</p><p>PREFIX dbo: &lt;http://dbpedia.org/ontology/&gt; SELECT ?name ?place WHERE { ?person dbo:birthPlace ?place. ?person foaf:name ?name. } Here, the user's predicate of interest moves from dbo:birthPlace to foaf:name. In essence, the user is initially interested in looking at birthplaces of persons and then looking at their names. This can now be collectively studied after analysing all of the formalised query logs. Studying such patterns can provide insights into how the user's information need spread over different predicates and how these predicates are used together.</p><p>The process for visualising predicate order involves identifying the predicates that users have used. This can be retrieved by querying the KB for triple predicate instances, which provides the predicate order (B1). The triples are instantiated according to the order they appeared in the query. This ensures that the consistency for the predicate orders is maintained. The orders for all query log entries are then assimilated to construct a matrix (B2), which is then converted to data tables (B3). The data tables generated </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">SEMLEX -Exploring Information Needs</head><p>SEMLEX (SEMantic Logs EXplorer) was designed to explore and present analysis on large query log datasets such as that described in Section 4 . The current implementation of SEMLEX provides the user with two visualisations: Concept Graph and Predicate Order Tree, though several other visualisations will be included in the future. Concept Graph essentially visualises the underlying ontologies, visually encoding nodes with information such as amount of data, query frequency. Fig. <ref type="figure" target="#fig_4">5</ref> shows the relationship between query concept frequency and dataset concept frequency. In this example, the ontology classes have been visually encoded using two sets of information: size (to represent how many instances are types of the concept within the dataset) and colour (to represent how many times the concept has been queried). The larger the size of a class, the greater the number of instances. Similarly, the darker the colour for a class, the more frequently it has been observed in queries. For example, the concept 'wrestler' has been queried more times than 'soccer player' (Fig. <ref type="figure" target="#fig_4">5</ref>, top right), even though the number of instances of wrestlers is fewer than for soccer players. While aggregating all the queries to identify which concepts Fig. <ref type="figure">6</ref>. Exploring information needs of DBpedia users (Predicate Order Tree). This shows, for a particular property, which predicates are most likely to be used in a single query. are most queried can provide an insight to data providers on which sections of an ontology are more 'interesting' to all users, it may be useful to explore how users are querying the dataset. We found that the most commonly queried concepts in DBpedia were as follows: 'Person, Work, Organisation, Artist, Film, Place, PopulatedPlace, Mu-sicalArtist, Settlement, Drug, Company, Software, Band, Actor, Athlete, MusicalWork, EducationalInstitution, Album, OfficeHolder, RadioStation, Country, Species, Politician, City, SoccerPlayer'.</p><p>SEMLEX also enables users to see how predicates have been used along with other predicates in individual queries. The tool accumulates all the predicates to build a matrix, which records which predicate has been used with the next and in which order. This matrix is rendered as a tree. Fig. <ref type="figure">6</ref> shows an example in which a user explores the most commonly used predicates associated with http://dbpedia.org/property/starring. The subtrees of the node are arranged according to their usages: label being used most often while budget being used less frequently. In our example, we focus on how users have queried for individuals who have starred in movies and then focus their search on IMDB entries. However, it seems that more users have looked for individuals who have starred in movies and then queried for the movies they have starred in or the movie directors. Observations such as this can be interesting to other applications such as automatic query suggestions, recommender systems, search tools, etc. Fig. <ref type="figure" target="#fig_5">7</ref> shows the relationship between the information available in the dataset (instances) and the queries requesting this information. As per our expectation, we observed a direct relationship between the number of instances of a concept and the number of times they were queried. We explain this as users more often query for concepts for which there is a larger amount of information in the dataset.</p><p>However, the graph also shows some interesting anomalies. For instance, point A (shown in the figure) refers to the concept 'Continent which has only 10 instances but appeared in almost 10,000 queries The same concept appears in Fig. <ref type="figure" target="#fig_4">5</ref> only as a small node. In contrast, point B relates to the concept AutomobileEngine which exhibited lower than expected interest given the amount of available information.</p><p>Being aware of such a distribution (and dataset-specific points of interests) is valuable for both producers of Linked Data in terms of improving the structure of their data to better suit their users' information needs, as well as consumers such as designers of semantic search and visualisation tools who can better support their users when they know more about their needs in advance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusions and Future Work</head><p>This paper has presented an approach which can advance our understanding of the information needs of Linked Data consumers and help Linked Data providers match these needs. We described the analysis of semantic query logs and how the subsequent results can be represented in a formalised model. We presented a visualisation tool -SEMLEX -that demonstrates how these analyses can be consumed to explore trends and patterns in the queries.</p><p>Using DBpedia as a case study, we followed our proposed approach to analyse a sample of its query logs (around 5 million queries) from the USEWOD2011 data challenge. However, our proposed approach, model and visualisation tool are independent of any dataset and can, therefore, be used for any similar analysis of SPARQL query logs. Nevertheless, this study provides a useful insight into the information needs of Linked Data users by highlighting patterns and trends inherent in their queries. This reveals great potential for different applications consuming Linked Data. For instance, a semantic search tool could benefit from having an advance knowledge of the most queried categories and the associated search patterns followed by users.</p><p>In future work, we intend to apply our approach to examine other datasets with different features such as SWDogFood as a domain-specific dataset targeting Semantic Web researchers. We further intend to study query logs that span multiple datasets such as the ones in the Linked Open Data Cloud Cache <ref type="foot" target="#foot_5">6</ref> . This could present a more representative view of Linked Data queries in terms of size and domain coverage. Additionally, it would show how the query exchange between different datasets in the cloud occur and whether the Linked Data principle of connecting datasets is being used in real-world queries. Further on, we also intend to understand how the user queries evolve time.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig.1. An example of a combined log format entry<ref type="bibr" target="#b9">[10]</ref> </figDesc><graphic coords="3,134.77,116.83,362.48,62.56" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 2 .</head><label>2</label><figDesc>Fig. 2. The Query Log (QLog) Ontology. CLF concepts appear on the left and analysis concepts on the right.</figDesc><graphic coords="4,136.49,115.84,342.35,180.84" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 3 .</head><label>3</label><figDesc>Fig. 3. Query Logs analysis process diagram</figDesc><graphic coords="5,167.88,115.83,279.60,94.60" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Fig. 4 .</head><label>4</label><figDesc>Fig. 4. Consumption of QueryLog analysis results</figDesc><graphic coords="8,150.43,115.84,314.50,165.02" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Fig. 5 .</head><label>5</label><figDesc>Fig. 5. Exploring information needs of DBpedia users (Concept Graph). Node size represents the amount of instances (larger nodes represent more instances), color represent the amount of user interest (darker nodes represent more interest)</figDesc><graphic coords="9,134.77,115.84,358.20,230.40" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Fig. 7 .</head><label>7</label><figDesc>Fig. 7. Distribution of number of queries referring to a concept versus number of instances of that concept in the dataset.</figDesc><graphic coords="11,134.77,116.83,345.84,194.04" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0"><head></head><label></label><figDesc></figDesc><graphic coords="10,134.77,115.84,346.50,152.75" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Statistics summarising the query logs</figDesc><table><row><cell>Number of analysed queries</cell><cell>4951803</cell></row><row><cell>Number of unique triple patterns</cell><cell>2641098</cell></row><row><cell>Number of unique subjects</cell><cell>1168945</cell></row><row><cell>Number of unique predicates</cell><cell>2003</cell></row><row><cell>Number of unique objects</cell><cell>196221</cell></row><row><cell>Number of unique vocabularies</cell><cell>323</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 .</head><label>2</label><figDesc>Distribution of triple pattern and join types in the queries.</figDesc><table><row><cell>TP Type</cell><cell cols="2">Queries Percentage No. Queries</cell><cell cols="3">No. Joins Queries Percentage No. Queries</cell></row><row><cell>S P ?O</cell><cell>49.55%</cell><cell>3760649</cell><cell>0</cell><cell>85.8%</cell><cell>4242899</cell></row><row><cell>S ?P ?O</cell><cell>25.94%</cell><cell>1968511</cell><cell>1</cell><cell>9.8%</cell><cell>485307</cell></row><row><cell>?S P ?O</cell><cell>12.84%</cell><cell>974882</cell><cell>2</cell><cell>2.6%</cell><cell>132128</cell></row><row><cell>?S P O</cell><cell>9.51%</cell><cell>722091</cell><cell>3</cell><cell>0.8%</cell><cell>37646</cell></row><row><cell>S ?P O</cell><cell>1.17%</cell><cell>88679</cell><cell>5</cell><cell>0.6%</cell><cell>30539</cell></row><row><cell>?S ?P O</cell><cell>0.97%</cell><cell>73888</cell><cell>6</cell><cell>0.07%</cell><cell>3560</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">http://data.semanticweb.org/usewod/2011/challenge.html</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">The QLog ontology and a video of SEMLEX are available at http://oak.dcs.shef. ac.uk/QLogAnalysis/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">http://httpd.apache.org/docs/1.3/logs.html#combined</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">http://jena.sourceforge.net/index.html</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">http://data.semanticweb.org/usewod/2011/challenge.html</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5">http://lod.openlinksw.com/</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Acknowledgements Elbedweihy and Wrigley are funded by the EU FP7 Project SEALS (Semantic Evaluation at Large Scale, FP7-238975); Cano is funded by CONACyT, grant 175203; Mazumdar is funded by SAMULET, a project supported by Rolls Royce Plc and the UK Government Technology Strategy Board.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">An empirical study of real-world sparql queries</title>
		<author>
			<persName><forename type="first">M</forename><surname>Arias</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Fernández</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Martnez-Prieto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>De La Fuente</surname></persName>
		</author>
		<idno>CoRR, abs/1103.5043</idno>
		<imprint>
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
	<note>informal publication</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Falcons: searching and browsing entities on the semantic web</title>
		<author>
			<persName><forename type="first">G</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Ge</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Qu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceeding of the 17th international conference on World Wide Web, WWW &apos;08</title>
				<meeting>eeding of the 17th international conference on World Wide Web, WWW &apos;08<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="1101" to="1102" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">WordNet: An Electronic Lexical Database</title>
		<author>
			<persName><forename type="first">C</forename><surname>Fellbaum</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1998">1998</date>
			<publisher>MIT Press</publisher>
			<pubPlace>Cambridge, MA</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">A query-driven characterization of linked data</title>
		<author>
			<persName><forename type="first">H</forename><surname>Halpin</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Web Search Behavior of Internet Experts and Newbies</title>
		<author>
			<persName><forename type="first">C</forename><surname>Hölscher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Strube</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computer Networks</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="issue">1-6</biblScope>
			<biblScope unit="page" from="337" to="346" />
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">An analysis of web searching by european alltheweb.com users</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">J</forename><surname>Jansen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Spink</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information Processing and Management: an International Journal</title>
		<imprint>
			<biblScope unit="volume">41</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="361" to="381" />
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">A temporal comparison of altavista web searching: Research articles</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">J</forename><surname>Jansen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Spink</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Pedersen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J. Am. Soc. Inf. Sci. Technol</title>
		<imprint>
			<biblScope unit="volume">56</biblScope>
			<biblScope unit="page" from="559" to="570" />
			<date type="published" when="2005-04">April 2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">From linked data to relevant data -time is the essence</title>
		<author>
			<persName><forename type="first">M</forename><surname>Kirchberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">K L</forename><surname>Ko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">S</forename><surname>Lee</surname></persName>
		</author>
		<idno>CoRR, abs/1103.5046</idno>
		<imprint>
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Investigating the demand side of semantic search through query log analysis</title>
		<author>
			<persName><forename type="first">P</forename><surname>Mika</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Meij</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zaragoza</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SemSearch</title>
				<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Learning from linked open data usage: Patterns and metrics</title>
		<author>
			<persName><forename type="first">K</forename><surname>Möller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hausenblas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Cyganiak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">A</forename><surname>Grimnes</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the WebSci10: Extending the Frontiers of Society On-Line</title>
				<meeting>the WebSci10: Extending the Frontiers of Society On-Line</meeting>
		<imprint>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Semantics and complexity of sparql</title>
		<author>
			<persName><forename type="first">J</forename><surname>Pérez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Arenas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Gutierrez</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Trans. Database Syst</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="issue">16</biblScope>
			<biblScope unit="page">45</biblScope>
			<date type="published" when="2009-09">September 2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Analysis of a very large web search engine query log</title>
		<author>
			<persName><forename type="first">C</forename><surname>Silverstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Marais</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Henzinger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Moricz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SIGIR Forum</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="6" to="12" />
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">versus european web searching trends</title>
		<author>
			<persName><forename type="first">A</forename><surname>Spink</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ozmutlu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">C</forename><surname>Ozmutlu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">J U</forename><surname>Jansen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SIGIR Forum</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<biblScope unit="page" from="32" to="38" />
			<date type="published" when="2002-09">September 2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Sindice.com: Weaving the open linked data</title>
		<author>
			<persName><forename type="first">G</forename><surname>Tummarello</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Oren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Delbru</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 6th International Semantic Web Conference and 2nd Asian Semantic Web Conference (ISWC/ASWC2007)</title>
				<meeting>the 6th International Semantic Web Conference and 2nd Asian Semantic Web Conference (ISWC/ASWC2007)<address><addrLine>Busan, South Korea; Berlin, Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer Verlag</publisher>
			<date type="published" when="2007-11">November 2007</date>
			<biblScope unit="volume">4825</biblScope>
			<biblScope unit="page" from="547" to="560" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
