<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">A Comparative Study on Storing and Retrieval of URIs for Life Sciences Databases</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Atsuko</forename><surname>Yamaguchi</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Tokyo City University</orgName>
								<address>
									<settlement>Setagaya</settlement>
									<region>Tokyo</region>
									<country key="JP">Japan</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Yasunori</forename><surname>Yamamoto</surname></persName>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="department">Database Center for Life Science</orgName>
								<address>
									<settlement>Kashiwa</settlement>
									<region>Chiba</region>
									<country key="JP">Japan</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">A Comparative Study on Storing and Retrieval of URIs for Life Sciences Databases</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">FE963362BAC53C3B30ECB73FD1DAD9C6</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:25+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Semantic Web</term>
					<term>Resource Description Framework</term>
					<term>URI</term>
					<term>Information retreival</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In the field of life sciences, numerous databases are publicly available in the form of graph structures called RDF (Resource Description Framework). Each node in these graphs is linked to a unique URI representing entities such as genes and proteins. Consequently, these databases contain vast amounts of graph-structured data with numerous URIs. By establishing links to URIs in other databases, these databases become interconnected through links, ultimately forming a massive knowledge graph.</p><p>As a result, it is essential to develop a foundational infrastructure for storing and searching a significant number of URIs from life science data in this vast knowledge graph, utilizing database IDs and keywords. Presently, Front Coding is commonly employed for URI compression, but it suffers from a limitation in its inability to perform prefix matching searches. Trie is frequently used as an alternative approach capable of retrieving prefixes. This study assess the compression ratio and search efficiency by comparing the utilization of Front Coding, Trie, and our proposed methods. The evaluation is conducted using a dataset of URIs extracted from life-science RDF datasets.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>In the life science field, a large number of heterogeneous data is produced. To facilitate their integration, RDF (Resource Description Framework) is frequently employed, These RDF datasets exhibit graph structures, with URIs serving as universal identifiers. The utilization of large RDF datasets necessitates the development of an efficient method for storing and retrieving a comprehensive set of URIs that appear within these datasets.</p><p>A previous study by Martínez-Prieto et al. <ref type="bibr" target="#b0">[1]</ref> delved into a comparative analysis of Hashing, Front Coding, and FM-index, using dictionaries extracted from RDF datasets, focusing on URIs and literals. They also introduced a method capable of swiftly locating strings within the dataset, achieving compression ratios of 22% to 66%. However, it's noteworthy that their investigation did not explore the potential of Trie, a technique with the capability to compress sets of strings and retrieve not only the strings themselves but also their prefixes. Efficient federated search hinges on the ability to identify servers offering datasets with specific prefixes in an efficient IJCKG 2023 * Corresponding author. Envelope atsuko@tcu.ac.jp (A. Yamaguchi); yy@dbcls.rois.ac.jp (Y. Yamamoto) Orcid 0000-0001-7538-5337 (A. Yamaguchi); 0000-0002-6943-6887 (Y. Yamamoto) manner. In this poster, we undertake a comparative study of Front Coding, Trie, and two novel methods we propose. Our assessment is based on the average time required for locating, as well as the compression ratio achieved by these methods.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Computational Experiment</head><p>Front Coding is a technique employed to compress a set of sorted strings by comparing each string with its preceding string and encoding the shared substring to its length. To facilitate efficient location, the strings are subdivided into subsets, known as buckets. In this study, we employed a bucket size of 128 for Front Coding. Trie, on the other hand, is a tree structure designed for locating a specific string within a set of strings. In a Trie, each node corresponds to a character in a string, and each path in the Trie corresponds to a string within the set. These two methods are often used for locating a string from a set of strings.</p><p>However, we observed a distinctive feature among the URIs present in life-science RDF datasets: they tend to exhibit a limited number of prefixes and an abundance of suffixes. Based on this observation, we devised an innovative approach for URI location within a set of URIs. We propose breaking down URIs by "/" delimiter, with each substring corresponding to a node in a tree structure. As depicted in Fig. <ref type="figure" target="#fig_0">1</ref>, the number of leaves in this tree can sometimes be substantial. To address this, we divided the search method into two phases. In the first phase, we search for the URI from the root to a leaf, akin to a Trie. If a node possesses a significant number of children, we employ a secondary method in the second phase. In this context, we present two distinct methods for the second phase: the first (Method 1) involves binary search for the children, while the second method (Method 2) entails constructing Tries for each child node. We utilized four life-science RDF datasets sourced from the RDF portal <ref type="bibr" target="#b1">[2]</ref>. Due to variations in the magnitude of URIs, four datasets were chosen. Table <ref type="table" target="#tab_0">1</ref> provides an overview of these datasets, including their names, the count of URIs, and the size of the URI lists, with Quanto being the smallest and KERO the largest. For the four datasets, it has been observed that typical structures of life science RDF URIs exhibit an explosive increase in the number of children from a certain depth.</p><p>It's worth noting that the tree structures utilized in both the Trie and our proposed methods are implemented using LOUDS (Level-Order Unary Degree Sequence) methodology <ref type="bibr" target="#b2">[3]</ref>. This choice of implementation is made due to its smaller space usage in comparison to other methods, such as the double array. In our comparative analysis, we evaluated the performance of Front Coding, Trie, and our two methods (Method 1 and Method 2) using the RDF datasets. We focused on assessing the compression ratio and the average time required for URI location. To calculate the average time for URI location, we randomly selected 100 samples from the URI lists for each dataset. All of the methods were implemented in C++. The computational experiments were conducted on an Ubuntu 20.04.4 machine equipped with an Intel(R) Xeon(R) 2.40GHz CPU and 200GB of memory.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Result</head><p>Table <ref type="table" target="#tab_1">2</ref> presents the compressed sizes and compression ratios. The compressed sizes encompass the necessary indexes for efficient location in the case of Trie and our methods. Notably, among the four methods evaluated, both Trie and Method 2 exhibit superior performance, with Method 1 showing the least favorable results, albeit achieving compression ratios ranging from 0.31 to 0.37. Table <ref type="table" target="#tab_2">3</ref> illustrates the average time required for URI location based on 100 trials. Front Coding emerges as the top performer, closely followed by Method 1. While the compression ratios of Trie and Method 2 are remarkably similar, Method 2 stands out as being two to three times faster than Trie. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Discussion and Conclusion</head><p>Based on the findings of our computational experiments, we can deduce that the choice of method should be contingent on specific application requirements. If expediting the time of URI location is of paramount importance, particularly in cases where prefix finding is unnecessary, Front Coding emerges as the preferred method. Conversely, when there is a demand for prefix finding and a preference for a lower compression ratio, Method 2 should be the method of choice. For scenarios where both prefix finding and the time of URI location are top priorities, Method 1 offers the most suitable solution.</p><p>As delineated in Section 2, URIs commonly encountered in life science datasets exhibit a distinctive pattern of limited prefix variety and extensive suffix diversity. Consequently, our proposed methods are designed with a dual-phase approach that focuses on traversing paths corresponding to the strings from the prefix perspective. In instances where a node has a substantial number of children, Method 1 resorts to binary search to locate the children, refraining from compressing the strings corresponding to them. This non-compression strategy may result in a relatively lower compression ratio. In contrast, Method 2 takes a different approach, constructing a Trie for each node with a large number of children, effectively compressing the strings related to these children, and consequently yielding the smallest size of compressed URIs.</p><p>In the context of this poster, we employed LOUDS for implementing tree structures. However, it is worth noting that the double array implementation of Trie is recognized for its speed, even though it entails larger compression sizes compared to LOUDS. An important avenue for future exploration remains a comparative analysis with Trie utilizing a double array implementation.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: A typical exemple of URIs in a life science RDF dataset. While there is a single path corresponding to the string from the root to the "refex" node, there are 4,430,897 leaves.</figDesc><graphic coords="2,172.63,415.62,250.01,97.08" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Four datasets used for computational experiment</figDesc><table><row><cell>Name</cell><cell cols="2">The count of URIs The size (byte)</cell></row><row><cell>Quanto</cell><cell>1 995 998</cell><cell>113 026 882</cell></row><row><cell>jPOST</cell><cell>11 412 967</cell><cell>510 998 241</cell></row><row><cell>RefEx</cell><cell>19 828 641</cell><cell>832 801 751</cell></row><row><cell>KERO</cell><cell cols="2">290 338 001 19 051 677 994</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>Compressed size and compression ratio</figDesc><table><row><cell>Name</cell><cell>Quanto</cell><cell>jPOST</cell><cell>RefEx</cell><cell>KERO</cell></row><row><cell cols="5">Original(bit) 904 215 056 4 087 985 928 6 662 414 008 152 413 423 952</cell></row><row><cell cols="2">Front Coding 174 363 760</cell><cell>328 912 984</cell><cell>541 609 480</cell><cell>11 110 721 176</cell></row><row><cell></cell><cell>(0.1928)</cell><cell>(0.08045)</cell><cell>(0.08129)</cell><cell>(0.07289)</cell></row><row><cell>Trie</cell><cell>172 591 599</cell><cell>147 453 553</cell><cell>220 324 169</cell><cell>6 738 005 419</cell></row><row><cell></cell><cell>(0.1908)</cell><cell>(0.03606)</cell><cell>(0.03306)</cell><cell>(0.04420)</cell></row><row><cell>Method 1</cell><cell cols="3">285 735 820 1 443 548 986 2 103 835 010</cell><cell>57 241 426 238</cell></row><row><cell></cell><cell>(0.3160)</cell><cell>(0.3531)</cell><cell>(0.3157)</cell><cell>(0.3755)</cell></row><row><cell>Method 2</cell><cell>172 591 344</cell><cell>147 452 340</cell><cell>222 547 974</cell><cell>6 738 004 412</cell></row><row><cell></cell><cell>(0.1908)</cell><cell>(0.03606)</cell><cell>(0.03340)</cell><cell>(0.04420)</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc>The average time (µs) to locate.</figDesc><table><row><cell>Name</cell><cell cols="4">Quanto jPOST RefEx KERO</cell></row><row><cell>Front Coding</cell><cell>4.37</cell><cell>4.26</cell><cell>4.21</cell><cell>7.57</cell></row><row><cell>Trie</cell><cell>61.91</cell><cell>56.05</cell><cell cols="2">38.99 125.70</cell></row><row><cell>Method1</cell><cell>5.03</cell><cell>4.66</cell><cell>5.70</cell><cell>11.32</cell></row><row><cell>Method2</cell><cell>21.59</cell><cell>18.41</cell><cell>10.16</cell><cell>72.48</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Acknowledgments</head><p>This work was supported by ROIS-DS-JOINT 2020 and 2021 from Research Organization of Information and Systems (ROIS) and by JSPS KAKENHI grant number 21K12148.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Compression of rdf dictionaries</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Martínez-Prieto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Fernández</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Cánovas</surname></persName>
		</author>
		<idno type="DOI">10.1145/2245276.2245343</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 27th Annual ACM Symposium on Applied Computing, SAC &apos;12</title>
				<meeting>the 27th Annual ACM Symposium on Applied Computing, SAC &apos;12</meeting>
		<imprint>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="340" to="347" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">NBDC RDF portal: a comprehensive repository for semantic data in life sciences</title>
		<author>
			<persName><forename type="first">S</forename><surname>Kawashima</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Katayama</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Hatanaka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Kushida</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Takagi</surname></persName>
		</author>
		<idno type="DOI">10.1093/database/bay123</idno>
	</analytic>
	<monogr>
		<title level="j">Database</title>
		<imprint>
			<biblScope unit="volume">2018</biblScope>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Space-efficient static trees and graphs</title>
		<author>
			<persName><forename type="first">G</forename><surname>Jacobson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 30th FOCS, FOCS 1989</title>
				<meeting>the 30th FOCS, FOCS 1989</meeting>
		<imprint>
			<date type="published" when="1989">1989</date>
			<biblScope unit="page" from="549" to="554" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
