<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Semantic Search for Scientific Publications Based on Rhetorical Structure</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Lan</forename><surname>Huang</surname></persName>
							<email>huanglan@jlu.edu.cn</email>
							<affiliation key="aff0">
								<orgName type="department">College of Computer Science and Technology</orgName>
								<orgName type="institution">Jilin University</orgName>
								<address>
									<addrLine>Qianjin Street 2699</addrLine>
									<settlement>Changchun</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Kai</forename><surname>Feng</surname></persName>
							<email>fengkai15@mails.jlu.edu.cn</email>
							<affiliation key="aff0">
								<orgName type="department">College of Computer Science and Technology</orgName>
								<orgName type="institution">Jilin University</orgName>
								<address>
									<addrLine>Qianjin Street 2699</addrLine>
									<settlement>Changchun</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Hao</forename><surname>Xu</surname></persName>
							<email>xuhao@jlu.edu.cn</email>
							<affiliation key="aff0">
								<orgName type="department">College of Computer Science and Technology</orgName>
								<orgName type="institution">Jilin University</orgName>
								<address>
									<addrLine>Qianjin Street 2699</addrLine>
									<settlement>Changchun</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Semantic Search for Scientific Publications Based on Rhetorical Structure</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">9C5CF338D83BF08AF2AFA4B117D66CEA</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T16:55+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>semantic search</term>
					<term>rhetorical structure</term>
					<term>semantic annotation</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Most scientific papers have their own rhetorical structures, which have deeply rooted in the minds of both authors and readers, such as background, problem and discussion. However, most existing search engines for scientific publications haven't made good use of such semantic information. In fact, each reader would be interested in different semantic modules of a paper, that is, certain concepts or entities mentioned in different semantic parts represent various indications. In this paper, we design and implement a semantic search platform that aims to provide semantic search for scientific publications based on rhetorical structure. To provide better results, we initiate with the semantic model of scientific papers, so as to meet the special attention of the semantic module in papers for readers.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Authors always hold an logical structures in their minds while they write scientific papers. Besides, each scientific paper has its own rhetorical structure such as research background, problem statement, solution and future work. For the sake of readers to read more targeted, some publishers even require the structured abstract that authors must provide. Nevertheless, the traditional search does not have strong connections with these semantic information and numerous metadata of articles. Actually, according to the phrases or words attained from the users', the search engine positions the documents. These words are regarded as the ordinary characters rather than any concepts.</p><p>Semantic search is an application of Semantic Web <ref type="bibr" target="#b0">[1]</ref>. Naturally, semantic search for scientific papers needs a huge database for publications and metadata. The model of data becomes the fundamental point in achieving semantic search because of the structured data, which could be read by machines, and it is also convenient for us to compute the relations between them <ref type="bibr" target="#b4">[5]</ref>. Several models have been devised to label the rhetorical structure within the papers. The Harmsze Model proposes that a paper is constituted by metadata, positioning, methods, results, interpretation and outcome <ref type="bibr" target="#b1">[2]</ref>. Another model for science publications named ABCDE includes annotations, background, contribution, discussion and entities <ref type="bibr" target="#b5">[6]</ref>. Thus, most papers could be labeled by these two coarse grained models. Generally, a paper contains modules, which are made up by metadata, background, problem, solution and discussion. And the main problem is the detection of rhetorical components. With structured data and relationships stored in the database, semantic search services could be provided. ClaimFinder was a research prototype which delivered the search services based on the original data <ref type="bibr" target="#b2">[3]</ref>. The home page of this website allows users to do the keyword search and shows the result about the concept and some relations linked in the concept. And Mimir, an Open-Source Semantic Search Framework, could provide complex queries on account of natural language process and this framework is built on a cloud storage platform <ref type="bibr" target="#b3">[4]</ref>. It stores annotations, tokens, index of all the basic data and etc. Thus, it would provide the better result than the traditional way.</p><p>For scientific publications, the meaning of a concept would be vary when it appears in different semantic modules. It is worth mentioning that different people may pay attention to the different parts of papers. In view of that, we design and implement a semantic search platform based on the rhetorical structure and natural language processing and semantic technologies. The platform extracts the keywords in different semantic modules of papers. Meanwhile, semantic search could use these keywords and readers could choose the semantic module they prefer. The platform would do the search under the rhetorical structure and retrieve the list of papers which are more accurate. Our goal is to provide more efficient and effective search services.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">System Design</head><p>In order to accomplish the task as much as possible, we divide the work into three parts: semantic annotation, concept/entity detection and semantic search. Semantic annotation is the work for adding labels of rhetorical structure for scientific papers. Concept and entity detection is to extract the keywords under rhetorical structure and to store them. The part of semantic search is to process keywords and to compute the results based on semantic modules.</p><p>The structure of the system as the Figure <ref type="figure" target="#fig_0">1</ref> shows below. The data of articles contains the title, authors, year, and the whole text of the article and they need to be processed by means of the semantic annotation and concept and entity detection. Then, the structured data of each article ,including the keywords of every semantic module and the title of the article, would be stored in the database via the system. Finally,the database would be able to accept the requirements through searching server and to return the result.</p><p>In this passage, the system contains two main functions. The first one is searching papers based on rhetorical structure. The second one is enabling readers to annotate semantic modules for scientific papers at their wills. The system would take the measure of statistics to achieve the final semantic modules.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Semantic annotation</head><p>Before semantic annotation, we patternized basic rhetorical structure of scientific publications that could be modified or extended. The experiment of this passage pays emphasis on general rhetorical structure of scientific papers. For the convenience of us to discuss, we entitled them as background, problem and solution respectively and there are two ways to make semantic annotation. The first way needs the help of readers. As the Figure <ref type="figure" target="#fig_1">2</ref> shows below, once the button clicked, the webpage would copy the text selected and send it to the server. Then the system would extract the keywords through the information the system stored. The second one to make the semantic annotation is achieved on the basis of L A T E X, which could manage text with labels. Meanwhile, we have designed some labels like "\background", "\problem" and "\solution". If the authors use them, the system could extract the keywords through the labels.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Concept and entity detection</head><p>The same concept could be distinguished easily for its different meaning in different rhetorical structure. The system should guarantee that the concept and entity detection relies on the accurate rhetorical structures.</p><p>In order to extract keywords, the system uses a program named Jieba word segmentation, which is a java program for word segmentation. The traditional technique the program used is named TF-IDF, which made good use of simple but effective ways to extract the keywords. According to the term frequency and inverse document frequency of the words appearing in the document, the system filter out common words and preserve vital words. Once the system got the keywords, it would store them with the link pointing to the article in the database as the Figure <ref type="figure" target="#fig_2">3</ref> shows. The first column is an identifier for each article. The second column is about the keywords of "background module". The next column deal with "problem module" and the fourth column concerns "solution module". The fifth column stores the title of article. It presents the results in descending order by the score that ranks the statistical significance in each semantic module. It would have fifteen to twenty keywords in each semantic module of each article.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Semantic Search</head><p>Semantic search provides insight into unstructured documents stored by extracting the relevant keywords and index statistics in the database. Then, it is also used to identify these keywords and index similar or related documents.</p><p>According to the keywords extracted before, the platform could retrieve the keywords in the semantic modules. In this passage, we take the score into full consideration by both the weight and the published year. Since the keywords of a semantic module of each article at the level of importance is not the same. To be specific, the hit keyword in the first position of the article is assigned to three points. Then, the second one represents two points and the following keywords are all amounted one point. Besides, the score of published year is two in the past decade. And the articles published ten years before amounted one point. Through the calculation of the searching server, it then return the list of articles. The communication of data between the readers and the database is as Figure <ref type="figure" target="#fig_3">4</ref> shows below. First, readers input the keywords. The searching server sends a query requirement to database. Then the database returns the data to searching server. Finally, through the processing of the server, an ordered list would return to readers.</p><p>Readers could search the keyword and decide which semantic module they prefer. So when the platform shows the result, it could focus on the scientific papers to hit the keywords in the semantic module selected just now. For instance, when readers search the keywords under the "solution module", semantic search would search the keywords in the solution column (key solution) of database and return all the papers hit the keywords in their solutions. As the Figure <ref type="figure" target="#fig_4">5</ref> shows, the platform lists all the papers using clustering to solve some problems. Besides, the platform could search in another way. With the help of semantic matching, the platform could calculate all the relationships among the keywords, including "more general" and "less general" relationship. In this experiment, when the platform is searching "data mining", an auxiliary program for semantic matching would find all the words which have relationships with "data mining". And Figure <ref type="figure" target="#fig_5">6</ref> shows the result when readers searched "data mining" as the keywords. Since "k-means" is less general than "data mining", some papers about k-means will be returned.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Conclusion and Future Work</head><p>The traditional search returns the correct results and also returns too many of the articles that are not so accurate. And the semantic search narrowed the range of the paper listed through the semantic annotation. From this view, searching scientific articles based on the rhetorical structure becomes more rapid and accurate.</p><p>The platform is a preliminary experiment. And the next phase of work is to label entity and implement the across-language platform. By tagging entity, readers could understand the involved concept easily and find articles more accurate.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Overall architecture of the system.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 2 .</head><label>2</label><figDesc>Fig. 2. Readers complete the semantic annotation.</figDesc><graphic coords="3,151.77,293.21,311.81,212.14" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 3 .</head><label>3</label><figDesc>Fig. 3. The attributes to store the keywords.</figDesc><graphic coords="4,165.95,222.87,283.47,155.01" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Fig. 4 .</head><label>4</label><figDesc>Fig. 4. The process of query requirement.</figDesc><graphic coords="5,194.29,152.70,226.78,57.07" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Fig. 5 .</head><label>5</label><figDesc>Fig. 5. The result semantic search returned.</figDesc><graphic coords="5,164.83,377.77,283.47,117.47" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Fig. 6 .</head><label>6</label><figDesc>Fig. 6. The semantic search result.</figDesc><graphic coords="6,151.77,115.83,311.81,174.86" type="bitmap" /></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Acknowledgements</head><p>This work is supported by the National Natural Science Foundation of China (No. 61300147), China Postdoctoral Science Foundation (No. 2014M551185), and Science and Technology Program of Changchun (No. 14GH014).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Semantic search</title>
		<author>
			<persName><forename type="first">R</forename><surname>Guha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Mccool</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Miller</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International World Wide Web Conference</title>
				<imprint>
			<date type="published" when="2003">2003</date>
			<biblScope unit="volume">36</biblScope>
			<biblScope unit="page" from="700" to="709" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">A P</forename><surname>Harmsze</surname></persName>
		</author>
		<title level="m">A modular structure for scientific articles in an electronic environment</title>
				<imprint>
			<date type="published" when="2000">2000</date>
		</imprint>
		<respStmt>
			<orgName>University of Amsterdam</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Ph.D. thesis</note>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">09gnes Sndor: Scientific discourse on the semantic web: A survey of models and enabling technologies</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">B</forename><surname>Shum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Groza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Handschuh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Semantic Web Journal Interoperability Usability Applicability</title>
				<imprint>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Mimir: An open-source semantic search framework for interactive information seeking and discovery</title>
		<author>
			<persName><forename type="first">V</forename><surname>Tablan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Bontcheva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Roberts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Cunningham</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Web Semantics Science Services &amp; Agents on the World Wide Web</title>
		<imprint>
			<biblScope unit="page" from="52C" to="68" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Ontology-based interpretation of keywords for semantic search</title>
		<author>
			<persName><forename type="first">T</forename><surname>Tran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Cimiano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Rudolph</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Studer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Lecture Notes in Computer Science</title>
		<imprint>
			<biblScope unit="volume">4825</biblScope>
			<biblScope unit="page">523</biblScope>
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">The abcde format enabling semantic conference proceedings</title>
		<author>
			<persName><forename type="first">A</forename><surname>De Waard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Tel</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2006">2006</date>
			<publisher>SemWiki</publisher>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
