<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Searching for explanation of difficult scientific terms</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Majda</forename><surname>Ennaciri</surname></persName>
							<email>ennacirimajda@gmail.com</email>
						</author>
						<title level="a" type="main">Searching for explanation of difficult scientific terms</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">EFDDFEF1F6C040C4FB22FF42BDAB7AE5</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T03:32+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>text simplification</term>
					<term>reading comprehension</term>
					<term>automatic natural language processing</term>
					<term>information search</term>
					<term>simplified text</term>
					<term>difficult scientific terms</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Understanding scientific texts is an essential skill for successful learning in school and throughout life (project...). However, non-experts (scientists who are interested in scientific documents from disciplines in which they are not experts) encounter significant difficulties in understanding them. This is due to key words that are difficult to understand without prior knowledge, linguistic complexity, structure and length of scientific articles. In this work, our goal is to generate an explanation for a given difficult scientific term in order to help a user understand the text (definitions, examples, use cases,...). First, we trained an AI model to predict a context to a given term. Then, we compared these results with different baselines.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The lack of basic knowledge can become an obstacle to reading comprehension and reduce access to information. Simplification of scientific texts can then appear as an aid because its objective is to make complex content more easily understandable by establishing links with the basic lexicon. Traditional methods of simplifying texts can eliminate complex concepts and constructions. Furthermore, simplification is about reducing the complexity of a text while retaining the original information and meaning.</p><p>As part of the CLEF 2022 SimpleText lab <ref type="bibr" target="#b0">[1]</ref> competition, which aims to simplify scientific texts, we solved task 2, which is the search for difficult words that make it difficult to understand the texts. Despite much research, these terms remain a difficult task to define. In general, these terms can be defined as the concepts of a domain. But, such a definition leaves room for several questions about the nature of the terms, the problem lies in practical aspects such as the length of the terms and theoretical considerations about the difference between words and terms. This leads to many problems, from data collection to extraction and evaluation.</p><p>First, among the methods that do term extraction, there are linguistic methods that rely on linguistic information such as POS models and chunking. In addition, statistical methods, which use frequencies compared to a reference corpus, and hybrid methods, which combine the two methods just mentioned. They tend to be better in terms of performance compared to other methods.These methods select candidate terms based on their POS model and rank them using statistical metrics. Secondly, the advancement of machine learning techniques has made it more complicated to classify such simple methodologies mentioned before <ref type="bibr" target="#b1">[2]</ref>. Finally, Jurassic-1 <ref type="bibr" target="#b2">[3]</ref> language models, with 178B parameters for J1-Jumbo capable of transforming existing text, e.g., in the case of summarization or prediction of difficult words.</p><p>The following sections begin with a brief overview of the methodology used (IDF, PMI) to perform difficult word extraction and the datasets used. In addition, the next section contains the results obtained. The last section is devoted to a discussion and conclusions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Methods</head><p>In order to determine the terms that require explanation and contextualization to help a reader understand a complex scientific text, for example, in relation to a query, the terms that need to be contextualized (with a definition, an example and/or a use case). To do this, we computed the IDF score (which gives us the frequency of use of a word and from which we ranked these words by order of difficulty) and the PMI scoreon on a dataset that we will present next.</p><p>After computing the scores, we set a threshold from which we extracted the difficult words. That is, words that have an IDF score greater than or equal to this threshold (we set this score from the training set) are considered difficult words. Moreover, we have two types of difficult words: term or sentence, in order to extract the complex sentences, we computed the PMI of each bigram and extracted those with the highest PMI in a given content.</p><p>We classified them into 3 (difficulty scores are: 1, 2 and 3) and 5 (difficulty scores are: 1, 2, 3, 4 and 5). In addition, the sentences whose scores do not exceed the thresholds are considered easy to understand.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">IDF (Inverse Document Frequency) score</head><p>Inverse Document Frequency (IDF) is a measure of how often a word is used, i.e. how much information the word provides. The higher its score, the more important the word is. It is the inverse fraction of documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term and then taking the logarithm of that quotient). IDF of a term t is computed as:</p><formula xml:id="formula_0">𝐼𝐷𝐹 (𝑡) = ln( 𝑁 𝑁 𝑡 )</formula><p>where N is the total number of documents in the corpus, and 𝑁 𝑡 is the number of documents containing the term 𝑡.</p><p>[4]</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">The Pointwise Mutual Information (PMI) Criterion</head><p>Pointwise Mutual Information (PMI) has proven to be a useful association measure in many natural language processing applications, such as collocation extraction and word space models.</p><p>The idea of PMI is to quantify the probability of co-occurrence of two words. The algorithm computes the (logarithmic) probability of co-occurrence,divided by the product of the single occurrence probability, as follows:</p><formula xml:id="formula_1">𝑃 𝑀 𝐼(𝑎, 𝑏) = log( 𝑃 (𝑎, 𝑏) 𝑃 (𝑎)𝑃 (𝑏) )</formula><p>With a and b are the terms that we want to calculate their PMI. knowing that, when 'a' and 'b' are independent, their joint probability is equal to the product of their marginal probabilities, when the ratio is equal to 1 (so the logarithm is equal to 0), this means that the two words together do not form a unique concept: they co-occur by chance. <ref type="bibr" target="#b4">[5]</ref> 2.3. Dataset</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.1.">Train Dataset</head><p>The data are extracted under the two topics: Medicine and Computer Science, as these two areas are the most popular on the forums. As in 2021, for computer science, they use the scientific abstracts from the Citation Network dataset: DBLP+Citation, ACM Citation network.</p><p>A student who is proficient in technical writing and translation manually annotated each sentence extracting difficult words.[6]</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.2.">Test Dataset</head><p>To construct the test data, 116,763 sentences were extracted from DBLP summaries with the following queries:</p><p>• Input and output formats. The input for the train and the test data was provided in JSON and CSV formats with the following fields: • snt id a unique passage (sentence) identifier.</p><p>• source snt passage text • doc id a unique source document identifier.</p><p>• query id a query ID.</p><p>• query text difficult terms should be extracted from sentences with regard to this query. <ref type="bibr" target="#b5">[6]</ref> Input examples (CSV format):</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Results and Evaluation</head><p>To evaluate our statistical model, we applied it on the training data, since we already have the actual hard terms. We found that the results obtained from the predictive model are similar to the real results. In fact, the following table shows that there are sentences that can have two difficult terms, which is similar to what we obtained. the result that we found from our predictive the true result</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Conclusion</head><p>After evaluating our statistical model (IDF), we found that it performs very well for difficult term extraction. However, the field of difficult term extraction from complex content is currently being improved by trying the latest deep learning methodologies that have been successfully used in other natural language processing tasks and by updating more traditional methodologies.</p><p>In addition, we are interested in making comparisons between the scores obtained by IDF and Jurassic-1 and making comparisons with evaluation metrics.</p></div>		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Overview of the CLEF 2022 SimpleText Lab: Automatic Simplification of Scientific Texts, Experimental IR</title>
		<author>
			<persName><forename type="first">L</forename><surname>Ermakova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Bellot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kamps</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Nurbakova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Ovchinnikova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Sanjuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Mathurin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Hannachi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Huet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Araujo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Thirteenth International Conference of the CLEF Association</title>
				<meeting><address><addrLine>CLEF</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022. 2022</date>
			<biblScope unit="page">13390</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Auditory distraction and short-term memory: Phenomena and practical implications</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">P</forename><surname>Banbury</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">J</forename><surname>Macken</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Tremblay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>Jones</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Human factors</title>
		<imprint>
			<biblScope unit="volume">43</biblScope>
			<biblScope unit="page" from="12" to="29" />
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">O L B S Y</forename><surname>Lieber</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename></persName>
		</author>
		<title level="m">Jurassic-1: Technical details and evaluation</title>
				<imprint>
			<biblScope unit="volume">9</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><forename type="first">G</forename><surname>Kavita</surname></persName>
		</author>
		<title level="m">What is inverse document frequency (idf)?</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Understanding pointwise mutual information in nlp</title>
		<author>
			<persName><forename type="first">V</forename><surname>Alto</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Automatic simplification of scientific texts: Simpletext lab at clef-2022</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">P K J N D O I S E M E A S H R H S P N</forename><surname>Ermakova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename></persName>
		</author>
		<idno type="DOI">10.1007/978-3-030-99739-746</idno>
		<ptr target="https://doi.org/10.1007/978-3-030-99739-746" />
	</analytic>
	<monogr>
		<title level="m">advances in information retrieval</title>
				<editor>
			<persName><surname>Setty</surname></persName>
		</editor>
		<imprint>
			<publisher>springer international publishing</publisher>
			<date type="published" when="2022">(2022</date>
			<biblScope unit="volume">13186</biblScope>
			<biblScope unit="page" from="364" to="373" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
