Searching for explanation of difficult scientific terms

Searching for explanation of difficult scientific terms MajdaEnnaciri ennacirimajda@gmail.com Searching for explanation of difficult scientific terms 1613-0073 EFDDFEF1F6C040C4FB22FF42BDAB7AE5 GROBID - A machine learning software for extracting information from scholarly documents text simplification reading comprehension automatic natural language processing information search simplified text difficult scientific terms

Understanding scientific texts is an essential skill for successful learning in school and throughout life (project...). However, non-experts (scientists who are interested in scientific documents from disciplines in which they are not experts) encounter significant difficulties in understanding them. This is due to key words that are difficult to understand without prior knowledge, linguistic complexity, structure and length of scientific articles. In this work, our goal is to generate an explanation for a given difficult scientific term in order to help a user understand the text (definitions, examples, use cases,...). First, we trained an AI model to predict a context to a given term. Then, we compared these results with different baselines.

Introduction

The lack of basic knowledge can become an obstacle to reading comprehension and reduce access to information. Simplification of scientific texts can then appear as an aid because its objective is to make complex content more easily understandable by establishing links with the basic lexicon. Traditional methods of simplifying texts can eliminate complex concepts and constructions. Furthermore, simplification is about reducing the complexity of a text while retaining the original information and meaning.

As part of the CLEF 2022 SimpleText lab [1] competition, which aims to simplify scientific texts, we solved task 2, which is the search for difficult words that make it difficult to understand the texts. Despite much research, these terms remain a difficult task to define. In general, these terms can be defined as the concepts of a domain. But, such a definition leaves room for several questions about the nature of the terms, the problem lies in practical aspects such as the length of the terms and theoretical considerations about the difference between words and terms. This leads to many problems, from data collection to extraction and evaluation.

First, among the methods that do term extraction, there are linguistic methods that rely on linguistic information such as POS models and chunking. In addition, statistical methods, which use frequencies compared to a reference corpus, and hybrid methods, which combine the two methods just mentioned. They tend to be better in terms of performance compared to other methods.These methods select candidate terms based on their POS model and rank them using statistical metrics. Secondly, the advancement of machine learning techniques has made it more complicated to classify such simple methodologies mentioned before [2]. Finally, Jurassic-1 [3] language models, with 178B parameters for J1-Jumbo capable of transforming existing text, e.g., in the case of summarization or prediction of difficult words.

The following sections begin with a brief overview of the methodology used (IDF, PMI) to perform difficult word extraction and the datasets used. In addition, the next section contains the results obtained. The last section is devoted to a discussion and conclusions.

Methods

In order to determine the terms that require explanation and contextualization to help a reader understand a complex scientific text, for example, in relation to a query, the terms that need to be contextualized (with a definition, an example and/or a use case). To do this, we computed the IDF score (which gives us the frequency of use of a word and from which we ranked these words by order of difficulty) and the PMI scoreon on a dataset that we will present next.

After computing the scores, we set a threshold from which we extracted the difficult words. That is, words that have an IDF score greater than or equal to this threshold (we set this score from the training set) are considered difficult words. Moreover, we have two types of difficult words: term or sentence, in order to extract the complex sentences, we computed the PMI of each bigram and extracted those with the highest PMI in a given content.

We classified them into 3 (difficulty scores are: 1, 2 and 3) and 5 (difficulty scores are: 1, 2, 3, 4 and 5). In addition, the sentences whose scores do not exceed the thresholds are considered easy to understand.

IDF (Inverse Document Frequency) score

Inverse Document Frequency (IDF) is a measure of how often a word is used, i.e. how much information the word provides. The higher its score, the more important the word is. It is the inverse fraction of documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term and then taking the logarithm of that quotient). IDF of a term t is computed as:

𝐼𝐷𝐹 (𝑡) = ln( 𝑁 𝑁 𝑡 )

where N is the total number of documents in the corpus, and 𝑁 𝑡 is the number of documents containing the term 𝑡.

[4]

The Pointwise Mutual Information (PMI) Criterion

Pointwise Mutual Information (PMI) has proven to be a useful association measure in many natural language processing applications, such as collocation extraction and word space models.

The idea of PMI is to quantify the probability of co-occurrence of two words. The algorithm computes the (logarithmic) probability of co-occurrence,divided by the product of the single occurrence probability, as follows:

𝑃 𝑀 𝐼(𝑎, 𝑏) = log( 𝑃 (𝑎, 𝑏) 𝑃 (𝑎)𝑃 (𝑏) )

With a and b are the terms that we want to calculate their PMI. knowing that, when 'a' and 'b' are independent, their joint probability is equal to the product of their marginal probabilities, when the ratio is equal to 1 (so the logarithm is equal to 0), this means that the two words together do not form a unique concept: they co-occur by chance. [5] 2.3. Dataset

Train Dataset

The data are extracted under the two topics: Medicine and Computer Science, as these two areas are the most popular on the forums. As in 2021, for computer science, they use the scientific abstracts from the Citation Network dataset: DBLP+Citation, ACM Citation network.

A student who is proficient in technical writing and translation manually annotated each sentence extracting difficult words.[6]

Test Dataset

To construct the test data, 116,763 sentences were extracted from DBLP summaries with the following queries:

• Input and output formats. The input for the train and the test data was provided in JSON and CSV formats with the following fields: • snt id a unique passage (sentence) identifier.

• source snt passage text • doc id a unique source document identifier.

• query id a query ID.

• query text difficult terms should be extracted from sentences with regard to this query. [6] Input examples (CSV format):

Results and Evaluation

To evaluate our statistical model, we applied it on the training data, since we already have the actual hard terms. We found that the results obtained from the predictive model are similar to the real results. In fact, the following table shows that there are sentences that can have two difficult terms, which is similar to what we obtained. the result that we found from our predictive the true result

Conclusion

After evaluating our statistical model (IDF), we found that it performs very well for difficult term extraction. However, the field of difficult term extraction from complex content is currently being improved by trying the latest deep learning methodologies that have been successfully used in other natural language processing tasks and by updating more traditional methodologies.

In addition, we are interested in making comparisons between the scores obtained by IDF and Jurassic-1 and making comparisons with evaluation metrics.

Overview of the CLEF 2022 SimpleText Lab: Automatic Simplification of Scientific Texts, Experimental IR LErmakova PBellot JKamps DNurbakova IOvchinnikova ESanjuan EMathurin RHannachi SHuet SAraujo Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Thirteenth International Conference of the CLEF Association

CLEF

2022. 2022 13390 Auditory distraction and short-term memory: Phenomena and practical implications SPBanbury WJMacken STremblay DMJones Human factors 43 2001 SO L B S YLieber O Jurassic-1: Technical details and evaluation 9 GKavita What is inverse document frequency (idf)? 2022 Understanding pointwise mutual information in nlp VAlto 2020 Automatic simplification of scientific texts: Simpletext lab at clef-2022 BP K J N D O I S E M E A S H R H S P NErmakova L 10.1007/978-3-030-99739-746 advances in information retrieval Setty springer international publishing (2022 13186