Temporal Pattern Extraction in Arabic language Hajer Omri 1, Zeineb Neji 2, Mariem Ellouze 3 Faculty of Economics and management, Tunisia, Sfax Computer department, Miracl laboratory, University of Sfax 1 hajer.omri2010@gmail.com 2 zeineb.neji@gmail.com 3 mariem.ellouze@planet.tn Abstract. Despite the importance of temporal inference in several domains es- pecially the question answering systems it remains still in its departure com- pared to other languages. This article deals with the automatic co-construction of patterns of temporal relations for the question answering systems. We have implemented this approach in temporal inference called TPE: Temporal Pattern Extraction. Keywords: inference; Question answering system; temporal inference; Arabic language. 1 Introduction In previous years the main objective of researchers is to build machines that can learn, communicate, see and manipulate objects and essentially reason because it is considered one of the biggest stakes in different fields. Although reasoning or infer- ring has always been peculiar to the human being and will not be easy to reproduce, it constitutes a research objective and a motivation to continue imitating the functions of the human brain. Inference is a mental operation that allows the reader to deduce the unspoken or implicit elements in a text by drawing on his knowledge of the world in his "personal encyclopedia". Making an inference means producing new information based on available information. There are many kinds, all researchers don’t agree on a single definition and classify inference according to non-mutually exclusive categories. Among the categories of inferences we distinguish temporal inference. This inference as called also temporal reasoning makes it possible to deduce temporal relations. Example (all the examples in Arabic language are transliterated with Buckwalter 1): 1 http://www.qamus.org/transliteration.htm adfa, p. 1, 2011. © Springer-Verlag Berlin Heidelberg 2011 ‫ في سالزبورغ بالنمسا مؤلف موسيقي نمساوي يعتبر من أشهر العباقرة‬6271 ‫ يناير‬72 ‫" ولد موزارت في‬ ‫ عاما ً بعد أن نجح في‬57 ‫ فقد مات عن عمر يناهز الـ‬،‫المبدعين في تاريخ الموسيقى رغم أن حياته كانت قصيرة‬ ‫ عمل موسيقي‬171 ‫إنتاج‬ " “Mozart was born on 27 January 1756 in Salzburg, Austria, an Austrian composer who is considered one of the most famous geniuses in the history of music, although his life was short. He died at the age of 35 after producing 626 musical works. ” Question: “‫متى ولد موزارت؟‬/ mtY wld mwzArt? / when was Mozart born?” We need a smart analysis here to get the right answer. This intelligent analysis is called inference more particularly temporal inference since one is processing temporal information. The temporal inference covers several domains and disciplines because of its im- portance. It is presented strongly in question answering systems, which is concerned with building systems that automatically answer questions in a natural language by extracting a precise answer from of a corpus of documents. Any temporal information can be clearly expressed (explicit) or referred to as an unspoken (implicit) and which the interlocutor must understand by himself. A speaker may wish to pass over some temporal information and if we speak of a machine that extracts a response that is not clearly expressed, we encounter several difficulties, hence the need for a system that makes the extraction of any temporal information implicitly represented. 2 Related works In this section, we present the previous work on temporal inference. Despite ex- tensive research in Arabic and the volume of Arabic textual data has started growing on the Web in the last decade, it is considered as a starting point for the work of other languages such as English. Several criteria go into slower progress at Arabic research levels. To understand that the information X is deduced from the information Y, is a simple deduction for the human being, but for the machine it is quite different. That’s why the researchers proposed several approaches to solve this problem. The latter are classified into: 2.1 Rules-based methods These methods, which are based on rules, are the oldest among the other types of extraction methods. The principle of this method is that the system designer manually establishes a set of rules for locating and extracting the desired data. These rules are extraction patterns, often implemented using automata, but the creation of these pat- terns is a long and costly job. Among the researchers who have made systems based on rules are: Reasoning [1] about time at different granularities while assuring the modeling of imprecise, gradual and intuitive relationships such as “just before” or “almost touch- es”. To deduce from the new relations it uses not only the classical operators but also its new operators of ascending granular conversion “↑” and descending “↓” which allows the conversion of one granularity to another. Expresses temporal information [2] on different levels of granularity as well preci- sion. It integrates it with other inferences, uses a uniform memory for declarative, episodic, and procedural knowledge. It distinguishes temporal inference by several characteristics: the use of a temporal window, temporal chaining, and interval manip- ulation, with projection, eternisassions and Anticipation. 2.2 Semantic methods HUTO [3] is an ontology which provides a conceptual model in RDFS for model- ing temporal expressions and annotating RDF resources. It proposes a set of the rules allowing standardizing the representation of the temporal data, but also rules of infer- ences and implications, expressed in the form of CONSTRUCT requests in SPARQL in order to deduce and explain the maximum temporal information so that to allow reasoning on the data. CHRONOS [4]: is a system of reasoning on temporal information for the OWL ontologies. The latter represents both qualitative and quantitative temporal infor- mation. Based on Allen's relationships CHRONOS makes it possible to deduce the implicit relations and to detect the inconsistencies while retaining the solidity, the exhaustiveness and traceability on the whole of the supported relations. 2.3 Hybrid methods Temporal inference has increased in recent years in several areas. Among the works are researchers who focus on the clinical field as the team of [5]. He develops a hybrid method for adapting the extraction of temporal expressions in a corpus of pa- tient clinical records. Hybridization takes place between a symbolic approach which is a manual enrichment of the rules of the HeidelTime tool specific to the clinical field. A supervised approach to sequence prediction based on CRF (conditional ran- dom fields). 3 Particularity of Arabic language and time constraints Arabic is a very rich language; However, this richness needs special manipulation which makes regular NLP systems, designed for other languages are unable to man- age it. Arabic is a spoken language by nearly 300 million people in the world and it is the religious language for more than a billion people. It imposed itself with the Quranic revelation which conferred its status as a sacred language. Its unique charac- ter and beauty have forgotten the admiration of Muslims, beyond ethnic and geo- graphical disparities. Among the manifestations of the richness of this language is the fact that the names, notions and concepts benefit from a very wide palette of nuances which al- lows to be expressed with extreme precision. Citing the example for the designation of the months of the year when one can note a significant variety of this word that’s why we need a system of equivalence between the representations set which desig- nates the same temporal information to resolve any ambiguity. Table 1. The name of the solar months Table 2. The name of the lunar months Example of ambiguity: for temporal information 03/12/2000 we find several repre- sentations:  03-12-2000  ‫ هجري‬6276 ‫الثالث من دجنبر‬/AlvAlv mn djnbr 1421 hjry  ‫ هجري‬6276 ‫الثالث من ذو الحجة‬/AlvAlv mn *w AlHjp 1421 hjry  ‫ ميالدي‬7222‫اليوم الثالث من شهر ديسمبر‬/Alywm AlvAlv mn $hr dysmbr 2000 mylA- dy For the word "‫ديسمبر‬/ December / dysmbr " we also find the following words which are equivalent « ‫دجنبر‬/ djnbr, ‫ ذو الحجة‬/ *w AlHjp »and for the year 2000 we can also find the year ‫ هجري‬1421/1421 hjry /1421 Hijri or ‫ ميالدي‬2000/2000 mylAdy /2000 gregorian. 4 Proposed approach The proposed method presented in this section aims at automating the construction of temporal relationship patterns for question answering systems. This method is con- sidered as a rule-based method and it’s composed of three modules as shown in the previous figure (Fig1). The first module in this method consists of the question analysis, which makes it possible to extract the various named entities as well as the verbs. In the second mod- ule, we proceeded to the construction of our corpus by automatically downloading the articles corresponding to the named entities already acquired through Wikipedia. Af- ter a set of corpus pre-processing us go on to the last module which consists in ex- tracting the candidate sentences, which leads to a set of relevant sentences that is used to construct the patterns of temporal relations. In the following we detail the various steps and the phases that constitute them. Fig. 1. Proposed approach 4.1 Question Analysis This step consists in analyzing a question in the Arabic language that solicits tem- poral information only. This constraint must be highlighted at the level of our pro- gram. Indeed, our starting point is a question bearing temporal information only; the other types of questions are not the subject of our research. A question is called temporal if it begins with temporal signals. To find these tem- poral signals we have used the list of questions produced in TERQAS Workshop 2 (illustrated in Table 2) then this list has undergone an Arabic translation in order to have possible temporal signals. This first stage contains two phases to be detailed. Table 3. Temporal signals Extraction of named entities. We proceed at this level to the extraction of the named entities by [9] that remains in each question for the purpose of building an EN base that we use for the construc- tion of our corpus granting the corresponding articles. Extraction of verbs. Here it is a question of decomposing the question in order to extract the verbs that exist by [8]. The extraction of verbs will be useful for the following modules. More details will then be given in the following sections. 4.2 Construction of the corpus Downloading articles. From this phase, we put our first step for the construction of our corpus. This phase consists of hosting articles from the online encyclopedia Wikipedia. In fact, we will automatically download the corresponding articles to the extracted EN from the pre- vious step in XML format. 2 TERQAS was an ARDA Workshop focusing on Temporal and Event Recognition for Ques- tion Answering Systems, www.cs.brandeis.edu/_jamesp/arda/time/readings.html Pre-treatment of articles. The structuring of Wikipedia articles requires a pre-treatment of gender: elimina- tion of parentheses and words that are not in the Arabic language and links and imag- es. In a first phase, we extract the textual content of the articles downloaded automatical- ly to have our corpus. During this phase, we will retrieve the Infobox, when it exists because we will use it for verification in the following steps. We retrieve the raw text from the article. The corpus becomes after this step of format TXT. Segmentation. In a third phase, we proceed to the segmentation [7] of the articles. The latter rep- resents, in linguistics, a pre-processing of one or more textual documents in order to be able to subsequently process them (a morphological analysis, semantics, etc.).This operation is sensitive to each language because each has its own specificities that must be taken into account. It is considered to be important to locate segments con- taining the information. The result of this stage will serve as input for the step of extracting the so-called temporal or candidate sentences. Extraction of temporal sentences. This is to get rid of unnecessary information and access those that are considered relevant to anticipate and act as quickly as possible in decision-making. Once the articles, text part of the article precisely, cleaned up is segmented, we proceed to a selection to keep only those sentences that contain temporal information (relevant) [8]. Part-of-speech Tagging. This stage consists in identifying the morphological characteristics [6] of the words of each temporal sentence of our corpus. What really interests us in this morphologi- cal analysis is to locate the verbs. Let us return here to the first phase in which the verb of each question analyzed was detected. A comparison of the verbs of the identified phrases and the detected verb of the question will take place. 4.3 Construction of patterns Extraction of synonyms and antonyms. In this step we will extract the list of synonyms and antonyms for the verbs detect- ed from our starting time questions from Arabic Wordnet (AWN). We went through a coding phase for this extraction; in fact AWN is codified with Bluckwalter so we used a codification to have synonyms and antonyms in Arabic. The antonyms serve us for temporal questions concerning duration. For Example: ‫كم دامت الحرب العالمية االولى‬ How long did the First World War last? Km dAmt AlHrb AlEAlmyp AlAwlY We find ourselves in front of two situations:  We can have a direct answer from a relation of synonymy: ‫استمرت الحرب العالمية االولى لمدة أربعة سنوات‬ The First World War continued for four years Astmrt AlHrb AlEAlmyp AlAwlY lmdp >rbEp snwAt [‫دام‬/ lasted /dAm =/‫إستمر‬/ continued /Astmr].  Or we can extract the response from an antonymic relation: ‫إنتهى لهيب الحرب العالمية االولى بعد أربعة سنوات من جحيم‬ The flames of First World War ended after four years of hell rbEp snwAt mn jHym [‫ إنتهى‬/‫>اسم> <تاريخ بدأ> <تاريخ انتهاء‬  < ‫بدأت <اسم> سنة<تاريخ بدأ> و انتهت سنة <تاريخ انتهاء‬ 5 Evaluation The aim of the evaluation is analyzing the detailed capabilities of our proposed method cited in the previous section. In this section we present their evaluation results. As a first evaluation, we collected a corpus (in Arabic language) composed of set of 100 temporal and heterogeneous questions related to several domains at the begin- ning and the number of questions was increased each time to evaluate the results of our system TPE. Our corpus was extracted from the corpus of TREC international conference3 (Text REtrieval Conference) for the years from 1999 to 2003 and from a list of questions produced in TERQAS Workshop. Once the patterns are extracted, and for more precision we have asked the help of an expert in the domain to judge the semantics of the patterns. Table 4. Experiment results 6 Conclusion The question of identifying temporal relations using a pattern approach is particu- larly on interesting entry point in several areas such as question-and-answer systems. 3 http://trec.nist.gov/ The work that we have presented in this article is part of the work of the identifica- tion of temporal relations. In this context, we proposed a method for the identification of temporal relations based on a semantic approach based on patterns. We began this article with an overview of temporal inferences. Next, we proposed a method for the automatic extraction of patterns for the identification of temporal relations. Then, we presented our system "TPE" which presents the result of devel- opment of the proposed method. This system allows defining the temporal patterns from a corpus of texts. In this work, we aim at extending the temporal information base in order to build a specific time dictionary that can be useful in different domains. Acknowledgment We would like to record our appreciation to all people that involve in writing this article. First of all, our appreciation goes to Computer department for all the guidance especially my advisors Madam Mariem Ellouze and Zeineb Neji for guiding and as- sistaning us until we complete this article. References 1. Quentin Cohen-Solal et al, “Une algèbre des relations temporelles granulaires pour le rai- sonnement qualitatif ”, 2015. 2. Patrick Hammer et al, “The OpenNARS implementation of the Non-Axiomatic Reasoning System”, 2015. 3. Papa Fary Diallo et al, “ HuTO: une Ontologie Temporelle Narrative pour les Applications du Web Sémantique”, 2015. 4. Eleftherios Anagnostopoulos et al, “CHRONOS: A Reasoning Engine for Qualitative Temporal Information in OWL”, 2014. 5. Mike Donald Tapi Nzali, Aurélie Névéol, Xavier Tannier, “analyse d'expressions tempo- relles dans les dossiers électroniques patients”, 2015. 6. Spence Green and Christopher D. Manning, “Better Arabic Parsing: Baselines, Evalua- tions, and Analysis”, 2010. 7. Lamia Hadrich Belguith, Leila Baccour et Ghassan Mourad, “Segmentation de textes arabes basée sur l’analyse contextuelle des signes de ponctuations et de certaines particules”, 2005. 8. Max Silberztein, Nooj: A Linguistic Annotation System for Corpus Processing, 2005. 9. Héla Fehri et al, “Reconnaissance et traduction d'entités nommées en arabe avec NooJ en utilisant un nouveau modèle de représentation”, 2011.