Demo: Text Titling Application Cédric Lopez Violaine Prince Mathieu Roche LIRMM, Univ. Montpellier 2, LIRMM, Univ. Montpellier 2, LIRMM, Univ. Montpellier 2, France France France lopez@lirmm.fr prince@lirmm.fr mroche@lirmm.fr ABSTRACT A title is not exactly the smallest possible abstract. While This paper deals with an application allowing the automatic a summary, the most condensed form of a text, has to give an titling of texts. This one consists of four stages: Corpus outline of the text contents that respects the text structure, acquisition, candidate sentence determination for the titling, a title indicates the treated subject in the text without re- extraction of noun phrases among the candidate sentences, vealing all the content. Text compression could be interest- and finally the choice of the title. ing for titling if a strong compression could be undertaken, resulting in a single relevant word group. Compression texts methods (e.g. [5]) could be used to choose a word group Categories and Subject Descriptors obeying to titles constraints. However, one has to largely I.2.7 [Natural Language Processing] prune compression results to select the relevant group [4]. A title is not an index : A title does not necessarily contain key words (and indexes are key words), and might present a 1. INTRODUCTION partial or total reformulation of the text (what an index is In this paper, we present an application dealing with an not). automatic approach providing a title to a document, which Finally, a title is thus a full entity, has own functions, and meets the different characteristics of human issued titles. So, titling has to be sharply distinguished from summarizing when a title is absent, for instance in emails without objects and indexing. or to determine the file title for saving, the described method enables the user to save time by informing him/her about the content in a single glance. In addition, it is designed 3. THE AUTOMATIC TITLING APPLICA- to meet at least one of the criteria of the standard W3C. TION Indeed, titling web pages is one of key fields of the web page accessibility, such as defined by associations for the disabled. 3.1 The process The goal is to enhance the page readability. Moreover, a rel- The global process in order to automatically title a docu- evant title is an important issue for the webmaster improving ment is composed of the following steps: the indexation of web pages. Let us note that titling is not a task to be confused with automatic summarization, text • Step 0: Corpus Acquisition: Determining the charac- compression, and indexation, although it has several com- teristics of the texts to be titled; mon points with them. This will be detailed in the ’related work’ section. • Step 1: Candidate Sentences Determination. We as- sume that any text contains at least a few sentences that would provide the relevant sentence for titling. In 2. RELATED WORK this article, we suppose that the terms used in the title While a lot of applications are borned in the NLP domain, can be located in the first sentences of the text [1]. it seems that no application was realized to title automati- cally textual documents. As for articles, it was noticed that • Step 2: Extracting Candidate Noun Phrases (NP) for elements appearing in the title are often present in the body Titling. This step process consists in selecting among of the text [6]. Recent work [1] supports this idea and this list of NP, the most relevant one. A first pre- shows that the covering rate of those words present in titles, selection allows to keep the longest NP, similarly to is very high in the first sentences of a text. [2], with lengths equivalent to Lmax and Lmax − 1 where Lmax is the longest local candidate. This tech- nique prevents from pruning interesting candidates too quickly. These candidates are called N Pmax . Permission to make digital or hard copies of all or part of this work for • Step 3: Selecting a Title by the ChTitres Approach personal or classroom use is granted without fee provided that copies are based on the use of TF-IDF measure [3]. This one not made or distributed for profit or commercial advantage and that copies enables to evaluate how important a word is to a text bear this notice and the full citation on the first page. To copy otherwise, to or corpus. The word importance increases proportion- republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ally to the number of times a word appears in the Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. document (TF) but is offset by the frequency of the Figure 2: Scale of relevance for Titling. rejected NP with selected ones will allow, in particular, the estimation of the selection process accuracy. Figure 1: Screen shot: Application interface. For every ’candidate’ title, the expert has to apreciate its relevance to the document contents with the following scale: Very relevant (C1), Relevant (C2), I don’t know (C3), not word in the corpus (IDF). We shall use the TF-IDF very relevant (C4), not relevant at all (C5). For each of measure to calculate the score of every N Pmax . This these Cn judgements, a digital value is assigned: −2 for score can be the maximal TF-IDF obtained for a word C5,−1 for C4, 0 for C3, +1 for C2 and+2 for C1. The final of the NP (T M AX) either the sum of the TF-IDF of note obtained for a title is the mean value of the experts every words of the NP (T SU M ). If a Named Entity given grades (see Fig. 2). So, the higher the value, the more is located among the N Pmax , then our approach fa- accurate the NP, as a title. vors selecting of this NP as a title. Other methods are Based on this expert evaluation, the experiments show detailed in the online application (see Fig. 1). that e-mails titles returned by our tool are relevant (0