1. INTRODUCTION

Demo: Text Titling Application

Cédric Lopez

lopez@lirmm.fr 0

Violaine Prince

prince@lirmm.fr 0

Mathieu Roche

mroche@lirmm.fr 0 0 LIRMM , Univ. Montpellier 2 , France

This paper deals with an application allowing the automatic titling of texts. This one consists of four stages: Corpus acquisition, candidate sentence determination for the titling, extraction of noun phrases among the candidate sentences, and nally the choice of the title.

1. INTRODUCTION

In this paper, we present an application dealing with an automatic approach providing a title to a document, which meets the di erent characteristics of human issued titles. So, when a title is absent, for instance in emails without objects or to determine the le title for saving, the described method enables the user to save time by informing him/her about the content in a single glance. In addition, it is designed to meet at least one of the criteria of the standard W3C. Indeed, titling web pages is one of key elds of the web page accessibility, such as de ned by associations for the disabled. The goal is to enhance the page readability. Moreover, a relevant title is an important issue for the webmaster improving the indexation of web pages. Let us note that titling is not a task to be confused with automatic summarization, text compression, and indexation, although it has several common points with them. This will be detailed in the 'related work' section.

RELATED WORK

While a lot of applications are borned in the NLP domain, it seems that no application was realized to title automatically textual documents. As for articles, it was noticed that elements appearing in the title are often present in the body of the text [ 6 ]. Recent work [ 1 ] supports this idea and shows that the covering rate of those words present in titles, is very high in the rst sentences of a text.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

A title is not exactly the smallest possible abstract. While a summary, the most condensed form of a text, has to give an outline of the text contents that respects the text structure, a title indicates the treated subject in the text without revealing all the content. Text compression could be interesting for titling if a strong compression could be undertaken, resulting in a single relevant word group. Compression texts methods (e.g. [ 5 ]) could be used to choose a word group obeying to titles constraints. However, one has to largely prune compression results to select the relevant group [ 4 ].

A title is not an index : A title does not necessarily contain key words (and indexes are key words), and might present a partial or total reformulation of the text (what an index is not).

Finally, a title is thus a full entity, has own functions, and titling has to be sharply distinguished from summarizing and indexing.

THE AUTOMATIC TITLING APPLICA TION The process

The global process in order to automatically title a document is composed of the following steps:

Step 0: Corpus Acquisition: Determining the characteristics of the texts to be titled; Step 1: Candidate Sentences Determination. We assume that any text contains at least a few sentences that would provide the relevant sentence for titling. In this article, we suppose that the terms used in the title can be located in the rst sentences of the text [ 1 ]. Step 2: Extracting Candidate Noun Phrases (NP) for Titling. This step process consists in selecting among this list of NP, the most relevant one. A rst preselection allows to keep the longest NP, similarly to [ 2 ], with lengths equivalent to Lmax and Lmax 1 where Lmax is the longest local candidate. This technique prevents from pruning interesting candidates too quickly. These candidates are called N Pmax.

Step 3: Selecting a Title by the ChTitres Approach based on the use of TF-IDF measure [ 3 ]. This one enables to evaluate how important a word is to a text or corpus. The word importance increases proportionally to the number of times a word appears in the document (TF) but is o set by the frequency of the word in the corpus (IDF). We shall use the TF-IDF measure to calculate the score of every N Pmax. This score can be the maximal TF-IDF obtained for a word of the NP (T M AX) either the sum of the TF-IDF of every words of the NP (T SU M ). If a Named Entity is located among the N Pmax, then our approach favors selecting of this NP as a title. Other methods are detailed in the online application (see Fig. 1). Subtitles are determinated with the same methods as titles. Note that the di erence is based on calculation of the TF-IDF measure: The IDF does not compute the frequency of the word in the di erent documents of the corpus but the measure depends of the word frequency in the segments of the document. So, this method can be seen as a local titling process. 3.2

The application

The application is available in English and French: http://www.lirmm.fr/ lopez/. This online application, developped with HTML, PHP, and MySQL has an user-friendly interface. On the screen shot (see Fig. 1), all parts of the interface are annotated with a letter.

A: It enables to choose which method the user can apply in order to title a given document (TMAX, TSUM, and other methods). By clicking the name of the method, the user nds the explanations about it.

B: The interface enables to choose which method will be apply for subtitling.

C: The text area has to receive the text to title. The text block must be separated by 'carriage returns' in order to be subtitled.

D: Some application examples (specialized texts) are proposed in a list.

E: Link allowing to pass in French mode.

The result page returns all the titles (and subtitles) according to the chosen methods on the interface page.

Ten experts have evaluated our methods (240 titles have been evaluated with 3 corpora). For every text, six titles were suggested1 among all the titles determined according to the methods TMAX and TSUM , as well as the real title T R. Three other titles (A1; A2; A3) are exposed in a random way from the list of noun phrases extracted among those that were rejected by the process. Comparing the evaluation of 1Identical titles obtained with di erent approaches are not given. rejected NP with selected ones will allow, in particular, the estimation of the selection process accuracy.

For every 'candidate' title, the expert has to apreciate its relevance to the document contents with the following scale: Very relevant (C1), Relevant (C2), I don't know (C3), not very relevant (C4), not relevant at all (C5). For each of these Cn judgements, a digital value is assigned: 2 for C5, 1 for C4, 0 for C3, +1 for C2 and+2 for C1. The nal note obtained for a title is the mean value of the experts given grades (see Fig. 2). So, the higher the value, the more accurate the NP, as a title.

Based on this expert evaluation, the experiments show that e-mails titles returned by our tool are relevant (0<score<1, see Table 1). In particular, evaluation results show that automatic e-mail titles (0.61) are more relevant than real titles (0.57) (with TMAX ).

CORPORA TR E-mails 0.57 Fora 1.15 Mailing lists 1.8

TMAX 0.61 0.42 0.81

TSUM 0.52 0.58 0.43 4.

CONCLUSION

The experts have con rmed that the titles built by our automatic titling tool are relevant. It is a possible bene t of an automatic method that might build a more relevant title than a 'real' one, and is a time saving procedure for a heavy e-mails writer.

[1]

Belhaoues . Titrage automatique de pages web. Master Thesis , University Montpellier II, France, 2009 .

[2]

Bourigault . Lexter, un logiciel d'extraction de terminologie. Application a l'acquisition des connaissances a partir de textes . PhD thesis , 1994 .

[3]

Salton and

Buckley . Term-weighting approaches in automatic text retrieval . Information Processing and Management 24, page 513 a 523 , 1988 .

[4]

Teufel and

Moens . Sentence extraction and rhetorical classi cation for exible abstracts . In AAAI Spring Symposium on Intelligent Text Summarisation , pages 16 { 25 , 2002 .

[5]

Yous -Monod and

Prince . Sentence compression as a step in summarization or an alternative path in text shortening . In Coling'08 , UK., pages 139 { 142 , 2008 .

[6]

Zajic ,

Door , and

Schwarz . Automatic headline generation for newspaper stories . Workshop on Text Summarization ( ACL 2002 and DUC 2002 meeting on Text Summarization) . Philadelphia., 2002 .