Demo: Text Titling Application

                     Cédric Lopez                                    Violaine Prince                         Mathieu Roche
            LIRMM, Univ. Montpellier 2,                       LIRMM, Univ. Montpellier 2,              LIRMM, Univ. Montpellier 2,
                    France                                            France                                   France
                   lopez@lirmm.fr                                   prince@lirmm.fr                        mroche@lirmm.fr


ABSTRACT                                                                                A title is not exactly the smallest possible abstract. While
This paper deals with an application allowing the automatic                          a summary, the most condensed form of a text, has to give an
titling of texts. This one consists of four stages: Corpus                           outline of the text contents that respects the text structure,
acquisition, candidate sentence determination for the titling,                       a title indicates the treated subject in the text without re-
extraction of noun phrases among the candidate sentences,                            vealing all the content. Text compression could be interest-
and finally the choice of the title.                                                 ing for titling if a strong compression could be undertaken,
                                                                                     resulting in a single relevant word group. Compression texts
                                                                                     methods (e.g. [5]) could be used to choose a word group
Categories and Subject Descriptors                                                   obeying to titles constraints. However, one has to largely
I.2.7 [Natural Language Processing]                                                  prune compression results to select the relevant group [4].
                                                                                        A title is not an index : A title does not necessarily contain
                                                                                     key words (and indexes are key words), and might present a
1.     INTRODUCTION                                                                  partial or total reformulation of the text (what an index is
  In this paper, we present an application dealing with an                           not).
automatic approach providing a title to a document, which                               Finally, a title is thus a full entity, has own functions, and
meets the different characteristics of human issued titles. So,                      titling has to be sharply distinguished from summarizing
when a title is absent, for instance in emails without objects                       and indexing.
or to determine the file title for saving, the described method
enables the user to save time by informing him/her about
the content in a single glance. In addition, it is designed                          3.    THE AUTOMATIC TITLING APPLICA-
to meet at least one of the criteria of the standard W3C.                                  TION
Indeed, titling web pages is one of key fields of the web page
accessibility, such as defined by associations for the disabled.                     3.1     The process
The goal is to enhance the page readability. Moreover, a rel-                         The global process in order to automatically title a docu-
evant title is an important issue for the webmaster improving                        ment is composed of the following steps:
the indexation of web pages. Let us note that titling is not
a task to be confused with automatic summarization, text                                  • Step 0: Corpus Acquisition: Determining the charac-
compression, and indexation, although it has several com-                                   teristics of the texts to be titled;
mon points with them. This will be detailed in the ’related
work’ section.                                                                            • Step 1: Candidate Sentences Determination. We as-
                                                                                            sume that any text contains at least a few sentences
                                                                                            that would provide the relevant sentence for titling. In
2.     RELATED WORK                                                                         this article, we suppose that the terms used in the title
   While a lot of applications are borned in the NLP domain,                                can be located in the first sentences of the text [1].
it seems that no application was realized to title automati-
cally textual documents. As for articles, it was noticed that                             • Step 2: Extracting Candidate Noun Phrases (NP) for
elements appearing in the title are often present in the body                               Titling. This step process consists in selecting among
of the text [6]. Recent work [1] supports this idea and                                     this list of NP, the most relevant one. A first pre-
shows that the covering rate of those words present in titles,                              selection allows to keep the longest NP, similarly to
is very high in the first sentences of a text.                                              [2], with lengths equivalent to Lmax and Lmax − 1
                                                                                            where Lmax is the longest local candidate. This tech-
                                                                                            nique prevents from pruning interesting candidates too
                                                                                            quickly. These candidates are called N Pmax .

Permission to make digital or hard copies of all or part of this work for                 • Step 3: Selecting a Title by the ChTitres Approach
personal or classroom use is granted without fee provided that copies are                   based on the use of TF-IDF measure [3]. This one
not made or distributed for profit or commercial advantage and that copies                  enables to evaluate how important a word is to a text
bear this notice and the full citation on the first page. To copy otherwise, to             or corpus. The word importance increases proportion-
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.                                                                    ally to the number of times a word appears in the
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.                                            document (TF) but is offset by the frequency of the
                                                                         Figure 2: Scale of relevance for Titling.


                                                                  rejected NP with selected ones will allow, in particular, the
                                                                  estimation of the selection process accuracy.
      Figure 1: Screen shot: Application interface.                 For every ’candidate’ title, the expert has to apreciate its
                                                                  relevance to the document contents with the following scale:
                                                                  Very relevant (C1), Relevant (C2), I don’t know (C3), not
       word in the corpus (IDF). We shall use the TF-IDF          very relevant (C4), not relevant at all (C5). For each of
       measure to calculate the score of every N Pmax . This      these Cn judgements, a digital value is assigned: −2 for
       score can be the maximal TF-IDF obtained for a word        C5,−1 for C4, 0 for C3, +1 for C2 and+2 for C1. The final
       of the NP (T M AX) either the sum of the TF-IDF of         note obtained for a title is the mean value of the experts
       every words of the NP (T SU M ). If a Named Entity         given grades (see Fig. 2). So, the higher the value, the more
       is located among the N Pmax , then our approach fa-        accurate the NP, as a title.
       vors selecting of this NP as a title. Other methods are      Based on this expert evaluation, the experiments show
       detailed in the online application (see Fig. 1).           that e-mails titles returned by our tool are relevant (0<score<1,
                                                                  see Table 1). In particular, evaluation results show that au-
       Subtitles are determinated with the same methods as
                                                                  tomatic e-mail titles (0.61) are more relevant than real titles
       titles. Note that the difference is based on calculation
                                                                  (0.57) (with TM AX ).
       of the TF-IDF measure: The IDF does not compute
       the frequency of the word in the different documents               CORPORA         TR     TM AX   TSU M    A1      A2      A3
       of the corpus but the measure depends of the word                  E-mails         0.57    0.61    0.52   -1.44   -0.55   -0.64
       frequency in the segments of the document. So, this                Fora            1.15    0.42    0.58   -1.00   -0.74   -0.79
                                                                          Mailing lists    1.8    0.81    0.43   -1.57   -1.03   -0.58
       method can be seen as a local titling process.

3.2     The application                                           Table 1: Average scores for each score computing
  The application is available in English and French:             methods.
http://www.lirmm.fr/∼lopez/. This online application, de-
velopped with HTML, PHP, and MySQL has an user-friendly
interface. On the screen shot (see Fig. 1), all parts of the      4.   CONCLUSION
interface are annotated with a letter.
                                                                    The experts have confirmed that the titles built by our
   • A: It enables to choose which method the user can ap-        automatic titling tool are relevant. It is a possible benefit of
     ply in order to title a given document (TMAX, TSUM,          an automatic method that might build a more relevant title
     and other methods). By clicking the name of the              than a ’real’ one, and is a time saving procedure for a heavy
     method, the user finds the explanations about it.            e-mails writer.

   • B: The interface enables to choose which method will         5.   REFERENCES
     be apply for subtitling.                                     [1] M. Belhaoues. Titrage automatique de pages web.
                                                                      Master Thesis, University Montpellier II, France, 2009.
   • C: The text area has to receive the text to title. The
     text block must be separated by ’carriage returns’ in        [2] D. Bourigault. Lexter, un logiciel d’extraction de
     order to be subtitled.                                           terminologie. Application à l’acquisition des
                                                                      connaissances à partir de textes. PhD thesis, 1994.
   • D: Some application examples (specialized texts) are         [3] G. Salton and C. Buckley. Term-weighting approaches
     proposed in a list.                                              in automatic text retrieval. Information Processing and
                                                                      Management 24, page 513 à 523, 1988.
   • E: Link allowing to pass in French mode.                     [4] S. Teufel and M. Moens. Sentence extraction and
                                                                      rhetorical classification for flexible abstracts. In AAAI
  The result page returns all the titles (and subtitles) ac-          Spring Symposium on Intelligent Text Summarisation,
cording to the chosen methods on the interface page.                  pages 16–25, 2002.
  Ten experts have evaluated our methods (240 titles have         [5] M. Yousfi-Monod and V. Prince. Sentence compression
been evaluated with 3 corpora). For every text, six titles            as a step in summarization or an alternative path in
were suggested1 among all the titles determined according             text shortening. In Coling’08, UK., pages 139–142,
to the methods TM AX and TSU M , as well as the real title            2008.
T R. Three other titles (A1, A2, A3) are exposed in a random      [6] D. Zajic, B. Door, and R. Schwarz. Automatic headline
way from the list of noun phrases extracted among those that          generation for newspaper stories. Workshop on Text
were rejected by the process. Comparing the evaluation of             Summarization (ACL 2002 and DUC 2002 meeting on
1
  Identical titles obtained with different approaches are not         Text Summarization). Philadelphia., 2002.
given.