Unsupervised Metadata Extraction in Scientific
   Digital Libraries Using A-Priori Domain-Specific
                      Knowledge
                    Alexander Ivanyukovich                                                Maurizio Marchese
 Department of Information and Communication Technology Department of Information and Communication Technology
                    University of Trento                                   University of Trento
                    38100 Trento, Italy                                    38100 Trento, Italy
            Email: a.ivanyukovich@dit.unitn.it                     Email: maurizio.marchese@unitn.it
   Abstract— Information extraction from unstructured sources         mapping [2], scientific social networks analysis etc.). However
is a crucial step in the semantic annotation of content. The          for the implementation of such semantic-aware services first
challenge is in supporting an high quality automatic approach         the accumulated and available scholarly content need to be
(or at least semi-automatic) in order to sustain the scalability of
the semantic-enabled services of the future. Unsupervised infor-      annotated with proper and high quality semantic information.
mation extraction encompasses a number of underlying research            In the specific domain of scholarly literature, the structure
problems, such as natural language processing, heterogeneous          of the published scientific information still follows, in the
sources integration, knowledge representation, and others that        majority of the cases, a number of established communication
are under past and current investigation. In this paper we            approach and patterns, i.e. a certain number of structure
concentrate on the problem of unsupervised metadata extraction
in the Digital Libraries domain. We propose and present a novel       information, such as title, author’s list, abstract, body, ref-
approach focusing on the improvement in the metadata extraction       erences et al., are always present. This fact allows adopting
quality without involving external information sources (oracles,      existing information processing techniques for both traditional
manually prepared databases, etc), but relying on the information     and Internet-based sources, contributing to the processing and
present in the document itself and in its corresponding context.      creation of structured information within this content type.
More specifically, we focus on quality improvements of metadata
extraction from scientific papers (mainly in computer science            In this paper we will focus on the intersection of Information
domain) collected from various sources over the Internet. Finally,    Retrieval (IR) and Digital Libraries (DL) research domains
we compare the results of our approach with the state of the art      to address the problem of quality automatic information ex-
in the domain and discuss future work.                                traction from digital scientific documents. This is a first and
                      I. I NTRODUCTION                                crucial step towards the semantic annotation of the raw digital
                                                                      content, in a kind of knowledge supply chain, as indicated in
   The continuing expansion of the Internet has opened
                                                                      [3].
many new possibilities for information creation and exchange
                                                                         In describing information extraction within DL we will use
in general and in the academic world in particular: elec-
                                                                      the term metadata to refer to the structured information ob-
tronic publishing, digital libraries, electronic proceedings, self-
                                                                      tained from text-based documents that includes but is not lim-
publishing and more recently blogs and scientific news stream-
                                                                      ited to title, authors, affiliations, year of publication, publishing
ing are rapidly expanding the amount of available scholarly
                                                                      source (journal, conference, etc), publishing authority (such as
digital content. Recently, we have also witnessed a major
                                                                      ACM, IEEE, Elsevier, etc) and the list of references - each
shift in the landscape of scientific publishing with projects
                                                                      including previously mentioned work. A number of standards
like the Open Access Initiative 1 . In fact the number of open
                                                                      are available describing and categorizing bibliographic and
access journals is rising steadily, and new publishing models
                                                                      publishing metadata: for instance Dublin Core [4] and Bib-
are rapidly evolving to test new ways to increase readership
                                                                      1 Attribute Set from ANSI/NISO Z39.50-2003 (ISO 23950)
and access. Such new channels for academic communications
                                                                      [5]. In the present work we limit our investigation on metadata
are complementing and sometimes competing with traditional
                                                                      extraction to a significant subset of such standards. In fact,
authorities like journals, books and conferences proceedings.
                                                                      here we want to describe and evaluate our approach; extension
The existence of such variety and size of scholarly content
                                                                      to other instances of metadata is only quantitative and not
as well as its increasing accessibility opens the way to the
                                                                      conceptual.
development of useful semantic-enable services (like author’s
                                                                         The two different information sources of scientific content
profiling 2 , scientometrics [1], automatic science domains
                                                                      (traditional and Internet sources) present important differences
  1 http://www.openarchives.org/                                      in the approach for metadata retrieval: traditional sources are
  2 http://www.rexa.info//                                            usually based on manually prepared information (from certi-
fied authorities such as professional associations like ACM,
                                                                                    Normalized          Basic        Recognized
IEEE and commercial publishers, such as Elsevier, Springer,                            Text            Markup          Titles
etc.). In this case either all records are manually processed
or processing results are manually revised. This is possible           Cleaned
because traditional sources usually belong to single authorities         Text
                                                                                                                                  Metadata
with their internal standards on information storage. On the                        Formatted
                                                                                      Text
                                                                                                      Recognized
                                                                                                       Authors
                                                                                                                      Borders
                                                                                                                      Adjusted
other hand Internet-based sources usually belong to large
open communities (single researchers, group of researcher,
institutions) and do not follow specific strict standards. For                         Fig. 1.    Metadata Extraction Steps
instance, an academic paper that is stored in the Digital Library
of the IEEE Computer Society 3 contains the appropriate
metadata to support navigation through related papers (search        method can be used in automated end-to-end information
and sort by author, by publication date, etc). The same paper        retrieval and processing systems, supporting the control and
can be found on the homepage of the author or in the digital         elimination of any error-prone human intervention in the
repository of the affiliated academic institution. In this case,     process.
most often the metadata is not separated from the paper, or it is       The remainder of the paper is organized as follows. In
not structured. It needs either extraction or separate processing.   Section II we describe in detail the proposed approach to
   The problem of metadata extraction in the specific context        improve the quality of metadata extraction from scientific
of scientific Digital Libraries can be summarized as                 corpora: in particular we describe a two step procedure based
  1) identification of logical structures within single docu-        on (1) pattern-based metadata extraction using Finite State
     ments (header, abstract, introduction, body, references         Machine (FSM) and (2) statistical correction using a-priori
     section, etc.)                                                  domain-specific knowledge. In Section III we describe how
  2) entity recognition (author, title, reference, etc.) within      the proposed approach has been applied to a large set of
     single document                                                 documents (ca. 120 K) and we provide preliminary comparison
  3) metadata recognition within single entity.                      with the state of the art in the domain. In Section IV we discuss
                                                                     related work. Section V summarizes the results and discusses
A general assumption, in current metatada extraction tech-
                                                                     our future work.
niques, is based on the fact that there is a limited number
of formats to structure an academic paper and to represent                                      II. T HE A PPROACH
references. This is particularly true in Computer Science
domain where even a fewer number of formats are in active               The proposed approach consists of two major steps, namely
use (ACM format, IEEE format, etc). This information is                 1) pattern-based metadata extraction using Finite State Ma-
particularly helpful for point (1) above, but nevertheless one             chine (FSM) and
can achieve low quality results because of differences in               2) statistical correction using a-priori domain-specific
formatting due to a number of reasons such as (a) authors not              knowledge
following the pattern, (b) specifics of text representation in          For the first step we have analyzed, tested and personalized
columns in PDF/PS formats, (c) text pagination, (d) presence         for the specific application, existing state-of-the-art imple-
of headnotes and footnotes, etc. Obviously similar problems          mentation of specialized FSM-based lexical grammar parser
could be found in (2) and (3) as well, among them human              for fast text processing [6], [7]. For the second step we
errors, text extraction technical details from PDF/PS formats        have developed and investigated statistical methods that allow
and initial low quality results after step (1). As a results the     metadata correction and enrichment without the need to access
overall metadata quality won’t be sufficient for everyday use        external information sources. In the subsequent subsection we
in popular academic literature systems like CiteSeer.IST 4 ,         will describe each of the two steps in details.
Google Scholar 5 and Windows Academic Live 6 .
   The main contribution of this paper is a novel method for         A. Patterns-based Metadata Extraction Using Finite State
unsupervised metadata extraction based on a-priori domain-           Machine
specific knowledge. Our method does not rely on any external            Patterns-based metadata extraction contributes to initial
information sources and is solely based on the existing infor-       metadata retrieval in our approach. In contrast to the classical
mation in the document itself as well as in the overall set of       Information Retrieval (IR) goal, we mainly focus on the
documents currently present in a given digital archive. This         quality and not on the overall quantity of information. Core
includes both a-priori domain-specific information and infor-        idea is in the emphasis on quality improvement within several
mation obtained on previous processing steps. The proposed           subsequent steps, even at the cost of a limited decrease in
  3 http://www.computer.org/portal/site/csdl/
                                                                     overall metadata quantity.
  4 http://citeseer.ist.psu.edu/                                        This first metadata extraction step consists of a number
  5 http://scholar.google.com/                                       of interim phases (see Figure 1), each implemented as a
  6 http://academic.live.com/                                        single FSM. Application of the FSM model allows simple
and formal model verification avoiding most of the human              define program
                                                                          [head]
mistakes commonly involved in these tasks.
                                                                          [uninteresting]
   We have constructed an initial set of patterns for each                [opt references]
grammar based on a number of small training sets (typically             | [empty]
ca. 50 documents) from the target documents collection. Each          end define
set was manually labeled before processing and the processing         define head
results were manually evaluated. Within a limited number (ca.           [head_begin_tag][newline]
10) of patterns’ adjustments loops, we were able to obtain              [opt head_line]
appropriate recognition quality for correct processing most of          [repeat other_line]
                                                                        [head_end_tag] [newline]
the relevant entities formats in the complete target collection       end define
(more than 120K of documents). This finding corroborated
the initial assumption about the presence of a limited number         define head_line
of formats in a given scientific collection. According to our            [author][repeat separator_author][delimiter]
                                                                         [repeat token_not_newline+][newline]
original idea of step-by-step quality improvement, the trade-         end define
off between metadata quality and completeness of recognition
coverage on this step was shifted to the quality aspect. To           define other_line
this end, we allow our procedure to discard badly-formatted               [author][repeat headseparator_author][repeat trash][newline]
                                                                        | [line]
input, finally retaining only high-quality content and related        end define
metadata.
   The major steps of metadata extraction include (see Figure         define headseparator_author
                                                                          [opt space][headseparator][opt space][author]
1):
                                                                      end define
  1) Text normalization and special symbols removal. This             ...
     covers extra spaces, new lines and tabs removal, as
     well as non-printable symbols handling, and references’                        Fig. 2.   FSM: Authors recognition step
     section normalization. Moreover, it includes: text flow
     recognition, collateral elements detection (indexes, ta-
     bles of content, pages header and footer, etc), and               4) Authors recognition (see Figure 2) and title recognition
     hyphens correction regardless of the text’s language.                 using invariants first method proposed in [9]. iIn brief,
     These 1st-level pre-processing activities in our informa-             this method denotes that subfields of a reference that
     tion extraction process, although conceptually simple,                have relatively uniform syntax, position, and composi-
     provide a number of important values that contributes                 tion given all previous parsing, are the first to be parsed
     to the overall quality of the subsequent information                  subsequently.
     extraction. Namely:                                               5) Borders adjustments. We constructed heuristics for smart
                                                                           borders shift based on the number of lexical construc-
       • Text pre-processing contribute to the more accurate               tions from the grammar in the marked (recognized) and
         textual information acquisition, i.e. correctly iden-             unmarked (not recognized) reference’s region.
         tified text flow (pages ordering), removal of the
                                                                       At present, our FSM application is context-free:, i.e. we do
         repeated elements that do not contribute to the struc-
                                                                    not compare obtained metadata with already existing ones (as
         tural content (headers and footers) and removal of
                                                                    partially recognized corpus or external sources). Moreover, we
         text delimiters inside structural elements (footnotes
                                                                    have design and constructed the steps in a way that grammar
         and page numbers inside single reference, hyphens
                                                                    application is linear to the processing document’s size. Both
         in the authors’ names and titles in references, etc)
                                                                    properties - context-free and linearity - contribute significantly
       • Text structuring contributes to the correct identi-
                                                                    to the overall processing speed. The use of other information
         fication of the major structural elements within a
                                                                    (partially recognized corpus or external sources) could be used
         text, i.e. Introduction section and reference to the
                                                                    in a successive step to improve overall quality, but at the
         Introduction section in a Table of content section
                                                                    expenses of performance.
         of the article should be correctly distinguished and
         handled appropriately                                      B. Statistical Correction Using A-Priori Domain-Specific
     An extended presentation of the techniques used in this        Knowledge
     step can be found in [8].                                         The metadata obtained using previous step patterns can have
  2) Initial text tagging: separating header, abstract and refer-   satisfactory quality, but in general they lack in recognition
     ences parts. This allows us to process each section sepa-      coverage. For example we can have a reference with 100%-
     rately, contributing to the improved processing speed of       correctly recognized title, but with partially recognized au-
     the subsequent FSMs.                                           thors. We still can query this metadata, but we cannot use it
  3) References separation and initial items recognition            for the next knowledge processing level, like for instance doc-
     within each single reference.                                  uments clustering based on authors or documents interlinking
                        Recognized
                                                                           From an operational level view, authors identification and
                         Authors
                                          Co-Authors
                                           Checked                       correction can be summarized within four major steps (see
                        Processed
                                                                         Figure 3):
                                                                           1) Construction of document and community dictionaries.
      Citation
                                                             Corrected
                                                              Authors
                                                                               All recognized authors within the same document (i.e.
                                            Author                             both paper authors and cited authors) are combined in
                         Authors
                        Normalized
                                           Position
                                           Checked
                                                                               a local dictionary (document dictionary). All recognized
                                                                               authors within all documents within the same domain
                                                                               and/or URL are combined in another dictionary (com-
         Fig. 3.   Correction steps (correlation between authors)              munity dictionary).
                                                                           2) Co-authors dictionary building. For each author, co-
                                                                               authors are checked and grouped within a separate co-
based on their references. To tackle this issue, we developed                  author dictionary.
and tested several methods for extending recognition coverage              3) Normalization of authors. Authors’ entries within each
that combines (1) the partially incomplete metadata, obtained                  dictionary are normalized to the following forms: ”Name
from previous step and (2) a-priori domain-specific knowledge.                 Surname” and ”Initials Surname”. This provides a first
   To this end we have analyzed a large sample - several                       level of disambiguation essentially using self-citation
hundreds - of publication in computer science and extracted a                  patterns. Then, we iterate the normalization step using
limited number of usage patterns that seem to be common for                    the co-author dictionary associated to each author. This
the whole research domain. In particular:                                      adds another level of disambiguation, using community
                                                                               writing patterns, within authors’ initials in case of iden-
  • it is common to find self-citation in one author’s publi-
                                                                               tical surname.
    cations - this information can be used to correct author
                                                                           4) Authors identification and correction This last step is
    as well as topic identification.
                                                                               based on the collection of dictionaries (document, co-
  • it is common that in one document’s reference section
                                                                               author and community) and aims to solve the remaining
    there are several publications with the same author - this
                                                                               ambiguous cases using the whole knowledge present in
    information can be used for improving author identifica-
                                                                               the collection.
    tion (separation from other authors and from title).
  • it is common to find references to the same authors                    For the same reasons as described in previous section- i.e.
    within the same domain on Internet - i.e. same parent                simple and formal verification - authors’ correction procedure
    URL - (publications of the same author, home pages of                was developed as FSM. Figure 4 shows a fragment of the
    authors that belong to the same organization, publica-               FSM used for the implemented authors’ correction procedure.
    tions of the same institution, publications on the same
    event/conference). It is therefore possible to use correla-             We have accomplished titles correction in a similar way
    tion within a community for identification of authors.               (see Figure 5), however special heuristics for title borders
  • it is common that titles in references section belong to the         adjustments needed to be introduced as well as a number of
    same topic area of the paper. Therefore, it is possible to           concepts that we use in this procedure. In particular:
    use already recognized titles for other titles identification           • concept of ”lexical formula”
    and correction.                                                         • concept of ”document topic”
  • it is common to find references on the same topic or                    • concept of ”community topic”
    number of topics within the same domain on Internet                     Here, we define a lexical formula as a lexical constructs’
    (examples are the same as those in the item before for               frequency within selected logical element, i.e. the weighted
    author’s identification). Also here, it is possible to use           set (by frequency) of words in a selected metadata field. This
    such correlation within a community for titles identifica-           definition does not consider - for simplicity - any punctuation
    tion and correction.                                                 constructions and lexical constructs ordering . However, we
   We have use each of these assumptions for identification              think that on large datasets (millions of documents) this can
and correction of the corresponding metadata extraction and              result in better precision.
we have been able to statistically prove their correctness for the          We then call ”topic” a group of lexical formulas within
selected domain. For the sake of presentation of the proposed            the same document that can be merged within larger lexical
heuristic and statistical approach, we detail in the following the       formula based on the frequencies. This larger lexical formula
main procedures for (1) authors identification and correction            will be referenced in the following as document topic.
and (2) title corrections. But the same reasoning and approach              Similarly we define community topic as a group of lexical
can be used for the identification and correction of other meta-         formulas that can be merged together based on the frequencies
data present in the document, like: affiliation, keywords, type          of the incoming element. The difference from the document
of publication (Journal, Proceedings, Workshop,...), project,            topic is that here the initial formulas belong to a set of
event, etc.                                                              documents originating from the same domain on Internet
  Identify the place where new author was found:
                                                                                                  Recognized           Title’s
  - If the substring does not belong to any tag - mark it                                           Titles            Borders
  - If the substring partially belongs to single Author                                           Processed           Detected
    tag - check for the following situations:
    - If the substring partially belongs to Author tag and Title
      tag - check for the following situations:                                                                                        Corrected
                                                                                Citation
    -