Unsupervised Metadata Extraction in Scientific
   Digital Libraries Using A-Priori Domain-Specific
                      Knowledge
                    Alexander Ivanyukovich                                                Maurizio Marchese
 Department of Information and Communication Technology Department of Information and Communication Technology
                    University of Trento                                   University of Trento
                    38100 Trento, Italy                                    38100 Trento, Italy
            Email: a.ivanyukovich@dit.unitn.it                     Email: maurizio.marchese@unitn.it


   Abstract— Information extraction from unstructured sources         mapping [2], scientific social networks analysis etc.). However
is a crucial step in the semantic annotation of content. The          for the implementation of such semantic-aware services first
challenge is in supporting an high quality automatic approach         the accumulated and available scholarly content need to be
(or at least semi-automatic) in order to sustain the scalability of
the semantic-enabled services of the future. Unsupervised infor-      annotated with proper and high quality semantic information.
mation extraction encompasses a number of underlying research            In the specific domain of scholarly literature, the structure
problems, such as natural language processing, heterogeneous          of the published scientific information still follows, in the
sources integration, knowledge representation, and others that        majority of the cases, a number of established communication
are under past and current investigation. In this paper we            approach and patterns, i.e. a certain number of structure
concentrate on the problem of unsupervised metadata extraction
in the Digital Libraries domain. We propose and present a novel       information, such as title, author’s list, abstract, body, ref-
approach focusing on the improvement in the metadata extraction       erences et al., are always present. This fact allows adopting
quality without involving external information sources (oracles,      existing information processing techniques for both traditional
manually prepared databases, etc), but relying on the information     and Internet-based sources, contributing to the processing and
present in the document itself and in its corresponding context.      creation of structured information within this content type.
More specifically, we focus on quality improvements of metadata
extraction from scientific papers (mainly in computer science            In this paper we will focus on the intersection of Information
domain) collected from various sources over the Internet. Finally,    Retrieval (IR) and Digital Libraries (DL) research domains
we compare the results of our approach with the state of the art      to address the problem of quality automatic information ex-
in the domain and discuss future work.                                traction from digital scientific documents. This is a first and
                      I. I NTRODUCTION                                crucial step towards the semantic annotation of the raw digital
                                                                      content, in a kind of knowledge supply chain, as indicated in
   The continuing expansion of the Internet has opened
                                                                      [3].
many new possibilities for information creation and exchange
                                                                         In describing information extraction within DL we will use
in general and in the academic world in particular: elec-
                                                                      the term metadata to refer to the structured information ob-
tronic publishing, digital libraries, electronic proceedings, self-
                                                                      tained from text-based documents that includes but is not lim-
publishing and more recently blogs and scientific news stream-
                                                                      ited to title, authors, affiliations, year of publication, publishing
ing are rapidly expanding the amount of available scholarly
                                                                      source (journal, conference, etc), publishing authority (such as
digital content. Recently, we have also witnessed a major
                                                                      ACM, IEEE, Elsevier, etc) and the list of references - each
shift in the landscape of scientific publishing with projects
                                                                      including previously mentioned work. A number of standards
like the Open Access Initiative 1 . In fact the number of open
                                                                      are available describing and categorizing bibliographic and
access journals is rising steadily, and new publishing models
                                                                      publishing metadata: for instance Dublin Core [4] and Bib-
are rapidly evolving to test new ways to increase readership
                                                                      1 Attribute Set from ANSI/NISO Z39.50-2003 (ISO 23950)
and access. Such new channels for academic communications
                                                                      [5]. In the present work we limit our investigation on metadata
are complementing and sometimes competing with traditional
                                                                      extraction to a significant subset of such standards. In fact,
authorities like journals, books and conferences proceedings.
                                                                      here we want to describe and evaluate our approach; extension
The existence of such variety and size of scholarly content
                                                                      to other instances of metadata is only quantitative and not
as well as its increasing accessibility opens the way to the
                                                                      conceptual.
development of useful semantic-enable services (like author’s
                                                                         The two different information sources of scientific content
profiling 2 , scientometrics [1], automatic science domains
                                                                      (traditional and Internet sources) present important differences
  1 http://www.openarchives.org/                                      in the approach for metadata retrieval: traditional sources are
  2 http://www.rexa.info//                                            usually based on manually prepared information (from certi-
fied authorities such as professional associations like ACM,
                                                                                    Normalized          Basic        Recognized
IEEE and commercial publishers, such as Elsevier, Springer,                            Text            Markup          Titles

etc.). In this case either all records are manually processed
or processing results are manually revised. This is possible           Cleaned
because traditional sources usually belong to single authorities         Text
                                                                                                                                  Metadata


with their internal standards on information storage. On the                        Formatted
                                                                                      Text
                                                                                                      Recognized
                                                                                                       Authors
                                                                                                                      Borders
                                                                                                                      Adjusted
other hand Internet-based sources usually belong to large
open communities (single researchers, group of researcher,
institutions) and do not follow specific strict standards. For                         Fig. 1.    Metadata Extraction Steps
instance, an academic paper that is stored in the Digital Library
of the IEEE Computer Society 3 contains the appropriate
metadata to support navigation through related papers (search        method can be used in automated end-to-end information
and sort by author, by publication date, etc). The same paper        retrieval and processing systems, supporting the control and
can be found on the homepage of the author or in the digital         elimination of any error-prone human intervention in the
repository of the affiliated academic institution. In this case,     process.
most often the metadata is not separated from the paper, or it is       The remainder of the paper is organized as follows. In
not structured. It needs either extraction or separate processing.   Section II we describe in detail the proposed approach to
   The problem of metadata extraction in the specific context        improve the quality of metadata extraction from scientific
of scientific Digital Libraries can be summarized as                 corpora: in particular we describe a two step procedure based
  1) identification of logical structures within single docu-        on (1) pattern-based metadata extraction using Finite State
     ments (header, abstract, introduction, body, references         Machine (FSM) and (2) statistical correction using a-priori
     section, etc.)                                                  domain-specific knowledge. In Section III we describe how
  2) entity recognition (author, title, reference, etc.) within      the proposed approach has been applied to a large set of
     single document                                                 documents (ca. 120 K) and we provide preliminary comparison
  3) metadata recognition within single entity.                      with the state of the art in the domain. In Section IV we discuss
                                                                     related work. Section V summarizes the results and discusses
A general assumption, in current metatada extraction tech-
                                                                     our future work.
niques, is based on the fact that there is a limited number
of formats to structure an academic paper and to represent                                      II. T HE A PPROACH
references. This is particularly true in Computer Science
domain where even a fewer number of formats are in active               The proposed approach consists of two major steps, namely
use (ACM format, IEEE format, etc). This information is                 1) pattern-based metadata extraction using Finite State Ma-
particularly helpful for point (1) above, but nevertheless one             chine (FSM) and
can achieve low quality results because of differences in               2) statistical correction using a-priori domain-specific
formatting due to a number of reasons such as (a) authors not              knowledge
following the pattern, (b) specifics of text representation in          For the first step we have analyzed, tested and personalized
columns in PDF/PS formats, (c) text pagination, (d) presence         for the specific application, existing state-of-the-art imple-
of headnotes and footnotes, etc. Obviously similar problems          mentation of specialized FSM-based lexical grammar parser
could be found in (2) and (3) as well, among them human              for fast text processing [6], [7]. For the second step we
errors, text extraction technical details from PDF/PS formats        have developed and investigated statistical methods that allow
and initial low quality results after step (1). As a results the     metadata correction and enrichment without the need to access
overall metadata quality won’t be sufficient for everyday use        external information sources. In the subsequent subsection we
in popular academic literature systems like CiteSeer.IST 4 ,         will describe each of the two steps in details.
Google Scholar 5 and Windows Academic Live 6 .
   The main contribution of this paper is a novel method for         A. Patterns-based Metadata Extraction Using Finite State
unsupervised metadata extraction based on a-priori domain-           Machine
specific knowledge. Our method does not rely on any external            Patterns-based metadata extraction contributes to initial
information sources and is solely based on the existing infor-       metadata retrieval in our approach. In contrast to the classical
mation in the document itself as well as in the overall set of       Information Retrieval (IR) goal, we mainly focus on the
documents currently present in a given digital archive. This         quality and not on the overall quantity of information. Core
includes both a-priori domain-specific information and infor-        idea is in the emphasis on quality improvement within several
mation obtained on previous processing steps. The proposed           subsequent steps, even at the cost of a limited decrease in
  3 http://www.computer.org/portal/site/csdl/
                                                                     overall metadata quantity.
  4 http://citeseer.ist.psu.edu/                                        This first metadata extraction step consists of a number
  5 http://scholar.google.com/                                       of interim phases (see Figure 1), each implemented as a
  6 http://academic.live.com/                                        single FSM. Application of the FSM model allows simple
and formal model verification avoiding most of the human              define program
                                                                          [head]
mistakes commonly involved in these tasks.
                                                                          [uninteresting]
   We have constructed an initial set of patterns for each                [opt references]
grammar based on a number of small training sets (typically             | [empty]
ca. 50 documents) from the target documents collection. Each          end define
set was manually labeled before processing and the processing         define head
results were manually evaluated. Within a limited number (ca.           [head_begin_tag][newline]
10) of patterns’ adjustments loops, we were able to obtain              [opt head_line]
appropriate recognition quality for correct processing most of          [repeat other_line]
                                                                        [head_end_tag] [newline]
the relevant entities formats in the complete target collection       end define
(more than 120K of documents). This finding corroborated
the initial assumption about the presence of a limited number         define head_line
of formats in a given scientific collection. According to our            [author][repeat separator_author][delimiter]
                                                                         [repeat token_not_newline+][newline]
original idea of step-by-step quality improvement, the trade-         end define
off between metadata quality and completeness of recognition
coverage on this step was shifted to the quality aspect. To           define other_line
this end, we allow our procedure to discard badly-formatted               [author][repeat headseparator_author][repeat trash][newline]
                                                                        | [line]
input, finally retaining only high-quality content and related        end define
metadata.
   The major steps of metadata extraction include (see Figure         define headseparator_author
                                                                          [opt space][headseparator][opt space][author]
1):
                                                                      end define
  1) Text normalization and special symbols removal. This             ...
     covers extra spaces, new lines and tabs removal, as
     well as non-printable symbols handling, and references’                        Fig. 2.   FSM: Authors recognition step
     section normalization. Moreover, it includes: text flow
     recognition, collateral elements detection (indexes, ta-
     bles of content, pages header and footer, etc), and               4) Authors recognition (see Figure 2) and title recognition
     hyphens correction regardless of the text’s language.                 using invariants first method proposed in [9]. iIn brief,
     These 1st-level pre-processing activities in our informa-             this method denotes that subfields of a reference that
     tion extraction process, although conceptually simple,                have relatively uniform syntax, position, and composi-
     provide a number of important values that contributes                 tion given all previous parsing, are the first to be parsed
     to the overall quality of the subsequent information                  subsequently.
     extraction. Namely:                                               5) Borders adjustments. We constructed heuristics for smart
                                                                           borders shift based on the number of lexical construc-
       • Text pre-processing contribute to the more accurate               tions from the grammar in the marked (recognized) and
         textual information acquisition, i.e. correctly iden-             unmarked (not recognized) reference’s region.
         tified text flow (pages ordering), removal of the
                                                                       At present, our FSM application is context-free:, i.e. we do
         repeated elements that do not contribute to the struc-
                                                                    not compare obtained metadata with already existing ones (as
         tural content (headers and footers) and removal of
                                                                    partially recognized corpus or external sources). Moreover, we
         text delimiters inside structural elements (footnotes
                                                                    have design and constructed the steps in a way that grammar
         and page numbers inside single reference, hyphens
                                                                    application is linear to the processing document’s size. Both
         in the authors’ names and titles in references, etc)
                                                                    properties - context-free and linearity - contribute significantly
       • Text structuring contributes to the correct identi-
                                                                    to the overall processing speed. The use of other information
         fication of the major structural elements within a
                                                                    (partially recognized corpus or external sources) could be used
         text, i.e. Introduction section and reference to the
                                                                    in a successive step to improve overall quality, but at the
         Introduction section in a Table of content section
                                                                    expenses of performance.
         of the article should be correctly distinguished and
         handled appropriately                                      B. Statistical Correction Using A-Priori Domain-Specific
     An extended presentation of the techniques used in this        Knowledge
     step can be found in [8].                                         The metadata obtained using previous step patterns can have
  2) Initial text tagging: separating header, abstract and refer-   satisfactory quality, but in general they lack in recognition
     ences parts. This allows us to process each section sepa-      coverage. For example we can have a reference with 100%-
     rately, contributing to the improved processing speed of       correctly recognized title, but with partially recognized au-
     the subsequent FSMs.                                           thors. We still can query this metadata, but we cannot use it
  3) References separation and initial items recognition            for the next knowledge processing level, like for instance doc-
     within each single reference.                                  uments clustering based on authors or documents interlinking
                        Recognized
                                                                           From an operational level view, authors identification and
                         Authors
                                          Co-Authors
                                           Checked                       correction can be summarized within four major steps (see
                        Processed
                                                                         Figure 3):
                                                                           1) Construction of document and community dictionaries.
      Citation
                                                             Corrected
                                                              Authors
                                                                               All recognized authors within the same document (i.e.
                                            Author                             both paper authors and cited authors) are combined in
                         Authors
                        Normalized
                                           Position
                                           Checked
                                                                               a local dictionary (document dictionary). All recognized
                                                                               authors within all documents within the same domain
                                                                               and/or URL are combined in another dictionary (com-
         Fig. 3.   Correction steps (correlation between authors)              munity dictionary).
                                                                           2) Co-authors dictionary building. For each author, co-
                                                                               authors are checked and grouped within a separate co-
based on their references. To tackle this issue, we developed                  author dictionary.
and tested several methods for extending recognition coverage              3) Normalization of authors. Authors’ entries within each
that combines (1) the partially incomplete metadata, obtained                  dictionary are normalized to the following forms: ”Name
from previous step and (2) a-priori domain-specific knowledge.                 Surname” and ”Initials Surname”. This provides a first
   To this end we have analyzed a large sample - several                       level of disambiguation essentially using self-citation
hundreds - of publication in computer science and extracted a                  patterns. Then, we iterate the normalization step using
limited number of usage patterns that seem to be common for                    the co-author dictionary associated to each author. This
the whole research domain. In particular:                                      adds another level of disambiguation, using community
                                                                               writing patterns, within authors’ initials in case of iden-
  • it is common to find self-citation in one author’s publi-
                                                                               tical surname.
    cations - this information can be used to correct author
                                                                           4) Authors identification and correction This last step is
    as well as topic identification.
                                                                               based on the collection of dictionaries (document, co-
  • it is common that in one document’s reference section
                                                                               author and community) and aims to solve the remaining
    there are several publications with the same author - this
                                                                               ambiguous cases using the whole knowledge present in
    information can be used for improving author identifica-
                                                                               the collection.
    tion (separation from other authors and from title).
  • it is common to find references to the same authors                    For the same reasons as described in previous section- i.e.
    within the same domain on Internet - i.e. same parent                simple and formal verification - authors’ correction procedure
    URL - (publications of the same author, home pages of                was developed as FSM. Figure 4 shows a fragment of the
    authors that belong to the same organization, publica-               FSM used for the implemented authors’ correction procedure.
    tions of the same institution, publications on the same
    event/conference). It is therefore possible to use correla-             We have accomplished titles correction in a similar way
    tion within a community for identification of authors.               (see Figure 5), however special heuristics for title borders
  • it is common that titles in references section belong to the         adjustments needed to be introduced as well as a number of
    same topic area of the paper. Therefore, it is possible to           concepts that we use in this procedure. In particular:
    use already recognized titles for other titles identification           • concept of ”lexical formula”
    and correction.                                                         • concept of ”document topic”
  • it is common to find references on the same topic or                    • concept of ”community topic”
    number of topics within the same domain on Internet                     Here, we define a lexical formula as a lexical constructs’
    (examples are the same as those in the item before for               frequency within selected logical element, i.e. the weighted
    author’s identification). Also here, it is possible to use           set (by frequency) of words in a selected metadata field. This
    such correlation within a community for titles identifica-           definition does not consider - for simplicity - any punctuation
    tion and correction.                                                 constructions and lexical constructs ordering . However, we
   We have use each of these assumptions for identification              think that on large datasets (millions of documents) this can
and correction of the corresponding metadata extraction and              result in better precision.
we have been able to statistically prove their correctness for the          We then call ”topic” a group of lexical formulas within
selected domain. For the sake of presentation of the proposed            the same document that can be merged within larger lexical
heuristic and statistical approach, we detail in the following the       formula based on the frequencies. This larger lexical formula
main procedures for (1) authors identification and correction            will be referenced in the following as document topic.
and (2) title corrections. But the same reasoning and approach              Similarly we define community topic as a group of lexical
can be used for the identification and correction of other meta-         formulas that can be merged together based on the frequencies
data present in the document, like: affiliation, keywords, type          of the incoming element. The difference from the document
of publication (Journal, Proceedings, Workshop,...), project,            topic is that here the initial formulas belong to a set of
event, etc.                                                              documents originating from the same domain on Internet
  Identify the place where new author was found:
                                                                                                  Recognized           Title’s
  - If the substring does not belong to any tag - mark it                                           Titles            Borders
  - If the substring partially belongs to single Author                                           Processed           Detected
    tag - check for the following situations:
    - If the substring partially belongs to Author tag and Title
      tag - check for the following situations:                                                                                        Corrected
                                                                                Citation
    - <Author>Fausto Giunchiglia Semantic</Author>                                                                                       Title
      <Title>Web</Title>                                                                           Dictionary
                                                                                                                      Overall
    - <Author>C. Lee </Author><Title>Giles CiteSeer.IST</Title>                                                       Borders
                                                                                                   Compiled
                                                                                                                     Corrected
      - check if <Title> is too short
      - check if last author is still OK
    - <Author>Fausto </Author> Giunchiglia. ->
       <Author>Fausto Giunchiglia</Author>                                          Fig. 5.   Correction steps (correlation between topics)
    - <Author>Fausto</Author> Giunchiglia. Semantic Web ->
      <Author>Fausto Giunchiglia</Author>. Semantic Web
    - <Author>Fausto Giunchiglia. Semantic Web</Author> ->                               III. P RELIMINARY E VALUATION
      <Author>Fausto Giunchiglia</Author>. Semantic Web
    - <Author>C. Lee Giles Fausto</Author> Giunchiglia. ->                   Since we are working with sets of hundreds of thousands
      <Author>C. Lee Giles</Author>                                       of documents, manual quality evaluation for the whole col-
      <Author>Fausto Giunchiglia</Author>.                                lection is not feasible. This fact has been already reported in
    - Fausto <Author>Giunchiglia C. Lee Giles</Author> ->
                                                                          related works [10], [11], [12] and other methodologies were
      <Author>Fausto Giunchiglia</Author>
      <Author>C. Lee Giles</Author>                                       proposed, based on the specifics of each concrete dataset.
  - If the substring partially belongs to different Author tags – check      In our experiments we compute quality evaluation criteria
    for situations:                                                       from preliminary results (extracted metadata) obtained with
  - <Author>Fausto </Author>
                                                                          the proposed approach with results used in existing state-
    <Author>Giunchiglia C. Lee Giles.</Author>
  - <Author>Fausto Giunchiglia C. </Author>                               of-the-art system. In order to build a consistent reference
    <Author>Lee Giles.</Author>                                           dataset for our evaluation, we have processed results of the
  ...                                                                     CiteSeer.IST Autonomous Citation Indexing (ACI) system,
                                                                          publicly available within its Open Archive Initiative (OAI).
                   Fig. 4.   FSM: Authors Correction                      The results contain metadata for 570K documents collected
                                                                          within CiteSeer.IST project within the last several years. We
                                                                          have used this metadata collection as an ”ideal” metadata. Here
                                                                          ”ideal” means that this metadata was processed using external
(typically same parent URL).
                                                                          manually verified datasets (DBLP) and afterwards manually
   Experimentally we found out that it is sufficient to use only          re-checked and corrected by users. Therefore, for our purpose,
middle dense part of a lexical formula, omitting all top-weight           it is our ”ideal” reference set.
constructs (usually they are represented by articles, prepo-                 We have processed list of URLs available within the ”ideal”
sitions and conjunctions) as well as low-weight constructs                collection and were able to retrieve from the Internet around
(usually they are represented by random constructs that are               120K of documents (other 450K documents have disappeared
similar to random noise and do not contribute to the topic’s              from their original location after they were collected by Cite-
definition).                                                              Seer.IST - in 3-4 years time). In this subset of documents, we
  The main operational steps for titles correction include:               were able to process, correctly identify and extract metadata
                                                                          from more than 90K of documents. This set corresponds to our
  1) Recognized titles cross-check and correction. This step              complete ”ideal” collection used in our preliminary evaluation.
     includes normalized titles presentation using dictionary                It is important to note that not all relevant metadata were
     with weights (lexical formula) and lexical formula                   present in the ”ideal” collection: for instance, the references’
     matching within complete collection of references lo-                section in each document contains only records that corre-
     cated in single document.                                            spond to the documents that are present in the overall collec-
  2) Construction of dictionaries of communities’ and docu-               tion. Practically this means that a large part of the references
     ments’ topics. The step includes communities and doc-                are missing. We were able to overcome this limitation by
     ument topics identification based on the formulas from               retrieving missing information from the corresponding static
     previous step and their clustering within communities.               pages within CiteSeer.IST project. This includes complete
  3) Titles’ borders detection using dictionaries of topics.              number of references and references themselves for each
     This includes topics’ formulas application to the refer-             document, however, without clear logical constructs separation
     ences and exact borders identification based on the same             within single reference.
     patterns used for initial metadata extraction.                          For our evaluation task, we have used standard quality
  4) Title-authors and title-publishing authority borders cor-            measure criteria, namely:
     rection. The step includes utilization of the invariants
     first method [9].                                                                                           A
                                                                                               Precision =                                         (1)
                                                                                                                A+C
                                                                                                  TABLE I
                                                                                      P RELIMINARY E VALUATION R ESULTS
                                A
                    Recall =                                 (2)
                               A+B                                                              Precision     87,7%
                                                                                                  Recall      88,5%
                     2 × Precision × Recall                                                    Fmeasure       88,1%
          Fmeasure =                                         (3)
                       Precision + Recall
   where:
   • A is the number of true positive samples predicted as
                                                                       In this domain, a powerful example is the XIP parsing
      positive; in our case A is the number of references that      system [17] a modular, declarative and XML-empowered
      we have recognized in a document that is not more than        linguistic analyzer and annotator: the system takes XML-
      the number of references in the corresponding document        based documents as input, linguistically analyzes their textual
      in the ideal set.                                             content (robust parsing) and produces the set of annotations
   • B is the number of true positive samples predicted as
                                                                    in an XML format as output. XIP robust parsing provides
      negative; in our case it is the difference between number     mechanisms for identifying Named Entity (NE) expressions,
      of references present in the ideal set and number of          and extracting relations between words or group of words, e.g.
      reference in corresponding document processed with our        relations between NE expressions.
      approach - but not less than zero.                               Other approaches used for metadata extraction include
   • C is the number of true negative samples predicted as
                                                                    grammar induction, hierarchical structuring and ontology-
      positive; in our case it is the difference between results    based approaches [18], [19].
      obtained with our approach and the ideal set - but not           Our approach aims to extend the state-of-the-art by con-
      less than zero.                                               tributing with a novel approach for metadata quality and cov-
   In Table I we report the results of our evaluation tests.        erage improvements. In distinction to the existing approaches,
These preliminary results show that our procedure is capable        we do not use any external information repositories, while we
of achieving the quality level that is comparable with the one      emphasize the exploitation of the knowledge available within
present in the ideal set but without usage of external manually     the available documents’ collection.
verified datasets and without human supervision (which is the
case for the ideal set).                                                                      V. C ONCLUSIONS
                     IV. R ELATED W ORK                                In this paper we have presented a novel method for unsu-
                                                                    pervised metadata extraction based on a-priori domain-specific
   The problem of unsupervised and quality metadata extrac-
                                                                    knowledge. The method does not rely on any external infor-
tion is under intense research activity. The first successful
                                                                    mation sources and is solely based on the existing information
Automated Citation Indexing (ACI) system - CiteSeer.IST -
                                                                    in the document and in the document’s context (set of docu-
has tackled this problem using a combination of patterns-
                                                                    ments). Combined with existing external knowledge sources
based approach (regular expressions) and manually prepared
                                                                    the approach can further improve overall metadata quality
external databases (DBLP and others) [9]. This provided high-
                                                                    and coverage. High quality automatic metadata extraction is a
quality metadata within known references. Subsequent meta-
                                                                    crucial step in order to move from linguistic entities to logical
data quality improvements in CiteSeer.IST were accomplished
                                                                    entities, relation information and logical relations and therefore
involving human-based information corrections. Application
                                                                    to the semantic level of Digital Library usability. This, in
of statistical models like Hidden Markov Model (HMM) [13]
                                                                    turn, creates the opportunity for value-added services within
and Dual and Variable-length output Hidden Markov Model
                                                                    existing and future semantic-enabled Digital Library systems.
(DVHMM) [14] are reported to have nearly 90% accuracy
however, the training set size used by the authors has the                               VI. ACKNOWLEDGMENTS
same magnitude as a processing corpora. Further metadata
methods development turned to the Support Vector Machines             We acknowledge C. Lee Giles for useful comments and
(SVM) usage. Numerous experiments involving SVM have                advice during initial brainstorming on the system architecture
been accomplished within this task [11], [15], demonstrating        as well as Fausto Giunchiglia for his advice and continuous
high accuracy, recall and precision of the results obtained.        support to the research project.
However there were reported several problems connected with                                      R EFERENCES
flexibility of such systems that could be the case for preventing
wide method application in the real-world systems.                   [1] A. V. Raan, “Scientometrics: State-of-the-art,” Scientometrics, vol. 38,
                                                                         no. 1, pp. 205–218, 2002.
   Natural Language Processing techniques belong to a dif-           [2] M. L. E.C.M. Noyons, H.F. Moed, “Combining mapping and citation
ferent but widely used approach for metadata extraction.                 analysis for evaluative bibliometric purposes: A bibliometric study,”
Experiments using Part-of-Speech (PoS) tagging [16], [13]                Journal of the American Society for Information Science, vol. 50, no. 2,
                                                                         pp. 115–131, 1999.
have proven capable to provide sufficient accuracy,; however         [3] R. Stecher, C. Niedere, P. Bouquet, and et al., “Enabling a knowledge
large manually-labeled corpora is usually required for training.         supply chain: From content resources to ontologies,” in Proc. of the.
 [4] A. Powell, “Guidelines for implementing dublin core in xml,”
     Dublin Core Metadata Initiative Recommendation, published at
     www.dublincore.org, 2003.
 [5] NISO, “Information retrieval (z39.50): Application service definition and
     protocol specification,” NISO Press, 2003.
 [6] N. Kiyavitskaya, N. Zeni, J. R. Cordy, L. Mich, and J. Mylopoulos,
     “Semi-automatic semantic annotations for next generation information
     systems,” in Proceedings of the 18th Conference on Advanced Infor-
     mation Systems Engineering, ser. Lecture Notes in Computer Science.
     Springer, 2006.
 [7] J. Cordy, “Txl a language for programming language tools and appli-
     cations,” in In Proceedings of 4th International Workshop on Language
     Descriptions, Tools and Applications, ser. Electronic Notes inTheoretical
     Computer Science, vol. 110. Elsevier Science, 2004.
 [8] A. Ivanyukovich and M. Marchese, “Unsupervised free-text processing
     and structuring in digital archives,” in Ist International Conference on
     Multidisciplinay Information Sciences and Technologies, 2006, accepted
     for publication.
 [9] S. Lawrence, C. L. Giles, and K. Bollacker, “Digital libraries and
     autonomous citation indexing,” IEEE Computer, vol. 32, no. 6, pp. 67–
     71, 1999.
[10] V. Petricek, I. J. Cox, H. Han, I. G. Councill, and C. L. Giles,
     “A comparison of on-line computer science citation databases,” in In
     Proceedings of the European Conference on Digital Libraries, ser.
     Lecture Notes in Computer Science. Springer, 2005.
[11] H. Han, C. L. Giles, E. Manavoglu, H. Zha, Z. Zhang, and E. A.
     Fox, “Automatic document metadata extraction using support vector
     machines,” in Proceedings of the 3rd ACM/IEEE-CS joint conference
     on Digital libraries. IEEE Computer Society, 2003, pp. 37–48.
[12] E. Agichtein and L. Gravano, “Snowball: Extracting relations from large
     plain-text collections,” in Proceedings of the 5th ACM International
     Conference on Digital Libraries. ACM, 2000.
[13] S. Cucerzan and D. Yarowsky, “Language independent, minimally
     supervised induction of lexical probabilities,” in Proceedings of the
     38th Annual Meeting on Association for Computational Linguistics.
     Association for Computational Linguistics, 2000, pp. 270–277.
[14] A. Takasu, “Bibliographic attribute extraction from erroneous references
     based on a statistical model,” in Proceedings of the 2003 Joint Confer-
     ence on Digital Libraries. IEEE, 2003.
[15] Y. Hu, H. Li, Y. Cao, D. Meyerzon, and Q. Zheng, “Automatic extraction
     of titles from general documents using machine learning,” in Proceesings
     of the JCDL’05, 2005.
[16] D. Besagni and A. Belaid, “Citation recognition for scientific publica-
     tions in digital libraries,” in First International Workshop on Document
     Image Analysis for Libraries (DIAL’04), 2004.
[17] S. Ait-Mokhtar, J. Chanod, and C. Roux, “Robustness beyond shallow-
     ness: incremental deep parsing,” Natural Language Engineering, vol. 8,
     pp. 121–144, 2002.
[18] P. Cimiano, S. Handschuh, and S. Staab, “Towards the self-annotating
     web,” in Proceedings of the WWW2004, 2004.
[19] M.-Y. Day, T.-H. Tsai, C.-L. Sung, C.-W. Lee, S.-H. Wu, C.-S. Ong,
     and W.-L. Hsu, “A knowledge-based approach to citation extraction,” in
     Proceedings of Information Reuse and Integration Conference, IRI-2005,
     2005.