Unsupervised Metadata Extraction in Scientific
Digital Libraries Using A-Priori Domain-Specific
Knowledge
Alexander Ivanyukovich Maurizio Marchese
Department of Information and Communication Technology Department of Information and Communication Technology
University of Trento University of Trento
38100 Trento, Italy 38100 Trento, Italy
Email: a.ivanyukovich@dit.unitn.it Email: maurizio.marchese@unitn.it
Abstract— Information extraction from unstructured sources mapping [2], scientific social networks analysis etc.). However
is a crucial step in the semantic annotation of content. The for the implementation of such semantic-aware services first
challenge is in supporting an high quality automatic approach the accumulated and available scholarly content need to be
(or at least semi-automatic) in order to sustain the scalability of
the semantic-enabled services of the future. Unsupervised infor- annotated with proper and high quality semantic information.
mation extraction encompasses a number of underlying research In the specific domain of scholarly literature, the structure
problems, such as natural language processing, heterogeneous of the published scientific information still follows, in the
sources integration, knowledge representation, and others that majority of the cases, a number of established communication
are under past and current investigation. In this paper we approach and patterns, i.e. a certain number of structure
concentrate on the problem of unsupervised metadata extraction
in the Digital Libraries domain. We propose and present a novel information, such as title, author’s list, abstract, body, ref-
approach focusing on the improvement in the metadata extraction erences et al., are always present. This fact allows adopting
quality without involving external information sources (oracles, existing information processing techniques for both traditional
manually prepared databases, etc), but relying on the information and Internet-based sources, contributing to the processing and
present in the document itself and in its corresponding context. creation of structured information within this content type.
More specifically, we focus on quality improvements of metadata
extraction from scientific papers (mainly in computer science In this paper we will focus on the intersection of Information
domain) collected from various sources over the Internet. Finally, Retrieval (IR) and Digital Libraries (DL) research domains
we compare the results of our approach with the state of the art to address the problem of quality automatic information ex-
in the domain and discuss future work. traction from digital scientific documents. This is a first and
I. I NTRODUCTION crucial step towards the semantic annotation of the raw digital
content, in a kind of knowledge supply chain, as indicated in
The continuing expansion of the Internet has opened
[3].
many new possibilities for information creation and exchange
In describing information extraction within DL we will use
in general and in the academic world in particular: elec-
the term metadata to refer to the structured information ob-
tronic publishing, digital libraries, electronic proceedings, self-
tained from text-based documents that includes but is not lim-
publishing and more recently blogs and scientific news stream-
ited to title, authors, affiliations, year of publication, publishing
ing are rapidly expanding the amount of available scholarly
source (journal, conference, etc), publishing authority (such as
digital content. Recently, we have also witnessed a major
ACM, IEEE, Elsevier, etc) and the list of references - each
shift in the landscape of scientific publishing with projects
including previously mentioned work. A number of standards
like the Open Access Initiative 1 . In fact the number of open
are available describing and categorizing bibliographic and
access journals is rising steadily, and new publishing models
publishing metadata: for instance Dublin Core [4] and Bib-
are rapidly evolving to test new ways to increase readership
1 Attribute Set from ANSI/NISO Z39.50-2003 (ISO 23950)
and access. Such new channels for academic communications
[5]. In the present work we limit our investigation on metadata
are complementing and sometimes competing with traditional
extraction to a significant subset of such standards. In fact,
authorities like journals, books and conferences proceedings.
here we want to describe and evaluate our approach; extension
The existence of such variety and size of scholarly content
to other instances of metadata is only quantitative and not
as well as its increasing accessibility opens the way to the
conceptual.
development of useful semantic-enable services (like author’s
The two different information sources of scientific content
profiling 2 , scientometrics [1], automatic science domains
(traditional and Internet sources) present important differences
1 http://www.openarchives.org/ in the approach for metadata retrieval: traditional sources are
2 http://www.rexa.info// usually based on manually prepared information (from certi-
fied authorities such as professional associations like ACM,
Normalized Basic Recognized
IEEE and commercial publishers, such as Elsevier, Springer, Text Markup Titles
etc.). In this case either all records are manually processed
or processing results are manually revised. This is possible Cleaned
because traditional sources usually belong to single authorities Text
Metadata
with their internal standards on information storage. On the Formatted
Text
Recognized
Authors
Borders
Adjusted
other hand Internet-based sources usually belong to large
open communities (single researchers, group of researcher,
institutions) and do not follow specific strict standards. For Fig. 1. Metadata Extraction Steps
instance, an academic paper that is stored in the Digital Library
of the IEEE Computer Society 3 contains the appropriate
metadata to support navigation through related papers (search method can be used in automated end-to-end information
and sort by author, by publication date, etc). The same paper retrieval and processing systems, supporting the control and
can be found on the homepage of the author or in the digital elimination of any error-prone human intervention in the
repository of the affiliated academic institution. In this case, process.
most often the metadata is not separated from the paper, or it is The remainder of the paper is organized as follows. In
not structured. It needs either extraction or separate processing. Section II we describe in detail the proposed approach to
The problem of metadata extraction in the specific context improve the quality of metadata extraction from scientific
of scientific Digital Libraries can be summarized as corpora: in particular we describe a two step procedure based
1) identification of logical structures within single docu- on (1) pattern-based metadata extraction using Finite State
ments (header, abstract, introduction, body, references Machine (FSM) and (2) statistical correction using a-priori
section, etc.) domain-specific knowledge. In Section III we describe how
2) entity recognition (author, title, reference, etc.) within the proposed approach has been applied to a large set of
single document documents (ca. 120 K) and we provide preliminary comparison
3) metadata recognition within single entity. with the state of the art in the domain. In Section IV we discuss
related work. Section V summarizes the results and discusses
A general assumption, in current metatada extraction tech-
our future work.
niques, is based on the fact that there is a limited number
of formats to structure an academic paper and to represent II. T HE A PPROACH
references. This is particularly true in Computer Science
domain where even a fewer number of formats are in active The proposed approach consists of two major steps, namely
use (ACM format, IEEE format, etc). This information is 1) pattern-based metadata extraction using Finite State Ma-
particularly helpful for point (1) above, but nevertheless one chine (FSM) and
can achieve low quality results because of differences in 2) statistical correction using a-priori domain-specific
formatting due to a number of reasons such as (a) authors not knowledge
following the pattern, (b) specifics of text representation in For the first step we have analyzed, tested and personalized
columns in PDF/PS formats, (c) text pagination, (d) presence for the specific application, existing state-of-the-art imple-
of headnotes and footnotes, etc. Obviously similar problems mentation of specialized FSM-based lexical grammar parser
could be found in (2) and (3) as well, among them human for fast text processing [6], [7]. For the second step we
errors, text extraction technical details from PDF/PS formats have developed and investigated statistical methods that allow
and initial low quality results after step (1). As a results the metadata correction and enrichment without the need to access
overall metadata quality won’t be sufficient for everyday use external information sources. In the subsequent subsection we
in popular academic literature systems like CiteSeer.IST 4 , will describe each of the two steps in details.
Google Scholar 5 and Windows Academic Live 6 .
The main contribution of this paper is a novel method for A. Patterns-based Metadata Extraction Using Finite State
unsupervised metadata extraction based on a-priori domain- Machine
specific knowledge. Our method does not rely on any external Patterns-based metadata extraction contributes to initial
information sources and is solely based on the existing infor- metadata retrieval in our approach. In contrast to the classical
mation in the document itself as well as in the overall set of Information Retrieval (IR) goal, we mainly focus on the
documents currently present in a given digital archive. This quality and not on the overall quantity of information. Core
includes both a-priori domain-specific information and infor- idea is in the emphasis on quality improvement within several
mation obtained on previous processing steps. The proposed subsequent steps, even at the cost of a limited decrease in
3 http://www.computer.org/portal/site/csdl/
overall metadata quantity.
4 http://citeseer.ist.psu.edu/ This first metadata extraction step consists of a number
5 http://scholar.google.com/ of interim phases (see Figure 1), each implemented as a
6 http://academic.live.com/ single FSM. Application of the FSM model allows simple
and formal model verification avoiding most of the human define program
[head]
mistakes commonly involved in these tasks.
[uninteresting]
We have constructed an initial set of patterns for each [opt references]
grammar based on a number of small training sets (typically | [empty]
ca. 50 documents) from the target documents collection. Each end define
set was manually labeled before processing and the processing define head
results were manually evaluated. Within a limited number (ca. [head_begin_tag][newline]
10) of patterns’ adjustments loops, we were able to obtain [opt head_line]
appropriate recognition quality for correct processing most of [repeat other_line]
[head_end_tag] [newline]
the relevant entities formats in the complete target collection end define
(more than 120K of documents). This finding corroborated
the initial assumption about the presence of a limited number define head_line
of formats in a given scientific collection. According to our [author][repeat separator_author][delimiter]
[repeat token_not_newline+][newline]
original idea of step-by-step quality improvement, the trade- end define
off between metadata quality and completeness of recognition
coverage on this step was shifted to the quality aspect. To define other_line
this end, we allow our procedure to discard badly-formatted [author][repeat headseparator_author][repeat trash][newline]
| [line]
input, finally retaining only high-quality content and related end define
metadata.
The major steps of metadata extraction include (see Figure define headseparator_author
[opt space][headseparator][opt space][author]
1):
end define
1) Text normalization and special symbols removal. This ...
covers extra spaces, new lines and tabs removal, as
well as non-printable symbols handling, and references’ Fig. 2. FSM: Authors recognition step
section normalization. Moreover, it includes: text flow
recognition, collateral elements detection (indexes, ta-
bles of content, pages header and footer, etc), and 4) Authors recognition (see Figure 2) and title recognition
hyphens correction regardless of the text’s language. using invariants first method proposed in [9]. iIn brief,
These 1st-level pre-processing activities in our informa- this method denotes that subfields of a reference that
tion extraction process, although conceptually simple, have relatively uniform syntax, position, and composi-
provide a number of important values that contributes tion given all previous parsing, are the first to be parsed
to the overall quality of the subsequent information subsequently.
extraction. Namely: 5) Borders adjustments. We constructed heuristics for smart
borders shift based on the number of lexical construc-
• Text pre-processing contribute to the more accurate tions from the grammar in the marked (recognized) and
textual information acquisition, i.e. correctly iden- unmarked (not recognized) reference’s region.
tified text flow (pages ordering), removal of the
At present, our FSM application is context-free:, i.e. we do
repeated elements that do not contribute to the struc-
not compare obtained metadata with already existing ones (as
tural content (headers and footers) and removal of
partially recognized corpus or external sources). Moreover, we
text delimiters inside structural elements (footnotes
have design and constructed the steps in a way that grammar
and page numbers inside single reference, hyphens
application is linear to the processing document’s size. Both
in the authors’ names and titles in references, etc)
properties - context-free and linearity - contribute significantly
• Text structuring contributes to the correct identi-
to the overall processing speed. The use of other information
fication of the major structural elements within a
(partially recognized corpus or external sources) could be used
text, i.e. Introduction section and reference to the
in a successive step to improve overall quality, but at the
Introduction section in a Table of content section
expenses of performance.
of the article should be correctly distinguished and
handled appropriately B. Statistical Correction Using A-Priori Domain-Specific
An extended presentation of the techniques used in this Knowledge
step can be found in [8]. The metadata obtained using previous step patterns can have
2) Initial text tagging: separating header, abstract and refer- satisfactory quality, but in general they lack in recognition
ences parts. This allows us to process each section sepa- coverage. For example we can have a reference with 100%-
rately, contributing to the improved processing speed of correctly recognized title, but with partially recognized au-
the subsequent FSMs. thors. We still can query this metadata, but we cannot use it
3) References separation and initial items recognition for the next knowledge processing level, like for instance doc-
within each single reference. uments clustering based on authors or documents interlinking
Recognized
From an operational level view, authors identification and
Authors
Co-Authors
Checked correction can be summarized within four major steps (see
Processed
Figure 3):
1) Construction of document and community dictionaries.
Citation
Corrected
Authors
All recognized authors within the same document (i.e.
Author both paper authors and cited authors) are combined in
Authors
Normalized
Position
Checked
a local dictionary (document dictionary). All recognized
authors within all documents within the same domain
and/or URL are combined in another dictionary (com-
Fig. 3. Correction steps (correlation between authors) munity dictionary).
2) Co-authors dictionary building. For each author, co-
authors are checked and grouped within a separate co-
based on their references. To tackle this issue, we developed author dictionary.
and tested several methods for extending recognition coverage 3) Normalization of authors. Authors’ entries within each
that combines (1) the partially incomplete metadata, obtained dictionary are normalized to the following forms: ”Name
from previous step and (2) a-priori domain-specific knowledge. Surname” and ”Initials Surname”. This provides a first
To this end we have analyzed a large sample - several level of disambiguation essentially using self-citation
hundreds - of publication in computer science and extracted a patterns. Then, we iterate the normalization step using
limited number of usage patterns that seem to be common for the co-author dictionary associated to each author. This
the whole research domain. In particular: adds another level of disambiguation, using community
writing patterns, within authors’ initials in case of iden-
• it is common to find self-citation in one author’s publi-
tical surname.
cations - this information can be used to correct author
4) Authors identification and correction This last step is
as well as topic identification.
based on the collection of dictionaries (document, co-
• it is common that in one document’s reference section
author and community) and aims to solve the remaining
there are several publications with the same author - this
ambiguous cases using the whole knowledge present in
information can be used for improving author identifica-
the collection.
tion (separation from other authors and from title).
• it is common to find references to the same authors For the same reasons as described in previous section- i.e.
within the same domain on Internet - i.e. same parent simple and formal verification - authors’ correction procedure
URL - (publications of the same author, home pages of was developed as FSM. Figure 4 shows a fragment of the
authors that belong to the same organization, publica- FSM used for the implemented authors’ correction procedure.
tions of the same institution, publications on the same
event/conference). It is therefore possible to use correla- We have accomplished titles correction in a similar way
tion within a community for identification of authors. (see Figure 5), however special heuristics for title borders
• it is common that titles in references section belong to the adjustments needed to be introduced as well as a number of
same topic area of the paper. Therefore, it is possible to concepts that we use in this procedure. In particular:
use already recognized titles for other titles identification • concept of ”lexical formula”
and correction. • concept of ”document topic”
• it is common to find references on the same topic or • concept of ”community topic”
number of topics within the same domain on Internet Here, we define a lexical formula as a lexical constructs’
(examples are the same as those in the item before for frequency within selected logical element, i.e. the weighted
author’s identification). Also here, it is possible to use set (by frequency) of words in a selected metadata field. This
such correlation within a community for titles identifica- definition does not consider - for simplicity - any punctuation
tion and correction. constructions and lexical constructs ordering . However, we
We have use each of these assumptions for identification think that on large datasets (millions of documents) this can
and correction of the corresponding metadata extraction and result in better precision.
we have been able to statistically prove their correctness for the We then call ”topic” a group of lexical formulas within
selected domain. For the sake of presentation of the proposed the same document that can be merged within larger lexical
heuristic and statistical approach, we detail in the following the formula based on the frequencies. This larger lexical formula
main procedures for (1) authors identification and correction will be referenced in the following as document topic.
and (2) title corrections. But the same reasoning and approach Similarly we define community topic as a group of lexical
can be used for the identification and correction of other meta- formulas that can be merged together based on the frequencies
data present in the document, like: affiliation, keywords, type of the incoming element. The difference from the document
of publication (Journal, Proceedings, Workshop,...), project, topic is that here the initial formulas belong to a set of
event, etc. documents originating from the same domain on Internet
Identify the place where new author was found:
Recognized Title’s
- If the substring does not belong to any tag - mark it Titles Borders
- If the substring partially belongs to single Author Processed Detected
tag - check for the following situations:
- If the substring partially belongs to Author tag and Title
tag - check for the following situations: Corrected
Citation
-