Unsupervised Metadata Extraction in Scientific Digital Libraries Using A-Priori Domain-Specific Knowledge Alexander Ivanyukovich Maurizio Marchese Department of Information and Communication Technology Department of Information and Communication Technology University of Trento University of Trento 38100 Trento, Italy 38100 Trento, Italy Email: a.ivanyukovich@dit.unitn.it Email: maurizio.marchese@unitn.it Abstract— Information extraction from unstructured sources mapping [2], scientific social networks analysis etc.). However is a crucial step in the semantic annotation of content. The for the implementation of such semantic-aware services first challenge is in supporting an high quality automatic approach the accumulated and available scholarly content need to be (or at least semi-automatic) in order to sustain the scalability of the semantic-enabled services of the future. Unsupervised infor- annotated with proper and high quality semantic information. mation extraction encompasses a number of underlying research In the specific domain of scholarly literature, the structure problems, such as natural language processing, heterogeneous of the published scientific information still follows, in the sources integration, knowledge representation, and others that majority of the cases, a number of established communication are under past and current investigation. In this paper we approach and patterns, i.e. a certain number of structure concentrate on the problem of unsupervised metadata extraction in the Digital Libraries domain. We propose and present a novel information, such as title, author’s list, abstract, body, ref- approach focusing on the improvement in the metadata extraction erences et al., are always present. This fact allows adopting quality without involving external information sources (oracles, existing information processing techniques for both traditional manually prepared databases, etc), but relying on the information and Internet-based sources, contributing to the processing and present in the document itself and in its corresponding context. creation of structured information within this content type. More specifically, we focus on quality improvements of metadata extraction from scientific papers (mainly in computer science In this paper we will focus on the intersection of Information domain) collected from various sources over the Internet. Finally, Retrieval (IR) and Digital Libraries (DL) research domains we compare the results of our approach with the state of the art to address the problem of quality automatic information ex- in the domain and discuss future work. traction from digital scientific documents. This is a first and I. I NTRODUCTION crucial step towards the semantic annotation of the raw digital content, in a kind of knowledge supply chain, as indicated in The continuing expansion of the Internet has opened [3]. many new possibilities for information creation and exchange In describing information extraction within DL we will use in general and in the academic world in particular: elec- the term metadata to refer to the structured information ob- tronic publishing, digital libraries, electronic proceedings, self- tained from text-based documents that includes but is not lim- publishing and more recently blogs and scientific news stream- ited to title, authors, affiliations, year of publication, publishing ing are rapidly expanding the amount of available scholarly source (journal, conference, etc), publishing authority (such as digital content. Recently, we have also witnessed a major ACM, IEEE, Elsevier, etc) and the list of references - each shift in the landscape of scientific publishing with projects including previously mentioned work. A number of standards like the Open Access Initiative 1 . In fact the number of open are available describing and categorizing bibliographic and access journals is rising steadily, and new publishing models publishing metadata: for instance Dublin Core [4] and Bib- are rapidly evolving to test new ways to increase readership 1 Attribute Set from ANSI/NISO Z39.50-2003 (ISO 23950) and access. Such new channels for academic communications [5]. In the present work we limit our investigation on metadata are complementing and sometimes competing with traditional extraction to a significant subset of such standards. In fact, authorities like journals, books and conferences proceedings. here we want to describe and evaluate our approach; extension The existence of such variety and size of scholarly content to other instances of metadata is only quantitative and not as well as its increasing accessibility opens the way to the conceptual. development of useful semantic-enable services (like author’s The two different information sources of scientific content profiling 2 , scientometrics [1], automatic science domains (traditional and Internet sources) present important differences 1 http://www.openarchives.org/ in the approach for metadata retrieval: traditional sources are 2 http://www.rexa.info// usually based on manually prepared information (from certi- fied authorities such as professional associations like ACM, Normalized Basic Recognized IEEE and commercial publishers, such as Elsevier, Springer, Text Markup Titles etc.). In this case either all records are manually processed or processing results are manually revised. This is possible Cleaned because traditional sources usually belong to single authorities Text Metadata with their internal standards on information storage. On the Formatted Text Recognized Authors Borders Adjusted other hand Internet-based sources usually belong to large open communities (single researchers, group of researcher, institutions) and do not follow specific strict standards. For Fig. 1. Metadata Extraction Steps instance, an academic paper that is stored in the Digital Library of the IEEE Computer Society 3 contains the appropriate metadata to support navigation through related papers (search method can be used in automated end-to-end information and sort by author, by publication date, etc). The same paper retrieval and processing systems, supporting the control and can be found on the homepage of the author or in the digital elimination of any error-prone human intervention in the repository of the affiliated academic institution. In this case, process. most often the metadata is not separated from the paper, or it is The remainder of the paper is organized as follows. In not structured. It needs either extraction or separate processing. Section II we describe in detail the proposed approach to The problem of metadata extraction in the specific context improve the quality of metadata extraction from scientific of scientific Digital Libraries can be summarized as corpora: in particular we describe a two step procedure based 1) identification of logical structures within single docu- on (1) pattern-based metadata extraction using Finite State ments (header, abstract, introduction, body, references Machine (FSM) and (2) statistical correction using a-priori section, etc.) domain-specific knowledge. In Section III we describe how 2) entity recognition (author, title, reference, etc.) within the proposed approach has been applied to a large set of single document documents (ca. 120 K) and we provide preliminary comparison 3) metadata recognition within single entity. with the state of the art in the domain. In Section IV we discuss related work. Section V summarizes the results and discusses A general assumption, in current metatada extraction tech- our future work. niques, is based on the fact that there is a limited number of formats to structure an academic paper and to represent II. T HE A PPROACH references. This is particularly true in Computer Science domain where even a fewer number of formats are in active The proposed approach consists of two major steps, namely use (ACM format, IEEE format, etc). This information is 1) pattern-based metadata extraction using Finite State Ma- particularly helpful for point (1) above, but nevertheless one chine (FSM) and can achieve low quality results because of differences in 2) statistical correction using a-priori domain-specific formatting due to a number of reasons such as (a) authors not knowledge following the pattern, (b) specifics of text representation in For the first step we have analyzed, tested and personalized columns in PDF/PS formats, (c) text pagination, (d) presence for the specific application, existing state-of-the-art imple- of headnotes and footnotes, etc. Obviously similar problems mentation of specialized FSM-based lexical grammar parser could be found in (2) and (3) as well, among them human for fast text processing [6], [7]. For the second step we errors, text extraction technical details from PDF/PS formats have developed and investigated statistical methods that allow and initial low quality results after step (1). As a results the metadata correction and enrichment without the need to access overall metadata quality won’t be sufficient for everyday use external information sources. In the subsequent subsection we in popular academic literature systems like CiteSeer.IST 4 , will describe each of the two steps in details. Google Scholar 5 and Windows Academic Live 6 . The main contribution of this paper is a novel method for A. Patterns-based Metadata Extraction Using Finite State unsupervised metadata extraction based on a-priori domain- Machine specific knowledge. Our method does not rely on any external Patterns-based metadata extraction contributes to initial information sources and is solely based on the existing infor- metadata retrieval in our approach. In contrast to the classical mation in the document itself as well as in the overall set of Information Retrieval (IR) goal, we mainly focus on the documents currently present in a given digital archive. This quality and not on the overall quantity of information. Core includes both a-priori domain-specific information and infor- idea is in the emphasis on quality improvement within several mation obtained on previous processing steps. The proposed subsequent steps, even at the cost of a limited decrease in 3 http://www.computer.org/portal/site/csdl/ overall metadata quantity. 4 http://citeseer.ist.psu.edu/ This first metadata extraction step consists of a number 5 http://scholar.google.com/ of interim phases (see Figure 1), each implemented as a 6 http://academic.live.com/ single FSM. Application of the FSM model allows simple and formal model verification avoiding most of the human define program [head] mistakes commonly involved in these tasks. [uninteresting] We have constructed an initial set of patterns for each [opt references] grammar based on a number of small training sets (typically | [empty] ca. 50 documents) from the target documents collection. Each end define set was manually labeled before processing and the processing define head results were manually evaluated. Within a limited number (ca. [head_begin_tag][newline] 10) of patterns’ adjustments loops, we were able to obtain [opt head_line] appropriate recognition quality for correct processing most of [repeat other_line] [head_end_tag] [newline] the relevant entities formats in the complete target collection end define (more than 120K of documents). This finding corroborated the initial assumption about the presence of a limited number define head_line of formats in a given scientific collection. According to our [author][repeat separator_author][delimiter] [repeat token_not_newline+][newline] original idea of step-by-step quality improvement, the trade- end define off between metadata quality and completeness of recognition coverage on this step was shifted to the quality aspect. To define other_line this end, we allow our procedure to discard badly-formatted [author][repeat headseparator_author][repeat trash][newline] | [line] input, finally retaining only high-quality content and related end define metadata. The major steps of metadata extraction include (see Figure define headseparator_author [opt space][headseparator][opt space][author] 1): end define 1) Text normalization and special symbols removal. This ... covers extra spaces, new lines and tabs removal, as well as non-printable symbols handling, and references’ Fig. 2. FSM: Authors recognition step section normalization. Moreover, it includes: text flow recognition, collateral elements detection (indexes, ta- bles of content, pages header and footer, etc), and 4) Authors recognition (see Figure 2) and title recognition hyphens correction regardless of the text’s language. using invariants first method proposed in [9]. iIn brief, These 1st-level pre-processing activities in our informa- this method denotes that subfields of a reference that tion extraction process, although conceptually simple, have relatively uniform syntax, position, and composi- provide a number of important values that contributes tion given all previous parsing, are the first to be parsed to the overall quality of the subsequent information subsequently. extraction. Namely: 5) Borders adjustments. We constructed heuristics for smart borders shift based on the number of lexical construc- • Text pre-processing contribute to the more accurate tions from the grammar in the marked (recognized) and textual information acquisition, i.e. correctly iden- unmarked (not recognized) reference’s region. tified text flow (pages ordering), removal of the At present, our FSM application is context-free:, i.e. we do repeated elements that do not contribute to the struc- not compare obtained metadata with already existing ones (as tural content (headers and footers) and removal of partially recognized corpus or external sources). Moreover, we text delimiters inside structural elements (footnotes have design and constructed the steps in a way that grammar and page numbers inside single reference, hyphens application is linear to the processing document’s size. Both in the authors’ names and titles in references, etc) properties - context-free and linearity - contribute significantly • Text structuring contributes to the correct identi- to the overall processing speed. The use of other information fication of the major structural elements within a (partially recognized corpus or external sources) could be used text, i.e. Introduction section and reference to the in a successive step to improve overall quality, but at the Introduction section in a Table of content section expenses of performance. of the article should be correctly distinguished and handled appropriately B. Statistical Correction Using A-Priori Domain-Specific An extended presentation of the techniques used in this Knowledge step can be found in [8]. The metadata obtained using previous step patterns can have 2) Initial text tagging: separating header, abstract and refer- satisfactory quality, but in general they lack in recognition ences parts. This allows us to process each section sepa- coverage. For example we can have a reference with 100%- rately, contributing to the improved processing speed of correctly recognized title, but with partially recognized au- the subsequent FSMs. thors. We still can query this metadata, but we cannot use it 3) References separation and initial items recognition for the next knowledge processing level, like for instance doc- within each single reference. uments clustering based on authors or documents interlinking Recognized From an operational level view, authors identification and Authors Co-Authors Checked correction can be summarized within four major steps (see Processed Figure 3): 1) Construction of document and community dictionaries. Citation Corrected Authors All recognized authors within the same document (i.e. Author both paper authors and cited authors) are combined in Authors Normalized Position Checked a local dictionary (document dictionary). All recognized authors within all documents within the same domain and/or URL are combined in another dictionary (com- Fig. 3. Correction steps (correlation between authors) munity dictionary). 2) Co-authors dictionary building. For each author, co- authors are checked and grouped within a separate co- based on their references. To tackle this issue, we developed author dictionary. and tested several methods for extending recognition coverage 3) Normalization of authors. Authors’ entries within each that combines (1) the partially incomplete metadata, obtained dictionary are normalized to the following forms: ”Name from previous step and (2) a-priori domain-specific knowledge. Surname” and ”Initials Surname”. This provides a first To this end we have analyzed a large sample - several level of disambiguation essentially using self-citation hundreds - of publication in computer science and extracted a patterns. Then, we iterate the normalization step using limited number of usage patterns that seem to be common for the co-author dictionary associated to each author. This the whole research domain. In particular: adds another level of disambiguation, using community writing patterns, within authors’ initials in case of iden- • it is common to find self-citation in one author’s publi- tical surname. cations - this information can be used to correct author 4) Authors identification and correction This last step is as well as topic identification. based on the collection of dictionaries (document, co- • it is common that in one document’s reference section author and community) and aims to solve the remaining there are several publications with the same author - this ambiguous cases using the whole knowledge present in information can be used for improving author identifica- the collection. tion (separation from other authors and from title). • it is common to find references to the same authors For the same reasons as described in previous section- i.e. within the same domain on Internet - i.e. same parent simple and formal verification - authors’ correction procedure URL - (publications of the same author, home pages of was developed as FSM. Figure 4 shows a fragment of the authors that belong to the same organization, publica- FSM used for the implemented authors’ correction procedure. tions of the same institution, publications on the same event/conference). It is therefore possible to use correla- We have accomplished titles correction in a similar way tion within a community for identification of authors. (see Figure 5), however special heuristics for title borders • it is common that titles in references section belong to the adjustments needed to be introduced as well as a number of same topic area of the paper. Therefore, it is possible to concepts that we use in this procedure. In particular: use already recognized titles for other titles identification • concept of ”lexical formula” and correction. • concept of ”document topic” • it is common to find references on the same topic or • concept of ”community topic” number of topics within the same domain on Internet Here, we define a lexical formula as a lexical constructs’ (examples are the same as those in the item before for frequency within selected logical element, i.e. the weighted author’s identification). Also here, it is possible to use set (by frequency) of words in a selected metadata field. This such correlation within a community for titles identifica- definition does not consider - for simplicity - any punctuation tion and correction. constructions and lexical constructs ordering . However, we We have use each of these assumptions for identification think that on large datasets (millions of documents) this can and correction of the corresponding metadata extraction and result in better precision. we have been able to statistically prove their correctness for the We then call ”topic” a group of lexical formulas within selected domain. For the sake of presentation of the proposed the same document that can be merged within larger lexical heuristic and statistical approach, we detail in the following the formula based on the frequencies. This larger lexical formula main procedures for (1) authors identification and correction will be referenced in the following as document topic. and (2) title corrections. But the same reasoning and approach Similarly we define community topic as a group of lexical can be used for the identification and correction of other meta- formulas that can be merged together based on the frequencies data present in the document, like: affiliation, keywords, type of the incoming element. The difference from the document of publication (Journal, Proceedings, Workshop,...), project, topic is that here the initial formulas belong to a set of event, etc. documents originating from the same domain on Internet Identify the place where new author was found: Recognized Title’s - If the substring does not belong to any tag - mark it Titles Borders - If the substring partially belongs to single Author Processed Detected tag - check for the following situations: - If the substring partially belongs to Author tag and Title tag - check for the following situations: Corrected Citation - Fausto Giunchiglia Semantic Title Web Dictionary Overall - C. Lee Giles CiteSeer.IST Borders Compiled Corrected - check if is too short - check if last author is still OK - <Author>Fausto </Author> Giunchiglia. -> <Author>Fausto Giunchiglia</Author> Fig. 5. Correction steps (correlation between topics) - <Author>Fausto</Author> Giunchiglia. Semantic Web -> <Author>Fausto Giunchiglia</Author>. Semantic Web - <Author>Fausto Giunchiglia. Semantic Web</Author> -> III. P RELIMINARY E VALUATION <Author>Fausto Giunchiglia</Author>. Semantic Web - <Author>C. Lee Giles Fausto</Author> Giunchiglia. -> Since we are working with sets of hundreds of thousands <Author>C. Lee Giles</Author> of documents, manual quality evaluation for the whole col- <Author>Fausto Giunchiglia</Author>. lection is not feasible. This fact has been already reported in - Fausto <Author>Giunchiglia C. Lee Giles</Author> -> related works [10], [11], [12] and other methodologies were <Author>Fausto Giunchiglia</Author> <Author>C. Lee Giles</Author> proposed, based on the specifics of each concrete dataset. - If the substring partially belongs to different Author tags – check In our experiments we compute quality evaluation criteria for situations: from preliminary results (extracted metadata) obtained with - <Author>Fausto </Author> the proposed approach with results used in existing state- <Author>Giunchiglia C. Lee Giles.</Author> - <Author>Fausto Giunchiglia C. </Author> of-the-art system. In order to build a consistent reference <Author>Lee Giles.</Author> dataset for our evaluation, we have processed results of the ... CiteSeer.IST Autonomous Citation Indexing (ACI) system, publicly available within its Open Archive Initiative (OAI). Fig. 4. FSM: Authors Correction The results contain metadata for 570K documents collected within CiteSeer.IST project within the last several years. We have used this metadata collection as an ”ideal” metadata. Here ”ideal” means that this metadata was processed using external (typically same parent URL). manually verified datasets (DBLP) and afterwards manually Experimentally we found out that it is sufficient to use only re-checked and corrected by users. Therefore, for our purpose, middle dense part of a lexical formula, omitting all top-weight it is our ”ideal” reference set. constructs (usually they are represented by articles, prepo- We have processed list of URLs available within the ”ideal” sitions and conjunctions) as well as low-weight constructs collection and were able to retrieve from the Internet around (usually they are represented by random constructs that are 120K of documents (other 450K documents have disappeared similar to random noise and do not contribute to the topic’s from their original location after they were collected by Cite- definition). Seer.IST - in 3-4 years time). In this subset of documents, we The main operational steps for titles correction include: were able to process, correctly identify and extract metadata from more than 90K of documents. This set corresponds to our 1) Recognized titles cross-check and correction. This step complete ”ideal” collection used in our preliminary evaluation. includes normalized titles presentation using dictionary It is important to note that not all relevant metadata were with weights (lexical formula) and lexical formula present in the ”ideal” collection: for instance, the references’ matching within complete collection of references lo- section in each document contains only records that corre- cated in single document. spond to the documents that are present in the overall collec- 2) Construction of dictionaries of communities’ and docu- tion. Practically this means that a large part of the references ments’ topics. The step includes communities and doc- are missing. We were able to overcome this limitation by ument topics identification based on the formulas from retrieving missing information from the corresponding static previous step and their clustering within communities. pages within CiteSeer.IST project. This includes complete 3) Titles’ borders detection using dictionaries of topics. number of references and references themselves for each This includes topics’ formulas application to the refer- document, however, without clear logical constructs separation ences and exact borders identification based on the same within single reference. patterns used for initial metadata extraction. For our evaluation task, we have used standard quality 4) Title-authors and title-publishing authority borders cor- measure criteria, namely: rection. The step includes utilization of the invariants first method [9]. A Precision = (1) A+C TABLE I P RELIMINARY E VALUATION R ESULTS A Recall = (2) A+B Precision 87,7% Recall 88,5% 2 × Precision × Recall Fmeasure 88,1% Fmeasure = (3) Precision + Recall where: • A is the number of true positive samples predicted as In this domain, a powerful example is the XIP parsing positive; in our case A is the number of references that system [17] a modular, declarative and XML-empowered we have recognized in a document that is not more than linguistic analyzer and annotator: the system takes XML- the number of references in the corresponding document based documents as input, linguistically analyzes their textual in the ideal set. content (robust parsing) and produces the set of annotations • B is the number of true positive samples predicted as in an XML format as output. XIP robust parsing provides negative; in our case it is the difference between number mechanisms for identifying Named Entity (NE) expressions, of references present in the ideal set and number of and extracting relations between words or group of words, e.g. reference in corresponding document processed with our relations between NE expressions. approach - but not less than zero. Other approaches used for metadata extraction include • C is the number of true negative samples predicted as grammar induction, hierarchical structuring and ontology- positive; in our case it is the difference between results based approaches [18], [19]. obtained with our approach and the ideal set - but not Our approach aims to extend the state-of-the-art by con- less than zero. tributing with a novel approach for metadata quality and cov- In Table I we report the results of our evaluation tests. erage improvements. In distinction to the existing approaches, These preliminary results show that our procedure is capable we do not use any external information repositories, while we of achieving the quality level that is comparable with the one emphasize the exploitation of the knowledge available within present in the ideal set but without usage of external manually the available documents’ collection. verified datasets and without human supervision (which is the case for the ideal set). V. C ONCLUSIONS IV. R ELATED W ORK In this paper we have presented a novel method for unsu- pervised metadata extraction based on a-priori domain-specific The problem of unsupervised and quality metadata extrac- knowledge. The method does not rely on any external infor- tion is under intense research activity. The first successful mation sources and is solely based on the existing information Automated Citation Indexing (ACI) system - CiteSeer.IST - in the document and in the document’s context (set of docu- has tackled this problem using a combination of patterns- ments). Combined with existing external knowledge sources based approach (regular expressions) and manually prepared the approach can further improve overall metadata quality external databases (DBLP and others) [9]. This provided high- and coverage. High quality automatic metadata extraction is a quality metadata within known references. Subsequent meta- crucial step in order to move from linguistic entities to logical data quality improvements in CiteSeer.IST were accomplished entities, relation information and logical relations and therefore involving human-based information corrections. Application to the semantic level of Digital Library usability. This, in of statistical models like Hidden Markov Model (HMM) [13] turn, creates the opportunity for value-added services within and Dual and Variable-length output Hidden Markov Model existing and future semantic-enabled Digital Library systems. (DVHMM) [14] are reported to have nearly 90% accuracy however, the training set size used by the authors has the VI. ACKNOWLEDGMENTS same magnitude as a processing corpora. Further metadata methods development turned to the Support Vector Machines We acknowledge C. Lee Giles for useful comments and (SVM) usage. Numerous experiments involving SVM have advice during initial brainstorming on the system architecture been accomplished within this task [11], [15], demonstrating as well as Fausto Giunchiglia for his advice and continuous high accuracy, recall and precision of the results obtained. support to the research project. However there were reported several problems connected with R EFERENCES flexibility of such systems that could be the case for preventing wide method application in the real-world systems. [1] A. V. Raan, “Scientometrics: State-of-the-art,” Scientometrics, vol. 38, no. 1, pp. 205–218, 2002. Natural Language Processing techniques belong to a dif- [2] M. L. E.C.M. Noyons, H.F. Moed, “Combining mapping and citation ferent but widely used approach for metadata extraction. analysis for evaluative bibliometric purposes: A bibliometric study,” Experiments using Part-of-Speech (PoS) tagging [16], [13] Journal of the American Society for Information Science, vol. 50, no. 2, pp. 115–131, 1999. have proven capable to provide sufficient accuracy,; however [3] R. Stecher, C. Niedere, P. Bouquet, and et al., “Enabling a knowledge large manually-labeled corpora is usually required for training. supply chain: From content resources to ontologies,” in Proc. of the. [4] A. Powell, “Guidelines for implementing dublin core in xml,” Dublin Core Metadata Initiative Recommendation, published at www.dublincore.org, 2003. [5] NISO, “Information retrieval (z39.50): Application service definition and protocol specification,” NISO Press, 2003. [6] N. Kiyavitskaya, N. Zeni, J. R. Cordy, L. Mich, and J. Mylopoulos, “Semi-automatic semantic annotations for next generation information systems,” in Proceedings of the 18th Conference on Advanced Infor- mation Systems Engineering, ser. Lecture Notes in Computer Science. Springer, 2006. [7] J. Cordy, “Txl a language for programming language tools and appli- cations,” in In Proceedings of 4th International Workshop on Language Descriptions, Tools and Applications, ser. Electronic Notes inTheoretical Computer Science, vol. 110. Elsevier Science, 2004. [8] A. Ivanyukovich and M. Marchese, “Unsupervised free-text processing and structuring in digital archives,” in Ist International Conference on Multidisciplinay Information Sciences and Technologies, 2006, accepted for publication. [9] S. Lawrence, C. L. Giles, and K. Bollacker, “Digital libraries and autonomous citation indexing,” IEEE Computer, vol. 32, no. 6, pp. 67– 71, 1999. [10] V. Petricek, I. J. Cox, H. Han, I. G. Councill, and C. L. Giles, “A comparison of on-line computer science citation databases,” in In Proceedings of the European Conference on Digital Libraries, ser. Lecture Notes in Computer Science. Springer, 2005. [11] H. Han, C. L. Giles, E. Manavoglu, H. Zha, Z. Zhang, and E. A. Fox, “Automatic document metadata extraction using support vector machines,” in Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries. IEEE Computer Society, 2003, pp. 37–48. [12] E. Agichtein and L. Gravano, “Snowball: Extracting relations from large plain-text collections,” in Proceedings of the 5th ACM International Conference on Digital Libraries. ACM, 2000. [13] S. Cucerzan and D. Yarowsky, “Language independent, minimally supervised induction of lexical probabilities,” in Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2000, pp. 270–277. [14] A. Takasu, “Bibliographic attribute extraction from erroneous references based on a statistical model,” in Proceedings of the 2003 Joint Confer- ence on Digital Libraries. IEEE, 2003. [15] Y. Hu, H. Li, Y. Cao, D. Meyerzon, and Q. Zheng, “Automatic extraction of titles from general documents using machine learning,” in Proceesings of the JCDL’05, 2005. [16] D. Besagni and A. Belaid, “Citation recognition for scientific publica- tions in digital libraries,” in First International Workshop on Document Image Analysis for Libraries (DIAL’04), 2004. [17] S. Ait-Mokhtar, J. Chanod, and C. Roux, “Robustness beyond shallow- ness: incremental deep parsing,” Natural Language Engineering, vol. 8, pp. 121–144, 2002. [18] P. Cimiano, S. Handschuh, and S. Staab, “Towards the self-annotating web,” in Proceedings of the WWW2004, 2004. [19] M.-Y. Day, T.-H. Tsai, C.-L. Sung, C.-W. Lee, S.-H. Wu, C.-S. Ong, and W.-L. Hsu, “A knowledge-based approach to citation extraction,” in Proceedings of Information Reuse and Integration Conference, IRI-2005, 2005.