=Paper=
{{Paper
|id=Vol-3220/paper2
|storemode=property
|title=A Blocking-Based Approach to Enhance Large-Scale Reference Linking
|pdfUrl=https://ceur-ws.org/Vol-3220/paper2.pdf
|volume=Vol-3220
|authors=Tarek Saier,Meng Luan,Michael Färber
|dblpUrl=https://dblp.org/rec/conf/jcdl/SaierL022
}}
==A Blocking-Based Approach to Enhance Large-Scale Reference Linking==
A Blocking-Based Approach to Enhance Large-Scale Reference Linking Tarek Saier, Meng Luan and Michael Färber Karlsruhe Institute of Technology (KIT), Institute AIFB, Kaiserstr. 89, 76133 Karlsruhe, Germany Abstract Analyses and applications based on bibliographic references are of ever increasing importance. However, reference linking methods described in the literature are only able to link around half of the references in papers. To improve the quality of reference linking in large scholarly data sets, we propose a blocking- based reference linking approach that utilizes a rich set of reference fields (title, author, journal, year, etc.) and is independent of a target collection of paper records to be linked to. We evaluate our approach on a corpus of 300,000 references. Relative to the original data, we achieve a 90% increase in papers linked through references, a five-fold increase in bibliographic coupling, and a nine-fold increase in in-text citations covered. The newly established links are of high quality (85% F1). We conclude that our proposed approach demonstrates a way towards better quality scholarly data. Keywords entity resolution, references, blocking, bibliometrics, scholarly data, digital libraries 1. Introduction Scholarly data is becoming increasingly important and with it its quality and coverage. Con- nections between publications in the form of literature references are of particular importance, as they are used as a basis for various analyses, decision making, and applications. Some examples are research output quantification [1], trend detection [2], summarization [3], and recommendation [4, 5]. However, reference linking methods1 described in the literature are only able to link around half of the references contained in the original papers to the cited publications [6, 7]. This lack in coverage is especially affecting references to non-English publications [8], which are in general underrepresented in scholarly data [9, 10, 11, 12] along with publications in the humanities [13, 14]. We see the reason for this lack in linked references in two key shortcomings of current methods. First, references are linked using simple string similarity measures that are often relying only on publications’ title and author information (which is not always contained in references; see Figure 1). Second, references are exclusively linked to a target collection of ULITE2022: Understanding Literature References in Academic Full Text, JCDL 2022, Cologne, Germany, June 24, 2022 Envelope-Open tarek.saier@kit.edu (T. Saier); lm19940625@163.com (M. Luan); michael.faerber@kit.edu (M. Färber) Orcid 0000-0001-5028-0109 (T. Saier); 0000-0001-5458-8645 (M. Färber) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 1 We use “link[ing/ed] references” w.r.t. to connections to cited papers rather than in-text citation markers. [10] I. Bonalde et al., Phys. Rev. Le. 85, 4775 (2000). [25] Bonalde I, Yanoff B D, Salamon M B, Van Harlingen D J, Chia E M E, Mao Z Q and Maeno Y 2000 Phys. Rev. Le. 85 4775 [4] Jaume, S.C. and Sykes, L.R., Pure and Applied Geophysics 155, 279-305. Jaume, S.C. and L.R. Sykes, Evolving Towards a Critical Point: A Review of Accelerating Seismic Moment/Energy Release Prior to Large and Great Earthquakes, Pure Appl. Geophys., 155, 279, 1999. Figure 1: Examples of challenging reference pairs from our evaluation that where successfully matched. Top: references from a r X i v : c o n d - m a t / 0 5 0 3 3 1 7 (no title, first author only) and a r X i v : c o n d - m a t / 0 1 0 4 4 9 3 (no title, all authors). Bottom: references from a r X i v : c o n d - m a t / 0 1 0 4 3 4 1 (no title, full venue, page rage, no year) and a r X i v : p h y s i c s / 0 5 0 4 2 1 8 (with title, venue abbreviation, start page only, with year). Token Blocking full text papers target Block Purging Ref (3) colle- ction Meta Blocking ? Graph Building (1) ✔ ✔ (2) ? ? Edge Weighting Graph Pruning Ref Block Collecting existing links Matching new links Figure 2: Schematic depiction of the use case. A corpus of full text papers, where some references are already linked to a target collection (blue), and some are not (orange, pink, green). At (1) we apply our blocking and matching approach to identify all references that point to the same publication. In doing so, we establish new links in the form of (2) bibliographic coupling and (3) links to the target collection. paper records—usually a large metadata set like DBLP2 or OpenAlex3 , or a set of IDs like DOIs or PMIDs. This means references to literature which is not contained in the target collection, as well as to non-source items [15], cannot be linked (see “?” markers in Figure 2). Linking references can be seen as a task of entity resolution (ER) [16], which is concerned with identifying entities referring to the same object within or between large data sets. Because the task requires a one-to-one comparison between each of the involved entities, it is inherently of quadratic complexity. To make approaches scalable, entities are assigned into groups of likely matching candidates prior to comparison, a technique called blocking [17]. While blocking- 2 See https://dblp.org/. 3 See https://openalex.org/. based approaches are used in the domain of scholarly data to, for example, identify duplicate paper records [18, 19, 6] (where information such as abstracts are used) and authors [20], they are not utilized for bibliographic references. We therefore address both of the aforementioned problems with current reference linking approaches, (1) the use of simple matching methods based on title and authors, as well as (2) the reliance on a target collection of paper records, by proposing (1) the use of a blocking and matching process utilizing seven reference fields (title, author, journal, year, etc.) that (2) operates within the set of bibliographic references of a corpus, and is thereby independent of a target collection of papers (see marker “(1)” in Figure 2). We showcase the feasibility and benefits of our approach, implementing a pre-processing, blocking, and matching pipeline and evaluating it on a corpus containing 300,000 references. We show that relative to the original data, our approach gives us a 90% increase in papers linked to the target collection, a five-fold increase in bibliographically coupled [21] papers (see marker “(2)” in Figure 2), and a nine-fold increase in in-text citation markers covered.4 The new links are furthermore of high quality (85% F1). This paves the way towards higher quality scholarly data, especially regarding the coverage of so far underrepresented literature and non-source items. In summary, we make the following contributions. • We propose a blocking-based approach for matching bibliographic references that is independent of a target collection of paper records. • We perform a large-scale evaluation showing that our approach results in a manifold increase in high quality reference links. • We make our data and code publicly available.5 2. Related Work Blocking-based approaches have been used in the domain of scholarly data, though to the best of our knowledge not for bibliographic references. We therefore report on (1) exemplary uses of blocking in the scholarly domain for entities other than references, and (2) approaches to linking bibliographic references using methods other than blocking. Simonini et al. [18] develop BLAST (Blocking with Loosely-Aware Schema Techniques) which adapts Locality-Sensitive Hashing. Among data sets from other domains, they also evaluate their approach for the task of linking 2,600 DBLP paper records to the ACM6 and Google Scholar.7 Sefid [19] proposes several models to match paper records utilizing the papers’ title, header, and citation information. The models are evaluated in three scenarios matching 1,000 paper records from CiteSeer𝑥 [22] to IEEE, DBLP, and Web of Science. Lastly, Färber et al. [20] detect duplicates among 243 million author records in the Microsoft Academic Knowledge Graph [23] and evaluate their approach using ORCiD IDs. 4 With the “coverage” of in-text citation markers we refer to markers associated with linked references, relative to markers belonging to unlinked references. 5 See https://github.com/IllDepence/ulite2022. 6 See https://dl.acm.org/. 7 See https://scholar.google.com/. Lo et al. [6] introduced the data set S2ORC, which contains 9.6 million open access papers and has recently seen extensive use in area of scholarly document processing. The authors link references to papers within their data set using a heuristic similarity measure based on n-grams and the Jaccard similarity, which only uses the paper title. Using this method, 26 million out of 50 million references (52%) are successfully linked. The authors report that the low number is “due to large numbers of papers (mostly in the field of physics) for which the bibliography entries are formatted without paper titles.” Saier et al. [7] introduce unarXive, a data set created from papers’ LATEX sources containing over 1 million publications. Bibliographic references in the data set are linked to the Microsoft Academic Graph [24, 25]. The linking procedure is based on string similarity of papers’ titles and author information. With this procedure 17 million out of 40 million references (42%) are successfully linked. Lastly, CiteSeer𝑥 [26, 22] in another large data set containing paper records. Similar to S2ORC, references are linked to paper records within the data set itself. In the case of CiteSeer𝑥 the linking is performed through a heuristic assignment based on title and author information. We are not aware of information on the percentage of references that are successfully linked in CiteSeer𝑥 . 3. Approach Our approach consists of the following three steps: (1) pre-processing to convert references into a normalized, structured format, (2) blocking to allow us to process large amounts of references, and (3) matching. These steps are explained in more detail below. Pre-processing References as they appear in papers are hard to match for several reasons, such as the variety of citation styles, variants of author names, venue abbreviations, sparsity of information, and typing errors [27] (see Figure 1). To mitigate these issues, we pre-process references in three steps: first, we apply GROBID’s [28] reference string parsing module,89 then we expand journal and conference abbreviations, and lastly all strings are lowercased and Unicode normalized. For the abbreviation expansion we use a mapping for 47.6k journal titles provided by JabRef10 and 2.6k conference titles crawled from various web sources. Following [30] we select seven reference fields for the blocking step: title, author, year, volume, journal, booktitle, and pages. Blocking Following [31], we build our blocking pipeline from components for (1) block building, (2) block cleaning, and (3) comparison cleaning. As shown in Figure 2, we use token blocking, block purging, and meta-blocking respectively for each of the steps. Token blocking is chosen for the block building step because it is schema-agnostic and therefore robust against the varying level of information contained in or missing from bibliographic references. In this step, references are assigned to blocks based on all tokens (i.e., words) contained in the identified and normalized reference fields. As a result, references at this point are associated with multiple blocks, which leads to a high level of redundancy. 8 See https://grobid.readthedocs.io/en/latest/Grobid-service/#apiprocesscitation. 9 GROBID was chosen according to the results of [29]. 10 See https://github.com/JabRef/abbrv.jabref.org. Block purging [32] removes oversized blocks based on a comparison cardinality metric, which we determine heuristically and set it to 0.01. Intuitively, the removed blocks originate from common tokens, meaning that matched reference strings within them are highly likely to also share smaller blocks. Purging therefore reduces the number of overall comparisons with minimal effect on the final result quality. Meta-blocking [33], our comparison cleaning step, reduces unnecessary comparisons within blocks by generating a weighted graph of entities (references in our case) based on their shared blocks, removing edges based on a pruning scheme, and lastly creating a new block collection based on the reduced graph. For both the weighting and the pruning of edges several schemes exist. In Section 4 we describe how we determined the most suitable combination of schemes for our use case. Here, we briefly mention the schemes involved. Available graph weighting schemes include the Common Blocks Scheme (CBS), the Enhanced Common Blocks Scheme (ECBS), the Aggregate Reciprocal Comparisons Scheme (ARCS), and the Jaccard Scheme (JS). For graph pruning, we consider Cardinality Node Pruning (CNP), which relies on cardinality to select the top edges for each node, as well as Weight Edge Pruning (WEP), which removes edges based on their assigned weight. Matching To determine which references within a block refer to the same publications, we utilize a weighted average of Jaccard similarities across our seven reference fields. Based on [34] as well as preliminary experiments, we set the weights for title, author, journal, booktitle, year, volume, and pages to 8, 6, 5, 5, 3, 3, and 2 respectively, and set the threshold for a match to 0.405. 4. Evaluation We use a large corpus of scholarly publications to perform two types of evaluations. (1) A large- scale evaluation utilizing the corpus’ existing reference links as ground truth, and (2) a manual evaluation to also assess the correctness of newly created reference links. In the following, we describe the data used, evaluations performed, and results obtained. Data For our evaluation we use the data set unarXive [7]. We chose this data set over similar data sets such as S2ORC [6], because it not only contains paper’s full text with annotated in-text citation markers, but also a dedicated database of all raw references in plain text. From unarXive we sample the 300,000 most recent references to conduct our evaluation. The 300,000 references originate from 9,917 papers from the disciplines of physics (7,347), mathematics (1,686), computer science (789), and other STEM fields (95). The publications cited through the references cover publication years from 1743 up to 2020. Four examples of references used in the evaluation are shown in Figure 1. Large-Scale Evaluation Our large-scale evaluation is performed in two steps. First, we determine the most suitable configuration of graph weighting and pruning scheme for our meta-blocking step, then we apply our pipeline to the evaluation corpus and determine the number of additionally linked entities. Table 1 Performance of five graph weighting and graph pruning scheme combinations for meta-blocking. Weighting scheme Pruning scheme #Comparisons #Matches RR1 (%) PC2 (%) PQ3 (%) CBS4 CNP8 39,050 3,053 99.96 54.47 7.82 ECBS5 CNP 39,050 3,201 99.96 57.11 8.20 ARCS6 CNP 39,050 2,890 99.96 51.56 7.40 ARCS WEP9 24,175 1,285 99.98 22.93 5.32 JS7 WEP 42,919 2,272 99.96 40.54 5.29 Metrics: 1 Reduction Ratio, 2 Pair Completeness, 3 Pairs Quality Weighting schemes: 4 Common Blocks Scheme, 5 Enhanced Common Blocks Scheme, 6 Aggregate Reciprocal Comparisons Scheme, 7 Jaccard Scheme Pruning schemes: 8 Cardinality Node Pruning, 9 Weight Edge Pruning Table 2 Number of linked papers, references, and in-text citations given in the original corpus and newly created through the application of our approach. Linked to target collection #Papers #Referencecs #In-text Citations Given 1,590 13,975 23,707 New 1,443 2,442 7,824 Linked through bibliographic coupling #Papers #Referencecs #In-text Citations Given - - - New 8,895 53,940 219,630 Combined (linked in either way)1 #Papers #Referencecs #In-text Citations Given 1,590 13,975 23,707 New 8,931 55,197 227,454 1 Note that the combined entity counts are not simply the sum of the numbers above, because a single entity can be linked in both ways. To chose a graph weighting and pruning scheme, we use the 13,976 references in our corpus which are are already linked to the target collection as ground truth. Following [33], we select five combinations of schemes to evaluate. The combinations are evaluated using the metrics pair completeness (PC), which expresses the ratio of detected matches with respect to all true matches, pair quality (PQ), which estimates the portion of true matches within all executed comparisons in the block collection, and reduction ratio (RR), which measures the number of unnecessary comparisons that are saved through blocking. Table 1 shows the results of our evaluation. We achieve the best results using ECBS weighting and CNP pruning. Accordingly, we apply our pipeline with this configuration on the full evaluation corpus of 300k references, where our approach performs 496,051 comparisons after blocking and identifies 71,826 matches. As shown earlier in Figure 2, we can use the matches identified by our pipeline to create two types of new links. First, new links to the target collection, and second, links between references created through bibliographic coupling. New links to the target collection are established whenever a reference with no existing link is matched to a reference with an existing link (see marker “(3)” in Figure 2). In cases where neither of the references in a match have an existing link, we create a bibliographic coupling (see marker “(2)” in Figure 2). In Table 2 we show on the level of papers, references, and in-text citations how many links were already given in our corpus and how many new links we are able to establish. Regarding links to the target collection, we are able to link 1,443 new papers (90.75% increase) through 2,442 references (17.47% increase), which are connected to 7,824 in-text citation markers (33.00% increase). As for bibliographic coupling, we connect 8,895 papers through 53,940 references connected to 219,630 in-text citation markers. Comparing the number of given links to the combined number of new links, we see a 90% increase in papers linked to the target collection, a five-fold increase in bibliographically coupled papers, and a nine-fold increase in in-text citation markers covered. Manual Evaluation To assess the quality of our newly linked references, we take a random sample of 500 reference comparisons from the matching procedure and manually verify if our approach correctly labeled each pair as a match or non-match. This is done by inspecting both original reference strings (prior to pre-processing) and determining whether they refer to the same publication or not. Because in some disciplines such as physics it is common to see references without a title given, this process involves looking up and verifying publications’ details online.11 Examples of two reference pairs are shown in Figure 1. Comparing our predicted matches with the manually established ground truth, we measure a precision of 93.20% and a recall of 79.34%. Accordingly the F1-score is 85.71%. This shows us that our newly established links are of good quality, suggesting our approach facilitates the creation of more accurate scholarly data and, accordingly, higher quality analyses and downstream applications based scholarly data sets. 5. Discussion and Future Work To improve the quality of reference linking in large scholarly data sets, we proposed a blocking- based reference linking approach that is independent of a target collection of paper records. In a large-scale evaluation, we first determined the most suitable meta-blocking scheme for our particular application case. Subsequently applying our approach to a corpus of 300,000 references, we saw a manifold increase in linked papers, references, and in-text citation markers. The newly established links are of high precision and have a high recall, which we confirmed through a manual evaluation on a sample of our results. This demonstrates the benefits and quality of our approach. Key limitations of the work presented are (1) the size and discipline coverage of the evaluation corpus, (2) the usage of a comparatively basic blocking technique, and (3) the lack of a thorough evaluation of time performance. 11 For further details see https://github.com/IllDepence/ulite2022/tree/master/5_manual_evaluation. In the future we want to address these points by expanding our work through using more advanced blocking methods such as progressive blocking [35, 36], using larger evaluation corpora such as the whole unarXive data set, including data from more diverse disciplines such as the humanities, and evaluating the time performance of our approach. Because references in our evaluation corpus are linked to in-text citation markers, we furthermore plan to explore application scenarios utilizing the paper full texts. Author Contributions Tarek Saier: Conceptualization, Data curation (support), Formal analysis, Investigation (sup- port), Methodology (support), Software (final evaluation), Visualization, Supervision, Writing – original draft (lead), Writing – review & editing. Meng Luan: Data curation, Formal analy- sis, Investigation, Methodology, Software, Writing – original draft (support). Michael Färber: Supervision, Writing – review & editing. References [1] J. E. Hirsch, An index to quantify an individual’s scientific research output, Proceedings of the National academy of Sciences 102 (2005) 16569–16572. [2] C. Chen, CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature, Journal of the American Society for Information Science and Technology 57 (2006) 359–377. doi:1 0 . 1 0 0 2 / a s i . 2 0 3 1 7 . [3] A. Elkiss, S. Shen, A. Fader, G. Erkan, D. States, D. Radev, Blind men and elephants: What do citation summaries tell us about a research article?, Journal of the American Society for Information Science and Technology 59 (2008) 51–62. [4] S. Ma, C. Zhang, X. Liu, A review of citation recommendation: from textual content to enriched context, Scientometrics 122 (2020) 1445–1472. [5] M. Färber, A. Jatowt, Citation recommendation: approaches and datasets, International Journal on Digital Libraries 21 (2020) 375–405. doi:1 0 . 1 0 0 7 / s 0 0 7 9 9 - 0 2 0 - 0 0 2 8 8 - 2 . [6] K. Lo, L. L. Wang, M. Neumann, R. Kinney, D. Weld, S2ORC: The Semantic Scholar Open Research Corpus, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 2020, pp. 4969–4983. [7] T. Saier, M. Färber, unarXive: a large scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata, Scientometrics (2020). doi:1 0 . 1 0 0 7 / s11192-020-03382-z. [8] T. Saier, M. Färber, T. Tsereteli, Cross-Lingual Citations in English Papers: A Large-Scale Analysis of Prevalence, Formation, and Ramifications, International Journal on Digital Libraries (2021). doi:1 0 . 1 0 0 7 / s 0 0 7 9 9 - 0 2 1 - 0 0 3 1 2 - z . [9] M.-A. Vera-Baceta, M. Thelwall, K. Kousha, Web of Science and Scopus language coverage, Scientometrics 121 (2019) 1803–1813. [10] X. Liu, X. Chen, CJK Languages or English: Languages Used by Academic Journals in China, Japan, and Korea, Journal of Scholarly Publishing 50 (2019) 201–214. [11] H. F. Moed, V. Markusova, M. Akoev, Trends in Russian research output indexed in Scopus and Web of Science, Scientometrics 116 (2018) 1153–1180. [12] O. Moskaleva, M. Akoev, Non-English language publications in Citation Indexes - quan- tity and quality, in: Proceedings 17th International Conference on Scientometrics & Informetrics, volume 1, Edizioni Efesto, Italy, 2019, pp. 35–46. [13] G. Colavizza, M. Romanello, Citation Mining of Humanities Journals: The Progress to Date and the Challenges Ahead, Journal of European Periodical Studies 4 (2019) 36–53. [14] C. Kellsey, J. E. Knievel, Global English in the humanities? A longitudinal citation study of foreign-language use by humanities scholars, College & Research Libraries 65 (2004) 194–204. [15] P.-S. Chi, Which role do non-source items play in the social sciences? A case study in political science in Germany, Scientometrics 101 (2014) 1195–1213. doi:1 0 . 1 0 0 7 / s 1 1 1 9 2 - 014-1433-1. [16] V. Christophides, V. Efthymiou, K. Stefanidis, Entity Resolution in the Web of Data, Synthesis Lectures on the Semantic Web: Theory and Technology 5 (2015) 1–122. doi:1 0 . 2 2 0 0 / S 0 0 6 5 5 E D 1 V 0 1 Y 2 0 1 5 0 7 W B E 0 1 3 . [17] G. Papadakis, D. Skoutas, E. Thanos, T. Palpanas, Blocking and Filtering Techniques for Entity Resolution: A Survey, ACM Computing Surveys 53 (2020) 31:1–31:42. doi:1 0 . 1 1 4 5 / 3377455. [18] G. Simonini, S. Bergamaschi, H. V. Jagadish, Blast: A loosely schema-aware meta-blocking approach for entity resolution, Proc. VLDB Endow. 9 (2016) 1173–1184. doi:1 0 . 1 4 7 7 8 / 2994509.2994533. [19] A. Sefid, Record Linkage Between CiteSeerX and Scholarly Big Datasets, Master’s thesis, The Pennsylvania State University, 2019. [20] M. Färber, L. Ao, The Microsoft Academic Knowledge Graph Enhanced: Author Name Disambiguation, Publication Classification, and Embeddings, Quantitative Science Studies 3 (2022) 51–98. doi:1 0 . 1 1 6 2 / q s s _ a _ 0 0 1 8 3 . [21] K. W. Boyack, R. Klavans, Co-citation analysis, bibliographic coupling, and direct cita- tion: Which citation approach represents the research front most accurately?, Journal of the American Society for Information Science and Technology 61 (2010) 2389–2404. doi:1 0 . 1 0 0 2 / a s i . 2 1 4 1 9 . [22] J. Wu, K. Kim, C. L. Giles, CiteSeerX: 20 Years of Service to Scholarly Big Data, in: Proceedings of the Conference on Artificial Intelligence for Data Discovery and Reuse, AIDR ’19, 2019. doi:1 0 . 1 1 4 5 / 3 3 5 9 1 1 5 . 3 3 5 9 1 1 9 . [23] M. Färber, The Microsoft Academic Knowledge Graph: A Linked Data Source with 8 Billion Triples of Scholarly Data, in: Proceedings of the 18th International Semantic Web Conference, ISWC’19, 2019, pp. 113–129. doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 0 3 0 - 3 0 7 9 6 - 7 _ 8 . [24] A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B.-J. P. Hsu, K. Wang, An Overview of Microsoft Academic Service (MAS) and Applications, in: Proceedings of the 24th Interna- tional Conference on World Wide Web, WWW ’15 Companion, ACM, 2015, pp. 243–246. doi:1 0 . 1 1 4 5 / 2 7 4 0 9 0 8 . 2 7 4 2 8 3 9 . [25] K. Wang, Z. Shen, C. Huang, C.-H. Wu, D. Eide, Y. Dong, J. Qian, A. Kanakia, A. Chen, R. Rogahn, A Review of Microsoft Academic Services for Science of Science Studies, Frontiers in Big Data 2 (2019) 45. doi:1 0 . 3 3 8 9 / f d a t a . 2 0 1 9 . 0 0 0 4 5 . [26] J. Wu, K. M. Williams, H.-H. Chen, M. Khabsa, C. Caragea, S. Tuarob, A. G. Ororbia, D. Jordan, P. Mitra, C. L. Giles, CiteSeerX: AI in a Digital Library Search Engine, AI Magazine 36 (2015) 35–48. doi:1 0 . 1 6 0 9 / a i m a g . v 3 6 i 3 . 2 6 0 1 . [27] P. Christen, Data Matching: Concepts and Techniques for Record Linkage, Entity Resolu- tion, and Duplicate Detection, Springer Science & Business Media, 2012. doi:1 0 . 1 0 0 7 / 9 7 8 - 3-642-31164-2. [28] P. Lopez, GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications, in: Research and Advanced Technology for Digital Libraries, 2009, pp. 473–474. [29] D. Tkaczyk, A. Collins, P. Sheridan, J. Beel, Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Reference and Citation Parsers, in: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, JCDL ’18, ACM, New York, NY, USA, 2018, pp. 99–108. doi:1 0 . 1 1 4 5 / 3 1 9 7 0 2 6 . 3 1 9 7 0 4 8 . [30] H.-K. Koo, T. Kim, H.-W. Chun, D. Seo, H. Jung, S. Lee, Effects of unpopular citation fields in citation matching performance, in: 2011 International Conference on Information Science and Applications, 2011, pp. 1–7. doi:1 0 . 1 1 0 9 / I C I S A . 2 0 1 1 . 5 7 7 2 3 7 2 . [31] G. Papadakis, J. Svirsky, A. Gal, T. Palpanas, Comparative analysis of approximate blocking techniques for entity resolution, Proc. VLDB Endow. 9 (2016) 684––695. doi:1 0 . 1 4 7 7 8 / 2947618.2947624. [32] G. Papadakis, E. Ioannou, C. Niederee, P. Fankhauser, Efficient entity resolution for large heterogeneous information spaces, 2011, pp. 535–544. doi:1 0 . 1 1 4 5 / 1 9 3 5 8 2 6 . 1 9 3 5 9 0 3 . [33] G. Papadakis, G. Koutrika, T. Palpanas, W. Nejdl, Meta-blocking: Taking entity resolution to the next level, IEEE Transactions on Knowledge and Data Engineering 26 (2014). doi:1 0 . 1 1 0 9 / T K D E . 2 0 1 3 . 5 4 . [34] Y. Foufoulas, L. Stamatogiannakis, H. Dimitropoulos, Y. Ioannidis, High-pass text filtering for citation matching, in: Research and Advanced Technology for Digital Libraries, Springer International Publishing, Cham, 2017, pp. 355–366. [35] G. Simonini, G. Papadakis, T. Palpanas, S. Bergamaschi, Schema-Agnostic Progressive Entity Resolution, IEEE Transactions on Knowledge and Data Engineering 31 (2019) 1208–1221. doi:1 0 . 1 1 0 9 / T K D E . 2 0 1 8 . 2 8 5 2 7 6 3 . [36] S. Galhotra, D. Firmani, B. Saha, D. Srivastava, Efficient and effective ER with progressive blocking, The VLDB Journal 30 (2021) 537–557. doi:1 0 . 1 0 0 7 / s 0 0 7 7 8 - 0 2 1 - 0 0 6 5 6 - 7 .