<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Applying LLM to Library Metadata: Mapping Geography and Language in the Library of Congress Collection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hongyu Zhou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kai Li</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raf Guns</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Brian Dobreski</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tim C. E. Engels</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Antwerp</institution>
          ,
          <addr-line>Antwerp</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Cambridge</institution>
          ,
          <addr-line>Cambridge</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Tennessee</institution>
          ,
          <addr-line>Knoxville</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <kwd-group>
        <kwd>eol&gt;Library Metadata</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Knowledge Geography</kwd>
        <kwd>Cataloging</kwd>
        <kwd>Cultural Representation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Extended Abstract
Books are among the most enduring forms of cultural and scholarly communication, yet they remain largely
invisible in quantitative analyses of knowledge production. Unlike journal articles, which are easily captured
in citation databases, books are embedded in library catalog systems whose rich metadata have rarely been
exploited for large-scale research. In this paper, we demonstrate how large language models (LLMs) can enhance
the research potential of such metadata and, in doing so, provide new empirical insights into the geography and
language of global knowledge.</p>
      <p>We assemble and process more than 6.4 million non-fiction bibliographic records from the U.S. Library of
Congress (LC) catalog spanning 1970 to 2018. From the MARC-format metadata, we extract three analytical
dimensions: (i) subject geography derived from Library of Congress Subject Headings (LCSH), (ii) publication
locations, and (iii) languages of publication. A custom LLM-based normalization pipeline processes over 250,000
unique geographic strings, ranging from cities to historical regions, and maps them to contemporary countries
and territories using ISO 3166 identifiers. Validation against the oficial MARC 043 geographic area codes achieves
over 96% concordance, demonstrating the accuracy and scalability of LLM-assisted geographic classification for
bibliometric research [1].</p>
      <p>This integration of LLM-driven text normalization with curated library metadata highlights subtle but
systematic diferences between machine-inferred and human-assigned geographic classifications. The
LLMbased approach can be instructed to align more closely with contemporary geopolitical boundaries and naming
conventions, whereas librarian-curated metadata may exhibit inconsistencies or temporal lag in reflecting political
or territorial shifts[ 2]. More broadly, this demonstrates how LLMs enable a new, adaptive form of knowledge
organization that complements traditional cataloging by dynamically updating representations of place, language,
and culture. Rather than replacing human curation, such systems can extend the bibliographic infrastructure of
libraries into a continuously evolving framework for mapping global knowledge.</p>
      <p>Using this LLM-enhanced dataset, we trace how the LC’s global representation has evolved over the past
ifve decades. Three empirical patterns emerge. First, the LC collection has undergone substantial geographic
diversification: the share of books about North America declined from more than 30 percent in the 1970s to under
20 percent by the 2010s, while East Asia, Latin America, and Eastern Europe expanded rapidly, mirroring the
globalization of publishing [3]. Second, geographic and linguistic dimensions are tightly aligned, with over 80
percent of books about a country published in its oficial language and 81 percent sharing the same publication and
subject country, indicating that catalog metadata accurately reflect the spatial organization of book knowledge.
Third, the linguistic composition of the collection has shifted markedly: English-language titles fell from over 50
percent to around one-third, while Chinese, Spanish, and other non-English languages rose steadily, signaling
the emergence of a more multilingual and globally distributed archive.</p>
      <p>Taken together, these results portray the Library of Congress not merely as a passive repository but as an
active infrastructure of global knowledge representation. Its evolving catalog mirrors the United States’ shifting
intellectual engagement with the world and the difusion of publishing capacity beyond the Western core. More
broadly, our findings demonstrate how LLMs can transform long-standing bibliographic systems, originally
designed for human catalogers, into computational data sources for mapping global information flows.</p>
      <p>By linking geography, language, and publication metadata at scale, and by reconciling librarian-curated
and LLM-derived classifications, this study contributes a methodological advance in the AI-assisted science of
science. It shows how the combination of controlled vocabularies and generative models can illuminate hidden
cultural and geopolitical dynamics in the global production of knowledge, ofering new pathways for research on
linguistic diversity, cultural equity, and the spatial organization of scholarship.</p>
      <p>Declaration on Generative AI
The authors used large language models (LLMs) to assist with grammar and style editing, as well as
text normalization within the research methodology. LLMs were employed to standardize geographic
entities and harmonize metadata extracted from the Library of Congress catalog. No figures were
generated using generative AI. All AI-assisted outputs were reviewed and verified by the authors, who
take full responsibility for the content of this publication.
for book history, Book History 1 (1998) 11–31.</p>
      <p>Bloomsbury Publishing USA, 2015.
of relations among cities and institutes, Journal of the American Society for Information Science
nEvelop-O
†These authors contributed equally to this work.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>