1. Introduction

Automatic Construction of a Patent Term Thesaurus with Fine-Tuned ChatGPT

Hidetsugu Nanba

Kohei Iwakuma

Satoshi Fukuda

0 0 Chuo University , 1-13-27 Kasuga, Bunkyo-ku, Tokyo, 112-8551 , Japan

84 91

Technical terms in patent documents are expressed with highly variable, domain-specific language, hindering cross-lingual prior-art search and knowledge discovery. Existing automatic thesaurus construction pipelines either rely on handcrafted patterns, which sufer from low recall, or on graph-augmented representation learning, which is accurate but complex and largely monolingual. We present a lightweight three-stage framework that: (1) filters candidate term pairs with of-the-shelf embeddings, (2) assigns fine-grained semantic relations via a ChatGPT-4o model fine-tuned on 36k English patent pairs, and (3) enforces cross-lingual consistency through ifxed-expression hypernym seeds automatically aligned between Japanese and English. The final output is written directly into an incrementally updateable multilingual thesaurus graph. On the Google Patent Phrase Similarity Dataset, our fine-tuned LLM attains 0.762 Pearson / 0.738 Spearman, outperforming strong baselines (SBERT, Patent-BERT) and the recent graph-based RA-Sim model by up to 0.14 correlation points.

eol>Large Language Models Term Relation Extraction Thesaurus Construction

1. Introduction

• It presents a workflow that organises the relations predicted by the LLM into a graph whose nodes are terms and edges are relations, and then expands this graph recursively to build an automatically updatable multilingual patent thesaurus. • It introduces an evaluation procedure that combines pattern-based hypernym candidates extracted independently from Japanese and English patents with their translation alignments, enabling the community to verify whether the LLM’s predictions remain consistent across languages.

2. Related Work

Automatic prediction of semantic relations—synonymy, hypernymy, meronymy, and the like—between technical terms has long underpinned knowledge acquisition and high-recall retrieval. Historical approaches fall into three broad families: symbolic pattern rules, distributional or embedding methods, and, most recently, large language models (LLMs). Below we survey their evolution in chronological order, emphasising patent-specific work and highlighting how our study difers.

Early research relied on explicit lexico-syntactic patterns. Hearst’s seminal paper introduced templates such as “X is a kind of Y ” to harvest thousands of hypernym–hyponym pairs at negligible cost [ 1 ]. The same idea was later applied to Japanese patent corpora: Nanba et al. mined the pattern “A nado no B” (B such as A), aligned the resulting pairs with English equivalents, and built a bilingual thesaurus with 78 % F1 [ 5 ]. Building on this, their subsequent study translated scholarly terms into patent terminology by combining citation analysis with an automatically constructed thesaurus, significantly broadening the candidate space [ 6 ]. Their scope, however, is restricted to hypernym–hyponym relations only, whereas the present study predicts a full spectrum of relations—including synonymy, meronymy, and graded similarity—across languages. Symbolic methods moreover demand handcrafted patterns for every language and domain; even in English, Roller et al. revisited Hearst rules with modern corpora to boost accuracy, yet still faced recall limits when wording drifted from canonical templates [ 7 ]. Patents exacerbate this problem: identical concepts are phrased idiosyncratically (“soccer ball” vs. “spherical recreational device”), so surface patterns alone capture only a fraction of true relations. A broader survey of how rule-based and other NLP techniques transfer—or fail to transfer—between patent sub-genres is given by Andersson et al. [ 8 ]. Complementary work by Judea et al. shows that figure references themselves can be harvested as symbolic cues, yielding fully unsupervised, high-quality training data for patent terminology extraction [ 9 ].

Distributional approaches learn continuous vectors from large corpora. Word2Vec[ 2 ] and GloVe[10] established that words with similar contexts occupy nearby positions in an embedding space. Jana et al. projected a distributional thesaurus into such a space and achieved strong co-hyponym detection by clustering context-similar terms [11]. However, plain similarity cannot distinguish type (synonymy versus hypernymy). Subsequent work trained classifiers or added constraints; Liu et al. prompted BERT with masked templates (“X is a type of __”) to recover hypernyms more robustly [12]. Contextual models improved further with Transformer pre-training: BERT [ 3 ] and its Siamese variant Sentence-BERT (SBERT) [13] achieved state-of-the-art semantic similarity. Yet domain adaptation proved essential—Patent-BERT, trained on claim corpora, vastly outperformed general BERT on patent relation benchmarks.

The advent of LLMs enabled direct reasoning over relations. Models such as ChatGPT-4 store vast world knowledge and can generate definitions, synonyms, or hypernyms with minimal prompting. Recent reports show ChatGPT-4 successfully deriving taxonomic links for multilingual cultural terms, indicating latent cross-lingual competence unavailable to earlier systems. In the patent realm, Peng and Yang combined a contextual encoder with a citation-derived phrase graph; their self-supervised method captured global evidence beyond local context and raised similarity correlation by seven points [14]. Such hybrids improve accuracy but demand heavy pipelines (citation crawling, graph learning) and remain monolingual.

Cross-domain evaluation has been invigorated by resources tailored to patents. The Google Patent Phrase Similarity Dataset supplies 50 k phrase pairs with graded similarity and relation labels [ 4 ]; Kaggle competitions around it confirmed SBERT-style models as strongest baselines and revealed the benefit of patent-specific pre-training. Yet most entries handled English only and did not automate thesaurus induction.

Our study departs from prior art in three ways. First, we retain a lightweight embedding filter but rely on a minimally fine-tuned ChatGPT-4o to infer relations, avoiding bespoke citation graphs or rule sets. Second, we enforce cross-lingual consistency via pattern-harvested bilingual seed pairs, allowing the same model to populate a thesaurus in Japanese and English without extra translation resources. Third, the LLM’s output is written directly into an incrementally expandable graph, turning relation inference into immediate thesaurus construction rather than a separate post-processing step. In doing so, we address the lingering gaps of multilingual coverage, domain knowledge acquisition, and pipeline complexity that earlier approaches left open.

3. Proposed Method

Our framework builds a multilingual patent thesaurus through two alternative relation–inference strategies plus a multilingual verification step: (i) an embedding-based similarity inference, (ii) an LLM-based explicit-label inference, and (iii) pattern-driven multilingual enrichment. Stages (i) and (ii) pursue the same objective—predicting the semantic relation of a term pair—but difer in the signal they exploit: dense vectors vs. generative reasoning. Stage (iii) then enforces cross-lingual consistency and incrementally expands the thesaurus graph.

3.1. Embedding-Based Similarity Inference

Given a term pair (, ), we obtain vectors e, e ∈ R from either OpenAI Embeddings ( = 1536) or multilingual-e5-large1 ( = 1024). Their cosine similarity, sim(, ) =

e · e , ‖e‖ ‖e ‖ serves as a proxy score for semantic relatedness. Pairs whose score exceeds a threshold (0.35 for OpenAI, 0.30 for e5) are tentatively regarded as related (synonym or taxonomic) and forwarded to the multilingual verification in Stage (iii). This embedding view ofers a fast, language-agnostic approximation that requires no fine-tuning.

3.2. LLM-Based Explicit Relation Inference

Alternatively, the same pair can be passed to ChatGPT-4o mini, fine-tuned on the Google Patent Phrase Similarity Dataset. The prompt asks:

Based on ’ reading machine’, what is the relationship of ’ photocopier’? Please choose the most appropriate one from the following: 1: ’Not related.’ 2: ’Other high level domain match.’ 3: ’Holonym (a whole of).’ 4: ’Meronym (a part of).’ 5: ’Antonym.’ 6: ’Structural match.’ 7: ’Hypernym (narrow-broad match).’ 8: ’Hyponym (broad-narrow match).’ 9: ’Highly related.’ 10: ’Very highly related.’ The model chooses a single label from Table 1; we map it to a numerical score {1.00, 0.75, 0.50, 0.25, 0.00}. Compared with Stage (i), the LLM returns an explicit relation type (e.g., Hyponym, Meronym) rather than a scalar similarity.

3.3. Pattern-Driven Multilingual Enrichment

1. Seed extraction: • Japanese: phrases matching “A nado no B” • English: phrases matching “B such as A”

These patterns produce provisional hyponym () / hypernym () pairs.

2. Translation alignment: English pairs are machine-translated to Japanese using ChatGPT and intersected with the Japanese set;high-confidence bilingual pairs. 3. Cross-lingual verification: Each pair is checked by either Stage (i) or (ii); only pairs whose

Japanese and English predictions agree are accepted. 4. Thesaurus graph update: Accepted pairs become edges (relation type) between term nodes.

The graph updates automatically as new pairs arrive.

By ofering two complementary inference routes—fast embedding similarity or explicit LLM labelling—and a verification layer that fuses them across languages, our method achieves multilingual coverage with minimal fine-tuning while avoiding complex citation graphs or handcrafted rules. Experimental details follow in Section 4.

4. Experiments 4.1. Experimental Setup

Alternatives Datasets For the English task we adopt the Google Patent Phrase Similarity Dataset, using 36,473 pairs for training and 9,232 for validation and testing.

• Embedding models: Word2Vec, GloVe, BERT, SBERT, Patent-BERT (baselines reported by [ 4 ]),

OpenAI Embeddings (text-embedding-3-large), and multilingual-e5-large. • Graph + encoder: the phrase-graph embeddings released with RA-Sim (a baseline reported by [14]). • LLMs: ChatGPT-4o and ChatGPT-4o mini in their pretrained form, plus fine-tuned versions on the English training split.

Metrics For English we report Pearson and Spearman correlation between predicted similarity scores and gold scores.

4.2. Results

The results are shown in Table 2. The fine-tuned ChatGPT-4o attains the strongest correlation (Pearson 0.762), outperforming the graph-augmented RA-Sim by 0.14 Pearson / 0.09 Spearman.

4.3. Discussion

To verify the efectiveness of fine-tuning, we compared similarity scores before and after adaptation. Table 3 shows that scores improved for 42% of pairs with ChatGPT-4o and 52% with ChatGPT-4o mini, while only 10 % deteriorated. The overall distribution shifted toward values closer to the gold standard, indicating that fine-tuning successfully supplements the model’s domain knowledge and yields more accurate similarity estimates.

Because the LLM classifies each pair into ten semantic relations, we can compute precision and recall for every class. Tables 5a and 5b list the fine-tuned ChatGPT-4o and ChatGPT-4o mini results, respectively. Both models excel at Not related, Antonym, and high-similarity classes, while Holonym, Meronym, and Structural match remain challenging—mainly due to their scarcity in the training data. Therefore, we constructed a multilingual thesaurus while improving these problems using the method proposed in Section 3.3.

5. Automatic Construction of a Multilingual Thesaurus Using Cross-Lingual Verification

We automatically construct a multilingual thesaurus from the full text of Japanese and US patents published between 1993 and 2023. Our main objective is to extract hypernym-hyponym relationships, but we also extract other relationships in the process. The procedure is described below. 1. Using the expressions “A nado no B” (Japanese) and “B such as A” (English), we extracted 613,251 Japanese and 518,166 English candidate pairs and kept 42,784 bilingual pairs after translation alignment using ChatGPT. 2. ChatGPT-4o mini (fine-tuned) predicted relations for both languages; only pairs with matching labels were retained (21,673 pairs).

In Step 2, we decided to use ChatGPT-4o mini (fine-tuned), which is comparable to ChatGPT-4o, which had the highest value in Table 2, because processing large amounts of data is extremely costly.

Tables 6a and 6b show the distribution of labels obtained by classifying the top and bottom candidates in English and Japanese from Step 1 using ChatGPT-4o mini (fine-tuned). Additionally, Table 6 shows the distribution of labels for the results where English and Japanese agree.

6. Conclusion

We introduced a three–stage pipeline that combines a lightweight embedding filter, a minimally finetuned ChatGPT-4o, and pattern-driven cross-lingual verification to build a continuously expandable multilingual patent thesaurus. Experiments on the Google Patent Phrase Similarity Dataset demonstrated that the proposed LLM surpasses both embedding baselines and the recent graph-augmented RA-Sim model (Pearson 0.762 vs. 0.622). On 42,784 automatically aligned Japanese–English hypernym pairs, the pattern + LLM strategy achieved 97 % accuracy.

The framework requires no citation crawling, no external knowledge base, and no language-specific rules beyond a handful of fixed expressions, yet delivers state-of-the-art accuracy while remaining fully incremental. These traits make it attractive for industry settings where frequent thesaurus updates and multilingual coverage are essential.

Declaration on Generative AI The author(s) have not employed any Generative AI tools.

Conference on Computational Linguistics: Technical Papers, Dublin City University and the Association for Computational Linguistics, Dublin, Ireland, 2014, pp. 290–300. [10] J. Pennington, R. Socher, C. Manning, Glove: Global vectors for word representation, in: Proceedings of EMNLP 2014, 2014, pp. 1532–1543. [11] A. Jana, N. R. Varimalla, P. Goyal, Using distributional thesaurus embedding for co-hyponymy detection, in: Proceedings of LREC 2020, 2020, pp. 5766–5771. [12] C. Liu, T. Cohn, L. Frermann, Seeking clozure: Robust hypernym extraction from bert with anchored prompts, in: Proceedings of *SEM 2023, 2023, pp. 193–206. [13] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, in:

Proceedings of EMNLP–IJCNLP 2019, 2019, pp. 3982–3992. [14] Z. Peng, Y. Yang, Connecting the dots: Inferring patent phrase similarity with retrieved phrase graphs, in: Proceedings of NAACL 2024, 2024, pp. 1877–1890.

[1]

M. A.

Hearst , Automatic acquisition of hyponyms from large text corpora , in: Proceedings of COLING '92 , 1992 , pp. 539 - 545 .

[2]

Mikolov ,

Chen , G. Corrado,

Dean , Eficient estimation of word representations in vector space , arXiv preprint arXiv:1301.3781 ( 2013 ).

[3]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Bert: Pre-training of deep bidirectional transformers for language understanding , in: Proceedings of NAACL 2019 , 2019 , pp. 4171 - 4186 .

[4]

Aslanyan , I. Wetherbee , Patents phrase to phrase semantic matching dataset , arXiv preprint arXiv:2208.01171 ( 2022 ).

[5]

Nanba ,

Mayumi ,

Takezawa , Automatic construction of a bilingual thesaurus using citation analysis , in: Proceedings of the PaIR'11 Workshop , 2011 , pp. 1 - 8 .

[6]

Nanba ,

Kamaya ,

Takezawa ,

Okumura ,

Shinmori ,

Tanigawa , Automatic translation of scholarly terms into patent terms , in: Proceedings of the 2nd International Workshop on Patent Information Retrieval (PaIR '09) , Association for Computing Machinery, Hong Kong, China, 2009 , pp. 21 - 24 . DOI forthcoming.

[7]

Roller ,

Kiela ,

Nickel , Hearst patterns revisited: Automatic hypernym detection from large text corpora , in: Proceedings of ACL 2018 , 2018 , pp. 358 - 363 .

[8]

Andersson ,

Hanbury ,

Rauber , The portability of three types of text mining techniques into the patent text genre , in: Mihai Lupu, Katja Mayer, John Tait, Anthony J. Trippe (Eds.), Current Challenges in Patent Information Retrieval , volume 37 of The Information Retrieval Series , Springer, Berlin / Heidelberg, 2017 , pp. 241 - 280 . doi: 10 .1007/978-3- 662 -53817-3\_9.

[9]

Judea ,

Schütze ,

Brügmann , Unsupervised training set generation for automatic acquisition of technical terminology in patents , in: Proceedings of COLING 2014 , the 25th International