CCS CONCEPTS

What is Special about Patent Information Extraction?

Liang Chen†

25565853@qq.com 2

Zheng Wang

wangz@istic.ac.cn 2

Shuo Xu

xushuo@bjut.edu.cn 1

Chao Wei

weichaolx@gmail.com 2

Weijiao Shang

shangwj490@163.com 3

Haiyun Xu

xuhy@clas.ac.cn 0 0 Chengdu Library and Information, Center, Chinese Academy of Sciences , Beijing , China P.R. 1 College of Economics and , Management , Beijing University of Technology , Beijing , China P.R. 2 Institute of Scientific and Technical, Information of China , Beijing , China P.R. 3 Research Institute of Forestry, Policy and Information, Chinese Academy of Forestry , Beijing , China P.R.

2020

63 72

Information extraction is the fundamental technique for text-based patent analysis in era of big data. However, the specialty of patent text enables the performance of general information-extraction methods to reduce noticeably. To solve this problem, an in-depth exploration has to be done for clarify the particularity in patent information extraction, thus to point out the direction for further research. In this paper, we discuss the particularity of patent information extraction in three aspects: (1) what is the special about labeled patent dataset? (2) What is special about word embeddings in patent information extraction? (3) What kind of method is more suitable for patent information extraction?

CCS CONCEPTS

CCSInformation systemsInformation tasks and goalsInformation extraction retrievalRetrieval

1 Introduction

Patent information extraction, deep learning, word embedding As an important source of technical intelligence, patents cover more than 90% latest technical information of the world, of which 80% would not be published in other forms [ 1 ]. There are two traditional ways to obtain technical intelligence, either by analyzing structured data with bibliometric methods or by experts reading patent texts. However, with the rapid growth of patent documents, the second way is facing more and more challenges.

Information extraction is an important technology for machine understanding text, which aims to extract structured data from free text to eliminate ambiguous problem inherent in free texts. In recent years, with the tremendous advances of machine learning technology, especially the rise of deep learning, the research in information extraction has made great progress. However, the particularity of patent text enables the performance of general information-extraction tools to reduce greatly. Therefore, it is necessary to deeply explore patent information extraction, and provide ideas for subsequent research.

Information extraction is a big topic, but it is far from being well explored in IP (Intelligence Property) field. To the best of our knowledge, there are only three labeled datasets publicly available in the literature which contain annotations of named entity and semantic relation. Therefore, we choose NER (Named Entity Recognition) and RE (Relation Extraction) for discussion in three aspects as follows.

2 What is special about labeled patent dataset?

Since supervised learning methods represent state-of-the-art techniques in information extraction, it’s necessary to clarify the particularity of labeled patent dataset for further improvement of information extraction in IP. To this end, a comparative analysis is conducted which contains seven labeled datasets of three categories: (1) news corpus consisting of Conll-2003 [ 2 ] and NYT-2010 (New York Times corpus) [3], (2) encyclopedia corpus consisting of Wikigold [ 4 ] and LIC-2019 (the annotated dataset of 2019 ;Language and Intelligence Challenge) [ 5 ], (3) patent corpus consisting of CPC-2014 (Chemical Patent Corpus) [ 6 ], CGP-2017 (The CEMP and GPRO Patents Tracks) [ 7 ], TFH-2020 (Thin Film Head annotated dataset) [ 8 ].

There are two parts contained in each labeled dataset, (1) an information schema to define label types, (2) a dataset consisting of labeled texts. Let’s take TFH-2020 as an example, the schema of named entities and semantic relations are shown in Table 1 and 2 of the Appendix, and the labeled text is shown in Fig. 1.

In order to analyze these datasets, 8 indicators are proposed as shown in Table 3 of the Appendix. It is worth noting that, (1) in CGP-2017, Conll-2003 and Wikigold, only entities are annotated but semantic relations, (2) all datasets are in English except Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). LIC-2019, which is in Chinese. As a consequence, some indicators are not calculated for certain datasets. The final result is shown in Table 4 of the Appendix.

From the statistics in Table 4, we can find the following facts: (1) In terms of average sentence length, there is no clear distinction between patent text and generic text; (2) In terms of count of entities per sentence, there are more entities in a single sentence of patent text than that of generic text; (3) As to rest indicators, TFH-2020 shows clear distinctions from the other patent datasets and all generic datasets.

In summary, there exist significant distinctions not only between patent text and generic text, but also between patent text from different technical domains. In our opinion, the later distinctions are two-fold: firstly, they come from the unique characteristics of different technical domains, i.e., there are plenty of sequences and chemical structures mentioned in describing innovations in chemical and biotechnology (Hunt et al., 2012), while the most frequent entities in the field of hard disk drive are of components, location and function (Chen et al., 2020), as to describe inventions with different materials and mechanisms, the patents of different domains follow different writing styles; secondly, they come from the concerns of experts from different domains, e.g., for TFH-2020 dataset, there are 17 types of entities designed, as to CGP-2017 dataset, only 3 types of entities including chemical, gene-n, gene-y are concerned while other types of entities are out of concern.

3 What is special about patent word embeddings?

So far deep learning techniques have achieved state-of-the-art performance in information extraction. As the foundation of deep learning techniques in NLP, word embedding refers to a class of techniques where words or phrases from the vocabulary are mapped to vectors of real numbers.

There are two ways to obtain word embeddings, (1) by training on a corpus via word embedding algorithm, such as Skip-gram [ 9 ] and the like; (2) by directly downloading a pre-trained word embedding file from the Internet, like GloVe [ 10 ]. Risch and Krestel [ 11 ] suggested obtaining word embeddings by training specifically on patent documents in all fields for improving semantic representation of patent language. In fact, such suggestion is based on automatic classification for patents in all fields, which is quite different from information extraction from patents in specific domain. In order to explore which word embedding is preferable in patent information extraction, four types of word embedding with the same dimensions of 100 are prepared as follows: (1) Word embeddings of GloVe provided by Stanford NLP group. According to the different training corpora, there are four release versions of GloVe [ 10 ]. We choose the one trained on Wikipedia 2014 and Gigaword 5 as it provides word embeddings of 100 dimensions. In fact, the version trained on Twitter also has word embeddings of 100 dimensions. But since our training corpus does not follow the patterns in such short texts as Twitter; (2) Word embeddings provided by Risch and Krestel [ 11 ], which are trained with the full-text of 5.4 million patents granted from USPTO during 1976 to 2016. Risch and Krestel released three versions of word embeddings with 100/200/300 dimensions. The 100 dimensions version is chosen and referred to it as USPTO-5M; (3) Word embeddings trained with a corpus of 1,010 patents mentioned in this paper but with their full-text (abstract, claims and description), these word embeddings are referred as TFH-1010; (4) Word embeddings trained with the abstract of 46,302 patents regarding magnetic head in hard disk drive, these word embeddings are referred as MH-46K.

On basis of these word embeddings, two deep-learning models, BiLSTM-CRF and BiGRU-HAN, are respectively used for entity identification and semantic relation extraction. Specifically speaking, BiLSTM-CRF (Fig. 2) takes sentences as input and represents every word in a word embedding format, during training procedure these word embeddings pass through the layers within BiLSTM-CRF and output the prediction of named entities in the sentence; The basic idea of BiGRU-HAN (Fig. 3) is to recognize the occurrence pattern of different semantic relations by a recurrent neural network named BiGRU, and then leverages a hierarchical attention mechanism consisting of a word-level attention layer and a sentence-level attention layer to further improve the model’s prediction accuracy.

Fig. 3 The structure of BiGRU-HAN model.

From Table 5 and Table 6, we can see the results produced by these four types of word embedding are almost the same. However, Risch and Krestel [ 11 ] observed a considerable improvement when replacing Wikipedia word embeddings with USPTO-5M word embeddings in patent classification task. In our opinion, the main reason lies in the huge difference between automatic classification for patents in all fields and the information extraction from patents in a specific domain. To say it in another way, when one confronts a task in a specific domain, the word embeddings trained on the same domain corpus should be preferred to.

4 What is special about methods of patent information extraction?

As far as supervised learning method is concerned, there are mainly 2 ways for information extraction, namely pipeline way and joint way shown in Fig.4. The former extracts the entities first, and then identifies the relationship between them. The separated strategy makes the information extraction task easy to deal with, and each component can be more flexible; differently, the latter uses a single model to extract entities and relations at one time. Even Zheng et al [ 12 ] claimed joint method is capable of integrating the information of entities and relations, thus to improve NER and RE performance in a mutually reinforcing way, in our opinion, the biggest advantage it brings is the elimination of entity pair generation which would produce a large number of entity pairs with no relation type shown in Fig. 5.

Subject

Predicate

Object

Predicate

Subject Entity pair generation

NER Patent dataset sturcture prediction

…… Feature engineering

Patent dataset (a) pipeline method (b) joint method Fig.4 Two patterns of information extraction

Joint method seems to be a better solution to extract patent information, so what is the actual situation?

To verify this, we prepare a pipeline baseline and a joint baseline, namely BiLSTM-CRF [ 13 ] &BiGRU-HAN [ 14 ] and Hybrid Structure of Pointer and Tagging [ 15 ] (Fig. 6) for an experiment on TFH-2020 dataset. Since the proportion of no relations in TFH-2020 is much larger than that of generic text after entity pair generation, two set of results are provided by pipeline baseline including with no relations and without no relations, which are shown in 1st and 2nd rows of Table 7, and the result of Hybrid Structure of Pointer and Tagging is shown in 3nd row. In order to highlight the performance of the two baselines on different types of relation, the precision, recall, and F1-value for each type of relation are shown in Fig. 7 and 8 with each type of relation denoted by its first 3 letters (cf. Table 2)

Raw sentence

In one embodiment the offset portion of the first magnetic layer is disposed within a recess in the substrate NER

In one embodiment the offset portion of the first magnetic layer is disposed within a recess in the substrate Entity pair generation

entity pair offset portion, first magnetic layer offset portion, recess offset portion, substrate first magnetic layer,recess first magnetic layer, substrate recess, substrate gold standard of relation type part-of spatial relation no relation no relation no relation attribution Model training Fig. 5 The procedure of information extraction in pipeline way

As can be seen, the experimental results in this paper contradict the observation from information extraction competition in LIC 2019 [ 5 ], where joint methods outperformed pipeline methods by a large margin. In our opinion, there are two reasons behind, (1) as same as pipeline method, the performance of joint method is severely affected by the number of entities in sentences; (2) the requirement of joint model for training set size is much higher than that of pipeline model. To verify the 2nd reason, we take the LIC-2019 dataset as an example to demonstrate how the size of the training set affects the performance of Hybrid Structure of Pointer and Tagging.

Object Predicate Subject

Neural Networks

Convolution layer BiLSTM layer

Word embedding & Position embedding & Relation type embedding

Patent dataset

Fig. 6 The structure of Hybrid Structure of Pointer and Tagging Fig.9 The performance of joint model with different size of training set

As shown in Fig. 9, as the size of the training set increases from 1000 to 50000, the performance of Hybrid Structure of Pointer and Tagging increases rapidly, and then it enters a stable state near 0.78/0.51/0.63 in terms of weighted-average precision /recall /F1-value.

5 Conclusions

In this paper, we discuss the particularity in patent information extraction in three aspects: (1) Labeled dataset through comparative analysis, it is found that there are differences not only between labeled patent datasets and labeled generic datasets, but also between labeled patent datasets from different technical fields, which means patent information extraction is a domain-specific task, and a series of processing steps should be customized from feature engineering to model building for better performance. (2) Word embedding word embedding is the foundation of deep learning methods in information extraction. Although Risch et al. suggested obtaining word embeddings by training specifically on patent documents in all fields to improve semantic representation of patent language, experiment shows when one confronts a task in a specific domain, the word embeddings trained on the same domain corpus should be preferred to. (3) Organization of sub-tasks in information extraction although joint method achieves state-of-the-art performance in information extraction, this excellent performance comes at the expense of large labeled dataset. When the dataset is limited, one should take a series of factors, such as model characteristics, computing resources, actual performance and so on into consideration, and then choose an optimal method.

We realize some conclusions in this paper are obtained only considering a few sample data considering simple metrics. However, given the scarcity of patent labeled dataset in information extraction, this is what we can get so far with data support. In the future, we hope more people participate in construction of patent labeled datasets and research of patent information extraction, not only because lack of labeled datasets, but also because there are valuable tasks waiting for us to explore, such as how to generate large-scale patent annotation dataset with low cost? Or how to use the particularity of patent text to improve the performance of information extraction in patent text?

ACKNOWLEDGMENTS

This research received the financial support from National Natural Science Foundation of China under grant number 71704169, National Key Research and Development Program of China under grant number 2019YFA0707202, and Social Science Foundation of Beijing Municipality under grant number 17GLB074, respectively. Our gratitude also goes to the anonymous reviewers for their valuable suggestions and comments.

Appendix:

example The etchant solution has a suitable solvent additive such as glycerol or methyl cellulose A camera using a film having a magnetic surface for recording magnetic data thereon Conductor is utilized for producing writing flux in magnetic yoke The curing step takes place at the substrate temperature less than 200.degree The curing step takes place at the substrate temperature less than 200.degree The legs are thinner near the pole tip than in the back gap region The MR elements are biased to operate in a magnetically unsaturated mode Magnetic disk system permits accurate alignment of magnetic head with spaced tracks A magnetic head having highly efficient write and read functions is thereby obtained Recess is filled with non-magnetic material such as glass A pole face of yoke is adjacent edge of element remote from surface A pole face of yoke is adjacent edge of element remote from surface This prevents the slider substrate from electrostatic damage A digital recording system utilizing a magnetoresistive transducer in a magnetic recording head Interlayer may comprise material such as Ta Peak intensity ratio represents an amount hydrophilic radical

Pressure distribution across air bearing surface is substantially symmetrical side formula = = ∑ ∑ ∑

∑

comment sentence sentence N indicates the number of sentences, indicates the length of the i-th N is the same as above, indicates the number of entities in the i-th NE indicates the number of entities in sentences,

indicates the number of words in the i-th entity N is the same as above, indicates the number of relation mentions in the i-th sentence NE is the same as

above, NE_distinct indicates the number of entities after deduplication RE indicates the number of relation mentions in sentences, RE_distinct indicates the number of mentions after deduplication NE is the same as above, relation

indicates the number of multi-word entities, namely ngram entities in sentences entity number of deduplicated entities that have common word(s) with the i-th is same as above, indicates the

memo Calculate how many words are included in an sentence on average Calculate how

many entities are included in an sentence on average Calculate how many words are included in an entity on average Calculate relation average

how mentions many

are included in an sentence on Calculate how many times an entity can appear in the corpus on average Calculate how many times an relation

mention can appear in the corpus on average Measure the proportion of phrase-type entities in all entities Measure the connection between entities

by co-word mechanism, i.e., thin film head and Ferrite head are connected as they have a common word head CPC-2014(EN) CGP-2017(EN) TFH-2020(EN) Conll-2003(EN) NYTC(EN) LIC-2019(CN) Wikigold(EN)

Wikipedia corpus description Patent full-text regarding biology and chemistry Patent abstract regarding biomedical science Patent abstract regarding thin film head techniques Reuters news stories New York Times Corpus search results of Baidu Search as well as Baidu Zhidao average length of sentence --# of entities per sentence # of words per entity # of relations per sentence entity repetition rate relation repetition rate percentage of ngram entities (%) entity association rate --------0.4 2.1 ------8.0 1.3 ----

[1] Zha , X. , & Chen , M. ( 2010 ). Study on early warning of competitive technical intelligence based on the patent map . Journal of Computers , 5 ( 2 ). doi:10.4304/jcp.5.2 . 274 - 281

[2] Sang

E. F. T. K.

, De Meulder F. ( 2003 ). Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition . arXiv preprint arXiv:cs/03060- 50 .[3] Riedel

, Yao

, McCallum

. ( 2010 ) Modeling Relations and Their Mentions without Labeled Text . In: Balcázar

J.L.

, Bonchi

, Gionis

, Sebag

. (eds) Machine Learning and Knowledge Discovery in Databases . ECML PKDD 2010. Lecture Notes in Computer Science , vol 6323 . Springer, Berlin, Heidelberg

[4] Balasuriya

, Ringland

, Nothman

, Murphy

, Curran

J. R.

( 2009 ). Named Entity Recognition inWikipedia , Proceedings of the 2009 Workshop on the People's Web Meets NLP, ACL-IJCNLP 2009 , pages 10 - 18 .

[5] Wu , H. ( 2019 ). Report of 2019 language & Intelligence technique evaluation . Baidu Corporation . http://tcci.ccf.org.cn/summit/2019/dlinfo/1101-wh.pdf

[6] Akhondi , S. A. , Klenner , A. G. , Tyrchan , C. , Manchala , A. K. , Boppana , K. , Lowe , D. , Zimmermann , M. , Jagarlapudi , S. A. R. P. , Sayle , R. , Kors , J. , & Muresan , S. ( 2014 ). Annotated Chemical Patent Corpus: A Gold Standard for Text Mining . PLoS ONE , 9 ( 9 ), 1 - 8 .

[7] Pérez-Pérez , M. , Pérez-Rodríguez , G. , Vazquez , M. , Fdez-Riverola , F. , Oyarzabal , J. , Oyarzabal , J. , Valencia , A. , Lourenço , A. , & Krallinger , M. ( 2017 ). Evaluation of Chemical and Gene/Protein Entity Recognition Systems at BioCreative V.5: The CEMP and GPRO Patents Tracks . In Proceedings of the BioCreative V.5 Challenge Evaluation Workshop , 11 - 18 .

[8] Chen , L. , Xu , S. , Zhu , L. , Zhang , J., Lei , X. , & Yang , G. ( 2020 ). A deep learning based method for extracting semantic information from patent documents . Scientometrics. doi:10 .1007/s11192-020-03634-y.

[9] Mikolov , T. , Chen , K. , Corrado

,& Dean , J. ( 2013 ). Efficient estimation of word representations in vector Space . arXiv preprint arXiv: 1301 . 3781 .

[10] Pennington , J. , Socher , R. , & Manning , C. ( 2014 ). Glove: Global vectors for word representation . In Proceedings of the 2014 conference on empirical methods in natural language processing (pp. 1532 - 1543 ).

[11] Risch , J. , & Krestel , R. ( 2019 ). Domain-specific word embeddings for patent classification . Data Technologies and Applications , 53 ( 1 ), 108 - 122 .

[12] Zheng

S.C.

, Wang

, Bao

H.Y.

, Hao

Y.X

, Zhou

, Xu

( 2017 ). Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme . arXiv preprint arXiv:1706.05075

[13] Huang , Z. , Xu , W. , & Yu

( 2015 ). Bidirectional LSTM-CRF models for sequence tagging . arXiv preprint arXiv:1508 . 01991 .

[14] Han , X. , Gao , T. , Yao , Y. , Ye , D. , Liu , Z. , Sun , M. ( 2019 ). OpenNRE: An Open and Extensible Toolkit for Neural Relation Extraction . arXiv preprint arXiv: 1301.3781

[15] Su , J.L. ( 2019 ). Hybrid Structure of Pointer and Tagging for Relation Extraction: A Baseline . https://github.com/bojone/kg-2019 -baseline 23.3 21.9 30.7 14.6 23.0 40.6 2.5 2.4 6.1 1.7 2.1 2.2 3.0 1.4 1.3 2.3 1.5 1.8 1.5 0.6 4.3 5.3 3.7 2.8 33.3 5.1 13.5 2.5 4.73 1.2 25.7 19.3 75.5 37.6 50.4 44.1 1.6 0.4 3.4 0.1 0.4 0 . 2