-

Fusing Multi-label Classi cation and Semantic Tagging

Jorg Kindermann

0 1

Katharina Beckh

katharina.beckhg@iais.fraunhofer.de 0 1 0 Competence Center for Machine Learning Rhine-Ruhr 1 Fraunhofer IAIS , Sankt Augustin , Germany

Companies have an increasing demand for enriching documents with metadata. In an applied setting, we present a three-part work ow for the combination of multi-label classi cation and semantic tagging using a collection of key-phrases. The work ow is illustrated on the basis of patent abstracts with the CPC scheme. The key-phrases are drawn from a training set collection of documents without manual interaction. The union of CPC labels and key-phrases provides a label set on which a multi-label classi er model is generated by supervised training. We show learning curves for both key-phrases and classi cation categories, and a semantic graph generated from cosine similarities. We conclude that, given su cient training data, the number of label categories is highly scalable.

multi-label classi cation based embedding spaces patents

For strategic developments, businesses and research organizations have an interest in identifying competences or trends in their respective organization and in comparison to competing institutions. Extracting this information manually among heterogeneous data is time-consuming which is partly complicated by di erent underlying classi cation schemes, e.g. from patents or publications. Therefore, there is an increasing demand for metadata [8] that combines categories from classi cation schemes with semantic tags.

The automatic single-label classi cation of documents is well-researched [21] [1] while multi-label classi cation with large numbers of labels still is a challenge [16]. The combination of classi cation and semantic tagging is also less explored. Advances in the distributed representation of words have provided the necessary basis for this combination [14] and recent work allows to achieve both steps together in a document processing work ow [18].

To tackle the fusion of classi cation and semantic tagging in an applied setting, we introduce a basis work ow which allows to classify and tag documents at once. For that we start by introducing the tools, namely the model, data and evaluation metrics (Section 3). Subsequently, we put the approach into context by describing a use case within the Fraunhofer society that aims to extract information from existing data sources (Section 4.1). As patent data is an important base for innovation research and because it exhibits one of the largest and prominent classi cation schemes, we employ it to demonstrate the workings of our approach.

Following the use case, we describe the three-part work ow in detail (Section 4.2). A set of key-phrases is collected in an unsupervised procedure from a training set of documents. The union of category labels and key-phrases provides a label set on which a multi-label classi er model is trained. Following the model training, we furthermore describe how to extract embedding vectors to visually represent classi cation categories and key-phrases together in a semantic graph. We depict learning curves with appropriate metrics and a cutout of the semantic graph. We conclude that the work ow scales to a larger amount of documents and can be applied on documents in various domains. 2

Related Work

Multi-label classi cation with a large number of categories has been notoriously di cult. A rst break-through that made classi cation of texts possible without relying on manually designed features was the Support Vector Machine [5], [10]. However, the computational e ort grows considerably with the number of labels, making the training of classi cation problems with thousands of labels intractable. Semantic tagging, i.e. the assignment of key-phrases to a text, in an unsupervised way was achieved by applications of the Latent Dirichlet Allocation topic model [3].

Both steps, multi-label classi cation and semantic tagging, in a document processing work ow could recently be combined with the advent of the StarSpace algorithm [18] based on embedding vector spaces. This algorithm implements the concept of prediction-based embedding spaces.

Since Elman's seminal paper [7] on recurrent neural networks and their training on sequences, in particular sentences as sequences of words, there have been many e orts to improve the storage capacity and reduce the computational complexity of such systems. The Word2Vec algorithms [14] were a path-breaking invention in this direction which for the rst time made it possible to represent semantic properties of words derived from their actual usage in large quantities of texts. This algorithm exceeded capacities of systems known so far by orders of magnitude. Levy and Goldberg [12] showed that the Word2Vec algorithms are closely related to counting-based vector representations by matrix-factorization mappings. An example is a vector-space based on PMI (point-wise mutual information) values. This nding supports con dence in the semantic properties of prediction-based embedding spaces, such as the StarSpace model, which are explored by cosine similarity. This is due to their close relationship to PMI-based representations. Important follow-up developments of Word2Vec were Glove [15] and FastText [4].

Recent applications of StarSpace have been published in the areas of ontologies [9] and knowledge graphs [20] that are related to our use case. Regarding other recent work, transformer-based architectures [6] are also suitable for multilabel classi cation. 3 3.1

Methods StarSpace

We chose StarSpace [18], a general-purpose neural embedding model which can be used for multi-label classi cation and tagging. It is based on a bag of entities representation. Entities can be texts, labels, meta-data like authors, source URLs etc. Starspace thus is capable of learning relations between items of various types and origins. The bag of entities representation is a high dimensional vector in an embedding space which may include labels. The actual learning algorithm is a stochastic gradient descent optimization of a special loss function X (a;b)2E+;b 2E

Lbatch(sim(a; b); sim(a; b1 ); :::; sim(a; bk )) ( 1 ) where entities a and b are drawn from the set E+ of positive examples, and entities b are drawn from the set E of negative examples. In our use case (section 4.1) the entities are the patent abstracts and their labels and key-phrases. The k-negative sampling strategy of [14] is used. The similarity function can be chosen from fcosine; dot productg. The loss function Lbatch has two implementations: { margin ranking loss: max(0;

sim(a; b)) with margin parameter { the negative log loss of the softmax function: log( Peyieyj ) j During the optimization run, the similarity function sim( ; ) is "learned". It can subsequently be used to measure the similarity between entities. For classi cation, a label is predicted for a given input a as max^b(sim(a; ^b)) over the set of possible labels ^b. This feature can be used to output a ranking of labels according to their similarity, implementing multi-label classi cation. 3.2

Data

In our experiments, we employ a sample of patent abstracts from the United States Patent and Trademark O ce (USPTO)3 from the month of January 2020 which amounts to 22.000 abstracts. The classi cation scheme that we use is the Cooperative Patent Classi cation (CPC). The CPC hierarchy is illustrated in Fig.1 and consists of section, class, subclass, maingroup and subgroup.

3 https://developer.uspto.gov/product/patent-grant-full-text-dataxml

Section

Class

Subclass

Maingroup

Subgroup

We focus on the rst three levels, namely section, class and subclass. The data contains a Main-CPC which serves as the main category of the patent and Further-CPC categories which are also applicable categories (see Fig. 5(b) for examples). We selected a subset of all possible labels with respect to the number of examples available in our data collection. Table 1 shows the numbers of selected labels in both categories. For the category key-phrases see section 4.2. counts how many steps have to be taken to move down the ranked label list to cover all the relevant labels of the example. The coverage-rank was used to assess the performance on the Further-CPC labels and key-phrases. It seems to be more adequate to multi-label classi cation than the F1 value. Another important reason is that we want to train the model on a semantic tagging task, which would be thwarted by an exclusive optimization according to F1 values. The reason is that semantic tagging is expected to tag documents with a certain key-phrase that is not literally contained in the document but is nevertheless highly relevant to the document content and topic. This desired behavior would, however, result in a degraded F1 value because it would be counted as a false positive. 4 4.1

Experiments Use Case

Here, we rst describe the applied bene t of our approach in the context of a current project. Within the project "Fraunhofer Digital" a data hub has been created which will cover a variety of datasets, ranging from publications and patents to project descriptions. All the datasets contain valuable information about the competence landscape and, in particular, patent data is important for the strategic technology and innovation management within Fraunhofer.

One key challenge is that patents are only mapped to a patent classi cation system. There is no basis in linking the classi cation to information outside of the scheme. In this use case it is desired to nd similarities between patents and at a glance we want to identify the most suitable key-phrases. This makes it for example easier to determine current technologies and technology trends.

Our approach is to extract and assign information inherent in the patents that exceeds the common patent classi cation. We achieve this by employing key-phrase extraction. By providing key-phrases on top of the classi cation, the model provides comprehensible information for readers and therefore serves as a base to facilitate work for employees. In the "Fraunhofer Digital" use case we apply this approach also to publication data using more data to create several classi cation models. For this paper, we narrow our focus to patent samples. In the following, we describe the work ow in more detail. 4.2

Work ows

Key-phrase Extraction. We collect a list of key-phrases from the pool of training documents using the RAKE (Rapid Automatic Keyword Extraction) algorithm [17]. We chose RAKE, because it does not depend on sophisticated preprocessing operations as named-entity recognition and training of neural networks as in [13]. RAKE operates in an unsupervised manner on individual documents. It identi es key-phrases by extracting phrases between stopwords (e.g. "the", "a") and by analyzing the frequency of word appearance and word cooccurrence.

Because RAKE works on single documents, the frequent extraction of noninformative standard key-phrases like section headings ("Related Work", etc.) is expected. It can be avoided by detecting and elimiating those phrases based on an information-theoretic measure like TF-IDF (Term Frequency - Inverse Document Frequency) [2] or Importance Weight [11]: We chose TF-IDF and keep only those phrases which contain at least one term with a value above a certain threshold (to be set as a hyper-parameter). The resulting list usually is still too large. Therefore, we select the n most frequent phrases. In the experiment described here, we chose 200 key-phrases (see Table 1). Examples from this set of key-phrases are "search engine" or "application programming interface" and more are depicted in Fig. 5. The selected key-phrases de ne the gold-standard for F1 value optimization.

Model Training. The key-phrases together with the Main-CPC and FurtherCPC labels de ne the set of StarSpace labels to be trained (see Fig. 5(b) for examples). Taking the abstracts and the labels, the StarSpace model is trained (Fig. 2 top) with a pre-determined number of iterations on the training set. From Patent abstracts

CPC

Key-phrases

Model Training

StarSpace

Model StarSpace

Model

Extract

Compute Embedding

Vectors

Nearest Neighbors

Visualize

Semantic Graph Fig. 3: The prediction work ow. Patent abstracts are fed into the StarSpace model which computes CPC categories and tags the trained model we export the embedding vectors of the labels and construct a semantic graph that represents the cosine-similarity based k-nearest-neighbor relations of the labels (Fig. 2 bottom). This graph serves as a human-readable quality reference of the model. It is not directly used for the prediction work ow.

To optimize hyper-parameters we used a xed training dataset of 13.000 documents and a test set of 8.800 documents (60%/40% split). We evaluated model performances for the CPC scheme from level 1 Section to level 4 Maingroup (see Fig. 1). Results are reported exclusively for level 3 Subclass, because this was the most detailed level for which we could achieve satisfactory results.

The StarSpace algorithm has several hyper-parameters4 which need to be explored in separate evaluations. We optimized 9 of them (see Table 2). StarSpace param. Description

Explanation number of training iterations an iteration includes n minibatches min frequency of terms ngrams of terms embedding dimension learning rate batch size loss function similarity measure less frequent terms are eliminated ngrams up to n terms the dimension of embeding vectors learning rates are set to <= 0:05 number of items in a minibatch the functions hinge (i.e. margin ranking) or softmax cosine similarity or dot product of embedding vectors stochastic gradient optimizer adagrad can be switched on or o iterations minCount ngrams dim lr loss batchSize similarity adagrad Model Prediction. New documents (without CPC-label) are assigned their CPC-labels and key-phrases by the trained StarSpace model (see Fig. 3). For each test document the model outputs a weight for each of the labels. Therefore, we need another hyper-parameter weight-threshold to cut-o the list of output labels sorted decreasingly by weight to achieve adequate F1 values. 4.3

Results

Attainable Model Performance Figure 4 shows a typical development of F1 and coverage-rank values during a training run of 640 iterations, a weight

4 see https://github.com/facebookresearch/StarSpace

threshold of 0.35 and otherwise optimal StarSpace parameters. We see that optimal values of F1 and coverage-rank occur in the same range of iterations. Note that large F1 values but small coverage-rank values are better. The overall F1 values are not very competitive. This is partly due to the limited number of documents we use. Moreover, optimizing the F1 value is only a secondary goal. It only makes sense for the Main-CPC values, because they are single-label categories. For the Further-CPC labels and a fortiori for the key-phrases we cannot de ne the F1 measure in a fully consistent way. This would require a prede ned ordering on the multi-label categories which is not given. After all, the behavior of the di erent label sets is as expected: the single-label Main-CPC categories show better performance with respect to F1 compared to the multi-label categories Further-CPC and key-phrases.

The more important evaluation criterion is the coverage-rank, because it gives an estimate on the precision of the output of non-sorted multi-labels. Here we see the Main-CPC labels again performing best, as expected. The second-best performance of key-phrases and the rather large distance of the Further-CPC values to the other two cases is not expected and needs an explanation: All Further-CPC labels are drawn from the same category system as the Main-CPC labels. The most relevant of them is the Main-CPC label, and all others are Further-CPC labels. The sequence of CPC categories may thus be di erent for thematically closely related patent abstracts and result in di erent Main/Further-CPC label sets. This seems to be more di cult to learn for a model than categorizations from disjoint label sets. The fact that we have more Further-CPC labels than keywords may also add to the performance di erences.

Semantic Tagging A trained StarSpace model contains exportable embedding vectors for both the terms occurring in the training documents and all category labels. This allows to de ne a k-nearest-neighbor relation on the labels with the cosine-similarity of their embedding vectors. A similar relation exists between the label embeddings and document texts based on the bag-of-ngrams representation of the documents5. This allows to assign k-nearest-neighbor key-phrase labels as semantic tags to documents. It is di cult to rate the appropriateness of such tagging directly. We therefore display a k-nearest-neighbor graph of labels from all three categories in Fig. 5.

This sub-graph is centered around the Main-CPC level 3 category "G06F electric digital data processing" and shows the neighboring color-coded MainCPC (red), Further-CPC (light blue) and key-phrase (cyan) labels6. The complete graph contains all 550 labels as nodes. The directed edges in the graph code the cosine similarity between the label embeddings. More similar labels are connected by stronger edges. Note that the linear distance of labels in this graph therefore is not an indicator of their embedding similarity. The edge color is set by its source label. In particular, we can observe that the Main-CPC labels and the Further-CPC labels of identical categories (for example G06F) are connected strongly vice-versa, as one would expect.

Semantic tagging now works as follows: if a document is classi ed, for example as M G06F, it gets assigned the Further-CPC labels G06F and H04L, as well as the key-phrases "search engine", "client system", "operating system", "computer processor" and possibly more key-phrases that are not displayed in this graph cutout. This tagging behavior is a major di erence from other tagging algorithms in that it may assign key-phrases to a document that are not contained in the document itself. 4.4

Limitations and Recommendations

The classi cation and tagging work ow presented here has some intrinsic limitations which we will shortly discuss in this section.

{ Speci city of key-phrases: We advise to investigate the speci city of the key-phrases that are extracted by the RAKE algorithm followed by TF-IDF ltering. Depending on the particular properties of a training collection, many of the key-phrases may occur in a large number of multi-label categories. It is up to the experimenter to create a mix of more frequent and more speci c key-phrases if required. { Number of labels: Though scalable in a large range there surely exist upper limits of the number of labels in a multi-label classi cation regime. These limits are related to the number of documents in the training set, but also to the skewedness of label distributions. We did not run quantitative investigations on this topic but from our general experience with StarSpace

5 For details see https://github.com/facebookresearch/StarSpace 6 For details see

https://www.cooperativepatentclassi cation.org/cpcSchemeAndDe nitions/table (a) CPC category Description G06F H04 H04B H04H H04L H04N H04W

Electric digital data processing Electric communication technique Transmission Broadcast communication Transmission of digital information Pictorial communication Wireless communication networks

(b) models in several domains we would state the following: The number of labels should not exceed 1-2% of the number of training data, and with respect to skewedness of distribution the frequency ratio of the least frequent and the most frequent label should not exceed 0.01. One way to circumvent the limit of label numbers would be to split labels into subsets and train several StarSpace models, one on each subset. Doing this, one has to take into account that the label weights in the model output cannot be compared across models. Therefore it makes sense to de ne subsets accordingly - for example category labels, frequent key-phrases, and speci c key-phrases. { Model and processing resources: StarSpace models can be very large with large numbers of training data and large n for the ngram parameter. Model sizes of more than 10GB are common, which also require corresponding RAM sizes to process. The StarSpace program is thread-parallel, but training wall-clock times can nevertheless exceed a day for large training sets and many training iterations. Compared to training times, the prediction time of a single document is small in the range of milliseconds. 5

Conclusion

We presented a detailed three-part work ow that allows to combine multi-label classi cation with semantic tagging demonstrated on patent abstracts with more than 200 CPC categories. An annotated large training set is needed to accomplish good results. The semantic tagging is based on a set of key-phrases extracted by an unsupervised algorithm from a training set. The predicted key-phrases do not have to occur literally in the tagged document. The number of labels and key-phrases is highly scalable, given su cient training data.

For future work, we plan to test our approach by replacing StarSpace with a deep neural network architecture. We already performed preliminary experiments with Transformer architectures, i.e. BERT [6], on the patent dataset and also on other textual datasets with di erent classi cation systems. The results on the patent dataset suggest that the performance of BERT is signi cantly worse than StarSpace with this amount of data and tests of both StarSpace and BERT on much larger datasets resulted in equal performance. We are planning to consolidate this hypothesis in more experiments.

Acknowledgements We thank the project team of Fraunhofer Digital for the opportunity, and Sven Giesselbach for helpful comments. This research has been funded by the Federal Ministry of Education and Research of Germany as part of the competence center for machine learning ML2R (01IS18038B). 2. Aizawa, A.: An information-theoretic perspective of tf{idf measures. Information

Processing & Management 39( 1 ), 45{65 (2003) 3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of machine

Learning research 3(Jan), 993{1022 (2003) 4. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, 135{146 (2017) 5. Cortes, C., Vapnik, V.: Support-vector networks. Machine learning 20(3), 273{297 (1995) 6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 7. Elman, J.L.: Finding structure in time. Cognitive science 14(2), 179{211 (1990) 8. Hirschmeier, S., Schoder, D.: Combining word embeddings with taxonomy information for multi-label document classi cation. In: Proceedings of the ACM Symposium on Document Engineering 2019. pp. 1{4 (2019) 9. Jimenez-Ruiz, E., Agibetov, A., Chen, J., Samwald, M., Cross, V.: Dividing the ontology alignment task with semantic embeddings and logic-based modules. arXiv preprint arXiv:2003.05370 (2020) 10. Joachims, T.: Svmlight: Support vector machine. SVM-Light Support Vector Machine http://svmlight. joachims. org/, University of Dortmund 19(4) (1999) 11. Leopold, E., Kindermann, J.: Text categorization with support vector machines.

how to represent texts in input space? Machine Learning 46( 1-3 ), 423{444 (2002) 12. Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization.

In: Advances in neural information processing systems. pp. 2177{2185 (2014) 13. Mahata, D., Kuriakose, J., Shah, R., Zimmermann, R.: Key2vec: Automatic ranked keyphrase extraction from scienti c articles using phrase embeddings. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). pp. 634{639 (2018) 14. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems. pp. 3111{3119 (2013) 15. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp. 1532{1543 (2014) 16. Prabhu, Y., Kag, A., Harsola, S., Agrawal, R., Varma, M.: Parabel: Partitioned label trees for extreme classi cation with application to dynamic search advertising.

In: Proceedings of the 2018 World Wide Web Conference. pp. 993{1002 (2018) 17. Rose, S., Engel, D., Cramer, N., Cowley, W.: Automatic keyword extraction from individual documents. Text mining: applications and theory 1, 1{20 (2010) 18. Wu, L.Y., Fisch, A., Chopra, S., Adams, K., Bordes, A., Weston, J.: Starspace: Embed all the things! In: Thirty-Second AAAI Conference on Arti cial Intelligence (2018) 19. Zhang, M.L., Zhou, Z.H.: A review on multi-label learning algorithms. IEEE transactions on knowledge and data engineering 26(8), 1819{1837 (2013) 20. Zhang, Q., Sun, Z., Hu, W., Chen, M., Guo, L., Qu, Y.: Multi-view knowledge graph embedding for entity alignment. arXiv preprint arXiv:1906.02390 (2019) 21. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classi cation. In: Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1. pp. 649{657. NIPS'15 (2015)

1. Adhikari , A. , Ram , A. , Tang , R. , Lin , J. : Docbert: Bert for document classi cation . arXiv preprint arXiv: 1904 . 08398 ( 2019 )