=Paper=
{{Paper
|id=Vol-2266/T3-7
|storemode=property
|title=idrbt-team-a@IECSIL-FIRE-2018 : Relation Categorization for Social Media News Text
|pdfUrl=https://ceur-ws.org/Vol-2266/T3-7.pdf
|volume=Vol-2266
|authors=N. Satya Krishna,S. Nagesh Bhattu,D. V. L. N. Somayajulu
|dblpUrl=https://dblp.org/rec/conf/fire/KrishnaBS18
}}
==idrbt-team-a@IECSIL-FIRE-2018 : Relation Categorization for Social Media News Text==
idrbt-team-a@IECSIL-FIRE-2018: Relation Categorization for Social Media News Text N. Satya Krishna2,3 , S. Nagesh Bhattu1 , and D. V. L. N. Somayajulu3 1 NIT Tadepalligudem, West Godavari District, Andhra Pradesh, India nageshbhattu@nitandhra.ac.in 2 IDRBT, Road No.1 Castle Hills, Masab Tank, Hyderabad, Telangana, India satya.krishna.nunna@gmail.com 3 NIT Warangal, Telangana, India {soma}@nitw.ac.in Abstract. This working note presents a statistical based classifier for text classification using entity relationship information present in the input text. We observed that parts-of-speech tags and named entities information will help us to predict the relationship between entities. We also presented the procedure for predicting POS tags and named entities, which we considered as the sources of information for entity relationship. These features (POS tags, NE) along with the words, in input text sen- tence, are used as input features to classify the given input into any one of the predefined relationship class. It also presents the experimental de- tails and performance results of this classifier on five Indian language datasets such as hindi, kannada, malayalam, tamil and telugu. Keywords: Relation Extraction· Parts-Of-Speech tagging· NER· Logis- tic regression· IR. 1 Introduction Relation Extraction(RE) is an important subtask in Natural Language Process- ing(NLP) pipeline to convert the unstructured data format to structured data format by extracting the relationship information among the entities in the nat- ural text. The main goal of a RE task is to identify and extract the relationships between two or more entities existing in the given unstructured data and classi- fies the given text based on the extracted relationship. With the increase in the usage of internet, the unstructured digital data has been increasing exponentially in the form of blogs, research documents, posts, tweets, news articles, and question-answering forms. This unstructured digital data consists important information in the hidden form. The objective of Information Retrieval(IR) is to develop tools for extracting this information automatically. To extract this information, it requires the conversion of unstruc- tured digital data into structured form by predicting the named entities and relationships existed among those entities in the unstructured data. 2 Satya Krishna et al. For example consider the sentence shown in table-1 along with its POS tags and named entities. The sentence has three entities person, occupation, and orga- nization. We can extract these entities information using NER tools. With these entities information, RE algorithm identifies the existence of entities in the given text, but not the relationship information among those entities. In this example there are two relationships among the three entities. The First relationship is woking as between john-person and assistant professor-occupation entities. The second relationship is working in between john-person and IITD-organization entities. Table 1. Eaxmple Sentence-1 sentence1 john working as an assistant professor in IITD POS tags NNP V VM PSP DT JJ NN PSP NNP Named Entities Person other other other occupation occupation other organization Named entities in the second example, shown in table-2, are person, location, event and datenum. Its corresponding POS tags and words are the sources to find the relationship among these entities. In this example two relationships discovered and discovered in are exist. The discovered relationship between Columbus-person and America-location. second relationship discovered in be- tween America-location and 1492-datenum. Table 2. Example sentence-2 sentence2 Columbus discovered America in 1492 POS tags NNP V VM NNS PSP CD Named Entities Person event location other datenum As shown in the above examples, our approach is to consider POS tags along with named entities, as they are good source for relationship extraction and text classification. Since, most of the times verbs, prepositions and verbs followed by preposition can provide the relationship information, in this work we built a machine learning based classifier to extract generic relationship pattern using POS tags and named entities along with words in a given training example in training phase. There are many applications in which we can use RE task. A bio-medical tool EDGAR, presented in [8], extracts the relationship information between the drugs and gems with cancer disease. It is implemented using NLP tools to generate POS tags. The recent survey[7], on Relation Extraction(RE) from unstructured text, presented the description of different types RE models implemented for different applications. It also presents the dataset details and working procedures of these models by categorizing them based on the type of classifiers. Title Suppressed Due to Excessive Length 3 RE is a pivotal sub task in several natural language understanding appli- cations, such as question answering [4]. Initially RE task was formulated as an essential task of information extraction by seventh Message Understanding Con- ference (MUC-7) [3]. Miller et al. [5] proposed a generative model to convert the unstructured text to structured form by extracting the entity relationship information using parts of speech tags and named entity information. Ru et al. [9] proposed a convolutional neural network (CNN) model for relation classifica- tion using core dependency phrases as input to the model. Initially he computed these core dependency phrases using Jaccard similarity score between the re- lation phrases exist in the knowledge base and the dependency phrases exist between two entities. 2 Overview of the Approach This section describes the overall approach we followed in this work to classify the sentences according to the relationship information that exists among the entities in that sentence. For implementation we divided the approach into two stages. In first stage, as shown in the figure-1, we extracted the list of features which carries the entity information and relationship information for each sentence in train and testset. In second stage we train a sentence level classifier and build the model by feeding training sentences along with features extracted in the previous stage. These features are parts of speech tags and named entities in an input sentence. Later we applied this trained classifier to classify test sentences with its POS features and entity features. 3 Experiments We experimented on five Indian language data sets using different classifiers. This section describes the problem definition, details of feature extraction, clas- sification and datasets details with experimental results on those datasets. 3.1 Problem statement Given a large collection of sentences C and each sentence is represented with a sequence of words, such as w0 , w1 , w2 , ...wn , which contains the set of entities and relation among those entities. Classify these sentences into any one of the class labels given in training data using entity relationship information existing in an input sentence. 3.2 POS tagging and NER As shown in the example mentioned in introduction section, the adverb, verb, conjunction and preposition type of words in a sentence are the sources of in- formation for extracting the relation among the existing entities. Hence, we 4 Satya Krishna et al. Fig. 1. Overview of approach predicted the POS tag for each word in a sentence using different POS taggers for different languages. These POS tagging tools are freely available in IIIT- Hyd web site4 . Though, they provided the POS taggers for these five Indian languages we utilized only hindi, tamil and telugu POS taggers. The details of these POS taggers and its source code is available in IIIT-Hyd website5 . We predicted the named entities using a deep learning based Recurrent Neural Net- work called Bi-directional Long Short-Term Memory. The detailed description of this model is explained in our work submitted along with this paper in Arnekt- IECSIL@FIRE2018 shared-task workshop. 3.3 Relationship Extraction As described in the previous section, we applied different types of machine learn- ing based text classifiers to predict the entity relationship class labels. Initially we verified the performance of these classifiers on telugu language dataset. As logistic regression based classifier gave the highest accuracy(76.35%) compared to other classifiers NBEM(67.5%) and MLP(75.98%) we applied this classifier for the remaining datasets. 4 http ltrc.iiit.ac.inshowfile.php?filename=downloadsshallow parser.php 5 http:ltrc.iiit.ac.indownload.php Title Suppressed Due to Excessive Length 5 Naive Bayes EM (NBEM) classifier: NBEM is a semi-supervised learning based classifier[6], which takes the test data along with the training data to utilize information existing in unlabeled data for predicting the class labels. The given training and test datasets are denoted with Dl = (x0 , y0 ),(x1 , y1 ), (x2 , y2 ), ... (xl , yl ) and Du = (xl+0 ), (xl+1 ),S(xl+2 ), ... (xl+u ) respectively. The NBEM classifier learns on dataset Dn = Dl Du by maximizing the following objective function shown in equation-1. Here n = l + u and l denotes the number of sentences in training data. u denotes the number of sentences in test data. xj and yj are denoting j th input and its corresponding output label. |Dl | |Du | X X maximize log(φ(xj , yj )) + log(φ(xk , ybk )) (1) j=1 k=1 It learns the model parameters by applying Expectation and Maximization(EM) algorithm. In EM algorithm it predicts the labels of each test sample in Expec- tation step(E-step) and updates the parameters in Maximization step(M-step) in each iteration. Logistic Regression classifier: In logistic regression[10] we use discriminate probabilistic method to build the classifier. If we consider the training data as Dn = (x0 , y0 ),(x1 , y1 ), (x2 , y2 ), ... (xn , yn ) then this classifier learns the parameters by maximizing the following objective function shown in equation-2. Here, φ is a conditional probability for occurrence of label yj given input observation xj . We compute this probability using softmax function as shown in equation-3. In this function L denotes the set of output labels in the given dataset. θyl is a parameter vector of size equal to the length of the input. |Dn | X maximize log(φ(yj /xj )) (2) j=1 exp(θyT .xj ) φ(yj /xj ) = P|L| j (3) T l=1 θyl .xj Multilayer Perceptron: It is a special type of feedforward neural network in which it uses a non-linear activation function in each neuron of hidden layer and output layer. We used the back propagation algorithm while learning the MLP classifier on training data. In training phase, it back propagates the error occurred for each input observation. While back propagating the error, it adjust the weights by minimizing the following objective function shown in equation-4, where Ei (xn ) gives the error at ith output node for the nth input observation. We applied the gradient descent algorithm to change the weights. 1X n Ei (xn ) = (y − ybi n )2 (4) 2 n i 6 Satya Krishna et al. 3.4 Dataset Arnekt-IECSIL@FIRE2018 shared task [2] provides five Indian language datasets [1] such as hindi, kannada, malayalam, tamil and telugu. Each language dataset has three different files with text in its corresponding language script. Among these three files, one is training file and other two are test-1 and test-2 files. All these files have data in the form of one input sequence per line. Except test-1 and test-2, the training data file contains labels for each input sequence. Each test file has 20% of data from the overall dataset. The training file in each dataset has different number of distinct labels. The number of class labels are 16, 14, 13, 17 and 14 in hindi, kannada, malayalam, tamil and telugu training data re- spectively. The table-3 describes the details regarding the number of sentences, number of words, number of unique words in each file of all datasets. Table 3. Corpus description Dataset name FileType # Sentences # words Vocabulary Size Train 56775 1134673 75808 Hindi Test-1 18925 380574 37770 Test-2 18926 374433 37218 Train 6637 123104 35961 Kannada Test-1 2213 40013 15859 Test-2 2213 40130 15862 Train 28287 439026 85043 Malayalam Test-1 2492 147259 39319 Test-2 2492 144712 38826 Train 64833 882996 130501 Tamil Test-1 21611 294115 61754 Test-2 21612 292308 61366 Train 37039 494500 76697 Telug Test-1 12347 163343 34984 Test-2 12347 163074 35092 3.5 Results Evaluation of this model is done based on the following metrics as specified in the shared task guide lines. N o.Of sentences are assigned with the correct label Accuracy = (5) N o.Of sentences in the dataset N o.Of sentences are correctly labeled with labeli P recision(Pi ) = (6) N o.Of sentences are labeled with labeli N o.Of sentences are correctly labeled with labeli Recall(Ri ) = (7) T otal N o.Of sentences with labeli intest data Title Suppressed Due to Excessive Length 7 2 ∗ Pi ∗ Ri f score(Fi ) = (8) Pi + R i 1 X Overall f score(F ) = ∗ Fi (9) |L| iinL The results in table-4 and 5 summarizes the accuracy and F-Scores of the model on testset-1 and 2 of five languages. In Per-Evaluation and Final-Evaluation this model has high accuracy for kannada dataset compared with other models presented in this competition. For the remaining four languages this model is in the second position. This same order is repeated in F-Score performance of this model. Table 4. Accuracy of the model in Pre-Evaluation and Final-Evaluation Hindi Kannada Malayalam Tamil Telugu Pre-Evaluation(Testset-1) 79.92 57.98 59.43 78.43 76.35 Final-Evaluation(Testset-2) 79.21 57.34 57.86 78.44 76.13 Table 5. F1-Scores of the model in Final-Evaluation Hindi Kannada Malayalam Tamil Telugu Final-Evaluation(Testset-2) 31.64 33.5 28.31 51.59 45.86 According to our analysis, the reason for showing less F-Score by this model is the error in prediction of POS tags and named entities, made this model to miss classify the given sentences. 4 Conclusion We presented an approach which makes use of POS tag information for relation categorisation. Our approach was general and applicable for all the languages used in the shared-task. We could not get the POS tags of malayalam and kannada which can be further improved with such additional information which is crucial for the success in the shared task. References 1. Barathi Ganesh, H.B., Soman, K.P., Reshma, U., Mandar, K., Prachi, M., Gouri, K., Anitha, K., Anand Kumar, M.: Information extraction for conversational sys- tems in indian languages - arnekt iecsil. In: Forum for Information Retrieval Eval- uation (2018) 8 Satya Krishna et al. 2. Barathi Ganesh, H.B., Soman, K.P., Reshma, U., Mandar, K., Prachi, M., Gouri, K., Anitha, K., Anand Kumar, M.: Overview of arnekt iecsil at fire-2018 track on information extraction for conversational systems in indian languages. In: FIRE (Working Notes) (2018) 3. Chinchor, N.: Proceedings of the 7th message understanding conference. Columbia, MD: Science Applications International Corporation (SAIC) (1998) 4. Hazrina, S., Sharef, N.M., Ibrahim, H., Murad, M.A.A., Noah, S.A.M.: Review on the advancements of disambiguation in semantic question answering system. Information Processing & Management 53(1), 52–69 (2017) 5. Miller, S., Fox, H., Ramshaw, L., Weischedel, R.: A novel use of statistical pars- ing to extract information from text. In: Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference. pp. 226–233. NAACL 2000, Association for Computational Linguistics, Stroudsburg, PA, USA (2000), http://dl.acm.org/citation.cfm?id=974305.974335 6. Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using em. Machine learning 39(2-3), 103–134 (2000) 7. Pawar, S., Palshikar, G.K., Bhattacharyya, P.: Relation extraction: A survey. arXiv preprint arXiv:1712.05191 (2017) 8. Rindflesch, T.C., Tanabe, L., Weinstein, J.N., Hunter, L.: Edgar: extraction of drugs, genes and relations from the biomedical literature. In: Biocomputing 2000, pp. 517–528. World Scientific (1999) 9. Ru, C., Tang, J., Li, S., Xie, S., Wang, T.: Using semantic similarity to reduce wrong labels in distant supervision for relation extraction. Information Processing & Management 54(4), 593–608 (2018) 10. Walker, S.H., Duncan, D.B.: Estimation of the probability of an event as a function of several independent variables. Biometrika 54(1-2), 167–179 (1967)