=Paper=
{{Paper
|id=Vol-1690/paper36
|storemode=property
|title=Towards Building Open Knowledge Base From Programming Question-Answering Communities
|pdfUrl=https://ceur-ws.org/Vol-1690/paper36.pdf
|volume=Vol-1690
|authors=Wei Emma Zhang,Ermyas Abebe,Quan Z. Sheng,Kerry Taylor
|dblpUrl=https://dblp.org/rec/conf/semweb/ZhangAST16
}}
==Towards Building Open Knowledge Base From Programming Question-Answering Communities==
Towards Building Open Knowledge Base From Programming Question-Answering Communities Wei Emma Zhang1 , Ermyas Abebe2 , Quan Z. Sheng1 , and Kerry Taylor3 1 School of Computer Science, The University of Adelaide, Australia 2 IBM Research Australia, Australia 3 Research School of Computer Science, Australian National University, Australia Abstract. In this paper, we propose the first system, so-called Open Programming Knowledge Extraction (OPKE), to automatically extract knowledge from programming Question-Answering (QA) communities. OPKE is the first step of building a programming-centric knowledge base. Data mining and Natural Language Processing techniques are leveraged to identify duplicate questions and construct structured information. Preliminary evaluation shows the effectiveness of OPKE. 1 Introduction Recent research on Open Information Extraction (Open IE) aims to automati- cally extract machine-understandable structured information from unstructured natural language texts [6, 4]. Existing works extract knowledge from web corpus holding text from, for example, news and articles. No effort has been made on ex- tracting knowledge from programming Question-Answering (QA) communities, e.g. Stack Overflow (SO)1 . SO is a code-centric QA community that provides working code solutions to natural language queries. It serves as a knowledge resource for software developers. In this paper, we provide the first work that extracts questions and answers from SO and outputs structured triples. Our method is called Open Programming Knowledge Extraction (OPKE). It is the initial step of building a programming-centric knowledge base. OPKE extracts questions and answers for a post and form a triple formatted as h“question”; “answer”; “tag1”,“tag2”,...i. The tags can be easily obtained and the answers can be obtained by choosing the accepted answers marked by the questioner. But it is challenging to extract information from questions. The key challenge is that questions with similar meanings/purpose exist. This is because questioner fail to search out answers from previous questions, so they post a new one. In Stack Overflow, such questions are marked as “Duplicate”. However this marking is manually performed which could not identify all the duplicate questions and not all communities provide duplicate identification. Recent works (e.g.,[1]) perform binary classification on post pairs by considering the features of these pairs. However, these works consider every two posts as a pair, which 1 http://stackoverflow.com/ 2 Wei Emma Zhang et al. lead to high computation cost. Instead, we propose to use features of posts them- selves to identify the duplicates. Specifically, we transform every post title to a numerical vector and perform clustering on a set of vectors. Then we consider the posts that are closest to a master post (i.e., the original post) as the dupli- cate posts. After identifying and removing duplicate posts, we extract the key information from master posts and rewrite them to the “question” component of knowledge triples. This can be done by using NLP techniques. Finally, triples are constructed by given question and answer, as well as the tags. 2 Building Programming Knowledge Base The input of OPKE is the post contents of programming QA communities and the output is a set of triples. Figure 1 illustrates the extraction process which contains three main steps: Preprocessing, Question-answer Extraction and Triple Generation described as following. Preprocessing Question-Answer TripleExtraction Generation • Extract web page • Identify • Form triples contents paraphrased … with questions • Prune invalid posts questions and answers • Obtain language • Select answer specific tag Fig. 1. Knowledge Extraction Process from Programming Community 2.1 Preprocessing In the preprocessing step, OPKE parses the post content of the target program- ming community and extracts all the questions and answers. Invalid posts which have no accepted answers are pruned. For SO, post contents can be retrieved through Stack Exchange API2 . The accepted answer and tags can be obtained by considering two parameters ‘AcceptedAnswerId ’ and ‘Tags’ respectively. 2.2 Question-Answer Extraction OPKE extracts the titles of the posts as the questions. It first identifies duplicate posts, then extract question information using Part Of Speech (POS) tagging and dependency parsing. The details are as follows: – Identify duplicate questions. OPKE firstly transforms all questions to vectors leveraging word embeddings [2]. Then it performs clustering based on these vectors. We use the batch solution of K-means for processing large scale post data [5]. Then for each cluster, a K-D tree is built. We mark the K nearest neighbors of every master post as duplicate. Duplicate questions are pruned. 2 https://api.stackexchange.com/ Building Programming Community Knowledge Base 3 – Parse master questions. We parse the master questions and rewrite it to capture the meaning of the questions. Our method is inspired by Open IE work [6], where dependency parsing has been applied to detect query mean- ing. We discuss OPKE parsing process using an example question “Java - How do I convert from int to String?”: i) remove language specific words if exist (e.g. Java), ii) parse dependencies of questions (using method in [3], see Figure 2(a)), iii) identify subject (I ), if subject is first person pronoun, then parse object part under the root in following steps (see the circled part in Figure 2(a)), iv) identify root action (covert), v) identify relation- ship (hconvert; f romint, tostringi) and vi) rewrite question (“convert int to string”). For more details, please refer to [6]. convert Final Triple for “Java - How do I convert int to string”: How from to <“convert int to string”; pobj pobj “Normal ways would be Integer.toString(i) or String.valueOf(i)…….”; do I “java”> int string (a) Dependency Parse Tree (b) Generated Tuple Fig. 2. Parse Questions and Generate Triples OPKE simply uses the accepted answer as the final answer for a post. Al- though it might not be the best solution with highest votes, we believe that the questioner has the judgment on the solutions. 2.3 Triple Generation After obtaining the questions and answers from previous steps, OPKE gener- ate triples with the format of h“question”; “answer”; “tag1”,“tag2”,...i where question and answers are obtained from Section 2.2 and tags are obtained from Section 2.1. So for the example question, the triple is depicted as in Figure 2(b). 3 Experiment The performance of OPKE system has been evaluated by performing extraction on Stack Overflow dump datasets3 (the version is 2016-06-13). We choose valid posts with non-empty titles and have accepted answers as the master posts. We obtain the duplicate and linked posts of these master posts. These result in 3 https://archive.org/details/stackexchange 4 Wei Emma Zhang et al. 339,990 posts in total. We define the evaluation metrics, clustering recall, as the number of correctly labeled posts (i.e., duplicate or not) divided by the total posts. Figure 3 reports the clustering recall rate on different number of clusters for duplicate and linked questions. Duplicate has higher recall rate than linked ones because linked posts do not always represent the same meaning with the master questions. When the cluster number is the 1/2000 of the total number of posts, the recall rate achieves the best. Ƶ Ɖ ů ŝ Đ Ă ƚ Ğ > ŝ Ŷ Ŭ ϰ Ϭ ϯ ϴ й ϯ ϲ й ϯ ϲ й ϯ ϱ й ϯ ϰ й ϯ ϯ й ϯ Ϯ й ϯ Ϭ ϯ Ϭ й Z Ğ Đ Ă ů ů Z Ă ƚ Ğ Ϯ ϰ й Ϯ Ϯ й Ϯ Ϭ ϭ Ϭ Ϭ E ͬ ϱ Ϭ Ϭ E ͬ ϭ Ϭ Ϭ Ϭ E ͬ Ϯ Ϭ Ϭ Ϭ E ͬ ϱ Ϭ Ϭ Ϭ E ͬ ϭ Ϭ Ϭ Ϭ Ϭ E Ƶ ŵ ď Ğ ƌ Ž Ĩ ů Ƶ Ɛ ƚ Ğ ƌ Ɛ ; E с Ŷ Ƶ ŵ ď Ğ ƌ Ž Ĩ ƚ Ž ƚ Ă ů Ɖ Ž Ɛ ƚ Ɛ Ϳ Fig. 3. Clustering Recall Rate on Various Number of Clusters 4 Conclusion and Future Works In this work, we discuss our attempt to extract structured information from programming QA communities. The method, which combines data mining and NLP techniques, can also be applied to other QA communities. In the future, we will develop more fine-grained methods to identify paraphrased questions and consider the evolvement of programming language/tools. A programming-centric knowledge base and corresponding QA system will be built. References 1. M. Ahasanuzzaman, M. Asaduzzaman, C. K. Roy, and K. A. Schneider. Mining Duplicate Questions in Stack Overflow. In Proc. of MSR 2016. 2. Q. V. Le and T. Mikolov. Distributed Representations of Sentences and Documents. In Proc. of ICML 2014. 3. M.-C. D. Marneffe, B. MacCartney, and C. D. Manning. Generating Typed Depen- dency Parses From Phrase Structure Parses. In Proc. of LREC 2006. 4. S. Nam, Y. Hahm, S. Nam, and K. Choi. SRDF: korean open information extraction using singleton property. In Proc. of ISWC 2015 (Posters & Demonstrations Track). 5. D. Sculley. Web-Scale K-Means Clustering. In Proc. of WWW 2010. 6. P. Yin, N. Duan, B. Kao, J. Bao, and M. Zhou. Answering Questions with Complex Semantic Constraints on Open Knowledge Bases. In Proc. of CIKM 2015.