Predict Closed Questions on StackOverflow c Galina E. Lezina Artem M. Kuznetsov Ural Federal University Ural Federal University galina.lezina@gmail.com whoisnexta@gmail.com Abstract glish text but due to having a German IP address Mi- crosoft activates the automatic translation on every new ”Millions of programmers use StackOverflow page load which gives me a yellow box with a German to get high quality answers to their program- translation of the text I am currently hovering over with ming questions every day. There has evolved an the mouse. effective culture of moderation to safe-guard it. This happens regardless what language is initially set More than six thousand new questions is asked in the right upper corner and regardless of whether I am on StackOverflow1 every weekday. Currently logged in or not. I can’t tell how annoying this is !! Any about 6% of all new questions end up ”closed”. ideas, anyone ? The goal of this paper is to build a classifier that Too localized is a question that is unlikely to be help- predicts whether or not a question will be closed ful for anyone in the future; it is only relevant to a small given the question as submitted, along with the geographic area, a specific moment in a time, or an ex- reason that the question was closed. traordinary narrow situation that is not generally applica- ble to the worldwide audience of the internet. 1 Introduction Example: Is it time to start using HTML5? Someone In recent time question-answer services like StackOver- has to start sometime but is now the time? Is it possible flow are becoming more popular. Knowledge of such to use the new HTML5 tags and code in such a way as to services has been steadily growing so it requires more degrade gracefully? resources to moderate. Some automation of this process Not constructive is a question that is not a good fit would ease this task. The problem solved in this paper to Q&A format. It is expected that the answers gener- is a small step in this direction. The task was a contest ally involve facts, references, or specific expertise; this organized by kaggle2 . It has two stages: public and pri- question will likely solicit opinion, debate, arguments, vate, for public and private datasets accordingly. In first polling, or extended discussion. one user who submitted solution could see their results Example: What is the best comment in source code immediately, but they could do it no more than 2 times a you have ever encountered? day. In private stage users were doing prediction for pri- Not a real question is a question when it’s difficult to vate dataset but results can be seen only after the compe- tell what is being asked here. This question is ambigu- tition. The results of all participants can be seen in public ous, vague, incomplete, overly broad or rhetorical and leaderboard, but those who submitted post-deadline are cannot be reasonably answered in its current form. not shown there. We submitted our solution after dead- Example: For a few days I’ve tried to wrap my line and the best position we’ve got is 5th with 0.31467 head around the functional programming paradigm in points. Haskell. I’ve done this by reading tutorials and watch- StackOverflow is a service where users ask questions ing screencasts, but nothing really seems to stick. Now, in about programming and it belongs to StackExchange learning various imperative/OO languages (like C, Java, network which contains many thematic websites. Ques- PHP), exercises have been a good way for me to go. But tions on StackOverflow can be closed as off topic (OT), since I don’t really know what Haskell is capable of and not constructive (NC), not a real question (NRQ), too lo- because there are many new concepts to utilize, I haven’t calized (TL) or exact duplicate. Exact duplicate reason known where to start. So, how did you learn Haskell? was excluded from competition because it depends on What made you really ”break the ice”? Also, any good posts history. Posts history actually is present in Stack- ideas for beginning exercises? Overflow database dump but its size is about 6GB in xml The process of question closing includes user voting. format, which requires many resources to analyze. Thus users with a certain reputation can vote a ques- Off topic is a question that is not on-topic of the site tion to be closed with one reason. When question gains or is related to another site in Stack Exchange network. enough close votes it is closed by moderator. So this can Example: Is there a way to turn off the automatic text be automated if it will be possible to predict which ques- translation at the MSDN library pages ? I do prefer En- tion will be closed. Proceedings of the Ninth Spring Researcher’s Colloquium 2 Dataset on Database and Information Systems, Kazan, Russia, 2013 1 http://stackoverflow.com For this task the data was provided by kaggle and it in- 2 https://www.kaggle.com cludes train data which contains 3664927 posts and train sample data consisting of 178 351 posts. Full train data been shown to be highly effective at traditional text cate- and sample train data distribution on closed reasons is gorization [5]. We chose this because of amount of data. shown in table 1 . As mentioned above it is slightly less than 4 millions of samples and we didn’t balanced data as we did for ran- Table 1: Training data distribution over categories dom forest classifier. Liblinear do not use kernels and Dataset NRQ NC OT Open TL is trained very quickly. Liblinear also provides an op- Train 38622 20897 20865 3575678 8910 tion to select regularization parameter C. The value for C Sample 38622 20897 20865 89337 8910 parameter that we found to be optimal for our dataset is 1. Also StackOveflow database dump of august 2012 was available. Database dump contains all users and 4.3 Vowpal Wabbit posts information including history of the post editing, Vowpal Wabbit (VW)4 is a library and algorithms devel- commenting and many other information. oped at Yahoo! Research by John Langford. VW focuses on the approach to stream the examples to an online- 3 Related work learning algorithm [6] in contrast of parallelization of a User interaction analysis in social media. As was men- batch learning algorithm over many machines. The de- tioned above questions on StackOverflow are closed by fault learning algorithm is a variant of online gradient de- user voting. So user’s feedback is very important compo- scent. The main difference from vanilla online gradient nent also it’s a valuable source of post quality. In [1] user descent is fast and correct handling of large importance relationships were analyzed to gain significant amount of weights. Various extensions, such as conjugate gradient quality information. Authors applied link-analysis algo- (CG), mini-batch, and data-dependent learning rates, are rithms for quality scores propagating; the main idea was included. We found that default algorithm works much that ”good” answerers write ”good” answers. This idea better on our dataset. We trained VW with samples in can be propagated onto the questions that peoples who chronological order and for reasons of clarity in shuf- do not asks ”bad” questions are less unlikely to do so fled order and the result for the shuffled data were much in the future. In process of link-analysis user-user graph worse - 0.3340 versus 0.31467 for ordered dataset in con- was built to represent those relationships. This graph can dition that we used the same feature set for both of them. be noted as G = (V, E) in which V is a set of vertices As mentioned above the algorithm used in Vowpal stands for users set, and E represents relationships be- Wabbit is a modified stochastic gradient descend al- tween the users. In [4] authors classified questions as gorithm. Unlike the typical online-learning algorithms conversational and informational. In their work they di- which have at least one weight for every feature the ap- vided peoples into two categories: answer people, who proach used in VW allows to induce sparsity in learned answers many questions and discussion peoples who in- feature weights. The main idea of truncate gradient is teract often with other discussion people. To do so they that it uses the simple rounding rule of weight to achieve also analyzed user’s question answers ego-network. Au- the sparsity. The most of the methods rounds small coef- thors of [7] showed that almost the same user interaction ficients by threshold to zero after a specified number of features are significant during classification of a question steps. In truncated gradient amount of shrinkage is con- as social and non-social. trolled by a gravity parameter gi . Weights are updated in Text content quality analysis. In [1] were presented according with update rule f (wi ): features to represent grammatical properties of the text. f (wi ) = T1 (wi − η∇1 L(wi , zi ), ηgi , θ) In their work they also take into account punctuation and where T1 (v, α, θ) = [T1 (v1 , α, θ), ..., T1 (vd , α, θ)] typos, syntactic and semantic complexity. It’s important with  because this content is generated by the users. Their fea-  max(0, vi − α) if vj ⊆ [0, θ] tures for text quality analysis were helpful for us because T1 (vj , α, θ) = min(0, vj + α) if vj ⊆ [−θ, 0] ,  v otherwise one of the close reason - not constructive - is essentially j a conversational question. θ is a threshold, gi is a gravity parameter so gi = 0 if Ki is not an 4 Used methods integer and gi = K if Ki is an integer. Here K is the number of steps after which the weights are updated. gi We’ve compared three methods during our research. is used with θ to control sparsity. ηe is a step size and calculated as 4.1 Random forest ldn−1 ip ηe = (i+P p, e0