Predict Closed Questions on StackOverflow

                               c Galina E. Lezina                   Artem M. Kuznetsov
                                Ural Federal University             Ural Federal University
                               galina.lezina@gmail.com              whoisnexta@gmail.com


                        Abstract                                glish text but due to having a German IP address Mi-
                                                                crosoft activates the automatic translation on every new
    ”Millions of programmers use StackOverflow                  page load which gives me a yellow box with a German
    to get high quality answers to their program-               translation of the text I am currently hovering over with
    ming questions every day. There has evolved an              the mouse.
    effective culture of moderation to safe-guard it.               This happens regardless what language is initially set
    More than six thousand new questions is asked               in the right upper corner and regardless of whether I am
    on StackOverflow1 every weekday. Currently                  logged in or not. I can’t tell how annoying this is !! Any
    about 6% of all new questions end up ”closed”.              ideas, anyone ?
    The goal of this paper is to build a classifier that            Too localized is a question that is unlikely to be help-
    predicts whether or not a question will be closed           ful for anyone in the future; it is only relevant to a small
    given the question as submitted, along with the             geographic area, a specific moment in a time, or an ex-
    reason that the question was closed.                        traordinary narrow situation that is not generally applica-
                                                                ble to the worldwide audience of the internet.
1   Introduction                                                    Example: Is it time to start using HTML5? Someone
In recent time question-answer services like StackOver-         has to start sometime but is now the time? Is it possible
flow are becoming more popular. Knowledge of such               to use the new HTML5 tags and code in such a way as to
services has been steadily growing so it requires more          degrade gracefully?
resources to moderate. Some automation of this process              Not constructive is a question that is not a good fit
would ease this task. The problem solved in this paper          to Q&A format. It is expected that the answers gener-
is a small step in this direction. The task was a contest       ally involve facts, references, or specific expertise; this
organized by kaggle2 . It has two stages: public and pri-       question will likely solicit opinion, debate, arguments,
vate, for public and private datasets accordingly. In first     polling, or extended discussion.
one user who submitted solution could see their results             Example: What is the best comment in source code
immediately, but they could do it no more than 2 times a        you have ever encountered?
day. In private stage users were doing prediction for pri-          Not a real question is a question when it’s difficult to
vate dataset but results can be seen only after the compe-      tell what is being asked here. This question is ambigu-
tition. The results of all participants can be seen in public   ous, vague, incomplete, overly broad or rhetorical and
leaderboard, but those who submitted post-deadline are          cannot be reasonably answered in its current form.
not shown there. We submitted our solution after dead-              Example: For a few days I’ve tried to wrap my
line and the best position we’ve got is 5th with 0.31467        head around the functional programming paradigm in
points.                                                         Haskell. I’ve done this by reading tutorials and watch-
    StackOverflow is a service where users ask questions        ing screencasts, but nothing really seems to stick. Now, in
about programming and it belongs to StackExchange               learning various imperative/OO languages (like C, Java,
network which contains many thematic websites. Ques-            PHP), exercises have been a good way for me to go. But
tions on StackOverflow can be closed as off topic (OT),         since I don’t really know what Haskell is capable of and
not constructive (NC), not a real question (NRQ), too lo-       because there are many new concepts to utilize, I haven’t
calized (TL) or exact duplicate. Exact duplicate reason         known where to start. So, how did you learn Haskell?
was excluded from competition because it depends on             What made you really ”break the ice”? Also, any good
posts history. Posts history actually is present in Stack-      ideas for beginning exercises?
Overflow database dump but its size is about 6GB in xml             The process of question closing includes user voting.
format, which requires many resources to analyze.               Thus users with a certain reputation can vote a ques-
    Off topic is a question that is not on-topic of the site    tion to be closed with one reason. When question gains
or is related to another site in Stack Exchange network.        enough close votes it is closed by moderator. So this can
    Example: Is there a way to turn off the automatic text      be automated if it will be possible to predict which ques-
translation at the MSDN library pages ? I do prefer En-         tion will be closed.

Proceedings of the Ninth Spring Researcher’s Colloquium         2    Dataset
on Database and Information Systems, Kazan, Russia, 2013
   1 http://stackoverflow.com                                   For this task the data was provided by kaggle and it in-
   2 https://www.kaggle.com                                     cludes train data which contains 3664927 posts and train
sample data consisting of 178 351 posts. Full train data        been shown to be highly effective at traditional text cate-
and sample train data distribution on closed reasons is         gorization [5]. We chose this because of amount of data.
shown in table 1 .                                              As mentioned above it is slightly less than 4 millions of
                                                                samples and we didn’t balanced data as we did for ran-
  Table 1: Training data distribution over categories           dom forest classifier. Liblinear do not use kernels and
 Dataset NRQ        NC        OT       Open        TL           is trained very quickly. Liblinear also provides an op-
 Train     38622 20897 20865 3575678 8910                       tion to select regularization parameter C. The value for C
 Sample 38622 20897 20865 89337                    8910         parameter that we found to be optimal for our dataset is
                                                                1.
   Also StackOveflow database dump of august 2012
was available. Database dump contains all users and             4.3   Vowpal Wabbit
posts information including history of the post editing,        Vowpal Wabbit (VW)4 is a library and algorithms devel-
commenting and many other information.                          oped at Yahoo! Research by John Langford. VW focuses
                                                                on the approach to stream the examples to an online-
3     Related work                                              learning algorithm [6] in contrast of parallelization of a
User interaction analysis in social media. As was men-          batch learning algorithm over many machines. The de-
tioned above questions on StackOverflow are closed by           fault learning algorithm is a variant of online gradient de-
user voting. So user’s feedback is very important compo-        scent. The main difference from vanilla online gradient
nent also it’s a valuable source of post quality. In [1] user   descent is fast and correct handling of large importance
relationships were analyzed to gain significant amount of       weights. Various extensions, such as conjugate gradient
quality information. Authors applied link-analysis algo-        (CG), mini-batch, and data-dependent learning rates, are
rithms for quality scores propagating; the main idea was        included. We found that default algorithm works much
that ”good” answerers write ”good” answers. This idea           better on our dataset. We trained VW with samples in
can be propagated onto the questions that peoples who           chronological order and for reasons of clarity in shuf-
do not asks ”bad” questions are less unlikely to do so          fled order and the result for the shuffled data were much
in the future. In process of link-analysis user-user graph      worse - 0.3340 versus 0.31467 for ordered dataset in con-
was built to represent those relationships. This graph can      dition that we used the same feature set for both of them.
be noted as G = (V, E) in which V is a set of vertices              As mentioned above the algorithm used in Vowpal
stands for users set, and E represents relationships be-        Wabbit is a modified stochastic gradient descend al-
tween the users. In [4] authors classified questions as         gorithm. Unlike the typical online-learning algorithms
conversational and informational. In their work they di-        which have at least one weight for every feature the ap-
vided peoples into two categories: answer people, who           proach used in VW allows to induce sparsity in learned
answers many questions and discussion peoples who in-           feature weights. The main idea of truncate gradient is
teract often with other discussion people. To do so they        that it uses the simple rounding rule of weight to achieve
also analyzed user’s question answers ego-network. Au-          the sparsity. The most of the methods rounds small coef-
thors of [7] showed that almost the same user interaction       ficients by threshold to zero after a specified number of
features are significant during classification of a question    steps. In truncated gradient amount of shrinkage is con-
as social and non-social.                                       trolled by a gravity parameter gi . Weights are updated in
    Text content quality analysis. In [1] were presented        according with update rule f (wi ):
features to represent grammatical properties of the text.           f (wi ) = T1 (wi − η∇1 L(wi , zi ), ηgi , θ)
In their work they also take into account punctuation and           where T1 (v, α, θ) = [T1 (v1 , α, θ), ..., T1 (vd , α, θ)]
typos, syntactic and semantic complexity. It’s important        with                   
because this content is generated by the users. Their fea-                              max(0, vi − α)         if vj ⊆ [0, θ]
tures for text quality analysis were helpful for us because         T1 (vj , α, θ) =         min(0, vj + α) if vj ⊆ [−θ, 0] ,
                                                                                        v                    otherwise
one of the close reason - not constructive - is essentially                                   j
a conversational question.                                          θ is a threshold,
                                                                    gi is a gravity parameter so gi = 0 if Ki is not an
4     Used methods                                              integer and gi = K if Ki is an integer. Here K is the
                                                                number of steps after which the weights are updated. gi
We’ve compared three methods during our research.               is used with θ to control sparsity.
                                                                    ηe is a step size and calculated as
4.1    Random forest                                                           ldn−1 ip
                                                                    ηe = (i+P               p,
                                                                                 e0 <e ie0 )
We took baseline’s scikit-learn random forest with 50 es-           where l is a learning rate, d is a decay learning rate, i
timators implementation and used it with our new fea-           is an initial time for learning rate, p is a power of learning
tures to see how these new features may affect the result       rate decay.
predictions.                                                        The update rule parameter were chosen empirically
                                                                and it’s values is: logistic loss function, p = 0.5, l = 1, d
4.2    Support Vector Machine                                   =1
We’ve used liblinear3 library support vector machine im-            Also VW provides online latent Dirichlet allocation
plementation. Support Vector Machines (SVM) have                algorithm which we used for 200 topics. 200 topics were
    3 www.csie.ntu.edu.tw/c̃jlin/liblinear                        4 https://github.com/JohnLangford/vowpal wabbit/
optimal for our data. We’ve tried for 50, 100, 200 and           5.1   User Features
300 values, but 200 gave the best result.
                                                                 User features describes user parameters on StackOver-
                                                                 flow server such as reputation, personal information
4.4    Baseline                                                  completeness and interaction between all users. To take
Baseline model was provided by kaggle. It includes six           into account interaction between users we calculate some
features to represent each post as a vector of features:         features by building user interaction graph. Along with
    OwnerUndeletedanswersAtPostCreation. This is the             baseline user features we calculated are listed below:
count of answers posts the user had made that were un-               Reputation. User reputation at the time of the Stack-
deleted when that row’s question was submitted.                  Oveflow database dump creation. The idea is that users
    BodyLength. This is the initial body length including        with high reputation ask incorrect questions less often
its code blocks lenth.                                           than users with low reputation.
                                                                     AgeFilled, AboutMeFilled, LocationFilled, Website-
    ReputationAtPostCreation. User reputation at post
                                                                 Filled, AllInfoFilled. These features are binary and cor-
creation time.
                                                                 respond to filled information in user’s profile. We tested
    NumTags. Number of tags that the assigns to the post.
                                                                 if users with fully filled profile more
Its maximum value is 5 per post.
                                                                     UpVotes. Votes up the user received for his posts.
    TitleLength. Title length of the post.
                                                                     DownVotes. Votes down the user received for his
    UserAge. This is the system user age. Not actual user        posts.
age which user fills in his profile, but the time elapsed
                                                                     CVInDegree, CVOutDegree.. Close votes received by
from the time a user logs into the system.
                                                                 the user for his posts, Close votes given by the user for
    Classification was carried out using scikit-learn ran-       other user’s posts.
dom forest (RF) implementation (50 estimators) for                   QAInDegree. Number of answers received by the
training data sample. The output is a raw prediction             user.
which then is updated to get the posterior probability               QAOutDegree. Number of answers given by the user.
because it was trained on balanced data. The posterior
                                                                     QAClustCoef.       The clustering coefficient of the
probability for the modified model is
                                                                 question-ask ego network. The hypothesis was that users
    P (C|D, T ) = P (C|D, S) PP (C|T    )
                                  (C|S) ,                        with high QA clustering coefficient are more communi-
    where P (C|S) is the old prior (frequencies of closed        cable. In [4] it was summarized that users asking con-
questions in balanced sample), P (C|D, S) is the clas-           versational questions have more densely interconnected
sification outputs (posterior) from the model which is           than users asking informational question. In our case
based on S and P (C|T ) is the new prior (frequencies of         not constructive category questions are conversational in
closed questions in unbalanced train data). Here S means         fact.
trained on balanced sample data and T means training on
unbalanced train data.                                           5.2   Features of a Post
    So classifier infers the probability of class C (distribu-
                                                                 Post features include information about the title and body
tion over a parameter C - close reason in our case) which
                                                                 contents. In addition to direct representation the text as
is denoted as P (C|D, S) based on explicit data D (the
                                                                 a vector using tf-idf or LDA we calculate so-called text
data to be predicted) and on a prior balanced data S. We
                                                                 parameters along with post features from baseline:
need to modify this prior to get the probability of class C
                                                                     CBCount. Number of code blocks in the post’s body.
for modified model P (C|D, T ) based on data D and on
unbalanced data T.                                                   LinkCount. Number of links in the post’s body.
                                                                     Dates, Times. The number of occurrences of dates
    Baseline does not take into account the content of the
                                                                 and time periods in the body of the post. As described in
post, so we decided to focus on this. We presented text in
                                                                 StackOverflow FAQ, the too localized posts is closed as
vector from using two techniques. First is tf-idf weight-
                                                                 time or place specific.
ing technique; the second is Latent Dirichlet Allocation
(LDA) [2]. We used these techniques with different pre-              NumberOfDigits. Number of digits in the post body.
viously described methods. For building LDA model for                NumberOfSentences. Number of sentences in the post
train data we’ve been using GibbsLDA++5 implementa-              body excluding code blocks.
tion using Gibbs Sampling technique for parameter esti-              NumberOfSentencesStartsWithI. Number of sen-
mation and inference. At the same time Vowpal Wabbit             tences in the post body which start with ”I”.
has its own online LDA implementation.                               NumberOfSentencesStartsWithYou. Number of Sen-
                                                                 tences in the post which start with ”you”. In [4] it was
                                                                 concluded that conversational questions are more often
5     Features                                                   directed at readers by using word ”you”, while informa-
As was mentioned above along with baseline features we           tional questions are more often focused on the asker by
used our features to represent data as vectors. All these        using the word ”I”.
features can be divided into three categories. We describe           UpperTextLowerTextRatio. The ratio of the number of
just some interesting features. Full list of features can be     upper letters to lower letters in the text. Besides of text it
found in the full version of paper which may be obtained         may give some answerer characterization like accuracy.
by e-mailing6 to the author.                                         FirstTextLineLength. Length of the first text line.
                                                                 Usually first short line implies personal appeal or greet-
    5 http://gibbslda.sourceforge.net/                           ing. The former case is the most interesting if it is pecu-
    6 galina.lezina@gmail.com                                    liar to one of the close categories.
                                                              vw-varinfo. VW-varinfo produces input variable names
                                                              as an output and any other parameters including the rela-
                                                              tive distance of each variable from the best constant pre-
                                                              diction - feature relevance score. So in final VW training
                                                              we removed all features with zero relative scores.

                                                              6     Evaluation
                                                              Prediction results were evaluated for the public and pri-
                                                              vate data provided by kaggle7 using metric called mul-
                                                              ticlass logarithmic loss (logloss). Class labels for public
                                                              and private data is hidden and still is not accessible for
                                                              participants. So the result for these data can be calculated
                                                              only on the kaggle’s server using logloss metric.
         Figure 1: Relative feature importance.
                                                              6.1   Multiclass Logarithmic Loss
   NumberOfInterrogativeWords. Number of interroga-
tive words such as ”what”, ”where” and so on.                 The metric is negative log likelihood of the model that
   NumberOfSentencesStartsWithInterrWords. Number             says each test observation is chosen independently from
of sentences which starts with interrogative words. It        a distribution that places the submitted probability mass
provides an information on whether the post is a ques-        on the corresponding class, for each observation. Mul-
tion.                                                         ticlass logarithmic loss (MLL) is calculating by the fol-
   NRQCloseRate,         NCCloseRate,      OTCloseRate,       lowing formula.
                                                                                PN PM
OCloseRate, TLCloseRate. We encounter close rate for              M LL = − N1 i=1 j=1 yi,j ln(pi,j ) ,
each category. These values are calculated for every              where N is the number of observation, M is the num-
10000 posts. For predicting samples we take latest close      ber of class labels, yi,j is 1 if observation i is in class j
rate values.                                                  and 0 otherwise, and pi,j is the predicted probability that
   The rest of features include such information as punc-     observation i is in class j.
tuation marks, indentation in code blocks, some features
from post title and many others are described in paper        6.2   Outline the results
linked at the beginning of this section.
                                                              In table 2 we present results for different methods we
5.3   Tag Features                                            used and for clarity best public leaderboard result also
                                                              provided.
During prediction we’ve been using information about
the tags affixed to the posts by its owner. Every new
question asked on service must be tagged with at least                    Table 2: MLL for used methods
one tag. During classification we counted close frequen-       Method                     MLL       Leaderboard
cies for each tag and each category. Also we counted                                                position
questions close frequency for every tag pair. The hy-          leader                     0.29837 1
pothesis is that tags reflect some topics of the post and      vw+uf+tf+lda200            0.31467 5
a pairwise occurrence of some tags can mean that two           vw+uf+tf+tf-idf3+lda200 0.31994 5
topics that lead to disputes may occur in one post.            vw+if+uf+tf+lda200         0.31795 5
                                                               vw+tf+lda200               0.31909 5
5.4   Feature Selection                                        vw+uf+tf+tf-idf3           0.34141 15
                                                               vw+uf+lda200               0.44828 35
To select most important features we used two ap-              vw+lda200                  0.45577 35
proaches. The first method is based on estimating rela-        svm+tf-idf3+tf+uf+if       0.38152 24
tive importance of features by constructing big amount of      svm+lda200+tf+uf+ if       0.41366 29
trees for randomly selected subsets of features and is de-     rf+tf-idf3+tf+uf+if        0.44045 34
scribed in [3]. The result was used while classifying data     rf+lda200+tf+uf+if         0.44846 35
using random forest and support vector machine classi-         baseline                   0.46102 36
fiers as the output of this method can be used to train any
classifier.                                                      Here in table of results
    The example is shown in figure 1. This figure shows
                                                                 uf means user features which are described in User
relative importance for some features described earlier
                                                              Features section except user interaction features which is
in the text. As we can see from the figure the most rel-
                                                              marked in results separately,
evant feature is a time elapsed since the opening of the
                                                                 tf is a text features which are fully described in Post
post before it was closed. The longer post is open the
                                                              Features section and it also includes tag features,
less likely that it will be closed. The next is the number
                                                                 lda200 is a representation of text as vector of topics
of code blocks in the post, its score and so on. While
                                                              probability distribution and includes post title and post
selecting features we also measured importance in com-
                                                              body text along with tags attached to it,
bination with LDA topics and tf-idf weights.
    The second approach is a feature selection using the         7 https://www.kaggle.com/c/predict-closed-questions-on-stack-
Vowpal Wabbit. VW has a small wrapper around it called        overflow/data
   if is a user interaction features and includes question-         on Web Search and Data Mining, pages 183–194,
answer and close vote features such as in-degree, out-              2008.
degree and clustering coefficient.
   rf is a scikit-learn random forest implementation used       [2] D. Blei, A. Ng, and M. Jordan. Latent dirichlet al-
with 50 estimators,                                                 location. Journal of Machine Learning Research,
   svm is a support vector machine                                  pages 993–1022, 2003.
   vw is a vowpal wabbit library.                               [3] M. Draminski, A. Rada-Iglesias, S. Enroth,
   As we can see from the table of results for vowpal               C. Wadelius, J. Koronacki, and J. Komorowski.
wabbit user interaction features worsen outcome for a               Monte carlo feature selection for supervised
small value. And if we compare using vw with user fea-              classification. Bionformatics, pages 110–117, 2008.
tures and text features (along with lda200 in both cases)
we will see that text features contribute much more to the      [4] F. Maxwell Harper, D. Moy, and J. Konstan. Facts
result than user features.                                          or friends?: distinguishing infromational and conver-
   The model of Vowpal Wabbit which gave us the best                sational questions in social q&a sites. Proc. of the
result utilizes a logistic loss function and one against all        SIGCHI Conference on Human Factors in Comput-
classification. Also we measured the result for tf-idf vec-         ing Systems, pages 759–768, 2009.
tors counted for 3-grams and it is interesting that it didn’t
outperformed our result for the model which uses only           [5] T. Joachims. Text categorization with support vec-
LDA for 200 topics. Also the SVM and RF classifiers                 tor machines: Learning with many relevant fea-
require preliminary LDA model construction which re-                tures. Proc. of the European Conference on Machine
quires a lot of resources while VW has online LDA im-               Learning (ECML), pages 137–142, 1998.
plementation so it can build it while training.
                                                                [6] J. Langford, L. Li, and T. Zhang. Sparse online learn-
   As we mentioned earlier baseline doesn’t take into ac-
                                                                    ing via truncated gradient. The Journal of Machine
count the content of the post and as we’ve seen the text
                                                                    Learning Research, pages 777–801, 2009.
feature is very informative. So our solution benefits from
the post context.                                               [7] E. Rodrigues and N. Milic-Frayling. Socializing
                                                                    or knowledge sharing?: characterizing social intent
7   Future work                                                     in community question answering. Proc. of the
                                                                    18th ACM Conference on Information and knowl-
After some manual analysis we’ve noted that some ques-
                                                                    edge management, pages 1127–1136, 2009.
tions’ status is open but it actually should be closed.
Sometimes it’s reflected in the comments and it would
be useful to consider such close recommendations from
comments. Such information can be taken from Stack-
Oveflow database dump. So we are planning to make
better use of the database dump not only for further text
feature extraction but for seeing user communication.
Posts can be edited by different users depending on who
owns the post - community or owner, so we can watch
changes which make ”bad” post (which received some
votes for closing ) to be a ”good” one.
    Also it’s very hard to determine too localized question
in some case because sometimes it can be seen from the
text of the post, and sometimes it is enough to look at the
code included in the post body. But we didn’t analyze
the content of the code in any way during classification.
It is nontrivial task to determine if the code works for
specific conditions and will never be useful for anyone
else in the future. Some code analysis might be helpful
because in StackOverflow the code includes actually not
only the code but some stack trace is also considered to
be a code but when user posts full stack trace he can ask
very specific question associated with his error. Deter-
mining type of code language may be helpful if we are
trying to see if user compares the same thing in different
language which is like to ask ”what is better” but this is
not constructive.


References
[1] E. Agichtein, C. Castillo, D. Donato, A. Gionis, and
    G. Mishne. Finding high-quality content in social
    media. Proc. of the 2008 International Conference