=Paper=
{{Paper
|id=Vol-1917/paper08
|storemode=property
|title=None
|pdfUrl=https://ceur-ws.org/Vol-1917/paper08.pdf
|volume=Vol-1917
}}
==None==
Predicting Tags for Stack Overflow Questions
Sayan Pal, Ammar Shaker, and Eyke Hüllermeier
Intelligent Systems Group
Paderborn University
Stack Overflow (https://stackoverflow.com/) is one of the major community-
driven Question and Answer (Q&A) websites, focusing on topics related to
computer programming. It has nearly 7 million users, who ask more than 6700
questions every day. Each question can be associated with up to five different
tags, which serve as metadata to facilitate information retrieval.
In this paper, we consider the problem of supporting this process through
automatic tagging. From a machine learning point of view, we are facing a problem
of Extreme Multi-Label Classification (XMLC), as Stack Overflow allows for
choosing from several thousands of tags. Besides, instead of learning in a standard
batch mode, it is desirable to learn incrementally on the continuous stream of
questions entering the system, with the capability to capture changes and drifts
in the data; for example, many tags (such as ‘facebook’) have a lifetime, first
gaining popularity, then reaching a peak and eventually diminishing.
Thus, we end up with an extremely challenging problem of XMLC for data
streams with a non-stationary set of labels. To tackle this problem, we build on
an XMLC method based on probabilistic label trees (PLT), which has recently
been proposed in [1]. We extend this approach in two directions. First, instead
of specifying the entire PLT beforehand, we develop an adaptive version that
starts with only a single node and expands the tree whenever a new label is
observed in a training example. As a second contribution, we further improve
adaptive PLTs through stream-based boosting [2]. More specifically, we apply the
online boosting method by Oza and Russell, which we tailor for minimizing the
F-measure as a performance metric (instead of 0/1 loss, which is not appropriate
in the context of XMLC).
In addition to the methodological contributions, we present empirical results
based on extensive experiments with real data from Stack Overflow. Our ex-
perimental setting is focused on evaluating the usefulness of the extensions we
proposed for PLTs, i.e., the adaptive handling of labels and online boosting.
References
1. Kalina Jasinska, Krzysztof Dembczynski, Róbert Busa-Fekete, Karlson Pfannschmidt,
Timo Klerx, and Eyke Hüllermeier. Extreme F-measure maximization using sparse
probability estimates. In Proc. ICML 2016, New York City, NY, USA, pages
1435–1444, 2016.
2. Nikunj C. Oza and Stuart Russell. Online bagging and boosting. In In Artificial
Intelligence and Statistics 2001, pages 105–112. Morgan Kaufmann, 2001.
3. Gene Smith. Tagging : people-powered metadata for the social web. New Riders,
Berkeley, CA, 2008.
Copyright © 2017 by the paper’s authors. Copying permitted only for private and academic purposes.
In: M. Leyer (Ed.): Proceedings of the LWDA 2017 Workshops: KDML, FGWM, IR, and FGDB.
Rostock, Germany, 11.-13. September 2017, published at http://ceur-ws.org