Lifting user generated comments to SIOC

                     Julien Subercaze and Christophe Gravier

               LT2C, Télécom Saint-Étienne, Université Jean Monnet
                           10 rue Tréfilerie, F-4200 France
                   {julien.subercaze,christophe.gravier}
                               @univ-st-etienne.fr
                      http://portail.univ-st-etienne.fr


       Abstract. HTML boilerplate code is used on webpages as presentation
       directives for a browser to display data to a human end user. For ma-
       chines, our community has made tremenduous efforts to provide querying
       endpoints using agreed-upon schemas, protocols, and principles since the
       avent of the Semantic Web. These data lifting efforts have been some of
       the primary materials for bootstrapping the Web of data. Data lifting
       usually involves an original data structure from which the semantic ar-
       chitect has to produce a mapper to RDF vocabularies. Less efforts are
       made in order to lift data produced by a Web mining process, due to
       the difficulty to provide an efficient and scalable solution. Nonetheless,
       the Web of documents is mainly composed of natural language twisted
       in HTML boilerplate code, and few data schemas can be mapped into
       RDF. In this paper, we present CommentsLifter, a system that is able
       to lift SIOC data from user-generated comments in Web 2.0.


Keywords: Data extraction, Frequent subtree mining, users comments


1     Introduction
The SIOC ontology [5] has been defined to represent user social interaction on
the web. It aims at interconnecting online communities. Nowadays SIOC data
are mostly produced by exporters 1 . These exporters are plugins to existing
frameworks such as blog platforms (Wordpress, DotClear , . . .), content man-
agement systems (Drupal) and bulletin boards. Unfortunately, these exporters
are not yet default installation plugins for these frameworks, except for Drupal 7.
Therefore few administrators enable them and as a consequence the SIOC data
production remains rare2 . There exists also numerous closed source platforms
that supports online communities. Among them are the online newspapers that
allow commenting on articles, Q&A systems. This subset of the user generated
content on the web will never be unlocked using exporters.
    In this paper we present CommentsLifter, a web-mining approach that aims
at extracting users’ comments directly from HTML pages, in order to circumvent
1
    http://sioc-project.org/exporters
2
    http://www.w3.org/wiki/SIOC/EnabledSites
the exporter issue. The comments are identified in webpages by mining frequent
induced subtrees from the DOM , and using heuristics allow to discriminate the
different field of the comment (username, date, . . .). This approach does not
require any a priori knowledge of the webpage. We empirically evaluated our
approach and obtained very good results.
    The paper is structured as follows. The next section presents related works
on both structuring data into semantic web formats and web mining approaches.
Section 3 presents a formalisation of the problem and recalls some theoretical tree
mining results. Section 4 details the different steps of CommentsLifter, followed
by experimental results. Finally, section 6 concludes.


2     Related works

Converting existing format of data into RDF is a cornerstone in the success of the
semantic web. The W3C maintains a list of available RDFizers on its website
3
  . Input data can be either structured or unstructured. In the former case, if
the semantics of the data can be extracted, then the conversion can take place
without human intervention [14], otherwise the user needs to manually specifiy
the semantics of the data. Sesame contains an API called SAIL (Storage And
Inference Layer) that can be used to wrap existing data format into RDF. The
BIO2RDF project [3] uses this API to build bioinformatics knowledge systems.
Van Assem presented a method for converting Thesauri to RDF/OWL [2] that
has been successfully applied for biological databases [3]. In order to convert a
mailing list archive into a RDF format, the authors of [11] developed SWAML,
a python script that reads a collection of messages in a mailbox and generates a
RDF description based on the SIOC ontology. Since the input data are already
structured (i.e. emails follow the RFC 4155) the conversion is straightforward.
On the other side, there exist several approaches that aims at automatically or
semi-automatically adressing the case where user intervention is usually required.
Text-To-Onto [19] is a framework to learn an ontology from text using text
mining techniques as well as its successor Text2Onto [9]. However none of these
papers provide a sound evaluation of the quality of learnt ontologies. This is due
to the very nature of ontology modeling in which no ground truth can be assessed,
as there exists as many models as one could imagine for describing the same
thing. In [4], Berendt details relationships between web mining and semantic web
mining. The different cases (ontology learning, mapping, mining the semantic
web, . . .) are detailed. From this categorization, the purpose of our research falls
into the category of instance mining, which focuses on populating instances for
existing semantics. For this purpose, learning techniques have been proposed for
web scale extraction with a few semantic concepts [10] and presented promising
results at the time of publication. Textrunner [22] also learns to perform web-
scale information extraction, presenting good precision but a very low recall.
Concerning non learning techniques, automatic modelling of user profiles has
3
    http://www.w3.org/wiki/ConverterToRdf
been performed in [12], using term recognition and OpenCalais for named entity
recognition.
    Several techniques have been developed for web extraction. In [6], the authors
simulate how a user visually understands Web layout structure. [17] divided the
page based on the type of tags. Many recent research works exploit text density
to extract content on pages [16, 21, 15]. This approach presents good results
regarding article content extraction. In order to do so, the boilerpipe library4 ,
based on the work from Kohlschutter [16, 15] is widely used. For a more detailed
survey on the different Web data extraction techniques we encourage the reader
to refer to [7]. Among other techniques DEPTA [23] (an extension of works
done in [18]) presents a hybrid approach of visual and tree analysis. It uses a
tag tree alignment algorithm combined with visual information. In a first step
DEPTA processes the page using a rendering engine (Internet Explorer) to get
the boundaries information of each HTML element. Then the algorithm detects
rectangles that are contained in another rectangle, and thus build a tag tree
in which the parent relationships indicates a containment in the rendered page.
DEPTA then uses a string edit distance to cluster similar nodes into regions.
Since each data region in a page contains multiple data records, extracted tag
trees must be aligned to produce a coherent database table. A tree edit distance
(like in [20]) is then defined and used to merge trees. However DEPTA is not able
to extract nested comments. We will use a different approach, that only requires a
DOM parsing technique and that is suitable for analyzing huge amount of pages.
Our approach is based on a theoretical tree mining background, presented in the
next section.


3     Problem Definition

The purpose of our work is to provide a solution for the leverage of Linked Data
using the SIOC schema from user generated comments on webpages, without any
a priori knowledge on the webpage. Our main assumption is that comments on a
given webpage (even at website scale) are embedded in the same HTML pattern.
This assumption is well fulfilled in practice since comments are usually stored
in a relational database and exposed into HTML after an automatic processing
step. Therefore our goal is to automatically determine the HTML pattern that
is used to expose the comments and then to identify the relevant information in
the content to fill SIOC instances.
    Basically a comment is a sioc:Post contained in a sioc:Forum container. We
identified the following subset of the core-ontology properties of sioc:Post to be
relevant for the extraction (we marked with * the mandatory properties and
relationships) :

sioc:content*: text-only representation of the content
dc:terms: title of the comment
dcterms:created : creation date
4
    http://code.google.com/p/boilerpipe/
      and relationships :

sioc:has creator *: points to a sioc:UserAccount which is the resource
sioc:has container *: indicates a sioc:Container object
sioc:reply to: links to a sioc:Post item
sioc:has reply : links to sioc:Post items

    The sioc:Container pointed by sioc:has container can be from different types
in our case. We consider extraction from user reviews (sioct:ReviewArea), posts
on forum (sioc:Forum), comments from blogs (sioct:Weblog), Q&A answers
(sioct:) or generally for newspaper discussion (sioc:Thread ). However, distin-
guishing these differents subclasses of sioc:Container would require classifica-
tion from the webpages we intend to perform extraction on. This is out of the
scope of our paper, we will therefore uniformally consider the container as an
instance of sioc:Container. Similarly there exists subclasses of a sioc:Post for
each container. As for the container, our algorithm will output sioc:Post items.
    To summarize, our problem is the following : in order to generate SIOC
data from raw HTML, we must identify the different items (sioc:Post) and their
conversational relationships (sioc:reply to and sioc:has reply). For each item we
must identify the user (sioc:UserAccount), the content of the post (sioc:content)
and when possible the date and the title.


3.1     Frequent Subtree Mining

In the case of product listing extraction, the goal is to extract frequent subtrees
that are identical in the page. For this purpose, a bottom-up subtree mining
objective is sufficient. For selecting a feature among mined items, for example
title and price, a filter on the leaf nodes can be applied. In the case of comments
extraction, the patterns can be nested. Assuming that the pattern we are looking
for is [a[p; br; p; div]], if we encounter a reply to a comment, i.e. nested instances
in the pattern, the tag tree for the comment and its answer could be as follows :
[a[p; br; p; div; [a[p; br; p; div]]]]. In the case of a single comment we will encounter
our pattern [a[p; br; p; div]]. We observe that nodes can be skipped horizontally
along with their descendants.
    In frequent subtree mining, three types of subtrees are distinguished : bottom-
up, induced and embedded. Figure 1 depicts these different subtrees. For more
details on frequent subtree mining, we refer the reader to [8].
    We observed empirically that instances are nested in the way we described
previously : the direct parenting relation is preserved. Consequently induced
subtree is a sufficient type of subtree for our purpose. Embedded subtree min-
ing could also be used, however since the algorithms complexity grow with the
complexity of the pattern to mine, induced subtree mining is definitely more
appropriate. The main advantage of subtree mining over existing works is that
it provides a mine once extract many approach. In order to mine large web-
sites, one would need to mine the pattern from only one webpage from each site
and could later extract data by simple pattern matching on other pages, thus
             A                  B                   B                   A

         B         C       D         F         D          F         D         C
                                E                                       F
    D        F
         E
                               (b)                  (c)                 (d)
             (a)

Fig. 1. Different types of subtree from tree (a) : bottom-up subtree (b) ; induced
subtree (c) ; embedded subtree (d)


saving large amount of computational resources and time. In the next section,
we present CommentsLifter, our approach to comments’ extraction based on
frequent subtree mining.


4       CommentsLifter
We now start the description of CommentsLifter. This section presents the dif-
ferent steps of our algorithm. The underlying idea of CommentsLifter is to use
simple observations on Web page structures to reduce the candidate set gen-
eration. This allows us to minimize the error rate while selecting the winning
pattern, and to contribute to the runtime performance optimization objective.
To extract comments from a Web page, CommentsLifter uses seven steps: Docu-
ment preprocessing, Frequent subtrees extraction, Clustering, Merging, Pattern
expansion, Winner selection, Field extraction, Data extraction. In the next sub-
sections we detail the process followed in each step of our algorithm.

4.1     Preprocessing
Recommendations issued by the W3C aim at specifying the languages of the
World Wide Web, with various versions of HTML and CSS. While pages follow-
ing these recommendations produce clean DOM trees, they represent only 4.13
% of the Web pages5 . The remaining pages are made up of wrongly formatted
HTML code that is often referred to as ”tag soup”6 . Due to the large portion
that these pages represent, Web browser engines are able to handle malformed
HTML tags, improperly nested elements, and other common mistakes. In our
algorithm, the first step is to convert any input HTML document (malformed
or not) and to output a well-formed document as a DOM tree. For this purpose,
any dedicated library (Jsoup7 , Jtidy8 ) or browser engine can be used.
5
  http://arstechnica.com/Web/news/2008/10/opera-study-only-4-13-of-the-Web-is-
  standards-compliant.ars
6
  http://en.wikipedia.org/wiki/Tag_soup
7
  http://jsoup.org/
8
  http://jtidy.sourceforge.net/
4.2   Frequent subtrees extraction
The next step of our algorithm consists of the generation of a first candidate set.
For this purpose we extract frequent depth-two trees from the DOM and store
them in a cardinality map (an example is given in Table 1). The basis of our
approach is to select patterns with a depth of two that will be expanded into
larger patterns (empirical results in Section 5 show that the average depth of
a comment pattern is 4.58). For this purpose, a tag tree is generated from the
preprocessed DOM. CommentsLifter traverses the tree in a top down fashion
and stores the encountered trees of size two (each node that has children, along
with its children). The results are sorted, as presented in Table 1. Candidates
with less than two occurrences are discarded, since we assume that there are at
least two comments. The same assumption is made for instance by [23]. For each
pattern occurrence, the encountered instance is stored in a multimap (in fact we
only store the label of the parent node).


                   Count      13         12         10        8
                    Tree div[a;br;i;p] ul[li;li] div[p;p] div[p;p;p]
                   Table 1. Example of two depth candidates.


   At this stage our candidate set is initialized and contains the instances of
comments we are looking for.

4.3   Clustering
Comments in a Web page appear in a continuous manner, so it is very unlikely
that comments are stored in different branches of the DOM tree. We did not en-
counter the case of split comments in different subtrees during our experiments.
In fact, comments are organized in a tree structure, where the beginning of the
comments block is the root of the tree. Since comments are located in the same
subtree, we proceed to a clustering phase of the occurrences for each pattern.
    In this step, we aim at clustering co-located instances of the same pattern.
In other words, the algorithm builds the pairs (pattern, Instances), where each
pattern is associated with a set of instances matching it that are located in the
same subtree and close to each other. Consequently, one pattern can be associ-
ated multiple times with a unique set of instances, whose member is distinct from
any other member of another associated set to the pattern. This means that each
pair (pattern, Instances) is splitted into different (pattern, Instances) where the
instances in Instances do belong to the same subtree in the DOM. The basis
of our algorithm is a distance-based clustering.For each given pattern, the algo-
rithm sorts its instances along their depth in the tree. At each depth, we check
for each instance if it has a parent in the previously found (pattern, Instances).
For the remaining instances, we cluster them using a classical node distance in
trees : d(a, b) ∈ N is the length of the shortest path between the nodes a and
b. After building the set of (pattern, Instances), we remove elements where the
cardinality of instances is equal to one.
    For example, running our algorithm on the tree provided by Figure 2 with the
pattern ul[li; li] would produce two pairs (pattern, Instances) that are depicted
with dotted and dashed boxes in the same figure.


                                               div
                                            llllRRRRR
                                          ll         RRR
                                       lll              R
                                 aE              ...         div
                               yyy EEEE
                              y        E
                            yy                         _ _ _ _ _ _ _ 
                     ulE                   ulE                     ulE       
                     yy EEEE               yy EEEE                yy EEEE    
                   yy       E            yy       E              yy       E
                 yy                    yy                     yy             
                li         li         li         li                         
                                                           _ _li _ _ _ _li _


          Fig. 2. Example output for section 4.3 with the pattern ul[li; li]


   At the end of this step, CommentsLifter holds the set of pairs (pattern, Instances)
that matches a pattern to a set of co-located instances in the DOM tree. In the
next step, we try to identify mergeable patterns from this data structure.


4.4   Merging

HTML patterns that contain comments often have a depth greater than two
(see Section 5 for more details). Consequently our candidate set may at this step
contain pairs of (pattern, Instances) that belong to the same global pattern,
that we did not discover yet because the pattern expansion process will occur
later. Instead of discovering the same pattern from different candidates, we aim
at merging these candidates beforehand for optimization.
    For this purpose we perform a pairwise comparison between the sets of in-
stances. We process a merge between sets of instance when every element of a
set S1 is topped by an element from the other set of instances. An element is
topped by another if the latter is a parent of the former. If a set of instances
is topped by another set, we discard this set and restart the process until no
further updates to the candidate set are performed.
    Algorithm 1 presents this merging process. After this step, the candidate set
is again drastically pruned since we eliminated all potentially duplicate patterns
for the expansion phase. This process, together with the previous clustering
process, are the key phases of our algorithm since they discard both duplicate
and irrelevant candidates.
Algorithm 1: M ergeP atterns : Merge similar pattern
 Data: A collection of couple (Pattern,Instances) Input where all patterns are
         different,
 minDepth the mininum depth of the instances in the DOM tree
 Result: A collection Candidates of (pattern,Instances) where similar couples
            have been merged
 recursion ← f alse;
 Candidates ← ∅;
 for i : 0 . . . Input.size do
     for j : i . . . Input.size do
          //Depth of the highest common element ;
          commonDepth ← depth(HighestCommon(Input[i].Instances,;
          Input[j].Instances));
          //Save useless computation ;
          if commonDepth < minDepth then
                 continue ;
          end
          Small ← setW ithLessInstances(Input[i], Input[j]);
          Large ← setW ithM oreInstances(Input[i], Input[j]);
          if ∀k ∈ Small.Instances, ∃l ∈ Large.Instances, isP arentOf (l, k) then
          Candidates ← Input \ Small;
          //Restart the merging process;
          M ergeP atterns(Candidates);
          //Cut the current call;
          break;;
          else Candidates ← Candidates ∪ Large ∪ Small
     end
 end
 return Candidates;
4.5   Pattern expansion
Since the candidate set contains simple patterns (i.e. of depth two), we process
a pattern expansion to discover the fully matching patterns. Patterns may be
expanded in both directions, towards the top and the bottom of the tree.
    For each candidate (pattern, Instances), we distinguish two cases. In the
first case every instance is at the same depth in the tree, this is the case of
product listing extraction that we call the flat case. This is the case of bottom-
up subtree mining (see Figure 1). The second case also called the nested case is
the one where instances belong to the same subtree but at different depths, this
is the case of induced subtrees in Figure 1 . Top expansion is straightforward,
we check if the type of the parent node (i.e. HTML markup tags) for every
instance is the same, in this case we expand the pattern with the new parent
node and update the instances consequently. This process is executed until all
the instances do not share the same tag as parent node or if the same node (in
sense of label in the tree, not HTML markup tag) is the parent of all instances.
This process is the same for both nested and flat cases.


                                                           div
                                                     jj jjjjUUUUUUU
                                                   jj                          UU
               _ _ _ _ _ _jj_j _ _ _  _ _ _ _UU_ _ _ _ 
                                         - div                                  div           
                                            ~~@@@                               ~~U@@U@UUUU 
                                           ~      @@                              ~       @@ UUUU
                                        ~~                                   ~~                  U
                                  ul/               a                 ul/                a           div
                                  ///                               ///                    
                                                                                                 
                            li            li        p              li li                  p           ...
                            _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _


Fig. 3. Pattern expansion process, top expansion until the same node (top div node) is
shared by both instances, then bottom expansion for induced subtree (the most right
div is skipped)


    Once a pattern is expanded in the top direction, the bottom expansion takes
place. This bottom expansion in the flat case, similar to the top expansion is
simple since top-down subtrees are easy to extract. One just needs to traverse the
instances trees in a top-down lefmost direction, looking for nodes that are shared
by all instances. Once a node is not present in every instance the algorithm uses
backtracking to select the next sibling, and then continues its process.
    However the nested case is not trivial since we look for embedded subtrees.
Standard algorithms such as AMIOT [13] and FREQT [1] are dedicated to this
task, but as we mentioned in section 3.1 they performed poorly on the full Web
page, either the running was excessively long (some minutes) or the program
ended with an exception. To make efficient use of these algorithms, we take
advantage of the specificities of comments extraction. Apart from the aforemen-
tioned issues, the biggest problem we encountered using AMIOT or FREQT was
that we were unable to adaptatively select the support (in fact, a percentage)
for a Web page. However in our pattern expansion case, we know how much
instances are present for each couple (pattern, Instances). Thus, we performed
a modification in the algorithms (see Implementation in section 5.1 for further
details) to discard candidate patterns not on a minimum support base, but on a
strict occurrence equality base. Finally, instead of using the complete Web page
as the data tree, we construct a tree by adding all the instances in a tree. More
precisely since the instances are stored using only their top nodes (the whole in-
stance is retrieved by applying the pattern to this node in the data tree), we build
the data tree by adding the subtrees under the instances’ top nodes as children
of the root node. Therefore, our data tree contains on ly the relevant instances.
Consequently the set of candidate patterns is drastically reduced, compared
with the use of the whole DOM page. To drive AMIOT in the right direction,
the candidate pattern set is initialized with the current depth-two pattern, thus
avoiding useless candidate generation for the first stage. Figure 3 depicts this
expansion process. In this figure we present the instances as they appear in the
DOM tree. Our starting pattern is ul[li; li], and its instances are represented
within the dotted boxes. Both instances have a div node with a different label
as parent, consequently the pattern is expanded to the top : div[ul[li; li]]. Next,
both instances again have a div node as parent but in this case this is the same
node. The top expansion process finishes. For the bottom expansion, we consider
this subtree as the data tree. AMIOT (resp. FREQT) performs a left to right
expansion that adds the node a to the pattern: div[ul[li; li]; a]. The rightmost div
node is discarded since it does not belong with the left instance (the occurrence
is one whereas the algorithm expects two). Then AMIOT adds the node p to the
pattern that becomes div[ul[li; li]; a[p]]. The figure does not show the part under
the right most div, in this part we could find for example another instance of
div[ul[li; li]; a[p]], resulting in a nested comment.

4.6   Winner selection
Recurring structures competing with the comments pattern in the candidate set
are usually menu elements, links to other articles. In [15, 16], Kohlschütter de-
veloped a densitometric approach based on simple text features that presents
excellent results (F1 −Score : 95 %) for news article boilerplate removal. User
generated comments also differentiate from menu elements on their text fea-
tures. Comments are from different text lengths, the link density is low since
comments are not part of a link in comparison to menu items. We developed
simple heuristics, based on our observations, to discard irrelevant candidates
and to rank the remaining candidates.
    Our experimentation showed that instances with a link density greater than
0.5 (Kohlschutter found 0.33 for news article) are always boilerplate. Short com-
ments in very complex HTML boilerplate patterns can produce instances with a
quite high link density. We empirically observed that links are on the username
or on its avatar and link to a profile page. Consequently we discard candidates
where the average link density is above 0.5.
    For the remaining candidates their score is given by the following formula:
                                      v
                                      u          n
                                      u     1 X                      2
               Score(p, I) = lgt(I) × t             lgt(I) − lgt(Ij )            (1)
                                          n − 1 j=1


    The above formula computes the average text length times its standard de-
viation. This heuristic promotes candidates where the instances have longer
text length with variable length. Finally we did not use heuristics based on
a lower bound of words (as in [15, 16]). Once again comments extraction differs
from traditional boilerplate removal since some comments are sometimes just
one word (e.g. lol, +1, first) or they may be longer than the article they are
commenting on. Therefore the standard deviation is very useful for eliminating
menu items where the text length is often very close among their instances. The
(pattern, Instances) couple with the highest score is promoted as the winner.
At this step, the algorithm output the tree of instances, i.e. we have the struc-
ture of the conversation, but we need to further structure the conversation by
identifying the different fields in the pattern.


4.7   Structuring the content


Once the HTML pattern containing comments has been selected, the next step
of our algorithm consists of extracting the related SIOC fields as we described
in section 3. Two fields are always present in a comment : the username and
the content of the comment. From our observations on various websites (online
press, Q&A, blogs, reviews), two other fields appear often. Firstly the date when
the comment has been posted occurs in 97.95 % of the cases (See Table 2). Com-
ments less often have a title, but the percentage we measured (24.48 %) remains
high enough to be of interest. We voluntarily skipped the extraction of rare fields
(vote on comments, account creation date) because of their few occurrences and
the fact that they complicate the whole field’s identification process. From our
observations, we noticed that content and title are always contained in their own
HTML markup tags, i.e. it is very unlikely to see the title and the content in
the same div. However we noticed that username and date often occur between
the same markup tag, for example we often encountered comments fields such
as <div>John, May 25, 2012 at 5:00 p.m</div>. Therefore we apply the fol-
lowing procedures : first, identifying the date field within the pattern. We store
the location of the date in the pattern and remove dates in every instances. For
instance, the previous example will become<div>John</div>. At this point we
are sure that the fields we are looking for will be in distinct nodes of the pattern.
Date parsing Date parsing library such as JodaTime9 and JChronic10 perform
well on extracting date from messy data. These libraries takes a String as input
and return a Date object. However they do not offer the feature of returning the
original text without the substring containing the date. Therefore we developed
our library, using features and heuristics from both of the above mentioned
libraries, that offers the date string removal feature.


Fields selection Our heuristic is based again on simple observations. We know
that every comment contains at least a username and a content, with date and
title being optional. For each leaf of our pattern, we compute the following mea-
sures over its instances : percentage of date found, average text length, standard
deviation of text length, text entropy and average word count, standard devia-
tion on word count. Using these values, we build two candidate sets, the first one
for the date, and the other one for title, content and username. We distinguish
these fields using the fact that title, content and username are unstructured text,
however dates have a particular structure (containing year, days, . . .).
   A node in the pattern is a candidate for the date field if its percentage of
date found is over 0.7 (to take into account the fact that date extraction is
not perfect), has more than two words and has a coefficient of variation on
the word count that is inferior to 0.2. This latter condition requires that the
number of words is very close from one instance to another. We did not set the
value to zero to avoid discarding fields where the date has a variable length, for
example one hour ago and yesterday. From these candidates, we pick the field
with the highest entropy in order to discard constant fields that may have been
recognized as a date. However if the set is empty, then no dates are specified for
the comments, in practice this happens very rarely.
    The second candidate set should contain only nodes that instances own tex-
tual data. For this purpose we discard the nodes in the pattern where the vari-
ation of text entropy is equal to zero (constant text in every instance). Since we
know that the node containing the content must be present within this set, we
aim at identifying this field in the first place. Luckily the content is very simple
to identify since it contains the most words, has a very variable length and word
count. In the practice, selecting the node having the highest word count average
is sufficient. Once the content has been removed from this candidate set, we first
check its size. If the size is equal to one, the remaining candidate matches the
username. In the case where two candidates are present, we have to distinguish
between username and title. Usernames are very short names, usually one or two
words, and are then in average shorter than the title. If two fields are present,
the shorter is identified as being the username and the longest is then the title.


9
     http://joda-time.sourceforge.net/
10
     https://github.com/samtingleff/jchronic
                                 Ground True    False
                           Pages Truth Positive Positive
                     Global 100   2323    2121    153
                      Flat   77   1837    1674    125
                     Nested 23     486     447     26


                                  Precision Recall   F1
                         Global    93,3 % 91,3 % 92,3 %
                          Flat     93.1 % 91.1 % 92.1 %
                         Nested    94.5 % 91.97 % 93.22 %

                        Table 2. Evaluation : steps one to six


                                        Content Username Date Title
                Occurrence (%)            100      100   97.59 24.48
             Correct extraction (%)       100     87.75  83.67 81.63

                 Table 3. Evaluation, step seven : field identification


5      Mining experiments
This section presents the evaluation protocol of CommentsLifter as well as our
experiment’s results. We first detail the experimental setup, which is a bit parti-
culiar for comments extraction due to the large use of AJAX. Finally we present
global and detailed results for both flat and nested cases.

5.1     Setup
Many Web pages handle comments using AJAX, consequently downloading raw
HTML along manual ground truth construction is not sufficient for building the
dataset. To circumvent this issue we developed two components:
Firefox Extension We implemented a Firefox extension that sends the current
   DOM (after browser-side Javascript processing) to a Web server.
Web Server The server receives the DOM from the browser through a POST
   request on a servlet, then runs CommentsLifter and presents an evaluation
   form along with the extracted comments. The user is asked to evaluate the
   pertinence of the extraction. We distinguish three cases for the result. We
   used Jena11 to generate the SIOC output.

5.2     Evaluation
The results obtained from our evaluation are given in Table 2 and Table 3.
Table 2 presents extraction results of the pattern mining part (steps 1 to 6 of
11
     http://jena.apache.org/
the algorithm) whereas Table 3 presents the field identification results (step 7).
The first six steps of the algorithm present very good results, with a global F1
score over 92%. Concerning field identification we first present the occurrence of
the different fields over our dataset. Username and content are always present,
while the date is not far from being present in every comment. However titles
are to be found in one quarter of the comments. The evaluation accords to the
heuristics we describe in section 4.7. Content extraction is a straightforward task
since it is easy to ”measure” differences with other fields, our algorithm performs
perfectly at this task. Date parsing is no easy task, however our algorithm still
performs well with an identification rate of 83.67 %. However we note that while
the rest of the process is language agnostic, date parsing libraries are designed
to work with western languages (English, German, french, spanish, . . .) but may
fail with other languages, especially with non latin alphabet.


6   Conclusions and future work

In this paper, we presented CommentsLifter, an algorithm that extracts users’
comments and outputs SIOC data. Our algorithm combines mining induced
subtrees from the DOM with simple yet robust heuristics to select the pattern
containing the comment as well as identifying several fields within the pattern.
The empirical evaluation presents very good results, for both extraction and
field identification. We successfully extracted comments from various types of
Web sites, without a priori knowledge, such as online newspapers, forum, user
reviews, blogs and we were able to reconstruct the conversations.
    Further research will focus on refining the category of the extracted container,
in order to determine whether the discussion takes into a forum, Q&A, blog or
review area.


References

 1. Tatsuya Asai, Kenji Abe, Shinji Kawasoe, Hiroshi Sakamoto, and Setsuo Arikawa.
    Efficient substructure discovery from large semi-structured data. pages 158–174,
    2002.
 2. Mark Van Assem, Maarten R. Menken, Guus Schreiber, Jan Wielemaker, and
    Bob Wielinga. A method for converting thesauri to rdf/owl. In Proc. of the 3rd
    Intl Semantic Web Conf. (ISWC04), number 3298 in Lecture Notes in Computer
    Science, pages 17–31. Springer-Verlag, 2004.
 3. F. Belleau, M.A. Nolin, N. Tourigny, P. Rigault, and J. Morissette. Bio2rdf: to-
    wards a mashup to build bioinformatics knowledge systems. Journal of biomedical
    informatics, 41(5):706–716, 2008.
 4. Bettina Berendt, Andreas Hotho, and Gerd Stumme. Towards semantic web min-
    ing. In Intl Semantic Web Conf. (ISWC02), pages 264–278. Springer, 2002.
 5. J. Breslin, A. Harth, U. Bojars, and S. Decker. Towards semantically-interlinked
    online communities. The Semantic Web: Research and Applications, pages 71–83,
    2005.
 6. D. Cai, S. Yu, J.R. Wen, and W.Y. Ma. Extracting content structure for web pages
    based on visual representation. In Proc. of the 5th Asia-Pacific conference on Web
    technologies and applications, pages 406–417. Springer-Verlag, 2003.
 7. C.H. Chang, M. Kayed, R. Girgis, and K.F. Shaalan. A survey of web information
    extraction systems. Knowledge and Data Engineering, IEEE Transactions on,
    18(10):1411–1428, 2006.
 8. Y. Chi, R.R. Muntz, S. Nijssen, and J.N. Kok. Frequent subtree mining-an
    overview. Fundamenta Informaticae, 66(1-2):161, 2005.
 9. P. Cimiano and J. Völker. Text2onto. Natural Language Processing and Informa-
    tion Systems, pages 257–271, 2005.
10. M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and
    S. Slattery. Learning to construct knowledge bases from the world wide web.
    Artificial Intelligence, 118(1):69–113, 2000.
11. S. Fernández, D. Berrueta, and J.E. Labra. Mailing lists meet the semantic web.
    In Proc. of the Workshop on Social Aspects of the Web, Poznan, Poland, 2007.
12. A. Gentile, V. Lanfranchi, S. Mazumdar, and F. Ciravegna. Extracting semantic
    user networks from informal communication exchanges. The Semantic Web–ISWC
    2011, pages 209–224, 2011.
13. S. Hido and H. Kawano. Amiot: induced ordered tree mining in tree-structured
    databases. In Data Mining, Fifth IEEE International Conference on, 2005.
14. D. Huynh, S. Mazzocchi, and D. Karger. Piggy bank: Experience the semantic web
    inside your web browser. The Semantic Web–ISWC 2005, pages 413–430, 2005.
15. C. Kohlschütter, P. Fankhauser, and W. Nejdl. Boilerplate detection using shallow
    text features. In Proceedings of the third ACM international conference on Web
    search and data mining, pages 441–450. ACM, 2010.
16. C. Kohlschütter and W. Nejdl. A densitometric approach to web page segmen-
    tation. In Proceeding of the 17th ACM conference on Information and knowledge
    management, pages 1173–1182. ACM, 2008.
17. S.H. Lin and J.M. Ho. Discovering informative content blocks from web documents.
    In Proceedings of the eighth ACM SIGKDD international conference on Knowledge
    discovery and data mining, pages 588–593. ACM, 2002.
18. B. Liu, R. Grossman, and Y. Zhai. Mining data records in web pages. In Proceedings
    of the ninth ACM SIGKDD international conference on Knowledge discovery and
    data mining, pages 601–606. ACM, 2003.
19. A. Maedche and S. Staab. Ontology learning for the semantic web. Intelligent
    Systems, IEEE, 16(2):72–79, 2001.
20. D.D.C. Reis, P.B. Golgher, A.S. Silva, and AF Laender. Automatic web news ex-
    traction using tree edit distance. In Proceedings of the 13th international conference
    on World Wide Web, pages 502–511. ACM, 2004.
21. F. Sun, D. Song, and L. Liao. Dom based content extraction via text density.
    In Proceedings of the 34th international ACM SIGIR conference on Research and
    development in Information, pages 245–254. ACM, 2011.
22. A. Yates, M. Cafarella, M. Banko, O. Etzioni, M. Broadhead, and S. Soderland.
    Textrunner: open information extraction on the web. In Proceedings of Human
    Language Technologies, pages 25–26. ACL, 2007.
23. Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In
    Proceedings of the 14th international conference on World Wide Web, pages 76–
    85. ACM, 2005.