=Paper= {{Paper |id=Vol-2696/paper_223 |storemode=property |title=DPRL Systems in the CLEF 2020 ARQMath Lab |pdfUrl=https://ceur-ws.org/Vol-2696/paper_223.pdf |volume=Vol-2696 |authors=Behrooz Mansouri,Douglas Oard,Richard Zanibbi |dblpUrl=https://dblp.org/rec/conf/clef/MansouriOZ20 }} ==DPRL Systems in the CLEF 2020 ARQMath Lab== https://ceur-ws.org/Vol-2696/paper_223.pdf
                      DPRL Systems in the
                    CLEF 2020 ARQMath Lab

          Behrooz Mansouri,1 Douglas W. Oard,2 and Richard Zanibbi1
                      1
                       Rochester Institute of Technology (USA)
                             {bm3302,rxzvcs}@rit.edu
                    2
                      University of Maryland, College Park (USA)
                                    oard@umd.edu



        Abstract. This paper describes the participation of the Document and
        Pattern Recognition Lab from the Rochester Institute of Technology in
        the CLEF 2020 ARQMath lab. There are two tasks defined for ARQ-
        Math: (1) Question Answering, and (2) Formula Retrieval. Four runs
        were submitted for Task 1 using systems that take advantage of text and
        formula embeddings. For Task 2, three runs were submitted: one uses
        only formula embedding, another uses formula and text embeddings,
        and the final one uses formula embedding followed by re-ranking results
        by tree-edit distance. The Task 2 runs yielded strong results, the Task 1
        results were less competitive.


Keywords: Community Question Answering (CQA), Mathematical Informa-
tion Retrieval, Math-aware search, Math formula search


1     Introduction
The ARQMath lab at CLEF 2020 has two tasks [3]. In the first task, called
Answer Retrieval, given the mathematical questions, the participants were asked
to return a set of relevant answers to each question. Formula Retrieval is the
second task, where given a mathematical formula as the query, the participants
were asked to retrieve a set of relevant formulas.
    The Document and Pattern Recognition Lab (DPRL) from the Rochester
Institute of Technology (RIT, USA) participated in both tasks. For Task 1, we
used a text and formula embedding approach, where each question and answer
are represented by a vector of float numbers. We submitted 4 runs for this task.
    For the second task, we submitted 3 runs, all based on a mathematical for-
mula embedding model, Tangent-CFT [4]. This embedding model uses both
Symbol Layout Tree (SLT) and Operator Tree (OPT) formula representations.
The SLT representation encodes the appearance of the formula, while the OPT
representation encodes its operator syntax (i.e., the operator tree). Figure 1
    Copyright © 2020 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25
    September 2020, Thessaloniki, Greece.
                                                           =
                           a         2
                               n         n
                                                     ^         4
                                                 0        1
                       x             =       4
                                                 x         2
                                   (a)                   (b)

Fig. 1. Formula x2 = 4 represented as (a) Symbol Layout Tree and (b) Operator Tree.


shows the SLT and OPT representations for the formula x2 = 4. In the SLT rep-
resentation, the edge labels show the spatial relationship between the formula
elements. For instance, the edge label ‘a’ shows that the number ‘2’ is located
above the variable ‘x’, while the edge label ‘n’ shows operator ‘=’ is located
next after ‘x’. In the OPT representation, the edge labels for non-commutative
operators indicate argument position. For more details, refer to Mansouri et
al. [4].
    While our runs results for Task 1 were not as good as for some other partic-
ipating teams, our Task 2 results were noteworthy for including a system that
was statistically indistinguishable from the baseline (which was a strong base-
line relative to submitted systems). In this paper, we first introduce our runs for
Task 2 because some techniques from Task 2 were used in our Task 1 systems.


2     Formula Retrieval (Task 2)

For Task 2, we built three systems and submitted one run from each. All three
systems use formula embeddings produced from strings generated after parsing
formulas with Tangent-S [2], which produces both SLT and OPT representations.
Tangent-S and Tangent-CFT are publicly available.3
    In Tangent-S, candidate formula SLTs and OPTs are first retrieved indepen-
dently, using tuples representing the relative paths between pairs of symbols. The
top-2000 results from the separate OPT and SLT result lists are reranked after
aligning candidates to the query formula, using an ordered list based on three
features (the second and third features are used to break ties when ranking): 1) a
unified structural matching score (Maximum Subtree Similarity (MSS)), 2) pre-
cision for candidate symbols matched only by unification (treated as a penalty),
and 3) symbol recall based on exact matching of candidate and query symbols.
The reranked candidates are then rescored again using a linear combination of
the three-element score vectors associated with each retrieved formula’s SLT
and OPT. This combines the SLT and OPT score vectors for each candidate
formula into a single rank score. The baseline uses a weight vector learned on
the NTCIR-12 Wikipedia Formula Browsing task test collection [2].
3
    https://www.cs.rit.edu/~dprl/software.html
    In the following, we describe our formula retrieval models, all of which use the
Tangent-S SLT and OPT tree encodings. Note that our systems do not use any
of the retrieval models of Tangent-S: only the SLT and OPT string encodings,
which represent depth-first traversals on OPTs and SLTs.


2.1   Tangent-CFT

Tangent-CFT [4] is an embedding model for mathematical formulas. This model
uses both SLT and OPT representations of formulas to consider both the appear-
ance and the syntax of formulas. Summarizing the indexing steps of Tangent-
CFT (for the complete description, please see the paper):

 – Tuple Extraction: Using Tangent-S, LATEX formulas are converted to Pre-
   sentation and Content MathML, from which the internal SLT and OPT
   formula representations are produced. Depth-first traversals of these OPT
   and SLT trees are used to produce strings consisting of a sequence of tuples.
   Each tuple has three elements (Ni , Nj , E). The Ni shows the node value
   which is in the form of Type!Value, where type can take values such as Vari-
   able (V) or Number(N) and value shows the variable name or the numeric
   value. E shows a sequence of edge labels connecting the two nodes.
 – Tuple Encoding: Each tuple is then tokenized and enumerated. The tok-
   enization is based on type, value. For example the tuple (V!x, N!2, a), which
   represents x2 , will be tokenized to {‘V’, ’x’, ‘N’, ‘2’, ‘a’ }.
 – Training Embedding Models with fastText: After tokenizing each tu-
   ple, each token is considered as a character and the whole tuple as a word.
   The resulting sequence of tuples is considered as a sentence in which the
   words (tuples) appear. As shown in Mansouri et al. [5], mathematical for-
   mulas can be rather infrequent, so the fastText [1] n-gram embedding model
   is used to get vector representations for each tuple. The formula vector rep-
   resentation is then obtained by averaging the vector representations for the
   individual tuples.
 – Combining the Vector Representations: Tangent-CFT uses three rep-
   resentations of the formulas: SLT, OPT, and SLT-Type. In SLT-Type, each
   node represents only the type, and not the value. This can help with the par-
   tial matching of formulas especially for the cases where the variable names
   are not critical. For each representation, a separate embedding model is
   trained and a vector representation is obtained. The final formula vector is
   obtained by adding the 3 vectors together.

We trained Tangent-CFT on Math Stack Exchange formulas, using the de-
fault parameters which were obtained by training the model on the NTCIR-12
Wikipedia Formula Browsing task. SLT and OPT representations are sometimes
not available for specific formulas, and Tangent-CFT will only retrieve formulas
for which at least one of those representations is available. For this task, if one
of the formula representations is not available, a vector zeros is used for that
embedding.
                                                                    Formula Vector
                                                                    Representations
                                                                                   Similar Formulas based
                          SLT Tuples                    Query Vector
                                                                                   on cosine similarity
                                                        Representation
 Formula                                  Tangent-CFT                Tangent-CFT
  Query     Tangent-S     OPT Tuples                                                          Retrieval Results
                                            Encoder                     Model
                        SLT-Type Tuples



Fig. 2. Overview of Tangent-CFT. The set of SLT, OPT and SLT-type tuples are
extracted from the formula query using Tangent-S. These tuples are then encoded using
the Tangent-CFT encoder, and compared against vector representations for formulas
in the collection using cosine similarity.


2.2   Tangent-CFT-Plus
As described in [3], to decide the relevance, the annotators might consider the
question (context) in which the formula appeared in. Tangent-CFT only consid-
ers the structural similarity of formulas when embedding them. Tangent-CFT-
Plus is an extension to Tangent-CFT embedding model in which each formula
has two vector representations:
 – Formula Vector : Vector representation obtained by Tangent-CFT. The vec-
   tor size is 300.
 – Text Vector : Vector representation obtained by considering formula as a word
   and training the fastText model on the surrounding words of the formula.
   We used a vector size of 100 which is the fastText default value.
    Each formula is represented as the concatenation of Formula and Text vec-
tors.

2.3   Tangent-CFT with Tree Edit Distance
Experiment results on the NTCIR-12 Wikipedia Formula Browsing task test
collection [7] showed Tangent-CFT to do better on partial matching compared
to full matching. To address that issue, we extended Tangent-CFT by re-ranking
the retrieval results based on tree-edit distance.
    Tree-edit distance (TED) is the minimum cost of converting one tree to
another. Three edit operations were considered: insertion, deletion and substi-
tution. Each of these operations can have different weights when calculating the
tree-edit distance cost. The ranking score for each retrieved formula is calculated
as:
                                                1
                         sim(T1 , T2 ) =                    .                   (1)
                                         T ED(T1 , T2 ) + 1
We tuned the weights for each edit operations, over the range [0, 1] with step size
0.05, on the NTCIR-12 test collection using leave-one-out cross-validation. In the
NTCIR-12 Formula Browsing Task test collection [7] there are 20 formula queries
without wildcards. We learn operation weights for each query independently, and
Table 1. DPRL runs for Task 2, results averaged over 45 topics and computed over
deduplicated ranked lists of visually distinct formulas. For MAP0 and P@10, relevance
was thresholded H+M (High or Medium) binarization.

                                                   Evaluation Measures
        Run                 Data     Primary      nDCG0   MAP0    P@10
        Baseline
        Tangent-S           Math        (X)     ( 0.506 ) (0.288) ( 0.478 )
        DPRL
        Tangent-CFTED       Math        X         0.420      0.258     0.502
        Tangent-CFT         Math                  0.392      0.219     0.396
        Tangent-CFT-Plus    Both                  0.135      0.047     0.207



report results as averages over the 20 queries using weights learned for each query.
Our goal was to maximize the harmonic mean bpref score, which balances full
and partial bpref scores. As there are two tree representations for formulas (OPT
and SLT), the re-ranking is done on each representation separately, and then the
results were linearly combined. The linear combination weight is calculated in a
similar way as the edit operation weights using leave-one-out cross-validation.
   To use the tree edit distance for formula retrieval, the following steps are
taken:

 1. Retrieve the top-1000 results with Tangent-CFT model.
 2. Re-rank the results with tree-edit distance using SLT and OPT representa-
    tions. For SLTs, the edit operations cost were set 0.85, 0.15, and 0.54 for
    deletion, substitution and insertion operations respectively and for OPTs
    they were set to 0.28, 0.185, 0.225.
 3. The re-ranked results with OPT and SLT representations are linearly com-
    bined as:

              Scoreq (f ) = α · SLT -Scoreq (f ) + (1 − α) · OP T -Scoreq (f )   (2)

      with α = 0.95. The tree-edit distance values are calculated using a publicly-
      available Python package for the APTED algorithm [6].


2.4     Results

Table 1 shows the results of our runs and for the baseline (Tangent-S) (Taken
from the lab overview paper [8]). For all three evaluation measures, the apparent
differences between Tangent-CFTED and Tangent-S (which, for P@10, are 5%
relatively higher for Tangent-CFTED) are not statistically significant by a two-
tailed paired t-test (at p < 0.05).
    We analyzed our runs against the baseline and against each other. Our first
analysis was comparing the Tangent-S system against Tangent-CFTED to inves-
tigate how many relevant formulas are retrieved by both systems, and how many
are retrieved by one system and not the other. We considered different relevance
degrees, first treating only high as relevant, then high and medium as relevant,
and finally all the three levels as relevant. Figure 3 shows the Venn diagram of
retrieval results between Tangent-S (the right circles with dashed blue outline)
and Tangent-CFTED (the left circles with grey background).
    The largest difference in P@10 values between Tangent-CFTED and Tangent-
S was 0.5 for the formula
                                (x + y)k ≥ xk + y k                          (3)
(Topic B.48). Half of the top-10 formulas retrieved by Tangent-CFTED are anno-
tated as highly relevant and the other half are annotated as medium. Tangent-S
retrieved 4 formulas annotated as high/medium, and 5 as low. Formulas such
as (x + y)p ≡ xp + y p (mod p) with a low relevance rating get a higher rank in
Tangent-S, while formulas such as (x + y)a ≥ xa + y a and (x + y)s ≥ xs + y s that
are highly relevant get lower ranks, and are not in the top-10 unique formulas.
However, these last two formulas appear at ranks 6 and 3 in the Tangent-CFTED
search results, respectively.
    Another topic where Tangent-CFTED did better than the baseline is the
formula
                                     n  
                                    X     n
                                             k                                 (4)
                                          k
                                         k=0

(Topic B.4) where P@10 for the baseline was 0, but 0.4 for Tangent-CFTED.
The top-5 unique formulas retrieved by Tangent-S and Tangent-CFTED for this
query are shown in Table 2.
   For 8 Task 2 topics, the P@10 for Tangent-CFTED was 0 (i.e., no formulas
with medium or high relevance ratings were found), and for 4 of these topics,
no relevant formulas (at any relevance level) were retrieved in top-10 results.
The highest P@10 achieved by Tangent-S for these 8 topics was 0.3. Some of
these topics were more complex than other queries, and had fewer relevant hits.
Providing an example, for formula
                                                       
                      ∃p p is prime → ∀x (x is prime)                      (5)
(Topic B.56) there are only 5 formulas with a relevance rating of medium or
high in the collection. For this formula, the Tangent-CFT model fails to retrieve


 54                            106                                  202
         58    273    124               101    457   232                     147   662    386




       Relevance score: High     Relevance score: High and Medium   Relevance score: High, Medium and Low



Fig. 3. Venn Diagrams for all Visually Distinct Relevant formulas in Task 2 (45 Top-
ics). Each diagram indicates how many relevant formulas were retrieved by Tangent-S
(blue circles) and Tangent-CFTED (grey circles) within their top-1000 results. Rele-
vant formulas missed by both systems are at top-left in each diagram. In total, there
are 1,397 relevant formulas in the collection (509 high, 387 medium, and 501 low).
                 Table 2. Top-5 unique formulas for query
                                                            Pn         n
                                                                           
                                                               k=0     k
                                                                            k

                             Tangent-S                     Tangent-CFTED
 Rank    Formula                         Relevance      Formula    Relevance
          n
   1.                                    Non-Relevant              High
         P  n
               n−k                                     Pn   n
                                                               
            k
               k2                                        k=0 k k
         k=0


   2.                                    Non-Relevant                           Non-Relevant
         Pn      n
                      n−k                              Pn       n
                                                                 
           k=0   k
                      k2                                 k=0     k
                                                                  Mk

   3.                                    Non-Relevant                           Medium
         Pn      n
                                                       Pn       n
                                                                  k
           k=0   k
                         k! Dn−k                         k=0     k
                                                                  x

   4.                                    Low                                    Medium
         Pn      n
                      p                                Pn       n
                                                                  k
           k=0   k
                      k                                  k=0     k
                                                                  x

   5.                                    Non-Relevant                           Low
         Pn      n
                      k                                Pn       n
                      x (1 − x)n−k k
                                                                      p
           k=0   k                                       k=0     k
                                                                      k




any relevant formulas in its top-10 results, and Tangent-CFTED has only one
relevant formula, ∃p(p is prime and p2 | n), which has a low relevance rating at
rank 5. It is possible that the larger depth used for re-ranking by Tangent-S
(i.e., top-2000, vs. top-1000 for Tangent-CFTED) may be contributing to the
difference in performance for these topics.
    Comparing Tangent-CFT-Plus with our other two runs, there were four
queries for which Tangent-CFT-Plus had higher P@10. One of these cases was
the query
                                  df
                                     = f (x + 1)                             (6)
                                  dx
where the P@10 for our other two runs was 0, but for Tangent-CFT-Plus was
0.4. Tangent-CFT only retrieves 6 relevant formulas for this query, and Tangent-
CFTED fails to re-rank them better than Tangent-CFT. The Tangent-CFT-Plus
model takes advantage of surrounding text. Terms such “differential equations”
appear frequently around the related formulas. Therefore, Tangent-CFT-Plus
gives higher ranks to formulas such as dfdx (x)
                                                = f (x), which was the first formula
                                         df
retrieved by this system for the query dx = f (x + 1).
    Performance. Table 3 shows the minimum, maximum, mean and standard
deviation of retrieval times for our three runs in task 2. Our runs were done
on a machine with 4 NVIDIA GeForce GPU cards with 11GB memory, a 528
GB Ram, with Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz (32 cores). For
calculating the cosine similarity of vectors we used Pytorch framework. As can
be seen, the retrieval time for Tangent-CFT is lower than Tangent-CFT-PLUS
due to smaller size of vectors. As Tangent-CFTED has an additional re-ranking
step, its retrieval time is longer than Tangent-CFT.
    Post-Task Experiments. After the submission of our runs, we found an
error in our run, Tangent-CFT-Plus which used partial data for retrieval: results
were only returned from answers written between 2010 and 2015. Therefore, we
ran our model again, after assessment was complete (i.e., after pooling). With
                 Table 3. Retrieval Times in Seconds for Task 2.

                                         Run Time (Seconds)
              System              Min (Topic) Max (Topic) (Avg, StDev)
              Tangent-CFT      0.653 (B.4) 0.669 (B.41) (0.658, 0.004)
              Tangent-CFT-PLUS 0.848 (B.69) 0.857 (B.54) (0.851, 0.002)
              Tangent-CFTED    0.672 (B.3) 7.052 (B.26) (1.752, 1.622)




correct use of the dataset, Tangent-CFT-Plus could achieved the nDCG0 value
of 0.305, mAP0 of 0.155 and P@10 of 0.342.
    In another experiment, we looked at the Tangent-CFTED model, to check
the effect of α value used to combine SLT and OPT results. This value α was set
to 0.95, which was learnt on NTCIR-12 dataset and leaned highly toward SLT
representation. We conducted another experiment by setting the α value to 0.5,
weighting SLTs and OPTs equally. With this setting, the effectiveness measures
were nDCG0 of 0.424, mAP0 of value of 0.253 and P@10 of 0.488. This produces
small changes in nDCG0 and mAP0 scores (≤ 0.5%), and a decrease in P @10 of
about two percent compared to using Tangent-CFTED using α = 0.95, and so
giving greater weight to OPTs did not improve the performance of the model.
    Note that the formula relevance definition in NTCIR-12 was different than
ARQMath, and therefore in future work, we will do parameter tuning using the
ARQMath dataset.


3     Answer Retrieval (Task 1)
In this section we present our system design, runs, and results for Task 1.

3.1   System Design
For Task 1, our approach was to use text and formula embeddings to find relevant
answers for each topic. Our runs differ in how the embedding models are used
for retrieval. Each post (question or answer) in our models is represented as a
vector with dimension 400: the first 100 elements being the text embedding of
the post and the next 300 being the formula embedding.
    We trained fastText [1] to get text embeddings for each post. To tune the
fastText parameters, we used 1,003 questions published in the year 2018 that
had duplicate or related questions as our training questions. The duplicate and
related questions are determined by the Math Stack Exchange moderators are
data that was provided by the ARQMath organizers. The answers given to the
duplicate or related questions, were considered as the relevant answers for the
training questions. For each training question, all the answers from duplicate
and related questions were collected and sorted based on scores. Then, the first
quarter of the answers was considered as high, the second quarter medium, and
the third quarter as an answer with low relevance level.
    Using the training questions, we tuned the fastText parameters by consid-
ering our first proposed retrieval system. Our goal is to train a model that
maximizes the precision at 10 for the training questions. After hyper-parameter
tuning, our final setting for the fastText model was to use a skip-gram model
with negative sampling (15 negative samples), with a context window size of 5,
n-grams of lengths between 2 and 5 characters, and a vector size of 100. The
fastText model was trained on all the questions, answers, and comments in the
collection with 10 epochs. As fastText provides embeddings for the words, we
averaged all the vector representations of the words inside a post to get a post
embeddings. For the questions, the vector representation is the average of vector
representations of the title and the body of the questions.
    The second part of a post vector represents the formula embedding. We
trained the formula embedding model, Tangent-CFT [4] with the same param-
eters that this model used for the NTCIR-12 [7] formula browsing task. The
model is described in previous section. As each post can have multiple formulas,
the formula vector is the unweighted average of all the formulas vectors in three
of our runs. In one run (DPRL3), we used the weighted average of formulas,
using the length of formula as the weight.

3.2   Submitted Runs
Here we describe our four runs for Task 1, which are summarized in Table 4.
    DPRL1. This was our primary run, in which the answers in the collection
are represented as the summation of two vector representations: one for the an-
swer and one for the question to which the answer was given. Both the answer
and question vector representations are obtained as described before, by con-
catenating the text and formula vectors. For each topic, the cosine similarities
of the topic’s vector and all the answers’ vectors in the collection are computed
and used to rank the answers.
    DPRL2. This alternate run is similar to our primary run, with the difference
being that the answer vector is just the vector representation of the answer.
Compared to our primary run, the question to which the answer was given is
ignored. For each topic, the cosine similarity of the topic vector and answer
vector is the similarity score.
    DPRL3. This alternate run is again similar to our primary run, with the
difference being that the formula part of the vector is calculated differently. As
mentioned earlier, there could be multiple formulas in a post. In this run, instead
of using the unweighted average of vector representations of formulas, we used
the weighted average of vector representations by considering the length of each
formula as the weight. Our assumption was that the bigger formulas are more
important. We used the length of the LATEX string representation of formulas for
this. As with the primary run, the similarity of topic and answer is the cosine
similarity of their vector representations.
    DPRL4. This alternate run first finds questions that are similar to the topic
and then ranks the answers to those similar questions. To perform this ranking,
after finding similar questions to the topic, we compute the cosine similarity
score of the topic vector and the vector for the similar question with which each
answer is associated, and we then multiply that by the score assigned to that
Table 4. Summary of DPRL Submissions for ARQMath Answer Retrieval Task. Each
Answer has a vector representation of size 400, where the first 100 elements belong to
the text embedding, and the final 300 elements are the formula embedding.

                           Embedding Vector (400 elements)
       Run        Text (100)         Formula (300)
       DPRL1      Answer + Question  Tangent-CFT (unweighted average)
       DPRL2      Answer             Tangent-CFT (unweighted average)
       DPRL3      Answer + Question  Tangent-CFT (weighted average)
       DPRL4      Question           Tangent-CFT (unweighted average)

Table 5. Task 1 (CQA) results, averaged over 77 topics. P indicates a primary run, M
indicates a manual run, and (X) indicates a baseline pooled at the primary run depth.
For Precision@10 and MAP, H+M (High or Medium) binarization was used. The best
baseline results are in parentheses. * indicates that one baseline did not contribute to
judgment pools.

                                        Run Type      Evaluation Measures
      Run                      Data     P    M        nDCG0 MAP0 P@10
      Baselines
      Linked MSE posts         n/a     (X)            (0.303) (0.210) (0.417)
      Approach0 *              Both             X      0.250   0.100   0.062
      TF-IDF + Tangent-S       Both    (X)             0.248   0.047   0.073
      TF-IDF                   Text    (X)             0.204   0.049   0.073
      Tangent-S                Math    (X)             0.158   0.033   0.051
      DPRL4                    Both                    0.060   0.015   0.020
      DPRL2                    Both                    0.054   0.015   0.029
      DPRL1                    Both     X              0.051   0.015   0.026
      DPRL3                    Both                    0.036   0.007   0.016



answer by Math Stack Exchange users. We then sort the answers in decreasing
order of that product. We then perform a reranking step in which we make use of
the Tags assigned to the Topic question and to the question posts that had been
found, which had been provided by the people asking the questions. Answers
associated with found questions that have at least one Tag that exactly matches
a Tag in the query question are moved ahead of all other answer posts (with
their relative order maintained).


3.3   Results

For the answer retrieval task, three evaluation measures were considered: nDCG0 ,
MAP0 and P@10. Our results for this task were not promising. Table 5 shows
the baselines and our runs results for this task (Taken from the lab overview
paper [8]). Our runs mostly did better for topics in the category “Computation”
than for topic in categories “Concept” or “Proof”. For instance, all the topics for
which our DPRL2 run (which had our highest P@10), was able to find relevant
                           Fair point, but I would argue that the term "one of them" does not imply the exclusion of "both of them" as in the famous and idiotic puzzle "I have two coins that add to
                           30 cents and one is not a nickel". The question was clearly not intended to be taken that way. – The Count Feb 22 '17 at 19:08

                       1   @TheCount, the point I'm trying to make is that the confusion between 1/2 and 1/3 comes about because we don't exactly know what we've been told. There's a third
                           interpretation; not necessarily more correct, but also not invalid. It's a more "it couldn't possibly mean that!" case which is meant to highlight that the question is not well
                           framed. – sh1 Feb 22 '17 at 19:13

                           I see. Well, you make a fair point, but I suppose my confusion comes about from the fact that your answer seemingly declares your interpretation to be the correct one,
                           instead of explaining it to me as you just did. I think including that info in the answer itself would be helpful. – The Count Feb 22 '17 at 19:14




                       Let be the event "one of the children is a girl" Let     refer to the event the family has one boy and one girl (I'm not distinguishing                                                and          ), and
                       so forth. Everything is conditioned on the family having two children.
             1



                                                                     Topic A.5 Question
                       (note that I am assuming this isn't taken as a "trick" question, so that                                           is meant, rather than                          )
         A family has two children. Given that one of the children is a boy, what is the probability
               So everything boils down to the probability of "one of the children is a girl" given that the family has a boy and a girl.
         that both children are boys?
                       Now, what does the phrase "one of the children is a girl" mean? There are two typical options
         0                  This
                   A family has twophrase
                                    children.simply   means
                                              Given that one of that  this isisnot
                                                                the children        anwhat
                                                                                a boy, all-boy
                                                                                           is the family.   Inthat
                                                                                                  probability   this case,
                                                                                                                   both children are boys?          , and                            .
                             The phrase means that a particular child was chosen and it could just has easily have been the boy, so that
                 I was doing this question using conditional probability formula.
                                                                                                                                                                                                   , and                            .

                      In my
                 Suppose, (1) isopinion,    an English
                                 the event, that             speaker
                                                 the first child is a boy,would
                                                                           and (2)almost    always
                                                                                  is the event        interpret
                                                                                               that the           the
                                                                                                        second child is aphrase
                                                                                                                          boy.  the first way rather than the second way.
                 Then the probability of the second child to be boy given that first child is a boys by formula,
                       However, there is a nasty linguistic trap: even meaning it the first way, we may often pick a girl to be the...since second
                                                                                                                                     referent   of "one" (e.g. so that we can speak of
                 child to be boy doesn't depend on first child and vice versa. Please provide the detailed solution and correct me if I am wrong.
                       the "other" child). And since we've picked a girl, we get stuck in the second interpretation. This is exacerbated by the fact the second interpretation has a
                       very
                  probability appealing
                              proof-verificationshortcut     for  reasoning
                                                  conditional-probability   out the solution.

                                                                                                                                                     edited Oct 17 '15 at 3:51               answered Oct 17 '15 at 3:37
                                                                                                                                                                                                    user14972


         2 Answers
                                           DPRL2 Run: First Answer Retrieved
                       You have to read carefully. Here, the probability of a girl means the probability of at least one girl. You have
             0       Let's write the sample space:
             1       BB, BG, GB, GG

                     But we know that one child is a boy, so that means that GG isn't a possibility. Thus, the sample space is reduced to: BB, BG, GB.
                       Now you are calculating the conditional probabilty
                     Therefore, the probability of both children being boys given that one is a boy is 1/3.




                                                                                                                                                                                             answered May 27 '16 at 1:27

Fig. 4. Topic     A.5 and the top answer retrieved by DPRL2 run. The highest P@10 was
                                                                                    ncmathsadist
                                                                                    45k  2   67  113
         1 The flaw in your solution is to write
achieved for this topic (0.9)
                 What would be the logic behind this? Think again, the event          means "there are    boys and there is at least   boy". Therefore,      is equal to the event
                     "there are two boys".
                                    Table 6. Retrieval Times in Seconds for Task 1.
                 Can you fix your solution yourself now?



                                                                      Run Times (in seconds)
                                        System                 Min (Topic) Max (Topic) (Avg, StDev)
                                        DPRL 1               0.055 (A.84)                  0.061 (A.41)                 (0.056, 0.0001)
                                        DPRL 2               0.056 (A.54)                  0.057 (A.49)                 (0.056, 0.0006)
                                        DPRL 3               0.056 (A.1)                   0.057 (A.18)                 (0.056, 0.0001)
                                        DPRL 4               0.901 (A.26)                  1.143 (A.30)                 (0.985, 0.0564)

                                     By using our site, you acknowledge that you have read and understand our Cookie Policy, Privacy Policy, and our Terms of Service.




answers were in the category “Computation”. For two topics, the DPRL2 run
had P@10 higher than 0.5 (note that P@10 was calculated by considering the
relevance degrees of high and medium as relevant). Figure 4 shows topic A.5 for
which DPRL2 achieved P@10 of 0.9, along with the first answer retrieved for
this topic, which is highly relevant.
    Another observation we had was that the DPRL4 run, which has the highest
nDCG0 value among our runs, is able to find relevant answers for categories of
Proof and Concept. For instance, for topic A.38 (in the Concept category, with
difficulty Hard), this run achieved the P@10 of 0.4.
    As mentioned earlier, we did the hyper-parameter tuning for our proposed
models based on MSE scores. Comparing these scores, with the annotation
scores, there is no correlation; that is to say, not all the relevant answers have
high score and not all the not relevant ones have low score. Therefore, in our
future work, we will do the hyper-parameter tuning using the annotated data.
    Performance. Table 6 shows the minimum, maximum and average retrieval
time for each of proposed models in task 1. We used the same machine as for
task 2. As can be seen, all our three first runs have similar retrieval times as they
just use cosine similarity of vector representations of documents. However, for
our last run, we have an extra re-ranking step which results in longer retrieval
times.


4   Conclusion
In this paper, we described the DPRL runs for ARQMath lab at CLEF 2020.
For Formula Retrieval (Task 2), the DPRL submitted 3 runs. Our run that uses
formula embedding and then re-ranks by tree-edit distance achieved the highest
P@10 value and the second-best nDCG0 after the baseline system. For the answer
retrieval task (Task 1), we used fastText to embed text, and Tangent-CFT to
emebed formulas, concatenating text and formula vectors to represent each post
(answer or question). Our runs differed from each other in terms of how the
vector embeddings were used for answer retrieval. Our best run found similar
questions to the topics and used the answers given to those similar questions.
In future work, we plan to conduct further analysis of our results and to further
enhance our systems for both tasks.

Acknowledgements. This material is based upon work supported by the National
Science Foundation (USA) under Grant No. IIS- 1717997 and the Alfred P. Sloan
Foundation under Grant No. G-2017-9827.


References
1. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with
   subword information. Transactions of the Association for Computational Linguistics
   5, 135–146 (2017)
2. Davila, K., Zanibbi, R.: Layout and semantics: Combining representations for math-
   ematical formula search. In: Proceedings of the 40th International ACM SIGIR
   Conference on Research and Development in Information Retrieval. pp. 1165–1168
   (2017)
3. Mansouri, B., Agarwal, A., Oard, D., Zanibbi, R.: Finding old answers to new math
   questions:the ARQMath lab at CLEF 2020. In: European Conference on Information
   Retrieval (2020)
4. Mansouri, B., Rohatgi, S., Oard, D.W., Wu, J., Giles, C.L., Zanibbi, R.: Tangent-
   CFT: An embedding model for mathematical formulas. In: Proceedings of the 2019
   ACM SIGIR International Conference on Theory of Information Retrieval (ICTIR).
   pp. 11–18 (2019)
5. Mansouri, B., Zanibbi, R., Oard, D.W.: Characterizing searches for mathematical
   concepts. In: Joint Conference on Digital Libraries (2019)
6. Pawlik, M., Augsten, N.: Tree edit distance: Robust and memory-efficient. Informa-
   tion Systems 56, 157–173 (2016)
7. Zanibbi, R., Aizawa, A., Kohlhase, M., Ounis, I., Topic, G., Davila, K.: NTCIR-12
   MathIR task overview. In: NTCIR (2016)
8. Zanibbi, R., Oard, D.W., Agarwal, A., Mansouri, B.: Overview of arqmath 2020:
   Clef lab on answer retrieval for questions on math (2020)