=Paper= {{Paper |id=Vol-2456/paper20 |storemode=property |title=Multi-Query Optimization in RDF Q/A System |pdfUrl=https://ceur-ws.org/Vol-2456/paper20.pdf |volume=Vol-2456 |authors=Jie Jiao,Shujun Wang,Xiaowang Zhang,Zhiyong Feng |dblpUrl=https://dblp.org/rec/conf/semweb/JiaoWZF19 }} ==Multi-Query Optimization in RDF Q/A System== https://ceur-ws.org/Vol-2456/paper20.pdf
Multi-Query Optimization in RDF Q/A System

           Jie Jiao, Shujun Wang, Xiaowang Zhang⋆ , and Zhiyong Feng

    College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
    Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin, China
                   ⋆
                      Corresponding author: xiaowangzhang@tju.edu.cn



        Abstract. In this paper, we present an optimization to answer a ques-
        tion with multiple queries by detecting all common subqueries of that
        question. Moreover, we apply mutual information in reducing ambiguity
        of queries of a question to improve the quality of common subqueries. Fi-
        nally, to improve the accuracy of SPARQL query generated, we evaluate
        the semantic importance of each word in a question via TF-IDF dur-
        ing the generating queries process. Experiments show that our proposal
        ourperforms those single query execution.


1     Introduction
SPARQL is a standard way to access RDF data, it remains tedious for users.
Hence, qustion/answering(Q/A) based on Knowledge Graph has received wide
attention in both natural language processing and database areas.
    Generally, there are two significant challenges in RDF Q/A systems:question
understanding (Question → SPARQL) and query evaluation. A great deal
of research has been done to address these two challenges. For example, Zou et
al[2] proposed two frameworks to build the semantic query graph. To overcome
the second challenge, single-machine RDF systems, like gStore[5] and many dis-
tributed SPARQL query engines have been introduced.
    Phrase Linking is a huge challenge in Question Understanding stage. It refers
to a word wi in a question may have several meanings, e.g. the word “Paul
Anderson in question “Which movies are directed by Paul Anderson” can map to
three items(⟨Paul S. Anderson⟩ ,⟨Paul Anderson⟩ and ⟨Paul W. S. Anderson⟩)
in RDF dataset. However, these matches are not all we need. Therefore, we need
a way to remove as many failed matches as possible.
    The ambiguity in the Phrase Linking stage cannot be completely eliminated,
therefore, a question may still correspond to multiple SPARQL queries. The
current solution is to execute SPARQL queries one by one, but this method is
not efficient enough.
    In addition, picking which words in the question to generate a SPARQL
query is also an important challenge. At present, the method based on statistical
information is adopted, while the information provided by statistical methods is
relatively limited.
*
    Copyright 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
2       J. Jiao et al.

2    Overview of Our Approach
Mutual Information Disambiguation Consider a question “Which movies
are directed by Paul Anderson? ”. When we do phrase linking to the word “Paul
Anderson”, we may see that the RDF dataset contains a large number of entities
about the word “Paul Anderson”.
    ⟨Paul Anderson⟩ and ⟨Paul W S Anderson⟩ may be directors, but ⟨Paul S
Anderson⟩ is a teacher. According to the semantics of question, we can actually
know that although there are three “Paul Anderson” in RDF dataset, Movie-
related ⟨Paul Anderson⟩ and ⟨Paul W. S. Anderson⟩ are what we really need.
Hence, we can delete ⟨Paul S Anderson⟩.
    The above example illustrates the intuition of our approach, we can deal with
ambiguity by counting the number of predicates between entities in advance.
However, there are too many entities in RDF graph. In this paper, we point out
that collecting two types of lightweight information in RDF graphs:
    1. Count the number of relationships between different types of entities.
    2. Count the number of relationships between specific entities and different
types of entities.

Word Core Measurement In the challenge of Question Understanding, we
try to use the SPARQL Q to accurately express the semantics of the question
N. In fact, the different words in N are different in the importance of generating
Q. However, there is currently no way to measure the importance of each word
in N for generating Q. Question Understanding stage can be expressed by the
formula:f (N ) → Q.
    The natural language question N is composed of the word wi , and the S-
PARQL is composed of triple pattern pj , hence, we can convert the f (N ) → Q
into f (w1 , w2 , · · · , wn ) → (p1 , p2 , · · · , pm ), and then by vectorizing wi and pj we
can get the following formula:

                         f (−
                            →1 , −
                            w    →2 , · · · , −
                                 w            w→      −
                                                      → −   →            −→
                                               n ) → ( p1 , p2 , · · · , pm )             (1)

we defined a loss function as follows.
                                        ∑∑ m      ∑
                                                  n
                                 L=      (   −
                                             →
                                             pj −   −
                                                    →i )
                                                    w                                     (2)
                                              j=1         i=1

We choose transH[4] to vectorize triple patterns in SPARQL queries. By formula
2, we make the overall ⟨Ni , Qj ⟩ in the dataset as equal as possible.


3    Multi-Query Optimization
We can divide all phrases in N into two categories:
   (1) “The United States” in Figure 1 (c) only corresponds to the entity
⟨United States⟩ in Figure 1(d). We use symbol wo to represent this type of
phrases.
                                                 Multi-Query Optimization in RDF Q/A System                                                                  3

         Dependency tree                          Nodes                      Super Semantic                        SPARQL Query Graph
               Which actors                                                   Query Graph
                                                                                                                           
                                              Which actors                  Which movies
                                                                                                             ?actors     
                           in
        born
                                                                                                                      
                                                                                    were born in
                                                                                                         ?actors    
                         Transformers
      were          in                  Transformers The United States                                                     
                                                                     Transformers   The United States        ?actors    
                    The United States

                 (a)                                 (b)                              (c)                                      (d)


                            Fig. 1. An example of Question Understanding.


    (2) The phrase “Transformers” in Figure 1 (c) corresponds to multiple entities
⟨Transformers (1,2,3)⟩ in Figure 1(d). We use symbol wm to represent these
ambiguous phrases.
    From Figure 1, we can see that subquery {?actors ⟨birthPlace⟩ ⟨United States⟩}
is a common subquery among all SPARQL queries. Hence, we can conclude that
the SPARQL queries generated by phrases without ambiguity in a question are
unique and common.
    Besides, due to the large amount of data in RDF Graph, it is very likely
that a phrase in N corresponds to many entities. In this case, too many SPAR-
QL queries are generated. Hence, we present a scoring mechanism for words in
question. Pay more attention to phrases that have important implications.
    We introduce TF-IDF to measure the semantic importance of different words
in a question:
                                        |wi |             |N |
                     Score(wi ) = T F (       ) · IDF (           )            (3)
                                         |w|            |N (wi )|
   |wi | indicates the number of occurrences of the word wi , |w| denotes the total
number of words. |N | denotes the number of questions in the corpus, |N (wi )|
denotes the number of questions that contain the word wi .
   If the importance of a word wi is not high, but wi corresponds to a lot of
items in RDF graph. We can ignore wi . In this way, we can slightly reduce the
accuracy, but in return for a great increase in efficiency.


4   Experiments and Evaluation

From the data shown in Table 1, we can see that the accuracy of Our Approach is
slightly higher than the original work gAnswer[2] because of the better selection
of words wi in natural language question N .
    As can be seen from Figure 2, the efficiency of processing multiple SPAR-
QL queries can be improved by finding common structures between SPARQL
queries. Because it avoids redundant execution of common subquery. On this
basis, our method can further improve the execution efficiency of multiple S-
PARQL queries. Because our method can determine the common subquery in
the Question Understanding phase. That is, all phrases in the question that
4                   J. Jiao et al.

               90
                                         Serial Execution
               80                        Common Subquery

               70
                                         Our Approach                          Processed Right
               60
                                                                Our Approach      100     70
                                                                   gAnswer        100     68
    Time(ms)


               50

               40
                                                                    RFF           100     40
               30
                                                                 KWGAnswer        100     52
               20
                                                                    Aqqu          100     36
               10
                                                            Table 1. Evaluating QALD Testing Ques-
                0
                     1      2        3   4         5        tions

               Fig. 2. Question Processing Time


do not contain ambiguity will generate a common subquery after the question
understanding stage.


5               Conclusion
In this paper, we addressed the issue of multiple query optimization in RDF
Q/A system. We introduce machine learning algorithm to select the useful part
of question for SPARQL query generation. At the same time, we give a spe-
cific application of multiple SPARQL queries optimization. We hope that our
work can inspire other RDF system designers to apply machine learning more
in system design.


Acknowledgments
This work is supported by the National Key Research and Development Program
of China (2017YFC0908401) and the National Natural Science Foundation of
China (61972455,61672377). Xiaowang Zhang is supported by the Peiyang Young
Scholars in Tianjin University (2019XRX-0032).


References
1. Bidoit N., Herschel M., Tzompanaki A.: Efficient computation of polynomial expla-
   nations of why-not questions. In Proc. of CIKM 2015, pp.713–722.
2. Hu S., Zou L., Yu J.X., Wang H., Zhao D.: Answering natural language questions by
   subgraph matching over knowledge graphs (extended abstract). In Proc. of ICDE
   2018, pp.1815–1816.
3. Ren X., Wang J.: Multi-query optimization for subgraph isomorphism search.
   PVLDB,10(3):121–132 (2016).
4. Wang Z., Zhang J., Feng J., Chen Z.: Knowledge graph embedding by translating
   on hyperplanes. In Proc. of AAAI 2014, pp.1112–1119.
5. Zou L., Özsu M.T., Chen L., Shen X., Huang R., Zhao D.: gStore: A graph-based
   SPARQL query engine. VLDB J., 23(4):565–590 (2014).