=Paper=
{{Paper
|id=Vol-2456/paper20
|storemode=property
|title=Multi-Query Optimization in RDF Q/A System
|pdfUrl=https://ceur-ws.org/Vol-2456/paper20.pdf
|volume=Vol-2456
|authors=Jie Jiao,Shujun Wang,Xiaowang Zhang,Zhiyong Feng
|dblpUrl=https://dblp.org/rec/conf/semweb/JiaoWZF19
}}
==Multi-Query Optimization in RDF Q/A System==
Multi-Query Optimization in RDF Q/A System
Jie Jiao, Shujun Wang, Xiaowang Zhang⋆ , and Zhiyong Feng
College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin, China
⋆
Corresponding author: xiaowangzhang@tju.edu.cn
Abstract. In this paper, we present an optimization to answer a ques-
tion with multiple queries by detecting all common subqueries of that
question. Moreover, we apply mutual information in reducing ambiguity
of queries of a question to improve the quality of common subqueries. Fi-
nally, to improve the accuracy of SPARQL query generated, we evaluate
the semantic importance of each word in a question via TF-IDF dur-
ing the generating queries process. Experiments show that our proposal
ourperforms those single query execution.
1 Introduction
SPARQL is a standard way to access RDF data, it remains tedious for users.
Hence, qustion/answering(Q/A) based on Knowledge Graph has received wide
attention in both natural language processing and database areas.
Generally, there are two significant challenges in RDF Q/A systems:question
understanding (Question → SPARQL) and query evaluation. A great deal
of research has been done to address these two challenges. For example, Zou et
al[2] proposed two frameworks to build the semantic query graph. To overcome
the second challenge, single-machine RDF systems, like gStore[5] and many dis-
tributed SPARQL query engines have been introduced.
Phrase Linking is a huge challenge in Question Understanding stage. It refers
to a word wi in a question may have several meanings, e.g. the word “Paul
Anderson in question “Which movies are directed by Paul Anderson” can map to
three items(⟨Paul S. Anderson⟩ ,⟨Paul Anderson⟩ and ⟨Paul W. S. Anderson⟩)
in RDF dataset. However, these matches are not all we need. Therefore, we need
a way to remove as many failed matches as possible.
The ambiguity in the Phrase Linking stage cannot be completely eliminated,
therefore, a question may still correspond to multiple SPARQL queries. The
current solution is to execute SPARQL queries one by one, but this method is
not efficient enough.
In addition, picking which words in the question to generate a SPARQL
query is also an important challenge. At present, the method based on statistical
information is adopted, while the information provided by statistical methods is
relatively limited.
*
Copyright 2019 for this paper by its authors. Use permitted under Creative Com-
mons License Attribution 4.0 International (CC BY 4.0).
2 J. Jiao et al.
2 Overview of Our Approach
Mutual Information Disambiguation Consider a question “Which movies
are directed by Paul Anderson? ”. When we do phrase linking to the word “Paul
Anderson”, we may see that the RDF dataset contains a large number of entities
about the word “Paul Anderson”.
⟨Paul Anderson⟩ and ⟨Paul W S Anderson⟩ may be directors, but ⟨Paul S
Anderson⟩ is a teacher. According to the semantics of question, we can actually
know that although there are three “Paul Anderson” in RDF dataset, Movie-
related ⟨Paul Anderson⟩ and ⟨Paul W. S. Anderson⟩ are what we really need.
Hence, we can delete ⟨Paul S Anderson⟩.
The above example illustrates the intuition of our approach, we can deal with
ambiguity by counting the number of predicates between entities in advance.
However, there are too many entities in RDF graph. In this paper, we point out
that collecting two types of lightweight information in RDF graphs:
1. Count the number of relationships between different types of entities.
2. Count the number of relationships between specific entities and different
types of entities.
Word Core Measurement In the challenge of Question Understanding, we
try to use the SPARQL Q to accurately express the semantics of the question
N. In fact, the different words in N are different in the importance of generating
Q. However, there is currently no way to measure the importance of each word
in N for generating Q. Question Understanding stage can be expressed by the
formula:f (N ) → Q.
The natural language question N is composed of the word wi , and the S-
PARQL is composed of triple pattern pj , hence, we can convert the f (N ) → Q
into f (w1 , w2 , · · · , wn ) → (p1 , p2 , · · · , pm ), and then by vectorizing wi and pj we
can get the following formula:
f (−
→1 , −
w →2 , · · · , −
w w→ −
→ − → −→
n ) → ( p1 , p2 , · · · , pm ) (1)
we defined a loss function as follows.
∑∑ m ∑
n
L= ( −
→
pj − −
→i )
w (2)
j=1 i=1
We choose transH[4] to vectorize triple patterns in SPARQL queries. By formula
2, we make the overall ⟨Ni , Qj ⟩ in the dataset as equal as possible.
3 Multi-Query Optimization
We can divide all phrases in N into two categories:
(1) “The United States” in Figure 1 (c) only corresponds to the entity
⟨United States⟩ in Figure 1(d). We use symbol wo to represent this type of
phrases.
Multi-Query Optimization in RDF Q/A System 3
Dependency tree Nodes Super Semantic SPARQL Query Graph
Which actors Query Graph
Which actors Which movies
?actors
in
born
were born in
?actors
Transformers
were in Transformers The United States
Transformers The United States ?actors
The United States
(a) (b) (c) (d)
Fig. 1. An example of Question Understanding.
(2) The phrase “Transformers” in Figure 1 (c) corresponds to multiple entities
⟨Transformers (1,2,3)⟩ in Figure 1(d). We use symbol wm to represent these
ambiguous phrases.
From Figure 1, we can see that subquery {?actors ⟨birthPlace⟩ ⟨United States⟩}
is a common subquery among all SPARQL queries. Hence, we can conclude that
the SPARQL queries generated by phrases without ambiguity in a question are
unique and common.
Besides, due to the large amount of data in RDF Graph, it is very likely
that a phrase in N corresponds to many entities. In this case, too many SPAR-
QL queries are generated. Hence, we present a scoring mechanism for words in
question. Pay more attention to phrases that have important implications.
We introduce TF-IDF to measure the semantic importance of different words
in a question:
|wi | |N |
Score(wi ) = T F ( ) · IDF ( ) (3)
|w| |N (wi )|
|wi | indicates the number of occurrences of the word wi , |w| denotes the total
number of words. |N | denotes the number of questions in the corpus, |N (wi )|
denotes the number of questions that contain the word wi .
If the importance of a word wi is not high, but wi corresponds to a lot of
items in RDF graph. We can ignore wi . In this way, we can slightly reduce the
accuracy, but in return for a great increase in efficiency.
4 Experiments and Evaluation
From the data shown in Table 1, we can see that the accuracy of Our Approach is
slightly higher than the original work gAnswer[2] because of the better selection
of words wi in natural language question N .
As can be seen from Figure 2, the efficiency of processing multiple SPAR-
QL queries can be improved by finding common structures between SPARQL
queries. Because it avoids redundant execution of common subquery. On this
basis, our method can further improve the execution efficiency of multiple S-
PARQL queries. Because our method can determine the common subquery in
the Question Understanding phase. That is, all phrases in the question that
4 J. Jiao et al.
90
Serial Execution
80 Common Subquery
70
Our Approach Processed Right
60
Our Approach 100 70
gAnswer 100 68
Time(ms)
50
40
RFF 100 40
30
KWGAnswer 100 52
20
Aqqu 100 36
10
Table 1. Evaluating QALD Testing Ques-
0
1 2 3 4 5 tions
Fig. 2. Question Processing Time
do not contain ambiguity will generate a common subquery after the question
understanding stage.
5 Conclusion
In this paper, we addressed the issue of multiple query optimization in RDF
Q/A system. We introduce machine learning algorithm to select the useful part
of question for SPARQL query generation. At the same time, we give a spe-
cific application of multiple SPARQL queries optimization. We hope that our
work can inspire other RDF system designers to apply machine learning more
in system design.
Acknowledgments
This work is supported by the National Key Research and Development Program
of China (2017YFC0908401) and the National Natural Science Foundation of
China (61972455,61672377). Xiaowang Zhang is supported by the Peiyang Young
Scholars in Tianjin University (2019XRX-0032).
References
1. Bidoit N., Herschel M., Tzompanaki A.: Efficient computation of polynomial expla-
nations of why-not questions. In Proc. of CIKM 2015, pp.713–722.
2. Hu S., Zou L., Yu J.X., Wang H., Zhao D.: Answering natural language questions by
subgraph matching over knowledge graphs (extended abstract). In Proc. of ICDE
2018, pp.1815–1816.
3. Ren X., Wang J.: Multi-query optimization for subgraph isomorphism search.
PVLDB,10(3):121–132 (2016).
4. Wang Z., Zhang J., Feng J., Chen Z.: Knowledge graph embedding by translating
on hyperplanes. In Proc. of AAAI 2014, pp.1112–1119.
5. Zou L., Özsu M.T., Chen L., Shen X., Huang R., Zhao D.: gStore: A graph-based
SPARQL query engine. VLDB J., 23(4):565–590 (2014).