=Paper=
{{Paper
|id=Vol-1688/paper-18
|storemode=property
|title=Dish Discovery via Word Embeddings on Restaurant Reviews
|pdfUrl=https://ceur-ws.org/Vol-1688/paper-18.pdf
|volume=Vol-1688
|authors=Chih-Yu Chao,Yi-Fan Chu,Yi Ho,Chuan-Ju Wang,Ming-Feng Tsai
|dblpUrl=https://dblp.org/rec/conf/recsys/ChaoCHWT16
}}
==Dish Discovery via Word Embeddings on Restaurant Reviews==
Dish Discovery via Word
Embeddings on Restaurant Reviews
Chih-Yu Chao1 , Yi-Fan Chu1 , Yi Ho2 , Chuan-Ju Wang1,3 , and Ming-Feng Tsai4
1
Department of Computer Science, University of Taipei, Taipei 100, Taiwan
2
Department of Engineering Science and Ocean Engineering, National Taiwan University, Taipei 106, Taiwan
3
Research Center for Information Technology Innovation, Academia Sinica, Taipei 115, Taiwan
4
Department of Computer Science, National Chengchi University, Taipei 116, Taiwan
ABSTRACT Our methodology mainly consists of three parts: 1) dish-
This paper proposes a novel framework for automatic dish word recognition, 2) word embedding learning, and 3) dish
discovery via word embeddings on restaurant reviews. We score calculation. As alluded to earlier, UGC usually in-
collect a dataset of user reviews from Yelp and parse the corporates a degree of noise and different language usages;
reviews to extract dish words. Then, we utilize the processed therefore, extracting dish names from user reviews is a com-
reviews as training texts to learn the embedding vectors of plicated task. For example, observed from the dataset, users
words via the skip-gram model. In the paper, a nearest- tend not to write the full name of a dish in their reviews;
neighbor like score function is proposed to rank the dishes instead, the last word or the last two words are often written
based on their learned representations. We brief some analy- in the reviews. To grapple with this issue, we use regular
ses on the preliminary experiments and present a web-based expressions (regexps) to extract dish names from the user
visualization at http://clip.csie.org/yelp/. reviews. However, this also give rise to an issue that a cer-
tain dish in a restaurant may be of the same name in other
restaurants, which may induce the problem of ambiguity and
Keywords lower the accuracy of matching the correct dish name. So,
dish discovery, word embeddings, dish-word extraction we attach a dish name with its restaurant name to solve the
ambiguity problem.
We then utilize the collection of processed reviews as train-
1. BACKGROUND ing texts to learn embeddings of each word in the reviews
With the growth of social media, corporations, such as via a continuous space language model, the skip-gram model.
Yelp, have accumulated a great number of user generated After the training phase, each word (including every dish)
content (UGC). In the literature, some studies have been is represented by an n-dimensional vector (called the em-
conducted with a perspective of finding critical information bedding of this word). Inspired by the k-nearest neighbors
hidden in the content [2]. While much has been proposed algorithm, we define the score for every dish d as:
on accurate sentiment interpretation towards reviews and m
X
recommendation, little has focused on dish-level analysis [4]. S(d) = λk fk (d), (1)
In this paper, therefore, we aim to provide a novel framework k=1
for automatic dish discovery from restaurant reviews via the k
embedding techniques. We employ regular expressions to where fk (d) = Pk , m is the total number of
i=1 kwd −wsi k
first parse restaurant reviews to extract dish words, and then positive sentiment words considered, λi (i = 1, · · · , m) is a
utilize the processed reviews as training texts to learn em- weighting parameter. In addition, si denotes the i-nearest
bedding vector of each word via the skip-gram model [3]. In positive sentiment words of the given dish d, and wd , wsi ∈
addition, a nearest-neighbor like score function is proposed Rn are the vector representations of the dish d and the
to rank the dishes via their learned representations. Prelimi- sentiment word si , respectively.
nary experiments are conducted on a real-world restaurant In an extreme case (1) of λm = 1 and λi = 0 for i =
review dataset collected from Yelp Data Challenge. 1, · · · , m − 1, this score function implements the concept of
the average Euclidean distance between a dish and all the
positive sentiment words; while in the case (2) λ1 = 1 and
2. METHODOLOGY λi = 0 for i = 2, · · · , m, the scored is obtained with the
closest positive sentiment words to the dish.
3. EXPERIMENTS
Our preliminary experiments involve a real-world restau-
rant review dataset collected from Yelp Data Challenge.1
We first choose the top 100 restaurants containing the most
reviews in the area of Las Vegas and then manually parse
Copyright held by the author(s).
RecSys 2016 Poster Proceedings, September 15-19, 2016, USA, Boston. 1
https://www.yelp.com/dataset challenge
2.0
#20: Sushisamba Las Vegas
Table 1: Top-3 dishes of Sushisamba Las Vegas. Words with positive sentiments
Ranking methods Dish - case (1): average
1.5 Dish - case (2): minimum
Case (1) Case (2)
Frequency Average distance Minimum distance 1.0
Sea Bass Soft Shell Crab Seaweed Salad 1
←−−−−−−−−−−−
0.5 Seaweed Salad
(364, 0.706, 0.787) (4, 0.737, 0.899) (25, 0.706, 0.910)
precedence
Peruvian Corn Lamb Chop Soft Shell Crab 0.0 Samba Sushi
(125, 0.713, 0.809) (11, 0.735, 0.858) (4, 0.737, 0.899) 2 1 3
Soft Shell Crab 2
Spicy Tuna Samba Sushi Green Bean Tempura -0.5
Lamb Chop
(81, 0.702, 0.787) (14, 0.735, 0.845) (20, 0.703, 0.877)
-1.0 Green Bean Tempura
3
-1.5
the menu of each restaurant from its official website. Out good, best, great, etc.
of those 100 restaurants, we extract the restaurants with -2.0
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
a complete menu, setting the reviews of those restaurants
and their menus as our dataset. In summary, there are 69 Figure 1: 2-D Visualization on the top-3 recom-
restaurants and 95,578 reviews in total after the filtering; the mended dishes and positive words.
number of words per review in average is about 147 and the
vocabulary size is 46,017.
For preprocessing the reviews to identify each dish, here the words with similar meanings are usually close to each
we demonstrate the matching rule via the example dish, other, such as the words in the circle including good, best,
Housemade Country Pate; its regexps can be set as: and great. Furthermore, for the extreme case (1), the dishes
close to the centroid of all the positive words tend to have
(Housemade*|Country*)+Pat[a-z]+(s|es|ies)?, higher scores and their contents in the reviews may be more
diverse. On the other hand, for the case (2), the top-ranked
which is set to match Country Pate, Housemade Pate, or
dishes are close to a certain sentiment word; for example, the
Housemade Country Pate. If a match of the dish is found,
dish Seaweed Salad is top-ranked and far from the centroid
we replace the name of the dish with its full name and append
in the case (2), but its score based on the average distance
the name of the restaurant to an underscore symbol, mod-
is rather low than the other top-3 dishes in the case (1).
ifying it to Housemade-Country-Pate_Mon-Ami-Gabi. After
the modification and replacement, the score of each dish d is
calculated via the score function defined in Eq. (1), where 4. CONCLUSIONS AND FUTURE WORK
the positive sentiment words are selected from the lexicon This paper proposes a novel framework for dish discov-
provided in [1], and only top 200 most frequent sentiment ery from restaurant reviews via word embedding techniques.
words in our dataset are adopted. For the representation This framework can be of great help in discovering or recom-
learning, the word2vec toolkit2 and the skip-gram model are mending dishes via only the review texts based the proposed
adopted, in which the context (window) size for the skip- score function. Although in this preliminary work, we have
gram model was set to 5 and the dimensionality of the word not conducted quantitative evaluation on our experiments,
vectors was set to 200. the given example and the visualization results demonstrate
Table 1 tabulates the top-3 dishes ranked by the proposed the novelty and the potential of the proposed approach.
approach for the restaurant Sushisamba Las Vegas. In the In the current work, we only consider two extreme cases
table, the dishes in each column are the top-3 results ranked of the score function; hence, considering different settings
by (a) their number of occurrences, (b) the score based of the score function and quantitatively analyzing the corre-
on average distance, and (c) the score based on minimum sponding results will be one of our important future work.
distance; (a), (b), and (c) correspond to the three numbers Also, a food-oriented lexicon will be considered in the future.
in the parentheses. From the table, it can be observed that Most importantly, the size of the collected texts is vital to
none of the top-3 most frequently mentioned dishes occurs representation learning algorithms, so we are now collecting
in the lists ranked by our method (both cases (1) and (2)), more data from Yelp and plan to conduct our experiments
which is due to the fact that these high frequent dishes on a much larger dataset.
might not be surrounded with positive words and sometimes
with negative reviews. For example, there is a review for 5. REFERENCES
Peruvian Corn within a comment of “The Peruvian Corn [1] M. Hu and B. Liu. Mining and summarizing customer
was awful” in the dataset. This phenomenon indicates that reviews. In Proc. ACM KDD, pages 168–177, 2004.
the most frequent dish mentioned in the reviews may not [2] J. McAuley and J. Leskovec. Hidden factors and hidden
be the most recommended dish by users. In addition, the topics: understanding rating dimensions with review
proposed method is capable of finding dishes that might not text. In Proc. ACM Recsys, pages 165–172, 2013.
frequently occur in reviews, e.g., Soft Shell Crab, and thus [3] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient
can provide more diverse results. estimation of word representations in vector space. arXiv
Figure 1 visualizes the positive sentiment words and the preprint arXiv:1301.3781, 2013.
top-3 dishes ranked by the proposed method based on the
[4] M. Trevisiol, L. Chiarandini, and R. Baeza-Yates. Buon
learned representations. From the figure, we can observe that
appetito: recommending personalized menus. In Proc.of
2 ACM HT, pages 327–329, 2014.
https://code.google.com/archive/p/word2vec/