=Paper=
{{Paper
|id=Vol-1688/paper-18
|storemode=property
|title=Dish Discovery via Word Embeddings on Restaurant Reviews
|pdfUrl=https://ceur-ws.org/Vol-1688/paper-18.pdf
|volume=Vol-1688
|authors=Chih-Yu Chao,Yi-Fan Chu,Yi Ho,Chuan-Ju Wang,Ming-Feng Tsai
|dblpUrl=https://dblp.org/rec/conf/recsys/ChaoCHWT16
}}
==Dish Discovery via Word Embeddings on Restaurant Reviews==
<pdf width="1500px">https://ceur-ws.org/Vol-1688/paper-18.pdf</pdf>
<pre>
                                       Dish Discovery via Word
                                   Embeddings on Restaurant Reviews

                Chih-Yu Chao1 , Yi-Fan Chu1 , Yi Ho2 , Chuan-Ju Wang1,3 , and Ming-Feng Tsai4
                             1
                               Department of Computer Science, University of Taipei, Taipei 100, Taiwan
          2
              Department of Engineering Science and Ocean Engineering, National Taiwan University, Taipei 106, Taiwan
                   3
                     Research Center for Information Technology Innovation, Academia Sinica, Taipei 115, Taiwan
                        4
                          Department of Computer Science, National Chengchi University, Taipei 116, Taiwan


ABSTRACT                                                                 Our methodology mainly consists of three parts: 1) dish-
This paper proposes a novel framework for automatic dish              word recognition, 2) word embedding learning, and 3) dish
discovery via word embeddings on restaurant reviews. We               score calculation. As alluded to earlier, UGC usually in-
collect a dataset of user reviews from Yelp and parse the             corporates a degree of noise and different language usages;
reviews to extract dish words. Then, we utilize the processed         therefore, extracting dish names from user reviews is a com-
reviews as training texts to learn the embedding vectors of           plicated task. For example, observed from the dataset, users
words via the skip-gram model. In the paper, a nearest-               tend not to write the full name of a dish in their reviews;
neighbor like score function is proposed to rank the dishes           instead, the last word or the last two words are often written
based on their learned representations. We brief some analy-          in the reviews. To grapple with this issue, we use regular
ses on the preliminary experiments and present a web-based            expressions (regexps) to extract dish names from the user
visualization at http://clip.csie.org/yelp/.                          reviews. However, this also give rise to an issue that a cer-
                                                                      tain dish in a restaurant may be of the same name in other
                                                                      restaurants, which may induce the problem of ambiguity and
Keywords                                                              lower the accuracy of matching the correct dish name. So,
dish discovery, word embeddings, dish-word extraction                 we attach a dish name with its restaurant name to solve the
                                                                      ambiguity problem.
                                                                         We then utilize the collection of processed reviews as train-
1.     BACKGROUND                                                     ing texts to learn embeddings of each word in the reviews
   With the growth of social media, corporations, such as             via a continuous space language model, the skip-gram model.
Yelp, have accumulated a great number of user generated               After the training phase, each word (including every dish)
content (UGC). In the literature, some studies have been              is represented by an n-dimensional vector (called the em-
conducted with a perspective of finding critical information          bedding of this word). Inspired by the k-nearest neighbors
hidden in the content [2]. While much has been proposed               algorithm, we define the score for every dish d as:
on accurate sentiment interpretation towards reviews and                                            m
                                                                                                    X
recommendation, little has focused on dish-level analysis [4].                             S(d) =          λk fk (d),             (1)
In this paper, therefore, we aim to provide a novel framework                                       k=1
for automatic dish discovery from restaurant reviews via the                                   k
embedding techniques. We employ regular expressions to                where fk (d) = Pk                    , m is the total number of
                                                                                          i=1 kwd −wsi k
first parse restaurant reviews to extract dish words, and then        positive sentiment words considered, λi (i = 1, · · · , m) is a
utilize the processed reviews as training texts to learn em-          weighting parameter. In addition, si denotes the i-nearest
bedding vector of each word via the skip-gram model [3]. In           positive sentiment words of the given dish d, and wd , wsi ∈
addition, a nearest-neighbor like score function is proposed          Rn are the vector representations of the dish d and the
to rank the dishes via their learned representations. Prelimi-        sentiment word si , respectively.
nary experiments are conducted on a real-world restaurant                In an extreme case (1) of λm = 1 and λi = 0 for i =
review dataset collected from Yelp Data Challenge.                    1, · · · , m − 1, this score function implements the concept of
                                                                      the average Euclidean distance between a dish and all the
                                                                      positive sentiment words; while in the case (2) λ1 = 1 and
2.     METHODOLOGY                                                    λi = 0 for i = 2, · · · , m, the scored is obtained with the
                                                                      closest positive sentiment words to the dish.

                                                                      3.     EXPERIMENTS
                                                                        Our preliminary experiments involve a real-world restau-
                                                                      rant review dataset collected from Yelp Data Challenge.1
                                                                      We first choose the top 100 restaurants containing the most
                                                                      reviews in the area of Las Vegas and then manually parse
Copyright held by the author(s).
RecSys 2016 Poster Proceedings, September 15-19, 2016, USA, Boston.   1
                                                                          https://www.yelp.com/dataset challenge
                                                                                      2.0
                                                                                                            #20: Sushisamba Las Vegas
       Table 1: Top-3 dishes of Sushisamba Las Vegas.                                                                            Words with positive sentiments
                                    Ranking methods                                                                              Dish - case (1): average
                                                                                      1.5                                        Dish - case (2): minimum
                                      Case (1)              Case (2)
                   Frequency       Average distance     Minimum distance              1.0

                   Sea Bass         Soft Shell Crab       Seaweed Salad                                                                                 1
←−−−−−−−−−−−


                                                                                      0.5                                                            Seaweed Salad
               (364, 0.706, 0.787) (4, 0.737, 0.899)    (25, 0.706, 0.910)
  precedence


                 Peruvian Corn        Lamb Chop          Soft Shell Crab              0.0                             Samba Sushi
               (125, 0.713, 0.809) (11, 0.735, 0.858)   (4, 0.737, 0.899)                            2      1                3
                                                                                                 Soft Shell Crab                   2
                  Spicy Tuna         Samba Sushi Green Bean Tempura                   -0.5
                                                                                                                            Lamb Chop
               (81, 0.702, 0.787) (14, 0.735, 0.845) (20, 0.703, 0.877)
                                                                                      -1.0 Green Bean Tempura
                                                                                                3

                                                                                      -1.5
the menu of each restaurant from its official website. Out                                             good, best, great, etc.
of those 100 restaurants, we extract the restaurants with                             -2.0
                                                                                         -2.0   -1.5      -1.0     -0.5     0.0       0.5      1.0      1.5       2.0
a complete menu, setting the reviews of those restaurants
and their menus as our dataset. In summary, there are 69                     Figure 1: 2-D Visualization on the top-3 recom-
restaurants and 95,578 reviews in total after the filtering; the             mended dishes and positive words.
number of words per review in average is about 147 and the
vocabulary size is 46,017.
  For preprocessing the reviews to identify each dish, here                  the words with similar meanings are usually close to each
we demonstrate the matching rule via the example dish,                       other, such as the words in the circle including good, best,
Housemade Country Pate; its regexps can be set as:                           and great. Furthermore, for the extreme case (1), the dishes
                                                                             close to the centroid of all the positive words tend to have
                (Housemade*|Country*)+Pat[a-z]+(s|es|ies)?,                  higher scores and their contents in the reviews may be more
                                                                             diverse. On the other hand, for the case (2), the top-ranked
which is set to match Country Pate, Housemade Pate, or
                                                                             dishes are close to a certain sentiment word; for example, the
Housemade Country Pate. If a match of the dish is found,
                                                                             dish Seaweed Salad is top-ranked and far from the centroid
we replace the name of the dish with its full name and append
                                                                             in the case (2), but its score based on the average distance
the name of the restaurant to an underscore symbol, mod-
                                                                             is rather low than the other top-3 dishes in the case (1).
ifying it to Housemade-Country-Pate_Mon-Ami-Gabi. After
the modification and replacement, the score of each dish d is
calculated via the score function defined in Eq. (1), where                  4.   CONCLUSIONS AND FUTURE WORK
the positive sentiment words are selected from the lexicon                      This paper proposes a novel framework for dish discov-
provided in [1], and only top 200 most frequent sentiment                    ery from restaurant reviews via word embedding techniques.
words in our dataset are adopted. For the representation                     This framework can be of great help in discovering or recom-
learning, the word2vec toolkit2 and the skip-gram model are                  mending dishes via only the review texts based the proposed
adopted, in which the context (window) size for the skip-                    score function. Although in this preliminary work, we have
gram model was set to 5 and the dimensionality of the word                   not conducted quantitative evaluation on our experiments,
vectors was set to 200.                                                      the given example and the visualization results demonstrate
   Table 1 tabulates the top-3 dishes ranked by the proposed                 the novelty and the potential of the proposed approach.
approach for the restaurant Sushisamba Las Vegas. In the                        In the current work, we only consider two extreme cases
table, the dishes in each column are the top-3 results ranked                of the score function; hence, considering different settings
by (a) their number of occurrences, (b) the score based                      of the score function and quantitatively analyzing the corre-
on average distance, and (c) the score based on minimum                      sponding results will be one of our important future work.
distance; (a), (b), and (c) correspond to the three numbers                  Also, a food-oriented lexicon will be considered in the future.
in the parentheses. From the table, it can be observed that                  Most importantly, the size of the collected texts is vital to
none of the top-3 most frequently mentioned dishes occurs                    representation learning algorithms, so we are now collecting
in the lists ranked by our method (both cases (1) and (2)),                  more data from Yelp and plan to conduct our experiments
which is due to the fact that these high frequent dishes                     on a much larger dataset.
might not be surrounded with positive words and sometimes
with negative reviews. For example, there is a review for                    5.   REFERENCES
Peruvian Corn within a comment of “The Peruvian Corn                         [1] M. Hu and B. Liu. Mining and summarizing customer
was awful” in the dataset. This phenomenon indicates that                        reviews. In Proc. ACM KDD, pages 168–177, 2004.
the most frequent dish mentioned in the reviews may not                      [2] J. McAuley and J. Leskovec. Hidden factors and hidden
be the most recommended dish by users. In addition, the                          topics: understanding rating dimensions with review
proposed method is capable of finding dishes that might not                      text. In Proc. ACM Recsys, pages 165–172, 2013.
frequently occur in reviews, e.g., Soft Shell Crab, and thus                 [3] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient
can provide more diverse results.                                                estimation of word representations in vector space. arXiv
   Figure 1 visualizes the positive sentiment words and the                      preprint arXiv:1301.3781, 2013.
top-3 dishes ranked by the proposed method based on the
                                                                             [4] M. Trevisiol, L. Chiarandini, and R. Baeza-Yates. Buon
learned representations. From the figure, we can observe that
                                                                                 appetito: recommending personalized menus. In Proc.of
2                                                                                ACM HT, pages 327–329, 2014.
    https://code.google.com/archive/p/word2vec/

</pre>