<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Dish Discovery via Word Embeddings on Restaurant Reviews</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chih-Yu Chao</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yi-Fan Chu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yi Ho</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chuan-Ju Wang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ming-Feng Tsai</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, National Chengchi University</institution>
          ,
          <addr-line>Taipei 116</addr-line>
          ,
          <country country="TW">Taiwan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, University of Taipei</institution>
          ,
          <addr-line>Taipei 100</addr-line>
          ,
          <country country="TW">Taiwan</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Engineering Science and Ocean Engineering, National Taiwan University</institution>
          ,
          <addr-line>Taipei 106</addr-line>
          ,
          <country country="TW">Taiwan</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Research Center for Information Technology Innovation</institution>
          ,
          <addr-line>Academia Sinica, Taipei 115</addr-line>
          ,
          <country country="TW">Taiwan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <abstract>
        <p>This paper proposes a novel framework for automatic dish discovery via word embeddings on restaurant reviews. We collect a dataset of user reviews from Yelp and parse the reviews to extract dish words. Then, we utilize the processed reviews as training texts to learn the embedding vectors of words via the skip-gram model. In the paper, a nearestneighbor like score function is proposed to rank the dishes based on their learned representations. We brief some analyses on the preliminary experiments and present a web-based visualization at http://clip.csie.org/yelp/.</p>
      </abstract>
      <kwd-group>
        <kwd>dish discovery</kwd>
        <kwd>word embeddings</kwd>
        <kwd>dish-word extraction</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>BACKGROUND</title>
      <p>With the growth of social media, corporations, such as
Yelp, have accumulated a great number of user generated
content (UGC). In the literature, some studies have been
conducted with a perspective of nding critical information
hidden in the content [2]. While much has been proposed
on accurate sentiment interpretation towards reviews and
recommendation, little has focused on dish-level analysis [4].
In this paper, therefore, we aim to provide a novel framework
for automatic dish discovery from restaurant reviews via the
embedding techniques. We employ regular expressions to
rst parse restaurant reviews to extract dish words, and then
utilize the processed reviews as training texts to learn
embedding vector of each word via the skip-gram model [3]. In
addition, a nearest-neighbor like score function is proposed
to rank the dishes via their learned representations.
Preliminary experiments are conducted on a real-world restaurant
review dataset collected from Yelp Data Challenge.</p>
    </sec>
    <sec id="sec-2">
      <title>METHODOLOGY</title>
      <p>Our methodology mainly consists of three parts: 1)
dishword recognition, 2) word embedding learning, and 3) dish
score calculation. As alluded to earlier, UGC usually
incorporates a degree of noise and di erent language usages;
therefore, extracting dish names from user reviews is a
complicated task. For example, observed from the dataset, users
tend not to write the full name of a dish in their reviews;
instead, the last word or the last two words are often written
in the reviews. To grapple with this issue, we use regular
expressions (regexps) to extract dish names from the user
reviews. However, this also give rise to an issue that a
certain dish in a restaurant may be of the same name in other
restaurants, which may induce the problem of ambiguity and
lower the accuracy of matching the correct dish name. So,
we attach a dish name with its restaurant name to solve the
ambiguity problem.</p>
      <p>
        We then utilize the collection of processed reviews as
training texts to learn embeddings of each word in the reviews
via a continuous space language model, the skip-gram model.
After the training phase, each word (including every dish)
is represented by an n-dimensional vector (called the
embedding of this word). Inspired by the k-nearest neighbors
algorithm, we de ne the score for every dish d as:
m
S(d) = X kfk(d);
k=1
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
where fk(d) = k ; m is the total number of
      </p>
      <p>Pik=1 kwd wsi k
positive sentiment words considered, i (i = 1; ; m) is a
weighting parameter. In addition, si denotes the i-nearest
positive sentiment words of the given dish d, and wd; wsi 2
Rn are the vector representations of the dish d and the
sentiment word si, respectively.</p>
      <p>
        In an extreme case (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) of m = 1 and i = 0 for i =
1; ; m 1, this score function implements the concept of
the average Euclidean distance between a dish and all the
positive sentiment words; while in the case (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) 1 = 1 and
i = 0 for i = 2; ; m, the scored is obtained with the
closest positive sentiment words to the dish.
3.
      </p>
    </sec>
    <sec id="sec-3">
      <title>EXPERIMENTS</title>
      <p>Our preliminary experiments involve a real-world
restaurant review dataset collected from Yelp Data Challenge.1
We rst choose the top 100 restaurants containing the most
reviews in the area of Las Vegas and then manually parse
1https://www.yelp.com/dataset challenge
the menu of each restaurant from its o cial website. Out
of those 100 restaurants, we extract the restaurants with
a complete menu, setting the reviews of those restaurants
and their menus as our dataset. In summary, there are 69
restaurants and 95,578 reviews in total after the ltering; the
number of words per review in average is about 147 and the
vocabulary size is 46,017.</p>
      <p>
        For preprocessing the reviews to identify each dish, here
we demonstrate the matching rule via the example dish,
Housemade Country Pate; its regexps can be set as:
which is set to match Country Pate, Housemade Pate, or
Housemade Country Pate. If a match of the dish is found,
we replace the name of the dish with its full name and append
the name of the restaurant to an underscore symbol,
modifying it to Housemade-Country-Pate_Mon-Ami-Gabi. After
the modi cation and replacement, the score of each dish d is
calculated via the score function de ned in Eq. (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), where
the positive sentiment words are selected from the lexicon
provided in [1], and only top 200 most frequent sentiment
words in our dataset are adopted. For the representation
learning, the word2vec toolkit2 and the skip-gram model are
adopted, in which the context (window) size for the
skipgram model was set to 5 and the dimensionality of the word
vectors was set to 200.
      </p>
      <p>
        Table 1 tabulates the top-3 dishes ranked by the proposed
approach for the restaurant Sushisamba Las Vegas. In the
table, the dishes in each column are the top-3 results ranked
by (a) their number of occurrences, (b) the score based
on average distance, and (c) the score based on minimum
distance; (a), (b), and (c) correspond to the three numbers
in the parentheses. From the table, it can be observed that
none of the top-3 most frequently mentioned dishes occurs
in the lists ranked by our method (both cases (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        )),
which is due to the fact that these high frequent dishes
might not be surrounded with positive words and sometimes
with negative reviews. For example, there is a review for
Peruvian Corn within a comment of \The Peruvian Corn
was awful" in the dataset. This phenomenon indicates that
the most frequent dish mentioned in the reviews may not
be the most recommended dish by users. In addition, the
proposed method is capable of nding dishes that might not
frequently occur in reviews, e.g., Soft Shell Crab, and thus
can provide more diverse results.
      </p>
      <p>Figure 1 visualizes the positive sentiment words and the
top-3 dishes ranked by the proposed method based on the
learned representations. From the gure, we can observe that
#20: Sushisamba Las Vegas</p>
      <p>
        Words with positive sentiments
Dish - case (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ): average
      </p>
      <p>
        Dish - case (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ): minimum
1.5
1.0
0.5
0.0
-0.5
-1.5
-1.0 Green Bean Tempura
3
      </p>
      <p>2 1
Soft Shell Crab</p>
      <p>Samba Sushi
3</p>
      <p>2</p>
      <p>Lamb Chop
good, best, great, etc.</p>
      <p>
        1
Seaweed Salad
the words with similar meanings are usually close to each
other, such as the words in the circle including good, best,
and great. Furthermore, for the extreme case (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), the dishes
close to the centroid of all the positive words tend to have
higher scores and their contents in the reviews may be more
diverse. On the other hand, for the case (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ), the top-ranked
dishes are close to a certain sentiment word; for example, the
dish Seaweed Salad is top-ranked and far from the centroid
in the case (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ), but its score based on the average distance
is rather low than the other top-3 dishes in the case (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ).
4.
      </p>
    </sec>
    <sec id="sec-4">
      <title>CONCLUSIONS AND FUTURE WORK</title>
      <p>This paper proposes a novel framework for dish
discovery from restaurant reviews via word embedding techniques.
This framework can be of great help in discovering or
recommending dishes via only the review texts based the proposed
score function. Although in this preliminary work, we have
not conducted quantitative evaluation on our experiments,
the given example and the visualization results demonstrate
the novelty and the potential of the proposed approach.</p>
      <p>In the current work, we only consider two extreme cases
of the score function; hence, considering di erent settings
of the score function and quantitatively analyzing the
corresponding results will be one of our important future work.
Also, a food-oriented lexicon will be considered in the future.
Most importantly, the size of the collected texts is vital to
representation learning algorithms, so we are now collecting
more data from Yelp and plan to conduct our experiments
on a much larger dataset.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hu</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          .
          <article-title>Mining and summarizing customer reviews</article-title>
          .
          <source>In Proc. ACM KDD</source>
          , pages
          <volume>168</volume>
          {
          <fpage>177</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>McAuley</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Leskovec</surname>
          </string-name>
          .
          <article-title>Hidden factors and hidden topics: understanding rating dimensions with review text</article-title>
          .
          <source>In Proc. ACM Recsys</source>
          , pages
          <volume>165</volume>
          {
          <fpage>172</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Corrado, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <article-title>E cient estimation of word representations in vector space</article-title>
          .
          <source>arXiv preprint arXiv:1301.3781</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Trevisiol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chiarandini</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          .
          <article-title>Buon appetito: recommending personalized menus</article-title>
          .
          <source>In Proc.of ACM HT</source>
          , pages
          <volume>327</volume>
          {
          <fpage>329</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>