<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Dish Discovery via Word Embeddings on Restaurant Reviews</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Chih-Yu</forename><surname>Chao</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Taipei</orgName>
								<address>
									<postCode>100</postCode>
									<settlement>Taipei</settlement>
									<country key="TW">Taiwan</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Yi-Fan</forename><surname>Chu</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Taipei</orgName>
								<address>
									<postCode>100</postCode>
									<settlement>Taipei</settlement>
									<country key="TW">Taiwan</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Yi</forename><surname>Ho</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">Department of Engineering Science and Ocean Engineering</orgName>
								<orgName type="institution">National Taiwan University</orgName>
								<address>
									<postCode>106</postCode>
									<settlement>Taipei</settlement>
									<country key="TW">Taiwan</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Chuan-Ju</forename><surname>Wang</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Taipei</orgName>
								<address>
									<postCode>100</postCode>
									<settlement>Taipei</settlement>
									<country key="TW">Taiwan</country>
								</address>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="department">Research Center for Information Technology Innovation</orgName>
								<orgName type="institution">Academia Sinica</orgName>
								<address>
									<postCode>115</postCode>
									<settlement>Taipei</settlement>
									<country key="TW">Taiwan</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ming-Feng</forename><surname>Tsai</surname></persName>
							<affiliation key="aff3">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">National Chengchi University</orgName>
								<address>
									<postCode>116</postCode>
									<settlement>Taipei</settlement>
									<country key="TW">Taiwan</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Dish Discovery via Word Embeddings on Restaurant Reviews</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">43E7A73F4825E85197D408ED94B843AB</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-19T15:54+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>dish discovery</term>
					<term>word embeddings</term>
					<term>dish-word extraction</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper proposes a novel framework for automatic dish discovery via word embeddings on restaurant reviews. We collect a dataset of user reviews from Yelp and parse the reviews to extract dish words. Then, we utilize the processed reviews as training texts to learn the embedding vectors of words via the skip-gram model. In the paper, a nearestneighbor like score function is proposed to rank the dishes based on their learned representations. We brief some analyses on the preliminary experiments and present a web-based visualization at http://clip.csie.org/yelp/.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">BACKGROUND</head><p>With the growth of social media, corporations, such as Yelp, have accumulated a great number of user generated content (UGC). In the literature, some studies have been conducted with a perspective of finding critical information hidden in the content <ref type="bibr" target="#b2">[2]</ref>. While much has been proposed on accurate sentiment interpretation towards reviews and recommendation, little has focused on dish-level analysis <ref type="bibr" target="#b4">[4]</ref>. In this paper, therefore, we aim to provide a novel framework for automatic dish discovery from restaurant reviews via the embedding techniques. We employ regular expressions to first parse restaurant reviews to extract dish words, and then utilize the processed reviews as training texts to learn embedding vector of each word via the skip-gram model <ref type="bibr" target="#b3">[3]</ref>. In addition, a nearest-neighbor like score function is proposed to rank the dishes via their learned representations. Preliminary experiments are conducted on a real-world restaurant review dataset collected from Yelp Data Challenge.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">METHODOLOGY</head><p>Copyright held by the author(s).</p><p>RecSys 2016 Poster Proceedings, September 15-19, 2016, USA, Boston.</p><p>Our methodology mainly consists of three parts: 1) dishword recognition, 2) word embedding learning, and 3) dish score calculation. As alluded to earlier, UGC usually incorporates a degree of noise and different language usages; therefore, extracting dish names from user reviews is a complicated task. For example, observed from the dataset, users tend not to write the full name of a dish in their reviews; instead, the last word or the last two words are often written in the reviews. To grapple with this issue, we use regular expressions (regexps) to extract dish names from the user reviews. However, this also give rise to an issue that a certain dish in a restaurant may be of the same name in other restaurants, which may induce the problem of ambiguity and lower the accuracy of matching the correct dish name. So, we attach a dish name with its restaurant name to solve the ambiguity problem.</p><p>We then utilize the collection of processed reviews as training texts to learn embeddings of each word in the reviews via a continuous space language model, the skip-gram model. After the training phase, each word (including every dish) is represented by an n-dimensional vector (called the embedding of this word). Inspired by the k-nearest neighbors algorithm, we define the score for every dish d as:</p><formula xml:id="formula_0">S(d) = m k=1 λ k f k (d),<label>(1)</label></formula><p>where</p><formula xml:id="formula_1">f k (d) = k k i=1 w d −ws i</formula><p>, m is the total number of positive sentiment words considered, λi</p><formula xml:id="formula_2">(i = 1, • • • , m</formula><p>) is a weighting parameter. In addition, si denotes the i-nearest positive sentiment words of the given dish d, and w d , ws i ∈ R n are the vector representations of the dish d and the sentiment word si, respectively. In an extreme case (1) of λm = 1 and λi = 0 for i = 1, • • • , m − 1, this score function implements the concept of the average Euclidean distance between a dish and all the positive sentiment words; while in the case (2) λ1 = 1 and λi = 0 for i = 2, • • • , m, the scored is obtained with the closest positive sentiment words to the dish.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">EXPERIMENTS</head><p>Our preliminary experiments involve a real-world restaurant review dataset collected from Yelp Data Challenge. <ref type="foot" target="#foot_0">1</ref>We first choose the top 100 restaurants containing the most reviews in the area of Las Vegas and then manually parse </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Spicy Tuna</head><p>Samba Sushi Green Bean Tempura (81, 0.702, 0.787) (14, 0.735, 0.845) (20, 0.703, 0.877) the menu of each restaurant from its official website. Out of those 100 restaurants, we extract the restaurants with a complete menu, setting the reviews of those restaurants and their menus as our dataset. In summary, there are 69 restaurants and 95,578 reviews in total after the filtering; the number of words per review in average is about 147 and the vocabulary size is 46,017.</p><p>For preprocessing the reviews to identify each dish, here we demonstrate the matching rule via the example dish, Housemade Country Pate; its regexps can be set as:</p><p>(Housemade*|Country*)+Pat[a-z]+(s|es|ies)?, which is set to match Country Pate, Housemade Pate, or Housemade Country Pate. If match of the dish is found, we replace the name of the dish with its full name and append the name of the restaurant to an underscore symbol, modifying it to Housemade-Country-Pate_Mon-Ami-Gabi. After the modification and replacement, the score of each dish d is calculated via the score function defined in Eq. ( <ref type="formula" target="#formula_0">1</ref>), where the positive sentiment words are selected from the lexicon provided in <ref type="bibr" target="#b1">[1]</ref>, and only top 200 most frequent sentiment words in our dataset are adopted. For the representation learning, the word2vec toolkit 2 and the skip-gram model are adopted, in which the context (window) size for the skipgram model was set to 5 and the dimensionality of the word vectors was set to 200.</p><p>Table <ref type="table" target="#tab_0">1</ref> tabulates the top-3 dishes ranked by the proposed approach for the restaurant Sushisamba Las Vegas. In the table, the dishes in each column are the top-3 results ranked by (a) their number of occurrences, (b) the score based on average distance, and (c) the score based on minimum distance; (a), (b), and (c) correspond to the three numbers in the parentheses. From the table, it can be observed that none of the top-3 most frequently mentioned dishes occurs in the lists ranked by our method (both cases (1) and ( <ref type="formula">2</ref>)), which is due to the fact that these high frequent dishes might not be surrounded with positive words and sometimes with negative reviews. For example, there is a review for Peruvian Corn within a comment of "The Peruvian Corn was awful" in the dataset. This phenomenon indicates that the most frequent dish mentioned in the reviews may not be the most recommended dish by users. In addition, the proposed method is capable of finding dishes that might not frequently occur in reviews, e.g., Soft Shell Crab, and thus can provide more diverse results.</p><p>Figure <ref type="figure">1</ref> visualizes the positive sentiment words and the top-3 dishes ranked by the proposed method based on the learned representations. From the figure, we can observe that the words with similar meanings are usually close to each other, such as the words in the circle including good, best, and great. Furthermore, for the extreme case (1), the dishes close to the centroid of all the positive words tend to have higher scores and their contents in the reviews may be more diverse. On the other hand, for the case (2), the top-ranked dishes are close to a certain sentiment word; for example, the dish Seaweed Salad is top-ranked and far from the centroid in the case ( <ref type="formula">2</ref>), but its score based on the average distance is rather low than the other top-3 dishes in the case (1).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">CONCLUSIONS AND FUTURE WORK</head><p>This paper proposes a novel framework for dish discovery from restaurant reviews via word embedding techniques. This framework can be of great help in discovering or recommending dishes via only the review texts based the proposed score function. Although in this preliminary work, we have not conducted quantitative evaluation on our experiments, the given example and the visualization results demonstrate the novelty and the potential of the proposed approach.</p><p>In the current work, we only consider two extreme cases of the score function; hence, considering different settings of the score function and quantitatively analyzing the corresponding results will be one of our important future work. Also, a food-oriented lexicon will be considered in the future. Most importantly, the size of the collected texts is vital to representation learning algorithms, so we are now collecting more data from Yelp and plan to conduct our experiments on a much larger dataset.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>2Figure 1 : 2 -</head><label>12</label><figDesc>Figure 1: 2-D Visualization on the top-3 recommended dishes and positive words.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Top-3 dishes of Sushisamba Las Vegas.</figDesc><table><row><cell></cell><cell></cell><cell>Ranking methods</cell></row><row><cell></cell><cell></cell><cell>Case (1)</cell><cell>Case (2)</cell></row><row><cell></cell><cell>Frequency</cell><cell>Average distance Minimum distance</cell></row><row><cell>precedence ← −−−−−−−−−− −</cell><cell cols="2">Sea Bass (364, 0.706, 0.787) (4, 0.737, 0.899) (25, 0.706, 0.910) Soft Shell Crab Seaweed Salad Peruvian Corn Lamb Chop Soft Shell Crab (125, 0.713, 0.809) (11, 0.735, 0.858) (4, 0.737, 0.899)</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://www.yelp.com/dataset challenge</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title/>
		<author>
			<persName><surname>References</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Mining and summarizing customer reviews</title>
		<author>
			<persName><forename type="first">M</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. ACM KDD</title>
				<meeting>ACM KDD</meeting>
		<imprint>
			<date type="published" when="2004">2004</date>
			<biblScope unit="page" from="168" to="177" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Hidden factors and hidden topics: understanding rating dimensions with review text</title>
		<author>
			<persName><forename type="first">J</forename><surname>Mcauley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Leskovec</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. ACM Recsys</title>
				<meeting>ACM Recsys</meeting>
		<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="165" to="172" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Efficient estimation of word representations in vector space</title>
		<author>
			<persName><forename type="first">T</forename><surname>Mikolov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Corrado</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dean</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1301.3781</idno>
		<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Buon appetito: recommending personalized menus</title>
		<author>
			<persName><forename type="first">M</forename><surname>Trevisiol</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Chiarandini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Baeza-Yates</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc.of ACM HT</title>
				<meeting>.of ACM HT</meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="327" to="329" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
