<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>vExplorer: A Search Method to Find Relevant YouTube Videos for Health Researchers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hillol Sarker</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Murtaza Dhuliawala</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicholas Fay</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amar Das</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>IBM Research</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cambridge</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Patients and caregivers are increasingly using online video sharing services, such as YouTube, to document experiences with a health condition. Despite the richness of shared information and social commentaries in such videos, there have been few systematic studies focused on the health content of such media. Finding related videos on YouTube can be challenging for researchers because of inadequate search and ranking methods. In this paper, we present initial work on an unsupervised information retrieval method that supports the mental model of a researcher while he or she is exploring a topic area and lets the user examine extracted metadata to identify relevant videos. An experimental comparison and evaluation of our approach to YouTube for searching for videos on autism personal stories finds that using title, description, and tags of the videos in our approach produces more relevant videos than the ones that YouTube suggests.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>Related Works</title>
      <p>Machine learning relies heavily on the quantity and the quality of the labeled dataset. Collection of this dataset can
incur significant time and money. Researchers in numerous cases have shared their annotated datasets to help the
research community. These datasets include annotated text2–4, image5–7, audio8–10, video11, 12, and wearable sensors13, 14.
These datasets are static, however, and only limited to a specific problem space. User-generated content on online
platforms, such as YouTube, are a dynamic source of datasets. Users are continuously adding new and diverse content.
Active user communities augment this content by providing feedback, such as comments and likes. We propose a new
approach to create datasets based on the metadata of the video content and to use the metadata in a coherent way for
search.</p>
      <p>Clustering text-based data of users undertaking video search has been studied widely. The majority of the cases
researchers start with are by having a given corpus. The problem then reduces to the categorization of these documents.
Xiao et al.,15 proposed a hierarchical approach to cluster the videos that are nearly identical based on the content of
the video, overlooking the user comments of the online video. Another method is to use a scatter/gather16 approach
through an iteration of two steps. The first step scatters the entire corpus. The second step gathers only those
documents which are relevant to the current concept. This approach is relatively faster and more appropriate for browsing a
large database17. Daan et al.,18 have proposed a Markov decision process method to mine the subtitles of live television
broadcasts and suggest relevant video contents to the user. An another approach is to use agglomerative and
hierarchical clustering17, 19, which starts with each document as an individual cluster. Based on best-case, average-case, or
worst-case pairwise similarity score, the method takes two clusters and merges them into one cluster at a time. At the
end, the entire corpus becomes one big cluster providing an explainable and intuitive category of clusters. In cases
when we have prior knowledge of the number of clusters, k-means or k-medoid can be a candidate clustering approach.
However, this approach is sensitive to the selection of initial seeds. In addition, computation may take indefinite time
due to nondeterministic iterations. In a similar case, when we have prior knowledge about the number of topics in
the corpus, we can use topic model based approaches. Latent Dirichlet Allocation (LDA)20 and similar algorithms
can generate the distribution of words in topics and the distribution of topics in a corpus. They provide, as a result,
a human interpretable distribution of words and topics. All of these prior approaches start with the assumption that
the corpus is given and the computational process proceeds bottom-up. On the other hand, vExplorer is a top-down
approach that develops the corpus in a hierarchical and iterative manner.</p>
      <p>Other works related to our evaluation of videos for ASD research have attempted to discover similar types of YouTube
videos to assess a research hypothesis. Vincent et al.21, tested a hypothesis that it is feasible to use unstructured
home videos for the early detection of autism, outside of the clinical setting. As a potential source of home video
data, the authors used autism-related YouTube videos. They then applied their domain knowledge to manually search
YouTube videos, identify the related videos, and develop a dataset. Non-clinical raters were able to identify the cases
of autism with high accuracy, proving the validity of the hypothesis. Our approach can augment this related work21 by
discovering related videos as well as the search strings in an unsupervised way.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Method</title>
      <p>In Figure 1, we provide a brief overview of the proposed system from the perspective of the user. The user first selects
a YouTube video which he or she finds related to the search content and seeks to find related videos. The system
uses this video as a seed video, finds the most important descriptor tokens for the video, searches YouTube on behalf
of the user, scores search results, and provides the user with a list of relevant videos. In this section, we discuss the
computational methods used in our system.</p>
      <p>Search Token Extraction: The search token extraction begins with the seed video. The system makes API calls to the
YouTube data API service to obtain the title, description, tags, comments, and replies of the seed video. We tokenize
the text and apply the Penn Treebank (PTB) based part of speech tags22. On completion we retain nouns, verbs, and
adjectives along with their different forms. Common stop words or slang terms are filtered out. The remaining terms
are lemmatized and we create a bag of words representation (C). However, we keep a mapping of source/zone of those
tokens (e.g., title, description, tags, and comments) so that we can evaluate their relative importance. Each video,
including the seed video, is represented as a vector of the token frequencies (TF). Token frequencies of the vector
are normalized using formula Augmented T F (t; Z) = 0:5 + 0:5 maxfTTFF((t0t))jt02Zg , where Z is the set of terms in
one of the four possible zones (title, description, tag, and comment). Coefficient of each token is the mean of four
Augmented T F s for the four zones.</p>
      <p>Scoring (Similarity Measure): Based on the normalized frequency of the tokens in the seed video, we identify the n
most significant tokens. The system creates different combinations of these n tokens to form candidate search strings.
The number of such combinations are exponentially large. We prioritize the exploration process by considering the
most relevant search string first based on a similarity score. A YouTube API call produces n (50) videos for a certain
search string. Each video (including the seed video) is represented as a vector of normalized term frequencies of 100
most important tokens. We compute pairwise cosine similarity between the seed video vector (SVV) and each of the
search result videos’ vector (RVV). Limiting SVV and RVV to a length of 100 introduces many uncommon words.
We take a union of the two sets and form a new pair of vectors with coefficient 0 in case the word is not present in the
corresponding vector. We use the mean cosine similarity score as the overall score for the search string. This similarity
measure helps us to explore the more relevant search paths first.</p>
      <p>Expansion of the Search Space: Exploration of the video search space uses a greedy approach. Initially each token is
treated as a one-word search string. We insert them in a priority queue. In each iteration, we pick the top scoring search
string, concatenate other one word tokens with the search string, perform the search with the YouTube API, compute
the mean similarity score from the search results, and push the new search string back to the priority queue. We make
sure that the token sequences for which we are searching are unique and that no tokens are repeated in a search string
as we expand the search. The process terminates when we either run out of all possible combinations of search tokens
or explore a user-specified number of search strings. Throughout the process, we keep track of the search strings that
yields the best set of videos that the user is seeking. An ordered list of search strings and corresponding videos are
returned to the user.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Experimental Setup</title>
      <p>Our approach allows us to consider the merit of including various types of extracted metadata in the search process.
We designed an experiment to evaluate which configurations provided the most relevant search results and how these
results compared to those returned from YouTube. We first searched YouTube using the search string “autism personal
stories.” We used the browsers’ incognito mode feature and disabled the location sharing feature so that YouTube does
not provide any personalized search result. Based on manual inspection of the top 10 search result videos, we observed
that each of them is relevant to the search string. We later used these 10 videos for the validation of the proposed
solution. As the baseline condition, for each of these 10 videos, we used YouTube API to obtain a list of 10 other
recommended videos. We maintained a system-wide circular queue of 10 YouTube API keys. Each YouTube API
call picks an API key from this circular queue and move the pointer to the next. We believe that such system reduced
a potential confounder created by the fact that YouTube may store cookies or other kind of tracking information to
personalize the search result. To show the efficacy of the proposed vExplorer system we designed our experiment to
prove the following hypothesis.</p>
      <p>H01. vExplorer suggested videos are more relevant to a seed video than YouTube recommended videos.
From the previous list of 10 selected YouTube videos, we picked one video at a time and set as the seed video. We
generated 15 most relevant tokens based on normalized term frequency in three experimental settings. First, we only
used title and description as a baseline condition for vExplorer. Second, in addition to title and description, we also
included tags which are provided by the video uploader. Third, we also included user comments and their replies. 15
tokens can generate 215 1 search strings which may take significant time to complete. Given that we used a greedy
approach to generate search strings, we believe that the vExplorer system can reach the desired search string fast. We
limited our system to run only for 100 iterations (see figure 5). We obtained 10 vExplorer suggested relevant videos in
each of the three settings. Three experimental settings, 10 seed videos, and each suggesting 10 related videos produces
300 videos in total. In addition, there are 10 YouTube suggested related videos for each of the 10 seed videos making
a total of 100 videos. Overall, we created a set of 400 related videos that were suggested by YouTube and vExplorer.
We took a union of YouTube recommended videos and our system recommended ones (n=400) for all three
experimental settings. Due to duplicate videos appearing across different experimental settings, we had only 200 unique
YouTube videos in this set. Before having human raters assess the relevance of the returned videos, we randomized
the ordering of the videos to remove potential bias. Two independent raters, not involved in the development of
vExplorer, were recruited to rate the relevance of autism videos, and were given a list of the 10 initial seed videos. They
were asked to watch these videos, and they were told of the initial search string, “autism personal stories.” After the
participants gained confidence in understanding the health-related topic presented in those initial seed videos, they
proceeded independently to rate each video in the randomized list into three categories, “relevant,” “not relevant,” and
“unsure.” In cases of disagreement between rated categories, a third reviewer, not involved in the development of the
system, was brought into adjudicate the discrepancy.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>We present the results of our evaluation where the selected 10 seed videos had each produced 15 tokens based on the
normalized token frequency. In the first experiment setting, we use the title and description of videos to find the 15
most informative tokens. The title and description are also used to compute the similarity score between a seed video
and any given video. Figure 3 shows the mean ( SD) normalized score for each of the set of 15 selected tokens from
10 seed videos. We observe that autism, story, family, child, life, and documentary are the most informative tokens.
The system then tries different combinations of these tokens to form candidate search strings. Table 1 shows a list of 5
example search strings suggested by vExplorer. Although “dillan” and “voice” independently, are not among the most
common tokens, we observe that “dillan voice” is the highest scored search string based on the content it produces
when searched on YouTube. Figure 3 list tokens generated by including the tags into consideration while Figure 4
considers all four types of metadata.
In this paper, we propose a novel search system to help researchers discover videos of health-related behaviors based
on an initial seed video of interest in YouTube. Our system helps researchers avoid common challenges in this type of
search. For example, the user may not know which keywords provide optimal results in their manual search of related
videos. In addition, certain tags on videos cannot be used directly as search terms in the public YouTube video page.
To address these challenges, we have developed an unsupervised information retrieval approach that extracts and uses
tags and other metadata to search and rank videos based on relevance to a seed video.</p>
      <p>Our evaluation of different metadata configurations of vExlporer found that adding comments to the search strategy
did not improve relevance of the results. This unexpected finding for social media data may be attributed to numerous
factors. First, the presence of slang terms and inflammatory trolls in user comments may make it hard for the proposed
system to capture the innate concepts in a specific video. Detection of trolling23 may have the potential to improve
the performance of our system. Second, our search space is limited to the token list available in the initial seed
video. While computing the similarity score, our current system may overlook missing but relevant tokens in a seed
video. This may be a likely scenario in the case where the seed video’s title, description, tags, and comments are not
sufficiently verbose. The incorporation of a thesaurus may improve the performance of the proposed system. We can
also generate close caption text for each video and include the content in the bag of words representation. In addition,
we may be able to improve the performance further by including speech and video frame-based features. Third,
YouTube provides related videos based on analytics built into their native platform (e.g., database) whereas YouTube
API adds the network latency. Our system is significantly slower in comparison to a YouTube search. Systems like
vExplorer could be integrated into these online video services to provide a better user experience. Precomputed
hierarchies of related videos may also reduce the search time significantly. Fourth, our evaluation was confined to the
concept of “autism personal stories” where only a limited set of YouTube videos are available. Many videos did not
have any comments and for some the uploader of the video disabled user comments. In summary, we plan to expand
our current work to address these challenges and to conduct more comprehensive evaluations of the system in different
health domains.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>David T Fetzer</surname>
            and
            <given-names>O Clark</given-names>
          </string-name>
          <string-name>
            <surname>West</surname>
          </string-name>
          .
          <article-title>The hipaa privacy rule and protected health information</article-title>
          . Academic radiology,
          <volume>15</volume>
          (
          <issue>3</issue>
          ):
          <fpage>390</fpage>
          -
          <lpage>395</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Alec</given-names>
            <surname>Go</surname>
          </string-name>
          , Richa Bhayani, and
          <string-name>
            <given-names>Lei</given-names>
            <surname>Huang</surname>
          </string-name>
          .
          <article-title>Twitter sentiment classification using distant supervision</article-title>
          .
          <source>CS224N Project Report</source>
          , Stanford,
          <volume>1</volume>
          (
          <year>2009</year>
          ):
          <fpage>12</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Jure</given-names>
            <surname>Leskovec</surname>
          </string-name>
          and
          <string-name>
            <surname>Julian J Mcauley</surname>
          </string-name>
          .
          <article-title>Learning to discover social circles in ego networks</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>539</fpage>
          -
          <lpage>547</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Julian</surname>
            <given-names>McAuley</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Targett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Qinfeng</given-names>
            <surname>Shi</surname>
          </string-name>
          , and Anton Van Den Hengel.
          <article-title>Image-based recommendations on styles and substitutes</article-title>
          .
          <source>In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , pages
          <fpage>43</fpage>
          -
          <lpage>52</lpage>
          . ACM,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Florian</given-names>
            <surname>Schroff</surname>
          </string-name>
          , Dmitry Kalenichenko, and
          <string-name>
            <given-names>James</given-names>
            <surname>Philbin</surname>
          </string-name>
          .
          <article-title>Facenet: A unified embedding for face recognition and clustering</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pages
          <fpage>815</fpage>
          -
          <lpage>823</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Lior</given-names>
            <surname>Wolf</surname>
          </string-name>
          , Tal Hassner, and
          <string-name>
            <given-names>Itay</given-names>
            <surname>Maoz</surname>
          </string-name>
          .
          <article-title>Face recognition in unconstrained videos with matched background similarity</article-title>
          .
          <source>In Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <source>2011 IEEE Conference on</source>
          , pages
          <fpage>529</fpage>
          -
          <lpage>534</lpage>
          . IEEE,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Trishul</surname>
            <given-names>M Chilimbi</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Yutaka</given-names>
            <surname>Suzue</surname>
          </string-name>
          , Johnson Apacible, and
          <string-name>
            <given-names>Karthik</given-names>
            <surname>Kalyanaraman</surname>
          </string-name>
          .
          <article-title>Project adam: Building an efficient and scalable deep learning training system</article-title>
          .
          <source>In OSDI</source>
          , volume
          <volume>14</volume>
          , pages
          <fpage>571</fpage>
          -
          <lpage>582</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Justin</given-names>
            <surname>Salamon</surname>
          </string-name>
          , Christopher Jacoby, and
          <article-title>Juan Pablo Bello. A dataset and taxonomy for urban sound research</article-title>
          .
          <source>In Proceedings of the 22nd ACM international conference on Multimedia</source>
          , pages
          <fpage>1041</fpage>
          -
          <lpage>1044</lpage>
          . ACM,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Thierry</surname>
            Bertin-Mahieux, Daniel PW Ellis, Brian Whitman, and
            <given-names>Paul Lamere.</given-names>
          </string-name>
          <article-title>The million song dataset</article-title>
          .
          <source>In Ismir</source>
          , volume
          <volume>2</volume>
          , page 10,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. Betul Erdogdu Sakar,
          <string-name>
            <given-names>M Erdem</given-names>
            <surname>Isenkul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C Okan</given-names>
            <surname>Sakar</surname>
          </string-name>
          , Ahmet Sertbas, Fikret Gurgen, Sakir Delil, Hulya Apaydin, and
          <string-name>
            <given-names>Olcay</given-names>
            <surname>Kursun</surname>
          </string-name>
          .
          <article-title>Collection and analysis of a parkinson speech dataset with multiple types of sound recordings</article-title>
          .
          <source>IEEE Journal of Biomedical and Health Informatics</source>
          ,
          <volume>17</volume>
          (
          <issue>4</issue>
          ):
          <fpage>828</fpage>
          -
          <lpage>834</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Sami</surname>
            Abu-El-Haija, Nisarg Kothari,
            <given-names>Joonseok</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            , Paul Natsev, George Toderici, Balakrishnan Varadarajan, and
            <given-names>Sudheendra</given-names>
          </string-name>
          <string-name>
            <surname>Vijayanarasimhan</surname>
          </string-name>
          .
          <article-title>Youtube-8m: A large-scale video classification benchmark</article-title>
          .
          <source>arXiv preprint arXiv:1609.08675</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Bart</surname>
            <given-names>Thomee</given-names>
          </string-name>
          , David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and
          <string-name>
            <surname>Li-Jia Li</surname>
          </string-name>
          .
          <article-title>Yfcc100m: The new data in multimedia research</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>59</volume>
          (
          <issue>2</issue>
          ):
          <fpage>64</fpage>
          -
          <lpage>73</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Davide</surname>
            <given-names>Anguita</given-names>
          </string-name>
          , Alessandro Ghio, Luca Oneto, Xavier Parra, and
          <string-name>
            <surname>Jorge L Reyes-Ortiz</surname>
          </string-name>
          .
          <article-title>Human activity recognition on smartphones using a multiclass hardware-friendly support vector machine</article-title>
          .
          <source>In International workshop on ambient assisted living</source>
          , pages
          <fpage>216</fpage>
          -
          <lpage>223</lpage>
          . Springer,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>Timo</given-names>
            <surname>Sztyler</surname>
          </string-name>
          and
          <string-name>
            <given-names>Heiner</given-names>
            <surname>Stuckenschmidt</surname>
          </string-name>
          .
          <article-title>On-body localization of wearable devices: An investigation of positionaware activity recognition</article-title>
          .
          <source>In Pervasive Computing and Communications (PerCom)</source>
          ,
          <source>2016 IEEE International Conference on</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          . IEEE,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Xiao</surname>
            <given-names>Wu</given-names>
          </string-name>
          , Alexander G Hauptmann,
          <article-title>and</article-title>
          <string-name>
            <given-names>Chong-Wah</given-names>
            <surname>Ngo</surname>
          </string-name>
          .
          <article-title>Practical elimination of near-duplicates from web video search</article-title>
          .
          <source>In Proceedings of the 15th ACM international conference on Multimedia</source>
          , pages
          <fpage>218</fpage>
          -
          <lpage>227</lpage>
          . ACM,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Douglass R Cutting</surname>
            , David R Karger, Jan O Pedersen,
            <given-names>and John W Tukey.</given-names>
          </string-name>
          <article-title>Scatter/gather: A cluster-based approach to browsing large document collections</article-title>
          .
          <source>In Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <fpage>318</fpage>
          -
          <lpage>329</lpage>
          . ACM,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Nachiketa</surname>
            <given-names>Sahoo</given-names>
          </string-name>
          , Jamie Callan, Ramayya Krishnan, George Duncan, and
          <string-name>
            <given-names>Rema</given-names>
            <surname>Padman</surname>
          </string-name>
          .
          <article-title>Incremental hierarchical clustering of text documents</article-title>
          .
          <source>In Proceedings of the 15th ACM international conference on Information and knowledge management</source>
          , pages
          <fpage>357</fpage>
          -
          <lpage>366</lpage>
          . ACM,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Daan</surname>
            <given-names>Odijk</given-names>
          </string-name>
          , Edgar Meij, Isaac Sijaranamual, and Maarten de Rijke.
          <article-title>Dynamic query modeling for related content finding</article-title>
          .
          <source>In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , pages
          <fpage>33</fpage>
          -
          <lpage>42</lpage>
          . ACM,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Maria-Florina</surname>
            <given-names>Balcan</given-names>
          </string-name>
          , Yingyu Liang, and
          <string-name>
            <given-names>Pramod</given-names>
            <surname>Gupta</surname>
          </string-name>
          .
          <article-title>Robust hierarchical clustering</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          ,
          <volume>15</volume>
          (
          <issue>1</issue>
          ):
          <fpage>3831</fpage>
          -
          <lpage>3871</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>David M Blei</surname>
            , Andrew Y Ng, and
            <given-names>Michael I</given-names>
          </string-name>
          <string-name>
            <surname>Jordan</surname>
          </string-name>
          .
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>Journal of machine Learning research</source>
          ,
          <volume>3</volume>
          (Jan):
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Vincent</surname>
            <given-names>A Fusaro</given-names>
          </string-name>
          , Jena Daniels, Marlena Duda, Todd F DeLuca,
          <string-name>
            <surname>Olivia</surname>
            <given-names>DAngelo</given-names>
          </string-name>
          , Jenna Tamburello, James Maniscalco, and
          <string-name>
            <given-names>Dennis P</given-names>
            <surname>Wall.</surname>
          </string-name>
          <article-title>The potential of accelerating early detection of autism through content analysis of youtube videos</article-title>
          .
          <source>PLOS one</source>
          ,
          <volume>9</volume>
          (
          <issue>4</issue>
          ):e93533,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22. Mitchell P Marcus, Mary Ann Marcinkiewicz, and
          <string-name>
            <given-names>Beatrice</given-names>
            <surname>Santorini</surname>
          </string-name>
          .
          <article-title>Building a large annotated corpus of english: The penn treebank</article-title>
          .
          <source>Computational linguistics</source>
          ,
          <volume>19</volume>
          (
          <issue>2</issue>
          ):
          <fpage>313</fpage>
          -
          <lpage>330</lpage>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23. Patxi Gala´
          <fpage>n</fpage>
          -Garc´ıa, Jose´ Gaviria de la Puerta,
          <article-title>Carlos Laorden Go´mez, Igor Santos, and Pablo Garc´ıa Bringas. Supervised machine learning for the detection of troll profiles in twitter social network: Application to a real case of cyberbullying</article-title>
          .
          <source>Logic Journal of the IGPL</source>
          ,
          <volume>24</volume>
          (
          <issue>1</issue>
          ):
          <fpage>42</fpage>
          -
          <lpage>53</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>