<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Real-Time Game Highlight Detection for Data-driven League of Legends Coaching</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rosana Valero</string-name>
          <email>rosanavalero5@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cesar O. Diaz</string-name>
          <email>cesar@omashu.gg</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jordi Sanchez-Riera</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>OMASHU</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Spain</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institut de Robòtica i Informàtica Industrial</institution>
          ,
          <addr-line>Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidad Autonoma de Barcelona</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <fpage>211</fpage>
      <lpage>223</lpage>
      <abstract>
        <p>League of Legends is one of the most popular e-Sports games, with its highly competitive gameplay demanding both strategic precision and real-time decision-making. Analyzing high-impact events is key for coaching, match analysis, and content creation. This study presents a real-time highlight detection system that identifies impactful moments by fusing visual and audio cues. Visual indicators are extracted using optical flow and color intensity, while audio excitement is captured from caster commentary using pitch and volume analysis. Experiments conducted on professional match footage demonstrate the system's efectiveness and its potential for e-Sports analytics, coaching, and automated workflows. Future work will explore integrating player facial expressions, voice communication, and emotional context to better understand high-pressure moments and enhance highlight interpretation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;League of Legends</kwd>
        <kwd>e-Sports Analytics</kwd>
        <kwd>Highlight Detection</kwd>
        <kwd>Multi-modal Analysis</kwd>
        <kwd>Optical Flow</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The rapid growth of e-Sports, particularly titles like League of Legends (LoL), has created new
opportunities for performance analytics, content automation, and strategic insight. In competitive environments,
such as tournaments and regional leagues, key gameplay moments, matches often hinge on
highimpact events like teamfights, objective captures, or turret destructions. Automatically detecting these
highlights is essential for real-time broadcasting, post-match analysis, and training tools.</p>
      <p>LoL is a multiplayer online battle arena (MOBA) game featuring two teams of five players who
compete to destroy the enemy Nexus. Matches are characterized by bursts of intense activity interspersed
with calmer strategic play. These high-action moments are usually accompanied by visual cues (skill
animations, explosions, map efects) and audio cues (e.g., casters - the live commentators who narrate
and analyze the match for the audience — or crowd noise), making them ideal candidates for highlight
detection.</p>
      <p>
        This work introduces a multimodal system that detects key gameplay moments by combining two
main sources:
• Visual cues: Fast motion, color bursts, and flashy efects during kills, turret destructions, or
objective captures are tracked using optical flow and color scoring.
Problem Definition: E-Sports has become a global industry with massive audiences and competitive
stakes on par with traditional sports [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], a trend further accelerated by the COVID-19 pandemic [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
For instance, the 2018 LoL World Championship reached more than 43 million average concurrent
viewers, making it the third most-watched championship event worldwide, surpassing many traditional
sports finals and trailing only the FIFA World Cup and the NFL Super Bowl in audience size [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
While traditional sports benefit from well-established methods for identifying key moments (e.g., goals,
touchdowns), e-Sports requires more advanced techniques to identify moments of high importance.
Manual curation is time-consuming and subjective, highlighting the need for automated, data-driven
approaches.
      </p>
      <p>Working Hypothesis: This work hypothesizes that the integration of visual and audio signals can
allow the detection of significant moments in competitive League of Legends. The primary objectives
of this research are:
• To detect gameplay highlights based on visual motion intensity and caster vocal excitement.
• To validate the system on professional match footage and evaluate its utility for strategic analysis.</p>
      <p>By leveraging lightweight, real-time analysis, this work ofers a scalable solution for highlight
detection. Beyond its direct applications in broadcasting and coaching, it also lays the groundwork for
future extensions, such as integrating player reactions via webcam or voice communication, to better
understand how players experience high-pressure moments.</p>
    </sec>
    <sec id="sec-2">
      <title>2. State of the art</title>
      <p>The growth of e-Sports has triggered increasing interest in both gameplay analysis and the human
factors that influence competitive performance. In fast-paced games like League of Legends, detecting
and analyzing key events, such as teamfights or objective captures, has high value for broadcasting,
coaching, and content creation. This section reviews related work in automatic highlight detection,
with emphasis on audio-visual signal processing and real-time inference.</p>
      <sec id="sec-2-1">
        <title>2.1. Gameplay Highlights: Data Processing and Feature Extraction</title>
        <p>
          Detecting gameplay highlights in real time requires analyzing both visual and audio signals. On
the audio side, features such as pitch, tone, and volume, often extracted via Mel-Frequency Cepstral
Coeficients (MFCCs) [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], are used to capture spikes in caster excitement, which often align with key
in-game events [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. To isolate relevant audio sources, some approaches apply source separation tools
such as Spleeter to filter out background game sounds and emphasize commentary [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>
          Visually, motion-based techniques like optical flow and color scoring are used to detect rapid changes
on screen, such as skill efects, explosions, or animations during objective captures [
          <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
          ]. These are
particularly efective in dynamic games like LoL, where the camera constantly moves and multiple
elements overlap.
        </p>
        <p>
          Because practical applications require near real-time processing, recent research has explored
lightweight models for event detection and outcome prediction. For example, Junior and Campelo
(2023) achieved over 80% accuracy in mid-game outcome prediction for LoL using logistic regression
and LightGBM [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. These eforts illustrate the growing potential of scalable e-Sports analytics systems.
        </p>
        <p>
          For highlight detection to be practically useful, real-time inference is critical. Prior studies have
demonstrated the feasibility of applying lightweight models for in-game event tracking or outcome
prediction. For instance, Yao (2021) applied deep learning to recognize basketball actions in real time,
while Junior and Campelo (2023) achieved over 80% accuracy in predicting LoL match outcomes
midgame using models like LightGBM and logistic regression [
          <xref ref-type="bibr" rid="ref10 ref9">10, 9</xref>
          ]. These examples highlight the growing
potential of scalable e-Sports analytics systems.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Datasets</title>
        <p>Research in highlight detection and e-Sports analysis depends on datasets that reflect the complexity of
gameplay and human response. Two main categories are relevant:
Audio Cue Datasets. Although originally designed for emotion classification, several speech datasets
provide high-quality vocal data with expressive variations in pitch, tone, and intensity—features that
are useful for detecting vocal excitement in caster commentary.</p>
        <p>
          Among the most relevant: RAVDESS provides labeled emotional speech across multiple intensities,
useful for training models to detect vocal stress or excitement [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. IEMOCAP contains dyadic
conversations annotated with vocal expressions that align well with momentary spikes in pitch and
energy [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. MELD extends this by providing contextual, multi-speaker dialogues extracted from real
scenarios, enabling training on vocal patterns that vary based on scene intensity or speaker reactions
[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>
          League of Legends Gameplay Datasets. For event detection and match analysis in LoL:
DeepLeague provides labeled minimap frames and in-game event sequences for training deep learning
models [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. LoL-V2T links gameplay video to natural language annotations, supporting highlight
summarization and video-text modeling [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>This section presents our pipeline for real-time highlight detection in League of Legends e-Sports
matches. The system processes both gameplay video and caster audio to identify high-intensity events.
Figure 1 illustrates the overall architecture.</p>
      <sec id="sec-3-1">
        <title>3.1. Audio-Based Event Detection</title>
        <p>Audio processing is performed on caster commentary, leveraging their expressive reactions as implicit
signals of key moments.</p>
        <p>
          Model Architecture: We trained a CNN-based classifier inspired by [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], originally designed for
speech emotion recognition. Although our goal is not to classify emotional categories, we leverage the
RAVDESS dataset and the associated architecture as a proxy to model vocal excitement, a key signal in
caster commentary. The assumption is that vocal expressions labeled as anger, joy, surprise, etc., exhibit
pitch and energy dynamics similar to those present during gameplay highlights. The architecture
includes three convolutional layers (64, 128, 128 filters; kernel size 3 × 3 ), each followed by Batch
Normalization, MaxPooling, and Dropout (rate 0.3). During training, we used a softmax output layer to
optimize for emotion classification. However, for highlight detection, we discard the final predictions
and instead use intermediate features (e.g., MFCC activations, pitch, and volume patterns) as indicators
of vocal intensity.
        </p>
        <p>Feature Extraction Using MFCCs: We extract 40 Mel-Frequency Cepstral Coeficients (MFCCs)
per audio segment using librosa, along with pitch and volume information. Moments with pitch
above 110Hz and volume louder than -8dB are flagged as acoustically intense and likely to correspond
to gameplay highlights.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Visual Gameplay Frame Analysis</title>
        <p>To evaluate the gameplay frames, we employed a two-step process: (1) gameplay frame classification
and (2) motion analysis via optical flow and color scoring.</p>
        <sec id="sec-3-2-1">
          <title>1. Frame Analysis</title>
          <p>
            To distinguish gameplay from non-gameplay frames and detect elements like replays or key
events, we applied the following techniques:
Gameplay Binary Classification Development: We trained a classifier on a custom-labeled
dataset with two categories: gameplay (in-game action) and non-gameplay (casters, audience, etc.).
For this, we used the ResNeXt-50 32x4d architecture [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ], known for its grouped convolutions and
strong classification performance. The model, pre-trained on ImageNet, was fine-tuned on our
custom dataset. We replaced the original classification head with a binary output layer, allowing
the model to output probabilities for the two classes: gameplay and non-gameplay. Standard
preprocessing (resizing, normalization) was first applied to all frames, and the convolutional
backbone was used for hierarchical feature extraction.
          </p>
          <p>
            Replay Detection Using Optical Character Recognition (OCR): We used Tesseract OCR [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ]
to detect “Replay” text overlays, commonly displayed during replays in e-Sports broadcasts. This
ensures we avoid duplicate highlight detections from repeated footage.
          </p>
          <p>Player and Event Text Extraction: OCR was also used to extract on-screen game text (e.g.,
player names, event phrases like “has slain” or “double kill”). To improve matching accuracy, we
compared detected phrases against a dynamic list built from JSON-based team rosters, organized by
year. This list includes player names and game-specific terms (e.g., “turret destroyed”, “shutdown”,
“Baron Nashor”).</p>
          <p>Matching this information to frame-level text allows us to contextualize each gameplay segment,
identifying what is happening and which players were involved in specific highlights.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>2. Color Score Evaluation Using Optical Flow</title>
          <p>
            To assess visual intensity, we measured motion using optical flow and calculated a frame-wise
color score:
Optical Flow-Based Motion Detection: We apply the Farneback algorithm [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ] to compute
dense optical flow between consecutive frames, capturing pixel-wise displacement. This allows
us to detect movement patterns associated with key gameplay events like kills, turret dives, or
objectives.
          </p>
          <p>To reduce noise from camera panning, since the camera in League of Legends is constantly
moving across the map, we apply an overlay mask focused on the in-game map (Summoner’s Rift),
excluding irrelevant elements (e.g., player cams, scoreboard). This isolates meaningful gameplay
actions such as duels, tower destructions, or jungle invades.</p>
          <p>Color Score Analysis for Visual Dynamics: Once the optical flow was computed, we applied a
threshold to filter out low-motion areas, ensuring that we focus on the most significant movements.
To quantify the intensity of this visual activity, we compute a color score for each masked frame
as follows:
• Identifying Dominant Colors: K-means clustering to identify and suppress dominant (static)
colors in the masked region, typically associated with the Summoner’s Rift map background.
• Applying a Weight Mask: A Gaussian weight mask emphasizes central regions of the frame
where action usually takes place.
• Score Computation: Remaining pixel values are weighted and summed to compute a final
color score, reflecting the frame’s visual intensity and correlating with potential highlights.
This fusion of optical flow-based motion detection and color scoring enhances background
subtraction and focuses attention on the most dynamic gameplay segments.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Highlight Detection via Multimodal Fusion</title>
        <p>We classify a segment as a highlight only when both visual (color score) and audio excitement (pitch
and volume) exceed empirical thresholds.</p>
        <p>Visual Threshold: Frames with a color score above 0.009 are flagged as potential highlights, signaling
moments of high visual activity. This threshold was set based on empirical observations of impactful
in-game events (e.g., skill animations, explosions).</p>
        <p>Audio Threshold: Audio cues, specifically pitch and volume, are extracted from casters’
commentary. Moments exceeding 110Hz pitch and -8dB in both volume and background accompaniment are
considered acoustically intense. These spikes often correspond to key events like kills or objectives and
reflect both excitement and crowd reactions during these critical moments.</p>
        <p>Combining both cues helps ensure robustness and reduces false positives from noise in a single
modality. This multimodal approach serves as the core of the highlight detection system.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Datasets</title>
        <p>This section presents the datasets, training setup, evaluation metrics, and performance of our highlight
detection system.</p>
        <p>We used several datasets, each focused on a specific component of the system.</p>
        <p>
          RAVDESS for Audio-Based Excitement Modeling The RAVDESS dataset [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] includes 1,440
speech recordings from 24 professional actors expressing eight emotions (e.g., happy, sad, angry) at
two intensity levels. Its balanced, high-quality audio makes it suitable for training models that capture
expressive vocal patterns, which we use to model caster excitement based on pitch and volume variations.
We used the speech portion to train a CNN model on Mel-Frequency Cepstral Coeficients (MFCCs),
leveraging CNNs’ strength in capturing spatial features within audio signals.
        </p>
        <p>Custom Gameplay Detection Dataset. For accurate highlight detection, we built a custom
framelevel dataset using gameplay-only footage from the 2023 Worlds competition. We excluded any
nongameplay visuals, such as player cams, casters, or audience shots to ensure that the model focused
solely on in-game events. The final dataset comprises 3146 labeled frames, of which 74.8% are gameplay
and 25.2% are non-gameplay. This distribution reflects the natural prevalence of gameplay segments in
professional broadcasts rather than an artificially balanced dataset. The dataset includes:
• Gameplay Frames: Representing the in-game action, where the core gameplay is visible,
including battles, character movements, and game environment changes.
• Non-Gameplay Frames: Including player POVs, replays, audience, commentator speaking, or
even breaks between gameplay.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Training Setup</title>
        <p>4.2.1. Audio-Based Peak Detection:
The audio classifier was trained on the RAVDESS dataset using MFCCs as input features. The model, a
CNN with three convolutional layers and a softmax output, was trained over 250 epoch monitoring
both training and validation performances.</p>
        <p>(a) Accuracy
(b) Loss
4.2.2. Gameplay Binary Classification:
A ResNeXt-50 model, pre-trained on ImageNet, was fine-tuned to classify frames as gameplay or
nongameplay. The dataset contained 2,340 gameplay and 785 non-gameplay frames, manually labeled from
Worlds 2023 footage. Data augmentation was applied, and training lasted 25 epochs using cross-entropy
loss and the Adam optimizer with an initial learning rate of 0.001.</p>
        <p>(a) Accuracy
(b) Loss</p>
        <p>The model quickly reached low loss values, reflecting the clear visual distinction between the two
classes. Figure 4 illustrates typical frame examples.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Groundtruth and Highlight Categorization:</title>
        <p>To evaluate highlight detection, we constructed a groundtruth dataset from 2023 Worlds Championship,
using event data scraped from the LoL Fandom wiki1. Events such as kills, objectives, and turret
destructions were timestamped and grouped into highlights within a 30-second window.
1https://lol.fandom.com/wiki/2023_Season_World_Championship/Main_Event
(a) Gameplay
(b) Non-Gameplay</p>
        <p>The event data extracted from these matches were formatted into a CSV file with attributes such
as match_id, event_type, and timestamp. Each event was categorized based on its type (e.g.,
CHAMPION_KILL, BUILDING_KILL) and grouped into highlights if they occurred within a 30-second
window.</p>
        <p>Each highlight was categorized by importance:
• High: Multi-kills, Baron/Dragon objectives, or aces.
• Moderate: Single kills with some strategic impact.</p>
        <p>• Low: Minor actions like turret hits or isolated events.</p>
        <p>We tested the detection model on full Worlds 2023 matches, comparing detected highlights with
groundtruth events using a ± 2 second tolerance to account for possible timestamp mismatches. This
evaluation tested the system’s ability to identify key gameplay moments in real match conditions.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Evaluation Metrics:</title>
        <p>Performance was assessed using standard classification metrics:
• Precision: The proportion of detected highlights that matched groundtruth highlights, calculated
as:</p>
        <p>Precision =   (1)
  +  
• Recall: The proportion of groundtruth highlights that were correctly detected by the model,
calculated as:</p>
        <p>Recall =   (2)</p>
        <p>+  
• F1-Score: The harmonic mean of precision and recall, providing a balanced view of the model’s
performance, calculated as:</p>
        <p>1 = 2 · PPrreecciissiioonn +·RReceaclalll (3)</p>
        <p>Where   ,   , and   refer to true positives, false positives, and false negatives in highlight
detection.</p>
        <p>Given that the most critical moments in Lol matches are often high-importance events such as
multi-kills, Baron steals, or game-ending plays, we placed special emphasis on evaluating the model’s
ability to detect high-importance highlights. True positives (TP), false positives (FP), and false negatives
(FN) were calculated specifically for high-importance events to assess how well the model performed in
identifying these key moments.</p>
        <p>
          Exploratory step: We also experimented with a highlight importance classifier using a small custom
dataset of video clips labeled with key in-game events. The model combined ResNet-18 [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] for spatial
features and an LSTM for temporal dynamics.
        </p>
        <p>Despite regularization techniques like dropout and gradient clipping, results were unsatisfactory due
to limited and imbalanced training data. In-game complexity (e.g., overlapping animations, occlusions)
further reduced reliability. Improving this model would require a larger, curated dataset and refined
annotations. Future work could focus on building a comprehensive dataset to improve event detection
accuracy.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>This section presents the evaluation of our highlight detection system using 12 full matches from the
2023 League of Legends World Championship.</p>
      <sec id="sec-5-1">
        <title>5.1. Feature Score Analysis</title>
        <p>Figure 6 presents the normalized feature scores used in highlight detection—color score, commentator
pitch, and background audio volume—plotted over time. Peaks in the color score indicate visually
intense gameplay moments, such as team fights or objective captures, where rapid changes in frame
content (e.g., explosions or ability flashes) occur. The pitch of commentators’ voices tends to spike
during high-stakes scenarios, reflecting their heightened reactions and serving as a strong indicator
of audience-relevant highlights. Background audio volume, which includes in-game sound efects,
commentary intensity, and occasional crowd noise, further reinforces these cues by marking acoustically
rich segments often linked to key events. Together, these features allow the system to identify moments
that are not only strategically relevant but also contextually rich for audience engagement, ensuring
robust and accurate highlight detection.
5.1.1. Qualitative Case: Baron Nashor Detection.</p>
        <p>Figure 7a and 7b capture consecutive gameplay moments where the highlight is happening, while Figure
7c shows the computed optical flow, used to identify motion and pinpoint high-intensity actions. Figure
7d applies background subtraction, removing static elements and isolating key gameplay movement.
Finally, Figure 7e uses color analysis to refine the focus on dynamic areas, as explained in the Method part,
generating a color score of 0.0180, which helps quantify the intensity of the action. This combination of
optical flow and color analysis efectively highlights critical in-game moments.
(a) Gameplay Frame 1
(b) Gameplay Frame 2
(c) Optical Flow
(d) Background Subtraction
(e) Color Analysis</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Highlight Detector Analysis</title>
        <p>To assess the alignment between modalities, we computed the cross-correlation between the color score
and pitch. Figure 8 shows a strong peak near zero lag, indicating that visual and audio signals often
occur simultaneously, validating their joint use in the detection pipeline.</p>
        <p>We focused on high-importance events like multi-kills and major objectives. Table 1 summarizes
system performance.</p>
        <p>The recall is relatively high, meaning the system successfully captures a large number of
highimportance moments. However, the relatively lower precision brings the F1-score down, emphasizing
that while the model is good at detecting key events, it tends to overestimate and classify less significant
moments as important highlights.</p>
        <p>Beyond detecting high-importance highlights, we evaluated whether the detected highlights included
any in-game events such as kills, turret destructions, or objectives (Baron Nashor/Dragon/Rift Herald).
This additional evaluation ensures that even if the model misclassifies a highlight’s importance, it still
detects relevant gameplay moments. As shown in Table 2, out of the 157 detected highlights, 97 were
true positives ("high-importance highlights"), while 60 were false positives, and 32 important moments
were missed. Additionally, 121 detected events were correctly identified (they included at least one
in-game event), while 36 highlights contained no relevant events, further reinforcing the importance of
refining the system to better handle such cases.</p>
        <p>False Positives: Often caused by flashy ability usage (e.g., Krugs), crowd noise, or excited caster voice
during less impactful actions. An example of such a misclassification is shown in Figure 9. The model
mistakenly identified the killing of Krugs as a highlight. We can observe that there’s high light intensity
in the scene due to the use of abilities from the champion to kill them.</p>
        <p>(a) Krug kill Frame 1
(b) Krug kill Frame 2</p>
        <p>(c) Krug kill Frame 3
(d) Krug kill Frame 4
(e) Krug kill Frame 5</p>
        <p>Despite the promising results, some false positives were caused by moments with crowd noise or
rapid camera movements, which the model mistakenly interpreted as high-importance highlights.
Additionally, large quantities of light efects in the game also disturbed the model, leading to further
misclassifications.</p>
        <p>However, the model consistently captured high-importance moments, such as multi-kills, Baron
steals, and team fights, confirming its efectiveness in detecting the most critical moments of a match.</p>
        <p>By refining the model to better diferentiate between subtle in-game events and non-relevant moments,
and incorporating additional data to enhance its understanding of various highlight categories, future
iterations could further improve the detection of these high-importance events.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>This work presented a multi-modal highlight detection system for League of Legends, combining visual
dynamics and caster audio to identify and segment key in-game moments. By leveraging video cues
such as optical flow and color intensity, alongside peaks in pitch and volume from commentary, the
system efectively pinpointed events of strategic significance.</p>
      <p>The system showed strong performance, particularly in recall (75.19%), capturing most high-impact
events such as team fights and objective captures. However, its lower precision (61.78%) revealed
a tendency to flag minor moments as highlights. With an F1-score of 67.83%, the system strikes a
balance between detecting key events and the need for further refinement to reduce false positives.
This balance demonstrates the system’s ability to reliably identify key events while highlighting the
need for refinement in filtering less relevant events.</p>
      <p>Audio and visual cues complemented each other well, with cross-correlation analysis confirming their
synchronization. Casters’ pitch and volume provided emotional signals that helped identify moments
of excitement, even when visual intensity was low. However, the reliance on casters introduced bias,
as their reactions sometimes exaggerated the significance of events. For instance, casters might raise
their voices simply to build excitement when two enemies head toward the same point, even if nothing
significant happens. Tailoring audio models more specifically to e-Sports could mitigate this issue.</p>
      <p>Overall, the system shows strong potential for advancing e-Sports analytics. With further
improvements in precision, audio modeling, fine-tuning, and dataset expansion, it could support applications
such as content creation, post-game review, and automated match summarization across competitive
titles.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Future Work</title>
      <p>While this work focuses on highlight detection using video and caster audio, future research could
include information coming directly from the players, such as webcam footage or team voice chat,
especially during high-pressure moments of a match.</p>
      <p>Since highlights often represent the most intense and emotionally charged segments of a match,
they ofer a natural opportunity to analyze how players respond under pressure, handle stress, make
decisions, and work as a team under pressure. Capturing facial expressions, tone of voice, or other
behavioral cues during these moments could help reveal how individuals manage stress, make decisions,
and collaborate as a team.</p>
      <p>This kind of analysis could support coaching and performance optimization, helping teams better
understand emotional resilience, stress responses, and player tendencies in critical scenarios. Although
this study focused on League of Legends, the proposed multimodal approach could be generalized to
other competitive video games where the combination of video and audio cues might enable automatic
detection of crucial moments.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Kempe-Cook</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sher</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Su</surname>
          </string-name>
          , N.:
          <article-title>Behind the Voices: The Practice and Challenges of Esports Casters</article-title>
          .
          <source>In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Hamari</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sjöblom</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <source>What is eSports and Why Do People Watch It? Internet Research</source>
          <volume>27</volume>
          (
          <issue>2</issue>
          ),
          <fpage>211</fpage>
          -
          <lpage>232</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Campbell</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goss</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trottier</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Claypool</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Sports versus Esports - A Comparison of Industry Size, Viewer Friendliness, and Game Competitiveness</article-title>
          .
          <source>In: Global Esports: Transformation of Cultural Perceptions of Competitive Gaming</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          . Bloomsbury Academic, London (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Fakazlı</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>The Efect of Covid-19 Pandemic on Digital Games and eSports</article-title>
          .
          <source>In: International Journal of Science Culture and Sport</source>
          , pp.
          <fpage>335</fpage>
          -
          <lpage>344</lpage>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Ali</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tanweer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khalid</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rao</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          :
          <article-title>Mel Frequency Cepstral Coeficient: A Review</article-title>
          .
          <source>In: Proceedings of the 2021 EAI International Conference</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Hennequin</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khlif</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Voituret</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moussallam</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Spleeter: A Fast and Eficient Music Source Separation Tool with Pre-trained Models</article-title>
          .
          <source>Journal of Open Source Software</source>
          <volume>5</volume>
          ,
          <issue>2154</issue>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Zou</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
          </string-name>
          , J.-B.:
          <article-title>DF-Net: Unsupervised Joint Learning of Depth and Flow Using Cross-Task Consistency</article-title>
          . arXiv preprint arXiv:
          <year>1809</year>
          .
          <volume>01649</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Tanaka</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simo-Serra</surname>
          </string-name>
          , E.:
          <string-name>
            <surname>LoL-V2T</surname>
          </string-name>
          :
          <article-title>A Dataset for Esports Video and Text Processing</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          (
          <year>2023</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Junior</surname>
            ,
            <given-names>J.B.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Campelo</surname>
            ,
            <given-names>C.E.C.</given-names>
          </string-name>
          :
          <article-title>League of Legends Real-Time Result Prediction</article-title>
          .
          <source>In: XVI Brazilian Conference on Computational Intelligence (CBIC)</source>
          , Salvador, Brazil (
          <year>2023</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Yao</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Real-Time Analysis of Basketball Sports Data Based on Deep Learning</article-title>
          .
          <source>IEEE Transactions on Neural Networks and Learning Systems</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Livingstone</surname>
            ,
            <given-names>S.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Russo</surname>
            ,
            <given-names>F.A.</given-names>
          </string-name>
          :
          <article-title>The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A Dynamic, Multimodal Set of Facial and Vocal Expressions in North American English</article-title>
          .
          <source>PLoS ONE</source>
          <volume>13</volume>
          (
          <issue>5</issue>
          ),
          <year>e0196391</year>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Busso</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bulut</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Narayanan</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>IEMOCAP: Interactive Emotional Dyadic Motion Capture Database</article-title>
          .
          <source>Language Resources and Evaluation</source>
          <volume>42</volume>
          (
          <issue>2</issue>
          ),
          <fpage>335</fpage>
          -
          <lpage>359</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Poria</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hazarika</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Majumder</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cambria</surname>
          </string-name>
          , E.:
          <article-title>MELD: A Multimodal EmotionLines Dataset</article-title>
          .
          <source>In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
          </string-name>
          , H.:
          <article-title>DeepLeague: A Dataset for Esports Gameplay Analysis</article-title>
          .
          <source>In: Proceedings of the 32nd International Conference on Esports Data</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Chakraborty</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Speech-Emotion-Recognition. GitHub repository</article-title>
          , https://github.com/ Shreyasi2002/Speech-Emotion-Recognition--
          <volume>1</volume>
          , last accessed
          <year>2025</year>
          /06/06
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dollár</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Aggregated Residual Transformations for Deep Neural Networks</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          , pp.
          <fpage>5987</fpage>
          -
          <lpage>5995</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>R.:</given-names>
          </string-name>
          <article-title>An Overview of the Tesseract OCR Engine</article-title>
          .
          <source>In: Proceedings of the Ninth International Conference on Document Analysis and Recognition (ICDAR)</source>
          ,
          <source>vol. 2</source>
          , pp.
          <fpage>629</fpage>
          -
          <lpage>633</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Farnebäck</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Two-Frame Motion Estimation Based on Polynomial Expansion</article-title>
          .
          <source>In: Scandinavian Conference on Image Analysis (SCIA)</source>
          , vol.
          <volume>2749</volume>
          , pp.
          <fpage>363</fpage>
          -
          <lpage>370</lpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
          </string-name>
          , J.:
          <article-title>Deep Residual Learning for Image Recognition</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>