Support Vector Machine-Based Segmentation for Accurate
                                Crowd Density Detection in Urban Spaces
                                Gourav Kalra1, †, Rajeev Yadav2, †, Satish Kumar Alaria3,*, †
                                1M. Tech. Scholar, Department of CSE, Arya College of Engineering, Jaipur, Rajasthan

                                2Professor, Department of CSE, Arya College of Engineering, Jaipur, Rajasthan

                                3Computer Instructor, Education Department, Government of Rajasthan, Rajasthan


                                                          Abstract
                                                          Estimating crowd density has become increasingly important in fields like public safety, event management,
                                                          and urban planning. Accurate detection of crowd density helps in making informed decisions and ensuring
                                                          safety in crowded areas. This study proposes a novel method for crowd density detection using segmentation
                                                          and classification based on a Support Vector Machine (SVM). The method involves two key steps: crowd
                                                          segmentation and density categorization. During segmentation, advanced image processing techniques like
                                                          background removal and region-based segmentation extract crowd sections from input images or video
                                                          frames. These segmented areas are then classified using an SVM model, known for handling complex data.
                                                          The model is trained on a diverse dataset containing images with varying crowd densities. The approach
                                                          captures crucial spatial and contextual information, and extensive testing on various datasets has
                                                          demonstrated its accuracy and resilience in dynamic crowd scenarios. The proposed SVM-based method can
                                                          be implemented in real-time, making it valuable for applications requiring quick decisions. This technique
                                                          offers a reliable and efficient solution for crowd density detection, with significant implications for event
                                                          management, public safety, and urban planning in congested environments.

                                                          Keywords
                                                          Crowd density detection, Support Vector Machine, crowd segmentation, image processing, real-time
                                                          detection, region-based segmentation, urban planning, machine learning.


                                1. Introduction
                                The world has undergone rapid urbanization over the past two decades, leading to a significant
                                increase in city populations. As cities become more crowded, the need for effective surveillance
                                systems has grown, particularly to monitor people's movements and behaviors in public spaces,
                                ensuring the safety and security of individuals and their possessions. Surveillance has become an
                                integral part of maintaining public safety, with both public and private entities worldwide regularly
                                employing video cameras for this purpose. However, traditional surveillance systems heavily rely on
                                human operators, whose effectiveness can vary depending on their alertness and the available
                                manpower. Given these limitations, modern surveillance is transitioning towards smart systems
                                equipped with advanced technologies like intelligent video analysis, which enable automated decision-
                                making without continuous human intervention.
                                Smart surveillance systems can be broadly categorized into two types: visual-based and multimodal.
                                Visual-based systems utilize computer vision algorithms to process video data from cameras and
                                drones in real time, offering solutions like facial recognition and license plate identification. On the
                                other hand, multimodal systems integrate various data sources, including motion and audio sensors,
                                alongside video data to provide comprehensive real-time insights. Companies like IBM and Intel have
                                pioneered technologies that can detect traffic incidents, optimize routes, and even identify crime-
                                related events using these advanced surveillance systems.


                                SCCTT-2024: International Symposium on Smart Cities, Challenges, Technologies and Trends, 29th Nov 2024, Delhi, India
                                ∗ Corresponding author.
                                † These authors contributed equally.

                                    gkalra144@gmail.com (Gourav Kalra); rajeevtpo@gmail.com (Dr. Rajeev Yadav); Satish.alaria@gmail.com (Satish Kumar Alaria)

                                    0009-0008-4926-1929 (Gourav Kalra); 0000-0002-1976-4065 (Dr. Rajeev Yadav); 0000-0001-8298-1364(Satish Kumar Alaria)

                                                     © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
In today’s world, smart surveillance plays a critical role, particularly in monitoring crowds. This
becomes especially relevant during large public gatherings, where the potential for disasters, accidents,
or criminal activity increases. Effective crowd control is vital in these scenarios, as seen in airports,
concert venues, and religious gatherings. As crime, terrorism, and natural disasters rise, smart
surveillance systems must rely on robust algorithms to manage and predict crowd behavior.
The analysis of crowd behavior is a key focus of this chapter. It begins by defining different types of
crowds, highlighting their unique characteristics and behaviors in various contexts. A deeper
exploration of collective crowd behavior from a psychological standpoint follows, offering insights
into how crowds react in specific situations. From there, the discussion shifts to the challenges of
analyzing crowd behavior through video footage, including the complexities involved in cognitive
modeling for crowd behavior analysis. Ultimately, this chapter sets the stage for understanding the
motivations behind this research and the primary contributions of the proposed approach.


Figure 1. Common applications of smart surveillance
A crowd is a large group of people gathered in one location, exhibiting a range of behaviors and
attitudes. Based on movement patterns, crowds can generally be divided into two categories: dynamic
and stationary. Dynamic crowds are constantly in motion and can be either organized or unstructured.
In organized crowds, such as marathons or rallies, individuals move in the same direction, maintaining
consistent behavior over time. In unstructured crowds, such as those seen in airports or stadiums,
individuals move in various directions with varying spatiotemporal characteristics. Stationary crowds,
on the other hand, include audiences at rallies, concerts, or plays, where people remain in one place
for a period of time.


Figure 2: Statistics of crowd disasters
The characteristics of a crowd—such as its size, density, location, and time—are critical in
understanding its behavior. Crowds can also be categorized into active and passive groups based on
the behavior of their participants. While passive crowds primarily observe without engaging in
activities, active crowds may exhibit behaviors ranging from aggression to panic or expressive actions,
such as cheering at a concert or participating in religious events.
Analyzing crowd behavior is essential for smart surveillance systems, as it helps authorities
understand crowd dynamics, develop control measures, and prevent crowd-related disasters. The
behavior of a crowd is often influenced by the context in which it forms. For instance, in a shopping
district, people might move peacefully alongside one another, while in a stadium, fans may express
intense emotions in response to the game. These varying behaviors highlight the need for smart
surveillance systems capable of monitoring and analyzing different crowd scenarios in real time.
Crowd behavior is inherently complex, as it depends on the context and setting. Monitoring and
understanding collective crowd behavior in both regular and emergency situations is challenging,
particularly when individual identification is difficult in dense crowds. Over time, psychologists and
sociologists have proposed numerous theories to explain crowd behavior. One of the earliest and most
popular is Le Bon's Group Mind Theory, which suggests that crowd members lose their individual
identity and are easily influenced by a leader. Freud’s theories support the notion that individuals in a
crowd open their unconscious minds, yet maintain control over their actions. McPhail’s Pre-
Disposition hypothesis posits that aggressive behavior in crowds is influenced by individual
dispositions toward antisocial behavior. In contrast, the Emergent-Norm hypothesis suggests that
crowds consist of people with common interests, leading to distinctive behavior patterns. These
collective behaviors can often become impulsive, unpredictable, and volatile. Understanding these
behaviors is crucial for developing smart surveillance systems that can anticipate and prevent crowd-
related issues. Such systems must account for the social and psychological components of group
behavior, including how crowd members concentrate their attention on a common cause, exchange
ideas rapidly, and form homogenous groups based on shared beliefs and behaviors. Machine learning,
particularly Support Vector Machines (SVM), is a key technology used in crowd behavior analysis.
SVM models create distinct classes from input data features, enabling the classification of various
crowd behaviors. Deep learning, especially Convolutional Neural Networks (CNN), is another
powerful tool for crowd behavior research. CNN mimics the structure of neurons in the human visual
cortex, allowing for the hierarchical processing of input data. Long Short-Term Memory (LSTM)
networks, which resemble the brain's short-term memory, are also used to analyze and predict crowd
behavior based on past events. These advanced AI models enable the system to learn from past
examples, making it more effective in predicting crowd behaviors and detecting anomalies. This
research is motivated by the need to develop smart surveillance systems capable of detecting crowd
anomalies, evaluating behaviors in real-time, and providing timely alerts. The current pandemic crisis
has also highlighted the importance of monitoring crowd behavior to ensure public safety, especially
in terms of enforcing social distancing and detecting free-standing conversation groups. By combining
video, audio, and other sensor data, this study aims to develop a comprehensive crowd behavior
analysis system that can operate effectively in a variety of challenging scenarios.
In conclusion, the introduction of cognitive modeling and AI technologies into surveillance systems
offers the potential to greatly improve crowd management, enhancing public safety and preventing
disasters in crowded settings.

Related Works
   Crowd behavior evaluation through computer vision techniques has been explored through various
research studies, with each contributing to a broader understanding of how anomalies and movement
patterns in large groups can be detected and analyzed. A review of these works highlights both
advancements in this domain and the identification of gaps that future research must address. For
instance, a framework [1] for video event identification that proved essential for high-level video
indexing and retrieval. This framework addressed challenges such as skewed data distribution and
loose video structure, automating the determination of crucial thresholds that were typically manually
set in conventional Association Rule Mining (ARM) techniques. The reduction in manual intervention
in video analysis was a critical advancement towards fully autonomous video content analysis.
    The Trajectory Segmentation and Multi-Instance Learning (TRASMIL) framework, which allowed
for precise and adaptable local anomaly detection. This three-step method was found to outperform
existing techniques in terms of identifying trajectories with local abnormalities [2]. TRASMIL
emphasized the importance of trajectory-based anomaly detection for accurately understanding crowd
movement and behaviors. Similarly, a semantic video [3] segmentation method that relied on One-
Class Classification (OCC) techniques for identifying events through frame-by-frame processing.
Their work highlighted the effectiveness of OCC in detecting unsupervised events, particularly
through the use of Temporal Self-Similarity Maps (TSSMs), which were evaluated using a publicly
available thermal video dataset. The use of OCC for unsupervised event detection opened new avenues
for handling video data with minimal prior knowledge of the scene.
    A dynamic time interval segmentation technique to improve item anomaly detection. Their
segmentation approach dynamically validated the time interval length, grouping successive attack
ratings [4]. While effective, [5] the robustness of anomaly detection methods had received limited
attention in terms of accuracy and consistency, pointing to a gap that future research must address.
Meanwhile, [6] contributed by proposing an unsupervised method for scene analysis and anomaly
detection in traffic video data recorded by stationary security cameras. By using local Hierarchical
Dirichlet Process (HDP) models, Kaltsa et al. were able to achieve improved accuracy with lower
computational costs, emphasizing the need for efficient solutions in processing large amounts of traffic
video data.
    Other researchers have approached the problem from a probabilistic standpoint. A probabilistic
framework for identifying [7] local spatiotemporal anomalies. This framework allowed for a more
refined decision-making process by identifying ideal decision-making procedures based on score
functions obtained from nearby neighbors’ distances. The work emphasized the importance of
spatiotemporal scales in accurately identifying anomalies. Spatiotemporal anomaly detection using
scalable aggregation [8] and geolocated text visualization. They proposed a cluster analysis technique
to automatically discover anomalies and presented these findings through a global map depiction.
Their work demonstrated how scalable visualization could assist analysts in categorizing and
evaluating event candidates on a global scale.
    The visualization of social media data with a visual analytics technique [9], which allowed users to
extract significant subjects from a chosen collection of communications. By applying Latent Dirichlet
Allocation (LDA) and visualizing topic time series, analysts could better understand abnormal events
by identifying peaks and outliers in the data. A probabilistic methodology that placed temporal and
geographical [10] constraints on video volumes, allowing for the identification of abnormal video
configurations. Their approach, which avoided the need for motion estimation or background removal,
proved particularly efficient for detecting rare events in video data.
    In a related development, [11] an anomaly detection method that incorporated both spatial and
temporal contexts. They introduced a region-based descriptor called Motion Context, which proved to
be more reliable than statistical models when dealing with small training datasets. Their use of
compact random projections sped up the search process, further enhancing the efficiency of the
method. A spatiotemporal Laplacian eigenmap [12] technique to model crowd behavior and detect
anomalies. Their method, which identified both local and global anomalies, showcased the potential
of regular crowd behavior modeling in accurately detecting abnormal crowd behaviors.
    A different approach by developing a Structural Context Descriptor (SCD) [13] to define crowd
individuals, utilizing the potential energy function of particles from solid-state physics. Their SCD
method used the 3-D Discrete Cosine Transform (DCT) to compute crowd SCD fluctuations and
pinpoint issues through these variations. Focused on anomaly detection [15] in complex crowd
settings, using a hierarchical activity-pattern discovery framework. Their work factored in both local
and global spatiotemporal contexts, creating an anomaly energy function that could quantify the
abnormality of motion patterns. This method was particularly useful for detecting abnormal activity
in densely packed crowds [16].
    Continuing with anomaly detection in video monitoring, [17] an unsupervised statistical learning
framework for monitoring crowded environments. The method, which relied on clustering and sparse
coding to learn global and local activity patterns, utilized a multi-scale analysis approach to ensure
precise anomaly localization. Advanced these techniques by developing a novel crowd video anomaly
detection [18] method based on spatiotemporal texture analysis. Their approach, designed for real-
time applications, simplified machine learning procedures and demonstrated improved flexibility and
efficiency compared to existing systems.
    a spatiotemporal architecture for anomaly detection, combining spatial feature representation with
temporal changes in spatial features [19]. This method proved to be effective for detecting anomalies
in videos of crowded scenes. An intrusion detection technique that detected normal behavior
disturbances, signaling potential intentional [20] or unintentional attacks. Their work explored both
supervised and unsupervised methods for anomaly detection, emphasizing the importance of detecting
disruptions in normal behavior patterns.
    An anomaly detection approach that utilized a reliable anomaly degree measure to increase the
separability between anomaly pixels and background pixels [21]. This method divided pixels into
potential anomaly sections and background sections, followed by discriminative information learning,
highlighting the significance of feature extraction for accurate anomaly detection. A fresh approach
to anomaly detection using a difference of convex functions algorithm [22]. This method built a hidden
Markov anomaly detector that extended the One-Class SVM and demonstrated improved performance
across various datasets.
    A sparse reconstruction-based method for detecting aberrant behavior, [23] combining low-level
visual features with causality analysis. By analyzing individual and group behaviors, they were able
to detect abnormal interactions in multi-object settings. Improving image classification performance
through convolutional neural network (CNN) ensembles, showing how this approach could
outperform both single CNN models and regular perceptrons in detecting abnormalities [24].
    An unsupervised Fully Convolutional Network (FCN) for anomaly detection in videos. Their
approach relied on temporal data and cascaded outlier detection, lowering computational complexity
and improving both speed and accuracy [25]. A machine learning-based anomaly detection approach
for detecting fraudulent traffic in Modbus and Transmission Control Protocol (TCP) connections. Their
use of SVM, Random Forest, K-NN, and K-means clustering allowed for effective anomaly detection in
an industrial scenario [26].
    Applied deep learning to behavior detection, using a bag of vision words and the Agglomerative
Information Bottleneck technique to compress vocabulary and minimize feature dimensions. Their
sparse representation approach increased detection precision for deviant behavior [27]. Leveraged
deep learning in social multimedia to detect suspect flows, testing their method on a large-scale
Carnegie Mellon University (CMU) dataset [28]. The Inception-V3 neural network for feature
extraction and classification, comparing its performance with traditional models like K-nearest
Neighbor, random forest, and SVM [29], while a technique focused on maximizing the area under the
ROC curve for hierarchical abnormal behavior detection, eliminating the need for manual labeling and
offering a semi-supervised approach [30].
    The literature on crowd behavior analysis demonstrates the continuous evolution of methods
aimed at enhancing surveillance through anomaly detection. From trajectory-based techniques to deep
learning and probabilistic models, researchers have developed increasingly sophisticated approaches
to ensure real-time, accurate detection of abnormal behavior in crowds. These advancements have laid
the groundwork for further research into the robustness and scalability of anomaly detection methods,
while also identifying key areas for future exploration, such as improving computational efficiency
and addressing issues like occlusion and multi-camera data integration.
Mathematical Modeling & Proposed Methodology

   In the realm of image processing, feature extraction is pivotal for enhancing tasks like pattern
recognition, face detection, and image classification. Features can broadly be divided into two
categories: general features such as color, texture, and shape, and domain-specific features like object
detection or human face recognition. The efficiency of image annotation frameworks hinges on the
ability to represent semantic concepts through low-level image features, which form the foundation
of multimedia information retrieval, object recognition, and image annotation. In both Content-Based
Image Retrieval (CBIR) and Automatic Image Annotation (AIA), key image features such as color,
texture, and shape are employed to extract meaningful data. While CBIR primarily focuses on visual
aspects of an image, AIA incorporates high-level concepts that better reflect the image content,
addressing the challenge of locating images in large datasets. Hence, this research integrates both low-
level features and high-level semantic concepts to improve image retrieval, focusing particularly on
texture and shape as central features for efficient image annotation. Feature extraction is a
dimensionality reduction process where the image is transformed into a feature set, representing its
high-level characteristics. By condensing the image data into a feature vector, the system can quickly
and accurately identify patterns within an image. For computational efficiency, a robust feature
extraction system is required, and combining low-level and high-level semantic concepts provides
better retrieval accuracy. The proposed system uses fused feature extraction, employing texture and
shape features to enhance the accuracy of image retrieval and reduce system complexity. This
methodology combines multiple features to provide more accurate image information, avoiding the
errors that might arise from relying on a single feature. In this study, the Haralick and Tamura texture
features are fused with shape features, significantly improving image retrieval performance and
reducing processing time. Image feature extraction forms the backbone of image retrieval systems,
with features classified into two main categories: general features and domain-specific features.
General features, including color, texture, and shape, describe the overall content of the image, while
domain-specific features, such as face recognition or object detection, require specialized knowledge
and fine-tuning. Low-level features like color and texture represent the visual aspects of an image,
while high-level features correspond to semantic keywords or concepts.
Figure 3: Multi-Class SVM classifier

    In CBIR systems, visual similarity is calculated using distance measurements between the feature
vectors of the query image and images in the database. The user feeds a query image, and the system
ranks the database images based on similarity, often leading to incorrect results when only low-level
features are considered. To overcome this issue, AIA systems incorporate semantic concepts based on
visual content, enabling more accurate retrieval of relevant images. Pre-processing is crucial for
pattern recognition and image classification, as it enhances the quality of input images by removing
noise, resizing, and adjusting image features. In this research, the images are normalized through
rescaling to (128x128) pixels, ensuring uniformity across datasets and improving computational
efficiency, as shown. Additionally, color conversion to grayscale reduces the inherent complexity of
the images, facilitating edge detection and pixel-based processing. In this research, edge-based
segmentation is employed, relying on intensity differences and content. Edge detection using
techniques such as Sobel, Prewitt, and Canny operators helps identify object boundaries by detecting
intensity contrasts. Canny edge detection, in particular, is favored for its ability to produce sharp and
fine edges, as demonstrated.The performance of various segmentation techniques is evaluated using
metrics such as Root Mean Square Error (RMSE), Signal-to-Noise Ratio (SNR), and Peak Signal-to-
Noise Ratio (PSNR). RMSE measures the average difference between the original image and the
segmented image, with a higher value indicating greater differences. SNR quantifies the noise present
in an image, with higher values representing cleaner, noise-free images. PSNR is commonly used to
measure the quality of edge detection between the original and segmented image, with higher values
indicating better segmentation accuracy, where RRR is the maximum possible pixel value of the image.
The performance evaluation results indicate that the Canny operator outperforms other edge detection
techniques in terms of RMSE, SNR, and PSNR values. In thus section, we provide detailed mathematical
expressions related to the proposed methodology, including image pre-processing, feature extraction,
classification, and evaluation techniques. Each expression will be explained to illustrate its role in the
overall image annotation and retrieval system. To normalize the size of images for consistent
processing, we perform rescaling. If the original image has dimensions 𝑊𝑊 × 𝐻𝐻 (width 𝑊𝑊 and height
), and we want to resize it to a fixed siae 𝑤𝑤10 × ℎ0 . the rescaling factor 𝑆𝑆𝑥𝑥 and 𝑆𝑆𝑦𝑦 in the 𝑥𝑥 and 𝑦𝑦
directions can be expressed as:
          𝑤𝑤          ℎ
    𝑆𝑆𝑧𝑧 = 𝑛𝑛 , 𝑆𝑆𝑣𝑣 = 0                                                                              (1)
          𝑊𝑊             𝐻𝐻
   This        ensures    the        image    is    resized uniformly        for    further   processing.
To convert a color image to a gray-scale image, a weighted sum of the red, green, and blue (RG日)
components is used:
   𝐼𝐼gIn = 0.2989 ⋅ 𝑅𝑅 + 0.5870 ⋅ 𝐺𝐺 + 0.1140 ⋅ 𝐵𝐵                                                  (2)
   Where 𝑅𝑅, 𝐺𝐺, and 𝐵𝐵 are the intensities of the red, green, and blue compconents of the image,
respectively. This formula accounts for the different contributions of each color channel to perceived
brightness.
Thresholding is a simple segmentation technique used to separate objects from the background by
converting an image into a binary format. Given a threshold value 𝑇𝑇, the binary image 𝐼𝐼biuny (𝑥𝑥, 𝑦𝑦) is
computed as
                        1 if 𝐼𝐼(𝑥𝑥, 𝑦𝑦) > 𝑇𝑇
   𝐼𝐼hinary (𝑥𝑥, 𝑦𝑦) = �                                                                            (3)
                        0 if 𝐼𝐼(𝑥𝑥, 𝑦𝑦) ≤ 𝑇𝑇
   Where 𝐼𝐼(𝑥𝑥, 𝑦𝑦) represents the intensity of the pixel at location (𝑥𝑥, 𝑦𝑦). Canny edge detection uses
gradients to detect edges. The gradient magnitude 𝐺𝐺 at each pixel is calculated using the partial
derivatives in the 𝑥𝑥 - and 𝑦𝑦-directions, 𝐺𝐺𝑥𝑥 and 𝐺𝐺𝑦𝑦 :

   𝐺𝐺 = �𝐺𝐺𝐺𝐺𝑥𝑥2 + 𝐺𝐺𝑦𝑦2                                                                             (4)
   The direction of the edge 𝜃𝜃 is caloulated as:
                   𝐺𝐺
   𝐵𝐵 = tan−1 �𝐺𝐺𝑝𝑝 �                                                                                (5)
                    𝑥𝑥
   After calculating the gradient magnitude and direction, non-maximum suppression and double
thresholding are applied to finalize the edge map.
   The GLCM matrix is a statistical measure to describe texture features. For two picels separated by
a distance 𝑑𝑑 in a specific direction 𝜃𝜃, the GLCM matrix element 𝑝𝑝(𝑖𝑖, 𝑗𝑗) is defined as:
                  𝑥𝑥−1 ∑𝑦𝑦−1 [1 if 𝐼𝐼(𝑥𝑥, 𝑦𝑦) = 𝑖𝑖 and 𝐼𝐼(𝑥𝑥 + 𝑑𝑑, 𝑦𝑦 + 𝑑𝑑) = 𝑗𝑗]
   𝑝𝑝(𝑖𝑖, 𝑗𝑗) = ∑𝑁𝑁                                                                                  (6)
                           𝑁𝑁

   Where 𝐼𝐼(𝑥𝑥, 𝑦𝑦) is the intensity of the pixel at (𝑥𝑥, 𝑦𝑦), and 𝑖𝑖 and 𝑗𝑗 represent gray-level values.
The contrast, a texture feature that describes the intensity contrast between a pixel and its neighbor
over the whole image, is computed as
    Contrast = ∑𝑁𝑁−1          𝑁𝑁−1           2
                      𝑖𝑖=0 ∑𝑗𝑗=0 (𝑖𝑖 − 𝑗𝑗) ⋅ 𝑝𝑝(𝑖𝑖, 𝑗𝑗)                                              (7)
   Where 𝑝𝑝(𝑖𝑖, 𝑗𝑗) is the element in the GLCM matrix coeresponding to the gray-level ca-occurrence
between 𝑖𝑖 and 𝑗𝑗. Entropy measures the randomness or complexity of the texture, and is given by:
    Entropy = − ∑𝑁𝑁−1           𝑁𝑁−1
                         𝑖𝑖=0 ∑𝑗𝑗=0 𝑝𝑝(𝑖𝑖, 𝑗𝑗) ⋅ log 𝑝𝑝(𝑖𝑖, 𝑗𝑗)                                      (8)
   Entropy measures the randomness or complexity at the texture, and is grven by:
    Entropy = − ∑𝑁𝑁−1           𝑁𝑁−1
                         𝑖𝑖=0 ∑𝑗𝑗=0 𝑝𝑝(𝑖𝑖, 𝑗𝑗) ⋅ log 𝑝𝑝(𝑖𝑖, 𝑗𝑗)                                      (9)
   This value indicates the level af disorder or unpredictability in the texture of the image. Coarseness
measures the texture's roughness, where large differences in pixel intensities indicate coarser textures.
The coarseness feature is calculated as
   𝐶𝐶 = 2𝑘𝑘 , 𝑘𝑘𝑢𝑢𝑢𝑢 = arg max �∑𝑁𝑁            𝑁𝑁               𝑘𝑘
                                     𝑥𝑥−1 ∑𝑦𝑦−1 �𝐴𝐴�𝑥𝑥 + 2 , 𝑦𝑦� − 𝐴𝐴(𝑥𝑥, 𝑦𝑦)��                     (10)
                          𝑘𝑘
    Where 𝐴𝐴(𝑥𝑥, 𝑦𝑦) is the intensity at pixel (𝑥𝑥, 𝑦𝑦) and 𝑘𝑘𝑢𝑢𝑢𝑢 is the scale that maximizes the intensity
difference.
In Support Vector Machines (SVM), the goal is to find a hyperplane that separates data points of
different classes. For a linear SVM, the decision boundary is given by:
    𝑤𝑤 ⋅ 𝑥𝑥 + 𝑏𝑏 = 0                                                                                  (11)
    Where 𝑤𝑤 is the weight vector, 𝑥𝑥 is the imput feature vector, and 𝑏𝑏 is the bias term. The hyperplane
is defined such that it maximizes the margin between the two classes.
The margin 𝑀𝑀 is the distance between the hyperplane and the closest data points, and is defined as
           2
   𝑀𝑀 =                                                                                              (12)
          ∥𝑤𝑤∥
   The objective is to maximize 𝑀𝑀, which is equivalent to minimizing ∥ 𝑤𝑤 ∥ - For non-linearly
                                                                                     2

separable data, kernel functions transform the input space into a higher dimensional space. The
polynomial kernel is given by:
                               2
   𝐾𝐾�𝑥𝑥𝑖𝑖 , 𝑥𝑥𝑗𝑗 � = �𝑥𝑥𝑖𝑖 ⋅ 𝑥𝑥𝑗𝑗 + 1�                                                        (13)
   Where 𝑑𝑑 is the degree of the polynomial, and 𝑥𝑥𝑖𝑖 and 𝑥𝑥𝑗𝑗 are input vectors.
RMSE measures the difference between the original and predicted values, after used in evaluating edge
detection. RMSE is computed as:
                 1
   RMSE = � ∑𝑁𝑁
             𝑖𝑖=1 (𝑂𝑂𝑖𝑖 − 𝐸𝐸𝑖𝑖 )
                                2                                                                    (14)
                 𝑁𝑁
    Where 𝑂𝑂𝑖𝑖 is the original image, 𝐸𝐸𝑖𝑖 is the processed (e.g, edge-detected) image, and 𝑁𝑁 is the total
number of pixels.PSNR is used to measure the quality of an image after compression or transformation.
It is defined as:
                        𝑅𝑅2
   PSNR = 10log10 �MSE�                                                                              (15)
    Where 𝑅𝑅 is the maximum pixel value (e.g, 255 for 8 -bit imoges) and MSE is the Mean Squared Error
between the original and processed image.
    These mathematical expressions and their explanations provide a foundation for understanding the
various components of the proposed image annotation and retrieval system, from feature extraction
to classification and evaluation. Each formula plays a critical role in enhancing the accuracy and
efficiency of the averall system. In the proposed methodology, the focus is on automatic image
annotation using machine learning, specifically the Multi-Class Support Vector Machine (MCSVM)
classifier. Automatic image annotation is a classification task where an image is automatically labeled
with semantic keywords based on its visual content. Traditional binary SVM classifiers have
limitations in handling multi-class problems, which are common in image annotation tasks. MCSVM
extends the binary SVM approach to handle multiple classes by training classifiers for each class and
combining their outputs to classify new images.
    The proposed system incorporates the Semantic Keyword Transfer (SKT) algorithm to bridge the
gap between low-level image features and high-level semantic concepts. Image classification involves
training a model to recognize patterns in labeled images and applying this model to classify new
images. Classification techniques such as Minimum Distance Classifier (MDC), K-Nearest Neighbor
(KNN), Support Vector Machine (SVM), Artificial Neural Networks (ANN), and Decision Trees (DT)
are commonly used in image processing.
    The SVM classifier is particularly effective in high-dimensional data classification due to its ability
to create optimal class boundaries by maximizing the margin between classes. In the context of image
annotation, MCSVM is used to classify images with multiple objects or regions.
    The proposed methodology for automatic image annotation combines fused features (texture and
shape) with the MCSVM classifier and SKT algorithm. This approach bridges the semantic gap
between low-level image features and high-level semantic concepts, resulting in improved image
retrieval accuracy. The integration of Haralick and Tamura texture features with shape features
provides a comprehensive representation of image content, while the MCSVM classifier efficiently
handles multi-class image annotation tasks. The evaluation results demonstrate that the proposed
system outperforms existing methods in terms of retrieval accuracy, making it a promising solution
for automatic image annotation and retrieval tasks.

Results and Analysis

    This research proposes and examines a simple algorithm to perform this crowd behavior analysis.
Given an aerial image of a crowd, the algorithm segments the image into crowd and non-crowd
regions. On a large scale, we expect a crowd to contain some repetitive visual elements or textures
that are significantly different from that of a non-crowd region. The proposed algorithm uses multiple
Gabor filters to capture these different textures in an image and uses improved pre processing and
support vector machines to segment the image into 2 groups corresponding to crowd and non-crowd
regions. This research attempts to detect crowds of humans in still images. Given an image, the
proposed algorithm segments out the regions that the crowd occupies. The data set consists of 1200
aerial images of crowds taken from the internet. Each images are tagged with a range 5 properties. By
testing the algorithm on a range of images with varying properties, this research aims to choose a
good set of parameters that can detect crowd well despite the diverse characteristics of crowds.
    The ratio 𝜎𝜎/𝜆𝜆 determines the spatial frequency bandwidth and hence the number of parallel
excitatory and inhibitory stripes in the Gabor filter. The half-response spatial frequency bandwidth 𝑏𝑏
(in octave) related to the ratio 𝜎𝜎/𝜆𝜆 as follows:
                𝜎𝜎     ln (2)
                   𝜋𝜋+�             𝜎𝜎  1 ln (2) 2𝑏𝑏 +1
                                                                                                    (16)
                𝜆𝜆       2
   𝑏𝑏 = log 2                   ,      = �             .
                𝜎𝜎     ln (2)       𝜆𝜆  𝜋𝜋  2 2𝑏𝑏 −1
                   𝜋𝜋−�
                𝜆𝜆       2
    In order to capture the repetitive texture of a crowd from many perspectives, we use 6 orientations
with orientation separation angles of 𝑑𝑑𝜃𝜃 = 30∘ :
    𝜃𝜃: 0∘ , 30∘ , 60∘ , 90∘ , 120∘ , 150∘                                                          (17)
    We also use a range of wavelengths, evenly spaced in log 2 -space, ranging from some minimum
wavelength to the radius of the image (or half its diagonal length). The choice of the minimum
wavelength is adjusted when we apply the algorithm to some initial images. The general formula for
the chosen wavelengths is
    𝜆𝜆: 𝜆𝜆min × 2𝑘𝑘 , 𝑘𝑘 ∈ ℕ                                                                        (18)
    For example, if we choose both 𝜆𝜆min and 𝑟𝑟𝜆𝜆 equal to 2 for a 288 × 512 image, there would be a
total of 42 Gabor filters used from 6 orientations and 7 wavelengths.
In this work we set the value of the bandwidth 𝑏𝑏 by default to 1 octave. In that case, the Equation gives
the approximation
    𝜎𝜎 = 0.5 × 𝜆𝜆                                                                                 (19)
For each filtered image, we use a Gaussian smoothing function given by:
               1           𝑥𝑥 2 +𝑦𝑦 2
   𝑔𝑔(𝑥𝑥, 𝑦𝑦) =         exp �−       �                                                          (20)
                2𝜋𝜋𝜎𝜎 2        2𝜎𝜎 2
   where 𝜎𝜎 is the standard deviation that determines the windown size. The ratio 𝜎𝜎/𝜎𝜎𝑔𝑔 (where 𝜎𝜎𝑔𝑔 is
the standard deviation parameter of Gabor filter) is estimated and adjusted when we apply the
algorithm to some initial images. We first test them on minimum wavelength λmin = 3 and the
gaussian vs gabor standard deviation ratio σ/σg = 3. The resulting segmentation is in Figure 5.


   Figure 5: Test image of moderate crowd scenario


   Figure 6: Labeling of images in moderate scenario


   Figure 7: Segmentation of crowd scenario
   Figure 8: Test image of high crowd scenario


   Figure 9: Plot of confusion matrix

   Table 1
   Analysis of performance parameters
               Scenario         Precision               Recall              F1-Score
               1                1                       0.98                0.99
               2                0.98                    0.95                0.97
               3                1                       0.98                1
               4                1                       0.99                1
               5                0.97                    0.95                0.95

    The algorithm does decently well with both of the picture. For both images, it pinpoints the correct
regions where the crowds of people are. In the first image, it seems slightly over estimate the size of
each crowd on the left and right. But the crosswalk stripes do not seem to confuse the algorithm. With
the second image, the algorithm does a slightly worse job, as the shadow makes it overestimates the
regions that the crowd occupies, and there are quite a few people who are not captured as belonging
to the crowd.

   Table 2
   Comparative analysis of proposed methodology
    Parameter                       Previous Work                        Proposed Work
    Type of Detection               Segmentation                         Segmentation              and
                                                                      Classification
    Type of Analysis                    Single Level Scenario            Multiple Scenario
    Performance Parameter               F Score                          Precision, Recall and F Score
    Implementation                     Complex                           Simple
    Computational Time                 Average                           Faster

    In order to lessen the algorithm’s overestimation and be able to detect more people in a scattered
crowd, we will reduce the value of both the minimum wavelength and the standard deviation ratio.
The goal is that the algorithm can pick up smaller details in the picture and thus segment more
precisely all the regions of the crowd.
    In the second trial, we change the minimum wavelength to 2 and the standard deviation ratio to
1.6. The algorithm seems to improve for both images. For the first image, the algorithm seems to reduce
the algorithm overestimation, although it seems to confuse a tiny part of the crosswalk stripes as parts
of the crowd. For the second image, the algorithm seems to no longer include the majority of the
shadow as parts of the crowd, and there are only 1-2 people who are no included as belonging to the
crowd. As a result, we choose minimum wavelength equal 2 and standard deviation ratio equal 1.6 as
the parameters for our algorithm, in addition to the other parameters
    There are some defects inherent in Matlab average filters such as Gabor and Gaussian. In particular,
they assume that pixels out of the image has intensity of 0, and thus it is possible the algorithm does
not work well for pixels at the circumference of images. This problem did not arise with the 16 images
in this data set, but it is a problem that may be needed to deal with when applying to more images in
different circumstances. This program worked reasonably fast, needed from 20.839009 to 31.543316
seconds for each image of size 288 × 512. However, the time does add up when we want to process all
the images multiple times when testing for different parameters. Crowd image segmentation and
detection play a significant role in various computer vision applications, including crowd monitoring,
crowd behavior analysis, and public safety. This work presents a comprehensive study on the use of
Gabor filters and Support Vector Machine (SVM) for crowd image segmentation and detection. The
Gabor filter is employed to extract discriminative features from crowd images, and SVM is used as a
classifier to distinguish between crowd and non-crowd regions. The results demonstrate the
effectiveness of this approach in accurately segmenting and detecting crowds in complex visual scenes.
This research concludes by discussing the potential applications of crowd image segmentation and
detection using Gabor filters and SVM in real-world scenarios.

Conclusion
    This research presents a novel approach to crowd behavior analysis using a combination of Gabor
filters and Support Vector Machines (SVM) to detect and segment crowds in still images. The algorithm
effectively segments an image into crowd and non-crowd regions by identifying repetitive textures
that differentiate the crowd from the background. Through the use of multiple Gabor filters, the
method captures various orientations and scales of these textures, enhancing the detection of crowd-
specific characteristics. The SVM classifier is used to cluster the regions based on these features,
ensuring that crowd regions are distinguished from non-crowd areas. The ability to detect crowds in
public spaces is crucial for preventing congestion, ensuring safety, and enforcing social distancing
measures. This research successfully demonstrates that crowd segmentation is a vital preprocessing
step for more complex tasks such as crowd density estimation and behavior analysis. The algorithm's
robustness is tested on a dataset of 1200 aerial images with varying properties, including crowd
density, background variation, and lighting conditions, resulting in reliable crowd detection.Despite
some limitations, such as overestimation in regions affected by shadows, the proposed methodology
improves the precision and accuracy of crowd detection. By adjusting key parameters like the
minimum wavelength and standard deviation ratio, the algorithm's performance was optimized,
providing precise crowd segmentation. This research highlights the potential for further
advancements in crowd detection, with applications in public safety, event management, and urban
planning, offering a foundation for real-time crowd analysis systems in diverse environments.

REFERENCES
[1.]   Aditya, CSK, Hani'ah, M, Bintana, RR & Suciati, N 2015, 'Batik classification using neural
      network with gray level co-occurence matrix and statistical color feature extraction', in 2015
      International Conference on Information & Communication Technology and Systems (ICTS),
      pp. 163-8.
[2.] Ahmed, M, Mahmood, AN & Hu, J 2016, 'A survey of network anomaly detection techniques',
      Journal of Network and Computer Applications, vol. 60, pp. 19-31.
[3.] Anton, SD, Kanoor, S, Fraunholz, D & Schotten, HD 2018, 'Evaluation of machine learning-based
      anomaly detection algorithms on an industrial modbus/tcp data set', in Proceedings of the 13th
      international conference on availability, reliability and security, pp. 1-9.
[4.] Au, CE, Skaff, S & Clark, JJ 2006, 'Anomaly detection for video surveillance applications', in 18th
      International Conference on Pattern Recognition (ICPR'06), vol. 4, pp. 888-91.
[5.] Babenko, B, Yang, M-H & Belongie, S 2010, 'Robust object tracking with online multiple instance
      learning', IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 8, pp. 1619-
      32.
[6.] Belousov, A, Verzakov, S & Von Frese, J 2002, 'A flexible classification approach with optimal
      generalisation performance: support vector machines', Chemometrics and intelligent laboratory
      systems, vol. 64, no. 1, pp. 15-25.
[7.] Benabbas, Y, Ihaddadene, N & Djeraba, C 2011, 'Motion pattern extraction and event detection
      for automatic visual surveillance', EURASIP Journal on Image and Video Processing, vol. 2011,
      pp. 1-15.
[8.] Bertini, M, Del Bimbo, A & Seidenari, L 2012, 'Multi-scale and real- time non-parametric
      approach for anomaly detection and localization', Computer Vision and Image Understanding,
      vol. 116, no. 3, pp. 320-9.
[9.] Bezdek, JC, Ehrlich, R & Full, W 1984, 'FCM: The fuzzy c-means clustering algorithm',
      Computers & Geosciences, vol. 4, no. 10, pp. 191-203.
[10.] Brassil, J 2009, 'Technical challenges in location-aware video surveillance privacy', in Protecting
      Privacy in Video Surveillance, Springer, pp. 91-113.
[11.] Brutzer, S, Höferlin, B & Heidemann, G 2011, 'Evaluation of background subtraction techniques
      for video surveillance', in CVPR 2011, pp. 1937-44.
[12.] Castiglione, A, Cepparulo, M, De Santis, A & Palmieri, F 2010, 'Towards a lawfully secure and
      privacy preserving video surveillance system', in International Conference on Electronic
      Commerce and Web Technologies, pp. 73-84.
[13.] Chae, J, Thom, D, Bosch, H, Jang, Y, Maciejewski, R, Ebert, DS & Ertl, T 2012, 'Spatiotemporal
      social media analytics for abnormal event detection and examination using seasonal-trend
      decomposition', in 2012 IEEE Conference on Visual Analytics Science and Technology (VAST),
      pp. 143-52.
[14.] Chandola, V, Banerjee, A & Kumar, V 2009, 'Anomaly detection: A survey', ACM computing
      surveys (CSUR), vol. 41, no. 3, pp. 1-58.
[15.] Chang, C-I & Chiang, S-S 2002, 'Anomaly detection and classification for hyperspectral imagery',
      IEEE transactions on geoscience and remote sensing, vol. 40, no. 6, pp. 1314-25.
[16.] Chapelle, O, Scholkopf, B & Zien, A 2009, 'Semi-supervised learning (chapelle, o. et al., eds.;
      2006)[book reviews]', IEEE Transactions on Neural Networks, vol. 20, no. 3, pp. 542.
[17.] Chen, M, Chen, S-C & Shyu, M-L 2007, 'Hierarchical temporal association mining for video event
      detection in video databases', in 2007 IEEE 23rd International Conference on Data Engineering
      Workshop, pp. 137-45.
[18.] Cheng, K-W, Chen, Y-T & Fang, W-H 2015, 'Video anomaly detection and localization using
      hierarchical feature representation and Gaussian process regression', in Proceedings of the IEEE
      Conference on Computer Vision and Pattern Recognition, pp. 2909
[19.] Cho, S-B & Park, H-J 2003, 'Efficient anomaly detection by modeling privilege flows using
      hidden Markov model', Computers & Security, vol. 22, no. 1, pp. 45-55.
[20.] Choi, Y-S 2009, 'Least squares one-class support vector machine', Pattern Recognition Letters,
      vol. 30, no. 13, pp. 1236-40.
[21.] Chong, YS & Tay, YH 2017, 'Abnormal event detection in videos using spatiotemporal
      autoencoder', in International symposium on neural networks, pp. 189-96.
[22.] Coello, CAC, Pulido, GT & Lechuga, MS 2004, 'Handling multiple objectives with particle swarm
      optimization', IEEE Transactions on evolutionary computation, vol. 8, no. 3, pp. 256-79.
[23.] Cong, Y, Yuan, J & Tang, Y 2013, 'Video anomaly search in crowded scenes via spatio-temporal
      motion context', IEEE transactions on information forensics and security, vol. 8, no. 10, pp. 1590-
      9.
[24.] Dasarathi, S 2015, 'Parametrization of Convolutional Neural Network for Image Classification',
      Dublin, National College of Ireland.
[25.] Davies, AC, Yin, JH & Velastin, SA 1995, 'Crowd monitoring using image processing', Electronics
      & Communication Engineering Journal, vol. 7, no. 1, pp. 37-47.
[26.] Davis, JW & Sharma, V 2005, 'Fusion-based background-subtraction using contour saliency', in
      2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
      (CVPR'05)-Workshops, pp. 11.
[27.] Du, B & Zhang, L 2014, 'A discriminative metric learning based anomaly detection method', IEEE
      transactions on geoscience and remote sensing, vol. 52, no. 11, pp. 6844-57.
[28.] Du, B, Zhao, R, Zhang, L & Zhang, L 2016, 'A spectral-spatial based local summation anomaly
      detection method for hyperspectral images', Signal Processing, vol. 124, pp. 115-
[29.] Duan, L-Y, Xu, M, Tian, Q, Xu, C-S & Jin, JS 2005, 'A unified framework for semantic shot
      classification in sports video', IEEE Transactions on multimedia, vol. 7, no. 6, pp. 1066-83.
[30.] Feizi, A 2020, 'Hierarchical detection of abnormal behaviors in video surveillance
      through modeling normal behaviors based on AUC maximization', Soft Computing, vol.
      24, no. 14, pp. 10401-13.