Panic Detection in Crowded Scenes

A crowd is a gathering of a huge number of individuals in a confined area. Early identification and detection of unusual behaviors in terms of panic occurring in crowded scenes are very important. Panic detection comprises of formulating normal scene behaviors and detecting and identifying non-matching behaviors. However, panic detection and recognition is a very difficult problem, especially when considering diverse scenes. Many methods proposed to cope with these problems have limited robustness as the density of the crowd varies. In order to handle this challenge, this paper proposes the integration of different features into a unified model. Discriminant binary patterns and neighborhood information are used to model complex and unique motion patterns in order to characterize different levels of features for diverse types of crowd scenes, focusing in particular on the detection of panic and nonpedestrian entities. The proposed method was evaluated considering two benchmark datasets and outperformed five


INTRODUCTION
Crowd scene analysis is major problem in computer vision. The challenges of crowd scene analysis are found in gatherings which attract many people, such as sports, festivals, social, political and religious gatherings. Important problems and security concerns are raised in huge crowds. Therefore, dealing with the crowd scenes is a very challenging problem in the subfield of video surveillance, due to the complex behaviors of individuals. A method developed for this purpose could be deployed for detecting critical crowd levels, people counting, and anomalies in such environments. Detecting panic situations in such scenes becomes extremely important for crowd control and safety. Furthermore, in public gatherings, it is important to know the ongoing situation and the behavior of people attending an event, as it can provide useful information for future event planning and public space design. This study aims at detecting panic and non-pedestrian entities representing abnormal behaviors in crowd scenes, integrating robust features into a unified model. The structure of the proposed method consists of the Discriminant Binary Pattern (DBP) [1] and the neighborhood orientation information [2]. These features are combined and classified using a multi-class support vector machine [3]. In the crowd scene context, different feature integration could represent individuals, and patterns that reflect the collective behavior of crowd arise from the interaction of individuals. This kind of modeling deals with the challenges of crowded scenes in terms of various distributions, sparseness, consistency and inconsistency of the scene.
Many computer vision based methods have been proposed to detect panic and other anomalies in crowded scenes. However, most of them are modeled to address a specific scene, while different representations of motion and appearance are analyzed with different techniques. The proposed method explores the properties of abnormal situations, considering both the anomalies in terms of panic and non-pedestrian entities. This method computes dense motion trajectories from input videos. Considering the computed trajectories, different features are combined into a unified hybrid model. Furthermore, the underlying complex motion information is explored, which can formulate complex events in the crowd. This information consists of mid-level characteristics that model the distance between low and highlevel features for representing anomalous situations. The proposed hybrid features do not limit the type of features or scenes, enabling the extension of the technique to broader research fields. For experimental evaluation, the method is applied on two widely used benchmark datasets. Additionally, the proposed technique is compared with five existing methods, namely the Spatio-Temporal Texture Model (STTM) [4], the Abnormal Crowd Behavior Model (ACBM) [5], and the spatio-temporal volume based methods: Mixture of Dynamic Texture (MDT) [6], Data Mining for Anomaly (DMA) [7], and Optical Flow and Texture for Anomaly (OFTA) [8].

II. LITERATURE REVIEW
Recently, deep learning methods have shown promising results in different applications in the field of computer vision. Therefore, different computer vision and deep learning based methods have been successfully employed to solve problems related to crowded scenes in real time [9][10][11]. Crowded scenes can be categorized on gathering locations, e.g., congested urban areas, historical events organized in museums, music and fun concerts, arrival and departure zones in airports, and different sports in stadiums. Various proposed methods can generally help safety, security staff and administrators in identifying security methods. Different crowd analysis methods consider the input data in different ways. For example, some methods process videos from stored archives. The methods introduced in [12,13] extract frames from video CCTV cameras. Some researchers introduced unified frameworks for crowd congestion analysis, individual counting, crowd flow analysis on pedestrian pathways, crowd fight analysis, panic and bottleneck detection during exit and entrance [14][15][16]. In [17], a hybrid method was used to analyze crowd from different angles, such as crowd moving in different directions, computing statistics related to a static crowd, calculation of areas covered by potential crowd, and crowd's social behavior analysis in the form of groups or as a whole. This technique could be deployed to take into account different aspects in public locations, considering a unified framework. Moreover, the techniques introduced in [18][19][20] could be used to direct crowd flow to avoid stampede or other adverse events. In addition, crowd analysis approaches driven by tracking techniques [21][22][23] can be exploited to track suspicious individuals in a congested area, in order to ensure security. The tracking driven technique in [24] could be used to understand how individuals are behaving and moving in crowded scenes. These techniques can be used to divert crowd flows in public places to avoid congestion and stampede. In general, crowd analysis methods can be divided into two categories. In the first, each entity in the crowd is considered individually and they are combined to model crowd behavior. However, these methods can only be considered when the scene is not occupied by a very dense crowd. In the second category, methods treat the whole crowd as a single entity. These techniques perform very well in dense crowded scenes. The techniques in [25][26][27] fall in the first category, and they can be utilized to track individual people in the crowd, calculate the waiting time of a static crowd, compute the length of individuals in a queue, count the number of people in crowded scenes, and analyze people's engagement in crowd scenes. The techniques presented in [28][29][30] fall in the second category, and they can help security staff and administrators to have in insight into a crowd scene and get the essence of the behavior as a whole.
Different methods have been proposed in dealing with the problems of crowd analysis considering different factors such as individual tracking, abnormal situation detection, flow analysis, and situation awareness. The method proposed in [31] combines different data from different sensors to get more accurate localization of the target in the crowded scene, introducing a modified ensemble of Kalman filters to deal with a high degree of nonlinearity. The method in [6] detects and recognizes anomalies in crowded scenes using features related to the appearance and the dynamics of the crowd, considering both temporal and spatial information to compute mixtures of dynamic textures. The model presented in [32] uses a multiscale deep convolutional neural network to count the number of individuals in a single image with arbitrary crowd density and perspective. Authors in [33] presented an aggregation of ensembles considering pre-trained ConvNets and a pool of classifiers in order to detect anomalies in crowded scenes. Their approach was inspired by the concept that a set of different fine-tuned CNNs represents various levels of semantic characterization, and therefore encodes a very rich set of robust features. The method in [34] was an agent-based crowd simulation model that exploited a path planning strategy. Different elements, including traveling distance and turning angle, can be used efficiently for path planning in crowded scenes. An inspiration of momentum from Physics was exploited in [35], integrating foreground information with an object's motion, based on background subtraction, feature computation, behavior identification, and anomaly detection. Authors in [36] discussed various approaches for the analysis of both high and low density crowds to facilitate personal mobility, safety, and security, enabling assistive robotics in crowded scenes. This study demonstrated the main challenges and solutions for the analysis of unified behaviors, in order to explore interpersonal relations and social interaction of people in crowd scenes. Authors in [37] investigated a confined set of socio-cognitive crowd behaviors, discovering the interrelated connection between the movements of individuals to analyze different crowd behaviors. This was a layered approach segmenting visual analysis and semantic crowd behaviors.

Ref. Publication date
Features/Model Dataset [9] 2017 Statistical UCD [11] 2018 Nested motion PETS2009UMN [12] 2019 Spatio-temporal Collected data [13] 2019 Intra-frame classification WWW crowd [19] 2019 Social model CUHK crowd [20] 2019 Depth information UCF [23] 2019 Texture features PNNL parking [24] 2019 Spatio-temporal UCF crowd [26] 2016 Simulation features Mall dataset [27] 2019 Entropy features Crowd dataset [29] 2019 Probabilistic model UMN crowd [6] 2010 Dynamic texture UCSD [33] 2020 Aggregation of ensembles Collected data [35] 2019 Histogram features UCSD, PETS [7] 2018 Visual features WWW crowd [8] 2019 Optical flow features CUHK crowd [38] 2016 Holistic features PNNL parking [39] 2011 Low-level features Collected data [40] 2011 Adaptive features UMN [4] 2019 Spatio-temporal CUHK crowd A method based on artificial bacteria colony was investigated in [7], where the optical flow of frames was exploited to get the foreground segments with entity motions as layers. Surveillance problems using image processing and machine learning techniques were studied in [8], exploring algorithms of identifying abnormal crowd behaviors, proposing a unified method for anomaly detection, considering both crowd motion and texture-based analysis. A deep learning method was proposed in [41], calculating crowd density from individual images representing different crowd densities, exploiting both deep and shallow fully convolutional networks to estimate the density map of a crowd image. This method is good in encoding both high level semantic information and low level features. Authors in [42] analyzed stationary crowd by computing the duration of a static foreground pixel. For this purpose, they used dynamic constraints, spatial and temporal features, and mixed partials. Authors in [38] used holistic features for crowd anomaly detection, fusing together different features including crowd collectiveness, conflict, density and mean motion speed. Authors in [43] mapped a given crowd scene to its density, avoiding people occlusion in dense crowd, foreground and background similarities, and huge changes in camera viewpoints. In [44] [45] to produce a robust map of crowd density and people count, by using both global and local contextual features. Recently, some advanced deep learning methods [46,47] were presented, which could improve the performance of crowd analysis techniques. The modeling of complex crowd scenes using deep Elman neural network architecture in [46], could extract in a better way the complex structure of a crowd, and emulate perfectly such dynamic systems.

III. PROPOSED METHOD
Panic detection and non-pedestrian entity localization is a challenging task due to its associated complexities. For this purpose, discriminative features are explored [1]. There is a lack in researches that efficiently assess the underlying structure of crowd scenes, investigating the most discriminant pattern representation of abnormal situations. This study's purpose is to exploit the connection between the motion feature extraction and their discriminability, proposing an integrated model characterized by the discriminative power of combining different features. This framework renders a deep insight into the optimal feature extraction for anomaly detection. This study avoids the modeling of traditional features that use single-type representation and features unable to encode the motion and crowd structure information. Moreover, this study is exploiting local neighborhood features. A refined feature [2] combines more information of the intrinsic structure and is effective for crowd scene analysis. The method's flow diagram is shown in Figure 1. The effectiveness of the selected features is not limited to abnormal situation detection, they could also be widely used in many other applications in the field of computer vision and machine learning. The proposed method computes repetitions of different orientations in localized areas of a crowd video. This framework gets some inspiration from edge orientation histograms [48], scale-invariant feature transform descriptors [49], and shape contexts. A significant difference lies in the unification of completely different features, as the model is formulated on a dense grid of all pixels in each frame. The intuition behind the feature unification was that localized crowd pattern appearance and layout within a video frame could be formulated by the distribution of intensity and orientation information. Previous studies [50][51][52] extracted different types of orientation features. However, this study investigated the important discriminability of feature unification for generalized purposes, exploring a model to characterize the discriminative power of various types of orientations, so that more discriminative features can be fused together. The proposed method is effective and compact for abnormal situation detection. Hence, it demonstrates complex and unique motion patterns, and it could be considered as a mid-level hierarchy to integrate the space between low and high-level information in order to capture complex crowd events. Features represent motion information in the neighborhood of the region under observation. The most significant ability of this method is the unification of features in the local neighborhood. Crowd scene layout (i.e. appearance structure of the scene) captures reliable information in a reasonable neighborhood, discovering important hints to recognize various events of interest in crowds.
The conventional techniques extract gradient locations and orientation information for feature modeling. The proposed method differs from these as it considers the difference between each pixel and the center one, within local windows of its eight neighbors to produce different orientation features, whereas conventional techniques are confined to only horizontal and vertical. Therefore this kind of conventional modeling cannot be effectively applied without any distinction in the significance and the influence of different orientations. The proposed refined features improve the local scene description ability, enhancing the intrinsic structure of the crowd scenes. In this method, each frame is extracted from the crowd video. It is worth noticing that each class label for each event in the scene is considered. Abnormal situations in each crowd video consist of multiple abnormal events. Inspired by [1], the abnormal situation is extracted through the Discriminant Binary Pattern (DBP). DBP can model the scene changes and implicitly formulate the multiple dominant features, as: where represents the function to extract direction information. The potential feature will maximize this response, and it can be considered as the prominent direction feature of the abnormal situation. In (1), and represent the standard deviations of the data surrounding the position x and y. The convolution between and a video frame I is represented by: where * represents the convolution operation, and functions and obtain convolved results for each pixel of the video frame. In order to identify individual pixels related to abnormal situations in the crowded scene, the following modeling is performed: where represents the orientation of the pixel under observation. Equation  where  represents  the  central  pixel  and  ,  ,  , , . . , represent the neighboring pixels. The intensities of non-unit neighboring pixels are obtained by arithmetic operations, using their adjacent pixel values. Next, the intensity difference between each neighboring pixel and the central pixel is computed. Differences are concatenated to a difference vector that represents the scene. In fact, for each frame, this information is extracted densely for all 9×9 neighborhoods. Features in (3) and (4) are combined into a unified framework as shown in: This equation considers the entire features of the scene and gathers the statistics of local regions to embed more local information and strengthen the robustness of representing complicated crowd motion patterns for abnormal situation detection. The classification of a crowded scene into normal and abnormal situations, involves exploiting a multi-class support vector machine [3]. Two benchmark datasets were used and the training parts of these datasets consist of panic situation and non-pedestrian entity detection. Therefore, a function that shows at most ε deviation from the actually chosen labels for the training data of both datasets is required. This function was intended to be very flat according to the training data of both datasets. In order to elaborate the theoretical background, errors are not considered as long as they are less than ε. The importance of this theory arises from the fact that it is desired to ensure that error lies in some acceptable margin during the classification process of different abnormal events. To start the classification procedure according to the multi-class support vector machine, a linear function is considered as: where <. , . > represents the dot product. Smoothness in this equation is related to the convergence of the smaller w, and this is only possible by minimizing its norm. This process belongs to convex optimization formulation according to: minimize ‖ω‖ subject to y −< w, x > −b ≤ ϵ < w, x > +b − y ≤ ϵ According to (7), convex optimization is possible if enough training data are available. For this reason, two benchmark datasets were used to train the multi class support vector machine, taking full advantage of the convex optimization process. Standalone features without integration are not very effective, since the physical structure of crowd scene changes over a temporal window. In addition, single type of features is very sensitive to background and illumination changes, scale variations, and crowd flow direction. To handle these problems, the integration of different robust and reliable features should be explored. For instance, entities in the crowd scattered in the scene with a still background, consist a very challenging environment in terms of modeling different events. Therefore, a method based on single type of features suffers if the uncertainty in crowd flow increases over a large scale, such as a uniform crowd flow versus a random one. The outcome of the traditional and single type of features would be unreliable in such situations. Moreover, if the flow of the crowd is in one direction and it changes randomly due to the occurrence of either panic situation or a non-pedestrian entity, the same features would behave differently. Therefore a unified model for different features is needed, modeling the random flow of crowd and being robust to different variations. The proposed method can effectively cope with these issues.

IV. RESULTS
The detailed experimental analysis procedure utilized the widely used benchmark datasets UCSD [53] and UMN [54]. These datasets are properly annotated benchmarks for the analysis of abnormal detection and localization in crowded scenes. UCSD consists of anomalous entities represented by non-pedestrian entities in scenes. The videos of this dataset were captured with a CCTV camera fixed at an elevation, at a resolution of 238×158 at 10fps, overlooking people in pedestrian pathways. Non-person objects in pathways and anomalous pedestrian motion patterns are treated as abnormal situations. In this dataset, the abnormal entities are bikers, skaters, small carts, and people walking across a pathway or in the park. The video clips in the dataset are divided into 2 subsets: Ped1 and Ped2, and each one is associated with a different crowd scene. Videos recorded from each scene were categorized into different clips, each of them consisting of about 200 frames. Ped1 consists of 34 training and 36 testing videos. Ped2 consists of 16 training and 14 testing videos. In each video, the ground truth annotation includes a binary flag per frame, showing whether an anomalous entity is present. Moreover, there is a subset consisting of 10 videos with manually produced pixel-level binary masks, which recognize the parts or sub-parts consisting of abnormal events with clear boundaries. This was generated for performance analysis, with respect to the ability of anomaly localization and segmentation. Sample images from the UCSD dataset are depicted in Figure  2. In the top row, non-pedestrian entities are shown, which are not allowed in these pedestrian pathways.  Table II presents the area as a ROC curve (AUC) of the tested methods, in which a larger AUC score shows improved classification results. As it can be noted, the proposed method outputs competitive results against the five reference methods. Moreover, experimental analysis in the form of Equal Error Rate (EER) was performed. Figure 4 shows the results in graphs for the five references and the proposed method. Blue and red graphs represent the results for Ped1 and Ped2 subsets respectively. We can see that the proposed method has smaller EER errors compared to the five reference methods. The UMN dataset presents both normal and abnormal crowd video sequences. This dataset comprises of three indoor and outdoor scenes, showing 11 different scenarios of panic events. The dataset consists of 7739 frames in total, with resolution of 320x240 pixels. Each video starts with normal human behaviors, such as walking or standing. Figure 5 depicts sample images from the UMN dataset. People walking and standing. Images from [54], © University of Minnesota Qualitative results for the UMN dataset are depicted in Figure 6, where the top row shows sample frames taken from four video sequences. The bottom row shows anomalous entities annotated and highlighted in light blue. Panic is detected and highlighted when pedestrians start running in different directions. Table III shows the area under a ROC curve (AUC) of the five references and the proposed method. Again, the proposed method has larger AUC score than the referenced methods. UMN Dataset: Normal frames and panic detection on them. Images from [54], © University of Minnesota Experimental analysis in the form of EER was performed, considering the UMN dataset. Figure 7 shows the results provided for the five references and the proposed method. Blue

www.etasr.com
Ullah & Altamimi: Panic Detection in Crowded Scenes and red graphs represent the results for the first two and the following two video sequences of the UMN respectively. As can be noted, the proposed method has smaller EER errors compared to the five reference methods. Many diverse approaches have been proposed to solve the problem of panic detection and anomaly identification. However, most of them are designed to work on specific scenes, where different representations of motion and appearance are analyzed with different models. In this study, the properties of abnormal situations were considered specifically as anomalies in terms of panic, and more general, as anomalies in terms of non-pedestrian entities. Abnormal entities are rare things with unexpected appearance or motion patterns. For anomalous situation detection including panic and non-pedestrian entities, a novel technique was proposed, where dense motion trajectories were computed from the crowd input videos. Considering the computed trajectories, a set of robust features was designed considering HOG, HOF, and MBH. Motion atoms were explored for compact encoding of motion patterns in crowded scenes, representing distinguished motion patterns of crowds. In fact, motion atoms are mid-level characteristics to fuse the distance between low and high-level features for capturing anomalous situations. Since an anomalous situation is described from the view of a feature set, the proposed method can be utilized in different surveillance scenes. Moreover, hybrid features and motion atoms do not limit the type of features or the type of scenes, which helps in extending the proposed technique to broader research fields. The experimental results demonstrated that the proposed approach is effective for real crowd videos containing various types of normal and abnormal activities, in terms of panic situation and the existence of non-pedestrian entities. The proposed method is independent of crowd flow variability and density over temporal windows. Furthermore, the method is not sensitive to crowd concentration in different locations of different scenes. Experimental evaluation was performed considering two benchmark datasets and the proposed method outperformed five known methods both quantitatively and qualitatively.