Abstract— When we are stressed small altercations can escalate to a full blown out aggression. The best way to prevent negative emotions from escalating is to detect them at the earliest and deescalate the tension immediately. With the recent advancement in multimedia and Artificial Intelligence, a surveillance system can be trained to detect the traits of aggression and help to stop it when it is happening. However even though in tests a system can perform reasonable but the limitation on the situations the system can be trained on create a major hurdle in its suitability for real life deployments. 1 tries to achieve this by training a system on different context, it devices various scenarios and uses audio-video recordings to capture audio and video data. The system could only be trained based on data sets obtained by enacting real like scenes where there is a most likeliness of aggression, due to the ethical and privacy issues that recording real aggressive situations may raise. The system is trained and cross-validated on these contexts to try to make it as diverse and capable as possible. The experiments were made using audio, video and audio-video combined features to see which would produce best results for a real-life system. It was found that combination of multiple contexts for training did not provide the expected results and training on a single context performed better and that audio-video features performed like the one of audio alone.
Keywords—component, formatting, style, styling, insert (key words)
I. INTRODUCTION (HEADING 1)
The focus of this paper is on recognizing negative human-human interactions from the perspective of video surveillance. It is of very important to identify such situations and implement means to de-escalate it before it leads to violence. Often the paths of such escalation starts with negative emotions and stressful behaviors accompanied by different levels of verbal and non-verbal behavior characteristics.
Detecting negative interactions has a wide range of applications in where people are provided public services that has a lot of human-human interactions. Some examples would be service desks, call centres, hospitals/clinics, group meetings, discussion forums and general public surveillance. It can also be applied in the field of virtual reality anger therapy systems to calm down people in the even of provoking or negative situations and provide them timely feedback on how to control themselves.
Training materials and real-life situations are usually very complex and diverse. Only a subset of real-life scenarios could be usually captured while developing training set. Hence it is only possible to train the system for a limited set of possible situations. The effect of this is that the model will not be able to generalize well. Studies have been done on cross-corpus video-based action emotion detection. It is observed that intra-corpus recognition accuracies are superior to cross-compared accuracies.
Behaviour interpretations cannot be a fixed one. It varies according to the context. Thus is it is important to study more on context-based interactions. Studies have been progressing based on context-sensitive systems. The factor that should be noted is that the more wide the context could vary, the more it would limit generalization. The behaviors human exhibits in different contexts are diverse and complex to predict. An argument could occur anywhere when two human beings interact. Possibility of conflict of interests is a basic phenomenon across everyone. Hence negative interactions may occur anywhere in a public place for example at a service desk, or public vending machine. Expected behaviors, likely sources of conflict, length of the interaction, expected movements, number of people in the scene are some of the traits that vary across contexts. Traits like language, noise levels, room acoustics will vary in the case of audio analysis.. Traits like lighting conditions, angle of view, blockage of view could vary in the case of visual analysis. But the major challenge here is the complexity and diversity in human behavior which makes is highly difficult to understand the context. It gets tougher when there is high emotional content as it would lead to more data sparsity.
This work is focussed on finding answers to the following questions: (1) when a trained system is exposed to new context, how will it perform?, (2) which is the model showing best modality when context is changed?, (3) Is it possible to obtain better results when data is merged from multiple context, if not what is the performance loss?
The aim in this paper is to come up with quantified performance measures for negative interactions between humans in different contexts. A service desk, a vending machine, in front of lockers and a cafeteria are selected as contexts in this work. Audio-visual recordings are made in each contexts. The scenarios are designed such that it leads to a negative interaction between the participants in the conversation. The participants are assigned role and they are made to interact with each other for a short period of time. The situation is not scripted. They will speak impromptu thereby getting close to real-world situation. The conversation would be in such a manner that they would escalate naturally depending on the reaction by the participants.
The audio-visual data recorded will be used to find out if there is an escalation is occurred or not. Non-verbal behaviors on the data will be used to identify the occurrence of negative interaction. During such an escalation it is expected that people would use agitated expressions, strong and rapidly changing body language, tense actions etc. These could be identified from the visuals. For audio too, we will be focussing on non verbal characteristics rather than verbal ones. The non verbal characteristics in audio would be change in pitch, voice quality and intensity. These non verbal cues are used to identify the occurrence of a negative interaction between people. The audio and video features are combinedly used to analyze modulations. So it expected that the system will be able to generalize better than the case where just action recognition is done to identify negative interaction. Such a multi modal aspect would add more depth in understanding the nature of situation.
The performance of the system was analyzed for the selected audio-visual features and also audio and video features separately to find out which modality works best to identify negative situation. The features are analyzed in intra context and cross-context schemes. The experiments are carried out to find out if merging of different contexts together would give a better system that could generalize well. So, the data from different intra-context systems are merged together and from three training contexts and then tested on the merged one with cross-context audio-visual features.
The paper would cover the details on how the data is collected. The data content, procedure and annotations used will be described in the Section 2. Section 4 would describe the experimental setup along with audio video features and how they are classified. Finally, the results are discussed in section 5 and concluded in section 6.
II. DATA COLLECTION
To be able to properly recognize the aggression trains and deescalate it we need to target specific set of datasets which will help us develop a system that can have a very high degree of recognition and be as realistic as possible so that it can generalize based on the learnt behavior and improve further and predict intelligently out of the box. One major problem with collecting the data is the ethical dilemma of using actual data of escalation versus that collected from an acted situation. We can do this by creating a real like scenario where actors are used to pretend to be in a certain situation and say things with acted emotions. The problem that real human emotions are similar but very different and sporadic and thus it is possible that actors will exaggerate or underrate a situation as from a real aggression since it’s all part of a script and the actors know the result. Some times in real life cases an act of aggression can start out of no where for example, when a person greets a very busy salesperson in a supermarket and request a specific information and the salesperson due to the number of other people requesting similar services and also his personal stress etc. ignores them, a short tampered individual can easily start an act of aggression simply because they felt ignored. Real human emotions are therefore rare sometimes unpredictable and maybe determined by many factors such as the physiological status of an individual their emotional and stress level etc.
The best data can be collected by recording real negative emotions, but those would need to be done in an uncontrolled environment and would be stepping on the boundaries of the law, they will raise ethical issues making it very challenging and controversial.
We can try to achieve a balance between the advantages and disadvantages of acted and real-life recordings by using a middle route as that achieved from the IEMOCAP dataset 1 to be able to create a system that is more suitable for a general application.
A. Content and recording Protocol
To get the relevant datasets the paper tries to create for scenarios where there is the most likeliness of an aggression occurring. There is a very high possibility of some aggression occurring at a service desk when tempers are running high, at a vending machine when an individual is not satisfied, a cafeteria where customers’ expectations are not met or at a locker when someone tries to gain unauthorized access. Therefor the paper uses actors to simulate these scenarios and create the situation that will generate negative interactions.
The setup consisted of a group of about 9 professional actors all from a multicultural background out of whom 4 were male and five were females 1. They were all given specific roles they needed to enact based on the four mentioned scenarios 1. The actors were provided with a brief idea of what the cause of the conflict would have been and their role in the specific scenario to try to make it as natural as possible without any script. The actors were given the freedom to react to the scenario as they deemed suitable ie how they may normally react if they were faced with that situation without any specific restraints so that we had an as near real experience recorded for data collection as possible. Most of the time the interactions were between two people, but in some of the scenarios it was between up to five people, of whom some spoke Dutch and some English. The language spoken was based on the preferences of the actors. The entire interaction was recorded using two High Definition cameras from multiple angles (mostly two, due to certain constraints of resources) and used for the proposed study. The actors voice was recorded using mics they wore clipped to their dress to have the best audio input as well 1.
i. Service-desk (SD)
The first scenario that was enacted was the service desk, the actors we divided into employees and customers requiring services of the employees and were required to play their roles using just a brief description of their role and scenario. The actors had to enact four scenarios, in the first scenario a visitor has a meeting and requires the assistance of an employee however the employee due to some reasons was being very slow in providing the visitor the service he required. In the second scenario a visitor tries to find a location on a map, when he can’t locate it he requests an employee to take him to the location but the employee refuses. In the third scenario, an employee is going for lunch when a visitor approaches him for assistance, and the employee refuses to help since he is on break. Finally, people want to access the service desk for assistance but an employee who is on the phone is blocking them from accessing the service desk. Each of these scenarios were enacted twice to get more diverse data set and a more generalized result.
ii. Lockers (LK)
In the Locker scenario, someone is trying to gain unauthorized access to a locker and trying multiple times to pin in the code by pretending to have forgotten his code or that the code does not work, when an employee comes over and notices something suspicious and confronts the individual, as expected tempers flare up and escalate. This scenario was recorded a total of two times for a more diverse dataset.
iii. Vending machine (VM)
In this scenario, an actor plays the role of someone trying to buy an item from a vending machine, makes the payment but for some reason the product dos not fall for him to pick. The customer is visibly angry and irritated when an employee passes by and enquires what the issue was, and on not getting prompt help tempers flare up , this scenario was repeated and recorded four time with various actors.
iv. Cafeteria (CF)
Finally in this scenario, customers at a cafeteria encounter issues such as when one of them tries to make a payment by cash the employee of the cafeteria refuses to accept card and demands payment by card, this leads to a confrontation, in a second instance a customer who is slow tries to make payment with his card but the process takes longer than should while a queue is building up behind him and other customers are becoming frustrated by the unnecessary wait. These scenarios are recorded in a total of four times.
Even though the selected scenarios may not give the exact real-life results since conflicts situations may vary based on several factors such as the individuals involved. For example, some people are more patient that others while some would be more willing to compromise than others to avoid an escalation, but the problem of recording real life interactions that result in aggression is very challenging and will raise serious ethical and privacy issues. Therefore, these selected scenarios will give a high degree of accurate outcomes while avoiding the ethical and privacy concerns.
The contexts which were used in the experiment were selected from the various situations such as at the service desk, the locker, vending machine and the cafeteria, as mentioned above in which tempers flared up and an aggressive situation arose. In the experiment carried out in 1, the scene was scaled based on the stress level in the entire scene on a scale of 1 to 5. A stress level of 1 meant the scenario was not stressful, while a stress level of a 2 or 3 meant the stress level in the scenes was moderate and finally a stress lever of 4 or 5 meant it was a very stressful situation that lead to an act of aggression. The stress level was determined from the recordings which included both audio and video data of the entire scenes enacted above, and the Krippendorff’s alpha was measured at 0.71 1.
III. DATA PROCESSING
The aim is to work on unbalanced data so that it lies close to real world occurrence of events. The frequency of negative interactions are relatively low in most of the cases. Hence such data is kept sparsest. We can not know before when the negative interaction is likely to occur so that is the assumption ins the experiment. We analyze equal lengths of segments name 2 seconds for the presence of any utterances. If multiple utterances are spanned by a single segment, then it will be labelled as the segment covering the maximum length within that 2 seconds. In the experiment there are 971 recordings representing the training data obtained from service desk recordings. There are 267 recordings in lockers, 340 in cafeteria and 472 recordings in lockers. So, in total we have 2005 samples.
B. Acoustic features extraction
Acoustic features in the samples are used to identify what kind of interaction is happening in the conversation. They are used to identify the connection between negative emotions, stress and aggression. Suprasegmental traits of emotion are explored using popular speech recognition approaches. It makes use of statistical functionals in individual features in frames and then applying regression/classification on the feature sets. Low order moments or extrema to the frame level are the set of descriptive statistical functionals used to explore suprasegmental traits. Another popular method to generate feature set would be to use brute-force approach which could generate upto 50000 features. We have to implement feature selection because it will help in reducing the high dimensionality. But a limitation here would be the dependency on the corpus that is chosen.
A feature that had stable performance in a similar cross-corpus study for negative interaction has been chosen because the work is aimed at a small and generic feature set. To extract the features, the software tool Praat has ben implemented. As described earlier our interest has been identifying features such as speech length (without pauses), mean deviation, mean slope with and without octave jumps and range of pitch in the audio, mean, max, slope, standard deviation, range of intensity, harmonics to noise ratio (HNR), centre of gravity and spectrum of skewness, average slope of spectrum scope, mean and bandwidth if first fout formants(F1-F4), shimmer, high frequency energy (HF500) (HF1000) and jitter.
C. Video features extraction
The work is aimed at obtaining low-level video features that are good enough to distinguish between normal situations and stressful or negative situations. A high quality video feature analysis is not done. Facial expression recognition is currently not implemented but is it definitely a possibility for extended improvement in performance. Cleary visible changes, actions, vigorous actions or spatial changes like how sudden people are moving are considered to be a key factor that will help to identify a negative situation or presence of aggression.
Motion of objects can be considered as the most relevant feature to identify stressful situation. A space-time interest points (STIP) representation of video segments are used in this work to identify features. The segments are scenes involving motion. These features are then used to identify actions. Degrees of stress and aggression were obtained successfully using STIP. Multiple spatio-temporal scales are taken then space-time interest points are calculated for these scales. Two types of descriptors namely histograms of oriented gradient (HOG) and histograms of optical flow (HOF). Histograms of oriented gradients will help to capture appearance whereas histograms of optical flow will help to capture movement. For each patches corresponding to each space-time interest points, HOGs and HOFs are computed. The bag-of-words approach described in is applied on these descriptors. Using random forest using with 30 trees and 32 nodes, specialized codebooks were computed in a supervised way. K-means can also be applied here to compute the feature vectors. Later, correlation based feature subset selection was implemented to reduce resulting feature.
IV. CLASSIFICATION METHODOLOGY
The experiment setup is explained in this section. The approach used for classification, audio video features and statistical over sampling methods adopted are explained.
A. Experiment setup
This work is aimed at analyzing the performance of the system in cross-context scenario and finding out means to improve its performance. We are trying to find out if training on the combination of multiple context will help to detect negative interactions. We are also interested in finding out the modality that shows maximum robustness. Firstly, the test dataset is kept fixed and trained on that and other contexts. They are then merged together and then trained again. 5-fold cross-validation is implemented to check intra-corpus performance for comparison. These experiments are performed on audio features, video features and feature level fusion which is clubbing together of audio and visual features together for classification.
Random forest classifiers having 100 trees is used in classification. It is applied as in Weka. For each feature type, the audio and the video features are normalized per corpus to zero mean and unit standard deviation. This is done considering the inter-corpus variation. The unweighted average accuracy is evaluation measure across all the cases.
B. Statistical oversampling
The major challenge is in getting good results from classifiers due to unbalanced data. One approach could be to do the classification once and then finding out under-sampled class. Re-sampling could be done on the under-sampled class to achieve more balance. Statistical minority oversampling (SMOTE) is used in the experiment. In this method noise is added to the data to generate new artificial samples of minority classes. SMOTE is applied only on the training set. The percentage of new data will be a parameter. The value for this parameter has to be set. Statistical oversampling with a precomputed parameter to even out the distributions for the
two least represented classes is applied on the initial label of distribution of training data.
RESULTS AND DISCUSSION
The system was trained using the collected data from all the four contexts that lead to aggression discussed above and cross validates as well. Even though we would expect the system to perform better when it was trained on all the four contexts, but the results in 1 show that the system performed best when it was trained using a single context as opposed to combining all the four contexts especially the Vending Machine context.
It was also observed that the system was able to better detect aggression as desired when relying on audio features as opposed to when using videos features. And when both the audio and video features were combined the result was not very different from when the system was using just the audio features in either of the training and cross validations.
Figure x: System tested and trained on all four contexts (Service Desk (SD), Vending Machine(VM), Locker(LK) and Cafeteria(CF). The green line corresponds to Audio feature(A), while the blue line corresponds to Video feature(V) and yellow corresponds to Audio-Visual (AV) feature combined.
The most likely reason the system performed better when using audio versus video is due to the complexity of visible human behavior and the fact that it has a much wider range compared to the audible behavior.
This could be seen more clearly if we look at the various scenarios used to generate the visual datasets, for example at the service desk since based on the circumstances there would be lots of complaining and explaining so it would most likely involve lots of hand gestures, looking worried and anxious and other gestures that are peculiar to someone in a hurry, while in a cafeteria the video dataset would contain lots of motion since there are a lot of people and are walking and moving etc. In the vending machine the visual data would include actions such as the individual hitting the machine take force it to drop his purchase etc. while in the case of the locker it would be an individual acting sneaky trying to avoid appearing suspicious.
This system performed satisfactorily when tested under the provided circumstances and has the potential to be deployed in a real situation.
However there are some challenges that may still need to be handled in a real deployment of the system, for example in a real cafeteria where there could be lots of people the audio and video system would be challenged as it would need to capture entire audio and video action of several people and zero into the one that is pointing towards an aggression by filtering out other information. While the vending machine scenario which may involve vandalism would have sharp and quick movements and based on the kind of environment there will be variation in the audio characteristics as well which the system would need to have the capacity and processing power to handle. The recording would need to be done with a very powerful device and probably need to be able to perform intelligent close view analysis of the individual to get the best input/output as well which is not always possible