CHAPTER – 1 INTRODUCTION Audio signals which include speech



Audio signals which include speech, music and environmental sounds are important types of media. The problem of distinguishing audio signals into these different audio types is thus becoming increasingly significant. A human listener can easily distinguish between different audio types by just listening to a short segment of an audio signal. However, solving this problem using computers has proven to be very difficult. Nevertheless, many systems with modest accuracy could still be implemented.
Audio segmentation and classification have applications in wide areas. For instance, content based audio classification and retrieval is broadly used in the entertainment industry, audio archive management, commercial music usage, surveillance, etc. There are many digital audio databases on the World Wide Web nowadays, here audio segmentation and classification would be needed for audio searching and indexing. Recently, there has been a great deal of interest in monitoring broadcast news programs, in this case classification of speech data in terms of speaker could help in efficient navigation through broadcast news archives.
Like many other pattern classification tasks, audio classification is made up of two main sections: a signal processing section and a classification section. The signal processing part deals with the extraction of features from the audio signal. The various methods of time-frequency analysis developed for processing audio signals, in many cases originally developed for speech processing, are used. The classification part deals with classifying data based on the statistical information extracted from the signals.
With the growing market of portable digital audio players, the number of digital music files inside personal computers has increased. It can be difficult to choose and classify which songs to listen to when you want to listen to specific mood of music, such as happy music, sad music and angry music. Not only must the consumer classify their music, but online distributors must classify thousands of songs in their databases for their consumers to browse.
In music psychology and music education, emotions based components of music has been recognized as the most strongly component associated with music expressivity. Music information behavior studies have also identified music mood emotion as an important criterion used by people in music seeking indexing and storage. However, evaluation of music mood is difficult to classify as it is highly subjective. Although there seems to be a very strong connectivity between the music (the audio) and the mood of a person.
There are many entities which explicitly change our music while we are listening music. Rhythm, tempo, instruments and musical scales are some such entities. There is one very important entity in the form of lyrics which directly affects our minds. Identifying audible words from lyrics and classifying the music’s accordingly is a difficult problem as it includes complex issues of Digital Signal Processing. We have explored rather a simple approach of understanding the music on the basis of audio patterns. This problem resembles to classical problem of pattern recognition. We have made an effort to extract these patterns from the audio as audio features.

In artificial neural networking system a computer can, in a very real sense, read human minds. Although the dot’s gyrations are directed by a computer, the machine was only carrying out the orders of the test subject. The artificial neural network technique is far more than a laboratory stunt. Though computers can solve extraordinarily complex problems with incredible speed, Psychological tests are standardized measures of behavior.
Most psychological tests fall under two broad categories, mental ability tests and personality scales. Mental ability tests include intelligence tests, which are designed to measure general mental ability, and aptitude tests, which measure more specific mental abilities. Personality measures are usually called scales, rather than tests, as there are no rights or wrong answers. Personality tests measure a variety of motives, interests, values, and attitudes. The experimental results establish the effectiveness of the ANN-MFCC technique system.

The process of identifying the correct genre of audio is a natural process for our mind. The Human brain processes and classifies the audios naturally, based on the long history of learning and experience. Here we emulate the same methodology (inspired by the biological nervous system) by training the system to make and use the knowledge gained to classify by artificial neural network techniques.
In this dissertation, we have identified a set of computable audio features of audios and have developed methods to understand them. These features are generic in nature and are extracted from audio streams; therefore, they do not require any object detection, tracking and classification. These features include time domain, pitch based, frequency domain, sub band energy and MFCC as audio features.
We have developed an audio classifier that can parse a given audio clip into predefined genre categories using extracted audio features of the audio clips. The audio classifier is designed using Multi-layer Feedforward neural network with back propagation learning algorithm and tested the classifier for characterization of audio into sports, news and music.
Music genre is further classified into three categories
? Happy Songs,
? Angry Songs and
? Sad Songs.
The overall schematic diagram for audio classifier system is shown in figure 5.1.

It is evident from the literature described above that considerable work has been done in the field of audio characterization. The present work describes low level audio features based audio characterization.
1. Chapter 1: In this chapter, we provide a brief introduction to our research problem, proposed methodology and outline the contents of this thesis.
2. Chapter 2: This chapter explores the related work already done in this field by various researchers.
3. Chapter 3: In this chapter, we first extract the audio waveform from the audio clip and then extract a set of low-level audio features for characterizing genres of movie clips.
The extracted features come under the following categories:
? Time Domain Features
? Pitch Based Features
? Frequency Domain Features
? Sub Band Energy Features

4. Chapter 4: In this chapter, we describe artificial neural network and intelligence with its various types, learning methodology of neural network and neural network classifier for audio characterization.
5. Chapter 5: This chapter presents the outcomes of various extracted audio features. Chapter also describes the implementation of audio classifier with their experimental results.
6. Chapter 6: In this chapter, we draw our conclusions and indicate possible future works.


2.1 Review of Work Already Done

The auditory system of humans and animals, can efficiently extract the behaviorally relevant information embedded in natural acoustic environments. Evolutionary adaptation of the neural computations and representations has probably facilitated the detection of such signals with low SNR over natural, coherently fluctuating background noises 23, 24. It has been argued 47 that the statistical analysis of natural sounds – vocalizations, in particular – could reveal the neural basis of acoustical perception. Insights in the auditory processing then could be exploited in engineering applications for efficient sound identification, e.g. speech discrimination from music, animal vocalizations and environmental noises.
Similar to a Fourier analyzer, our auditory system maps the one-dimensional sound waveform to a two-dimensional time-frequency representation through the cochlea, the hearing organ of the inner ear 22. In mammals cochlea is a fluid filled coiled tube, about one cubic centimetre in volume, which resembles the shell of a sea-snail. The eardrum transmits the incoming sounds to the cochlear fluids as pressure oscillations. These oscillations in turn, deflect a membrane of graded mechanical properties which runs along the cochlear spiral, the basilar membrane (BM).

Stiffness grading of the basilar membrane serves to analyze incoming sounds into their component frequencies from 20 to 20,000 Hz. A different resonant-like frequency response characterizes each place along the membrane: peak frequencies and bandwidths are highest near the spiral’s broad mouth (where the BM is stiffest), and lowest further up the tube near the spiral’s apex 22, 61.
The tube in addition carries neuro sensory hair cells that fire in response to vibrations in the thin cochlear fluid; nerve impulses are then sent to the brain as electrical signals. Manoussaki et al 13, 14 have found that the energy of the sound waves traveling in the cochlea is not evenly distributed along the tube; the spiral shape of the cochlea focuses wave energy towards the outer wall, especially further up the tube, where lower frequencies (bass sounds) are detected 13,14.
This concentration of energy in the spiral’s outer edge, amplifies the sensitivity of membrane cells to vibrations. This amplification corresponds to a 20 decibels boosting of lower frequencies relative to the higher frequencies detected at the outer face of the spiral. Mammals may have developed the spiral structure for communication and survival, since low frequency sound waves can travel further 13, 14.
Recent findings from auditory physiology, psychoacoustics, and speech perception, suggest that the auditory system re-encodes acoustic spectrum in terms of spectral and temporal modulations.
The perceptual role of very low frequency modulations resembles inaudible “message bearing waves” modulating higher frequency carriers. The existence of modulations in speech has been evidenced in 2, 50. Dynamic information provided by the modulation spectrum includes fast and slower time-varying quantities such as pitch, phonetic and syllabic rates of speech, tempo of music, etc 37. Moreover, speech intelligibility depends on the integrity of the slow spectro-temporal energy modulations.
Accordingly, the auditory models proposed by various researchers, are based on a two stage processing: the early stage consists of spectrum estimation whereas in the second “cortical” stage, spectrum analysis is performed in order to estimate low frequency modulations. Various models of modulation representations have been proposed: a two-dimensional Fourier transform of autocorrelation matrix of sound spectrogram 47, a two-dimensional wavelet transform of auditory spectrogram 36, a joint acoustic / modulation frequency representation 37, a combination of the latter representation with cepstrum 37, etc.
Auditory models are perhaps too complex for successful inversion which would be required for coding, speech enhancement, compression, etc. For recognition and classification systems however, robustness in noise and utility for segregation in cluttered acoustic environments might be more important than invariability.
In order for these features to describe natural signals like speech, we might have to consider fine resolution in all the subspaces involved. Using more relevant features from the original space is not satisfactory due to the fact that computation cost becomes very high and the generalization of classification algorithm is not guaranteed: the curse of dimensionality 55, 73 hinders the classification task. There is a trade-off between compactness and quality of a representation – measured in terms of redundancy and relevance of features, respectively. The relevance-redundancy trade-off can be explored using an information theoretic concept, mutual information 68.
In the early 2000s, existing speech recognition research was expanded upon in hopes of finding practical music labeling applications. Commonly used metrics such as Mel-frequency cepstral coefficients (MFCCs), a succinct representation of a smoothed audio spectrum, were quickly adapted for music. These tumbrel features proved to be quite effective when applied to segmented audio samples.
However, given the complexities of music, additional features were necessary for sufficient genre analysis. Rhythmic and harmonic features (e.g. tempo, dominant pitches) were developed in response. Some went a step further to utilize quantization of higher level descriptors like emotion and categorization of audio. The effectiveness of these types of extracted features has led some researchers to ponder their applications for automatic playlist generation, music recommendation algorithms, or even assisting musicologists in determining how humans define genres and otherwise classify music 9.
Upon extracting features from music samples, relationships were mapped out among various genres. For example, rock music tends have a higher beat strength than classical music, whereas hip-hop might usually contain greater low energy (bass) than jazz. Statistical models were trained to process feature sets by genre before finally being applied to test sets of audio for classification.
One of the most well-known (and perhaps the earliest successful) genre classification papers is that of Tzanetakis and Cook 20. Together, they spearheaded the initial usage of speech recognition feature sets for genre-related purposes. Timbre features utilized include spectral centroid, spectral rolloff, spectral flux, zero-crossings, MFCCs, and low-energy. Additional rhythmic and pitch-content features were also developed using beat and pitch histograms. Using Gaussian mixture model and k-nearest neighbor approaches, a 61% genre classification success rate was achieved. These encouraging results not only validated the feature-extraction and machine learning methodology, but provided a solid base for others to expand on and a corresponding benchmark to compare Li et. al 66 utilized the exact same feature set as Tzanetakis and Cook with the addition of Daubechies wavelet coefficient histograms (DWCHs). Two further machine learning methods were compared, support vector machines (SVM) and linear discriminate analysis. Findings were positive, with an overall accuracy of 80%.
McKinney and Beebart 42 suggested the usage of psychoacoustic features, e.g. roughness, loudness, and sharpness. Additional low-level features such as RMS and pitch strength were incorporated as well. Features were computed for four different frequency bands before being applied to a Gaussian mixture model, resulting in a 74% success rate.
Lidy and Rauber 67 analyzed audio from a pyscho-acoustic standpoint as well, deciding to further transform the audio in the process of feature extraction. Using SVMs with pair wise classification, they were able to reach an accuracy of 75%.
Pohle et al. 69 supplemented Tzanetakis and Cook’s feature set using MPEG-7 low level descriptors (LLDs); some of these include power, spectrum spread, and harmonicity. K-nearest neighbors, naïve Bayes, C4.5 (decision tree learner), and SVM were all utilized for machine learning. Although the original focus of the study was intended to classify the music into perceptual categorization such as (happy vs. sad) and emotion (soft vs. aggressive), their algorithm was also adapted for genre classification with a 70% rate of success.
Burred and Lerch 29 chose to use MPEG-7 LLDs as well, but added features like beat strength and rhythmic regularity. Using an intriguing hierarchal decision-making approach, they were able to achieve a classification accuracy of around 58%.
As mentioned previously, these studies often failed to depict electronic music in a manner accurately representative of the state of the genre. If included at all, terms used ranged from “Techno” 66 and “Disco” 20 to “Techno/Dance” 29 and “Eurodance” 17. Many of these labels are now antiquated or nonexistent, so audio features and machine learning strategies were selected from previous studies for application on samples from current electronic music subgenres that better embody the style of music. Saunders 32 published one of the first studies on speech/music discrimination in hopes of isolating music portions of FM radio broadcasts. Based on analysis of the temporal zero-crossing rate (ZCR) of audio signals, a Gaussian classifier was developed from a training set of samples. A remarkable 98.4% segmentation accuracy was achieved with real-time performance, proving not only the viability of speech/music discriminators in general, but also the effectiveness of this type of two-step feature extraction and machine learning process for signal classification purposes.
This work has been extended in several ways. The use of more complex features (e.g. spectral, rhythmic, and harmonic) is beneficial for expanding the types of classes analyzed 20. Scheirer and Slaney 18 proposed a low-energy metric, exploiting a characteristic energy peak that appears in speech signals around 4 Hz 30. Spectral features such as spectral rolloff, spectral centroid, and spectral flux were also included in their feature vector, yielding a maximum accuracy of 98.6%.
The Mel-frequency cepstral coefficients proved to be valuable for speech/music discrimination applications as well, resulting in 98.8% 43 and 93% 48 accuracies in different experimental setups. Linear prediction coefficients have also been used with some success in several studies 27, 34. A more in-depth summary of prior speech/music discrimination studies can be found in table form in 75. Noteworthy is the fact that a lack of a standard corpus makes classification rates hard to compare directly.
Throughout this wealth of experimentation, the focus has been on discriminating pure speech from pure music. In reality, much broadcast and internet audio cannot be classified as pure speech or pure music, with various mixtures of speech, singing, musical instruments, environmental noise, etc. occurring quite commonly. Even in radio applications, DJs speak over background music and news anchors talk over introduction themes or lead-in “bumpers.” Being able to identify these types of mixtures could be beneficial, for example, to determine transitions between types of audio for segmentation purposes. Didiot et al. 16 addressed the problem of multi-class audio by classifying audio samples twice, first as music or non-music, then speech or non-speech. Using the resulting four final categories, maximum accuracies of 92%, 85%, and 74% were obtained for three different test sets. Lu, Zhang and Li 40 took a very similar approach, with the same final four categories (music, speech, speech + music, noise), and with comparable accuracy (92.5% after post-processing). A similar setup was implemented by Zhang and Kuo 70, likewise resulting in over 90% of the test audio being labeled correctly.

The authors of 12 chose seven categories: silence, single speaker speech, music, environmental noise, multiple speaker speech, simultaneous speech and music, and speech and noise. Using a wide array of features such as MFCCs and LPCs, over 90% of the samples were classified correctly. Razik et al. 31 mixed speech and music at ratios of 5, 10, and 15 dB to determine the effect on classification. Results indicated that lower mixing ratios caused a greater likelihood for the samples to be misclassified as instrumental music or music with vocals. Around 75% of the test set was classified correctly in a three-way (speech/music/mixed) application, using MFCCs and spectral features. Music classification is the process of assigning such as happy, angry and sad. Different pieces of music in the same are thought to share the same “basic musical language”. The most common of the categorical approaches to emotion modeling is that of Paul Ekman’s49 basic emotions, which encompasses the emotions of anger, fear, sadness, happiness, and disgust. A categorical approach is one that consists of several distinct classes that form the basis for all other possible emotional variations. Categorical approaches are most applicable to goal-oriented situations.
Dimensional approach classifier emotions along several axes, such as valence (pleasure), arousal (activity), and potency (dominance). Such approaches include James Russell’s two-dimensional bipolar space (valence-arousal) 25, Robert Thayer’s energy-stress model 53,54 where contentment is de?ned as low energy/low stress, depression as low energy/high stress, exuberance as high energy/low stress, and anxious/frantic as high energy high stress, and Albert Mehrabian’s three-dimensional PAD representation (pleasure-arousal-dominance) 3.
One of the publications on emotion detection in music is credited to Feng, Zhuang, and Pan. They employ Computational Media Aesthetics to detect mood for music information retrieval tasks 74. The two dimensions of tempo and articulation are extracted from the audio signal and are mapped to one of four emotional categories; happiness, sadness, anger, and fear. This categorization is based on Juslin’s theory 51, where the two elements of slow or fast tempo and staccato or legato articulation adequately convey emotional information from the performer to the audience. The time domain energy of the audio signal issued to determine articulation while tempo is determined using Dixon’s beat detection algorithm 59.
Single modal and multi modal mood classification has been done by various researchers. Kate Hevner’s Adjective Circle 35 consists of 66 adjectives there are divided into 8 circles (which consists of moods). Chetan et al 11 chose emotional states based on Hevner’s circle for their motion-based music visualization using photos. The eight classes in the order of the numbers are called: sublime, sad, touching, easy, light, happy, exciting and grand.
Farnsworth modified Hevner’s concept and arranged the moods in ten groups 52. Rigg et al 44,45 experiment includes four categories of emotion; lamentation, joy, longing, and love. Categories are assigned several musical features, for example ‘joy’ is described as having iambic rhythm (staccato notes), fast tempo, high register, major mode, simple harmony, and loud dynamics (forte) . because he uses fifteen adjective groups in conjunction with the musical attributes pitch (low-high), volume (soft-loud), tempo (slow-fast), sound (pretty-ugly), dynamics (constant-varying), and rhythm (regular irregular). Watson’s research reveals many important relationships between these musical attributes and the perceived emotion of the musical excerpt. As such, Watson’s contribution has provided music emotion researchers with a large body of relevant data that they can now use to gauge the results of their experiments.
Automatic audio classification for music is a comparatively common technique. The used musical attributes are typically divided into two groups, timbre-based attributes and rhythmic or tempo-based attributes. The tempo-based attributes can be represented by e.g. an Average Silence Ratio or a Beats per Minute value.
Lu 38 uses amongst others Rhythm Strength, Average Correlation Peak, Average Tempo and Average Onset Frequency to represent rhythmic attributes. Frequency spectrum based features like Mel- Frequency Cepstral Coe?cients (MFCC), Spectral Centroid, Spectral Flux or Spectral Rollo? are also used.
Wu and Jeng 65 use a complex mixture of various features: Rhythmic Content, Pitch Content, Power Spectrum Centroid, Inter-channel Cross Correlation, Tonality, Spectral Contrast and Daubechies Wavelet Coe?cient Histograms. For the classi?cation step in the music domain Support Vector Machines (SVM) 60 and Gaussian Mixture Models (GMM) 39 are typically applied. Liu et al. 7 utilize a nearest-mean classi?er. The comparison of classi?cation results of di?erent algorithms is di?cult because every publication uses an individual test set or ground-truth. E.g. the algorithm of Wu and Jeng65 reaches an average classi?cation rate of 74,35% for 8 different audios with the additional di?culty that the results of the system and the ground- truth contain mood histograms which are compared by a quadratic-cross-similarity. Jadon et al57,58 have extracted time domain, pitch, frequency domain, sub band energy, and MFCC based audio features.
Another integral emotion detection project is LiandOgihara’s content-based music similarity search 64. Their original work in emotion detection in music 46 utilized Farnsworth’s ten adjective groups 44. Li and Ogihara’s system extracts relevant audio descriptors using MARSYAS 21 and then classi?es them using Support Vector Machines (SVM). The 2004 research utilized Hevner’s eight adjective groups to address the problem of music similarity search and emotion detection in music. Daubechies Wavelet Coe?cient Histograms are combined with timbral features, again extracted with MARSYAS, and SVMs were trained on these features to classify their music database.
Implementing Tellegen, Watson, and Clark’s three-layer dimensional model of emotion 33, Yang and Lee developed a system to disambiguate music emotion using software agents15.This platform makes use of acoustical audio features and lyrics, as well as cultural metadata to classify music. The emotional model focuses on negative a?ect, and includes the axes of high/low positive a?ect and high/low negative a?ect. Tempo is estimated through the autocorrelation of energy extracted from di?erent frequency bands. Timbral features such as spectral centroid, spectral rollo?, spectral ?ux, and kurtosis are also used to measure emotional intensity.
Another implementation of Thayer’s dimensional model of emotion is Tolos, Tato, and Kemp’s mood-based navigation system for large collections of musical data 46. In this system a user can select the mood of a song from a two-dimensional mood plane and automatically extract the mood from the song. Tolos, Tato, and Kemp use Thayer’s model of mood, which comprises the axes of quality (x-axis) and activation (y-axis).

This results in four audio classes, aggressive, happy, calm, and melancholic. Building on the work of Li and Ogihara, Wieczorkowska, Synak, Lewis, and Ras conducted research to automatically recognize emotions in music through the parameterization of audio data 5. They implemented a k-NN classi?cation algorithm to determine the music of a song. Timbre and chords are used as the primary features for parameterization. Their system implements single labeling of classes by a single subject with the idea of expanding their research to multiple labeling and multi-subject assessments in the future.
This labeling resulted in six classes: happy and fanciful; graceful and dreamy; pathetic and passionate; dramatic, agitated, and frustrated; sacred and spooky; and dark and bluesy.
Lastly, an emerging source of information relating to emotion detection in music is the Music Information Retrieval Evaluation eXchange’s (MIREX)62 annual competition, which will for the ?rst time include an audio music classi?cation category5. This MIR community has recognized the importance of mood as a relevant and salient category for music classi?cation. They believe that this contest will help to solidify the area of audio classi?cation and provide valuable ground truth data.
At the moment, two approaches to the music taxonomy are being considered. The ?rst is based on music perception, such as Thayer’s two-dimensional model. It has been found that fewer categories result in more accurate classi?cations.
The second model comes from music information practices, such as all music and audio logic, which use the labels to classify their music databases.
Social tagging of music, such as Last.FM, is also being considered as a valuable resource for music information retrieval and music classi?cation.
Content based audio classification system can effectively classify the song. Until now there is no such application system available which can classify the music of the song though there are certain websites available such as – “” that displays songs according to their moods but in this research we have tried to do so using some audio classification and feature extraction.

Audio features extraction is the process of converting an audio signal into a sequence of feature vectors carrying characteristic information about the signal. These vectors are used as basis for various types of audio analysis algorithms. It is typical for audio analysis algorithms to be based on features computed on a window basis. These window based features can be considered as short time description of the signal for that particular moment in time. The performance of a set of features depends on the application. The design of descriptive features for a specific application is hence the main challenge in building audio classification systems. Audio in fact tells a lot about the clip, the music component, the noise, fast or slowness of the pace and the human brain too can classify just on the base of audio.
In this chapter, we first extract the audio waveform from the audio clip and then extract a set of low-level audio features for characterizing genres of audio clips. A digitized audio waveform is segmented into clips as shown in figure 3.1, which may or may not overlap with previous clips and cover constant or variable duration.
The overall process of feature extraction is to convert the waveform into a sequence of parameter blocks, which we call it vectors. Continuous waveform is first discrete into bits representation (i.e. 16 bit, 32 bit) with higher bits represent higher quality and then the bits vector is encoded into parameter vectors; the number of sample collected in one second is called a sampling rate.

Figure 3.1: An audio clip used in audio characterization.

Figure 3.2: Concept of overlapping and window frames
A frame is a collection of sample vectors with a constant frame window size. Frame is overlapping with another frame; clip-level features are computed based on frame-level features. Figure 3.2 shows the concept of overlapping and window frames.
A wide range of audio features exist for classification tasks. These fall under the following categories:
? Time Domain Features
? Pitch Based Features
? Frequency Domain Features
? Energy Features
Volume is a reliable indicator for silence detection; therefore, it can be used to segment audio sequence and determine clip boundaries 28, 71. Volume is commonly perceived as loudness since natural sounds have pressure waves with different amount of power to push our ear. In electronic sound, the physical quantity is amplitude, which is particularly characterized by the sample value in digital signals. Therefore volume is often calculated as the Root-Mean-Square (RMS) of amplitude 19, 63, 76. Volume of the nth frame is calculated, by the following formula:
V(n) = ———(3.1)
V(n) is the volume,
Sn(i) is the ith sample in the nth frame audio signal,
N is the total number of samples in the frame.
Let us consider three different audio clips Sports, Music, and news for experimental purpose. Figure 3.3 show the waveforms of these clips. The volumes of these audio clips have different distribution.

(a) News

(b) Music

(c) Sports
Figure 3.3: Show the Waveform of Different Audio Clip

3.1.1 Volume Standard Deviation (VSD)
The standard deviation of the distribution of volume (of each frame) is calculated, and is treated as VSD. Thus, it is a clip level feature rather than a frame level feature.
? = ??(A-?)2/(n-1) ———(3.2)
Where ? means standard deviation and ? is the mean. The sport clips has higher values of VSD.
3.1.2 Volume Dynamic Range(VDR)
In audios with action in the background, volume of the frame does not change much, while in non-action audios, there are silent periods between the speeches, and hence VDR is expected to be higher.VDR is calculated as
VDR = (MAX(v) – MIN(v)) / MAX(v) ———(3.3)
Where MIN(v) and MAX(v) represent the minimum and maximum volume within a clip respectively.
3.1.3 Zero Crossing Rate (ZCR)
ZCR indicates the number of times that an audio waveform crosses the zero axes. ZCR is the most effective indication for detecting unvoiced speech. By combining ZCR and volume, misclassification of low-energy and unvoiced speech frames as being silent frames can be avoided. Specifically, unvoiced speech is recognized as low volume, but high ZCR 41,76. Generally, non-action audio samples have a lower ZCR. The definition of ZCR in discrete case is as follows:

ZCR= ———(3.4)

sgnS(n) = 1 S(n) ; 0
0 S(n) = 0
-1 S(n) ; 0
S(n) is the input audio signal,
N is the number of signal samples in a frame,
sgn( ) is the sign function and
fsis the sampling rate of the audio signal.
Figure 3.4 show the ZCR of audio clips. From these plots, we know that the ZCR of these audio clips have different distribution.

(a) Sports

(b) Music
(c) News
Figure 3.4 Show the ZCR of Different Genres of Audio Clips

3.1.4 Silence Ratio
Volume Root Mean Square and Volume Standard Deviation are helpful to calculate Silence Ratio (SR) of a frame. Silence ratio is not calculated for each frame, but rather, Volume Root Mean Square and Volume Standard Deviation of each frame is used in calculating SR of a clip. Thus, it is a clip level feature rather than being a frame level feature. The silence ratio is calculated as follows:
SR (n) = sr/n ———-(3.5)
Where ‘sr’ is initially zero and is incremented by one if VRMS is less than the half of VSD in each frame and ‘n’ is the total number of frames in the clip. In music and sport clips there is always some noise in the background, which results in a low silence ratio. On the other hand silence ratio is much higher in other genres clips.
Pitch serves as an important characteristic of an audio for its robust classification. Pitch information helps derive 3 features, which help in a more accurate and better classification, as compared to other features. Average Magnitude Difference Function (AMDF) is used to determine the pitch of a frame, and is defined as:

Am(n) = ———(3.6)

Am(n) = AMDF for nth sample in mth frame
N = number of samples in a frame
sm(i) = ith sample in mth frame of an audio signal
AMDF is calculated for every sample of the frame. The following is the pitch determination algorithm for a given frame:
(i) Find the global minimum value Amin (Amin is the minimum value of AMDF function for a given frame)
(ii) Find all local minimum points ni such that A(ni) ; Amin + ? (where ? is determined on the basis of data) .Thus, these values of ni are the number of corresponding samples, for which AMDF has local minimum values.
(iii) Clearness of each local minimum is checked. Each ni ,which have been collected in step 2, are checked for clearance i.e. the difference of A(ni), and average of A(ni-4),A(ni-3),A(ni-2),A(ni-1), A(ni+1), A(ni+2), A(ni+3), A(ni+4), if greater than a certain threshold (again decided on the basis of data), the point is said not to be cleared, and if it is not cleared, the local point is removed. Thus, after this step, we only have those sample numbers, for which AMDF function has a cleared local minimum value.
(iv) With the remaining local points (i.e. after step 2 and step 3 have been applied to the set of samples collected), choose the smallest ni’ (where ni’ is the set of points left after application of step 2 and 3 on the given set of points) value as the pitch period.

Figure 3.5: Show the AMDF of one News Frame of an audio clip

(a) Sport

(b) Music

Figure 3.6: Show the Pitch contours of Three Different Audio Clip
Thus, the pitch for a frame is calculated by making use of AMDF function for each sample in the frame. Figure 3.5 shows the AMDF of one news clip. Figure 3.6 gives the pitch tracks of the three different audio clips. After computing pitch for each frame, following clip level features are extracted.
3.2.1 Pitch Standard Deviation (PSD)
Pitch standard deviation is calculated using pitch of every frame, and it turns out to be a much better distinguishing factor. We obtained pitch for every frame in a given clip, which helps us to calculate the standard deviation using standard statistics based formulae.
3.2.2 Speech or Music Ratio (VMR)
Pitch information of each frame helps to decide whether the frame is speech or music or neither of them. For a speech/music frame, pitch stays relatively smooth. We compare the pitch of a frame with 5 of its neighboring frames, and if all of them are close to each other (decided on the basis of a threshold set on the basis of data provided), then the frame is classified as speech/music. After having gathered information about each frame (i.e. whether it is voice/music or not), we compute VMR as the ratio of these speech/music frames with the total length of the clip:
VMR = Number of frames classified as speech/music ——–(3.7)
Total number of frames
3.2.3 Noise or Unvoice Ratio (NUR)
NUR is calculated for a clip rather than a frame, by classifying each frame as either speech or noise/Unvoice. A frame for which no pitch has been detected and it is not silent (the criterion used for silence ratio is also applied here), is declared to be noise or Unvoice. NUR for the entire clip is calculated as the ratio of noise/Unvoice frames to the entire length of the clip (in terms of number of frames):
NUR = Number of frames classified as noise/Unvoice——–(3.8)
Total number of frames
To obtain frequency domain features, spectrogram of an audio clip, in the form of short-time Fourier transform is calculated for each audio frame. Since time domain does not show the frequency components and frequency distribution of a sound signal 72.
Any sound signal can be expressed in the form of the frequency spectrum which shows the graph of frequency versus amplitude; thus it shows the average amplitude of various frequency components in the audio signal 6. The spectrogram is used for the extraction of two features, namely frequency Centroid, and frequency bandwidth. Figure 3.7 gives the spectrogram of the three audio clips:

(a) Sports (b) Music

(c) News
Figure 3.7: Show the Frequency Spectrum of Three Different Audio Clips
3.3.1 Frequency Centroid
An important feature that can be derived from energy distribution of a sound is frequency centroid (also called brightness), which is the midpoint of the spectral energy distribution of a sound. Frequency Centroid is calculated as:
Where Si(w) represents the short time Fourier transform of the ith frame. We than calculate clip level feature taking the mean of frequency centroid.
3.3.2 Frequency Bandwidth
Frequency bandwidth indicates the frequency range of a sound. It is normally calculated by taking the difference between the highest and the lowest frequency of the ‘non-zero’ spectrum components. Frequency bandwidth is calculated as:

Where, Fc is frequency Centroid and Si(w) represents the short time Fourier transform of the ith frame. We than calculate clip level feature taking the mean of frequency bandwidth.
3.3.3 Frequency Rolloff:
It indicate the flatness of sound. The decrease in energy with increase in frequency, ideally described in the sound source as 12 dB per octave. Frequency rolloff is defined as the frequency where 85% of the energy in the spectrum is below this point. It is often used as an indicator of the skew of the frequencies present in a window.

3.3.4 Frequency Flux:
It determines changes of spectral energy (variation of harmonics). A feature extractor that extracts the Frequency Flux forms a window of samples and the proceeding window. This is a good measure of the amount of spectral change of a signal. Frequency flux is calculated by first calculating the difference between the current values of each magnitude spectrum bin in the current window from the corresponding value of the magnitude spectrum of the previous window. Each of these differences is then squared, and the result is the sum of the squares.
The energy distribution in different frequency bands also varies quite significantly among different types of audio signals. The entire frequency spectrum is divided into four sub-bands at the same interval of 1 KHz. Each subband consists of six critical bands which represent cochlear filter in the human auditory model 56. The Sub-band energy is defined as:
——— (3.11)
Here WiL and WiH are lower and upper bound of sub-band i, and Then Eiis normalized as:
——— (3.12)
Figure 3. 8 Energy for News Signal

Figure 3. 9 Energy for music signal

MFCCs are short term spectral based features. MFCC features are frequently used by many researchers for speech recognition and in music/ speech classification problem. A block diagram showing the steps taken for the computing MFFCs can be seen in figure 3.10. Each step in this process of creating Mel Frequency Cepstral Coefficients is motivated by computational or perceptual considerations.

Figure 3.10 Block diagram showing the steps for computing MFCCs
The first step in this process is to block a continuous audio signal into frames. The purpose here is to model small sections of the audio signal that are statistically stationary. Each frame consists of n samples with adjacent frames separated by m samples. The following frame starts m samples after the first sample and overlaps it by (n – m) samples. In a similar way the third frame starts m samples after the second frame and overlaps it by ( n – m ) samples. Typical values for n and m are 256 and 100 respectively.
The next step is to use a window function on each individual frame in order to minimise discontinuities at the beginning and end of each frame. Typically the window function used is the Hamming window and has the following form:

——— (3.13)

Given the above window function and assuming that there are N samples in each frame, we will obtain the following signal after windowing.

——— (3.14)

The next step is the process of converting each frame of N samples from the time domain to the frequency domain. Here we will take the Discrete Fourier Transform of each frame. We use the FFT algorithm, which is computationally efficient, to implement the DFT. As the amplitude of the spectrum is much more important than the phase, we will retain only the amplitude spectrum. The Discrete Fourier Transform on the set of N samples is defined as follows 33.
——— (3.15)
The next step is the transformation of the real frequency scale to the mel frequency scale. A mel is a unit of measure of perceived pitch or frequency of a tone 45. The mel- frequency is based on the nonlinear human perception of the frequencies of audio signals. It is a linear frequency spacing below 1KHz and logarithmic above this frequency. By taking the pitch of the 1 KHz tone as a reference and assigning it 1000 mels, and then making some test measurements based on human perception of audio signals, it is possible to drive a model for an approximate mapping of a given real frequency to the mel frequency scale. The following is an approximation of the mel frequency based on such experiments.
——— (3.16)
Where f is the physical frequency in Hz and Mel is the perceived frequency in mels.

Figure 3. 11 Mel scale mapping

Figure 3.12 Plot of a news signal as function of time

Figure 3.13 Plot of the MFCCs for the News signal

Figure 3.14 Plot of a music signal as function of time

Figure 3. 15 Plot of the MFCCs for the music signal

Entropy is a property that can be used to determine the energy not available for work. It is also a measure of the tendency of a process. It is a measure of disorder of a system. Entropy refers to the relative degree of randomness. The higher the entropy, the more frequently are signaling errors. Entropy is directly proportional to the maximum attainable data speed in bps. Entropy is directly proportional to noise and bandwidth. It is inversely proportional to compatibility. Entropy also refers to disorder deliberately added to data in certain encryption process.



Recent awareness in artificial neural networks has motivated a large number of applications covering a wide range of research fields. The ability of learning in neural networks provides an interesting alternative to other conventional research methods. The different problem domains where neural network may be used are: pattern association, pattern classification, regularity detection, image processing, speech analysis, simulation etc.
We have designed a neural classifier using artificial neural networks for our Content based audio classification task. In this chapter, we first describe the neural network with its various types, learning methodology of neural network and designing of neural network classifier.
A neural network is an artificial representation of the human brain that tries to simulate its learning process. The term ;artificial; means that neural nets are implemented in computer programs that are able to handle the large number of necessary calculations during the learning process 1, 10, 39.The human brain consists of a large number (more than a billion) of neural cells that process information. Each cell works like a simple processor and only the massive interaction between all cells and their parallel processing makes the brain's abilities possible.
Figure 4.1 shows a sketch of such a neural cell, called a neuron. In figure, a neuron consists of a core, dendrites for incoming information and an axon with dendrites for outgoing information that is passed to connected neurons. Information is transported between neurons in form of electrical stimulations along the dendrites. Incoming information’s that reach the neuron's dendrites is added up and then delivered along the neuron's axon to the dendrites at its end, where the information is passed to other neurons if the stimulation has exceeded a certain threshold the neuron is said to be activated. If the incoming stimulation has been too low, the information will not be transported any further and the neuron is said to be inhibited.
Like the human brain, a neural net also consists of neurons and connections between them. The neurons are transporting incoming information on their outgoing connections to other neurons. In neural net terms these connections are called weights. The "electrical" information is simulated with specific values stored in these weights. By simply changing these weight values the changing of the connection structure can also be simulated. Figure 4.2 shows an idealized neuron of a neural net. In this network, information (called the input) is sent to the neuron on its incoming weights. This input is processed by a propagation function that adds up the values of all incoming weights. The resulting value is compared with a certain threshold value by the neuron's activation function. If the input exceeds the threshold value, the neuron will be activated, otherwise it will be inhibited. If activated, the neuron sends an output on its outgoing weights to all connected neurons and so on.

Figure 4.1: Structure of a neural cell in the human brain

Figure 4.2: Structure of a neuron in a neural net

In a neural net, the neurons are grouped in layers, called neuron layers. Usually each neuron of one layer is connected to all neurons of the preceding and the following layer (except the input layer and the output layer of the net). The information given to a neural net is propagated layer-by-layer from input layer to output layer through none, one or more hidden layers. Depending on the learning algorithm, it is also possible that information is propagated backwards through the net. Figure 4.3 shows a neural net with three neuron layers. This is not the general structure of a neural net. Some neural net types have no hidden layers or the neurons in a layer are arranged as a matrix. But the common to all neural net types is the presence of at least one weight matrix and the connections between two neuron layers.

Figure 4.3: Neural net with three neuron layers
There are many different types of neural network each of having special properties. Generally it can be said that neural nets are very flexible systems for problem solving purposes. One ability should be mentioned explicitly: the error tolerance of neural networks. That means, if a neural net had been trained for a specific problem, it will be able to recall correct results, even if the problem to be solved is not exactly the same as the already learned one. Although neural nets are able to find solutions for difficult problems but the results can't be guaranteed to be perfect or even correct. They are just approximations of a desired solution and a certain error is always present.

4.2.1 Perceptron
The perceptron was first introduced by F. Rosenblatt in 1958. It is a very simple neural net type with two neuron layers that accepts only binary input and output values (0 or 1). The learning process is supervised and the net is used for pattern classification purposes. Figure 4.4 shows the simple Perceptron:
4.2.2 Multi Layer Perceptron
The Multi-Layer-Perceptron was first introduced by M. Minsky and S. Papert in 1969. It is an extended Perceptron and has one ore more hidden neuron layers between its input and output layers as shown in figure 4.5.

Figure 4.4: Simple Perceptron

Figure 4.5: Multi-Layer-Perceptron

4.2.3 Back Propagation Network
The Back propagation Network was first introduced by G.E. Hinton, E. Rumelhart and R.J. Williams in 1986 and is one of the most powerful neural net types. It has the same structure as the Multi-Layer-Perceptron and uses the back propagation learning algorithm as shown in figure 4.6.

Figure 4.6: Back propagation Net
In the human brain, information is passed between the neurons in form of electrical stimulation along the dendrites. If a certain amount of stimulation is received by a neuron, it generates an output to all other connected neurons and so information takes its way to its destination where some reaction will occur. If the incoming stimulation is too low, no output is generated by the neuron and the information's further transport will be blocked. Explaining how the human brain learns certain things is quite difficult and nobody knows it exactly. It is supposed that during the learning process the connection structure among the neurons is changed, so that certain stimulations are only accepted by certain neurons. This means, there exist firm connections between the neural cells that once have learned a specific fact, enabling the fast recall of this information.

If some related information is acquired later, the same neural cells are stimulated and will adapt their connection structure according to this new information. On the other hand, if specific information isn't recalled for a long time, the established connection structure between the responsible neural cells will get more "weak". This has happened if someone "forgot" a once learned fact or can only remember it vaguely.
Unlike the biological model, a neural net has an unchangeable structure, built of a specified number of neurons and a specified number of connections between them (called "weights"), which have certain values. What changes during the learning process are the values of those weights? Compared to the original this means: Incoming information "stimulates" (exceeds a specified threshold value of) certain neurons that pass the information to connected neurons or prevent further transportation along the weighted connections. The value of a weight will be increased if information should be transported and decreased if not. While learning different inputs, the weight values are changed dynamically until their values are balanced, so each input will lead to the desired output. The training of neural net results in a matrix that holds the weight values between the neurons. Once a neural net has been trained correctly, it will probably be able to find the desired output to a given input that has been learned, by using these matrix values. Very often there is a certain error left after the learning process, so the generated output is only a good approximation to the perfect output in most cases.
4.3.1 Supervised Learning
Supervised learning is based on the system trying to predict outcomes for known examples and is a commonly used training method. It compares its predictions to the target answer and "learns" from its mistakes. The data start as inputs to the input layer neurons. The neurons pass the inputs along to the next nodes. As inputs are passed along, the weighting, or connection, is applied and when the inputs reach the next node, the weightings are summed and either intensified or weakened. This continues until the data reach the output layer where the model predicts an outcome. In a supervised learning system, the predicted output is compared to the actual output for that case. If the predicted output is equal to the actual output, no change is made to the weights in the system. But, if the predicted output is higher or lower than the actual outcome in the data, the error is propagated back through the system and the weights are adjusted accordingly. This feeding error backwards through the network is called "back-propagation." Both the Multi-Layer Perceptron and the Radial Basis Function are supervised learning techniques. The Multi-Layer Perceptron uses the back-propagation while the Radial Basis Function is a feed-forward approach which trains on a single pass of the data 11.
4.3.2 Unsupervised Learning
Neural networks which use unsupervised learning are most effective for describing data rather than predicting it. The neural network is not shown any outputs or answers as part of the training process–in fact, there is no concept of output fields in this type of system. The primary unsupervised technique is the Kohonen network. The main uses of Kohonen and other unsupervised neural systems are in cluster analysis where the goal is to group "like" cases together.
4.3.3 Back Propagation Learning Algorithm
Back propagation is a supervised learning algorithm and is mainly used by Multi Layer Perceptron to change the weights connected to the net's hidden neuron layer(s). The back propagation algorithm uses a computed output error to change the weight values in backward direction. To get this net error, a forward propagation phase must have been done before. While propagating in forward direction, the neurons are being activated using the sigmoid activation function.
Following are the steps of back propagation algorithm:
(i)Initialize the weights and biases:
• The weights in the network are initialized to random numbers from the interval -1,1.
• Each unit has a BIAS associated with it
• The biases are similarly initialized to random numbers from the interval -1,1.
(ii) Feed the training sample:
Each training sample is processed by the steps given below.
(iii) Propagate the inputs forward:
We compute the net input and output of each unit in the hidden and output layers.
• For unit j in the input layer, its output is equal to its input, that is,
for input unit j.
• The net input to each unit in the hidden and output layers is computed as follows:

Figure 4.7: Shows Propagation through Hidden Layer
o Given a unit j in a hidden or output layer, the net input is:


wij is the weight of the connection from unit i in the previous layer to unit j;
Oi is the output of unit i from the previous layer;
?j is the bias of the unit
• Each unit in the hidden and output layers takes its net input and then applies an activation function. The function symbolizes the activation of the neuron represented by the unit. It is also called a logistic, sigmoid, or squashing function.
• Given a net input Ij to unit j, then
Oj = f(Ij), ———-(4.3)
the output of unit j, is computed as

(iv) Back propagate the error:
• When reaching the Output layer, the error is computed and propagated backwards.
• For a unit k in the output layer the error is computed by a formula:

Ok – actual output of unit k (computed by activation function.

Tk – True output based of known class label of the given training sample
Ok(1-Ok) – is a Derivative ( rate of change ) of activation function.
• The error is propagated backwards by updating weights and biases to reflect the error of the network classification.
• For a unit j in the hidden layer the error is computed by a formula:

wjk is the weight of the connection from unit j to unit k in the next higher layer, and
Errkis the error of unit k.
(v) Update weights and biases to reflect the propagated errors:
• Weights are updated by the following equations, where l is a constant between 0.0 and 1.0 reflecting the learning rate, this learning rate is fixed for implementation.


• Biases are updated by the following equations


• We are updating weights and biases after the presentation of each sample.This is called case updating.
• Epoch: One iteration through the training set is called an epoch.
• Epoch updating: Alternatively, the weight and bias increments could be accumulated in variables and the weights and biases updated after all of the samples of the training set have been presented.
(vi) Terminating conditions:
Stop the training when:
• All in the previous epoch are below some threshold, or
• The percentage of samples misclassified in the previous epoch is below some threshold, or
• A pre specified number of epochs have expired.
• In practice, several hundreds of thousands of epochs may be required before the weights will converge.
The algorithm ends, if all output patterns match their target patterns.

We have designed an Audio classifier using layered Feedforward network known as Multi-Layer Perceptron (MLP) as shown in figure 4.8 for classification of audio. Network consists of one input layer with fourteen input neurons, one hidden layer consist of seven neurons and output layer consist of three outputs class. Audio clips are first classified into News, Sports and Music. Music genre is further classified into three categories happy, angry and sad. The classifier is trained using the supervised back propagation learning algorithm. The input data to the classifier is the extracted low level audio features. The data is divided, into training data and testing data. All data must be normalized (i.e. all values of attributes in the database are changed to contain values in the internal 0,1 or-1,1). There are two basic data normalization techniques:
(i)Max- Min normalization: Max-Min normalization performs a linear transformation on the original data. Suppose that mina and maxA are the minimum and maximum values of the attribute A.Max-Min normalization maps a value v of A to v’ in the range new_minA, new_maxA by computing:


(ii) Decimal Scaling Normalization: Normalization by decimal scaling normalizes by moving the decimal point of values of attribute A.

Here j is the smallest integer such that max|v’|