Greg Leung, Juan David Dominguez, and Ben Sandeen

EECS 352
Prof. Bryan Pardo
Northwestern University

View the Project on GitHub BenSandeen/surveillance_sound_classifier

Contact the creators:

Greg
GregoryLeung2016@u.northwestern.edu

Juan
JuanDominguez2017@u.northwestern.edu

Ben
BenjaminSandeen2016@u.northwestern.edu

Surveillance system sound classifier

The goal of our project is to develop a system that can recognize and distinguish between threatening/threat-indicating and non-threatening/non-threat-indicating sounds. This can be useful for surveillance systems, in order to help them effectively monitor their surroundings and report sounds that are often indicative of malfeasance. This is important because it can help prevent a single security officer from falling into a stupor while staring at handfuls of monitors and potentially listening to sound feeds as well. One can imagine that such a job, day after day, will almost invariably become dreadfully bore to a worker, leading to reduced performance. This is where our system should be able to contribute the most.

We envision our system to act as a filter of sorts. It constantly monitors incoming audio feeds and analyzes the sounds it detects. Should the system detect a sound, say, a scream, that often indicates someone is in danger, the system will then alert a human security worker for further analysis and, should the detection indeed prove to be a real issue, determine the best course of action.

This system uses a decision tree to classify segments of sounds based on their features, such as spectral composition, attack (as in how quickly a single sound reaches its peak volume), MFCCs (Mel frequency Cepstral Coefficients), spectral centroid, root mean squared energy, chromagram, anverage chroma value, and standard deviation of chroma values. The decision tree algorithm is a good fit four our purposes because the algorithm tends to ignore less useful features implicitly. The result is a system that segments an unknown file, segements it, then gives classfications for each of the segements. This is good, because a recording may not contain a scream for its full duration, so it is useful classify parts of it.

Audio features were exctracted from of 40 millsecond segments of each recording of a sound with a 20 millisecond hop size (50 percent overlap) at 44100 Hz WAV files (1764 and 882 samples respectively). Segments that contained little to no sound were not used to train the classifier. The following image is a summary of how our system works.


Below are several chromagrams generated from each of our different sound source types (bangs and other manmade/artificial sounds, screams, singing, and birds/other natural sounds). Each row consists of five randomly-selected sound files from the same category.

Below are several MFCCs (Mel Frequency Cepstral Coefficients), generated analogously to the chromagrams. Note that the plots in the corresponding positions between the chromagrams and MFCCs are most likely NOT plots of the same sound file.

We also hope to be able to incorporate attack, or rapidity of onset, of sounds across these different sources. Note that the attack is effectively how steep the initial incline of the blue area is. A high slope indicates a sound that that very rapidly reached its peak loudness. A good example of such a sound is a percussionist hitting a drum. As our preliminary results clearly show, our category "bangs and other manmade sounds" consists largely of sounds with a very rapid attack. Below is another wall of plots displaying the attack of randomly selected sounds.

Another feature, which we did not end up using for four our classifier, is the autocorrelation. This measures the self-similarity over many time-delays. A sound that is quite self-similar over time will tend to have a higher autocorrelation (in our plots this means there will be more blue area). The closest thing we used to capture this was the root mean square energy of the signal, which is an okay approximation given the short time frame.



We trained our model on 23618 individual segments from about 750 sound files (evenly divided between those that contain screams, and those that do not). Using 7-fold cross validation, we trained several models of varying max depths and minimum population percentage in a leaf. Both of those parameters were tweaked to avoid overfitting. We ultimately chose a model with a max depth of 5 and a minmum population for a leaf of 0.15 ratio of the whole population. The result of its 7-fold cross validation was an average correct classification rate of 0.8135 and a standard deviation of 0.0620.

The following shows an output of our system with an 8 second file containing 4 seconds of scream and then 4 seconds of silence. Each 40 millisecond time segment is represented on the horizontal axis with the confidence rating of scream versus scream on the vertical axis.