Sound Detection Derek Hoiem Rahul Sukthankar mentor August

  • Slides: 16
Download presentation
Sound Detection Derek Hoiem Rahul Sukthankar (mentor) August 24, 2004

Sound Detection Derek Hoiem Rahul Sukthankar (mentor) August 24, 2004

Objective ¢ Learn model of sound object from few (10 -20) examples and distinguish

Objective ¢ Learn model of sound object from few (10 -20) examples and distinguish from all other sounds ¢ Examples of sound classes: l Gunshots, screams, laughter, car horns, meow, dog bark, etc

Applications ¢ “Tell me if you hear a gunshot. ” (monitoring) ¢ “Get me

Applications ¢ “Tell me if you hear a gunshot. ” (monitoring) ¢ “Get me video clips containing dogs barking. ” (search and retrieval) ¢ “What’s going on? ” (scene understanding)

Why its difficult ¢ Sound classes have large variations ¢ Sounds are often ambiguous

Why its difficult ¢ Sound classes have large variations ¢ Sounds are often ambiguous without context ¢ Overlaid “noise” obscures sound

Sound or not? Which of these sounds are not from their named classes? Car

Sound or not? Which of these sounds are not from their named classes? Car horn Dog bark Laser gun

Previous work ¢ Sound Classification (Wold 1996, Casey 2001, etc) l Categorize short sound

Previous work ¢ Sound Classification (Wold 1996, Casey 2001, etc) l Categorize short sound clips l Reasonable accuracy (5 -20% error) ¢ Sound Detection (Defaux 2000, Piamsa-nga 1999) l Localize and recognize sound objects in long clips l Poor performance or assumption of unrealistic conditions (e. g. , very quiet background)

Detection via Windowed Search Clip 1 Clip 2 Long Track … Clip Classifier Clip

Detection via Windowed Search Clip 1 Clip 2 Long Track … Clip Classifier Clip N Break audio track into short overlapping short clips Independently classify short clips as object or non-object Return locations of detected sound object

Representation meows Features phone rings Features Raw Representation Time-frequency analysis: windowed Fourier transform Extract

Representation meows Features phone rings Features Raw Representation Time-frequency analysis: windowed Fourier transform Extract power percentage in each band over time and total power over time Compute features used for classification

Classification Features ¢ Diverse feature set: Different sound classes are distinctive in different ways

Classification Features ¢ Diverse feature set: Different sound classes are distinctive in different ways l means and standard deviations of power at different frequencies l Band-width, peaks, loudness, etc. l 138 features in all l

Classification by Decision Trees ¢ ¢ ¢ Try to find simple rules that discriminate

Classification by Decision Trees ¢ ¢ ¢ Try to find simple rules that discriminate object from non-object Each decision is based on a threshold of a feature value Assign confidence based on likelihood of data for object and non-object classes at each leaf node Decision nodes Leaf Nodes

Boosted Trees ¢ ¢ Problem: One decision tree by itself may not be a

Boosted Trees ¢ ¢ Problem: One decision tree by itself may not be a great classifier Solution: Use several trees, with each one focusing on the mistakes of previously learned trees Adaboost: l Weight training data uniformly l Learn a decision tree classifier on weighted data l Re-weight data giving more weight to incorrectly classified examples Final classification based on linear combination of confidences from all learned decision trees

Examples of Decision Trees Meow Low percentage of power in low frequencies in mid-time

Examples of Decision Trees Meow Low percentage of power in low frequencies in mid-time of sound Gunshot High power amplitude range Very high power amplitude range More complex tree that focuses on examples misclassified by tree above Gunshot

Cascade of Classifiers ¢ ¢ ¢ Goal: eliminate false positives with few false negatives

Cascade of Classifiers ¢ ¢ ¢ Goal: eliminate false positives with few false negatives in early stages Advantages: l Allows use of large set of negative training examples l Improves classification speed Dangers: cannot recover from false negatives Pass (5%) Sound Clip Stage 1 Fail Pass (2%) Pass (0. 005%) Stage 2 Stage 3 Fail Pass

Results: Classification Error stage 1 Best Performance Worst Performance stage 2 stages 3 pos

Results: Classification Error stage 1 Best Performance Worst Performance stage 2 stages 3 pos neg meow 0. 0% 1. 4% 0. 0% 1. 2% 2. 2% 0. 8% phone 0. 0% 0. 4% 4. 3% 0. 1% 5. 9% 0. 0% car horn 0. 0% 3. 9% 0. 6% 2. 2% 3. 6% 1. 3% door bell 1. 4% 2. 1% 0. 4% 6. 3% 0. 1% swords 6. 1% 1. 3% 6. 7% 0. 1% 6. 7% 0. 0% scream 0. 3% 5. 5% 2. 7% 1. 4% 5. 3% 1. 1% dog bark 0. 7% 1. 0% 6. 0% 0. 3% 7. 7% 0. 2% laser gun 0. 0% 6. 8% 4. 4% 5. 1% 6. 7% 0. 9% explosion 4. 1% 5. 2% 7. 5% 12. 0% 0. 5% light saber 4. 8% 6. 8% 9. 7% 1. 0% 13. 9% 0. 2% gunshot 8. 1% 6. 1% 12. 5% 2. 3% 14. 5% 1. 1% close door 7. 9% 7. 8% 14. 5% 4. 8% 17. 6% 2. 3% male laugh 4. 3% 14. 7% 9. 5% 9. 7% 13. 3% 7. 0% average 2. 9% 4. 4% 6. 0% 2. 2% 8. 5% 1. 1%

Results: ROC curves Note: to approximate negative error rate divide FP by 25, 000

Results: ROC curves Note: to approximate negative error rate divide FP by 25, 000

Results: Anecdotal Gunshots Female Laugh Swords Male Laugh Scream

Results: Anecdotal Gunshots Female Laugh Swords Male Laugh Scream