Hearing without Listening Collaborators Manas Pathak CMU Shantanu

Hearing without Listening Collaborators: Manas Pathak (CMU), Shantanu Rane (MERL), Paris Smaragdis (UIUC) and Madhusudana Shashanka (UTC), Jose Portelo (INESC) Bhiksha Raj Pittsburgh, PA 15213 21 April 2011

About Me • Bhiksha Raj • Assoc. Prof in LTI • Areas: o o Computational Audition Speech recognition • Was at Mitsubishi Electric Research prior to CMU o Led research effort on speech processing 2

A Voice Problem Call center mans calls relating to various company services and products • Call center records customer response o o Thousands of hours of customer calls Must be mined for improved response to market conditions and improved customer response • Audio records carry confidential information o o o Social security number Names, addresses Cannot be revealed!! 3

Mining Audio • Can be dealt with as a problem of keyword spotting o o • Must not discover anything that was not specified! o • E. g. Social security number, names, addresses, other identifying information Trivial solution: Ship keyword spotter to data holder o o • Must find words/phrases such as “problem”, “disk controller” Data holder willing to provide list of terms to search for Unsatisfactory Does not prevent operator of spotter from having access to private information Work on encrypted data? 4

A more important problem • ABC NEWS Oct 2008 • Inside Account of U. S. Eavesdropping on Americans Despite pledges by President George W. Bush and American intelligence officials to the contrary, hundreds of US citizens overseas have been eavesdropped on as they called friends and family back home. . . 5

The Problem • Security: o NSA must monitor call for public safety Caller may be a known miscreant § Call may relate to planning subversive activity § • Privacy: o • Terrible violation of individual privacy The gist of the problem: o NSA is possibly looking for key words or phrases § o Or if some key people are calling in § o • Did we hear “bomb the pentagon”? ? Was that Ayman al Zawahiri’s voice? But must have access to all audio to do so Similar to the mining problem 6

Abstracting the problem • Data holder willing to provide encrypted data o • Mining entity willing to provide encrypted keywords o • A locked box Sealed packages Must find if keywords occur in data o Find if contents of sealed packages are also present in the locked box § • Without unsealing packages or opening the box! Data are spoken audio 7

Privacy Preserving Voice Processing • Problems are examples of need for privacy preserving voice processing algorithms • The majority of the world’s communication has • historically been done by voice Voice is considered a very private mode of communication o o Eavesdropping is a social no-no Many places that allow recording of images in public disallow recording of voice! • Yet little current research on how to maintain the privacy of voice. . 8

The History of Privacy Preserving Technologies for Voice 9

The History of Counter Espionage Techniques against Private Voice Communication 10

Voice Database Privacy: Not Covered • Can “link” uploaders of video to Flickr, youtube etc. , by matching the audio recordings o “Matching the uploaders of videos across accounts” § Lei, Choi, Janin, Friedland, ICSI Berkeley • More generally, can identify common “anonymous” speakers across databases o And, through their links to other identified subjects, can possibly identify them as well. . • Topic for another day 11

Automatic Speech Processing Technologies • Lexical content comprehension o Recognition § o Determining the sequence of words that was spoken Keyterm spotting § Detecting if specified terms have occurred in a recording • Biometrics and non-lexical content recognition o Identification § o Identifying the speaker Verification/Authentication § Confirming that a speaker is who he/she claims to be 12

Audio Problems involve Pattern Matching • Biometrics: o Speaker verification / authentication: § o Binary classification – Does the recording belong to speaker X or not Speaker identification: § Multi-classification – Who (if) from a given known set is the speaker • Recognition: o Key-term spotting: § o Binary classification – Is the current segment of audio an instance of the keyword Continuous speech recognition § Multi- or infinite- classification 13

Simplistic View of Voice Privacy: Learning what is not intended • User is willing to permit the system to have access to the intended result, but not anything else • Recognition/spotting System allowed access to “text” content of message o But system should not be able to identify the speaker, his/her mood etc • Biometric system allowed access to identity of speaker o But should not be able to learn the text content as well § E. g. discover a passphrase 14

Signal Processing Solutions to the simple problems • Voice is produced by a source filter mechanism o • • • o Lungs are the source Vocal tract is the filter Mathematical models allow us to separate the two The source contains much of the information about the speaker The filter contains information about the sounds 15

Eliminating Speaker Identity • Modifying the Pitch o • o Lowering it makes the voice somewhat unrecognizable Of course, one can undo this Voice morphing o o Map the entire speech (source and filter) to a neutral third voice Even so, the cadence gives identity away 16

Keep the voice, corrupt the content • Much tougher o The “excitation” also actually carries spoken content • Cannot destroy content without corrupting the excitation 17

Signal Processing is not the Solution • Often no “acceptable to expose” data o Voice recognition: Medical transcription. HIPAA rules § Court records § Police records § o Keyword spotting: Signal processing inapplicable § o Cannot modify only undesired portions of data Biometrics: Identification: Data holder may not know who is being identified (NSA) § Verification/Authentication – many problems § 18

What follows • A related problem: o • Song matching Two voice matching problems o o Keyword spotting Voice authentication Cryptographic approach § An alternative approach § • Issues and questions • Caveat: Have squiggles (cannot be avoided) o Will be kept to a minimum 19

A related problem • Alice has just found a short piece of music on the web o Possibly from an illegal site! • She likes it. She would like to find out the name of the song 20

Alice and her song • Bob has a large, organized catalogue of songs • Simple solution: o o o Alice sends her song snippet to Bob matches it against his catalogue Returns the ID of the song that has the best match to the snippet 21

What Bob does S 1 S 2 S 3 SM • • • Bob uses a simple correlation-based procedure Receives snippet W = w[0]. . w[N] For each song S o At each lag t § o • Finds Correlation(W, S, t) = Si w[i] s[t+i] Score(S) = maximum correlation: maxt Correlation(W, S, t) Returns song with largest correlation. 22

What Bob does S 1 W S 2 S 3 SM • • • Bob uses a simple correlation-based procedure Receives snippet W = w[0]. . w[N] For each song S o At each lag t § o • Finds Correlation(W, S, t) = Si w[i] s[t+i] Score(S) = maximum correlation: maxt Correlation(W, S, t) Returns song with largest correlation. 23

What Bob does S 1 W S 2 S 3 SM • • • Bob uses a simple correlation-based procedure Receives snippet W = w[0]. . w[N] For each song S o At each lag t § o • Finds Correlation(W, S, t) = Si w[i] s[t+i] Score(S) = maximum correlation: maxt Correlation(W, S, t) Returns song with largest correlation. 24

What Bob does S 1 W S 2 S 3 Corr(W, S 1, 0) SM • • • Bob uses a simple correlation-based procedure Receives snippet W = w[0]. . w[N] For each song S o At each lag t § o • Finds Correlation(W, S, t) = Si w[i] s[t+i] Score(S) = maximum correlation: maxt Correlation(W, S, t) Returns song with largest correlation. 25

What Bob does S 1 W S 2 S 3 Corr(W, S 1, 0) Corr(W, S 1, 1) SM • • • Bob uses a simple correlation-based procedure Receives snippet W = w[0]. . w[N] For each song S o At each lag t § o • Finds Correlation(W, S, t) = Si w[i] s[t+i] Score(S) = maximum correlation: maxt Correlation(W, S, t) Returns song with largest correlation. 26

What Bob does W S 1 S 2 S 3 Corr(W, S 1, 0) Corr(W, S 1, 1) Corr(W, S 1, 2) SM • • • Bob uses a simple correlation-based procedure Receives snippet W = w[0]. . w[N] For each song S o At each lag t § o • Finds Correlation(W, S, t) = Si w[i] s[t+i] Score(S) = maximum correlation: maxt Correlation(W, S, t) Returns song with largest correlation. 27

Relationship to Speech Recognition • Words are like Alice’s snippets. • Word spotting: Find locations in a recording where patterns that match the word occur o Computing “match” score is analogous to finding the correlation with an (or the best of many) examples of the word • Continuous speech recognition: Repeated word spotting 28

Alice has a problem • Her snippet may have been illegally downloaded • She may go to jail if Bob sees it o Bob may be the DRM police. . 29

Solving Alice’s Problem • Alice could encrypt her snippet and send it to Bob • Bob could work entirely on the encrypted data o And obtain an encrypted result to return to Alice! • A job for SFE o Except we wont use a circuit 30

Correlation is the root of the problem • The basic operation performed is a correlation o • Corr(W, S, t) = Si w[i] s[t+i] This is actually a dot product: o o o W = [w[0] w[2]. . w[N]]T St = [s[t] s[t+1]. . s[t+N]]T Corr(W, S, t) = W. St • Bob can compute Encrypt[Corr(W, S, t)] if Alice sends him Encrypt[W] • Assumption: All data are integers o Encryption is performed over large enough finite fields to hold all the numbers 31

Paillier Encryption • Alice uses Paillier Encryption o o o o Random P, Q N = PQ G = N+1 L = (P-1)(Q-1) M = L-1 mod N Public key = (N, G), Private Key = (L, M) Encrypt message m from (0…N-1): Enc[m] = gm r. N mod N 2 § Dec[ u ] = { [ (u. L mod N 2) – 1]/N}M mod N § • Important property: o o o Enc [m] == gm Enc[m]. Enc[v] = gmgv = gm+v = Enc[m+v] (Enc[m] )v = Enc[vm] 32

Solving Alice’s Problem • Alice generates public and private keys • She encrypts her snippet using her public key and sends it to Bob: o • Alice Bob : Enc[W] = Enc[w[0]], Enc[w[2]], … Enc[w[N]] To compute the dot product with any snippet s[t]. . s[t+N] o Bob computes Pi Enc[w[i]]s[t+i] = Enc[Si w[i]s[t+i]] = Enc[Corr(W, S, t)] • Bob can compute the encrypted correlations from Alice’s encrypted input without needing even the public key • The above technique is the Secure Inner Product (SIP) protocol 33

Solving Alice’s Problem - 2 • Given the encrypted snippet Enc[W] from Alice • Bob computes Enc[Corr(S, W, t)] for all S, and all t o o i. e. he computes the correlation of Alice’s snippet with all of his songs at all lags The correlations are encrypted, however • Bob only has to find which Corr(S, W, t) is the highest o And send the metadata for the corresponding S to Alice • Except, Bob cannot find argmax. S maxt Corr(S, W, t) o He only has encrypted correlation values 34

Bob Solves the Problem • Bob ships the encrypted correlations back to Alice o o o She can decrypt them all, find the maximum value, and send the index back to Bob retrieves the song corresponding to the index Problem – Bob effectively sends his catalogue over to Alice • Problems for Bob o Alice may be a competitor pretending to be a customer § o She is phishing to find out what Bob has in his catalogue She can find out what songs Bob has by comparing the correlations against those from a catalog of her own § Even if Bob sends the correlations in permuted order 35

The SMAX protocol • • Alice sends her public key to Bob adds a random number to each correlation o o • Enc[Corrnoisy(W, S, t)] = Enc[Corr(W, s, t)]*Enc[r] = Enc[Corr(W, s, t) + r] A different random number corrupts each correlation Bob permutes the noisy correlations and sends the permuted correlations to Alice o Bob Alice P(Enc[Corrnoisy(W, S, t)]) for all S, t § P is a permutation • Alice can decrypt them to get P(Corrnoisy(W, S, t)) • Bob retains P (-r), i. e. the permuted set of random numbers 36

The SMAX protocol • Bob has P (-r), Alice has P(Corr(W, s, t)+r) • Bob generates a public/private key, encrypts P (-r), and sends it along with his public key to Alice o • Alice encrypts the noisy correlations she has and “adds” them to what she got from Bob: o • Bob Alice: Enc. Bob[P (-r)] Enc. Bob[P(Corr(W, s, t)+r)]* Enc. Bob[P (-r)] = P Enc. Bob[ (Corr(W, s, t))] Alice now has encrypted and permuted correlations. 37

The SMAX protocol • Alice now has encrypted and permuted correlations. o PEnc. Bob[Corr(W, s, t)] • Alice adds a random number q : o o P Enc. Bob[Corr(W, s, t)]*Enc. Bob[q] = P Enc. Bob[ Corr(W, s, t)+q] The same number q corrupts everything • Alice permutes everything with her own permutation scheme and ships it all back to Bob: o Alice Bob : PAlice P Enc. Bob[ Corr(W, s, t)+q] 38

The SMAX protocol • Bob has : PAlice P Enc. Bob[ Corr(W, s, t)+q] • Bob decrypts them and finds the index of the largest value Imax • Bob unpermutes Imax and sends the result to Alice: o • Bob Alice P-1 Imax Alice unpermutes it with her own permutation to get the true index of the song in Bob’s catalogue o PAlice-1 P-1 Imax = Idx § • The above technique is insecure; in reality, the process is more involved Alice finally has the index in Bob’s catalogue for her song 39

Retrieving the Song • Alice cannot simply send the index of the best song to Bob o It will tell Bob which song it is The song may not be available for public download § i. e. Alice’s snippet is illegal § i. e. § 40

Retrieving the Song • Alice cannot simply send the index of the best song to Bob o It will tell Bob which song it is The song may not be available for public download § i. e. Alice’s snippet is illegal § i. e. § 41

Retrieving Alice’s song: OT • Alice sends Enc[Idx] to Bob (and public key) o Alice Bob : Enc[Idx] • For each song Si, Bob computes o o Enc[M(Si)]*(Enc[-i]*Enc[idx])ri = Enc[M(Si) + ri(Idx-i)] Bob sends this to Alice for every song § o Bob Alice : Enc[M(Si) + ri(idx-i)] M(Si) is the metadata for Si • Alice only decrypts the idx-th entry to get M(Sidx) o Decrypting the rest is pointless § Decrypting any other entry will give us M(Si) + ri(idx-i) 42

Unrealistic solution • The proposed method is computationally unrealistic o o Public key encryption is very slow Repeated shipping of entire catalog Can be simplified to one shipping of catalog § Still inefficient § • Illustrates a few primitives o o SIP: Secure inner product; computing dot products of private data SMAX: Finding the index of the largest entry without revealing entries § o • Also possible – SVAL: Finding the largest entry (and nothing else) OT as a mechanism for retrieving information Have ignored a few issues. . 43

Security Picture • Assumption: Honest but curious • What to they gain by being Malicious? o Bob: Manipulating numbers Can prevent Alice from learning what she wants § Bob learns nothing of Alice’s snippet § o Alice: § Can prevent herself from learning the identity of her song • Key Assumption: o Sufficient if malicious activity does not provide immediate benefit to agent 44

Pattern matching for voice: Basics Feature Computation Pattern Matching • The basic approach includes o A feature computation module § o Derives features from the audio A pattern matching module § Stores “models” for the various classes • Assumption: Feature computation performed by entity holding voice data 45

Pattern matching in Audio: The basic tools • Gaussian mixtures o o Class distributions are generally modelled as Gaussian mixtures “Score” computation is equivalent to computing log likelihoods § Rather than computing correlations • Hidden Markov models o o Time series models are usually Hidden Markov models With Gaussian mixture distributions as state output densities § Conceptually not substantially more complex than Gaussian mixtures 46

Computing a log Gaussian • The Gaussian likelihood for a vector X, given a Gaussian with mean m and covariance Q is given by • The log Gaussian is given by • With a little arithmetic manipulation, we can write 47

Computing a log Gaussian • Which can be further modified to • • is a vector containing all the components of W • In other words, the log Gaussian is an inner product is a vector containing all pairwise products o If Bob has parameters and Alice has data , they can compute the log Gaussian without exposing the data using SIP o By extension, can compute log Mixture Gaussians privately 48

Keyword Spotting • Very similar to song matching • The “snippet” is now the keyword • A “Song” is now a recording • Instead of finding a best match, we must find every recording in which a “correlation” score exceeds a threshold • Complexity is comparable to song retrieval o Actually worse § We will use HMMs instead of sample sequences and correlations are replaced by best-state-sequence scores 49

A Biometric Problem: Authentication • Passwords and text-based forms of authentication are physically insecure o Passwords may be stolen, guessed etc. • Biometric authentication is based on “non-stealable” signatures o o o Fingerprints Iris Voice § We will ignore the problem of voice mimicry. . 50

Voice Authentication • User identifies himself o May also authenticate himself/herself via text-based authentication • System issues a challenge o o User often asked to say a chosen sentence Alternately, User may say anything • System compares voiceprint to stored model for user o And authenticates user if print matches. 51

Authentication • Most common approach o Gaussian mixture distribution for User § o Gaussian mixture distribution for “background” § o • Learned from enrollment data recorded by the user Learned from a large number of speakers Compare log likelihoods of “speaker” Gaussian mixture with “background” Gaussian mixture to authenticate Privately done: o o Compute all loglikelihoods on encrypted data E[X] Engage in a protocol to determine which is higher § To authenticate the user without “seeing” the user’s data 52

Private Voice Authentication • Protecting user’s voice is perhaps less critical o Actually opens up security issues not present in the “open” mechanism § o OT can be manipulated and must be protected from malicious user May be useful, for surveillance-type scenarios § The NSA problem • Bigger problem o The “authenticating” site may be phishing Using the learned speaker models to synthesize fake recordings for the speaker § To spoof the speaker at other voice authentication sites § 53

Private Voice Authentication • Assumption: o User is using a client on a computer or a smartphone § • Which can be used for encryption and decryption Protocol for modified learning procedure: o Learn the user model from encrypted recordings § The system finally obtains a set of encrypted models § Cannot decrypt the models – Unable to Phish • Assumption 2: o Motivated user: will not provide fake data during enrollment 54

Authenticating with encrypted models • Authentication procedure changes o System can no longer compute dot products on encrypted data § Since models are themselves encrypted • User must perform some of the operations o Which will be seen without observing partial results • Malicious user o Imposter who has stolen Alice’s smartphone Has her encryption keys! § Can manipulate data sent by Alice during shared computation § 55

Malicious User • User intends to break into the system o Is happy if probability of false acceptance increases even just slightly § Improves his chance of breaking in • System: o Is satisfied if manipulating partial results of computation results in false acceptance rate no greater than that obtained with random data 56

Protecting from Malicious User • • Modify likelihood computation Instead of computing o L(X, Wspeaker), and L(X, Wbackground) o Compute o J(X, g(Wspeaker, Wbackground)) J(X, h(Wspeaker, Wbackground)) o o • System “unravels” these to obtain Enc[L(X, Wspeaker)] and Enc[L(X, Wbackground)] o • P(L(X, Wspeaker) > L(X, Wbackground)) comparable to that for random X if user manipulates intermediate results Finally engages in a millionaire protocol designed for malicious users to decide user is authentic 57

Speaker Authentication • Ignoring many issues w. r. t. malicious user • Computation overhead large • Signal processing techniques: Simple, but insecure • Cryptographic methods: Secure, but complex o Is there a golden mean 58

An Efficient Mechanism • Find a method to convert voice recording to a sequence of symbols • Sequence must be such that: o o The sequence s(X) for any recording X must belong to a small set S of sequences with high probability if the speaker is the target speaker The sequence must not belong to S with high probability if the speaker is not the target speaker • Reduces to conventional password approach o S is the set of passwords for the speaker § Any of which can authenticate him 59

Converting speaker’s audio to symbol sequence • Is as secure as conventional password system o o o Speaker’s client converts voice to symbol sequence Symbol sequence shipped to system as password Speaker’s “client” maintains no record of valid set of passwords § o Cannot be hacked System only stores encrypted passwords § Cannot voice-phish • Can improve false alarm/false rejection by using multiple mappings. . 60

Converting the Speaker’s voice Symbol Sequences • Technique 1: o o A generic HMM that represents all speech Most likely state sequences are “password” set for speaker • Learn model parameters to be optimal for speaker 61

Converting the Speaker’s voice Symbol Sequences • Technique 2: o Locality sensitive Hashing • Hashes are “passwords” o Hashing must be learned to have desired password properties 62

Other voice problems • Problems currently being addressed: • Keyword detection (for mining) • Authentication • Speaker identification • Speech recognition is a much tougher problem o o Infinite classification Requires privacy-preserving graph search 63

Work in progress • Efficient / practical authentication schemes • Efficient/practical computational schemes for voice processing o o On the side: Investigations into privacy preserving classification schemes Differential privacy / combining DP and SMC • Better signal processing schemes • Database privacy: Issues and solutions. . • QUESTIONS? 64