IEEE International Conference on Data Engineering ICDE 2019

Secure Data Outsourcing Use cryptographic mechanisms to protect sensitive data on the cloud Can

Roadmap • State-of-the-art in secure data outsourcing • Partitioned Computing & corresponding security properties

Data/Computation Outsourcing over the Years Keyword Search over Encrypted Documents SQLinover Solutions represent points

Cryptographic Techniques: Security Threats & Performance DSSE: Distributed Searchable Symmetric Encryption (PULSAR by Stealth)

Data Sensitivity & Outsourcing • Organization data is often only partially sensitive [refs in

Key Insight: Partial Sensitivity of Data (1) • Data about entry/exit from buildings possibly

Key Insight: Partial Sensitivity of Data (2) • Existing work on data classification •

Partitioned Computations Non-sensitive Data Dns Sensitive Data Ds Name Department t 1 E(Adam) E(Defense)

Leakage due to Partitioned Computing… Non-sensitive Data Dns Sensitive Data Ds Name Department t

What if we use access-pattern-hiding techniques? Non-sensitive Data Dns Sensitive Data Ds Name Department

Partitioned Data Security • Non-Linkability • The Adversary does not learn relationship between any

Secure Partitioned Computation (1) • Data partitioned into bins • Non-sensitive data partitioned into

Secure Partitioned Computation (2) Ds Dns …… x ……. . NSB(x) ……E( x)……. .

Secure Partitioned Computation (3) Ds Dns …… x ……. . NSB(x) ……E( x)……. .

Query Binning • Assumptions • Equal number of sensitive and non-sensitive attribute values •

The Algorithm: One Tuple Per Value S = {S 1, S 2, S 3,

The Algorithm: One Tuple Per Value NS = 6 S=6 • Bin Retrieval: Input:

Query Execution Cost on Outsourced Data Techniques Time Size SGX 10500 x Query Binning

Experimental Results (Selection Query) • X-axis = Data sensitivity (1%, 20%, 40%, 60%) •

Analytical Model • When is query binning better compared to pure cryptographic approach? Ratio

Query Binning Extensions • If there is no approximate square factor? • Select nearest

Distinct Values are not a Product of Approximately Square Factor (1) • What will

Distinct Values are not a Product of Approximately Square Factor (2) • Reducing communication

The Algorithm: General Case: Multiple Tuples per Value (1) • What will happen if

The Algorithm: General Case: Multiple Tuples per Value (2) • What will happen if

The Algorithm: General Case: Multiple Tuples per Value (3) • What will happen if

Range Queries • A full binary-tree is constructed for all non-sensitive value • Bins

Conclusion • Existing cryptographic techniques are orders of magnitude slower as compared to cleartext

Slides: 29

Download presentation

IEEE International Conference on Data Engineering (ICDE), 2019. Partitioned Data Security on Outsourced Sensitive and Nonsensitive Data Sharad Mehrotra 1, Shantanu Sharma 1, Jeffrey D. Ullman 2, and Anurag Mishra 1 1 University of California, Irvine, USA 2 Stanford University, USA

Secure Data Outsourcing Use cryptographic mechanisms to protect sensitive data on the cloud Can we design an outsourcing solution that is simultaneously Efficient – significantly better compared to downloading encrypted data, and Secure – similar to downloading the data and local processing 2

Roadmap • State-of-the-art in secure data outsourcing • Partitioned Computing & corresponding security properties • Binning algorithm to achieve partitioned security • Performance results

Data/Computation Outsourcing over the Years Keyword Search over Encrypted Documents SQLinover Solutions represent points the Encrypted data: [ICDE 02, SIGMOD 02, [IEEE SP, 2000, ACNS 04, Cryto, 08, Cryto of 09…] spectrum possibilities VLDB 04, Eurocrypt 03, SIGMOD 04, Crypto 11, STOC 09, SOSP 11, …] – Explore tradeoffs between Generality, security, efficiency. OS Encrypte d Data Cache 14, 15, 17 VLDB 17, Tech 19] More secure but orders of magnitude worse in performance compared to plaintext processing. 15, IEEE SP 15, 17, NSDI 18] Not secure and software techniques to make such solutions secure inefficient • coarse grain page faults, branch shadow, cache-line attacks Page. Ta ble MPC and Secret Sharing [CACM 79, Eurocrypt Secure Hardware [CIDR 13, Usenix Security Trusted Enclave Process 1 Process 2 The adversary can observe the cache-lines and page table access Ecall Ocall

Cryptographic Techniques: Security Threats & Performance DSSE: Distributed Searchable Symmetric Encryption (PULSAR by Stealth) MPC: Multi-party computation (Jana by Galois) Opaque SGX based solution [Zhang et al. , NSDI, 2017] represents technique is resilient to a given attack. • Cryptographic Overheads: Selecting a single row from TPC-H Customer table of 1. 5 M rows and 8 columns • Searchable encryption – ~2 orders of magnitude • Secure hardware - ~3 -4 order of magnitude • MPC based solution - ~5 -6 orders of magnitude

Data Sensitivity & Outsourcing • Organization data is often only partially sensitive [refs in paper] • Sensitivity dictated by policies • Sensitivity dictates what data and in what form is it outsourced • E. g. , General office emails possibly not sensitive (hence outsourced) • Information related to a sensitive project sensitive (hence not outsourced in plaintext) • Can we exploit partially sensitive nature of data to scale cryptographic solutions without compromising security of sensitive data? • Commercial encrypted database solutions (e. g. , Jana by Galois) are beginning to explore such solutions

Key Insight: Partial Sensitivity of Data (1) • Data about entry/exit from buildings possibly sensitive (inference about time spent at work) • Location within office building possibly not sensitive • Surveillance video sensitive, if visitor prefers not to be monitored (OK to know visitor not in frame, but not if visitor in frame!) Partial sensitivity is also true for other Can we exploit partial sensitivity domains to develop efficient (yet secure) http: //cybersecurity. ieee. org/blog/2015/11/13/ident ify-sensitive-data-and-how-they-should-be-handled/ https: //digitalguardian. com/ solutions to scale secure computing and/or data sharing

Key Insight: Partial Sensitivity of Data (2) • Existing work on data classification • • • Inference detection using graph-based semantic data modeling [Hinke, IEEE SP, 88] User-defined relationships between sensitive and non-sensitive data [Smith, IEEE SP, 90] Sensitive patterns hiding using sanitization matrix [Lee et al. , COMPSAC, 2004] Common knowledge-based association rules [Li et al. , DASFAA, 2007] Constraints-based mechanisms • Objectives of finding data-sensitivity • Data-sharing while keeping sensitive data at the trusted user • Multi-level secure data accessing • Allowing data for mining purposes while also preserving the confidentiality of the data

Partitioned Computations Non-sensitive Data Dns Sensitive Data Ds Name Department t 1 E(Adam) E(Defense) t 5 Adam Testing t 2 E(John) E(Security) t 6 John Testing t 3 E(Clark) E(Crypto) t 7 Lisa Design t 4 E(Lisa) E(Defense) t 8 Clark Design Query Qs Query Qns Answer As Query Q Answer A

Leakage due to Partitioned Computing… Non-sensitive Data Dns Sensitive Data Ds Name Department t 1 E(Adam) E(Defense) t 5 Adam Testing t 2 E(John) E(Security) t 6 John Testing t 3 E(Clark) E(Crypto) t 7 Lisa Design t 4 E(Lisa) E(Defense) t 8 Clark Design Query: Retrieve John rows Adversarial view Query value Tuples retrieved from sensitive side Tuples retrieved from non-sensitive side John T 2 T 6 T 2 is John’s row.

What if we use access-pattern-hiding techniques? Non-sensitive Data Dns Sensitive Data Ds Name Department t 1 E(Adam) E(Defense) t 5 Adam Testing t 2 E(John) E(Security) t 6 John Testing t 3 E(Clark) E(Crypto) t 7 Lisa Design t 4 E(Lisa) E(Defense) t 8 Clark Design Query: Retrieve John rows Adversarial view Query value Tuples retrieved from sensitive side Tuples retrieved from non-sensitive side John E(…. ) T 6 Output size reveals that one of John’s record is sensitive.

Partitioned Data Security • Non-Linkability • The Adversary does not learn relationship between any encrypted and plaintext value • Cyphertext Indistinguishability • The adversary does not learn any relationships between encrypted values • unless underlying crypto allows such relationships to be learnt (e. g. , OPE)

Secure Partitioned Computation (1) • Data partitioned into bins • Non-sensitive data partitioned into non-sensitive bins (NSB) • Sensitive data partitioned into sensitive bin (SB) • Query Q for value y mapped to all values in the bin corresponding to y • Retrieves all data in NSB(y) over non-sensitive data • Retrieves all data in SB(y) over sensitive data Ds Dns …… x ……. . NSB(x) ……E( x)……. . …… y……. . SB(y) ……E(y) ……. . NSB(y) …… z. ……. . SB(z) …… E(z)……. . NSB(z) ……. ……. . Adversarial view Query value Tuples retrieved from sensitive side Tuples retrieved from non-sensitive side John SB(y) NSB(y)

Secure Partitioned Computation (2) Ds Dns …… x ……. . NSB(x) ……E( x)……. . …… y……. . SB(y) ……E(y) ……. . NSB(y) …… z. ……. . SB(z) …… E(z)……. . NSB(z) ……. ……. . • Bins are created such that for each pair of sensitive and non-sensitive bins s & ns, there exists a value v, • such that s =SB(v) and ns =NSB(v) adversarial view does not allow learning sensitive and non-sensitive records linkability between

Secure Partitioned Computation (3) Ds Dns …… x ……. . NSB(x) ……E( x)……. . …… y……. . SB(y) ……E(y) ……. . NSB(y) …… z. ……. . SB(z) …… E(z)……. . NSB(z) ……. ……. . • Association amongst each sensitive bin and non-sensitive bin prevents • Leakage through joint access of data • Output size attacks • Workload skew attacks can be prevented through (careful) addition of (minimal) fake queries

Query Binning • Assumptions • Equal number of sensitive and non-sensitive attribute values • Each distinct attribute value appears in at most one tuple in sensitive and one tuple in non-sensitive data • Number of values are a product of approximately equal factors ***The paper relaxes all these assumptions

The Algorithm: One Tuple Per Value S = {S 1, S 2, S 3, S 4, S 5, S 6} • Permute all sensitive values NS = {NS 1, NS 2, NS 3, NS 6, NS 7} • Find approximate square factor of |NS| = x * y such that x ≥y x=3 y=2 • Create x sensitive bins; contains at most y inputs in each • Create |NS|/x non-sensitive bins SB 1 S 4 SB 2 S 5 • Assigning non-sensitive values: Assign non-sensitive value SB 3 corresponding to ith sensitive value, which is allocated to jth bin, to jth position of ith non-sensitive bin S 3 S 6 • Assign ith sensitive value to (i mod x)th sensitive bin • NSB[j][i] allocate. NS(SB[i][j]) • Fill remaining NS values NS = 6 S=6 Bin Creation: Inputs: S and NS NS 1 NS 2 NS 3 NSB 1 NS 4 NS 7 NS 6 NSB 2

The Algorithm: One Tuple Per Value NS = 6 S=6 • Bin Retrieval: Input: Query(w) S = {S 1, S 2, S 3, S 4, S 5, S 6} • If w is in a sensitive bin SB[i][j], then NS = {NS 1, NS 2, NS 3, NS 6, NS 7} • Retrieve ith sensitive bin and jth non-sensitive bin x=3 y=2 • If w is in a non-sensitive bin NSB[i][j], then • Retrieve ith non-sensitive bin and jth sensitive bin Query: S 2 SB 2, NSB 1 Query: NS 7 NSB 1, SB 2 SB 1 S 4 SB 2 S 5 SB 3 S 6 NS 1 NS 2 NS 3 NSB 1 NS 4 NS 7 NS 6 NSB 2

Query Execution Cost on Outsourced Data Techniques Time Size SGX 10500 x Query Binning + SGX (60% sensitivity) 8929 x Multi-party computations-Jana 954363 x Query Binning + Jana (60% sensitivity) 680131 x Resilient to attacks Workload-skew Access-patterns x is the time to search a predicate in cleartext. is showing a technique is resilient to a given attack. Experiments are conducted over 1. 5 M rows.

Experimental Results (Selection Query) • X-axis = Data sensitivity (1%, 20%, 40%, 60%) • Y-axis = time SGX Opaque + Partition computing vs SGX Opaque Data set size = 6 M rows Jana MPC + Partition computing vs Jana MPC Data set size = 1 M rows

Analytical Model • When is query binning better compared to pure cryptographic approach? Ratio of cost of QB versus crypto only approach Ratio of sensitive data Average query selectivity After several rounds of simplications (see paper) Ratio of computation cost of cryptographic techniques vs plaintext per tuple Under ideal assumptions…. QB is better than cryptographic only solution if this holds (see paper) Ratio of cryptographic computation vs communication cost per tuple (typically much greater than 1 for strong cryptographic techniques)

Query Binning Extensions • If there is no approximate square factor? • Select nearest square number • If there is no 1 -to-1 mapping of sensitive and non-sensitive value, and differences in size of the values? • Bin-packing algorithm • What about range queries? • With the help of a modified B-tree created over non-sensitive bins • What about join queries? • Keep pseudo-sensitive data with sensitive data • What about aggregation queries? • Execute like a selection query without tuple fetching

Distinct Values are not a Product of Approximately Square Factor (1) • What will happen when the number of distinct values is not a product of approximately square factor ? ? ? • Increasing communication cost • For example 82 non-sensitive values, results in 41 sensitive bins and 2 nonsensitive bins SB 1 E(s 1) SB 2 E(s 2) SB 41 At most 1 value in a sensitive bin ns 1, ns 2, …, ns 41 NSB 1 ns 42, ns 43, …, ns 82 NSB 2 E(s 41) Communication cost = 42 At most 41 values in a non-sensitive bin

Distinct Values are not a Product of Approximately Square Factor (2) • Reducing communication cost --- by finding nearest square number • In the case of 82 non-sensitive values, 81 is nearest square number SB 1 …. E(x)…. ns 1, ns 2, …, ns 10 NSB 1 SB 2 …E(y)…. . ns 11, ns 12, …, ns 19 NSB 2 SB 9 …. E(z)…. . ns 74, ns 75, …, ns 82 NSB 9 At most 5 values in a sensitive bin Communication cost = 15 82 Non-sensitive value 41 Sensitive value • Thus, create 9 -9 sensitive and non-sensitive bins At most 10 values in a non-sensitive bin

The Algorithm: General Case: Multiple Tuples per Value (1) • What will happen if all values have a different number of tuples? ? • Size of each sensitive bin is different now • Assumption: More non-sensitive values have more sensitive associated tuples. • The adversary learns from tuple retrieval that which bin contain sensitive value corresponding to non-sensitive values • E. g. , retrieval of SB 1 and NSB 1 reveals that S 1 is allocated to SB 1 S=6 NS = 6 S 1 = 10 S 2 = 2 S 3 = 1 S 4 = 15 S 5 = 2 S 6 = 1 NS 1 = 200 NS 2 = 20 NS 3 = 10 NS 4 = 150 NS 5 = 10 NS 7 = 10 x=3 y=2 Size of bin 25 SB 1 S 4 4 SB 2 S 5 2 SB 3 S 6 NS 1 NS 2 NS 3 NSB 1 Size of bin 230 NS 4 NS 7 NS 6 NSB 2 170

The Algorithm: General Case: Multiple Tuples per Value (2) • What will happen if all values have a different number of tuples? • Solution: Simply add fake tuples to sensitive. Webins add 44 fake tuples to sensitive data • Problem: too many fake tuples leading to increases communication cost • So how to overcome this problem? ? ? S=6 NS = 6 S 1 = 10 S 2 = 2 S 3 = 1 S 4 = 15 S 5 = 2 S 6 = 1 NS 1 = 200 NS 2 = 20 NS 3 = 10 NS 4 = 150 NS 5 = 10 NS 7 = 10 x=3 y=2 Added fake tuples Size of bin 0 25 SB 1 S 4 21 4 SB 2 S 5 2 SB 3 S 6 23 NS 1 NS 2 NS 3 NSB 1 Size of bin 230 NS 4 NS 7 NS 6 NSB 2 170

The Algorithm: General Case: Multiple Tuples per Value (3) • What will happen if all values have a different number of tuples? • Solution: Bin-packing-based approach • Sorting: Sort all the values in a decreasing order of the number of tuples. • Allocate sensitive values • Add fake tuples • Allocate non-sensitive values as we showed previously We add fewer fake tuples than a simple solution of adding fake tuples 44 vs 17 fake tuples Added fake tuples 0 S 4 = 15 S 1 = 10 S 2 = 2 S 5 = 2 S 3 = 1 S 6 = 1 After sorting Size of bins before adding faking tuples 16 SB 1 S 4 S 6 5 11 SB 2 S 1 S 3 12 4 SB 3 S 2 S 5 S=6 NS = 6 S 1 = 10 S 2 = 2 S 3 = 1 S 4 = 15 S 5 = 2 S 6 = 1 NS 1 = 200 NS 2 = 20 NS 3 = 10 NS 4 = 150 NS 5 = 10 NS 7 = 10 x=3 y=2 NS 7 NS 1 NS 2 NSB 1 NS 6 NS 3 NS 5 NSB 2

Range Queries • A full binary-tree is constructed for all non-sensitive value • Bins are created for each level of the tree, except the root node • Bins are retrieved based on least-matching • For example, a range query from ns 8 to ns 12 Bins as per node ns 23 and ns 8 Bins for each node of each level of the tree

Conclusion • Existing cryptographic techniques are orders of magnitude slower as compared to cleartext processing • Differentiating between sensitive and non-sensitive data can make cryptographic techniques faster • By avoiding expensive cryptographic operation on non-sensitive data • However, a naïve query execution on partitioned data can lead to information leakage • Partitioned security • Query binning • Implements partitioned security • While ensuring efficiency • Interesting side-effect of QB: • Makes existing cryptographic techniques more secure as a side-effect.