1 OPTIMIZATION WITH PARITY CONSTRAINTS FROM BINARY CODES

1 OPTIMIZATION WITH PARITY CONSTRAINTS: FROM BINARY CODES TO DISCRETE INTEGRATION Stefano Ermon*, Carla P. Gomes*, Ashish Sabharwal+, and Bart Selman* *Cornell University +IBM Watson Research Center UAI - 2013

2 High-dimensional integration • High-dimensional integrals in statistics, ML, physics • Expectations / model averaging • Marginalization • Partition function / rank models / parameter learning • Curse of dimensionality: n dimensional hypercube L L 2 L 3 L 4 • Quadrature involves weighted sum over exponential number of items (e. g. , units of volume) Ln

3 Discrete Integration • We are given • A set of 2 n items • Non-negative weights w • Goal: compute total weight Size visually represents weight 2 n Items 1 4 … 0 5 • Compactly specified weight function: • factored form (Bayes net, factor graph, CNF, …) • Example 1: n=2 variables, sum over 4 items 5 0 2 1 factor Goal: compute 5 + 0 + 2 + 1 = 8 • Example 2: n= 100 variables, sum over 2100 ≈1030 items (intractable) 5 1 0 2

4 EXP PSPACE Hardness 0 1 • 0/1 weights case: • Is there at least a “ 1”? SAT 0 1 • How many “ 1” ? #SAT • NP-complete vs. #P-complete. Much harder • General weights: • Find heaviest item (combinatorial optimization, MAP) • Sum weights (discrete integration) P^#P PH NP P Easy 0 3 4 7 • [ICML-13] WISH: Approximate Discrete Integration via Optimization. E. g. , partition function via MAP inference • MAP inference often fast in practice: • Relaxations / bounds • Pruning Hard

5 WISH : Integration by Hashing and Optimization Outer loop over n variables MAP inference on model augmented with random parity constraints Repeat log(n) times Aggregate MAP inference solutions Original graphical model AUGMENTED MODEL σ {0, 1}n σ n binary variables Parity check nodes enforcing A σ= b (mod 2) The algorithm requires only O(n log n) MAP queries to approximate the partition function within a constant factor

6 Visual working of the algorithm • How it works 1 random parity Function to be integrated 2 random parity constraints constraint …. n times 3 random parity constraints …. …. Log(n) times Mode M 0 + median M 1 × 1 + median M 2 × 2 + median M 3 × 4 + …

7 Accuracy Guarantees • Theorem [ICML-13]: With probability at least 1 - δ (e. g. , 99. 9%) WISH computes a 16 -approximation of the partition function (discrete integral) by solving θ(n log n) MAP inference queries (optimization). • Theorem [ICML-13]: Can improve the approximation factor to (1+ε) by adding extra variables and factors. • Example: factor 2 approximation with 4 n variables • Remark: faster than enumeration only when combinatorial optimization is efficient

8 Summary of contributions • Introduction and previous work: • WISH: Approximate Discrete Integration via Optimization. • Partition function / marginalization via MAP inference • Accuracy guarantees • MAP Inference subject to parity constraints: • Tractable cases and approximations • Integer Linear Programming formulation • New family of polynomial time (probabilistic) upper and lower bounds on partition function that can be iteratively tightened (will reach within constant factor) • Sparsity of the parity constraints: • Techniques to improve solution time and bounds quality • Experimental improvements over variational techniques

9 MAP INFERENCE WITH PARITY CONSTRAINTS Hardness, approximations, and bounds

10 Making WISH more scalable • Would approximations to the optimization (MAP inference with parity constraints) be useful? YES • Bounds on MAP (optimization) translate to bounds on the partition function Z (discrete integral) • Lower bounds (local search) on MAP lower bounds on Z • Upper bounds (LP, SDP relaxation) on MAP upper bounds on Z • Constant-factor approximations on MAP constant factor on Z • Question: Are there classes of problems where we can efficiently approximate the optimization (MAP inference) in the inner loop of WISH?

11 Error correcting codes Communication over a noisy channel Bob Alice x 0100|1 Noisy channel y 0110|1 Redundant parity check bit= 0 XOR 1 XOR 0 Parity check bit = 1 ≠ 0 XOR 1 XOR 0 = 0 • Bob: There has been a transmission error! What was the message actually sent by Alice? • Must be a valid codeword • As close as possible to received message y

12 Decoding a binary code • Max-likelihood decoding x 0100|1 ML-decoding graphical model Noisy channel y 0110|1 Our more general case x Transmitted string must be a codeword More complex probabilistic model Parity check nodes MAP inference is NP hard to approximate within any constant factor [Stern, Arora, . . ] LDPC Routinely solved: 10 GBase-T Ethernet, Wi-Fi 802. 11 n, digital TV, . . Parity check nodes Max w(x) subject to A x = b (mod 2) Equivalent to MAP inference on augmented model

13 Decoding via Integer Programming • MAP inference subject to parity constraints encoded as an Integer Linear Program (ILP): • Standard MAP encoding • Compact (polynomial) encoding by Yannakakis for parity constraints Parity polytope • LP relaxation: relax integrality constraint • Polynomial time upper bounds • ILP solving strategy: cuts + branching + LP relaxations • Solve a sequence of LP relaxations • Upper and lower bounds that improve over time

14 Iterative bound tightening Polynomial time upper ad lower bounds on MAP that are iteratively tightened over time • Recall: bounds on optimization (MAP) (probabilistic) bounds on the partition function Z. New family of bounds. • WISH: When MAP is solved to optimality (Lower. Bound = Upper. Bound), guaranteed constant factor approximation on Z

15 SPARSITY OF THE PARITY CONSTRAINTS Improving solution time and bounds quality

16 Inducing sparsity • Observations: • Problems with sparse A x = b (mod 2) are empirically easier to solve (similar to Low-Density Parity Check codes) • Quality of LP relaxation depends on A and b , not just on the solution space. Elementary row operations (e. g. , sum 2 equations) do not change solution space but affect the LP relaxation. 1)Reduce A x = b (mod 2) to row-echelon form with Gaussian elimination (linear equations over finite field) 2)Greedy application of elementary row operations Matrix A in row-echelon form Parity check nodes Equivalent but sparser Parity check nodes

17 Improvements from sparsity • Quality of LP relaxations significantly improves • Finds integer solutions faster (better lower bounds) Without sparsification, fails at finding integer solutions (LB) Upper bound improvement Improvements from sparsification using IBM CPLEX ILP solver for a 10 x 10 Ising Grid

18 Generating sparse constraints • WISH based on Universal Hashing: • Randomly generate A in {0, 1}i×n, b in {0, 1}i • Then A x + b (mod 2) is: n We optimize over solutions of A x = b mod 2 (parity constraints) • Uniform over {0, 1}i • Pairwise independent i A Given variable assignments x and y , the events A x = b (mod 2) and A y =b (mod 2) are independent. = b (mod 2) x • Suppose we generate a sparse matrix A • At most k variables per parity constraint (up to k ones per row of A) • A x+b (mod 2) is still uniform, not pairwise independent anymore • E. g. for k=1, A x = b mod 2 is equivalent to fixing i variables. Lots of correlation. (Knowing A x = b tells me a lot about A y = b)

19 Using sparse parity constraints • Theorem: With probability at least 1 - δ (e. g. , 99. 9%) WISH with sparse parity constraints computes an approximate lower bound of the partition function. • PRO: “Easier” MAP inference queries • For example, random parity constraints of length 1 (= on a single variable). Equivalent to MAP with some variables fixed. • CON: We lose the upper bound part. Output can underestimate the partition function. • CON: No constant factor approximation anymore

20 MAP with sparse parity constraints • MAP inference with sparse constraints evaluation 10 x 10 attractive Ising Grid 10 x 10 mixed Ising Grid • ILP and Branch&Bound outperform message-passing (BP, MP and MPLP)

21 Experimental results • ILP provides probabilistic upper and lower bounds that improve over time and are often tighter than variational methods (BP, MF, TRW)

22 Experimental results (2) • ILP provides probabilistic upper and lower bounds that improve over time and are often tighter than variational methods (BP, MF, TRW)

23 Conclusions • [ICML-13] WISH: Discrete integration reduced to small number of optimization instances (MAP) • Strong (probabilistic) accuracy guarantees • MAP inference is still NP-hard • Scalability: Approximations and Bounds • Connection with max-likelihood decoding • ILP formulation + sparsity (Gauss sparsification & uniform hashing) • New family of probabilistic polynomial time computable upper and lower bounds on partition function. Can be iteratively tightened (will reach within a constant factor) • Future work: • Extension to continuous integrals and variables • Sampling from high-dimensional probability distributions

24 Extra slides