Efficient Training in highdimensional weight space Christoph Bunzmann
Efficient Training in high-dimensional weight space Christoph Bunzmann, Robert Urbanczik , Michael Biehl Theoretische Physik und Astrophysik Wiskunde & Informatica Julius-Maximilians-Universität Würzburg Rijksuniversiteit Groningen, Postbus 800, http: //theorie. physik. uni-wuerzburg. de/~biehl@cs. rug. nl, www. cs. rug. nl/~biehl Computational Physics Am Hubland, D-97074 Würzburg, Germany Intelligent Systems NL-9718 DD Groningen, The Netherlands
Efficient training in high-dimensional weight space Learning from examples A model situation layered neural networks student teacher scenario The dynamics of on-line learning on-line gradient descent delayed learning, plateau states Efficient training of multilayer networks learning by Principal Component Analysis idea, analysis, results Summary, Outlook selected further topics prospective projects
Learning from examples choice of adjustable parameters in adaptive information processing systems · based on example data, e. g. input/output pairs in supervised learning classification tasks time series prediction regression problems · parameterizes a hypothesis e. g. for an unknown classification or regression task · guided by the optimization of an appropriate objective or cost function e. g. performance with respect to the example data · results in generalization ability e. g. the successful classification of novel data
Theory of learning processes · general results independent of · typical properties of model scenarios e. g. performance bounds - specific task - statistical properties of data - details of training procedure. . . e. g. learning curves - network architecture - statistics of data, noise - learning algorithm understanding/prediction of relevant phenomena, algorithm design trade off: general validity / applicability · description of specific applications e. g. hand written digit recognition - given real world problem - particular training scheme - special set of example data. . .
A two-layered network: the soft committee machine input data adaptive weights hidden units ( fixed hidden to output weights ) input/output relation sigmoidal hidden activation, e. g. g(x) = erf (a x) SCM+ adaptive thresholds: universal approximator
5 Student teacher scenario (best) rule parameterization adaptive student teacher ? ? ? ? hidden units unlearnable rule over-sophisticated student interesting effects relevant cases ideal situation: perfectly matching complexity
examples for the unknown function or rule input/output pairs: training (reliable) based on the performance w. r. t. example data, e. g. evaluation after training generalization error expected error for a novel input w. r. t. density of inputs / set of test inputs
Statistical Physics approach · consider large systems, in thermodynamic limit N N (K, M «N) dimension of input data number of adjustable parameters · perform averages over stochastic training process over randomized example data, quenched disorder (technically) simplest case: reliable teacher outputs, isotropic input density: independent components with zero mean / unit variance · description in terms of macroscopic quantities e. g. overlap parameters student/teacher similarity measure · evaluate typical properties e. g. the learning curve next: eg
The generalization error (sums of many random numbers) Central Limit Theorem: correlated Gaussians for large N first and second moments: averages over integrals over KN microscopic ½ (K 2+K) + K M macroscopic
Dynamics of on-line gradient descent presentation of single examples weights after presentation of examples novel, random example: On-line learning step: number of examples discrete learning time practical advantages: · no explicit storage of all examples ID required · little computational effort per example mathematical ease: typical dynamics of learning can be evaluated on average over a randomized sequence of examples coupled ODEs for {Rjm, Qij} in time =P/(KN)
projections recursions, e. g. large N • average over latest example • mean recursions coupled ODE in continuous time Gaussian ~ examples per weight training time learning curve
example: learning curve K = M = 2, = 1. 5, Rij(0) 0 10 fast initial decrease e. G 0. 05 0. 04 perfect generalization aha! 0. 03 0. 02 0. 01 0 Biehl, Riegler, Wöhler J. Phys. A (1996) 4769 0 100 200 300 = P/(KN) quasi-stationary plateau states with all dominate the learning process unspecialized student weights
example: K = M = 2, Tmn = mn, = 1, Rij(0) 0, evolution of overlap parameters 1. 0 R 11, R 22 Q 11, Q 22 0. 5 Q 21= Q 21 R 12, R 21 0. 0 0 100 200 300 permutation symmetry of branches in the student network
self-averaging Monte Carlo simulations Qjm 1/N standard deviation quantity mean N 1/ N
Plateau length exactly if all assume randomized initialization of weight vectors examples needed for successful learning ! hidden unit specialization (initial macroscopic overlaps) requires a priori knowledge property of the learning scenario necessary phase of training or ? ? ? artifact of the training prescription self-avg.
S. J. Hanson, in Y. Chauvin & D. Rumelhart (Hrsg. ) Backpropagation: Theory, Architectures, and Applications
Training by Principal Component Analysis problem: delayed specialization in ( K N ) dimensional weight space idea: A) identification (approximation) of the subspace of B) B) actual training within this low-dimensional space example: soft committee teacher (K=M), isotropic input density modified correlation matrix eigenvalues and eigenvectors: 1 eigenvector ( N-K ) e. v. ( K-1 ) e. v.
empirical estimate from a limited data set · determine (K-1) smallest eigenvalues, e. v. 1 largest eigenvalue, e. v. B) specialization in the K - dimensional space of · representation of student weights · optimization of w. r. t. E ( K 2 K N coefficients) ( # of examples P = NK K 2 ) note: required memory N 2 does not increase with P
typical properties: given a random set of P = NK examples formal partition sum quenched free energy replica trick saddle point integration limit typical overlap with teacher weights measures the success of teacher space identification A) B) given , determine the optimal e. G achievable by a linear combination of
K = 3, Statistical Physics theory and simulations, N = 400 ( ), N = 1600 ( • ) A) P = K N examples c (K=2) = 4. 49 c (K=3) = 8. 70 large K theory: c B) B) given , determine the optimal e. G achievable by a linear combination of c (K) ~ 2. 94 K (N-indep. !)
K = 3, theory and Monte Carlo simulations, N = 400 ( ), N = 1600 ( • ) A) 15 P = K N examples c (K=2) = 4. 49 c (K=3) = 8. 70 large K theory: c B) c (K) ~ 2. 94 K (N-indep. !) specialization without a priori knowledge specialized ( c independent of N ) unspecialized Bunzmann, Biehl, Urbanczik Phys. Rev. Lett. 86, 2166 (2001)
potential application: model selection spectrum of matrix CP, teacher with M = 7 hidden units algorithm requires no prior knowledge of M PCA hints at the required model complexity K-1 = 6 smallest eigenvalues
Summary · model situation, supervised learning - the soft committee machine - student teacher scenario - randomized training data · statistical physics inspired approach - large systems - thermal (training) and disorder (data) average - typical, macroscopic properties · dynamics of on-line gradient descent - delayed learning due to symmetry breaking necessary specialization processes · efficient training - PCA based learning algorithm reduces dimensionality of the problem - specialization without a priori knowledge
Further topics · perceptron training (single layer) optimal stability classification dynamics of learning · unsupervised learning principal component analysis competitive learning, clustered data · non-trivial statistics of data learning from noisy data time-dependent rules · dynamics of on-line training perceptron, unsupervised learning, two-layered feed-forward networks · specialization processes discontinuous learning curves delayed learning, plateau states · algorithm design variational method, optimal algorithms construction algorithm
Selected Prospective Projects · application relevant architectures and algorithms Local Linear Model Trees Learning Vector Quantization Support Vector Machines · unsupervised learning density estimation, feature detection, clustering, (Learning) Vector Quantization compression, self-organizing maps · algorithm design variational optimization, e. g. alternative correlation matrix · model selection estimate complexity of a rule or mixture density
- Slides: 25