Towards CI Foundations Wodzisaw Duch Department of Informatics

Towards CI Foundations Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google:

Questions • • • Nature of CI Current state of CI Promoting CI CI

CI definition Computational Intelligence. An International Journal (1984) + 10 other journals with “Computational

Are we really so good? Surprise! Almost nothing can be learned using current CI

How much can we learn? Linearly separable or almost separable problems are relatively simple

Boolean functions n=2, 16 functions, 12 separable, 4 not separable. n=3, 256 f, 104

k. D case 3 -bit functions: X=[b 1 b 2 b 3], from [0,

RBF for XOR Is RBF solution with 2 hidden Gaussians nodes possible? Typical architecture:

3 -bit parity in 2 D and 3 D Output is mixed, errors are

Spying on networks After initial transformation, what still needs to be done? Conclusion: separability

Parity n=9 Simple gradient learning; quality index shown below.

More meta-learning Meta-learning: learning how to learn, replace experts who search for best models

Intemi, Intelligent Miner Meta-schemes: templates with placeholders • • • May be nested; the

Biological justification • • Cortical columns may learn to respond to stimuli with complex

Slides: 14

Download presentation

Towards CI Foundations Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland Google: W. Duch WCCI’ 08 Panel Discussion

Questions • • • Nature of CI Current state of CI Promoting CI CI and Smart Adaptive Systems CI and Nature-inspiration Future of CI

CI definition Computational Intelligence. An International Journal (1984) + 10 other journals with “Computational Intelligence”, D. Poole, A. Mackworth & R. Goebel, Computational Intelligence - A Logical Approach. (OUP 1998), GOFAI book, logic and reasoning. CI should: • be problem-oriented, not method oriented; • cover all that CI community is doing now, and is likely to do in future; • include AI – they also think they are CI. . . CI: science of solving (effectively) non-algorithmizable problems. Problem-oriented definition, firmly anchored in computer sci/engineering. AI: focused problems requiring higher-level cognition, the rest of CI is more focused on problems related to perception/action/control.

Are we really so good? Surprise! Almost nothing can be learned using current CI tools! Ex: complex logic; natural language; natural perception.

How much can we learn? Linearly separable or almost separable problems are relatively simple – deform or add dimensions to make data separable. How to define “slightly non-separable”? There is only separable and the vast realm of the rest.

Boolean functions n=2, 16 functions, 12 separable, 4 not separable. n=3, 256 f, 104 separable (41%), 152 not separable. n=4, 64 K=65536, only 1880 separable (3%) n=5, 4 G, but << 1% separable. . . bad news! Existing methods may learn some non-separable functions, but most functions cannot be learned ! Example: n-bit parity problem; many papers in top journals. No off-the-shelf systems are able to solve such problems. For parity problems SVM may go below base rate! Such problems are solved only by special neural architectures or special classifiers – if the type of function is known. But parity is still trivial. . . solved by

k. D case 3 -bit functions: X=[b 1 b 2 b 3], from [0, 0, 0] to [1, 1, 1] f(b 1, b 2, b 3) and f(b 1, b 2, b 3) are symmetric (color change) 8 cube vertices, 28=256 Boolean functions. 0 to 8 red vertices: 1, 8, 28, 56, 70, 56, 28, 8, 1 functions. For arbitrary direction W index projection W. X gives: k=1 in 2 cases, all 8 vectors in 1 cluster (all black or all white) k=2 in 14 cases, 8 vectors in 2 clusters (linearly separable) k=3 in 42 cases, clusters B R B or W R W k=4 in 70 cases, clusters R W or W R Symmetrically, k=5 -8 for 70, 42, 14, 2. Most logical functions have 4 or 5 -separable projections. Learning = find best projection for each function. Number of k=1 to 4 -separable functions is: 2, 102, 126 and 26 126 of all functions may be learned using 3 -separability.

RBF for XOR Is RBF solution with 2 hidden Gaussians nodes possible? Typical architecture: 2 input – 2 Gaussians – 1 linear output, ML training 50% errors, but there is perfect separation - not a linear separation! Network knows the answer, but cannot say it. . . Single Gaussian output node may solve the problem. Output weights provide reference hyperplanes (red and green lines), not the separating hyperplanes like in case of MLP.

3 -bit parity in 2 D and 3 D Output is mixed, errors are at base level (50%), but in the hidden space. . . Conclusion: separability in the hidden space is perhaps too much to desire. . . inspection of clusters is sufficient for perfect classification; add second Gaussian layer to capture this activity; train second RBF on the data (stacking), reducing number of clusters.

Spying on networks After initial transformation, what still needs to be done? Conclusion: separability in the hidden space is perhaps too much to desire. . . rules, similarity or linear separation, depending on the case.

Parity n=9 Simple gradient learning; quality index shown below.

More meta-learning Meta-learning: learning how to learn, replace experts who search for best models making a lot of experiments. Search space of models is too large to explore it exhaustively, design system architecture to support knowledge-based search. • • Abstract view, uniform I/O, uniform results management. Directed acyclic graphs (DAG) of boxes representing scheme placeholders and particular models, interconnected through I/O. Configuration level for meta-schemes, expanded at runtime level. An exercise in software engineering for data mining!

Intemi, Intelligent Miner Meta-schemes: templates with placeholders • • • May be nested; the role decided by the input/output types. Machine learning generators based on meta-schemes. Granulation level allows to create novel methods. Complexity control: Length + log(time) A unified meta-parameters description. . . Inte. Mi, intelligent miner, coming “soon”.

Biological justification • • Cortical columns may learn to respond to stimuli with complex logic resonating in different way. The second column will learn without problems that such different reactions have the same meaning: inputs xi and training targets yj. are same => Hebbian learning DWij ~ xi yj => identical weights. Effect: same line y=W. X projection, but inhibition turns off one perceptron when the other is active. Simplest solution: oscillators based on combination of two neurons s(W. X-b) – s(W. X-b’) give localized projections! We have used them in MLP 2 LN architecture for extraction of logical rules from data. Note: k-sep. learning is not a multistep output neuron, targets are not known, same class vectors may appear in different intervals! We need to learn how to find intervals and how to assign them to classes; new algorithms are needed to learn it!