Additional Material Mahalanobis Distance Prof Dr Rudolf Kruse

Additional Material: Mahalanobis Distance Prof. Dr. Rudolf Kruse 1

Interpretation of a Covariance Matrix l A univariate normal distribution has the density function l A multivariate normal distribution has the density function Prof. Dr. Rudolf Kruse 2

Variance and Standard Deviation l Univariate Normal/Gaussian Distribution The variance/standard deviation provides information about the height of the mode and the width of the curve. Prof. Dr. Rudolf Kruse 3

Interpretation of a Covariance Matrix l The variance/standard deviation relates the spread of the distribution to the spread of a standard normal distribution l The covariance matrix relates the spread of the distribution to the spread of a multivariate standard normal distribution l Example: bivariate normal distribution l Question: Is there a multivariate analog of standard deviation? Prof. Dr. Rudolf Kruse 4

Eigenvalue Decomposition l Yields an analog of standard deviation. l Let S be a symmetric, positive definite matrix (e. g. a covariance matrix). Prof. Dr. Rudolf Kruse 5

Eigenvalue Decomposition Special Case: Two Dimensions Prof. Dr. Rudolf Kruse 6

Eigenvalue Decomposition Prof. Dr. Rudolf Kruse 7

Eigenvalue Decomposition Prof. Dr. Rudolf Kruse 8

Eigenvalue Decomposition Special Case: Two Dimensions Prof. Dr. Rudolf Kruse 9

Cluster-Specific Distance Functions The similarity of a data point to a prototype depends on their distance. l If the cluster prototype is a simple cluster center, a general distance measure can be defined on the data space. In this case the Euclidean distance is most often used due to its rotation invariance. It leads to (hyper-)spherical clusters. l However, more flexible clustering approaches (with size and shape parameters) use cluster-specific distance functions. The most common approach is to use a Mahalanobis distance with a cluster-specific covariance matrix. The covariance matrix comprises shape and size parameters. The Euclidean distance is a special case that results for Prof. Dr. Rudolf Kruse 10

Additional Material: Neuro-Fuzzy Systems Prof. Dr. Rudolf Kruse 11

Beispiel : Automatik-Getriebe Aufgabe: Verbesserung des VWAutomatik-Getriebes - keine zusätzlichen Sensoren - individuelle Anpassung des Schaltverhaltens Idee (1995): Das Fahrzeug “beobachtet” und klassifiziert den Fahrer nach Sportlichkeit - ruhig, normal, sportlich Bestimmung eines Sport. Faktors aus [0, 1] - nervös Beruhigung des Fahrers Testfahrzeug: - verschiedene Fahrer, Klassifikation durch Experten (Mitfahrer) - gleichzeitige Messungen: l Geschwindigkeit, l Position, l Geschwindigkeit des Gaspedals, l Winkel des Lenkrades, . . . (14 Attribute). Prof. Dr. Rudolf Kruse 12

Modellierung unscharfer Informationen mit Fuzzy-Mengen Zugehörigkeitsgrad Prof. Dr. Rudolf Kruse 13

Example: Continously Adapting Gear Shift Schedule in VW New Beetle Prof. Dr. Rudolf Kruse 14

Prof. Dr. Rudolf Kruse 15

n Fuzzy-Regler mit 7 Regeln AG 4 n Optimiertes Programm n Laufzeit 80 ms, 12 mal pro Sekunde wird ein neuer Sportfaktor bestimmt n In Serie im VW Konzern n Erlernen von Regelsystemen mit Hilfe von Künstlichen Neuronalen Netzen, Optimierung mit evolutionären Algorithmen Prof. Dr. Rudolf Kruse 16

Beispiel : Fuzzy Datenbank TOP MANAGEMENT NACHFOLGER TALENTBANK MANAGEMENT Nachfolger für Top-Management Positionen Prof. Dr. Rudolf Kruse 17

Prof. Dr. Rudolf Kruse 18

Beispiel : Automatisiertes sensor-basiertes Landen Prof. Dr. Rudolf Kruse 19

Neuro-Fuzzy Systems n Building a fuzzy system requires l prior knowledge (fuzzy rules, fuzzy sets) l manual tuning: time consuming and error-prone n Therefore: Support this process by learning l learning fuzzy rules (structure learning) l learning fuzzy set (parameter learning) Approaches from Neural Networks can be used Prof. Dr. Rudolf Kruse 20

Prof. Dr. Rudolf Kruse 21

Example: Prognosis of the Daily Proportional Changes of the DAX at the Frankfurter Stock Exchange (Siemens) n Database: time series from 1986 - 1997 Prof. Dr. Rudolf Kruse 22

Fuzzy Rules in Finance n Trend Rule IF DAX = decreasing AND US-$ = decreasing THEN DAX prediction = decrease WITH high certainty n Turning Point Rule IF DAX = decreasing AND US-$ = increasing THEN DAX prediction = increase WITH low certainty n Delay Rule IF DAX = stable AND US-$ = decreasing THEN DAX prediction = decrease WITH very high certainty n In general IF x 1 is m 1 AND x 2 is m 2 THEN y=h WITH weight k Prof. Dr. Rudolf Kruse 23

Classical Probabilistic Expert Opinion Pooling Method n DM analyzes each source (human expert, data + forecasting model) in terms of (1) Statistical accuracy, and (2) Informativeness by asking the source to asses quantities (quantile assessment) n DM obtains a “weight” for each source n DM “eliminates” bad sources n DM determines the weighted sum of source outputs n Determination of “Return of Invest” Prof. Dr. Rudolf Kruse 24

n E experts, R quantiles for N quantities each expert has to asses R·N values n stat. Accuracy: n information score: n weight for expert e: n outputt= n roi = Prof. Dr. Rudolf Kruse 25

Formal Analysis n Sources of information R 1 rule set given by expert 1 R 2 rule set given by expert 2 D data set (time series) n Operator schema fuse (R 1, R 2)fuse two rule sets induce(D) induce a rule set from D revise(R, D) revise a rule set R by D Prof. Dr. Rudolf Kruse 26

Formal Analysis n Strategies: l fuse(fuse (R 1, R 2), induce(D)) l revise(fuse(R 1, R 2), D) l fuse(revise(R 1, D), revise(R 2, D)) n Technique: Neuro-Fuzzy Systems l Nauck, Klawonn, Kruse, Foundations of Neuro-Fuzzy Systems, Wiley 97 l SENN (commercial neural network environment, Siemens) Prof. Dr. Rudolf Kruse 27

Neuro-Fuzzy Architecture Prof. Dr. Rudolf Kruse 28

From Rules to Neural Networks 1. Evaluation of membership degrees 2. Evaluation of rules (rule activity) 3. Accumulation of rule inputs and normalization Prof. Dr. Rudolf Kruse 29

The Semantics-Preserving Learning Algorithm Reduction of the dimension of the weight space 1. Membership functions of different inputs share their parameters, e. g. 2. Membership functions of the same input variable are not allowed to pass each other, they must keep their original order, e. g. Benefits: Prof. Dr. Rudolf Kruse the optimized rule base can still be interpreted the number of free parameters is reduced 30

Return-on-Investment Curves of the Different Models Validation data from March 01, 1994 until April 1997 Prof. Dr. Rudolf Kruse 31

Neuro-Fuzzy Systems in Data Analysis n Neuro-Fuzzy System: l System of linguistic rules (fuzzy rules). l Not rules in a logical sense, but function approximation. l Fuzzy rule = vague prototype / sample. n Neuro-Fuzzy-System: l Adding a learning algorithm inspired by neural networks. l Feature: local adaptation of parameters. Prof. Dr. Rudolf Kruse 32

A Neuro-Fuzzy System n is a fuzzy system trained by heuristic learning techniques derived from neural networks n can be viewed as a 3 -layer neural network with fuzzy weights and special activation functions n is always interpretable as a fuzzy system n uses constraint learning procedures n is a function approximator (classifier, controller) Prof. Dr. Rudolf Kruse 33

Learning Fuzzy Rules n Cluster-oriented approaches => find clusters in data, each cluster is a rule n Hyperbox-oriented approaches => find clusters in the form of hyperboxes n Structure-oriented approaches => used predefined fuzzy sets to structure the data space, pick rules from grid cells Prof. Dr. Rudolf Kruse 34

Hyperbox-Oriented Rule Learning Search for hyperboxes in the data space Create fuzzy rules by projecting the hyperboxes Fuzzy rules and fuzzy sets are created at the same time Usually very fast Prof. Dr. Rudolf Kruse 35

Hyperbox-Oriented Rule Learning n Detect hyperboxes in the data, example: XOR function n Advantage over fuzzy cluster anlysis: l No loss of information when hyperboxes are represented as fuzzy rules l Not all variables need to be used, don‘t care variables can be discovered n Disadvantage: each fuzzy rules uses individual fuzzy sets, i. e. the rule base is complex. Prof. Dr. Rudolf Kruse 36

Structure-Oriented Rule Learning Provide initial fuzzy sets for all variables. The data space is partitioned by a fuzzy grid Detect all grid cells that contain data (approach by Wang/Mendel 1992) Compute best consequents and select best rules (extension by Nauck/Kruse 1995, NEFCLASS model) Prof. Dr. Rudolf Kruse 37

Structure-Oriented Rule Learning n Simple: Rule base available after two cycles through the training data l 1. Cycle: discover all antecedents l 2. Cycle: determine best consequents n Missing values can be handled n Numeric and symbolic attributes can be processed at the same time (mixed fuzzy rules) n Advantage: All rules share the same fuzzy sets n Disadvantage: Fuzzy sets must be given Prof. Dr. Rudolf Kruse 38

Learning Fuzzy Sets n Gradient descent procedures only applicable, if differentiation is possible, e. g. for Sugenotype fuzzy systems. n Special heuristic procedures that do not use gradient information. n The learning algorithms are based on the idea of backpropagation. Prof. Dr. Rudolf Kruse 39

Learning Fuzzy Sets: Constraints n Mandatory constraints: l Fuzzy sets must stay normal and convex l Fuzzy sets must not exchange their relative positions (they must not „pass“ each other) l Fuzzy sets must always overlap n Optional constraints l Fuzzy sets must stay symmetric l Degrees of membership must add up to 1. 0 n The learning algorithm must enforce these constraints. Prof. Dr. Rudolf Kruse 40

Example: Medical Diagnosis n Results from patients tested for breast cancer (Wisconsin Breast Cancer Data). n Decision support: Do the data indicate a malignant or a benign case? n A surgeon must be able to check the classification for plausibility. n We are looking for a simple and interpretable classifier: knowledge discovery. Prof. Dr. Rudolf Kruse 41

Example: WBC Data Set n 699 cases (16 cases have missing values). n 2 classes: benign (458), malignant (241). n 9 attributes with values from {1, . . . , 10} (ordinal scale, but usually interpreted as a numerical scale). n Experiment: x 3 and x 6 are interpreted as nominal attributes. n x 3 and x 6 are usually seen as „important“ attributes. Prof. Dr. Rudolf Kruse 42

Applying NEFCLASS-J n Tool for developing Neuro-Fuzzy Classifiers n Written in JAVA n Free version for research available n Project started at Neuro-Fuzzy Group of University of Magdeburg, Germany Prof. Dr. Rudolf Kruse 43

NEFCLASS: Neuro-Fuzzy Classifier Output variables (class labels) Unweighted connections Fuzzy rules Fuzzy sets (antecedents) Input variables (attributes) Prof. Dr. Rudolf Kruse 44

NEFCLASS: Features n Automatic induction of a fuzzy rule base from data n Training of several forms of fuzzy sets n Processing of numeric and symbolic attributes n Treatment of missing values (no imputation) n Automatic pruning strategies n Fusion of expert knowledge and knowledge obtained from data Prof. Dr. Rudolf Kruse 45

Representation of Fuzzy Rules Example: 2 Rules c 1 c 2 R 1: if x is large and y is small, then class is c 1. R 2: if x is large and y is large, then class is c 2. R 1 small large x Prof. Dr. Rudolf Kruse y R 2 The connections x R 1 and x R 2 are linked. large The fuzzy set large is a shared weight. That means the term large has always the same meaning in both rules. 46

1. Training Step: Initialisation Specify initial fuzzy partitions for all input variables c 1 x Prof. Dr. Rudolf Kruse c 2 y 47

2. Training Step: Rule Base Algorithm: Variations: for (all patterns p) do find antecedent A, such that A( p) is maximal; if (A L) then add A to L; end; Fuzzy rule bases can also be created by using prior knowledge, fuzzy cluster analysis, fuzzy decision trees, genetic algorithms, . . . for (all antecedents A L) do find best consequent C for A; create rule base candidate R = (A, C); Determine the performance of R; Add R to B; end; Select a rule base from B; Prof. Dr. Rudolf Kruse 48

Selection of a Rule Base • Order rules by performance. • Either select the best r rules or the best r/m rules per class. • r is either given or is determined automatically such that all patterns are covered. Prof. Dr. Rudolf Kruse 49

Rule Base Induction NEFCLASS uses a modified Wang-Mendel procedure c 1 R 1 x Prof. Dr. Rudolf Kruse c 2 R 3 y 50

Computing the Error Signal c 1 R 1 Fuzzy Error ( jth output): c 2 R 3 Rule Error: x Prof. Dr. Rudolf Kruse y 51

3. Training Step: Fuzzy Sets Example: triangular membership function. Parameter updates for an antecedent fuzzy set. Prof. Dr. Rudolf Kruse 52

Training of Fuzzy Sets m(x) initial fuzzy set reduce enlarge 0. 85 0. 55 0. 30 x Heuristics: a fuzzy set is moved away from x (towards x) and its support is reduced (enlarged), in order to reduce (enlarge) the degree of membership of x. Prof. Dr. Rudolf Kruse 53

Training of Fuzzy Sets Algorithm: repeat for (all patterns) do accumulate parameter updates; accumulate error; end; modify parameters; until (no change in error); local minimum Prof. Dr. Rudolf Kruse Variations: n Adaptive learning rate n Online-/Batch Learning n optimistic learning (n step look ahead) Observing the error on a validation set 54

Constraints for Training Fuzzy Sets n Valid parameter values n Non-empty intersection of adjacent fuzzy sets n Keep relative positions n Maintain symmetry n Complete coverage (degrees of membership add up to 1 for each element) Correcting a partition after modifying the parameters Prof. Dr. Rudolf Kruse 55

4. Training Step: Pruning Goal: remove variables, rules and fuzzy sets, in order to improve interpretability and generalisation. Prof. Dr. Rudolf Kruse 56

Pruning Algorithm: Pruning Methods: repeat select pruning method; 1. Remove variables (use correlations, information gain etc. ) repeat execute pruning step; train fuzzy sets; if (no improvement) then undo step; until (no improvement); 2. Remove rules (use rule performance) 3. Remove terms (use degree of fulfilment) 4. Remove fuzzy sets (use fuzziness) until (no further method); Prof. Dr. Rudolf Kruse 57

WBC Learning Result: Fuzzy Rules R 1: if uniformity of cell size is small and bare nuclei is fuzzy 0 then benign R 2: if uniformity of cell size is large then malignant Prof. Dr. Rudolf Kruse 58

WBC Learning Result: Classification Performance Estimated Performance on Unseen Data (Cross Validation) n NEFCLASS-J: 95. 42% n NEFCLASS-J (numeric): 94. 14% n Discriminant Analysis: 96. 05% n Multilayer Perceptron: 94. 82% n C 4. 5: 95. 10% n C 4. 5 Rules: 95. 40% Prof. Dr. Rudolf Kruse 59

WBC Learning Result: Fuzzy Sets Prof. Dr. Rudolf Kruse 60

NEFCLASS-J Prof. Dr. Rudolf Kruse 61

Resources Detlef Nauck, Frank Klawonn & Rudolf Kruse: Foundations of Neuro-Fuzzy Systems Wiley, Chichester, 1997, ISBN: 0 -471 -97151 -0 Neuro-Fuzzy Software (NEFCLASS, NEFCON, NEFPROX): http: //www. neuro-fuzzy. de Beta-Version of NEFCLASS-J: http: //www. neuro-fuzzy. de/nefclassj Prof. Dr. Rudolf Kruse 62

Download NEFCLASS-J Download the free version of NEFCLASS-J at http: //fuzzy. cs. uni-magdeburg. de Prof. Dr. Rudolf Kruse 63

Conclusions n Neuro-Fuzzy-Systems can be useful for knowledge discovery. n Interpretability enables plausibility checks and improves acceptance. n (Neuro-)Fuzzy systems exploit tolerance for sub-optimal solutions. n Neuro-fuzzy learning algorithms must observe constraints in order not to jeopardise the semantics of the model. n Not an automatic model creator, the user must work with the tool. n Simple learning techniques support explorative data analysis. Prof. Dr. Rudolf Kruse 64

Information Mining n Information mining is the non-trivial process of identifying valid, novel, potentially useful, and understandable information and patterns in heterogeneous information sources. n Information sources are l data bases, l expert background knowledge, l textual description, l images, l sounds, . . . Prof. Dr. Rudolf Kruse 65

Information Mining Prof. Dr. Rudolf Kruse 66

Example: Line Filtering n Extraction of edge segments (Burns’ operator) n Production net: edges lines long lines parallel lines runways Prof. Dr. Rudolf Kruse 67

Example: Line Filtering n Problems l extremely many lines due to distorted images l long execution times of production net Prof. Dr. Rudolf Kruse 68

Example: Line Filtering n Only few lines used for runway assembly n Approach: l Extract textural features of lines l Identify and discard superfluous lines Prof. Dr. Rudolf Kruse 69

Example: Line Filtering n Several classifiers: l minimum distance, k-nearest neighbor, decision trees, NEFCLASS n Problems: classes are overlapping and extremely unbalanced n Result above with modified NEFCLASS: l all lines for runway construction found l reduction to 8. 7% of edge segments Prof. Dr. Rudolf Kruse 70

Surface Quality Control: the 2 Approaches n Today’s Approach The current surface quality control is done manually an experienced worker treats the exterior surfaces with a grindstone. The experts classify surface form deviations by means of linguistic descriptions. Cumbersome – Subjective - Error Prone Time Consuming n The Proposed Approach Our Approach is based on the digitization of the exterior body panel surface with an optical measuring system. We characterize the form deviation by mathematical properties that are close to the subjective properties that the experts used in their linguistic description. Prof. Dr. Rudolf Kruse 71

Topometric 3 -D measuring system Triangulation and Gratings Projection 0 1 0 0 P(x, y) b(x, y) a(x, y) φn b z(x, y) Miniaturized Projection Technique Pixel coding (Grey Code Phase shift) z y x n n Prof. Dr. Rudolf Kruse High Point Density Fast Data Collection Measurement Accuracy Contact less and Non-destructive 72

Data Processing • Approximation by a Polynomial Surface • Difference • Colour-Coded Visualization z(x, y) Dz(x, y) ˜z(x, y) 3 -D Data Acquisition • 3 -D-Point Cloud Form Deviation Post-Processing ˜z(x, y) Detection of Form Deviation Features Analysis • Feature Calculation • Classification (Data-Mining) z(x, y) Prof. Dr. Rudolf Kruse 73

Color Coded Visualization Result of Grinding Prof. Dr. Rudolf Kruse 74

3 D Visualization of Local Surface Defects Uneven Surface (several sink marks in series or adjoined) Sink Mark (slight flat based depression inward) Prof. Dr. Rudolf Kruse Press Mark (local smoothing of (micro-)surface) Waviness (several heavier wrinklings in series) 75

Data Characteristics n We analysed 9 master pieces with a total number of 99 defects n For each defect we calculated 42 features n The types are rather unbalanced n We discarded the rare classes n We discarded some of the extremely correlated features (31 features left) n We ranked the 31 features by importance n We use stratified 4 -fold cross validation for the experiment. Prof. Dr. Rudolf Kruse 76

Application and Results The Rule Base for NEFCLASS Classification Accuracy NBC DTree NN NEFCLASS DC Train Set 89. 0% 94. 7% 90% 81. 6% 46. 8% Test Set 75. 6% 85. 5% 79. 9% 46. 8% Prof. Dr. Rudolf Kruse 77