Lattice Representation of Data Dr Alex Pogel Physical

  • Slides: 38
Download presentation
Lattice Representation of Data Dr. Alex Pogel Physical Science Laboratory New Mexico State University

Lattice Representation of Data Dr. Alex Pogel Physical Science Laboratory New Mexico State University

Basic Idea Replace tabular representation by lattice representation in order to reveal hierarchical structure

Basic Idea Replace tabular representation by lattice representation in order to reveal hierarchical structure 1. Basic definitions 2. Information in the lattice 3. Carving up epidemiological data Ganter & Wille: Formal Concept Analysis (FCA) Barwise & Seligman: Information Flow

Input data Base data structure is a {0, 1}-table · A set G of

Input data Base data structure is a {0, 1}-table · A set G of objects (represented by rows) and · A set M of attributes (represented by columns) · an entry of 1 indicates object g has attribute m M G {

Input data, mathematically Mathematically speaking: a binary relation I from G to M, a

Input data, mathematically Mathematically speaking: a binary relation I from G to M, a subset of G x M interpreted as an indication of which objects g have which attributes m Via (g, m) e I

Key Definitions The notion of “formal concept” is based on natural mappings that arise

Key Definitions The notion of “formal concept” is based on natural mappings that arise from the binary relation I [interpret G and M as before]: • to each subset H of G, we associate the set a(A) of all attributes the objects in H satisfy in common a: P(G) P(M) • to each subset N of M, we associate the set o(N) of all objects satisfying every attribute in N o: P(M) P(G)

Key Definitions The attribute subsets N of M such that a(o(N)) = N are

Key Definitions The attribute subsets N of M such that a(o(N)) = N are called formal concepts in FCA And are called closed sets in mathematics, as a(o(–)) is a closure operator on M A formal concept can be identified geometrically within a data table by reshuffling rows and columns such that 1. object-attribute relations are maintained and 2. a maximal rectangle of 1 s appears.

Animal Context

Animal Context

Shuffling Reveals a Concept

Shuffling Reveals a Concept

BIRD is the (formal) concept

BIRD is the (formal) concept

Closure System Arises Taking all closed sets together we obtain a closure system [aka

Closure System Arises Taking all closed sets together we obtain a closure system [aka a topped intersection structure, in Davey-Priestley] which is always a complete lattice [an ordered set for which every subset has both a supremum and infimum in the set] Examples: • R with <=, • P(S) with inclusion, • any topology with inclusion, …

Focus on attribute logic

Focus on attribute logic

Full list: difficult, redundant all implications that hold for the data, with up to

Full list: difficult, redundant all implications that hold for the data, with up to three attributes in their premise; 125 with positive support

Duquenne-Guigues Basis 20 implications generate the full list, and serve as a basis (analogy

Duquenne-Guigues Basis 20 implications generate the full list, and serve as a basis (analogy with linear algebra); ordered by support value

Full list, basis, and original data

Full list, basis, and original data

Implication Reads Upwards at top right: warm-blooded implies airbreather 1 st in basis: high

Implication Reads Upwards at top right: warm-blooded implies airbreather 1 st in basis: high support indicated in lime green

A Subinterval of the lattice fourlegged implies airbreather pet implies warm-blooded (iguana? ) and

A Subinterval of the lattice fourlegged implies airbreather pet implies warm-blooded (iguana? ) and fur implies fourlegged and warm-blooded (platypus? )

Original data preserved animals 26 and 27 share the attributes “lives in water”, “is

Original data preserved animals 26 and 27 share the attributes “lives in water”, “is warm-blooded” and “is an airbreather”

Original data preserved animals 26 and 27 share the attributes “lives in water”, “is

Original data preserved animals 26 and 27 share the attributes “lives in water”, “is warm-blooded” and “is an airbreather”

Color-coded support the similarity in color between “livestock” and the concept node below it

Color-coded support the similarity in color between “livestock” and the concept node below it yields the association rule livestock implies fur with 79% confidence And 11% support (bottom)

Visual Vocabulary Small subdiagrams (Specifically meet-subsemilattices) can be recognized as complex sentences

Visual Vocabulary Small subdiagrams (Specifically meet-subsemilattices) can be recognized as complex sentences

3 unordered attribute concepts a b c Note: the top element is really irrelevant,

3 unordered attribute concepts a b c Note: the top element is really irrelevant, but adding it makes everything we’ll look at a lattice instead of just a meet semilattice (definition: an ordered structure closed under finite meet (glb))

Here’s the best known outcome No non-trivial implications a b c

Here’s the best known outcome No non-trivial implications a b c

W over V: a & c b a b c

W over V: a & c b a b c

Diamond in diamond Under condition c, a and b are equivalent a b c

Diamond in diamond Under condition c, a and b are equivalent a b c

Convergence any two imply the third a b c

Convergence any two imply the third a b c

Two Complex Sentences So, we can read that For nocturnal animals and pets, the

Two Complex Sentences So, we can read that For nocturnal animals and pets, the attributes fourlegged and warmblooded are equivalent, and the only implication between the attributes “nocturnal, ” “fur” and “pet” is pet and nocturnal implies fur.

The Hague, Netherlands

The Hague, Netherlands

Before Freese improvement

Before Freese improvement

After Freese improvement

After Freese improvement

Apparent Splits

Apparent Splits

Eliminating Light Smokers

Eliminating Light Smokers

Why no object names?

Why no object names?

Lung Cancer and Smoking nearly half of these 30+ year smokers have lung cancer

Lung Cancer and Smoking nearly half of these 30+ year smokers have lung cancer

Bird-keeping and Smoking Association rules involving bird -keeping and smoking

Bird-keeping and Smoking Association rules involving bird -keeping and smoking

Limitations as KDD Process • Needs attention given to data preparation • Need more

Limitations as KDD Process • Needs attention given to data preparation • Need more built-in verification of discovered rules • No domain-specific constructions (advantage ? ) • Does not scale without clustering (universal ? )

Epidemiological functions Plan to add odds ratio calculation, via click Lung Cancer No Lung

Epidemiological functions Plan to add odds ratio calculation, via click Lung Cancer No Lung Cancer Bird. Keep Yes 33 34 Bird. Keep No 16 64 OR = 3. 9

Clustering for too large lattices

Clustering for too large lattices

Support for improvement Traditional diagram improvement algorithms are based solely upon the order structure

Support for improvement Traditional diagram improvement algorithms are based solely upon the order structure We are now moving towards the inclusion of support values in these algorithms I will talk about this topic in detail in July, here at DIMACS, as part of the Applications of Lattice Theory workshop END