Analysing Large Data Sets using Formal Concept Lattices

  • Slides: 24
Download presentation
Analysing Large Data Sets using Formal Concept Lattices Simon Andrews and Constantinos Orphanides {s.

Analysing Large Data Sets using Formal Concept Lattices Simon Andrews and Constantinos Orphanides {s. andrews, c. orphanides} @shu. ac. uk Conceptual Structures Research Group Communication and Computing Research Centre

Acknowledgement This work is part of the CUBIST project ("Combining and Uniting Business Intelligence

Acknowledgement This work is part of the CUBIST project ("Combining and Uniting Business Intelligence with Semantic Technologies"), funded by the European Commission's 7 th Framework Programme of ICT, under topic 4. 3: Intelligent Information Management.

Data Sets • A variety of data sets can be converted into formal contexts:

Data Sets • A variety of data sets can be converted into formal contexts: – Data Discretization – Data Booleanization • However, issues arise: – Data of modest size can contain hundreds (of thousands) of formal concepts, resulting in unmanageable and unreadable concept lattices. – Density of, and noise in a context: factors that increase the number of formal concepts. – Computation of formal concepts cannot be carried out, by much of the existing software, on a large scale.

Tools By The Authors • Fca. Bedrock (Formal Context Creator) – Creating sub-contexts by

Tools By The Authors • Fca. Bedrock (Formal Context Creator) – Creating sub-contexts by restricting the conversion of the data to information of interest. • In-Close (Fast Concept Miner) – By removing relatively small concepts from a context to reduce "noise". ⇨ Production of readable, yet still meaningful, concept lattices.

Fca. Bedrock - Overview • A Formal Context Creator for Formal Concept Analysis, developed

Fca. Bedrock - Overview • A Formal Context Creator for Formal Concept Analysis, developed by the authors. • Free and open-source at Sourceforge. • Input files supported: Flat-file CSV and Threecolumn CSV (triples). • Output files supported: Burmeister (. cxt) and FIMI (. dat). • User guided automation - the user has the final say on how to interpret a data set. • Attributes supported: Categorical (aka many-valued, nominal/ordinal), Boolean and Continuous.

Fca. Bedrock - Overview • Auto-detection of metadata, directly from the data set, if

Fca. Bedrock - Overview • Auto-detection of metadata, directly from the data set, if desired. • Support of both discrete (0 -10, 10 -20, …) and progressive (>10, >20, …) scaling for continuous attributes. • Ability to exclude attributes from the analysis. • Ability to restrict the analysis to user-specified attribute values. • Metadata of each conversion/analysis saved & stored for subsequent conversions. • Repetition of metadata for similar attributes.

In-Close - Overview • A fast Concept Miner for Formal Concept Analysis, developed by

In-Close - Overview • A fast Concept Miner for Formal Concept Analysis, developed by one of the authors. • Free and open-source at Sourceforge. • Input files supported: Burmeister (. cxt). • Minimum support for intent and extent. • Output of analysis data and concepts. • Output of sub-context ("noise" reduction). • Fast computation of formal concepts: – Mining 1 million concepts per second.

Analysis of Sub-Contexts: Agaricus-Lepiota • Data Set: Agaricus Lepiota (aka Mushroom) – From UCI

Analysis of Sub-Contexts: Agaricus-Lepiota • Data Set: Agaricus Lepiota (aka Mushroom) – From UCI Machine Learning Repository – 8124 objects (mushrooms) – 23 attributes (mushroom properties) • e. g. stalk shape, cap color, edible/poisonous… – Attribute types: Categorical, Boolean – Processed by In-Close: 220, 000+ concepts

Analysis of Sub-Contexts: Agaricus-Lepiota • Lets us say we are interested in the relationship

Analysis of Sub-Contexts: Agaricus-Lepiota • Lets us say we are interested in the relationship between mushroom habitat and population type. • Using Fca. Bedrock: – Create a sub-context by only converting the habitat and population type attributes. • ⇨ Down to 33 Formal Concepts (from 220, 000+) and 13 Formal Attributes (from 125)

Visualisation of the sub-context in Con. Exp

Visualisation of the sub-context in Con. Exp

Analysis of Sub-Contexts: Census Income • Data Set: Census Income (aka Adult) – From

Analysis of Sub-Contexts: Census Income • Data Set: Census Income (aka Adult) – From UCI Machine Learning Repository – 32561 objects (adults) – 14 attributes (census data) • e. g. age, sex, education, employment type… – Attribute types: Categorical, Boolean, Continuous – Processed by In-Close: 100, 000+ concepts

Analysis of Sub-Contexts: Census Income • Lets us say we are interested in comparing

Analysis of Sub-Contexts: Census Income • Lets us say we are interested in comparing how pay is effected by gender in adults who have had a higher education. • Using Fca. Bedrock: – Create a sub-context by only converting the sex, class and education attributes. – Convert only those objects (adults) with the education attribute value Bachelors, Masters or Doctorate. ⇨ Down to 7941 objects and 37 Formal Concepts

Visualisation of the sub-context in Con. Exp

Visualisation of the sub-context in Con. Exp

In-Close: Concept Reduction • Using Fca. Bedrock's context reduction: – Attributes of no particular

In-Close: Concept Reduction • Using Fca. Bedrock's context reduction: – Attributes of no particular interest can be excluded from the analysis (attribute exclusion). – We can convert only those objects with specific attribute values (object exclusion). • Introducing In-Close's concept reduction: – Using the well-known idea of minimum support • Specifying a minimum number of objects and/or attributes for a concept. ⇨ Reduction of 'noise' in a context.

In-Close: Concept Reduction • 'Noise': Concepts containing number of attributes or objects smaller than

In-Close: Concept Reduction • 'Noise': Concepts containing number of attributes or objects smaller than the userdefined minimums. • Reduction of 'noise' achieved by: – Semi-automated form of lattice 'iceberging'. • Complete hierarchy maintained in the lattice. – Mining a context for concepts that satisfy a minimumsupport and then re-writing the context using only those concepts.

A Student Survey Example • Student survey data – Demographic and 'problem' data from

A Student Survey Example • Student survey data – Demographic and 'problem' data from 587 university undergraduates. – Yes/No responses to 36 problems that a student may have experienced during their studies: • missing lectures, low performance, etc. • Noisy data set: – 145 Formal Attributes – Processed by In-Close: 22, 760, 243 concepts!

A Student Survey Example • Let us say we are only interested in analysing

A Student Survey Example • Let us say we are only interested in analysing the 'problem' data. • Using Fca. Bedrock: – Convert only these attributes, exclude demographics. – Remaining concepts: 339, 672 • Significant reduction, but still too many! • Adding In-Close to the equation: – Set minimum size of intent to 4 and minimum size of extent to 80. ⇨ Remaining concepts: 32!

Visualisation of the sub-context in Con. Exp

Visualisation of the sub-context in Con. Exp

Comparing Quiet Sub-Contexts • Data Set: Agaricus-Lepiota (aka Mushroom) • Using Fca. Bedrock: –

Comparing Quiet Sub-Contexts • Data Set: Agaricus-Lepiota (aka Mushroom) • Using Fca. Bedrock: – Create two sub-contexts: one for edible mushrooms and one for poisonous mushrooms. • Using In-Close (for each sub-context): – Set minimum size of intent to 10. ⇨ 2, 848 objects + 17 concepts for the edible subcontext, 3, 344 objects + 14 concepts for the poisonous sub-context.

Comparing Quiet Sub-Contexts • Similarities between the two lattices: – Attributes expressed in both

Comparing Quiet Sub-Contexts • Similarities between the two lattices: – Attributes expressed in both lattices were moved to the right of each lattice. • Differences between the two lattices: – Attributes expressed in only one lattice and were moved to the left. ⇨ Clear visualisation for comparison.

Edible mushroom lattice in Con. Exp

Edible mushroom lattice in Con. Exp

Poisonous mushroom lattice in Con. Exp

Poisonous mushroom lattice in Con. Exp

Conclusion • Large data sets may be difficult to deal with computationally, but: –

Conclusion • Large data sets may be difficult to deal with computationally, but: – It is the number of formal concepts derived from a data set that is the key factor in determining if a concept lattice will be useful as a visualisation. • Readable lattices can be produced with a straightforward process of: – creating sub-contexts – reducing noise • Freely available software. • Burmeister (. cxt) format used to succesfully interoperate between the three FCA tools.

Thank you very much. Questions?

Thank you very much. Questions?