DATA TYPES AND QUANTITATIVE DATA ANALYSIS PRESENTED TO

  • Slides: 44
Download presentation
DATA TYPES AND QUANTITATIVE DATA ANALYSIS PRESENTED TO THIRD-TRIMESTER YEAR 1 1

DATA TYPES AND QUANTITATIVE DATA ANALYSIS PRESENTED TO THIRD-TRIMESTER YEAR 1 1

DATA l Information expressed qualitatively or quantitatively l Data are measurements of characteristics Measurements

DATA l Information expressed qualitatively or quantitatively l Data are measurements of characteristics Measurements are functions that assign values in quantitative or quantitative form l Characteristics are referred to as variables Eg. Height, weight, sex, tribe, etc l 2

VARIABLES AND DATA TYPES l Variable as characterization of event l Classification of Variables

VARIABLES AND DATA TYPES l Variable as characterization of event l Classification of Variables Ø Ø 3 – Qualitative: usually categorical; values/members fall into one of a set of mutually exclusive & collectively exhaustive classes. eg. Sex, crop variety, animal breed, source of water, type of house Quantitative: numeric values possessing an inherent order. Ø Discrete: eg. # of children/farmers/animals, etc Ø Continuous: height, weight, distance, etc Random and Fixed

Data Types l Scales of measurements l Nominal Ordinal l Interval Ratio Levels of

Data Types l Scales of measurements l Nominal Ordinal l Interval Ratio Levels of measurement distinguished on the basis of the following criteria: l Magnitude or size; Direction l Distance or interval; Origin l Equality of points; Ratios of intervals; Ratio of points 4

NOMINAL DATA l l l 5 Example: Sex (Gender) coded M, F or 0,

NOMINAL DATA l l l 5 Example: Sex (Gender) coded M, F or 0, 1 ‘Numbers’ simply identify, classify, categorize or distinguish. The score has no size or magnitude Score has equality because two subjects are similar (equal) if they have same number Weakest level of measurement; poor Arithmetic operations CANNOT be performed on nominal data types

ORDINAL DATA l l l l 6 Associated with qualitative random variables Generated from

ORDINAL DATA l l l l 6 Associated with qualitative random variables Generated from ranked responses (or from a counting process). Have properties of nominal-data, in addition to DIRECTION Numeric or non-numeric Next to nominal in terms of weakness Arithmetic operations must be avoided Egs: knowledge (low, average, high), socio-economic status, attitude, opinion (like, dislike, strongly dislike), etc.

INTERVAL and RATIO INTERVAL – Numeric, have magnitude or size, direction, distance or interval,

INTERVAL and RATIO INTERVAL – Numeric, have magnitude or size, direction, distance or interval, and origin – Interval scale has no absolute 0 that is NOT independent of system of measurement [0 o. C not same temperature as 0 o. F] – Eg. Temperature in degrees Fahrenheit or Celsius RATIO • – – Weight of cassava in kilogram or pounds weight Numeric, have magnitude or size, direction, distance or interval, and origin Absolute origin exists and not system dependent All arithmetic operations can be performed on such data types 7

DATA COLLECTION PROCESSES l Processes include (not mutually exclusive) – Routine Records; Survey Data;

DATA COLLECTION PROCESSES l Processes include (not mutually exclusive) – Routine Records; Survey Data; – Experimental data; – 8

ROUTINE (MONITORING) DATA l l 9 Data periodically recorded essentially for administrative use of

ROUTINE (MONITORING) DATA l l 9 Data periodically recorded essentially for administrative use of the establishment and for studying trends or patterns. Examples – medical records, meteorological data Some statistical analysis of data possible on description and prescription Cheap data, and planning could be haphazard

EXPERIMENTAL DATA l l l 10 Treatments are the investigated factors of variation Treatments

EXPERIMENTAL DATA l l l 10 Treatments are the investigated factors of variation Treatments are controlled by the designer Treatment levels may be fixed, random, qualitative, quantitative Comparative experimental data require inductive analysis Emphasis on inference including estimation of effects and test of hypotheses.

SURVEY DATA COLLECTION l Information on characteristics, opinions, attitudes, tendencies, activities or operations of

SURVEY DATA COLLECTION l Information on characteristics, opinions, attitudes, tendencies, activities or operations of the individual units of the population l Based on a small set of the population Can be planned; preference for random surveys l l 11 Researcher or investigator has no (or must not exercise) control over the respondent or data

Which procedure to use? l Depends on study objectives l All 3 procedures are

Which procedure to use? l Depends on study objectives l All 3 procedures are possible while in the community l Monitoring and Survey procedures will be most used during the first year. We discuss SURVEY further l 12

SAMPLING (SURVEY) METHODS l Ensure units of population have same chance of being in

SAMPLING (SURVEY) METHODS l Ensure units of population have same chance of being in the sample. Sampling Types l l l 13 Probability sampling - the selection of sampling units is according to a probability (random & non-random) scheme. Non-probability sampling - selection of samples not objectively made, but influenced a great deal by the sampler. Example – haphazard and use of volunteers Preference is for probability sampling, but situation may determine otherwise

SYSTEMATIC SAMPLING Procedure l l 14 Sampling units are selected according to a pre-determined

SYSTEMATIC SAMPLING Procedure l l 14 Sampling units are selected according to a pre-determined pattern. For instance, given a sampling intensity of 10% from a population of 100 numbered trees or units (strips etc) might require your observing every 1 out of 10 trees (units, strips) in an ordered manner or sequence

Selection in Systematic Procedure 15 l E. g. if by some process, random or

Selection in Systematic Procedure 15 l E. g. if by some process, random or non-random, the 3 rd tree (unit or strip) is selected first, then the 13 th, 23 rd, 33 rd, 43 rd, . . . , 93 rd trees (unit, strips) will accordingly be selected. Strictly, this type of selection as illustrated with the population of 100 trees (units) involves only one sample. l Improve by selecting 1 st unit randomly from 1 to 10, or 1 to 100, and by MULTIPLE random starts

Applications of Systematic Sampling _ Population is unknown _ Baseline studies on spatial distribution

Applications of Systematic Sampling _ Population is unknown _ Baseline studies on spatial distribution patterns of population _ Baseline studies on extent/distribution of pests, pathogens, etc. _ Mapping purposes _ Regeneration studies 16

Advantages of Systematic Sampling _ Easy to set-up _ Relative speed in data collection

Advantages of Systematic Sampling _ Easy to set-up _ Relative speed in data collection _ Total coverage of population assured _ Good base for future designs, as position of characters can easily be mapped (with known coordinates) _ Demarcation of units not necessary, as sampling units are defined by first unit. 17

Disadvantages of Systematic Sampling 18 l With only one random observation, sampling error not

Disadvantages of Systematic Sampling 18 l With only one random observation, sampling error not valid l Unknown trend(s) in population can influence results adversely [Examples: topography, season of sampling interval]

Avoiding the disadvantages 19 l The first major disadvantage on sampling error can be

Avoiding the disadvantages 19 l The first major disadvantage on sampling error can be rectified by introducing several multiple random starts through stratification of the population l The second problem of trend is more difficult but simply relates to the choice of the sampling interval.

Simple/Unrestricted Random Sampling 20 l Unlike the systematic sampling, sampling units need not be

Simple/Unrestricted Random Sampling 20 l Unlike the systematic sampling, sampling units need not be equally spaced. l We shall define this as that sampling procedure which ensures equal probability for all samples of the same size (without any restriction imposed on the selection process).

Illustration of SRS l Given a pop. Size of N from which a sample

Illustration of SRS l Given a pop. Size of N from which a sample of size n will be drawn, the number of possible ways of obtaining the sample is l Supposing a population is known to have 5 units, and a sample size of 3 is required. From this population of 5 units, there are 10 possible ways of obtaining a sample of size 3. [The formula is 5 C 3= 5!/{(5 -3)! 3!} = 10]. Each of these combinations is unique and has the same chance (1/10) of being selected. Thus SRS is a random sampling procedure where each sample of size n has the same probability of selection. l l l 21

SRS selection process l l 22 (i) Select randomly one 'sample combination' from the

SRS selection process l l 22 (i) Select randomly one 'sample combination' from the number 1 to 10 (as there are 10 possible combinations). (ii) Use the table of random numbers to select 3 numbers from 1 to 5 or select three numbers from a 'hat' containing all the five numbers. This option seems easier and more practicable than (i).

Summary - SRS l l l 23 Application: Applied when the population is known

Summary - SRS l l l 23 Application: Applied when the population is known to be homogeneous. Procedure is suitable for units defined by plot sizes. Advantage: Easy to apply, though not as easy as the systematic procedure. Disadvantage: Requires knowledge of all the units in the population (construction of the frame is necessary)

STRATIFIED RANDOM SAMPLING 24 l Requires dividing the population into non-overlapping homogeneous units, which

STRATIFIED RANDOM SAMPLING 24 l Requires dividing the population into non-overlapping homogeneous units, which we are called STRATA. l SRS is then applied to each stratum, hence stratified random sampling (STRS). l Examples of strata types or criteria are ages of plantation, species types, aspect, topography/ altitude, farm types, habitat l Dividing the population into such homogeneous units usually leads to better estimates of the desired population parameters.

Where/when to apply Stratified RS l l l 25 Very suitable for heterogeneous areas

Where/when to apply Stratified RS l l l 25 Very suitable for heterogeneous areas (or units) that can be identified and classified into homogeneous entities. Supplementary information, e. g. rem sensing aerial photographs, useful for stratification. Choice of strata should ensure variation between units within strata is less than the variation between strata.

Advantages/Disadvantages of STRS Advantages l Estimates are more precise l Separate estimates and inferences

Advantages/Disadvantages of STRS Advantages l Estimates are more precise l Separate estimates and inferences for strata are possible Disadvantages l Sample size depends on type of allocation to be used l Sampling likely to be efficient in some strata than others l Errors in strata classification affect overall estimate l Frame construction for each stratum is required. 26

Allocation of units (n) to strata 27 l Equal allocation - Equal (same) number

Allocation of units (n) to strata 27 l Equal allocation - Equal (same) number of units are collected from each stratum. l Proportional allocation - The number of units per strata is proportional to the size of the strata.

ANALYSING QUALITATIVE DATA 28 l Qualitative data are essentially labels of a categorical variable

ANALYSING QUALITATIVE DATA 28 l Qualitative data are essentially labels of a categorical variable l Statistical Analyses involve totals, percentages and conversion to pie-charts and bar charts (bar-graphs). l Sophisticated analyses include categorical modelling

EXAMPLE 29 Hse Freque ncy Percent Degree of 360 A=1 36 72% 260 B=2

EXAMPLE 29 Hse Freque ncy Percent Degree of 360 A=1 36 72% 260 B=2 10 20% 72 C=3 4 8% 28

You can have multiple bar graphs (i. e, can have more than one variable

You can have multiple bar graphs (i. e, can have more than one variable illustrated on a bar chart. Example is given below: 30

Contingency Table This involves count summaries for 2 or more categories placed in row-column

Contingency Table This involves count summaries for 2 or more categories placed in row-column format: Example of a 2 by 3 contingency table: Gender Male Female 31 Group A B C 36 10 4 34 28 2 Assess association between Gender & Group

ANALYSING QUANTITATIVE DATA 32 l Basic analyses involve determining the CENTRE and SPREAD of

ANALYSING QUANTITATIVE DATA 32 l Basic analyses involve determining the CENTRE and SPREAD of data. l Inferential, probability and non-probability based

Measuring Centre Statistics include – – – 33 MODE (most frequently occurring observation) MEDIAN

Measuring Centre Statistics include – – – 33 MODE (most frequently occurring observation) MEDIAN (observation lying at the centre of an ordered data) – best for INCOME data MEAN (a sufficient, consistent, unbiased statistic, utilising ALL observations)

EXAMPLE l Consider that we selected RANDOMLY 10 houses out of 50, and observed

EXAMPLE l Consider that we selected RANDOMLY 10 houses out of 50, and observed the number of school-aged children who do not go to school as follows: 1 2 4 4 1 1 6 Find MEDIAN, MODE, MEAN 34 0 5 2

l l MODE: 1 as it appeared most often (most households have at least

l l MODE: 1 as it appeared most often (most households have at least 1 child of school-going age not in school) MEDIAN: Centremost observation after ordering data lies between the 4 th and 5 th data, i. e. , between 2 and 2 (= 352) 0 1 1 1 2 2 4 4 5 6 Interpretation: 50% of the sampled population have up to 2 children of school-going age not in school) l 35 MEAN: We use the arithmetic mean = sum of data divided by no. of observations, = (0+1+1+1+ 2+2+4+4+5+6)/10=2. 6

Measuring Spread Statistics include – – – 36 MINIMUM, MAXIMUM (ie EXTREME data) RANGE

Measuring Spread Statistics include – – – 36 MINIMUM, MAXIMUM (ie EXTREME data) RANGE (a single statistic calculated as MAXIMUM minus MINIMUM value) MEAN of the sum of the ABSOLUTE DEVIATION STANDARD DEVIATION (SD, but use the divisor n-1, not n as in most calculators). STANDARD ERROR

EXAMPLE l Consider that we selected RANDOMLY 10 houses out of 50, and observed

EXAMPLE l Consider that we selected RANDOMLY 10 houses out of 50, and observed the number of school-aged children who do not go to school as follows: 1 2 4 4 1 1 6 0 5 2 Find STANDARD DEVIATION, STANDARD ERROR and CONFIDENCE LIMITS 37

CALCULATING SPREAD: STANDARD DEVIATION Deviation 1 -1. 6 2. 56 0 -2. 6 6.

CALCULATING SPREAD: STANDARD DEVIATION Deviation 1 -1. 6 2. 56 0 -2. 6 6. 76 2 -0. 6 0. 36 4 1. 96 5 2. 4 5. 76 6 3. 4 11. 56 26 38 Square Dev X Approximate SD = Standard Deviation: = 2. 01 36. 4 = (6 -0)/4 = 1. 5 (valid if sample is large and distribution is normal)

Sampling fraction (f) and Finite Population Correction Factor (fpc) 39 l Sampling fraction= f

Sampling fraction (f) and Finite Population Correction Factor (fpc) 39 l Sampling fraction= f = n/N = 10/50 = 0. 20 (represents the proportion of the population that is sampled, i. e. observed) l If f < 0. 05, fpc is ignored. In our case, f > 0. 5 (indeed equals 0. 20), fpc must be calculated and used for the sampling error computation fpc = (N-n)/N = 1– n/N = 1 - 0. 20 = 0. 80

CALCULATING SPREAD: STANDARD ERROR = 0. 57 40

CALCULATING SPREAD: STANDARD ERROR = 0. 57 40

Confidence (Fiducial) Limits l Given a level of significance, 5%, can obtain a 95%

Confidence (Fiducial) Limits l Given a level of significance, 5%, can obtain a 95% confidence limit on the mean number of non-school going children by multiplying SE by 1. 96, that is: P(2. 6 -1. 96*0. 57 < true number < 2. 6+1. 96*0. 57) =1 -0. 05= 0. 95 P(1. 5 < true number per household < 3. 7) = 0. 95 l Interpretation: 95% certain that true number of children in community who are of school-age but at home is between 1. 5 (1) and 3. 7 (4). OR can conclude (after multiplying by the total 50 households l 41 75 to 185 school-aged children in the community are not in school

Combining Spread and Centre BOX PLOT 42 HISTOGRAM

Combining Spread and Centre BOX PLOT 42 HISTOGRAM

Further Analysis of Quantitative Data 43 l Histograms give idea of the distribution of

Further Analysis of Quantitative Data 43 l Histograms give idea of the distribution of the data; very useful for quantitative data l An excellent alternative to histogram is the stem-leaf diagram. l Measures of association – correlation analysis, dependence (cause-effect) relations (regression procedures) – 2006/2007

DATA ANALYSIS IS ENDLESS!!! 44 l ENJOY YOUR TIME DURING TTFPP l END l

DATA ANALYSIS IS ENDLESS!!! 44 l ENJOY YOUR TIME DURING TTFPP l END l KS Nokoe, PT Birteeb, IK Addai, M Agbolosu, L Kyei,