Biostatistics course Part 3 Data summary and presentation
Biostatistics course Part 3 Data, summary and presentation Dr. en C. Nicolas Padilla Raygoza Department of Nursing and Obstetrics Division of Health Sciences and Engineering Campus Celaya Salvatierra University of Guanajuato Mexico
Biosketch n Medical Doctor by University Autonomous of Guadalajara. n Pediatrician by the Mexican Council of Certification on Pediatrics. n Postgraduate Diploma on Epidemiology, London School of n n n Hygine and Tropical Medicine, University of London. Master Sciences with aim in Epidemiology, Atlantic International University. Doctorate Sciences with aim in Epidemiology, Atlantic International University. Professor Titular A, Full Time, University of Guanajuato. Level 1 National Researcher System padillawarm@gmail. com raygosan@ugto. mx
Competencies n The reader will describe type of variables. n He (she) will analyze how summary shows the different variables n He (she) will calculate central trend measures and find them in graphics. n He (she) will calculate dispersion measures and find them in graphics.
Definitions n Data are collected on the specific characteristics of each subject, and groups are formed to be compared. n These characteristics are called variables, because they can change from each subject. n Variable is obtained because it is: A result of interest - dependent variable n Or it explain the dependent variable - risk factor - independent variable. n
Type of data n Classification for its measurement scale: n Qualititative n n Binary - dichotomous Ordinal Nominal Quantitative n n Discrete Continuous
Type of data - Examples n Qualitative n Dichotomous - binary n n n Ordinal n n Gender: male or female. Employment status: employment or without employment. Socioeconomic level: high, medium, low. Nominal n n Residency place: center, North, South, East, West. Civil status: single, married, widowed, divorced, free union. n Quantitative n Discrete n n Number of offspring: 1, 2, 3, 4. Continuous n Glucose in blood level: 110 mg/dl, 145 mg/dl.
Data summary n Generally, we want to show the data in a summary form. n Number of times that an event occur, is of our interest, it show us the variable distribution. n We can generate a frequency list quantitative or qualitative.
Summary of categorical data n We can obtain frequencies of categorical data and summary them in a table or graphic. n Example: we have 21 agents of parasitic diseases isolated from children. Giardia lamblia Entamoeba histolytica Ascaris lumbricoides Enterobius vermicularis Ascaris lumbricoides Enterobius vermicularis Giardia lamblia
Summary of categorical data n List of parasites detected show us an idea of the frequency of each parasite, but that is not clear. n If we ordered them, the idea is more clear. Giardia lamblia Ascaris lumbricoides Enterobius vermicularis Giardia lamblia Ascaris lumbricoides Entamoeba histolytica Giardia lamblia Enterobius vermicularis Entamoeba histolytica Ascaris lumbricoides Enterobius vermicularis Entamoeba histolytica
Summary of categorical data n We can show the results in a frequency distribution. Frequency distribution of intestinal parasites detected in children from CAISES Celaya, n=21 Parasite n Giardia lamblia 6 Ascaris lumbricoides 6 Enterobius vermicularis 6 Entamoeba histolytica 3 Total 21 Source: Laboratory report
Summary of categorical data n It is useful to show the frequency of each category, expressed as percentage of the total frequency. n It is called distribution of relative frequencies. Frequency distribution of intestinal parasites detected in children from CAISES Celaya, n=21 Parásito n % Giardia lamblia 6 28. 57 Ascaris lumbricoides 6 28. 57 Enterobius vermicularis 6 28. 57 Entamoeba histolytica 3 14. 29 Total 21 100. 00 Source: Laboratory report
Summary of categorical data n Sometimes, the number of categories is high and should diminish the number of categories. Distribution by death cause in Celaya, Gto, during 2012 Death cause n % Cardiovascular disease 12, 525 21. 96 Cancer 10, 321 18. 10 Lower respiratory infections 8, 745 15. 34 Other 25, 435 44. 60 Total 57, 026 100. 00 Source: Certification of deaths
Frequency distributions for quantitative data n With quantitative data, we need group the data, before of show it in a frequencies or relative frequencies table. Distribution of frequencies in students of FEOC that have smoked at least once. n=534 Age (years) n % 19 52 14. 70 20 32 9. 00 21 46 12. 99 22 67 18. 94 23 26 7. 35 24 77 21. 76 25 54 15. 26 Total 534 100. 00 Source: Health survey
Frequency distributions for quantitative data n With quantitative data, it is useful calculate cumulative frequency. Distribution of frequencies in students of Campus that have smoked at least once. n=534 Age (years) n % % cumulative 19 52 14. 70 20 32 9. 00 23. 70 21 46 12. 99 36. 69 22 67 18. 94 55. 63 23 26 7. 35 62. 98 24 77 21. 76 84. 74 25 54 15. 26 100. 00 Total 534 100. 00 Source: Health survey
Distributions of frequencies for grouped quantitative data. n Frequently, there are many categories with quantitative data, and we have to calculate intervals for each category. Distribution of frequencies of ages of children with acute streptoccocal pharyngotonsillitis Source: Padilla N, Moreno M. Comparison between clarithromycin, azithromycin and propicillin in the management of acute streptococcal pharyngotonsillitis in children. Archivos de Investigación Pediátrica de México 2005; 8: 5 -11. (In Spanish) Age (years) n % <1 2 0. 51 1 8 2. 00 2 13 3. 30 3 29 7. 36 4 37 9. 39 5 44 11. 17 6 51 12. 94 7 50 12. 69 8 49 12. 44 9 32 8. 12 10 25 6. 35 11 22 5. 58 12 14 3. 55 13 9 2. 28 14 7 1. 78 15 2 0. 51 Total 394 100. 00
Distribución de frecuencias para datos cuantitativos agrupados Distribution of frequencies of ages of children with acute streptoccocal pharyngotonsillitis Source: Padilla N, Moreno M. Comparison between clarithromycin, azithromycin and propicillin in the management of acute streptococcal pharyngotonsillitis in children. Archivos de Investigación Pediátrica de México 2005; 8: 5 -11. (In Spanish) Age (years) n % <1 - 3 52 13. 20 4 -6 132 33. 50 6 -9 131 33. 25 10 - 12 61 15. 48 13 - 15 18 4. 57 Total 394 100. 00
To group data n Guide n To obtain minimum and maximum values and decide the number of intervals. n Number of intervals between 5 and 15. n To assure interval limits. n To assure that width of intervals been the same. n To avoid that first or last interval been open.
Charts n Categorical data n Bar chart n Gráfica de pastel n Quantitative data n Histogram n Polygon of frequencies
Bar chart n The frequency or relative frequency of a categorical variable can be show easily in a bar chart. It is used with categorical or numerical discrete data. n Each bar represent one category and its high is the frequency or relative frequency. n Bars should be separated. n It is very important that Y axis begin with 0. n
Bar chart
Grouped bar chart n If we have a nominal categorical variable, divided in two categories, can show data with a grouped bar chart. n It allow easy comparison between groups.
Grouped bar chart
Pie chart n It is an alternative to show categorical variable. n Each slice of pie correspond at frequency or relative frequency of categories of variable. n It only shows one variable in each pie chart. n If we want to make comparisons, we need to build two or more pie charts.
Pie chart
Pie chart
Distribution of frequency charts: histograms n It is useful to quantitative variables. n There are not spaces between bars. n The area bar, not its high, represent its frequency. n X axis should be continuous. n Y axis should begin in 0. n Width represent the interval for each group.
Distribution of frequency charts: histograms
Distribution of frequency charts: frequencies polygon n It is another form to show the frequency distribution of a numerical variable. n It is building, joining the middle point higher of each bar of histogram. n We should be take into account the width of each bar. n We can plot more than one polygon in each chart, to make comparisons.
Distribution of frequency charts: polygon of frequencies
Distribution of frequencies: cumulative histogram n We can plot directly from a cumulative frequencies table. n It is not necessary to make adjustments to the high of the bars, because the cumulative frequencies represent the total frequency superior, including the superior limit of the interval.
Distribution of frequencies: cumulative histogram
Distribution of frequencies: cumulative polygon of frequencies n We use them to see proportions below o above of a point in the curve. n We can read median and percentiles, directly. n If the distribution is symmetrical, it has S form symmetrical. n If it is skewed to the right or to the left, will be flatten in that side.
Distribution of frequencies: cumulative polygon of frequencies
Other charts: tree and leafs n We use it to show directly quantitative data or preliminary step in the build a frequency distribution. We organize data determining the number of divisions (5 -15). n We plot a vertical line and put the first digit of category to the left of the line (tree) and the second digit to the right of the vertical line (leafs). n
Other charts: tree and leafs Patie nt Age 1 54 3 52 2 35 4 932 3 49 5 487 4 61 6 14 5 58 6 64 7 32 8 57 9 43 10 42
Other charts: box and line n We plot a vertical line that represents the range of distribution. n We plot a horizontal line that represents third quartile and another that represents the first quartile (box). n The point middle of distribution is show as a horizontal line in the center of box.
Other charts: box and line 5500 5000 4500 4000 3500 3000 2500 2000 1500 1000 500
Localization measures n For categorical variable: percentage n For quantitative variable: n Central trend measures: n n Mean Median Mode Dispersion measures: n n n Standard deviation Percentiles Range
Central trend measures n Mean n It is the conventional mean. n If we say that n observations have a xi value, then the value of the mean will be: _ X =Σxi/n
Central trend measures in a frequency distribution n Each value of data (xi) occur with a frequency (fi), then: _ X =Σxifi/n n In a grouped distribution, we use point middle of each interval as x value.
Central trend measures in a frequency distribution Interval Point middle Frequency (fi) _________________ 1– 3 2 18 4– 6 5 27 7– 9 8 34 10 – 12 11 22 13 – 15 14 13 _________________ Total 114 Example of mean for a grouped distribution (2 x 18) + (5 x 27) + (8 x 34) + (11 x 22) + (14 x 13) 36 + 135 + 272 + 242 + 182 867 Mean = ----------------------------------- = ---- = 7. 61 (18 + 27 + 34 + 22 + 13) 114 Mean = 7. 61 years
Central trend measures n Median n It is the value that divide the distribution in two equal parts. n If it is a pair number of observations, the central values are summed and divided by two. 51. 2, 53. 5, 55. 6, 65. 0, 74. 2 median is the value at the half, thus: Median = 55. 6 51. 2, 53. 5, 55. 6, 61. 4, 65. 0, 74. 2, 55. 6 + 61. 4 /2 = Median 58. 5
Central trend measures for frequency distributions n Median n It is the value where is 50%.
Central trend measures n Mode n It is the value that occur more frequently. Interval Point middle Frequency (fi) _________________ 1– 3 2 18 4– 6 5 27 7– 9 8 34 10 – 12 11 22 13 – 15 14 13 _________________ Total 114
Central trend measures n Properties n Mean is sensitive to the tails, median and mode, not. n Mode can be affected by little changes in the data, median and mean, not. n Mode and median can be find in a chart. n The three measures are the same in a Normal distribution.
Central trend measures n What measurement to use? n For skewed distributions, we use median. n For statistical analysis or inference, we use mean.
Dispersion measures n Range n It show the minimum and maximum values and the difference between they. 51. 2, 53. 5, 55. 6, 61. 4, 65. 0, 74. 2 Range of this distribution es 51. 2 – 74. 2 kg. However, the extreme values of this distribution are far center of distribution, it unclear the fact that the most data are between 53. 5 and 65 kg.
Dispersion measures n Percentiles n A percentile o centile is the value, below of which, a percentage given of data, has occurred. See the distribution of stature in this population. What is the range, median, percentile 25 and 75? Stature (cm. ). n Relative frequency (%) Cumulative frequency (%) 151 2 0. 7 152 3 1. 1 1. 8 152 6 2. 2 4. 0 154 12 4. 5 8. 5 155 27 10. 0 18. 5 157 29 10. 8 29. 3 158 26 9. 7 39. 0 159 33 12. 3 51. 3 163 37 13. 8 65. 1 164 16 5. 9 71. 0 165 24 8. 9 79. 9 168 18 6. 7 86. 6 169 14 5. 2 91. 8 171 6 2. 2 94. 0 174 7 2. 6 96. 6 175 1 0. 4 97. 0 177 4 1. 5 98. 5 179 2 0. 7 99. 2 184 1 0. 4 99. 6 185 1 0. 4 100. 0 ___________________________________ Total 269 100. 0
Dispersion measures n Standard deviation n It is the more common form of to quantify the variability of a distribution. n It measure the distance between each value and its mean. Subject 1 2 3 High 1. 6 1. 7 1. 8 Value -1 0 +1 Σ Xi - X Mean deviation = ------n _ X= 1. 7 Mean deviation = (-1)+(0)+(+1)/3 = 0
Dispersion measures n Standard deviation n We should be interest in magnitude of observations. If squared each deviation, we shall have positive values. If divided this add by n -1, we shall obtain variance and if we obtain square root, shall have standard deviation. Subject 1 2 3 High 1. 6 1. 7 1. 8 Value 2 0. 1 0 0. 1 Σ (Xi - X) 2 Standard deviation =√ -------n-1 _ X= 1. 7 Standard deviation = √ 0. 2/2 = 0. 32
Dispersion measures fo grouped data n Standard deviation n It use the mean point of each interval. Σ f(Xi - X)2 Standard deviation =√ -------f-1 Also, it can be expressed as: Σfx 2 - (Σfx)2 /Σf Standard deviation = √ ----------Σ f -1
Dispersion measures for grouped data n For data with Normal distribution n Around 68% of data are between -1 and +1 standard deviation. Around 95% of data are between -2 and +2 standard deviations. Around 99. 9% of data are between -3 and +3 standard deviations. n Standard deviation is a measure of the width of the distribution. If the standard deviation change, the distribution change, also.
Bibliography n 1. - Kirkwood BR. Essentials of medical statistics. Oxford, Blackwell Science, 1988. n 2. - Altman DG. Practical statistics for medical research. Boca Ratón, Chapman & Hall/ CRC; 1991.
- Slides: 53