Using Secondary Datasets Lee Cheng MD MSc Joint

  • Slides: 35
Download presentation
Using Secondary Datasets Lee Cheng, MD, MSc Joint Primary Care Fellowship Department of Family

Using Secondary Datasets Lee Cheng, MD, MSc Joint Primary Care Fellowship Department of Family & Community Medicine The University of Texas Medical School at Houston 2005

Definition of Secondary Data • Collected by other studies • For the first researcher,

Definition of Secondary Data • Collected by other studies • For the first researcher, they are primary data • For the second researcher, they are secondary data • Collected by federal, state, and local government agencies, nationally representative surveys • Health and health care • Social and behavioral data • Polices

Benefits • Cost-effective to test specific hypothesis • Create new research questions • Demonstrate

Benefits • Cost-effective to test specific hypothesis • Create new research questions • Demonstrate new and improved research designs, and analytic approaches

Potential Drawbacks • Only as good as the research that produced them • Must

Potential Drawbacks • Only as good as the research that produced them • Must assume what the authors meant by the terms they used • Data may be neither valid nor reliable • Instruments or data collection methods may have changed over time • Data may have been modified by the researcher already (e. g. , weighted)

Potential Drawbacks (cont’d) • Poor documentation of the secondary data set • Limited access

Potential Drawbacks (cont’d) • Poor documentation of the secondary data set • Limited access to the data, e. g. , on-site only • Substantial purchase cost

Start with Research Questions • Begin with well-defined research questions and testable hypotheses •

Start with Research Questions • Begin with well-defined research questions and testable hypotheses • Identify variables needed • Identify the most appropriate data source available to achieve the study's aims

Steps in Using Data

Steps in Using Data

Verify the Data Check for the following: • Proper documentation • The correct number

Verify the Data Check for the following: • Proper documentation • The correct number of observations or cases • The correct number of variables • The correct coding scheme • The original summary statistics are reproducible

Determine Variable of Interests Identify the variables needed • Independent variables • Dependent variables

Determine Variable of Interests Identify the variables needed • Independent variables • Dependent variables • Confounding variables

Study Design of Data • Cross-sectional • Longitudinal • Examine trends over time

Study Design of Data • Cross-sectional • Longitudinal • Examine trends over time

Codebooks • Variable name • Descriptive label • Position in the record • Numeric

Codebooks • Variable name • Descriptive label • Position in the record • Numeric or character code type • Applicable field input format • Questionnaire number that generated the measures

Get the Data • Download some public use files directly from the web as

Get the Data • Download some public use files directly from the web as either ASCII files or SAS transport files. • CD-ROM • Restricted use file

Survey Design • Ask Sampling Frame defines the population represented, as well as which

Survey Design • Ask Sampling Frame defines the population represented, as well as which population groups are excluded. • Ask the sampling unit • Individuals • Households • Institutions • Ask whether surveys allow for cross-sectional and longitudinal analyses, as well as serial crosssectional analyses to examine trends • Ask whether surveys allow for regional or state level comparisons

Survey Sampling Design

Survey Sampling Design

Multistage Cluster Probability Sample • An efficient strategy for data collection • Otherwise ………

Multistage Cluster Probability Sample • An efficient strategy for data collection • Otherwise ……… it would be inordinately expensive to travel the entire US to interview a random sample of individuals or households across the United States

Multistage Cluster Probability Sample Counties (PSUs) States (strata) Enumeration Area (SSUs) Households (EUs) Respondents

Multistage Cluster Probability Sample Counties (PSUs) States (strata) Enumeration Area (SSUs) Households (EUs) Respondents

Information and variables needed in the statistical analysis

Information and variables needed in the statistical analysis

Primary Sampling Unit (PSU) The first unit that is sampled in the design. For

Primary Sampling Unit (PSU) The first unit that is sampled in the design. For example, school districts from Texas may be sampled and then schools within districts may be sampled. The school district would be the PSU.

Stratification (Strata) A method of breaking up the population into different groups, often by

Stratification (Strata) A method of breaking up the population into different groups, often by demographic variables such as gender, race or SES.

Weight Variable • The weight variable reflects information about the sampling design includes •

Weight Variable • The weight variable reflects information about the sampling design includes • the probability of being sampled • adjustments for non-response • post-stratification adjustments • A weight variable is already included in the data set to be analyzed • Care must be taken to use the correct weight variable • Survey documentation serves as the guide for selecting the appropriate weight

Cluster • A naturally occurring unit or grouping within the population (e. g. enumeration

Cluster • A naturally occurring unit or grouping within the population (e. g. enumeration areas, cities, universities, provinces, hospitals etc). • A unit for which the administrative level has clear, non-overlapping boundaries. • It is useful because it avoids having to compile exhaustive lists of every single person in the population.

Panel Design • Participants (or subjects of survey) are followed over multiple survey rounds

Panel Design • Participants (or subjects of survey) are followed over multiple survey rounds for a specified period of time. • New panels are created at designated intervals. • Certain percentage of subject is rotated out and replaced by newly sampled subjects each round.

Understanding the Potential Sources of Errors • Survey non-response • Item non-response • Sampling

Understanding the Potential Sources of Errors • Survey non-response • Item non-response • Sampling frame under coverage • Measurement error • Instrument error

Software for Survey Data • Designed to analyze complex survey • Can account for

Software for Survey Data • Designed to analyze complex survey • Can account for important design parameters • Survey weights • Cluster and stratified sampling (psu) • Software: SAS, Stata, SUDAAN

Power Calculations • Make sure that there is sufficient data to test hypotheses. •

Power Calculations • Make sure that there is sufficient data to test hypotheses. • Or show that relevant statistics did not occur simply by chance.

Data bases for your use

Data bases for your use

The United States Department of Health & Human Services (HHS) http: //aspe. hhs. gov/statinfo/

The United States Department of Health & Human Services (HHS) http: //aspe. hhs. gov/statinfo/ • Major Federal health and human service surveys and data systems and state agency data sites. • Information on completed and in-progress program evaluations and policy research. • Links to published scientific literature in human services and health.

Behavioral Risk Factor Surveillance System State data available at: http: //apps. nccd. cdc. gov/brfss/index.

Behavioral Risk Factor Surveillance System State data available at: http: //apps. nccd. cdc. gov/brfss/index. asp English and Spanish Questionnaires available at: http: //www. cdc. gov/brfss/brfsquesqustionnairesesp. htm Evaluation of survey questionnaires which could be adapted for BRFSS can be found at: www. schs. state. nc. us/SCHS/about/programs/brf ss/surveys/ratings. html

Youth Risk Behavior Surveillance Survey http: //apps. nccd. cdc. gov/YRBSS/index. asp • Developed in

Youth Risk Behavior Surveillance Survey http: //apps. nccd. cdc. gov/YRBSS/index. asp • Developed in 1990 to monitor priority health risk behaviors that contribute markedly to the leading causes of death, disability, and social problems among youth and adults in the United States. • These behaviors are often established during childhood and early adolescence (e. g. , tobacco use, unhealthy dietary behaviors).

Cancer Registries http: //www. cdc. gov/cancer/dbdata. htm Age-adjusted mortality rates and number of new

Cancer Registries http: //www. cdc. gov/cancer/dbdata. htm Age-adjusted mortality rates and number of new cases for Lung, colorectal, breast, prostate cancer

Other Sources of Demographic, Socioeconomic and Health Data www. census. gov National Association of

Other Sources of Demographic, Socioeconomic and Health Data www. census. gov National Association of County and City Health Officials Community Health Status Indicators CDROM ($75) Area Resource File http: //www. arfsys. com/ ($500)

Health and Medical Care Archive of the Inter-university Consortium for Political and Social Research

Health and Medical Care Archive of the Inter-university Consortium for Political and Social Research (ICPSR) www. icpsr. umich. edu/ • Funded by Robert Wood Johnson Foundation • Datasets are organized by: Health Care Providers, Cost/Access to Health Care, Substance Abuse & Health, Chronic Health Conditions, Other. • Students, faculty, and staff can download any data file on Web site to authorized IP addresses.

Other Federal Data Systems http: //www. fedstats. gov/ • Local VAMC databases: The decentralized

Other Federal Data Systems http: //www. fedstats. gov/ • Local VAMC databases: The decentralized Hospital Computer program (DHCP/VISTA) http: //www. virec. research. med. va. gov/ or Austin Help Desk at 512 -326 -6780

Meetings and Training http: //www. resdac. umn. edu/Index. asp • Research Data Assistance Center

Meetings and Training http: //www. resdac. umn. edu/Index. asp • Research Data Assistance Center (Res. DAC) provides assistance to researchers and gives workshops and seminars on how to analyze Medicare and Medicaid data: http: //www. jpsm. umd. edu/ • Joint Program in Survey Methodology

Expectation A brief report on using secondary data with the following format: • Brief

Expectation A brief report on using secondary data with the following format: • Brief introduction (rational) • Hypothesis • Methods • Data sources • Variables needed • Major analysis steps • Timelines from accessing data to finishing draft manuscript