Dexter The Missouri Census Data Centers Data Extraction

  • Slides: 75
Download presentation
Dexter The Missouri Census Data Center’s Data Extraction Utility John Blodgett: OSEDA, University of

Dexter The Missouri Census Data Center’s Data Extraction Utility John Blodgett: OSEDA, University of Missouri Rev. 14 May 2007, jgb

What Is Dexter? A web utility for performing simple data queries, or extracts. An

What Is Dexter? A web utility for performing simple data queries, or extracts. An integral part of the MCDC’s Uexplorebased data exploration/access system. Written in SAS© to access data stored in SAS datasets but requires no knowledge of nor access to SAS.

Who Uses Dexter? Anyone interested in accessing the MCDC data archive, especially anyone who

Who Uses Dexter? Anyone interested in accessing the MCDC data archive, especially anyone who wants to directly access and manipulate the data. Not (directly) intended for the very casual data user. Has a small but non-trivial learning curve. Understanding the mechanics of Dexter is easy compared to understanding the data to be extracted.

Dexter’s Role Within Uexplore Dexter accepts parameters that identify a database file/table from which

Dexter’s Role Within Uexplore Dexter accepts parameters that identify a database file/table from which data are to be extracted. Uexplore provides the navigation tools to help locate and understand the content of datasets. Uexplore hyperlinks actually invoke uex 2 dex, the dexter preprocessor, which in turn invokes dexter.

Uexplore Page With Hyperlinks

Uexplore Page With Hyperlinks

The URL Used to Invoke Dexter On the previous screen the dataset name (usccflows.

The URL Used to Invoke Dexter On the previous screen the dataset name (usccflows. sas 7 bdat) is a hyperlink. The URL associated with it is: http: //mcdc 2. missouri. edu/cgibin/broker? _PROGRAM=websas. uex 2 dex. sa s&_SERVICE=appdev 9&path=/pub/data/mig 2 000&dset=usccflows&view=0 It calls a program named uex 2 dex, written in SAS, and passes parms to ID the data table to be queried.

Dexter and Census Data Dexter doesn’t really know much about the datasets from which

Dexter and Census Data Dexter doesn’t really know much about the datasets from which it extracts data. It is not American Fact. Finder. It is just a generic extraction tool. It uses only very basic metadata tools. Other tools must be used to assist users in navigating the database.

Dexter and the MCDC Data Archive Technically, there is nothing inherent in Dexter that

Dexter and the MCDC Data Archive Technically, there is nothing inherent in Dexter that ties it to this archive. In practice, however, the collection of public data files that we call the “MCDC Data Archive” is what Dexter was created for. It is very probable the only reason you’re reading this is because you want to access something in that archive.

How Do You Invoke Dexter? Most people will start at the uexplore home page

How Do You Invoke Dexter? Most people will start at the uexplore home page - http: //mcdc. missouri. edu/applications/uexplore. shtml You navigate the data collection by choosing “filetype” directories and at some point (…yada) you wind up selecting (clicking on) a file that is a data table. Clicking on the data table invokes the uex 2 dex preprocessor. You fill out the form which uex 2 dex generates and click on an “Extract Data” button to actually invoke Dexter.

Accessing Uexplore (Home Page) From the MCDC home page (or any page with the

Accessing Uexplore (Home Page) From the MCDC home page (or any page with the navy blue navigation bar) click on “MCDC Data Archive”. Or enter the URL: http: //mcdc 2. missouri. edu/applications/uexplore. shtml

Choose Major Category (from the links in teal box)

Choose Major Category (from the links in teal box)

Scroll Within the Filetype Descriptions to Find the Type (mig 2000)

Scroll Within the Filetype Descriptions to Find the Type (mig 2000)

Click on the Filetype Name (links to uexplore for that directory/filetype) In this case

Click on the Filetype Name (links to uexplore for that directory/filetype) In this case we want to click on the mig 2000 filetype. The text tells us what kind of data we can expect to find in this directory.

Uexplore Page - mig 2000 Filetype This page is all about hyperlinks (all the

Uexplore Page - mig 2000 Filetype This page is all about hyperlinks (all the blue text). Before proceeding to the Dexter-invocation links we want to

(Back to) The Uexplore/Dexter Home Page

(Back to) The Uexplore/Dexter Home Page

The Archive Directory (on the Uexplore/Dexter home page) The teal box contains links to

The Archive Directory (on the Uexplore/Dexter home page) The teal box contains links to 8 major data categories (2000 Census thru Compendia) The rest of the page consists mostly of descriptions of and hyperlinks to the archive’s data categories (which we refer to as filetypes. ) Filetypes within the major categories are sorted in descending order of what we think will be their popularity. Sf 32000 x is our most popular filetype.

What’s In the Archive? Very important question. But not the focus of this tutorial.

What’s In the Archive? Very important question. But not the focus of this tutorial. Some day we’ll do a separate, long tutorial just on that topic. Not all filetypes are created equal. We spend 90% of our resources on maybe 10% of our data directories. Filetypes that are in bold are the MCDC “house specialties”.

The Data Archive – General Info We keep the data table files (the things

The Data Archive – General Info We keep the data table files (the things Dexter accesses) in the same directories along with other related files (metadata, spreadsheets, csv files, Readme. html files, etc. ) Each filetype directory has a special Tools subdirectory where we keep program code and other tool modules related to the data. Subdirectories & Files starting with uppercase letters are listed first and are usually worth looking at. Dexter-accessible table files (“SAS datasets”) have extensions of sas 7 bdat or sas 7 bvew.

Exercise The Bureau of Economic Analysis disseminates its REIS data with key economic indictors

Exercise The Bureau of Economic Analysis disseminates its REIS data with key economic indictors for US geography down to the county level. Locate the filetype corresponding to this data collection and navigate to the directory page. What’s the major category?

Uexplore Data Directory Page What you see when you click on the beareis link

Uexplore Data Directory Page What you see when you click on the beareis link on the Uexplore home page. It displays a list of files within the directory. The “File” column entries are hyperlinks. With a few exceptions the files are displayed in alphabetical order. Datasets. html is a special file providing enhanced navigation of the data files in this dir. It displays just the data-table files, but in a more logical order and with additional metadata.

Datasets. html page

Datasets. html page

Datasets. html Columns The Name column is also a link to uex 2 dex

Datasets. html Columns The Name column is also a link to uex 2 dex / dexter. Label is a short description of the dataset. #Rows (# of observations) and #Cols (# of columns/variables) are taken from the datasets metadata set. As are the Geographic Universe and Units. Link to Details is the most important column.

Universe and Units The majority of datasets in the archive contain summary data for

Universe and Units The majority of datasets in the archive contain summary data for geographic areas. For example, a dataset in the popests directory might contain the latest estimates for all counties in the state of Missouri. The geographic universe is Missouri, and the units are counties. When we have many datasets in a directory it’s usually because we have many different combinations of universe and units.

Common Universes Missouri (the state of) is by far the most common universe for

Common Universes Missouri (the state of) is by far the most common universe for the MCDC archive. United States is second – we have quite a number of national datasets. Illinois and Kansas are also very common since we routinely download and convert census files for these key neighbor states. A common sort order for files on Datasets. html pages is Missouri files first, then US, then IL/KS and then other states.

Rows & Columns The rows of the data tables are typically geographic entities: a

Rows & Columns The rows of the data tables are typically geographic entities: a state, a county, a city, etc Most of the columns in the data tables are summary stats for the entity: e. g. the 2000 pop count, the latest estimated pop, the change and percent change, etc. Other columns (“variables”) are identifiers with names such as sumlev, geocode and areaname.

Numeric vs. Character Variables SAS© stores data as character strings or as numerics. We

Numeric vs. Character Variables SAS© stores data as character strings or as numerics. We store all identifiers (geographic codes, etc) as character strings even if they are made up of numeric digits. So the value of the state code for CT is “ 09”, not 9. The leading “ 0” matters. Unfortunately, Excel ignores the distinction when importing csv files.

Dataset Naming Conventions All filetype names are 8 characters or less. Dataset names were

Dataset Naming Conventions All filetype names are 8 characters or less. Dataset names were limited to 8 characters by the software until recently. The first characters of the dataset name often correspond to the universe – e. g. “mo”, “il”, “us”. The geo units are often part of the ds-name – e. g. “motracts”, “uszips”. For time series data the name usually ends with a time indicator – e. g. “uscom 03” contains data thru 2003.

Variable Naming Conventions Not as rigorously applied as we might like, esp. for older

Variable Naming Conventions Not as rigorously applied as we might like, esp. for older datasets (conventions used for 1980 datasets differ a little from 2 K and 1990 sets, for example) Certain names appear on many datasets and are consistent. These are mostly identifier variables, the ones used in creating filters and for merging data from different files.

Consistency With Census Bureau Data Dictionary Names The Bureau often distributes data dictionary files

Consistency With Census Bureau Data Dictionary Names The Bureau often distributes data dictionary files with their data that include suggested names for the fields. Their name for the field (variable) containing the name of the geographic area being summarized is ANPSADPI. We decided to go with Area. Name instead. But in most cases we try to use the same name as in the data dictionary.

Common ID Variables Sum. Lev: Geographic summary level codes as used in 2 K

Common ID Variables Sum. Lev: Geographic summary level codes as used in 2 K census. (3 -char) State: 2 -char state FIPS code. County: 5 -char county FIPS code, incl. the state. Geocode: A composite code to id a geographic area. E. g. the value for a census tract might be “ 29019 -0010. 00”. Area. Name: Name of the area.

Common ID Variables (cont) Tract: census tract in tttt. ss format, always 7 characters

Common ID Variables (cont) Tract: census tract in tttt. ss format, always 7 characters with leading 0 s and 00 suffixes. E. g. “ 0012. 00”. Esriid: Similar to geocode but intended to use as a key for linking to shape files from ESRI (the Arc. Info people). When geocode=“ 29019 -0010. 00” the value of esriid=“ 29019001000”.

SAS Formats Some variables have custom formats associated with them, which cause them to

SAS Formats Some variables have custom formats associated with them, which cause them to display a name instead of their actual value. E. g. the variable County may have a value of “ 29019” but displays as “Boone MO” using the format. Most Dexter output has the formatted values. Click the “View qmeta Metadata report” option at the end of Section II on the Dexter form to see which variables have formats associated.

More About the MCDC Data Archive http: //mcdc 2. missouri. edu/tutorials/ mcdc_data_archive. ppt

More About the MCDC Data Archive http: //mcdc 2. missouri. edu/tutorials/ mcdc_data_archive. ppt

Details Page We get here by clicking on the Details link on Datasets. html

Details Page We get here by clicking on the Details link on Datasets. html page. Lots of content here – but will vary. Key variables is often extremely useful when doing filters. Note the direct link to Dexter under Access the dataset near the bottom.

Increase Text Size to Read Fine Print

Increase Text Size to Read Fine Print

Exercise – Navigate to Dataset Earlier we were looking at datasets in the 2000

Exercise – Navigate to Dataset Earlier we were looking at datasets in the 2000 Census category, filetype mig 2000. Go to the Uexplore home page and navigate to this filetype. Use the Datasets. html page to display the datasets within the directory. Find the row for the usccflows data table and click on the Details link for this table. From the Details page click on the keyvals link for the variable State.

Key Variables Report: State Tells you that the variable State has a value of

Key Variables Report: State Tells you that the variable State has a value of 01 (for “Alabama”) in 22137 rows of this dataset. This can be very helpful when doing a data filter in Dexter.

Finally… Time to See Dexter

Finally… Time to See Dexter

Dexter Input Page (Top) Sec. I. Output Format(s): csv file (into Excel) most common.

Dexter Input Page (Top) Sec. I. Output Format(s): csv file (into Excel) most common. Sec. II is where the work is. Only 2 of 5 rows shown here. User fills out the entire form before using Extract Data button to invoke Dexter.

Dexter Section II

Dexter Section II

Filters “A filter is a logical condition that references values of columns within a

Filters “A filter is a logical condition that references values of columns within a row. For each row, the condition is evaluated and, if true, the row is selected for output. (If not, the row is omitted, or "filtered". ) To keep all rows, just skip this section. The filter being created here can consist of up to 5 logical segments, each referencing a data set Variable, a relational Operator, and a data Value (or values) -- constants that the user must type in. The segments are evaluated as true or false. Logical operators (which default to And appear between the segment specification rows) relate the segments when more than one is specified, creating a compound logical condition. ” If this explanation makes sense to you then you are going to have an easy time with Dexter. If not, follow through the examples and then try reading it again.

Example of Defining a Filter: What We Want Assuming we are running dexter to

Example of Defining a Filter: What We Want Assuming we are running dexter to access the mig 2000. usccflows dataset we want to select only those rows that: – have Missouri as the anchor state, and – have at least 100 gross flows. We’ll just assume you’ve read the descriptions and have some clue regarding what an anchor state and a gross flow are. (People interested in population migration would be likely to know this. )

Select Variable for Filter Click on the Variable/Column drop-down menu in the 1 st

Select Variable for Filter Click on the Variable/Column drop-down menu in the 1 st row and select State.

Select Comparison Operator Select “Equal to” as the Operator from drop down menu in

Select Comparison Operator Select “Equal to” as the Operator from drop down menu in the middle column.

Enter Value to Complete Row Remember the Key Values report showing all the values

Enter Value to Complete Row Remember the Key Values report showing all the values for the variable State? If you did not know the code for Missouri you could find it there.

What We Have So Far We have created a logical condition that can be

What We Have So Far We have created a logical condition that can be evaluated for each row of the dataset: State = ’ 29’ According the key values report for State we know that this condition is true for 38, 316 rows in the dataset. The filter we are building will select just those 38, 316 rows out of the 1. 1+ million in the full dataset.

Adding a Second Condition But we do not want all the cases pertaining to

Adding a Second Condition But we do not want all the cases pertaining to Missouri as the anchor state. We only want those where we have at least 100 gross flows (whatever those are). So we need to fill out a second row, adding this condition. We select Gross. Mig as the variable, Greater Than or Equal To as the Operator and enter 100 in the Value field. We leave the logical operator radio button set to “And” to indicate that this is an additional necessary condition.

The Completed Filter You are now ready to scroll down to Section III.

The Completed Filter You are now ready to scroll down to Section III.

Section III: Choose Variables Conceptually simple section: just select the variables you want on

Section III: Choose Variables Conceptually simple section: just select the variables you want on your output from scrollable (if needed) menu lists. Identifiers (character type variables) are listed separate from numerics. Important MCDC Data Archive convention. Typing names instead of selecting is possible but not recommended. Here we select all variables except State.

Section IV: Title & Sort Order Entirely optional, typically not used section. Title value

Section IV: Title & Sort Order Entirely optional, typically not used section. Title value is used as report title if you asked for one, which we did not in this example. Sort specs are handy. Note use of minus sign (hyphen) to indicate a descending sort. Another Extract Data button to use to run query.

Dexter Output Page The first output you see is this results “index” page. Always

Dexter Output Page The first output you see is this results “index” page. Always a link to a Summary Log page Additional links depend on output formats requested.

Dexter Summary Log This file always generated. Important for documenting the query. Indicates what

Dexter Summary Log This file always generated. Important for documenting the query. Indicates what file(s), when run, as well as any filter and the variables kept. Output directory details can usually be ignored.

Select Output File(s) Click on Delimited File Link What happens when you click on

Select Output File(s) Click on Delimited File Link What happens when you click on this file depends on how your browser is configured. The file referenced has a. csv extension which IE usually associates with the Excel plugin. Clicking this link will typically invoke Excel.

Viewing. csv Output in Excel The csv file is read into Excel. Rows 1

Viewing. csv Output in Excel The csv file is read into Excel. Rows 1 & 2 have names & labels. Other rows contain the selected data. Note sort order.

Some Key Points So Far Navigation tools such as the uexplore home page, the

Some Key Points So Far Navigation tools such as the uexplore home page, the uexplore directory navigator and Datasets. html reference pages are used to make accessing data with Dexter easier. You get to select rows (“filter”) and columns as well as the format(s) of your extracted data. Filtering often requires knowledge of code values. These can sometimes be accessed from the Key Values reports on the Details page referenced by a Datasets. html page. The query generated is summarized on a Summary. log page.

Pop Quiz 1. Can Dexter be used to access an xls file? 2. How

Pop Quiz 1. Can Dexter be used to access an xls file? 2. How are the files sorted on a directory page 3. 4. 5. 6. displayed by uexplore? What does the uex 2 dex interface app do? What is the fastest way to tell how many rows were selected by your query? Which of the 5 sections of the Dexter query form must be filled out to have a valid request? What’s a filetype? What does it mean when one is displayed in bold on the Uexplore home (Archive Directory) page?

Sample Query 2: What We Want We want data from the 2000 Census, Summary

Sample Query 2: What We Want We want data from the 2000 Census, Summary File 3 regarding poverty in Missouri – in cities and counties. We want the number and the % of poor persons, as well as the median household income. We only want the data for cities of at least 5000 persons, but for all counties and for the state as a whole. We want output as an HTML file sorted by the type of geography (state, county, city) and then by descending poverty rate.

What You Need to Know You need to know where these kinds of data

What You Need to Know You need to know where these kinds of data are stored. It is 2000 census data, but where among all those different summary files? Read the brief descriptions on the uexplore home page. The sf 32000 filetype looks good, but it turns out that it is too big. The standard extract version, sf 32000 x, has what we need. An alternate way by which users may arrive here is via links on the MCDC Demographic Profile reports.

A Demographic Profile Report A link at the bottom of this report page invokes

A Demographic Profile Report A link at the bottom of this report page invokes Dexter with the appropriate dataset selected. Follow the link (in title of this page) and try it.

(Back on uexplore home page) Click on sf 32000 x to Start Descriptions with

(Back on uexplore home page) Click on sf 32000 x to Start Descriptions with links from the uexplore home page.

The sf 32000 x Directory (As seen by uexplore) Subdirectories & files with upcased

The sf 32000 x Directory (As seen by uexplore) Subdirectories & files with upcased first letters are shown first. Index. html, Readme. html and, of course, Datasets. html are required reading (browsing). Files are in alphabetical (not logical) order.

(sf 32000 x) Readme. html

(sf 32000 x) Readme. html

The Datasets. html Page (for the sf 32000 x filetype)

The Datasets. html Page (for the sf 32000 x filetype)

Details Page -- sf 32000 x. moi Lots of info here. Most important is

Details Page -- sf 32000 x. moi Lots of info here. Most important is perhaps the Key variables link for variable Sum. Lev (geographic summary level).

Key Variables Report for Sum. Lev (stf 32000 x. moi)

Key Variables Report for Sum. Lev (stf 32000 x. moi)

Filters Based on Sum. Lev Var Sum. Lev Operator Value Equals 040 Results State

Filters Based on Sum. Lev Var Sum. Lev Operator Value Equals 040 Results State Level Summary (only 1 row selected) 140 Census Tract Summaries – 1320 rows selected. 040: 050: 160 1 State level , 115 County level & 972 Place level rows. Sum. Lev Equals Sum. Lev In List

Sample Query 2: What We Want (Repeated in case you forgot) We want data

Sample Query 2: What We Want (Repeated in case you forgot) We want data from the 2000 Census, Summary File 3 regarding poverty in Missouri cities and counties. We want the number and the % of poor persons, as well as the median household income. We only want the data for cities of at least 5000 persons, but for all counties and for the state. We want output as an HTML file sorted by the type of geography (state, county, city) and then by descending poverty rate.

A Complex Filter

A Complex Filter

The Filter Explained There are 2 logical parts to the filter: 1. Sum. Lev

The Filter Explained There are 2 logical parts to the filter: 1. Sum. Lev In (‘ 040’, ’ 050’) 2. Sumlev = ‘ 160’ and Tot. Pop >= 5000 The parentheses checkboxes are used to group the 2 nd & 3 rd lines. The and between lines 2 and 3 is executed before the or between lines 1 and 2.

The Filter Explained, cont. The SAS© code generatd by these menu choices : where

The Filter Explained, cont. The SAS© code generatd by these menu choices : where sumlev in (‘ 040’, ’ 050’) or (sumlev=‘ 160’ and totpop >=5000); The “in” operator (called “In List” on Operator pull-down menu) allows specifying that the value of a variable should be one of a list of values. Those values are entered separated by : ’s in the Value column of the filter specs form.

Completing the Query: Parts 3 & 4

Completing the Query: Parts 3 & 4

HTML Output We see that Pemiscot has the highest poverty rate of any county.

HTML Output We see that Pemiscot has the highest poverty rate of any county. How do we know this? Why don’t we see any data for cities?

Exercise Access the same dataset as in the example: sf 32000 x. moi Select

Exercise Access the same dataset as in the example: sf 32000 x. moi Select census tract summaries in Greene co… … with a poverty rate of at least 10%. Keep all identifiers necessary to identify the tract, and all variables related to poverty. Generate a csv file and load it into a spreadsheet (probably Excel).

Exercise 2 Repeat the previous exercise except do it for all counties (instead of

Exercise 2 Repeat the previous exercise except do it for all counties (instead of census tracts) in the states of Arkansas and Oklahoma. Sort the results by descending poverty rate and generate output in pdf format as well as a csv file. Hint: A good place is start is with the Datasets. html page.

End of Show Questions and Comments to: blodgettj@missouri. edu

End of Show Questions and Comments to: blodgettj@missouri. edu