[GIS] Using the US census to analyze data nationally

censusdata

I'm trying to make a table to relate the STFID and LOGRECNO values for the US 2000 census at the block group level, nationally. I realize that the LOGRECNO value is a state specific unique variable, so I am joining it to the state ID code.

The steps that I have done so far are:

Importing all the national geo files into a database (e.g. akgeo.txt,…. to wygeo.txt)

Extracting all the blockgroup values using the following SQL command:

SELECT SF1GEO.LOGRECNO, SF1GEO.REGION, SF1GEO.STATE, SF1GEO.COUNTY, SF1GEO.TRACT, SF1GEO.BLKGRP
FROM SF1GEO
WHERE SF1GEO.BLOCK IS NULL 
AND
SF1GEO.BLKGRP IS NOT NULL
AND 
SF1GEO.TRACT IS NOT NULL
AND 
SF1GEO.COUNTY IS NOT NULL;

I want to ignore all the block level information, and only take the relevant LOGRECNO for each blockgroup (BLKGRP). Then to produce the STFID I concatenate the variables:

STATE + COUNTY + TRACT + BLKGRP 

which gives me a 12 character code (STATE is 2 characters, COUNTY is 3 characters, TRACT is 6 characters, BLKGRP is 1 character). I also combine the STATE and LOGRECNO values for future analysis, based on advice from census.gov as I want to perform national analysis.

  LOGRECNO REGION STATE COUNTY  TRACT BLKGRP        STFID LOG_STATE
1  0000038      3    01    001 021100      2 010010211002 010000038
2  0000077      3    01    001 021100      3 010010211003 010000077
3  0000107      3    01    001 021100      1 010010211001 010000107
4  0000169      3    01    001 021100      2 010010211002 010000169
5  0000231      3    01    001 021100      3 010010211003 010000231
6  0000308      3    01    001 021000      2 010010210002 010000308

At this point I have 546293 values, but only 208144 unique STFID values. One problem I have is that I can't figure out why all my STFID values are not unique.

Then, when I try to merge this table with a shapefile of blockgroups it only matches on 72972 polygons. I also can't figure out why the STFID does not match all. This is what the map looks like:

enter image description here

I'm not sure if there is an obvious pattern to the blockgroups that are included/excluded.


EDIT: I used Sean's suggestion of selecting by SUMLEV asthat makes a lot more sense than what I was doing earlier. So my SQL query is now:

SELECT SF1GEO.LOGRECNO, SF1GEO.STATE, SF1GEO.COUNTY, SF1GEO.TRACT, SF1GEO.BLKGRP
FROM SF1GEO
WHERE (((SF1GEO.SUMLEV)="150"));

However, I'm still mystified when I join this to the shapefile as I end up with the same map as above, and the same number of joins. I am manually making the code to join within the shapefile using the information within by concatenating STATE + COUNTY + TRACT + BLKGRP. What I can't understand is why 72972 values successfully join, and the rest do not.

Best Answer

I don't know what an STFID is and I have pretty intimate knowledge of the Census counts.

The geography file has a record for every geographical unit. There's one for the entire state, for every county, for every tract, for every block etc. Each is uniquely identified by a LOGRECNO. The LOGRECNO will be unique in the geo file.

The other files have all of the counts. It's really one big table split into many files. Each record has, you guessed it, an associated LOGRECNO. The LOGRECNO will not be unique among data files.

You join on the LOGRECNO.

The quickest way to pull out Block Groups is by Summary Level. The Summary Level for Block Groups is 150. So, go through the geo file and pull out every record with a summary level of 150. Those are your Block Groups.

Block Groups are uniquely identified by Block Group number combined with state, county and tract as you have figured out. The corresponding field in the Block Group layer of TIGER is GEOID10. Join your data using GEOID10 and the STATE+COUNTY+TRACT+BG from the summary files.