2. Loc 6: Aggregate Demand¶

ISE 754, Fall 2024

Package Used: Functions from the following package are used in this notebook for the first time:

  • CSV: Documentation

Note: Download the latest ufl_heuristics.jl from 5. UFL Heuristics before starting since it was changed on the day after class.

1. Ex: Create 2-digit ZCTA Aggregate Data¶

In [1]:
using Logjam.DataTools, Logjam.MapTools
using GeoMakie, DataFrames

df = uszcta3()
Out[1]:
884×7 DataFrame
859 rows omitted
RowZCTA3LATLONPOPALANDAWATERISCUS
Int64Float64Float64Int64Float64Float64Bool
11042.2125-72.57324728391276.1338.492true
21142.1064-72.550417176040.9461.716true
31242.4562-73.2286128186902.05416.827true
41342.589-72.488282872786.42827.535true
51442.5801-71.7824224307474.26814.954true
61542.2158-71.7974398528796.70134.589true
71642.2679-71.8121425856.6882.064true
81742.328-71.4539412856377.36814.444true
91842.6359-71.2041779194393.49316.015true
101942.5725-70.908510471343.03154.709true
112042.1476-71.1507427635386.26725.638true
122142.3402-71.06821414778152.94315.389true
132242.3477-71.0918327441.8740.635true
⋮⋮⋮⋮⋮⋮⋮⋮
87398248.2761-122.2759840035441.9369.439true
87498347.4039-122.5568181345243.77282.374true
87598447.2011-122.475450094160.54812.763true
87698546.9764-123.0235341225358.61355.761true
87798645.793-122.5826653353831.0195.88true
87898847.5845-119.88924195210324.0186.32true
87998946.6216-120.4493014725330.9755.173true
88099047.6799-117.3731686912223.1226.286true
88199147.6773-117.70912101411778.0232.138true
88299247.6847-117.374385114336.7641.33true
88399346.2929-119.0444046367537.07155.382true
88499446.3593-117.29522324664.8714.416true
In [2]:
df.ZCTA2 = floor.(Int64, df.ZCTA3/10)
df
Out[2]:
884×8 DataFrame
859 rows omitted
RowZCTA3LATLONPOPALANDAWATERISCUSZCTA2
Int64Float64Float64Int64Float64Float64BoolInt64
11042.2125-72.57324728391276.1338.492true1
21142.1064-72.550417176040.9461.716true1
31242.4562-73.2286128186902.05416.827true1
41342.589-72.488282872786.42827.535true1
51442.5801-71.7824224307474.26814.954true1
61542.2158-71.7974398528796.70134.589true1
71642.2679-71.8121425856.6882.064true1
81742.328-71.4539412856377.36814.444true1
91842.6359-71.2041779194393.49316.015true1
101942.5725-70.908510471343.03154.709true1
112042.1476-71.1507427635386.26725.638true2
122142.3402-71.06821414778152.94315.389true2
132242.3477-71.0918327441.8740.635true2
⋮⋮⋮⋮⋮⋮⋮⋮⋮
87398248.2761-122.2759840035441.9369.439true98
87498347.4039-122.5568181345243.77282.374true98
87598447.2011-122.475450094160.54812.763true98
87698546.9764-123.0235341225358.61355.761true98
87798645.793-122.5826653353831.0195.88true98
87898847.5845-119.88924195210324.0186.32true98
87998946.6216-120.4493014725330.9755.173true98
88099047.6799-117.3731686912223.1226.286true99
88199147.6773-117.70912101411778.0232.138true99
88299247.6847-117.374385114336.7641.33true99
88399346.2929-119.0444046367537.07155.382true99
88499446.3593-117.29522324664.8714.416true99
In [3]:
gdf = groupby(df, :ZCTA2)
Out[3]:

GroupedDataFrame with 98 groups based on key: ZCTA2

First Group (10 rows): ZCTA2 = 1
RowZCTA3LATLONPOPALANDAWATERISCUSZCTA2
Int64Float64Float64Int64Float64Float64BoolInt64
11042.2125-72.57324728391276.1338.492true1
21142.1064-72.550417176040.9461.716true1
31242.4562-73.2286128186902.05416.827true1
41342.589-72.488282872786.42827.535true1
51442.5801-71.7824224307474.26814.954true1
61542.2158-71.7974398528796.70134.589true1
71642.2679-71.8121425856.6882.064true1
81742.328-71.4539412856377.36814.444true1
91842.6359-71.2041779194393.49316.015true1
101942.5725-70.908510471343.03154.709true1

⋮

Last Group (5 rows): ZCTA2 = 99
RowZCTA3LATLONPOPALANDAWATERISCUSZCTA2
Int64Float64Float64Int64Float64Float64BoolInt64
199047.6799-117.3731686912223.1226.286true99
299147.6773-117.70912101411778.0232.138true99
399247.6847-117.374385114336.7641.33true99
499346.2929-119.0444046367537.07155.382true99
599446.3593-117.29522324664.8714.416true99
In [4]:
centlat(w, y) = sum(w .* y) / sum(w)
centlon(w, x, y) = sum(w .* x .* cos.(y .* π / 180)) / sum(w .* cos.(y .* π / 180))
df2 = combine(gdf, [:POP,:LAT] => centlat => :LAT, [:POP,:LON,:LAT] => centlon => :LON, 
    :POP => sum => :POP, :ALAND => sum => :ALAND, :AWATER => sum => :AWATER, 
    :ISCUS => unique => :ISCUS)
Out[4]:
98×7 DataFrame
73 rows omitted
RowZCTA2LATLONPOPALANDAWATERISCUS
Int64Float64Float64Int64Float64Float64Bool
1142.419-71.703833952715447.11221.345true
2242.0493-71.122847320113352.84460.276true
3343.1541-71.431214301019484.45377.158true
4444.3167-69.7307130973524587.32652.4true
5544.1033-72.82626430779217.25397.935true
6641.4949-72.878436063664846.42206.495true
7740.7527-74.220551932002492.56111.426true
8840.0368-74.681640957944820.29317.837true
91040.8892-73.944553398591417.9851.747true
101140.7151-73.696880665461350.71228.059true
111242.5722-73.9132211888916234.2455.445true
121343.0723-75.834176407215858.6407.325true
131442.8672-78.1191291122911976.1139.941true
⋮⋮⋮⋮⋮⋮⋮⋮
878832.8653-105.46665006455103.3106.359true
888937.018-116.162310448649129.5279.842true
899033.9508-118.2485938874887.584264.029true
909133.9964-118.0453804762461.4816.59true
919233.5846-117.311969471522412.7125.732true
929335.6544-119.568503866330851.1320.212true
939437.7889-122.20357147543949.84162.761true
949538.1673-121.603734577630429.5430.718true
959640.3846-121.80242301317312.4313.918true
969744.7525-122.58423709765611.8971.924true
979847.3647-122.226660349237915.01566.16true
989947.1451-118.032110177922539.9419.552true
In [5]:
all(df2.ISCUS .== true)
Out[5]:
true
In [6]:
fig, ax = makemap(region=:CUS)
scatter!(ax, df2.LON, df2.LAT; marker='.', markersize=24, color=:red)
fig
Out[6]:
No description has been provided for this image

2. Area Adjustment for Demand Points at Centroid¶

In [7]:
function dgc(xy₁, xy₂; unit=:mi)
    length(xy₁) == length(xy₂) == 2 || error("Inputs must have length 2.")
    unit in [:mi, :km] || error("Unit must be :mi or :km")

    Δx, Δy = xy₂[1] - xy₁[1], xy₂[2] - xy₁[2]
    a = sind(Δy / 2)^2 + cosd(xy₁[2]) * cosd(xy₂[2]) * sind(Δx / 2)^2
    2 * asin(min(sqrt(a), 1.0)) * (unit == :mi ? 3958.75 : 6371.00)
end

Dgc(X₁, X₂) = [dgc(i, j) for i in eachrow(X₁), j in eachrow(X₂)]

Daa(X, Xa, a) = [max(dgc(i, j), (2/3)*sqrt(k)/pi) 
    for i in eachrow(X), (j, k) in zip(eachrow(Xa), a)]
Out[7]:
Daa (generic function with 1 method)
In [8]:
@doc uscounty
Out[8]:
uscounty() -> DataFrame

Returns DataFrame containing U.S. county-level data.

Geographic and population data for each U.S. county, including latitude-longitude coordinates representing the center of population of the county. Does not include U.S. territories.

Columns¶

  • STFIP: Integer representing state FIPS (Federal Information Processing Standards) code.
  • COFIP: Integer representing county FIPS code.
  • NAME: String containing name of county.
  • ST: String representing state abbreviation (e.g., AL for Alabama).
  • LAT: Float representing latitude of county center of population.
  • LON: Float representing longitude of county center of population.
  • POP: Integer representing population of county.
  • ALAND: Float representing land area of county in square miles.
  • AWATER: Float representing water area of county in square miles.
  • CBSA: Integer or None representing Core-Based Statistical Area code associated with county.

Sources¶

Area data from [1], population and center of population data from [2], and CBSA data from [3].

  1. U.S. Census Bureau, 2020 Gazetteer Files, 2020Gazcounty_national.txt
  2. U.S. Census Bureau, Centers of Population, CenPop2020MeanCO.txt
  3. U.S. Census Bureau, Core based statistical areas (CBSAs), metropolitan divisions, and combined statistical areas (CSAs), list1_2023.xls

Counties in NC west of longitude -82:

In [9]:
df = filter(r -> (r.STFIP == st2fips(:NC)) && (r.LON < -82.0), uscounty())
Out[9]:
15×10 DataFrame
RowSTFIPCOFIPNAMESTLATLONPOPALANDAWATERCBSA
Int64Int64String31String3Float64Float64Int64Float64Float64Int64?
13721BuncombeNC35.5853-82.5518269452656.4973.45311700
23787HaywoodNC35.5189-82.943162089553.560.94248200
33789HendersonNC35.3416-82.4716116281372.9492.13611700
437111McDowellNC35.6674-82.019944578439.9545.39732000
537115MadisonNC35.839-82.636121193449.6151.86911700
637175TransylvaniaNC35.2101-82.743132986378.3581.96614820
73739CherokeeNC35.1014-84.022628774455.54111.125missing
83743ClayNC35.0363-83.805711089214.9835.798missing
93775GrahamNC35.3321-83.78718030291.9669.675missing
103799JacksonNC35.3329-83.19943109491.1073.766missing
1137113MaconNC35.1569-83.377337014515.5794.094missing
1237121MitchellNC35.9601-82.11414903221.2510.63missing
1337149PolkNC35.2584-82.176919328237.6870.758missing
1437173SwainNC35.4378-83.41214117527.72812.516missing
1537199YanceyNC35.9109-82.294618470312.5920.589missing
In [10]:
# Est act distances without area adjustment
P = hcat(df.LON, df.LAT)
D = Dgc(P, P)*1.2            # mi
Out[10]:
15×15 Matrix{Float64}:
   0.0     26.9638   20.916    36.4865  …   37.0979  59.3279   32.0686
  26.9638   0.0      35.0806   63.4489      56.1158  32.3641   54.4255
  20.916   35.0806    0.0      40.7317      21.1055  64.0585   48.6857
  36.4865  63.4489   40.7317    0.0         35.5253  95.8125   27.3705
  21.7893  33.6424   42.7053   43.838       57.2466  61.9657   23.7057
  33.6868  28.9576   21.3649   61.8361  …   38.5528  49.0308   65.5071
 107.255   80.8233  106.92    143.273      125.754   49.8623  134.558
  96.2775  70.7831   93.8753  131.61       111.952   42.6517  125.175
  86.0269  59.0895   88.9782  122.481      109.138   26.8294  111.451
  48.4611  23.1699   49.2031   84.2826      69.4451  16.8191   77.5383
  66.1573  41.9963   63.2078  101.019   …   81.7578  23.4127   96.1518
  42.8126  66.7179   56.655    25.0776      58.3283  97.5358   12.7934
  37.0979  56.1158   21.1055   35.5253       0.0     84.8407   54.68
  59.3279  32.3641   64.0585   95.8125      84.8407   0.0      84.867
  32.0686  54.4255   48.6857   27.3705      54.68    84.867     0.0
In [11]:
# Area adjustment
Da = Daa(P, P, df.ALAND)*1.2
Out[11]:
15×15 Matrix{Float64}:
   6.52464  26.9638    20.916     36.4865   …   37.0979   59.3279    32.0686
  26.9638    5.99132   35.0806    63.4489       56.1158   32.3641    54.4255
  20.916    35.0806     4.91773   40.7317       21.1055   64.0585    48.6857
  36.4865   63.4489    40.7317     5.34126      35.5253   95.8125    27.3705
  21.7893   33.6424    42.7053    43.838        57.2466   61.9657    23.7057
  33.6868   28.9576    21.3649    61.8361   …   38.5528   49.0308    65.5071
 107.255    80.8233   106.92     143.273       125.754    49.8623   134.558
  96.2775   70.7831    93.8753   131.61        111.952    42.6517   125.175
  86.0269   59.0895    88.9782   122.481       109.138    26.8294   111.451
  48.4611   23.1699    49.2031    84.2826       69.4451   16.8191    77.5383
  66.1573   41.9963    63.2078   101.019    …   81.7578   23.4127    96.1518
  42.8126   66.7179    56.655     25.0776       58.3283   97.5358    12.7934
  37.0979   56.1158    21.1055    35.5253        3.92593  84.8407    54.68
  59.3279   32.3641    64.0585    95.8125       84.8407    5.84986   84.867
  32.0686   54.4255    48.6857    27.3705       54.68     84.867      4.50224
In [12]:
# UFL
include("ufl_heuristics.jl")

f = df.POP'                  # person
r = 1/1000                   # $/person-mi
C = r*f.*D                   # $
k = fill(3000, size(C, 1))   # $

y, TC = ufl(k, C)
  Add: 18110.84145958251
 Xchg: 18110.84145958251
  Add: 18110.84145958251
 Drop: 18110.84145958251
Out[12]:
([11, 1, 3], 18110.84145958251)
In [13]:
Ca = r*f.*Da                  # $ with area adjustment

ya, TCa = ufl(k, Ca)
  Add: 20230.627283430036
 Xchg: 20230.627283430036
  Add: 20230.627283430036
 Drop: 20230.627283430036
Out[13]:
([11, 1], 20230.627283430036)

3. Data Wrangling¶

It is often said that preparing data for subsequent processing (a.k.a. data wrangling) represents 80% of most data analysis efforts. Various tools are available in the DataFrames package to simplify the effort.

Tabular Data¶

Most data wrangling tools are designed to operate on tabular data:

  • Cell = Values: The raw data represented in the table are the values in each cell. Some cells may be missing data.
  • Columns = Variables: Each table column represents a different attribute of the data, represented as a different variable.
  • Rows = Observations: Each row of the table represents all of the values of the attributes of a single entity represented as a different observation.

Ex: Array vs. Tabular Representations¶

Three variables: var1 with possible values {A,B}; var2 with values {a,b}; var3 with values {1,2,3,4}:

Array:

a b
A 1 2
B 3 4

Table:

var1 var2 var3
A a 1
A b 2
B a 3
B b 4

Array representation is concise and often used to efficiently represent data for up to three variables, but it can be hard to manipulate for more than three variables. For example, the three variables needed to represent a network (i, j, w) can be represented as an adjacency matrix array (using an efficient sparse matrix) compared to a tabular arc list representation. For more than three variables, a dictionary could be used, but it is easier to manipulate tabular data; for example, if a new variable, var4, needs to be added in Ex 1, above, only another column needs to be added to the table, while the array would need to become 3-D.

Missing Data: Skip vs. Dropping vs. Imputation¶

For various reasons, some cells in a table may be missing data. In Julia, missing data can be identified with the type missing. All subsequent results that use the missing data will themselves be identified as missing and the data type of the column will be the union of missing and the type of the non-missing data (e.g., Union{missing, Int64}) and the type of column in a data frame with include ? (e.g.. Int64?); as a result, it is necessary to either skip, remove, or replace the missing data:

  • Skip: if possible, process data while skipping any missing values
  • Dropping: removing any observations that contain cells with missing data
  • Imputation: replacing any cells with missing data with a representative value

For large data sets with few missing values, any missing data is typically dropped; for smaller data sets, where the missing data occur in columns containing relatively unimportant variables, it may be possible to provide a reasonable value to replace the missing cell. For example, for numerical values, the replacement value could be the average of the non-missing values for the same variable; for categorical values, the replacement value could be the most likely or mode value of the non-missing values.

Ex: Per-Capita Income¶

Create a data frame that contains the per-capita income for all of the census tracts in Raleigh-Durham-Cary CSA:

In [14]:
using Logjam.DataTools, DataFrames

df = uscsa()
Out[14]:
180×7 DataFrame
155 rows omitted
RowCSANAMELATLONPOPALANDAWATER
Int64String?Float64Float64Int64Float64Float64
1101Abilene-Sweetwater, TX32.4453-99.79811913173655.5116.142
2104Albany-Schenectady, NY42.8411-73.830211907276043.36203.794
3105Albuquerque-Santa Fe-Los Alamos, NM35.2446-106.485116252323811.570.887
4106Allentown-Bethlehem-East Stroudsburg, PA-NJ40.7388-75.378910302162061.3431.219
5107Altoona-Huntingdon, PA40.4717-78.27781669141399.9315.887
6108Amarillo-Borger, TX35.2075-101.8272893086038.440.328
7118Appleton-Oshkosh-Neenah, WI44.2032-88.44174148771390.5229.752
8120Asheville-Waynesville-Brevard, NC35.5067-82.59785020012410.9810.366
9122Atlanta--Athens-Clarke County--Sandy Springs, GA-AL33.8526-84.3149697617113330.2255.544
10132Baton Rouge-Hammond, LA30.4545-90.972910037265161.87254.856
11142Birmingham-Cullman-Talladega, AL33.5239-86.740613610337402.23149.233
12144Bloomington-Bedford, IN39.1093-86.5562060501228.9622.119
13145Bloomington-Pontiac, IL40.5645-88.88872067692226.885.32
⋮⋮⋮⋮⋮⋮⋮⋮
169538Tulsa-Bartlesville-Muskogee, OK36.1116-95.881711341257496.37228.174
170539Tupelo-Corinth, MS34.4643-88.60221669541797.7915.395
171540Tyler-Jacksonville, TX32.2529-95.28592838911974.4337.466
172542Union City-Martin, TN36.3431-88.9217636891125.212.029
173544Victoria-Port Lavaca, TX28.7782-96.95131184372241.05539.812
174545Virginia Beach-Chesapeake, VA-NC36.8807-76.311818575423811.982572.54
175548Washington-Baltimore-Arlington, DC-MD-VA-WV-PA39.0344-77.06791002833113200.62014.1
176554Wausau-Stevens Point-Wisconsin Rapids, WI44.6928-89.72952825973139.1169.298
177555Weatherford-Elk City, OK35.4389-99.116618472893.5421.015
178556Wichita-Arkansas City-Winfield, KS37.665-97.29676821595273.8139.948
179558Williamsport-Lock Haven, PA41.2223-77.09691516382116.923.975
180566Youngstown-Warren-Salem, OH41.0699-80.72735324681561.5134.977
In [15]:
df = filter(r -> startswith(r[:NAME], "Ral"), uscsa())
Out[15]:
1×7 DataFrame
RowCSANAMELATLONPOPALANDAWATER
Int64String?Float64Float64Int64Float64Float64
1450Raleigh-Durham-Cary, NC35.8042-78.746522423244978.9109.583
In [16]:
df.CSA
Out[16]:
1-element Vector{Int64}:
 450
In [17]:
df = filter(r -> !ismissing(r[:CSA]) && r[:CSA] in df.CSA, uscbsa())
Out[17]:
5×10 DataFrame
RowCBSANAMELATLONPOPALANDAWATERM_MSACSAISCUS
Int64StringFloat64Float64Int64Float64Float64String31Int64?Bool
120500Durham-Chapel Hill, NC35.9804-78.99175889111758.154.038Metropolitan Statistical Area450true
239580Raleigh-Cary, NC35.7731-78.624314139822118.4128.823Metropolitan Statistical Area450true
311360Anderson Creek, NC35.3541-78.8674133568594.9336.291Micropolitan Statistical Area450true
441820Sanford, NC35.4538-79.173663285255.0624.141Micropolitan Statistical Area450true
525780Henderson, NC36.3328-78.40342578252.39616.29Micropolitan Statistical Area450true
In [18]:
df.CBSA
Out[18]:
5-element Vector{Int64}:
 20500
 39580
 11360
 41820
 25780
In [19]:
df = filter(r -> !ismissing(r[:CBSA]) && r[:CBSA] in df.CBSA, uscounty())
Out[19]:
10×10 DataFrame
RowSTFIPCOFIPNAMESTLATLONPOPALANDAWATERCBSA
Int64Int64String31String3Float64Float64Int64Float64Float64Int64?
13737ChathamNC35.756-79.208476285681.67927.25220500
23763DurhamNC35.9862-78.9017324833286.52511.33820500
33769FranklinNC36.0442-78.35468573491.8022.7739580
43785HarnettNC35.3541-78.8674133568594.9336.29111360
537101JohnstonNC35.5706-78.4191215999792.0173.62739580
637105LeeNC35.4538-79.173663285255.0624.14141820
737135OrangeNC35.9801-79.0823148696397.5593.420500
837145PersonNC36.372-78.970439097392.3412.04820500
937181VanceNC36.3328-78.40342578252.39616.2925780
1037183WakeNC35.7953-78.681129410834.58722.42639580
In [20]:
df = filter(r -> r[:STFIP] in df.STFIP && r[:COFIP] in df.COFIP, uscentract())
Out[20]:
491×10 DataFrame
466 rows omitted
RowSTFIPCOFIPTRFIPSTLATLONPOPALANDAWATERISCUS
Int64Int64Int64String3Float64Float64Int64Float64Float64Bool
1373720103NC35.8423-79.0516552010.9920.044true
2373720104NC35.7866-79.0695608215.8110.329true
3373720105NC35.831-79.110979448.7480.044true
4373720107NC35.8447-79.1709264319.5080.244true
5373720108NC35.8029-79.1364194910.880.268true
6373720201NC35.7996-79.454300759.1150.484true
7373720202NC35.7977-79.2612338247.9040.375true
8373720203NC35.7311-79.3019240952.5160.115true
9373720300NC35.6923-79.4761274255.8490.279true
10373720401NC35.7218-79.460358357.0590.04true
11373720402NC35.7378-79.4674461010.2110.145true
12373720500NC35.5849-79.4629360582.1910.257true
13373720600NC35.6116-79.28835199117.1460.912true
⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮
4803718354224NC35.9754-78.500346511.5820.059true
4813718354303NC35.8826-78.3859278321.1170.116true
4823718354304NC35.86-78.3206562815.010.075true
4833718354305NC35.831-78.317835514.7590.07true
4843718354306NC35.8137-78.3207363510.8720.087true
4853718354402NC35.769-78.394820115.6760.039true
4863718354403NC35.8356-78.4153415912.9910.12true
4873718354404NC35.7991-78.3756692711.2450.03true
4883718354501NC35.7417-78.660257631.4110.013true
4893718354502NC35.7528-78.648638983.6490.015true
49037183980100NC35.8708-78.782327.2390.152true
49137183980200NC35.8655-78.74908.2520.128true
In [21]:
using GeoMakie, Logjam.MapTools

fig, ax = makemap(df.LON, df.LAT)
scatter!(ax, df.LON, df.LAT; marker='.', markersize=24, color=:red)
fig
Out[21]:
No description has been provided for this image
In [22]:
using DataFrames, CSV

df2 = DataFrame(CSV.File("ACSDT5Y2022.B19301-Data.csv"))
Out[22]:
2673×5 DataFrame
2648 rows omitted
RowGEO_IDNAMEB19301_001EB19301_001MColumn5
String31StringStringStringMissing
1GeographyGeographic Area NameEstimate!!Per capita income in the past 12 months (in 2022 inflation-adjusted dollars)Margin of Error!!Per capita income in the past 12 months (in 2022 inflation-adjusted dollars)missing
21400000US37001020100Census Tract 201; Alamance County; North Carolina363846984missing
31400000US37001020200Census Tract 202; Alamance County; North Carolina169992944missing
41400000US37001020301Census Tract 203.01; Alamance County; North Carolina195622140missing
51400000US37001020302Census Tract 203.02; Alamance County; North Carolina207045129missing
61400000US37001020400Census Tract 204; Alamance County; North Carolina200152418missing
71400000US37001020501Census Tract 205.01; Alamance County; North Carolina248502746missing
81400000US37001020502Census Tract 205.02; Alamance County; North Carolina192462246missing
91400000US37001020601Census Tract 206.01; Alamance County; North Carolina611149256missing
101400000US37001020602Census Tract 206.02; Alamance County; North Carolina5158914215missing
111400000US37001020701Census Tract 207.01; Alamance County; North Carolina388176114missing
121400000US37001020702Census Tract 207.02; Alamance County; North Carolina298865312missing
131400000US37001020801Census Tract 208.01; Alamance County; North Carolina258624446missing
⋮⋮⋮⋮⋮⋮
26621400000US37197050200Census Tract 502; Yadkin County; North Carolina325433742missing
26631400000US37197050300Census Tract 503; Yadkin County; North Carolina343567804missing
26641400000US37197050401Census Tract 504.01; Yadkin County; North Carolina239025581missing
26651400000US37197050402Census Tract 504.02; Yadkin County; North Carolina333355934missing
26661400000US37197050502Census Tract 505.02; Yadkin County; North Carolina344284901missing
26671400000US37197050503Census Tract 505.03; Yadkin County; North Carolina3896712531missing
26681400000US37197050504Census Tract 505.04; Yadkin County; North Carolina260435041missing
26691400000US37199960101Census Tract 9601.01; Yancey County; North Carolina274754463missing
26701400000US37199960102Census Tract 9601.02; Yancey County; North Carolina240593132missing
26711400000US37199960200Census Tract 9602; Yancey County; North Carolina372636344missing
26721400000US37199960300Census Tract 9603; Yancey County; North Carolina273704168missing
26731400000US37199960400Census Tract 9604; Yancey County; North Carolina336625335missing
In [23]:
df2 = DataFrame(CSV.File("ACSDT5Y2022.B19301-Data.csv"; skipto=3))
println(nrow(df2))
first(df2, 5)
2672
Out[23]:
5×5 DataFrame
RowGEO_IDNAMEB19301_001EB19301_001MColumn5
String31StringString7String7Missing
11400000US37001020100Census Tract 201; Alamance County; North Carolina363846984missing
21400000US37001020200Census Tract 202; Alamance County; North Carolina169992944missing
31400000US37001020301Census Tract 203.01; Alamance County; North Carolina195622140missing
41400000US37001020302Census Tract 203.02; Alamance County; North Carolina207045129missing
51400000US37001020400Census Tract 204; Alamance County; North Carolina200152418missing
In [24]:
df2.GEO_ID = replace.(df2.GEO_ID, r"1400000US" => "")
println(nrow(df2))
first(df2, 5)
2672
Out[24]:
5×5 DataFrame
RowGEO_IDNAMEB19301_001EB19301_001MColumn5
StringStringString7String7Missing
137001020100Census Tract 201; Alamance County; North Carolina363846984missing
237001020200Census Tract 202; Alamance County; North Carolina169992944missing
337001020301Census Tract 203.01; Alamance County; North Carolina195622140missing
437001020302Census Tract 203.02; Alamance County; North Carolina207045129missing
537001020400Census Tract 204; Alamance County; North Carolina200152418missing
In [25]:
rename!(df2, names(df2)[3] => :INCOME)
println(nrow(df2))
first(df2, 5)
2672
Out[25]:
5×5 DataFrame
RowGEO_IDNAMEINCOMEB19301_001MColumn5
StringStringString7String7Missing
137001020100Census Tract 201; Alamance County; North Carolina363846984missing
237001020200Census Tract 202; Alamance County; North Carolina169992944missing
337001020301Census Tract 203.01; Alamance County; North Carolina195622140missing
437001020302Census Tract 203.02; Alamance County; North Carolina207045129missing
537001020400Census Tract 204; Alamance County; North Carolina200152418missing
In [26]:
df2.INCOME = parse.(Int64, df2.INCOME)
ArgumentError: input string is empty or only contains whitespace

Stacktrace:
  [1] tryparse_internal(::Type{Int64}, s::String7, startpos::Int64, endpos::Int64, base_::Int64, raise::Bool)
    @ Base .\parse.jl:115
  [2] parse(::Type{Int64}, s::String7; base::Nothing)
    @ Base .\parse.jl:254
  [3] parse
    @ .\parse.jl:253 [inlined]
  [4] _broadcast_getindex_evalf
    @ .\broadcast.jl:709 [inlined]
  [5] _broadcast_getindex
    @ .\broadcast.jl:692 [inlined]
  [6] getindex
    @ .\broadcast.jl:636 [inlined]
  [7] macro expansion
    @ .\broadcast.jl:1004 [inlined]
  [8] macro expansion
    @ .\simdloop.jl:77 [inlined]
  [9] copyto!
    @ .\broadcast.jl:1003 [inlined]
 [10] copyto!
    @ .\broadcast.jl:956 [inlined]
 [11] copy
    @ .\broadcast.jl:928 [inlined]
 [12] materialize(bc::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, typeof(parse), Tuple{Base.RefValue{Type{Int64}}, Vector{String7}}})
    @ Base.Broadcast .\broadcast.jl:903
 [13] top-level scope
    @ In[26]:1
In [27]:
sum(ismissing.(df2.INCOME))
Out[27]:
0
In [28]:
df2 = filter(r -> try parse(Int, r.INCOME); true catch; false end, df2)
println(nrow(df2))
first(df2, 5)
2647
Out[28]:
5×5 DataFrame
RowGEO_IDNAMEINCOMEB19301_001MColumn5
StringStringString7String7Missing
137001020100Census Tract 201; Alamance County; North Carolina363846984missing
237001020200Census Tract 202; Alamance County; North Carolina169992944missing
337001020301Census Tract 203.01; Alamance County; North Carolina195622140missing
437001020302Census Tract 203.02; Alamance County; North Carolina207045129missing
537001020400Census Tract 204; Alamance County; North Carolina200152418missing
In [29]:
df2.INCOME = parse.(Int64, df2.INCOME)
println(nrow(df2))
first(df2, 5)
2647
Out[29]:
5×5 DataFrame
RowGEO_IDNAMEINCOMEB19301_001MColumn5
StringStringInt64String7Missing
137001020100Census Tract 201; Alamance County; North Carolina363846984missing
237001020200Census Tract 202; Alamance County; North Carolina169992944missing
337001020301Census Tract 203.01; Alamance County; North Carolina195622140missing
437001020302Census Tract 203.02; Alamance County; North Carolina207045129missing
537001020400Census Tract 204; Alamance County; North Carolina200152418missing
In [30]:
df.GEO_ID = string.(df.STFIP, pad=2) .* string.(df.COFIP, pad=3) .* string.(df.TRFIP, pad=6)
println(nrow(df))
first(df, 5)
491
Out[30]:
5×11 DataFrame
RowSTFIPCOFIPTRFIPSTLATLONPOPALANDAWATERISCUSGEO_ID
Int64Int64Int64String3Float64Float64Int64Float64Float64BoolString
1373720103NC35.8423-79.0516552010.9920.044true37037020103
2373720104NC35.7866-79.0695608215.8110.329true37037020104
3373720105NC35.831-79.110979448.7480.044true37037020105
4373720107NC35.8447-79.1709264319.5080.244true37037020107
5373720108NC35.8029-79.1364194910.880.268true37037020108
In [31]:
df3 = leftjoin(df, df2[!, [:GEO_ID, :INCOME]], on=:GEO_ID)
println(nrow(df3))
first(df3, 5)
491
Out[31]:
5×12 DataFrame
RowSTFIPCOFIPTRFIPSTLATLONPOPALANDAWATERISCUSGEO_IDINCOME
Int64Int64Int64String3Float64Float64Int64Float64Float64BoolStringInt64?
1373720103NC35.8423-79.0516552010.9920.044true3703702010395434
2373720104NC35.7866-79.0695608215.8110.329true3703702010473654
3373720105NC35.831-79.110979448.7480.044true3703702010562005
4373720107NC35.8447-79.1709264319.5080.244true3703702010758697
5373720108NC35.8029-79.1364194910.880.268true3703702010847862
In [32]:
sum(ismissing.(df3.INCOME))
Out[32]:
3
In [33]:
df3 = filter(r -> !ismissing(r.INCOME), df3)
df3.INCOME = Int64.(df3.INCOME)
println(nrow(df3))
first(df3, 5)
488
Out[33]:
5×12 DataFrame
RowSTFIPCOFIPTRFIPSTLATLONPOPALANDAWATERISCUSGEO_IDINCOME
Int64Int64Int64String3Float64Float64Int64Float64Float64BoolStringInt64
1373720103NC35.8423-79.0516552010.9920.044true3703702010395434
2373720104NC35.7866-79.0695608215.8110.329true3703702010473654
3373720105NC35.831-79.110979448.7480.044true3703702010562005
4373720107NC35.8447-79.1709264319.5080.244true3703702010758697
5373720108NC35.8029-79.1364194910.880.268true3703702010847862
In [34]:
CSV.write("RaleighCSA-TR-INCOME.csv", df3)
Out[34]:
"RaleighCSA-TR-INCOME.csv"
In [35]:
df = DataFrame(CSV.File("RaleighCSA-TR-INCOME.csv"))
println(nrow(df))
first(df, 5)
488
Out[35]:
5×12 DataFrame
RowSTFIPCOFIPTRFIPSTLATLONPOPALANDAWATERISCUSGEO_IDINCOME
Int64Int64Int64String3Float64Float64Int64Float64Float64BoolInt64Int64
1373720103NC35.8423-79.0516552010.9920.044true3703702010395434
2373720104NC35.7866-79.0695608215.8110.329true3703702010473654
3373720105NC35.831-79.110979448.7480.044true3703702010562005
4373720107NC35.8447-79.1709264319.5080.244true3703702010758697
5373720108NC35.8029-79.1364194910.880.268true3703702010847862