When working with spatial data, one is rarely interested in working with only one source of data. This tutorial will introduce a set of tools for linking vector data with other data sources. It begins by introducing how to link spatial vector data with non-spatial data in table format, then turns to the problem of linking multiple sources of spatial data through spatial joins and intersects.
This tutorial uses the sp
, rgdal
, and raster
libraries from the RGIS1
tutorial. If you have not yet installed those, please revisit that tutorial for directions. In addition, this tutorial will also make use of the rgeos
library, installation of which is discussed in part0_setup
.
An attribute join combines tabular data with a Spatial* object by associating each observation in a table with a GIS object (a polygon, line, or point). If you have done attribute joins of shapefiles in GIS software like ArcGIS or QGis, or merged two datasets in Stata or R, this process is analogous – in an Attribute Join, a Spatial*Dataframe
(be that a SpatialPolygonsDataFrame
, SpatialPointsDataFrame
, or SpatialLinesDataFrame
) is merged with a table (an R data.frame
) using a common unique identifier.
Assume we have:
SpatialPolygons
object named worldCountries, andwhere:
Then we can just merge worldCountries with countryData on those variables, using the merge
method in sp
:
In this case, that means our merge would look like the following:
worldCountries <- merge(worldCountries, countryData, by.x = "id-number", by.y = "countryID")
Important Note: Always pass the merge
command your Spatial*DataFrame object. Never merge the data.frame
associated with your Spatial*DataFrame object directly, as in the following snippet:
# DON'T DO THIS!
worldCountries@data <- merge(worldCountries@data, countryData, by.x = "id-number", by.y = "countryID")
If merge
is passed two data.frames
instead of a Spatial*DataFrame
and a data.frame
, it may jumble the order of rows in the output data, corrupting your data.
That’s it!
Download and unzip the RGIS2_Data
folder.
Load the CSV table district_vote_shares.csv
from the data loaded above into a dataframe in R and name it vote_shares
.
Load the shapefile congressional_districts
from the folder shapefiles
and call it districts
.
Check out the column names of vote_shares
and of districts
to determine which one might contain the unique identifier for the join. Hint: use the names()
command.
Join the vote_shares
data frame with districts
using merge
as described above. Use the names()
command to see if the join was successful.
Now we could plot one of the variables we just joined - try democratic vote share!
spplot(districts, "dem_vote_share")
Combining different Spatial* data sources is one of the most common things you will do in GIS research. Consider the following examples:
Most of the time, you answer these questions using some form of Spatial Join. Unlike the “Attribute Joins” described above, a spatial join in your first purely spatial operation. In a spatial join, observations from different datasets are joined based not on a variable, but by their relationship in space.
To combine two Spatial* datasets, the first thing you have to do is make sure they have the same CRS. If you try and work with two Spatial* objects in R that are not in the same CRS, you will get results, but those results will be nonsense! Note that this is very different from programs like ArcGIS that will take care of this problem for you! So before we can join Spatial* objects, we need to learn how to re-project different data sources into a common CRS.
Make sure you remember the differences between defining a CRS and re-projecting:
Spatial*
object. For example, if data comes from a GPS device, one must tell the computer those x-y coordinates are longitudes and latitudes. It does not change the values of the numbers, just how the computer interprets them.proj4string
associated with an object, but also all the actual x and y coordinates.Re-projecting vector data requires two tools from the sp
and rgdal
packages:
CRS
object with the new CRS you wish to applyspTransform()
methodAs previously noted, a CRS object includes all the information needed to project a spatial object, generally including both a Geographic Coordinate System (the model of the Earth used to create the data) and a projection (a way of converting points on the three-dimensional Earth onto a two-dimensional plane).
Through package rgdal
, the CRS() function has access to a large library of coordinate systems and transformations, so you just need to know the code for the CRS you want. Codes – often called a “projection strings” – can be found online here.
Once you’ve found the projection you want, you can create the appropriate CRS object using one of two codes – the proj4
string, or the EPSG
code. These are equivalent, but look a little different.
For example, you can create a CRS object for UTM zone 33N (EPSG:32633) by either passing the full proj4 code:
MyNewProjection <- CRS("+proj=utm +zone=33 +ellps=WGS84 +datum=WGS84 +units=m +no_defs")
or the EPSG code:
MyNewProjection <- CRS("+init=EPSG:32633")
Once you’ve defined a CRS
object with the new CRS you want to use, all you have to do is execute the spTransform
function on the object you want to re-project. If, for example, we have an object called MyCity and we want to reproject this into a new CRS called MyNewCRS, we would type:
MyNewCRS <- CRS("definition of projection goes here as string")
MyCity.reprojected <- spTransform(MyCity, MyNewCRS)
Note that you can also retrieve the CRS from an existing Spatial*
object with the proj4string()
command! So if you have two files – file.a
and file.b
– a common idiom for reprojecting file.b
into the CRS of file.a
is:
common.crs <- CRS(proj4string(file.a))
file.b.reprojected <- spTransform(file.b, common.crs)
If you haven’t already, create a directory R_Workshop
on your Desktop and unzip RGIS2_Data
into that folder.
Load the rgdal
, and sp
packages.
If you haven’t already, read in the congressional_districts
shapefile into R and name it districts
.
Read the federal_grants
shapefile into R and name it grants
.
What is the CRS of districts
? What is the CRS of grants
?
Reproject grants
so it matches the projection of districts
and assign it to a new object called grants.newproj
.
You can use the range()
command from the R base package to compare the coordinates before and after reprojection and confirm that you actually have transformed them. range()
simply returns the min and max value of a vector of numbers that you give it. So you can check with:range(coordinates(grants))
andrange(coordinates(grants.newproj))
You can also compare them visually with:
par(mfrow = c(1, 2))
plot(grants, axes = TRUE)
plot(grants.newproj, axes = TRUE)
over
CommandThey primary tool for doing spatial joins is the over
command from the sp
library.
The exact behavior of over
depends on the inputs being used, but the basic idea is: “For each item of first position (the SOURCE), over
returns information about items of second argument (TARGET) that intersect”.
For example, if for every grant (SOURCE) I wanted to get information about their district (TARGET), I could run:
library(maptools)
grants.districts <- over(grants.newproj, districts) # Get district data
grants.districts
DISTRICT NAME Shape_Leng Shape_Area
1 12 San Francisco 70727.76 105358679
2 14 Peninsula 200487.79 706830475
3 14 Peninsula 200487.79 706830475
4 14 Peninsula 200487.79 706830475
5 18 Palo Alto 308447.51 1815914163
6 18 Palo Alto 308447.51 1815914163
# Recombine!
grants.newproj <- spCbind(grants.newproj, grants.districts)
grants.newproj
class : SpatialPointsDataFrame
features : 6
extent : 553582.3, 574813.7, 4141618, 4182618 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0
variables : 7
names : GrantBudge, JobsCreate, ProjectNam, DISTRICT, NAME, Shape_Leng, Shape_Area
min values : 210000, 0, Air Traffic Control Retrofit, 12, Palo Alto, 70727.76, 105358679
max values : 120000000, 32, Wetland Preservation, 18, San Francisco, 308447.51, 1815914163
Note the use of the spCbind
command from the maptools
library for merging the datasets – this can also be accomplished without the maptools
library by simple assignment of each vector of data.
Note that because districts
is a SpatialPolygonDataFrame
, over
returned the relevant row of data for each point. If districts
did not have any data, we would just get the index of the intersecting polygon for each grant. We can see this behavior if we use the geometry()
function to strip away the DataFrame:
grants.with.districts2 <- over(grants.newproj, geometry(districts))
head(grants.with.districts2)
1 2 3 4 5 6
2 3 3 3 1 1
A few caveats: * By default, over
will return the first item in the TARGET that intersects with an item in SOURCE if there are multiple items. * over
can only handle intersection of two SpatialPolygons
objects after package rgeos
has been loaded. * more details are found in this document
Multiple Intersections
By default, when there are multiple TARGET observations that intersect a SOURCE observation, over
will just return the first one. There are two ways to address this.
returnList = TRUE
over
normally returns a vector, which can only have one item per row. However, if you pass the argument returnList = TRUE
, over
will return a named list, where: * The name for each list entry is the index of the SOURCE observation * The contents are the indices of all items in the TARGET that intersect the SOURCE observation
Once you have this list, you can compute on it with tools like sapply
.
Here’s an example with just indices of intersecting TARGET observations:
over.list <- over(districts, geometry(grants.newproj), returnList = TRUE)
over.list
$`0`
[1] 5 6
$`1`
[1] 1
$`2`
[1] 2 3 4
Then you can process these down (say, to get the number of each) and merge back in:
library(maptools)
num.grants <- sapply(over.list, length)
districts <- spCbind(districts, num.grants)
districts
class : SpatialPolygonsDataFrame
features : 3
extent : 498739.1, 610118, 4089593, 4191546 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0
variables : 5
names : DISTRICT, NAME, Shape_Leng, Shape_Area, num.grants
min values : 12, Palo Alto, 70727.76, 105358679, 1
max values : 18, San Francisco, 308447.51, 1815914163, 3
Here’s an example where one gets back the relevant data.frame
rows of intersecting TARGET observations.
over(districts, grants.newproj, returnList = TRUE)
$`0`
GrantBudge JobsCreate ProjectNam DISTRICT NAME
5 210000 0 NSF Graduate Fellowship 18 Palo Alto
6 120000000 12 DARPA Robotics Grant 18 Palo Alto
Shape_Leng Shape_Area
5 308447.5 1815914163
6 308447.5 1815914163
$`1`
GrantBudge JobsCreate ProjectNam DISTRICT NAME
1 7000000 5 Emergency Starbucks 12 San Francisco
Shape_Leng Shape_Area
1 70727.76 105358679
$`2`
GrantBudge JobsCreate ProjectNam DISTRICT NAME
2 10000000 22 Runway Repairs 14 Peninsula
3 750000 32 Wetland Preservation 14 Peninsula
4 270000 3 Air Traffic Control Retrofit 14 Peninsula
Shape_Leng Shape_Area
2 200487.8 706830475
3 200487.8 706830475
4 200487.8 706830475
fn
or aggregate
However, if the second argument has a data.frame
associated with it, over
can be instructed to aggregate the variables associated with intersecting TARGET observations.
For example, we can use this to get the average value of grants in each district:
over(districts, grants.newproj[, "GrantBudge"], fn = mean)
GrantBudge
0 60105000
1 7000000
2 3673333
Since we are now aggregating GrantBudge
values in grants.newproj
by districts
, we can also use the aggregate
method in sp
for spatial aggregation, which in addition to the over
command above returns a Spatial*
object:
aggregate(grants.newproj[, "GrantBudge"], districts, mean)
class : SpatialPolygonsDataFrame
features : 3
extent : 498739.1, 610118, 4089593, 4191546 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0
variables : 1
names : GrantBudge
min values : 3673333.33333333
max values : 60105000
If we had other information on food trucks – like their value – we could also ask for things like their mean value by passing the mean
function.
Note that over
selects anything that intersects, including polygons that only touch each other. When working with two polygon sets, it makes often more sense to apply area weighting, where weights are proportional to the amount of overlap. This is obtained by
aggregate(polygons_A, polygons_B, mean, areaWeighted = TRUE)
Answer code at bottom of exercise, but no cheating!
Now that we’ve covered the basics of GIS in R, we can start working with real data. In the following exercises, we will work with data on US drone strikes in Pakistan collected by the Bureau of Investigative Journalism. Save your code from this exercise – we’ll use these shapefiles more.
Load the pk_districts
shapefile from the shapefiles
folder and save as pk_dist
.
Load the pk_drone_strikes
shapefile from the shapefiles
folder and save as strikes
.
Plot the two shapefiles together using the plot
command. The result should look something like the following. If it does not, consider whether you skipped an IMPORTANT STEP to combining different Spatial* objects.
Look at the pk_drone_strikes
data. What information is included?
Calculate the number of drone strikes per district in Pakistan. (Think: what is your SOURCE and what is your TARGET for this operation? Is it 1-1 or Many to 1) Where are drone strikes concentrated?
What is the average fatality rate per district?
Finally, sometimes we have a SpatialPolygons, and want to know something about the properties within the polygons based on information in a raster dataset. For example, we might have the polygons of electoral districts in Africa, and want to know the average level of light as seen by satellites.
In these situations, we use the extract
tool, which returns the values of all the raster cells that fall within each polygon. (More specifically, it returns a list of numeric vectors, where the vector in position N corresponds to the raster cell values inside polygon N.) You can then compute on these values as you wish!
As with Spatial*
objects, the first step is to make sure everything has a common projection.
Raster objects can be reprojected using the projectRaster
function. However, be aware that reprojecting rasters is not quite as costless as reprojecting vector data. Rasters must also have a regular grid pattern, and since different projections will not necessarily preserve that feature, reprojecting rasters means creating a new grid and computing each value based on overlapping cells from the old grid. Thus it is computationally difficult and can lead to losses of precision. Thus you are probably better off reprojecting your SpatialPolygon
object rather than your raster.
pollution <- raster("RGIS2_Data/pollution.tif")
raster.crs <- CRS(projection(pollution))
districts.reprojected <- spTransform(districts, raster.crs)
extracted.values <- extract(pollution, districts.reprojected)
# A few example outputs -- low index polygons are in south SF and don't
# actually have values, so jumping to top.
extracted.values
[[1]]
[1] 6.8 8.0 6.6 6.6 5.9 5.7 5.9 5.6 5.5 6.0 5.9 6.2 5.5 5.4 4.7 6.4 6.0
[[2]]
5.4
[[3]]
[1] 6.1 6.2 5.9 9.5 6.1 5.9
You can get average values using the sapply()
function:
# A subsets of outputs
sapply(extracted.values, mean)
[1] 6.041176 5.400000 6.616667
Answers below, but no cheating!
Load the population raster from the RGIS2_Data folder and call it pk.pop
with something like the following command (depending on your working directory): pk.pop <- raster("pakp00g")
. This file is a raster where the value of each cell is an estimate of the number of people living in that area. (Note that you are pointing R at a folder with the raster components – raster
, unlike rgdal
, is smart enough to figure out what you’re asking!)
Use the extract command to estimate the population of each Pakistani district.
Compute the number of drone-strikes per capita for each district. .
Answers to Exercise 2
pk.dist <- readOGR(dsn = "RGIS2_data/shapefiles", layer = "pk_districts")
strikes <- readOGR(dsn = "RGIS2_data/shapefiles", layer = "pk_drone_strikes")
# Re-project
dist.crs <- CRS(proj4string(pk.dist))
strikes.projected <- spTransform(strikes, dist.crs)
# Plot
plot(pk.dist)
par(new = T)
plot(strikes.projected, type = ".", col = "blue", add = T)
par(new = F)
# Look at data
head(strikes.projected)
# Number per district
over.list <- over(pk.dist, geometry(strikes.projected), returnList = TRUE)
num.strikes <- sapply(over.list, length)
pk.dist <- spCbind(pk.dist, num.strikes)
pk.dist[pk.dist$num.strikes != 0, ]
# Avg Fatality
avg.fatality <- over(pk.dist, strikes.projected[c("Killed", "CiviliansK")],
fn = mean)
pk.dist <- spCbind(pk.dist, avg.fatality)
pk.dist[!is.na(pk.dist$Killed), ]
Answers to Exercise 3
pk.pop <- raster("RGIS2_Data/pakp00g")
new.crs <- CRS(projection(pk.pop))
pk.dist.rasterproj <- spTransform(pk.dist, new.crs)
pops <- extract(pk.pop, pk.dist.rasterproj)
pops.summed <- sapply(pops, sum)
pk.dist.rasterproj <- spCbind(pk.dist.rasterproj, pops.summed)
pk.dist.rasterproj$strikes.percap <- pk.dist.rasterproj$num.strikes/pk.dist.rasterproj$pops.summed
head(pk.dist.rasterproj[pk.dist.rasterproj$num.strikes > 0, ])
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.