When working with spatial data, one is rarely interested in working with only one source of data. This tutorial will introduce a set of tools for linking vector data with other data sources. It begins by introducing how to link spatial vector data with non-spatial data from in table format, then turns to the problem of linking multiple sources of spatial data through spatial joins and intersects.
This tutorial uses the sp
, rgdal
, and raster
libraries from the RGIS1
tutorial. If you have not yet installed those, please revisit that tutorial for directions. In addition, this tutorial will also make use of the rgeos
library, installation of which is discussed in part0_setup
.
The over
command will cover a remarkable number of situations, but it is really only interesting if two shapes intersect exactly. In practice, however, we are often interested in sligthly more flexible questions: how many cities are within 10km of a drone strike? Or how many people live close to a government project?
When we want to do fancer geometric operations, we use the rgeos
library. rgeos
(which stands for “R interface to the Geometry Engine - Open Source library”) is a set of tools for geometric operations. GEOS
a huge library, and one that underlies many spatial tools (not just in R). Whenever you’re thinking about a geometric operation, rgeos
is the first place to look.
rgeos
tools can be broadly divided into three camps. Here some of the most commonly used ones.
over
can do.The syntax for these tools varies both across tools and depending on the input types, there are three concepts that come up constantly.
rgeos
isn’t a spatial library – it just does geometry. It takes x-y coordinates and applies geometric formula. Thus the units will always be the units of the x-y coordinates, which come from your projection. So here, we can check the projection to find our units:
proj4string(districts)
[1] "+proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0"
The units (+units=m
) are meters.
byid
is an option on most rgeos
commands. If byid=TRUE
, each observation in a Spatial* object is handled separately; if byid=FALSE
, a Spatial* object is treated as one big geometry. So if one were working with a SpatialPolygons
object of US states, when byid=TRUE
, the analysis would be conducted for each state; if byid=FALSE
, it would essentially run the analysis against the United States as a whole.
Intuitively, the distinction can often be thought of as the difference between asking whether about a spatial relationship between each observation in the Spatial* object on which the tool is being executed, or any observation in the Spatial* polygon.
To see this illustrated, let’s consider our data we used before on the location of government grants. Suppose we want to identify regions that are within 7km of government grants. We could either ask for the area within 7km of each government project (which would yield a different answer for each grant), or the area within 7km of any government grant (which would yield one large area). You can see this distinction in the following plots:
buffered.grants.byidTRUE <- gBuffer(grants.newproj, width = 7000, byid = TRUE)
buffered.grants.byidFALSE <- gBuffer(grants.newproj, width = 7000, byid = FALSE)
par(mfcol = c(1, 2))
plot(districts, main = "byid TRUE")
plot(buffered.grants.byidTRUE, col = "blue", add = T)
plot(districts, main = "byid FALSE")
plot(buffered.grants.byidFALSE, col = "blue", add = T)
As seen in the figures, when byid=TRUE
, a separate buffer polygon is created around each grant, creating a large number of (overlapping) polygons. When byid=FALSE
, there are no discrete polygons, just a single feature that covers all points within 7km of any government grant. Indeed, if we look at the summary of these buffers, we see this as well – in the first case, you see that under features
, the buffers created with byid=TRUE
have 6 distinct features (observations), while the buffers created with byid=FALSE
have only 1.
buffered.grants.byidTRUE
class : SpatialPolygonsDataFrame
features : 6
extent : 546582.3, 581813.7, 4134618, 4189618 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0
variables : 3
names : GrantBudge, JobsCreate, ProjectNam
min values : 210000, 0, Air Traffic Control Retrofit
max values : 120000000, 32, Wetland Preservation
buffered.grants.byidFALSE
class : SpatialPolygons
features : 1
extent : 546582.3, 581813.7, 4134618, 4189618 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0
The output of almost all rgeos
commands will be organized by id
. For example, with gArea
, each area has a label: 0, 1, and 2. When working with Spatial*DataFrame objects, id
will be the rowname for a given observation, just like with tools we’ve worked with before.
To see an object’s id
s, you can use the combination of the names
command and the geometry
command (if you don’t use the geometry
command you’ll get column names for objects with associated DataFrames:
names(geometry(districts))
[1] "0" "1" "2"
Things get a little more complicated when rgeos
creates new polygons using tools like gBuffer
or gIntersect
. rgeos
tends to try and do smart things, but the behavior does vary across tools, so always be careful.
Here’s an example – let’s create buffers (polygons of fixed radius) around our grants the buffers 7km, so we get back polygons for all points within 7km of each project.
proj4string(grants.newproj) # check units!
[1] "+proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0"
buffered.grants <- gBuffer(grants.newproj, width = 7000, byid = TRUE)
names(geometry(buffered.grants))
[1] "1" "2" "3" "4" "5" "6"
plot(districts)
plot(buffered.grants, col = "blue", add = T)
Note that points has given rise to a new polygon, many of them overlapping, and they were given simple names based on the names of the points that generated them. However, if we want a polygon of points within 7km of any grant, we can pass byid=FALSE
. Now we have one polygon (named buffer
) instead of 5.
buffered.grants <- gBuffer(grants.newproj, width = 7000, byid = FALSE)
names(geometry(buffered.grants))
[1] "buffer"
plot(districts)
plot(buffered.grants, col = "blue", add = T)
Answers below, but no cheating!
Let’s try and figure out what percentage of residents of Pakistans Federally Administered Tribal Areas (FATA, the regions with the most drone strikes) live within 7km of a drone strike.
Subset the pk.dist data for the Province of Fata.
Create a buffer of 7km with the gBuffer
command to help us estimate the total area that is within 7km of any drone strike (what value of byid
should you use?).
Plot your buffers over the districts so you can see them.
rgeos
can also do set operations on polygons, like returning areas in which different polygons coincide, don’t coincide.
This can be useful for lots of things. For example, the figure of grant buffers above looks a little odd because the buffered polygons spread into the ocean. We can fix this by “clipping” the buffered polygons using gIntersection
– essentially keeping only the part of the buffer that also coincides with districts.
intersection <- gIntersection(buffered.grants, districts, byid = TRUE)
plot(districts)
plot(intersection, col = "blue", add = T)
Interestingly, note that while buffered.grants
was previously one feature, because we passed byid=TRUE
, the intersection with the districts executed the intersection for each congressional district, creating three distinct polygons.
intersection
class : SpatialPolygons
features : 3
extent : 546582.3, 581813.7, 4134618, 4187377 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0
One challenge with set operations is that the number of features generated by these functions has no mechanical relationship to the number of features that were passed to the set function as inputs. As a result, we can’t just cbind
results back into the original data frame as we’ve been able to do in past tutorials because outputs may be of entirely different lengths than the inputs!
rgeos
tries to address this problem by generating new id
s for the output of set operations that are somewhat intuitive. In general, new id
s are concatenations of the id
s of the objects that gave rise to a new observation. For example, if you ran gIntersection
and one output polygon was created from the overlap of a polygon with the id
of 8
from the first Spatial* object passed and a polygon with the id
z
from the second Spatial* object, that new polygon would have the id
: 8 z
.
We can see this behavior in the id
s of polygons generated from the intersection we just executed between buffered.grants
and districts
:
names(geometry(buffered.grants))
[1] "buffer"
names(geometry(districts))
[1] "0" "1" "2"
names(geometry(intersection))
[1] "buffer 0" "buffer 1" "buffer 2"
The challenge with this is that these concatenated id
variables aren’t always easy to merge with the DataFrames of your original Spatial* objects. For example, say we wanted to measure the share of each electoral district that is within 7km of a government grant. To do so, we would want to divide the area within 7km of each grant in each district by the total area of each district. The problem is that if we measure the area within 7km of each grant – the area of the shapes in the intersection
object – we get back a list organized by these concatenated id
s.
gArea(intersection, byid = TRUE)
buffer 0 buffer 1 buffer 2
178460991 63473835 250339443
The following code snippet will convert lists like this into a DataFrame, and will split these concatenated id
variables into two columns – one for each component of the id
– to make merging easier. This is something you chould be able to copy-and-paste after almost any set operation.
# Run calculation
result <- gArea(intersection, byid = TRUE)
# Convert from list to DataFrame and name result
result <- as.data.frame(result)
colnames(result) <- c("area_near_grants")
# ids are stored as row-names, so make into a column.
ids <- as.character(rownames(result))
# Split the id into two columns based on the space in the middle. Note the
# one place this could break is if the original names had spaces...
new.id.columns <- t(as.data.frame(strsplit(ids, " ")))
colnames(new.id.columns) <- c("id1", "id2")
# Now your results are a data.frame with separate columns for each id
# component!
result <- cbind(result, new.id.columns)
result
area_near_grants id1 id2
buffer 0 178460991 buffer 0
buffer 1 63473835 buffer 1
buffer 2 250339443 buffer 2
Now you can use this reformated data to merge with the original data on districts:
# Move the rownames (which form the ids) into a column for merging
districts$id2 <- rownames(as.data.frame(districts))
districts <- merge(districts, result, by = "id2", type = "left")
as.data.frame(districts)
id2 DISTRICT NAME Shape_Leng Shape_Area area_near_grants id1
1 0 18 Palo Alto 308447.51 1815914163 178460991 buffer
2 1 12 San Francisco 70727.76 105358679 63473835 buffer
3 2 14 Peninsula 200487.79 706830475 250339443 buffer
Now we can calculate the share of each district within 7km!
districts$area_near_grants/districts$Shape_Area
[1] 0.09827612 0.60245474 0.35417183
byid
IssuesOne thing to be aware of is that rgeos
sometimes struggles when byid
is set to FALSE
. For example, if we want to clip our buffered polygon by the extent of all electoral districts – but not add cuts where the polygon crosses the lines between districts – one might try the following code:
The following code:
gIntersection(buffered.grants, districts, byid = FALSE)
But oddly, this generates the error: Error in RGEOSBinTopoFunc(spgeom1, spgeom2, byid, id, drop_lower_td, "rgeos_intersection") :
TopologyException: no outgoing dirEdge found at 570113.57942913717 4148852.3204373871
This kind of problem will come up from time to time – my general solution is to first dissolve the layer we want to consider as one polygon into a single polygon using gUnionCascaded
, then execute the second command:
merged.districts <- gUnionCascaded(districts)
plot(merged.districts)
intersection2 <- gIntersection(buffered.grants, merged.districts, byid = FALSE)
intersection2
class : SpatialPolygons
features : 1
extent : 546582.3, 581813.7, 4134618, 4187377 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0
plot(districts)
plot(intersection2, col = "blue", add = T)
Now the intersection is only one part!
rgeos
rgeos
is a very large and powerful library, and as a result you may be feeling a little overwhelmed at the moment. This is normal – because rgeos
has so many tools and each one behaves somewhat differently, it is not possible to provide comprehensive training for everything in a single tutorial. With that in mind, the aim of this tutorial has been to give you a sense of what (a) what the rgeos
library has to offer so you know where to turn when you run into a problem, and (b) to highlight aspects of rgeos
that are common to most tools so you’re prepared to figure out new tools within the least difficulty possible. But if you don’t feel quite as comfortable as you have after other tutorials, that’s prefectly normal.
Answers below, but no cheating!
Previously we created polygons for areas within 7km of a drone strike. Now let’s try and estimate the share of each district in FATA that is within 7km of a drone strike. (Note: districts are the administrative level below provinces. There are 13 districts within the province of FATA.)
Intersect the buffers with Districts using gIntersection
.
Measure the share of each district within 7km of any drone strike using the gArea
command for each district, and for the buffered strikes in each district. When you merge the areas of your intersections back in with districts, be sure to look at that table – why are there missing values? Do you need to worry about them?
Run this analysis just for strikes in 2012. If you’ve used ArcGIS before, how would you re-run this analysis in ArcGIS? Would it be easier or harder?
Answers to Exercise 1
pk.dist <- readOGR(dsn = "RGIS2_data/shapefiles", layer = "pk_districts")
strikes.wrongproj <- readOGR(dsn = "RGIS2_data/shapefiles", layer = "pk_drone_strikes")
dist.crs <- CRS(proj4string(pk.dist))
strikes <- spTransform(strikes.wrongproj, dist.crs)
# Subset to Fata
fata <- pk.dist[pk.dist$PROVINCE == "Fata", ]
fata$OBJECTID <- NULL # Drop column called objectid -- came from shapefile. Just confuses things.
# Buffer strikes
proj4string(strikes) # Check projection units
strike.buffer <- gBuffer(strikes, width = 7000, byid = FALSE)
strike.buffer.inter <- gIntersection(strike.buffer, fata, byid = TRUE)
Answers to Exercise 2
####### Merge intersection areas with original areas
# Now to merge back in with Districts based on row numbers
result <- gArea(strike.buffer.inter, byid = TRUE)
# Convert from list to DataFrame
result <- as.data.frame(result)
colnames(result) <- c("area_near_strikes")
# ids are stored as row-names, so make into a column.
ids <- as.character(rownames(result))
# Split the id into two columns based on the space in the middle. Note the
# one place this could break is if the original names had spaces...
new.id.columns <- t(as.data.frame(strsplit(ids, " ")))
colnames(new.id.columns) <- c("id1", "id2")
# Now your results are a data.frame with separate columns for each id
# component!
result <- cbind(result, new.id.columns)
fata$id2 <- rownames(as.data.frame(fata))
fata <- merge(fata, result, by = "id2", type = "left")
####### Calculate share of area within strike buffer
# Note first that in places where no part of a district was within 7km of a
# strike, there were no `intersection` polygons. Thus when we merge our
# data, we get missing for area near strikes. These need to be zeros.
fata[is.na(fata$area_near_strikes), "area_near_strikes"] <- 0
fata$share_covered <- fata$area_near_strikes/fata$Shape_Area
fata$share_covered
# Only for 2012: add: strikes <- strikes[strikes$year == 2012,] To start of
# code and just rerun!
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.