Computational Geometry Tools
As I've been reading about the recent developments in Pennsylvania's congressional map, I started to wonder if there was some kind of algorithmic solution to this problem. If there is (surely there is), I haven't quite found it, but I'd like to share some of the problems and solutions I've uncovered along the way.
First, a caveat: As this is a mathematical exercise and not a political one, all of the following analysis is done purely with geographic population data from the US Census Bureau. It ignores all political and other demographic information that might be used in practice when drawing congressional maps.
There are two important properties that any set of fair congressional districts should have:
This last property cannot (and, in practice, should not necessarily) always be adhered to, but it seems like a good rule of thumb to avoid pathologically odd shapes, and to keep demographically similar groups together.
As a first approximation, I ran a simple k-means clustering on the population data. The state (in this case Massachusetts) could then be divided into districts using a Voronoi diagram with the final k means as the seeds of the cells. 2D Voronoi diagrams have the property that all cells are convex. This resulted in the following maps:
First, a caveat: As this is a mathematical exercise and not a political one, all of the following analysis is done purely with geographic population data from the US Census Bureau. It ignores all political and other demographic information that might be used in practice when drawing congressional maps.
There are two important properties that any set of fair congressional districts should have:
- They should contain roughly the same number of people
- They should be contiguous
- They should be as close to convex as possible -- convexity will serve as a simple, well-define proxy for the vague notion of "compactness" often employed during redistricting.
This last property cannot (and, in practice, should not necessarily) always be adhered to, but it seems like a good rule of thumb to avoid pathologically odd shapes, and to keep demographically similar groups together.
As a first approximation, I ran a simple k-means clustering on the population data. The state (in this case Massachusetts) could then be divided into districts using a Voronoi diagram with the final k means as the seeds of the cells. 2D Voronoi diagrams have the property that all cells are convex. This resulted in the following maps:
Note: throughout this project, I have made extensive use of the scipy.spatial library. This includes functions for drawing the Voronoi diagrams above as well as the convex hull and Delaunay triangulation objects that I will eventually use.
The maps are different because of the inherent randomness in initializing the means for the k-means clustering. These maps have a few nice properties:
But, ultimately, they fail to provide good guidelines for drawing districts because the most populous district has 10-12 times the population of the least populous district. Standard k-means clustering does not include a mechanism to ensure that each cluster has an equal number of points. My solution to this problem was to first run a standard k-means clustering, and then go through the clusters from least populous to most populous, adding the nearest non-member points to that cluster. This shrinks populous areas and allows points to flow to less populous areas. The code can be found here. Below are some of the results of running this process on test data:
The maps are different because of the inherent randomness in initializing the means for the k-means clustering. These maps have a few nice properties:
- Convex districs
- Districts in more populous areas are smaller
- The Cape and The Islands are grouped together
But, ultimately, they fail to provide good guidelines for drawing districts because the most populous district has 10-12 times the population of the least populous district. Standard k-means clustering does not include a mechanism to ensure that each cluster has an equal number of points. My solution to this problem was to first run a standard k-means clustering, and then go through the clusters from least populous to most populous, adding the nearest non-member points to that cluster. This shrinks populous areas and allows points to flow to less populous areas. The code can be found here. Below are some of the results of running this process on test data:
In the figure above, the left panel shows the data after clustering and before equalization, and the right shows the data after equalization. You can see that the simple k-means clusters do in fact match up exactly with the Voronoi cells superimposed on them. On the right, we have sacrificed convexity for equal size, which is more important in this context.
In practice, we do not want to plot every single data point, as there may be millions. It would be preferable to draw a bounding shape, say by connecting the boundary points for each cluster. For a convex set of points, this would be the "convex hull" of the set: the smallest convex shape that can encompass the set of points. The convex hull can be quickly and efficiently calculated, and there are plenty of existing implementations.
In practice, we do not want to plot every single data point, as there may be millions. It would be preferable to draw a bounding shape, say by connecting the boundary points for each cluster. For a convex set of points, this would be the "convex hull" of the set: the smallest convex shape that can encompass the set of points. The convex hull can be quickly and efficiently calculated, and there are plenty of existing implementations.
Not surprisingly, the convex hulls (drawn in black) do not perfectly delineate the non-convex clusters resulting from equalization. In fact, while there is a unique convex hull for any set of points, there is no unique way to define a non-convex boundary of a point set because, essentially, it depends on a scale that we must choose arbitrarily. That is -- we must decide how much space you will allow between two boundary points before some intervening point behind them will also be considered part of the boundary. This idea is formalized with the concept of the alpha shape.
To understand the alpha shape, we must first discuss the Delaunay triangulation. In short, the Delaunay triangulation of a set of points is the set of triangles with those points as vertices such that no point is within the circle circumscribing any triangle within the triangulation. That's all very abstract, but suffice it to say that taking the Delaunay triangulation of a set of points gives you a triangle mesh of those points.
I will illustrate this with non-convex set of points in the shape of an 'H' -- you can find the code for this part here.
To understand the alpha shape, we must first discuss the Delaunay triangulation. In short, the Delaunay triangulation of a set of points is the set of triangles with those points as vertices such that no point is within the circle circumscribing any triangle within the triangulation. That's all very abstract, but suffice it to say that taking the Delaunay triangulation of a set of points gives you a triangle mesh of those points.
I will illustrate this with non-convex set of points in the shape of an 'H' -- you can find the code for this part here.
What we want is something that takes in some set of points like the one on the left, and returns a set of points like the one on the right. We achieve this in four steps:
The results of the first three steps above are shown below. The result of the last step is, of course, shown in the right panel above.
- Get the Delaunay triangulation of the input set
- Find the alpha complex of the Delaunay triangulation -- this part depends on an arbitrary parameter 'alpha'. The alpha complex is essentially a subset of the Delaunay triangulation that throws out triangles whose circumradii are larger than 1/alpha.
- Find the boundary simplices (...triangles) of the alpha complex. The Scipy implementation of the Delaunay triangulation has a helpful attribute "neighbors" that tells us which triangles neighbor other triangles within the complex. In general, a triangle can share an edge with up to three other triangles, and the boundary triangles are precisely those that share an edge with fewer than three.
- Find the boundary points -- I only did this very roughly. A decent first approximation of the boundary points would be those points that are members of exactly two boundary triangles. It turns out that this last step was not particularly useful or necessary for my ultimate purpose. It is not useful because there is no guarantee that the boundary points returned will be in counter clockwise order -- this would be required to uniquely define a polygon from the boundary points. It is not necessary because we can get a sense of the boundary simply by drawing the boundary triangles.
The results of the first three steps above are shown below. The result of the last step is, of course, shown in the right panel above.
The full Delaunay triangulation is shown in black, the alpha complex in magenta, and the boundary simplices in cyan.
Finally, this allows us to draw boundaries of non-convex shapes, and we can produce images like the following:
Finally, this allows us to draw boundaries of non-convex shapes, and we can produce images like the following:
Unfortunately, as a consequence of the equalization part of the clustering algorithm, we sometimes end up with non-contiguous clusters. This is because of a mechanism I included to lock points into their clusters. This is needed to prevent two clusters from trading the same points back and forth forever, but it also means that, occasionally, there are no nearby points to add to a small cluster, and that cluster will then look far away, creating an exclave, like so:
Here, the green cluster was boxed in by points that were locked in the yellow, black, and orange clusters, and it formed two exclaves. When this algorithm is run on the population data from Massachusetts, we generally end up with one or two non-contiguous districts:
These maps were created from the same data using different distance functions. The lefthand map uses the Manhattan, or taxicab norm; the middle uses the standard Euclidean norm; and the third uses the p-norm with p=3. (Actually, the taxicab norm is equivalent to the p-norm with p=1, and the Euclidean norm is equivalent to the p-norm with p=2, hence the arrangement of these images.)
As you can see, each map has at least one non-contiguous district. Unfortunately, I have not fully solved this problem, but I believe it can be solved with the help of the "neighbors" attribute of the Delaunay object, mentioned above. If all you have to work with is a set of discontinuous points, it is difficult to define what it means for them to be adjacent -- that is, how can you say that one set of points is bounded by a contiguous region and another is not? If we work with the alpha complex instead of the points directly, we have a built-in concept of adjacency which comes from the "neighbors" attribute of the Delaunay object. Additionally, given some complex of triangles, each adjacent triangle that is added contributes only one vertex to the complex, so adding triangles to a subcomplex of the alpha complex is roughly the same as adding points to a cluster.
The idea here is to choose k non-adjacent triangles from the alpha complex (say, the ones closest to the k-means). These triangles act as seeds from which these sub-complexes can grow, and since they can only grow by adding adjacent triangles, the sub-complexes will always be contiguous. The code for this can be found here in the class called AlphaCluster. Each growing subcomplex of the alpha complex is an instance of the AlphaCluster class. The result of growing the AlphaClusters from their seeds is shown below:
As you can see, each map has at least one non-contiguous district. Unfortunately, I have not fully solved this problem, but I believe it can be solved with the help of the "neighbors" attribute of the Delaunay object, mentioned above. If all you have to work with is a set of discontinuous points, it is difficult to define what it means for them to be adjacent -- that is, how can you say that one set of points is bounded by a contiguous region and another is not? If we work with the alpha complex instead of the points directly, we have a built-in concept of adjacency which comes from the "neighbors" attribute of the Delaunay object. Additionally, given some complex of triangles, each adjacent triangle that is added contributes only one vertex to the complex, so adding triangles to a subcomplex of the alpha complex is roughly the same as adding points to a cluster.
The idea here is to choose k non-adjacent triangles from the alpha complex (say, the ones closest to the k-means). These triangles act as seeds from which these sub-complexes can grow, and since they can only grow by adding adjacent triangles, the sub-complexes will always be contiguous. The code for this can be found here in the class called AlphaCluster. Each growing subcomplex of the alpha complex is an instance of the AlphaCluster class. The result of growing the AlphaClusters from their seeds is shown below:
This is a much more computationally demanding process than either k-means or equalized k-means because there are far more triangles in the alpha complex than there are points (each point is a member of many triangles). I will have to optimize the code quite a bit before it can be run on a dataset the size of the population of Massachusetts (or even a tenth that size) in any reasonable amount of time, but I believe this is a step in the right direction.
I would, of course, have preferred to make this post after all the kinks were ironed out, however, I felt that I had encountered and solved enough problems along the way that others might find it interesting as is. I will update this when I can.
One last interesting, mathematical curiosity: The Voronoi diagram of a set of points is the dual graph of its Delaunay triangulation -- the graph you get if you make the faces of one graph the vertices of a second while keeping the same edges. In this specific case, the Voronoi diagrams and Delaunay triangulations presented were not duals, as the point set used to draw the Voronoi diagrams was only the k means of the full data set used in the Delaunay triangulation. But, nonetheless, an interesting connection. For a cool proof of Euler's characteristic formula ( V - E + F = 2) that makes use of graph duality, check out this youtube video: Euler's Formula and Graph Duality.
All of my code for this post can be found on GitHub.
I would, of course, have preferred to make this post after all the kinks were ironed out, however, I felt that I had encountered and solved enough problems along the way that others might find it interesting as is. I will update this when I can.
One last interesting, mathematical curiosity: The Voronoi diagram of a set of points is the dual graph of its Delaunay triangulation -- the graph you get if you make the faces of one graph the vertices of a second while keeping the same edges. In this specific case, the Voronoi diagrams and Delaunay triangulations presented were not duals, as the point set used to draw the Voronoi diagrams was only the k means of the full data set used in the Delaunay triangulation. But, nonetheless, an interesting connection. For a cool proof of Euler's characteristic formula ( V - E + F = 2) that makes use of graph duality, check out this youtube video: Euler's Formula and Graph Duality.
All of my code for this post can be found on GitHub.