I am working on a broader project concerning the history of slavery in the United States post-1964, especially focusing on the birth and spread of hyper-incarceration.
In order to better facilitate that history project, I want an accurate understanding of the state of the prison system in the United States in 2025. Perhaps unsurprisingly, this is not a readily available resource. The nature of the carceral system of the United States is decentralized; intentionally so. Because of this, I had to make my own dataset.
The ideal result is a dataset which includes every prison, jail, and other carceral facility in the US, with associated statistics concerning number of beds, population, and more.
There are several ways I can go about this. I could find the data gradually, state by state, building a list and scraping sites as I go. This is doable, but how could I ever know that I had found everything?
A better approach is to build a system to do the finding for me, and that is what I did. It isn’t perfect, and I still have some gaps to fill, but I’ve had an excellent start. To accomplish my task I used Python, Google Earth Engine, NAIP images, shapefiles, and the Google Places API.
Initial Data Collection: Google Places API & Geospatial Search
The Google Places API is pretty powerful, but not free. Luckily, they hand out free credits like candy with fresh accounts. With those free credits we can take a set of coordinates, Latitude/Longitude in this case, and perform a keyword search for terms like ‘prison’, ‘jail’, ‘detention center’, and otherwise. The places API will give us back results within a 50,000 meter radius of the input coordinates.
Useful and cool in its own right, but we can do much better. Imagine distributing 50km-radius circles across the U.S., slightly overlapped so full coverage is achieved. We can create a table where each row contains the latitude/longitude coordinate pair for the center of each circle. once we do this, it is as simple as creating a python program which iterates through the coordinate pairs and stores the result in a new table. The start of our dataset is born!
Next up I cleaned the data, and then used an LLM to filter through all the results to remove non-carceral facilities. After this I generated an NAIP image of each facility.
Visualization
Here is what that data looks like. Pro-tip, click the little box icon in the top-left of the map to go full screen. NOTE: This map may take a few moments to load, it is fairly information dense.
Click a marker to see facility details and satellite image.
Current Status & Next Steps
You may notice that there is an LLM description of the sat photo of each facility. I passed each sat image to a multi-modal LLM to categorize each image into a rural or urban setting (as well as plenty of other datapoints) using structured JSON outputs. This is to prepare for training a YOLO model to recognize prison facilities.
This project is young and ongoing. As it develops I will share the code and data. I am actively developing the next steps of this system, and will gradually expand the concerned content here.
Next Steps
- Train a YOLO model to recognize carceral facilities. (ACTIVE - ONGOING)
- Associate each facility with MajorTom embeddings to learn what carceral facility look like in vector space, and run inference on global embedding datasets. (ACTIVE - ONGOING)
- Adapt a deep research agent to do basic research on each carceral facility, and store results in both natural language and JSON. (ACTIVE - ONGOING)
This is a project that is very technically complex, and requires significant amounts of work, but I am just one person. If you would like to sponsor this project to get it done faster, please reach out, I’d have no problem making this my full time job.