I am working on a broader project concerning the history of slavery in the United States post-1964, especially focusing on the birth and spread of hyper-incarceration.

In order to better facilitate that history project, I want an accurate understanding of the state of the prison system in the United States in 2025. Perhaps unsurprisingly, this is not a readily available resource. The nature of the carceral system of the United States is decentralized; intentionally so. Because of this, I had to make my own dataset.

The ideal result is a dataset which includes every prison, jail, and other carceral facility in the US, with associated statistics concerning number of beds, population, and more.

There are several ways I can go about this. I could find the data gradually, state by state, building a list and scraping sites as I go. This is doable, but how could I ever know that I had found everything?

A better approach is to build a system to do the finding for me, and that is what I did. It isn’t perfect, and I still have some gaps to fill, but I’ve had an excellent start. To accomplish my task I used Python, Google Earth Engine, NAIP images, shapefiles, and the Google Places API.

The Google Places API is pretty powerful, but not free. Luckily, they hand out free credits like candy with fresh accounts. With those free credits we can take a set of coordinates, Latitude/Longitude in this case, and perform a keyword search for terms like ‘prison’, ‘jail’, ‘detention center’, and otherwise. The places API will give us back results within a 50,000 meter radius of the input coordinates.

Useful and cool in its own right, but we can do much better. Imagine distributing 50km-radius circles across the U.S., slightly overlapped so full coverage is achieved. We can create a table where each row contains the latitude/longitude coordinate pair for the center of each circle. once we do this, it is as simple as creating a python program which iterates through the coordinate pairs and stores the result in a new table. The start of our dataset is born!

Next up I cleaned the data, and then used an LLM to filter through all the results to remove non-carceral facilities. After this I generated an NAIP image of each facility.

Visualization

Here is what that data looks like. Pro-tip, click the little box icon in the top-left of the map to go full screen. NOTE: This map may take a few moments to load, it is fairly information dense.

Click a marker to see facility details and satellite image.

Current Status & Next Steps

You may notice that there is an LLM description of the sat photo of each facility. I passed each sat image to a multi-modal LLM to categorize each image into a rural or urban setting (as well as plenty of other datapoints) using structured JSON outputs. This is to prepare for training a YOLO model to recognize prison facilities.

This project is young and ongoing. As it develops I will share the code and data. I am actively developing the next steps of this system, and will gradually expand the concerned content here.

Next Steps
  1. Train a YOLO model to recognize carceral facilities. (ACTIVE - ONGOING)
  2. Associate each facility with MajorTom embeddings to learn what carceral facility look like in vector space, and run inference on global embedding datasets. (ACTIVE - ONGOING)
  3. Adapt a deep research agent to do basic research on each carceral facility, and store results in both natural language and JSON. (ACTIVE - ONGOING)

This is a project that is very technically complex, and requires significant amounts of work, but I am just one person. If you would like to sponsor this project to get it done faster, please reach out, I’d have no problem making this my full time job.