TLDR: I built a fully automated system using LLMs to generate news articles every 15 minutes from real-time GDELT data (a global event database), showcasing the current capabilities and prevalence of AI content generation.
NOTE: I took the site down and stopped the service as it was costing too much in inferance fess and electricy usage each month.
Below you can find the nerdy technical details of my creation, or you can view the code directly on my GitHub.
If you would like me to build something similar for you or your organization, you can contact me here.
The Age of AI-Generated Content
These days it seems like the internet is full of questionable content and trash writing. Though large language models are getting smarter and more capable every day, their writing skills are still lacking, or at the very least have an ‘air’ of artificiality. On a day to day basis, we are consistently influenced by artificially authored text without realization; the internet is becoming more and more full of it each moment that passes.
Whether you are being influenced to garner clicks under some sort of SEO business model, or influenced as the result of some sort of political campaign, LLM propaganda is now EVERYWHERE. It is essentially unavoidable at this point, and it isn’t just shadowy sites under obscure domains with redacted WHOIS–LLM content is mainstream.
Mainstream Adoption
BuzzFeed was an early adopter and openly admits to using LLM generated content, their admission reportedly caused a 150% jump in their stock prices (Source: CNN Archive). CNET was caught red-handed publishing LLM generated articles attributed to “CNET Money Staff” (Source: PBS NewsHour Archive). You can bet your bottom-dollar that many many more large and mainstream outlets are serving you LLM generated content without telling you, and it is likely that just as many are serving you LLM content and lying about it.
Project Motivation and Goal
It is 2025, we are in the new age of internet influence, and Journalism is officially dead (article forthcoming). With this in mind I set out to create my own automated propaganda site. My goal with this project was a system that is fully automated and completely hands-off after deployment. As of now, it generates a new LLM generated article, based off content from the last 15 minutes, every fifteen minutes, around the clock. The site has been live for a little over 2 months as of now, with around 5,760 articles up and ready for viewing.
Data Source: GDELT
For source material the site uses the Global Database of Events, Language, and Tone (GDELT), which is a massive open database that continuously ingests print, broadcast, and web news media from around the globe, in over 100 languages. My site uses the GDELT 2.0 Events and Mentions dataset, which is consistently updated and offers its most recent content via data-dumps every 15 minutes. GDELT purports to analyse all of this content en mass, offering extracted details such as actors involved (countries, political entities, et cetera), event types (of which there are dozens), geolocation data, a few types of sentiment analysis, and source data.
GDELT Limitations and Utility
The unfortunate reality is that GDELT sentiment analysis and event classification is WOEFULLY inaccurate, verging on a joke, but it is still a wildly useful and rich resource should it be put to use properly. A lot of interesting stuff can be done with a couple hundred thousand neatly labeled URLs every day; it is these URLs that my system ingests and processes.
System Workflow Example (TLDR)
Here is the TLDR, in the below example, I am searching through the 15 minute data-dumps for information related to the new Trump administration, sorting and filtering to eliminate all sources written in English, scraping the resultant pages, extracting and translating their titles, and sending the titles to an LLM which selects 15 of the pages likely to portray trump in a negative light. These 15 articles are then locally machine translated into english, and all 15 translations are passed back to an LLM for summerisation. Finally, the translated articles, their summaries, titles, and source info are passed to an LLM yuet again to generate an article. The newly generated article is placed on the main page with a series of grafana visualizations, and the old article is moved to the archive. This cycle runs 24/7, posting an article every 15 minutes.
In total, this systems costs me on average about 1 USD per day to operate, processing anywhere from 3 to 8 millions tokens every 24 hours. The writing quality is quite low, due to using a budget friendly LLM (google/gemini-2.0-flash-001
).
The Nerdy Details
Overview
- Ingest: Automatically fetch GDELT v2 event and mention data (both English and translated sources) every 15 minutes.
- Store: Process and store the ingested data in a PostgreSQL database, including lookup tables for CAMEO codes and FIPS country codes.
- Analyze:
- Query the database for specific event mentions (e.g., related to ‘Trump’).
- Crawl the source URLs of selected mentions to retrieve full article content.
- Utilize Large Language Models (LLMs) via OpenRouter for:
- Selecting relevant articles based on titles.
- Summarizing content and assessing sentiment towards the U.S.
- Generating a consolidated article based on the summaries.
- Translate non-English content using LibreTranslate.
- Visualize: Display key metrics and data trends using Grafana dashboards.
- Present: Serve a web frontend (React/Vite/Tailwind) showing the latest generated article, archived articles, and embedded Grafana panels.
- Notify: Update the frontend in real-time when a new article is generated using WebSockets.
- Deploy: Orchestrate all components using Docker Compose, with public access secured via Cloudflare Tunnel.
This project demonstrates the construction of a complex data processing pipeline focused on analyzing global news sentiment related to the United States, using the GDELT dataset as its primary source. The goal was to create a system that automatically ingests, processes, analyzes, and presents insights from near real-time global event data.
Architecture and Design:
The system is built using a microservice architecture, containerized with Docker and orchestrated using Docker Compose. This approach was chosen for modularity and scalability. Key services include data ingestion, database management, web crawling, language translation, LLM-based analysis and summarization, real-time updates via WebSockets, and frontend presentation with embedded data visualizations.
PostgreSQL serves as the central data store, holding both the raw GDELT event/mention data and various lookup tables for data enrichment (CAMEO codes, FIPS codes). Kafka acts as the messaging backbone, decoupling services and enabling asynchronous workflows. For instance, the data ingestion service publishes a message upon completion, triggering the summarization service, which in turn publishes an update message consumed by the WebSocket server to notify the frontend.
Data Flow and Processing:
The core pipeline begins with the gdelt_connect
service fetching GDELT data every 15 minutes. A significant design choice was to completely refresh the events
and mentions
tables during each cycle rather than attempting incremental updates, simplifying state management given the nature of GDELT’s update mechanism.
Once new data is loaded, the summerizer
service initiates a multi-stage analysis process. It queries the database for relevant mentions, uses a dedicated crawler
service (leveraging Postlight Parser for robust content extraction) to fetch source articles, and then employs LLMs for several tasks: filtering articles based on titles, summarizing relevant content, assessing sentiment, and finally, generating a cohesive narrative article. The use of concurrent processing (ThreadPoolExecutor
) helps manage the potentially large number of crawling and LLM tasks efficiently.
Language barriers are addressed by integrating the libretranslate
service for both language detection and translation, ensuring content can be processed and analyzed consistently in English.
Frontend and Visualization:
The frontend is a React application built with Vite and styled using Tailwind CSS. It presents the latest AI-generated article, provides access to archived articles, and embeds Grafana panels for data visualization. Real-time updates are achieved through a WebSocket connection managed by a dedicated Node.js service (ws
), ensuring the displayed article reflects the latest analysis cycle without requiring manual refreshes. Grafana connects directly to the PostgreSQL database, allowing for flexible and dynamic visualization of the underlying event data.
Deployment and Access:
The entire system is containerized, simplifying deployment and dependency management. Public access to the web interface is provided securely using Cloudflare Tunnel, avoiding the need to expose ports directly on the host machine.
Challenges and Considerations:
- Data Volume and Schema: Handling the large volume and complex schema of GDELT data required careful planning for database operations, including batch inserts and schema definition management.
- Web Crawling Robustness: Web crawling is inherently fragile. Using Postlight Parser improves extraction success, but handling failures and varying website structures remains a challenge.
- LLM Integration: Integrating multiple LLM calls requires careful prompt engineering and handling of potential API errors or unexpected response formats. The use of JSON schema enforcement in LLM calls helps mitigate some of these issues.
- Resource Management: Running multiple services, including databases, messaging systems, LLMs (via API), and translation services, requires adequate system resources (CPU, RAM, network bandwidth).
- Cost: Relying on external APIs like OpenRouter incurs costs based on usage.
This project serves as a practical example of integrating various technologies – data ingestion, databases, messaging queues, web crawling, AI/LLM processing, real-time communication, and data visualization – to build a sophisticated, automated information analysis system.
Temporal Workflow
This outlines the sequence of events when the system starts up for the first time and runs through its first two 15-minute cycles.
Initial Startup:
- Docker Compose Up: The
docker-compose up
command initiates the startup of all defined services:postgres
,kafka
,grafana
,web
,cloudflared
,libretranslate
,crawler
,summerizer
,ws
,gdelt_connect
, andcreate_tables
. - Service Initialization:
postgres
starts and initializes the database based on environment variables.kafka
starts the broker in KRaft mode.grafana
starts, connects topostgres
for its own configuration storage, and provisions datasources/dashboards.libretranslate
starts and potentially downloads language models.crawler
,summerizer
,ws
,web
, andcloudflared
start their respective servers/processes.create_tables
waits forpostgres
andkafka
to be healthy.gdelt_connect
waits forpostgres
andkafka
to be healthy.
Cycle 0: Database Preparation
- Table Creation (
create_tables
):- Once
postgres
andkafka
are ready, thecreate_tables
service runs its main script (create_tables.py
). - It connects to the
postgres
database. - It executes a series of Python scripts (
create_*.py
) sequentially. Each script:- Drops the target table (e.g.,
events
,mentions
,cameo_country
) if it exists. - Recreates the table based on schema definitions (from
Events.csv
,eventMentions.tsv
, or hardcoded in the script). - Populates lookup tables (like
cameo_country
,fips_country
,cameo_event_code
, etc.) with data fromcodes/
directory files or hardcoded lists.
- Drops the target table (e.g.,
- After all table creation scripts complete successfully, it sends a
database_prepared
message to thedatabase_status
Kafka topic usingkafka-python
. - The
create_tables
container then exits (as configured withrestart: "no"
).
- Once
Cycle 1: First Data Ingestion and Summarization (Minutes 0-15)
- Wait for Preparation (
gdelt_connect
): Thegdelt_connect
service, having started earlier, listens to thedatabase_status
Kafka topic. It blocks until it receives thedatabase_prepared
message. - First Data Fetch (
gdelt_connect
):- Immediately after receiving
database_prepared
,gdelt_connect
fetches thelastupdate.txt
andlastupdate-translation.txt
files from GDELT. - It downloads the relevant
.export.CSV.zip
and.mentions.CSV.zip
files listed in those text files. - It extracts the TSV data from the downloaded ZIP archives in memory.
- Immediately after receiving
- First Data Load (
gdelt_connect
):- Connects to
postgres
. - Deletes all existing rows from
events
,mentions
,events_translated
, andmentions_translated
tables. - Parses the extracted TSV data, validates rows (padding if necessary), and inserts the data into the corresponding tables in batches.
- After successfully loading all data, it sends a
database populated
message to thedatabase_status
Kafka topic.
- Connects to
- First Summarization Trigger (
summerizer
): Thesummerizer
service listens to thedatabase_status
topic. Upon receiving thedatabase populated
message:- It queries
postgres
for mentions matching the criteria inconfig.py
(e.g., containing ‘Trump’ or ‘DOGE’). - It filters these mentions based on the
BLOCKED_DOMAINS
list. - It sends the URLs of the filtered mentions concurrently to the
crawler
service.
- It queries
- Crawling (
crawler
): For each URL received:- It uses
postlight-parser
to extract the main content and title. - It uses
libretranslate
to detect the language of the content. - If the language is not English, it uses
libretranslate
to translate the title to English. - It returns the original content, original title, translated title, and detected language to the
summerizer
.
- It uses
- LLM Processing (
summerizer
):- Selection: Gathers titles (translated if necessary) from successful crawls and sends them to the OpenRouter LLM using
CRAWLER_SELECTION_PROMPT
to get a list of ~15 relevant article indices. - Translation & Summarization: For each selected article:
- If the content is not English, it calls
libretranslate
to translate the content. - It sends the (potentially translated) content to the OpenRouter LLM using
SUMMARY_PROMPT
to get a JSON object containing relevance, sentiment, a summary, and a quote.
- If the content is not English, it calls
- Article Generation: Aggregates the summaries and quotes from relevant articles and sends them to the OpenRouter LLM (potentially a different model specified by
OPENROUTER_ARTICLE_MODEL
) usingARTICLE_PROMPT
. - The LLM returns a JSON object containing the final article in Markdown format.
- Selection: Gathers titles (translated if necessary) from successful crawls and sends them to the OpenRouter LLM using
- Article Archival & Writing (
summerizer
):- Calls
archive_article
. Sincecontent/article.md
is likely empty or non-existent on the first run, this step might effectively do nothing or archive an empty file intocontent/ancients.md
. - Calls
write_article
to save the newly generated Markdown article tocontent/article.md
.
- Calls
- First Update Notification (
summerizer
&ws
):summerizer
sends anarticle_update
message to thearticle_update
Kafka topic.- The
ws
(WebSocket) service, listening to this topic, receives the message and broadcasts it to all connected frontend clients.
- Frontend Update (Browser):
- The React app (
App.tsx
) running in the user’s browser receives the WebSocket message. - It triggers a re-fetch of
content/article.md
. - The updated content is parsed with
marked
and displayed.
- The React app (
- Wait Interval (
gdelt_connect
): Thegdelt_connect
service calculates the time elapsed since the start of its cycle and sleeps for the remainder of the 15-minute interval.
Cycle 2: Second Data Ingestion and Summarization (Minutes 15-30)
- Second Data Fetch (
gdelt_connect
): After the sleep interval,gdelt_connect
wakes up and repeats step 5, fetching the latest GDELT update files. - Second Data Load (
gdelt_connect
): Repeats step 6: deletes all data fromevents
/mentions
tables and loads the newly fetched data. Sendsdatabase populated
to Kafka. - Second Summarization Trigger (
summerizer
): Repeats step 7: receivesdatabase populated
and starts the summarization process based on the new data in the database. - Crawling (
crawler
): Repeats step 8 for the URLs identified in the second cycle. - LLM Processing (
summerizer
): Repeats step 9, potentially selecting, summarizing, and generating a different article based on the new set of mentions and crawled content. - Article Archival & Writing (
summerizer
):- Calls
archive_article
. This time, it reads the content of the first generated article fromcontent/article.md
. It prepends this content (wrapped in<details>
tags with a timestamp) tocontent/ancients.md
. - Calls
write_article
to save the second generated Markdown article tocontent/article.md
, overwriting the first one.
- Calls
- Second Update Notification (
summerizer
&ws
): Repeats step 11:summerizer
sendsarticle_update
,ws
broadcasts it. - Frontend Update (Browser): Repeats step 12: React app receives the message, fetches the second article from
content/article.md
, and updates the display. It also re-fetchescontent/ancients.md
to update the “Previous Entries” section. - Wait Interval (
gdelt_connect
): Repeats step 13, waiting for the next 15-minute cycle.
This cycle of data ingestion, processing, summarisation, and notification continues every 15 minutes.
Tech Stack
- Backend:
- Python 3.9 / 3.10 / 3.11 (FastAPI, psycopg2-binary, pandas, requests, kafka-python, openai, python-dotenv, concurrent.futures)
- Node.js 18 (ws, kafkajs, @postlight/parser)
- Frontend:
- React 18
- TypeScript
- Vite
- Tailwind CSS (with JIT, Typography plugin)
- Marked.js (Markdown rendering)
- Database: PostgreSQL (latest)
- Messaging: Kafka (confluentinc/cp-kafka:7.6.0 - KRaft mode)
- Translation: LibreTranslate
- Crawling: Postlight Parser, Requests, Readability-lxml
- LLM Integration: OpenRouter API
- Data Visualization: Grafana (latest)
- Web Server / Proxy: Nginx (alpine)
- Containerization: Docker, Docker Compose
- Tunneling: Cloudflare Tunnel (cloudflare/cloudflared:latest)
- OS (Containers): Linux (Debian Slim, Alpine)