TLDR: I built a fully automated system using LLMs to generate news articles every 15 minutes from real-time GDELT data (a global event database), showcasing the current capabilities and prevalence of AI content generation.

NOTE: I took the site down and stopped the service as it was costing too much in inferance fess and electricy usage each month.

Below you can find the nerdy technical details of my creation, or you can view the code directly on my GitHub.

If you would like me to build something similar for you or your organization, you can contact me here.

The Age of AI-Generated Content

These days it seems like the internet is full of questionable content and trash writing. Though large language models are getting smarter and more capable every day, their writing skills are still lacking, or at the very least have an ‘air’ of artificiality. On a day to day basis, we are consistently influenced by artificially authored text without realization; the internet is becoming more and more full of it each moment that passes.

Whether you are being influenced to garner clicks under some sort of SEO business model, or influenced as the result of some sort of political campaign, LLM propaganda is now EVERYWHERE. It is essentially unavoidable at this point, and it isn’t just shadowy sites under obscure domains with redacted WHOIS–LLM content is mainstream.

Mainstream Adoption

BuzzFeed was an early adopter and openly admits to using LLM generated content, their admission reportedly caused a 150% jump in their stock prices (Source: CNN Archive). CNET was caught red-handed publishing LLM generated articles attributed to “CNET Money Staff” (Source: PBS NewsHour Archive). You can bet your bottom-dollar that many many more large and mainstream outlets are serving you LLM generated content without telling you, and it is likely that just as many are serving you LLM content and lying about it.

Project Motivation and Goal

It is 2025, we are in the new age of internet influence, and Journalism is officially dead (article forthcoming). With this in mind I set out to create my own automated propaganda site. My goal with this project was a system that is fully automated and completely hands-off after deployment. As of now, it generates a new LLM generated article, based off content from the last 15 minutes, every fifteen minutes, around the clock. The site has been live for a little over 2 months as of now, with around 5,760 articles up and ready for viewing.

Data Source: GDELT

For source material the site uses the Global Database of Events, Language, and Tone (GDELT), which is a massive open database that continuously ingests print, broadcast, and web news media from around the globe, in over 100 languages. My site uses the GDELT 2.0 Events and Mentions dataset, which is consistently updated and offers its most recent content via data-dumps every 15 minutes. GDELT purports to analyse all of this content en mass, offering extracted details such as actors involved (countries, political entities, et cetera), event types (of which there are dozens), geolocation data, a few types of sentiment analysis, and source data.

GDELT Limitations and Utility

The unfortunate reality is that GDELT sentiment analysis and event classification is WOEFULLY inaccurate, verging on a joke, but it is still a wildly useful and rich resource should it be put to use properly. A lot of interesting stuff can be done with a couple hundred thousand neatly labeled URLs every day; it is these URLs that my system ingests and processes.

System Workflow Example (TLDR)

Here is the TLDR, in the below example, I am searching through the 15 minute data-dumps for information related to the new Trump administration, sorting and filtering to eliminate all sources written in English, scraping the resultant pages, extracting and translating their titles, and sending the titles to an LLM which selects 15 of the pages likely to portray trump in a negative light. These 15 articles are then locally machine translated into english, and all 15 translations are passed back to an LLM for summerisation. Finally, the translated articles, their summaries, titles, and source info are passed to an LLM yuet again to generate an article. The newly generated article is placed on the main page with a series of grafana visualizations, and the old article is moved to the archive. This cycle runs 24/7, posting an article every 15 minutes.

In total, this systems costs me on average about 1 USD per day to operate, processing anywhere from 3 to 8 millions tokens every 24 hours. The writing quality is quite low, due to using a budget friendly LLM (google/gemini-2.0-flash-001).

The Nerdy Details

Overview

  • Ingest: Automatically fetch GDELT v2 event and mention data (both English and translated sources) every 15 minutes.
  • Store: Process and store the ingested data in a PostgreSQL database, including lookup tables for CAMEO codes and FIPS country codes.
  • Analyze:
    • Query the database for specific event mentions (e.g., related to ‘Trump’).
    • Crawl the source URLs of selected mentions to retrieve full article content.
    • Utilize Large Language Models (LLMs) via OpenRouter for:
      • Selecting relevant articles based on titles.
      • Summarizing content and assessing sentiment towards the U.S.
      • Generating a consolidated article based on the summaries.
    • Translate non-English content using LibreTranslate.
  • Visualize: Display key metrics and data trends using Grafana dashboards.
  • Present: Serve a web frontend (React/Vite/Tailwind) showing the latest generated article, archived articles, and embedded Grafana panels.
  • Notify: Update the frontend in real-time when a new article is generated using WebSockets.
  • Deploy: Orchestrate all components using Docker Compose, with public access secured via Cloudflare Tunnel.

This project demonstrates the construction of a complex data processing pipeline focused on analyzing global news sentiment related to the United States, using the GDELT dataset as its primary source. The goal was to create a system that automatically ingests, processes, analyzes, and presents insights from near real-time global event data.

Architecture and Design:

The system is built using a microservice architecture, containerized with Docker and orchestrated using Docker Compose. This approach was chosen for modularity and scalability. Key services include data ingestion, database management, web crawling, language translation, LLM-based analysis and summarization, real-time updates via WebSockets, and frontend presentation with embedded data visualizations.

PostgreSQL serves as the central data store, holding both the raw GDELT event/mention data and various lookup tables for data enrichment (CAMEO codes, FIPS codes). Kafka acts as the messaging backbone, decoupling services and enabling asynchronous workflows. For instance, the data ingestion service publishes a message upon completion, triggering the summarization service, which in turn publishes an update message consumed by the WebSocket server to notify the frontend.

Data Flow and Processing:

The core pipeline begins with the gdelt_connect service fetching GDELT data every 15 minutes. A significant design choice was to completely refresh the events and mentions tables during each cycle rather than attempting incremental updates, simplifying state management given the nature of GDELT’s update mechanism.

Once new data is loaded, the summerizer service initiates a multi-stage analysis process. It queries the database for relevant mentions, uses a dedicated crawler service (leveraging Postlight Parser for robust content extraction) to fetch source articles, and then employs LLMs for several tasks: filtering articles based on titles, summarizing relevant content, assessing sentiment, and finally, generating a cohesive narrative article. The use of concurrent processing (ThreadPoolExecutor) helps manage the potentially large number of crawling and LLM tasks efficiently.

Language barriers are addressed by integrating the libretranslate service for both language detection and translation, ensuring content can be processed and analyzed consistently in English.

Frontend and Visualization:

The frontend is a React application built with Vite and styled using Tailwind CSS. It presents the latest AI-generated article, provides access to archived articles, and embeds Grafana panels for data visualization. Real-time updates are achieved through a WebSocket connection managed by a dedicated Node.js service (ws), ensuring the displayed article reflects the latest analysis cycle without requiring manual refreshes. Grafana connects directly to the PostgreSQL database, allowing for flexible and dynamic visualization of the underlying event data.

Deployment and Access:

The entire system is containerized, simplifying deployment and dependency management. Public access to the web interface is provided securely using Cloudflare Tunnel, avoiding the need to expose ports directly on the host machine.

Challenges and Considerations:

  • Data Volume and Schema: Handling the large volume and complex schema of GDELT data required careful planning for database operations, including batch inserts and schema definition management.
  • Web Crawling Robustness: Web crawling is inherently fragile. Using Postlight Parser improves extraction success, but handling failures and varying website structures remains a challenge.
  • LLM Integration: Integrating multiple LLM calls requires careful prompt engineering and handling of potential API errors or unexpected response formats. The use of JSON schema enforcement in LLM calls helps mitigate some of these issues.
  • Resource Management: Running multiple services, including databases, messaging systems, LLMs (via API), and translation services, requires adequate system resources (CPU, RAM, network bandwidth).
  • Cost: Relying on external APIs like OpenRouter incurs costs based on usage.

This project serves as a practical example of integrating various technologies – data ingestion, databases, messaging queues, web crawling, AI/LLM processing, real-time communication, and data visualization – to build a sophisticated, automated information analysis system.

Temporal Workflow

This outlines the sequence of events when the system starts up for the first time and runs through its first two 15-minute cycles.

Initial Startup:

  1. Docker Compose Up: The docker-compose up command initiates the startup of all defined services: postgres, kafka, grafana, web, cloudflared, libretranslate, crawler, summerizer, ws, gdelt_connect, and create_tables.
  2. Service Initialization:
    • postgres starts and initializes the database based on environment variables.
    • kafka starts the broker in KRaft mode.
    • grafana starts, connects to postgres for its own configuration storage, and provisions datasources/dashboards.
    • libretranslate starts and potentially downloads language models.
    • crawler, summerizer, ws, web, and cloudflared start their respective servers/processes.
    • create_tables waits for postgres and kafka to be healthy.
    • gdelt_connect waits for postgres and kafka to be healthy.

Cycle 0: Database Preparation

  1. Table Creation (create_tables):
    • Once postgres and kafka are ready, the create_tables service runs its main script (create_tables.py).
    • It connects to the postgres database.
    • It executes a series of Python scripts (create_*.py) sequentially. Each script:
      • Drops the target table (e.g., events, mentions, cameo_country) if it exists.
      • Recreates the table based on schema definitions (from Events.csv, eventMentions.tsv, or hardcoded in the script).
      • Populates lookup tables (like cameo_country, fips_country, cameo_event_code, etc.) with data from codes/ directory files or hardcoded lists.
    • After all table creation scripts complete successfully, it sends a database_prepared message to the database_status Kafka topic using kafka-python.
    • The create_tables container then exits (as configured with restart: "no").

Cycle 1: First Data Ingestion and Summarization (Minutes 0-15)

  1. Wait for Preparation (gdelt_connect): The gdelt_connect service, having started earlier, listens to the database_status Kafka topic. It blocks until it receives the database_prepared message.
  2. First Data Fetch (gdelt_connect):
    • Immediately after receiving database_prepared, gdelt_connect fetches the lastupdate.txt and lastupdate-translation.txt files from GDELT.
    • It downloads the relevant .export.CSV.zip and .mentions.CSV.zip files listed in those text files.
    • It extracts the TSV data from the downloaded ZIP archives in memory.
  3. First Data Load (gdelt_connect):
    • Connects to postgres.
    • Deletes all existing rows from events, mentions, events_translated, and mentions_translated tables.
    • Parses the extracted TSV data, validates rows (padding if necessary), and inserts the data into the corresponding tables in batches.
    • After successfully loading all data, it sends a database populated message to the database_status Kafka topic.
  4. First Summarization Trigger (summerizer): The summerizer service listens to the database_status topic. Upon receiving the database populated message:
    • It queries postgres for mentions matching the criteria in config.py (e.g., containing ‘Trump’ or ‘DOGE’).
    • It filters these mentions based on the BLOCKED_DOMAINS list.
    • It sends the URLs of the filtered mentions concurrently to the crawler service.
  5. Crawling (crawler): For each URL received:
    • It uses postlight-parser to extract the main content and title.
    • It uses libretranslate to detect the language of the content.
    • If the language is not English, it uses libretranslate to translate the title to English.
    • It returns the original content, original title, translated title, and detected language to the summerizer.
  6. LLM Processing (summerizer):
    • Selection: Gathers titles (translated if necessary) from successful crawls and sends them to the OpenRouter LLM using CRAWLER_SELECTION_PROMPT to get a list of ~15 relevant article indices.
    • Translation & Summarization: For each selected article:
      • If the content is not English, it calls libretranslate to translate the content.
      • It sends the (potentially translated) content to the OpenRouter LLM using SUMMARY_PROMPT to get a JSON object containing relevance, sentiment, a summary, and a quote.
    • Article Generation: Aggregates the summaries and quotes from relevant articles and sends them to the OpenRouter LLM (potentially a different model specified by OPENROUTER_ARTICLE_MODEL) using ARTICLE_PROMPT.
    • The LLM returns a JSON object containing the final article in Markdown format.
  7. Article Archival & Writing (summerizer):
    • Calls archive_article. Since content/article.md is likely empty or non-existent on the first run, this step might effectively do nothing or archive an empty file into content/ancients.md.
    • Calls write_article to save the newly generated Markdown article to content/article.md.
  8. First Update Notification (summerizer & ws):
    • summerizer sends an article_update message to the article_update Kafka topic.
    • The ws (WebSocket) service, listening to this topic, receives the message and broadcasts it to all connected frontend clients.
  9. Frontend Update (Browser):
    • The React app (App.tsx) running in the user’s browser receives the WebSocket message.
    • It triggers a re-fetch of content/article.md.
    • The updated content is parsed with marked and displayed.
  10. Wait Interval (gdelt_connect): The gdelt_connect service calculates the time elapsed since the start of its cycle and sleeps for the remainder of the 15-minute interval.

Cycle 2: Second Data Ingestion and Summarization (Minutes 15-30)

  1. Second Data Fetch (gdelt_connect): After the sleep interval, gdelt_connect wakes up and repeats step 5, fetching the latest GDELT update files.
  2. Second Data Load (gdelt_connect): Repeats step 6: deletes all data from events/mentions tables and loads the newly fetched data. Sends database populated to Kafka.
  3. Second Summarization Trigger (summerizer): Repeats step 7: receives database populated and starts the summarization process based on the new data in the database.
  4. Crawling (crawler): Repeats step 8 for the URLs identified in the second cycle.
  5. LLM Processing (summerizer): Repeats step 9, potentially selecting, summarizing, and generating a different article based on the new set of mentions and crawled content.
  6. Article Archival & Writing (summerizer):
    • Calls archive_article. This time, it reads the content of the first generated article from content/article.md. It prepends this content (wrapped in <details> tags with a timestamp) to content/ancients.md.
    • Calls write_article to save the second generated Markdown article to content/article.md, overwriting the first one.
  7. Second Update Notification (summerizer & ws): Repeats step 11: summerizer sends article_update, ws broadcasts it.
  8. Frontend Update (Browser): Repeats step 12: React app receives the message, fetches the second article from content/article.md, and updates the display. It also re-fetches content/ancients.md to update the “Previous Entries” section.
  9. Wait Interval (gdelt_connect): Repeats step 13, waiting for the next 15-minute cycle.

This cycle of data ingestion, processing, summarisation, and notification continues every 15 minutes.

Tech Stack

  • Backend:
    • Python 3.9 / 3.10 / 3.11 (FastAPI, psycopg2-binary, pandas, requests, kafka-python, openai, python-dotenv, concurrent.futures)
    • Node.js 18 (ws, kafkajs, @postlight/parser)
  • Frontend:
    • React 18
    • TypeScript
    • Vite
    • Tailwind CSS (with JIT, Typography plugin)
    • Marked.js (Markdown rendering)
  • Database: PostgreSQL (latest)
  • Messaging: Kafka (confluentinc/cp-kafka:7.6.0 - KRaft mode)
  • Translation: LibreTranslate
  • Crawling: Postlight Parser, Requests, Readability-lxml
  • LLM Integration: OpenRouter API
  • Data Visualization: Grafana (latest)
  • Web Server / Proxy: Nginx (alpine)
  • Containerization: Docker, Docker Compose
  • Tunneling: Cloudflare Tunnel (cloudflare/cloudflared:latest)
  • OS (Containers): Linux (Debian Slim, Alpine)