Astro-News-Bot: Building an AI-Driven Automated News Aggregation and Publishing System
During the initial architecture design, I followed several core principles:
- Modularity: Each function (crawling, deduplication, AI processing, publishing) should be an independent module, easy to maintain and replace
- Automation: The entire process requires no manual intervention, achieving "set once, run forever"
- Idempotency: Repeated task execution produces no side effects
- Scalability: Easy to add new news sources or processing steps
Technical Architecture
Based on these principles, I designed a linear data processing pipeline:
RSS Sources β Fetcher β Vector Dedup β AI Summary β Markdown Generation β Git Publishing β Blog Deployment
Core module structure:
astro-news-bot/
βββ news_bot/
β βββ fetcher.py # News fetching
β βββ dedup.py # Vector deduplication
β βββ summarizer.py # AI summarization
β βββ writer.py # Markdown generation
β βββ selector.py # News filtering
β βββ publisher.py # Git publishing
β βββ job.py # Workflow orchestration
βββ config.json # Configuration file
βββ requirements.txt # Dependencies
βββ run_daily_news.sh # Execution script
Data Flow Design
- News Fetching β Multi-source crawling β
raw_{date}.json - Vector Deduplication β Semantic similarity filtering β
dedup_{date}.json - AI Summarization β GPT-4o generates Chinese summaries β
summary_{date}.json - Markdown Generation β Organized by category β
news_{date}.md - Git Publishing β Push to blog repository β Trigger automatic deployment
Core Technical Implementation
1. Vector Deduplication: Beyond Simple Title Matching
In the early stages, I used article titles or URLs for deduplication, but quickly discovered issues:
- Different news sources use different titles for the same event
- URLs may differ due to tracking parameters
The solution is semantic-based vector deduplication:
# dedup.py core implementation
from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
class NewsDeduplicator:
def __init__(self, similarity_threshold=0.85):
self.model = SentenceTransformer('all-MiniLM-L6-v2')
self.threshold = similarity_threshold
def deduplicate(self, articles):
if not articles:
return []
# Extract title text
titles = [article['title'] for article in articles]
# Generate vector embeddings
embeddings = self.model.encode(titles)
# Calculate similarity matrix
similarity_matrix = cosine_similarity(embeddings)
# Deduplication logic
to_keep = []
for i, article in enumerate(articles):
is_duplicate = False
for j in to_keep:
if similarity_matrix[i][j] > self.threshold:
is_duplicate = True
break
if not is_duplicate:
to_keep.append(i)
return [articles[i] for i in to_keep]
This method effectively identifies articles that "say the same thing in different ways," far more accurate than keyword matching.
2. AI Summarization and Classification: The Art of Prompt Engineering
The quality of AI summarization and classification directly determines the final output value. The key isn't which LLM to choose, but how to design Prompts:
# summarizer.py core Prompt
SUMMARY_PROMPT = """
You are a professional tech news editor specializing in organizing news summaries for developers.
Please generate for the following news:
1. A Chinese summary of no more than 100 words, summarizing core information
2. Select the most appropriate category from: Artificial Intelligence, Mobile Technology, Autonomous Driving, Cloud Computing, Chip Technology, Venture Capital, Cybersecurity, Blockchain, Scientific Research, Other Tech
News content:
Title: {title}
Description: {description}
Source: {source}
Please return in JSON format:
{{
"summary": "Summary content",
"category": "Category name",
"tags": ["Tag1", "Tag2", "Tag3"]
}}
"""
class NewsSummarizer:
def __init__(self):
self.client = OpenAI()
def process_article(self, article):
prompt = SUMMARY_PROMPT.format(**article)
response = self.client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
max_tokens=500,
temperature=0.3
)
return json.loads(response.choices[0].message.content)
Through clear roles, detailed instructions, and output format requirements, we achieve stable, high-quality AI output.
3. GitOps Publishing: Reliable Automated Deployment
Why choose Git instead of API operations on the blog backend?
- Atomicity and Traceability: Every content update is a Git commit, providing clear change records
- Decoupling and Security: The bot only needs Git repository write permissions, no need to expose blog backend credentials
- Leverage Existing CI/CD: Reuse Git-triggered CI/CD from platforms like Vercel/Netlify
# publisher.py core implementation
import subprocess
import os
class NewsPublisher:
def __init__(self, blog_repo_path):
self.repo_path = blog_repo_path
def publish(self, commit_message):
try:
# Switch to blog directory
os.chdir(self.repo_path)
# Pull latest code
subprocess.run(['git', 'pull'], check=True)
# Add new files
subprocess.run(['git', 'add', '.'], check=True)
# Check for changes
result = subprocess.run(['git', 'diff', '--cached', '--exit-code'],
capture_output=True)
if result.returncode == 0:
print("No changes to commit")
return
# Commit and push
subprocess.run(['git', 'commit', '-m', commit_message], check=True)
subprocess.run(['git', 'push'], check=True)
print(f"Successfully published: {commit_message}")
except subprocess.CalledProcessError as e:
print(f"Git operation failed: {e}")
Astro Blog Integration Modifications
To achieve seamless integration between astro-news-bot and the Astro blog, minimal but crucial modifications to the blog project are needed:
1. Define News Content Collection
// src/content/config.ts
import { defineCollection, z } from 'astro:content';
import { glob } from 'astro/loaders';
const news = defineCollection({
loader: glob({ base: './src/content/news', pattern: '**/*.md' }),
schema: z.object({
title: z.string(),
description: z.string().optional(),
date: z.string().optional(),
pubDate: z.string().optional(),
tags: z.array(z.string()).optional(),
layout: z.string().optional(),
}),
});
export const collections = {
'news': news,
// ... other collections
};
2. Create LatestNews Component
---
// src/components/LatestNews.astro
import { getCollection } from 'astro:content';
const newsEntries = await getCollection('news');
let latestNews = null;
if (newsEntries && newsEntries.length > 0) {
latestNews = newsEntries
.filter(entry => entry.data && (entry.data.date || entry.data.pubDate))
.sort((a, b) => {
const dateA = new Date(a.data.date || a.data.pubDate);
const dateB = new Date(b.data.date || b.data.pubDate);
return dateB.getTime() - dateA.getTime();
})
.slice(0, 1)[0];
}
---
{latestNews && (
<div class="latest-news">
<h3>π° Latest News</h3>
<div class="news-item">
<h4>{latestNews.data.title}</h4>
<p>{latestNews.data.description}</p>
<a href={`/news/${latestNews.data.date || latestNews.data.pubDate}`}>
Read More β
</a>
</div>
</div>
)}
<style>
.latest-news {
border: 1px solid #e1e5e9;
border-radius: 8px;
padding: 1.5rem;
margin: 1rem 0;
background: #f8fafc;
}
.news-item h4 {
margin: 0 0 0.5rem 0;
color: #1a1a1a;
}
.news-item p {
color: #666;
margin: 0 0 1rem 0;
}
.news-item a {
color: #2563eb;
text-decoration: none;
font-weight: 500;
}
</style>
3. Fix Dynamic Route Rendering
Due to structural changes in entry objects with glob loader in Astro 5.x, adaptation is needed:
---
// src/pages/news/[date].astro
import { getCollection, getEntry } from 'astro:content';
export async function getStaticPaths() {
const newsEntries = await getCollection('news');
return newsEntries
.filter(entry => entry.data && (entry.data.date || entry.data.pubDate))
.map(entry => ({
params: {
date: entry.data.date || entry.data.pubDate
},
props: {
entryId: entry.id, // Use entry.id instead of slug
dateParam: entry.data.date || entry.data.pubDate
}
}));
}
const { entryId } = Astro.props;
const post = await getEntry('news', entryId);
if (!post) {
throw new Error(`No news entry found for entryId: ${entryId}`);
}
---
<html>
<body>
<main>
<h1>{post.data.title}</h1>
<!-- Use pre-rendered content -->
<div set:html={post.rendered.html}></div>
</main>
</body>
</html>
Key modification points:
- Use
entry.idinstead ofentry.slug(slug is undefined in glob loader) - Use
post.rendered.htmlto get pre-rendered content - Get complete entry object via
getEntryduring page rendering
Diverse Execution Methods
To adapt to different deployment environments, I designed multiple execution methods:
1. Direct Execution (Development & Debugging)
# Complete workflow
python -m news_bot.job --date $(date +%Y-%m-%d)
# Dry run mode (skip publishing)
python -m news_bot.job --date 2025-07-25 --dry-run
2. Shell Script Execution
#!/bin/bash
# run_daily_news.sh
cd "$(dirname "$0")"
source .env
# Create log directory
mkdir -p ~/logs
# Execute news processing workflow
DATE=$(date +%Y-%m-%d)
echo "=== Starting news processing for $DATE ===" >> ~/logs/news_bot.log
python -m news_bot.job --date $DATE >> ~/logs/news_bot.log 2>&1
echo "=== Completed at $(date -Iseconds) ===" >> ~/logs/news_bot.log
3. Daemon Mode (Recommended)
# Start daemon (complete background execution)
./start_daemon.sh start
# Check running status
./start_daemon.sh status
# View logs
./start_daemon.sh logs
# Graceful stop
./stop_daemon.sh
4. Cron Scheduled Tasks
# Edit crontab
crontab -e
# Execute daily at 8:05
5 8 * * * /Users/geyuxu/repo/astro-news-bot/run_daily_news.sh
Operations Experience and Best Practices
Configuration Management
{
"output_config": {
"blog_content_dir": "/Users/geyuxu/repo/blog/geyuxu.com/src/content/news",
"filename_format": "news_{date}.md",
"use_blog_dir": true
},
"git_config": {
"target_branch": "gh-pages",
"auto_switch_branch": true,
"push_to_remote": true
},
"news_config": {
"max_articles_per_day": 6,
"token_budget_per_day": 4000,
"similarity_threshold": 0.85
},
"llm_config": {
"model": "gpt-4o",
"max_tokens": 500,
"temperature": 0.3
},
"scheduler_config": {
"enabled": true,
"timezone": "Asia/Shanghai",
"cron_expression": "0 8 * * *"
}
}
Cost Control
- Daily processing: ~6 articles
- Estimated token consumption: ~4000 tokens/day
- OpenAI cost: ~$0.01-0.05/day
Log Management
# View real-time logs
tail -f logs/daemon.log
# View scheduler logs
tail -f logs/scheduler.log
# View today's execution records
grep "$(date +%Y-%m-%d)" ~/logs/news_bot.log
Project Value and Results
Test Validation Results
Latest test (2025-07-26):
- β Fetcher: Retrieved 31 tech news articles (RSS sources)
- β Deduplicator: Vector deduplication, retained 31 unique articles
- β Summarizer: AI summary generation, used 10,681 tokens
- β Writer: Generated 188-line Markdown with 7 tech categories
- β Publisher: Successfully committed and pushed to blog repository
News Classification System
The system automatically categorizes news into 9 tech domains:
- π€ Artificial Intelligence
- π± Mobile Technology
- π Autonomous Driving
- βοΈ Cloud Computing
- πΎ Chip Technology
- π° Venture Capital
- π Cybersecurity
- βοΈ Blockchain
- π¬ Scientific Research
Output Format Example
---
title: Daily News Digest Β· 2025-07-26
pubDate: '2025-07-26'
description: In 2025, the US semiconductor market experienced significant changes...
tags: [News, Daily, Chip Technology, Autonomous Driving, Mobile Technology]
layout: news
---
## Chip Technology
- **A timeline of the US semiconductor market in 2025**
In 2025, the US semiconductor market experienced significant changes, including leadership transitions at traditional semiconductor companies and volatile chip export policies.
*Tags: Semiconductor Β· US Market Β· Policy Changes*
[Read Original](https://techcrunch.com/2025/07/25/...) | Source: TechCrunch
## Autonomous Driving
- **Tesla is reportedly bringing robotaxi service to San Francisco**
Tesla plans to launch a limited version of its autonomous taxi service in San Francisco. Unlike the Austin service, this test will have employees in the driver's seat for safety.
*Tags: Tesla Β· Autonomous Driving Β· Taxi Service*
[Read Original](https://techcrunch.com/2025/07/25/...) | Source: TechCrunch
Future Plans
- Smarter Source Discovery: Enable the bot to automatically discover and recommend new high-quality news sources
- Trend Analysis & Topic Aggregation: Identify hot topics within specific time periods and aggregate related articles
- User Feedback Loop: Collect user feedback data for fine-tuning AI models
- Open Source Plan: Clean up code and open source to help others build their own AI news bots
Conclusion
astro-news-bot is a typical "use technology to solve your own problems" project. It organically combines AI, automation scripts, modern web development frameworks (Astro), and DevOps principles (GitOps) to build an elegant automated system.
This project not only solves my information overload problem but also serves as an excellent testbed for practicing LLM applications, vector databases, GitOps, and other emerging technologies. If you want to build a similar system, I hope this article provides some inspiration and reference.
Key Tech Stack:
- Backend: Python + OpenAI API + SentenceTransformers
- Frontend: Astro + TypeScript + Content Collections
- Deployment: GitOps + Shell Scripts + Cron Jobs
- Data: JSON + Markdown + Git
The entire system embodies modern AI application development best practices: modular design, vectorized processing, automated deployment, and continuous operations.