Updated: Dataset source has been updated utilizing RSS feeds from global news outlets in multiple languages, aggregate them, and stream them into Kafka for real-time translation.
Updated Project Pipeline
1 & 2. Data Collection and Storage
- Data Source: Use RSSHub to aggregate multilingual news feeds from international news outlets (e.g., BBC, CNN, Le Monde, El País).
- Data Fetching:
- Set up a Python script to periodically fetch RSS feed entries for each language via RSSHub.
- Each feed entry includes the article’s title, summary, publication date, and link.
- Data Formatting:
- Convert fetched news articles to JSON format with fields such as
title
, summary
, published
, language
, and link
.
- Streaming to Kafka:
- Send each formatted article to a specific Kafka topic based on its language (e.g.,
news_english
, news_spanish
).
- Batch Processing and Storage:
- Optional: Store JSON batches locally or upload to AWS S3 for backup or batch processing.
3. Real-Time Processing and Translation Pipeline (Kafka Consumers)
- Kafka Producer:
- A producer script fetches articles from RSSHub, formats them, and streams them into the appropriate Kafka topic.
- Kafka Topics:
- Each language has a dedicated Kafka topic (e.g.,
news_english
, news_spanish
), allowing for easy language-based processing.
- Kafka Consumers:
- Each consumer reads from a specific language-based Kafka topic.
4. Connect Hugging Face with Fast API & Kubernetes Deployment and Autoscaling
- Translation Service:
- Forward messages to your translation model hosted as a FastAPI microservice on Kubernetes.
- The model processes each message, translating it in real time.
- Translated Output:
- The translated messages are returned to Kafka or stored in a database for downstream applications (e.g., monitoring, analytics).
- Microservice Deployment:
- The FastAPI translation service is containerized and deployed on Google Kubernetes Engine (GKE).
- Horizontal Pod Autoscaling (HPA):
- HPA dynamically adjusts the number of translation service instances based on incoming traffic to ensure efficiency and reliability during high-traffic periods.