Updated Project Pipeline _ Ver 1.2 | Notion

Version 1.2: Created a custom dataset with English, Spanish, and French timestamps. I will use it to simulate real-time streaming.

How To Generate Custom Dataset (For Test)

🌍 Project Pipeline in Details

1. Dataset Preparation

Now, I have a dataset with three entries for each language (English, Spanish, French) and fixed 30-second intervals between entries.
This dataset can simulate real-time processing, with each entry streaming at specific intervals as if it were coming in live.

2. Data Loader and Kafka Producer

Script Setup: Write a Python script to read from the dataset file and send each entry to Kafka.
Streaming Simulation: Use time.sleep(30) to delay each entry by 30 seconds, ensuring that each message is sent in sequence to simulate real-time arrival.
Kafka Integration: Connect this script to a Kafka producer that pushes each entry to the Kafka topic tweets_raw.

3. AWS S3 Storage and Lambda Trigger (Optional)

You could additionally store batches of messages in AWS S3 if you want persistence or backup of raw data.
An AWS Lambda function with an S3 trigger can be set up to activate on each upload, performing initial processing (e.g., language verification or extraction).

4. Real-Time Processing and Translation Pipeline

Kafka Consumers: Set up Kafka consumers to read messages from tweets_raw. Each message can be sent to a Hugging Face translation model or similar service.
Translation: Deploy the Hugging Face translation model as a FastAPI microservice. The Kafka consumer will forward each message to this service for translation.