Version 1.2: Created a custom dataset with English, Spanish, and French timestamps. I will use it to simulate real-time streaming.
How To Generate Custom Dataset (For Test)
🌍 Project Pipeline in Details
1. Dataset Preparation
- Now, I have a dataset with three entries for each language (English, Spanish, French) and fixed 30-second intervals between entries.
- This dataset can simulate real-time processing, with each entry streaming at specific intervals as if it were coming in live.
2. Data Loader and Kafka Producer
- Script Setup: Write a Python script to read from the dataset file and send each entry to Kafka.
- Streaming Simulation: Use
time.sleep(30)
to delay each entry by 30 seconds, ensuring that each message is sent in sequence to simulate real-time arrival.
- Kafka Integration: Connect this script to a Kafka producer that pushes each entry to the Kafka topic
tweets_raw
.
3. AWS S3 Storage and Lambda Trigger (Optional)
- You could additionally store batches of messages in AWS S3 if you want persistence or backup of raw data.
- An AWS Lambda function with an S3 trigger can be set up to activate on each upload, performing initial processing (e.g., language verification or extraction).
4. Real-Time Processing and Translation Pipeline
- Kafka Consumers: Set up Kafka consumers to read messages from
tweets_raw
. Each message can be sent to a Hugging Face translation model or similar service.
- Translation: Deploy the Hugging Face translation model as a FastAPI microservice. The Kafka consumer will forward each message to this service for translation.