This project aims to create a scalable, real-time language operations pipeline that leverages AWS services to manage multilingual Twitter data in a batch-processing workflow. The pipeline collects Twitter data based on specific hashtags, saves it in batches, and processes it through a series of AWS services.
Data Preparation Step-By-Step (Option 1)
Twitter API Data Fetching:
text
, timestamp
, language
, user_info
, and any additional metadata necessary for downstream tasks.Local JSON Storage:
For each batch of tweets, store the data locally in JSON format, naming files with timestamps or unique identifiers (e.g., tweets_batch_YYYYMMDD_HHMMSS.json
) to prevent overwriting.
File Structure:
[
{
"text": "Sample tweet text",
"timestamp": "2024-11-05T12:34:56Z",
"language": "en",
"user_info": {...},
...
},
...
]
Batch Upload to AWS S3:
Configure a Kafka producer within the data loader script to send each tweet entry or batch to a specific Kafka topic (e.g., tweets_raw
).
Kafka Message Format:
{
"text": "Sample tweet text",
"timestamp": "2024-11-05T12:34:56Z",
"language": "en",
"user_info": {...},
...
}
text
, language
, timestamp
, etc.tweets_raw
topic and forward messages to the translation model.