Updated Project Pipeline _ Ver 1.1

This project aims to create a scalable, real-time language operations pipeline that leverages AWS services to manage multilingual Twitter data in a batch-processing workflow. The pipeline collects Twitter data based on specific hashtags, saves it in batches, and processes it through a series of AWS services.

Data Preparation Step-By-Step (Option 1)

🌍 Project Pipeline in Details

1. Data Collection and Storage

Twitter API Data Fetching:
- Develop a Python script to connect to the Twitter API and filter tweets by keywords, hashtags, language, or accounts.
- Configure batch processing to fetch tweets in sets, adjusting batch size based on API limits or project needs.
- Fields to capture: text, timestamp, language, user_info, and any additional metadata necessary for downstream tasks.
Local JSON Storage:
- For each batch of tweets, store the data locally in JSON format, naming files with timestamps or unique identifiers (e.g., tweets_batch_YYYYMMDD_HHMMSS.json) to prevent overwriting.
- File Structure:
```
[
  {
    "text": "Sample tweet text",
    "timestamp": "2024-11-05T12:34:56Z",
    "language": "en",
    "user_info": {...},
    ...
  },
  ...
]
```
Batch Upload to AWS S3:
- After each batch is stored locally, the Python script uploads the JSON file to an S3 bucket.
- Use a unique file-naming convention for easy identification and retrieval.
- Benefit: S3 provides scalable storage and serves as the event trigger for real-time Lambda processing.

2. Simulated Real-Time Streaming (Kafka Setup)

Data Loader Script:
- Develop a script that reads from the stored JSON files and loads tweet entries row-by-row or batch-by-batch.
- Simulated Delay: Add a delay between each entry or batch (e.g., 1-2 seconds) to mimic real-time data flow.
Kafka Producer:
- Configure a Kafka producer within the data loader script to send each tweet entry or batch to a specific Kafka topic (e.g., tweets_raw).
- Kafka Message Format:
```
{
  "text": "Sample tweet text",
  "timestamp": "2024-11-05T12:34:56Z",
  "language": "en",
  "user_info": {...},
  ...
}
```

3. AWS Lambda Trigger for Initial Processing

S3 Trigger Configuration:
- Set up AWS Lambda to trigger whenever a new JSON file is uploaded to the S3 bucket.
- The Lambda function reads the file, parsing each tweet entry for initial data processing.
Initial Processing in Lambda:
- Tasks in Lambda:
  - Extract fields like text, language, timestamp, etc.
  - Run a quick language detection check if necessary (for verification).
  - Apply any lightweight transformations (e.g., text cleaning).
- Output: Processed data can be logged or passed on to additional processing steps, depending on your design needs.

4. Real-Time Processing and Translation Pipeline (Kafka Consumers)

Kafka Consumer Setup:
- Deploy Kafka consumers that read from the tweets_raw topic and forward messages to the translation model.
- Set up separate consumers to handle multiple language streams if needed (e.g., by language type or priority).