1. Dataset Preparation

Step 1: Dataset Preparation

This step ensures that your dataset is formatted correctly for real-time simulation. The dataset has entries with fixed 30-second intervals, simulating the timing of real-world data arrival.

Organize Dataset Columns: The dataset should have columns such as:
- text: The actual tweet content.
- timestamp: A unique timestamp for each entry to simulate time-based data.
- language: The tweet's language (e.g., 'en' for English, 'es' for Spanish, 'fr' for French).
- user_info: Optional additional information, such as user ID or username.
Verify Format and Intervals:
- Ensure the timestamps are formatted as strings in YYYY-MM-DD HH:MM:SS format.
- Each entry should be spaced 30 seconds apart to mimic a live feed.
Save the Dataset:
- Save this dataset as a CSV file (e.g., real_time_dataset.csv), which the Kafka producer script will read.