Realtime Data Streaming With TCP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch | End-to-End Data Engineering Project

Introduction

This project is a detailed guide for building an end-to-end data engineering pipeline using TCP/IP Socket, Apache Spark, OpenAI LLM, Kafka, and Elasticsearch. It explains every stage from data acquisition, processing, sentiment analysis with ChatGPT, production to Kafka topic, and connection to elasticsearch.

System Architecture

The project is designed with the following components:

Data Source: I have used yelp.com dataset for the pipeline.
TCP/IP Socket: Used for streaming data over the network in chunks
Apache Spark: To process data with its master and worker nodes.
Confluent Kafka: Cluster on the cloud
Control Center and Schema Registry: Helps in monitoring and schema management of the Kafka streams.
Kafka Connect: For elasticsearch connection
Elasticsearch: For indexing and querying

What I worked on

Setting up data pipeline with TCP/IP
Real-time data streaming with Apache Kafka
Data processing techniques with Apache Spark
Realtime sentiment analysis with OpenAI ChatGPT
Synchronising data from kafka to elasticsearch
Indexing and Querying data on elasticsearch

Technologies

Apache Spark
Confluent Kafka
Docker
Elasticsearch
Python
TCP/IP

Getting Started

Clone the repository:

git clone https://github.com/ManojGowda27/Realtime_Data_Streaming.git

Navigate to the project directory:
```
cd src
```
Run Docker Compose to spin up the spark cluster:
```
docker-compose up
```

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
config		config
datasets/yelp_dataset		datasets/yelp_dataset
jobs		jobs
schemas		schemas
.gitignore		.gitignore
Dockerfile.spark		Dockerfile.spark
Readme.md		Readme.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Realtime Data Streaming With TCP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch | End-to-End Data Engineering Project

Table of Contents

Introduction

System Architecture

What I worked on

Technologies

Getting Started

About

Uh oh!

Releases

Packages

Uh oh!

Languages

ManojGowda27/Realtime_Data_Streaming

Folders and files

Latest commit

History

Repository files navigation

Realtime Data Streaming With TCP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch | End-to-End Data Engineering Project

Table of Contents

Introduction

System Architecture

What I worked on

Technologies

Getting Started

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages