This project uses Apache Spark (using PySpark) to analyze Twitter posts (Covid, Grammys and financial tweets). The application is Dockerized and can be run using Docker Compose.
- Docker
Clone the repository to your local machine.
git clone [email protected]:drwoj/tweets-pyspark.git
- Build the Docker images:
docker-compose build
- Run the Docker containers:
docker-compose up -d
- Submit the Spark application:
docker-compose exec spark-master spark-submit --master spark:https://spark-master:7077 src/main.py
To stop the application and remove the containers defined in the docker-compose.yml
file, run:
docker-compose down
You will be able to access it through a Spark WEB UI. The port (9090) specified in docker-compose.yml
will be exposed on your host machine, so you can access S[park Master by navigating to localhost:9090
in your web browser.