This project is a data analytics clustering project that uses Python libraries and aims to explore the characteristics of different clusters and provide insights for marketing strategies. Data Source The data source is a CSV file that contains information about 2000 customers, such as their age, gender, annual income, spending score, and segment. The data was obtained from Kaggle: Mall Customer Segmentation Data.
The data analysis process consists of the following steps:
- Data cleaning: check for missing values, outliers, and duplicates, and handle them accordingly.
- Data exploration: perform descriptive statistics and visualizations to understand the distribution and relationship of the variables.
- Data preprocessing: scale the numerical variables and encode the categorical variables for clustering.
- Clustering: apply K-Means clustering algorithm to find the optimal number of clusters using the elbow method and silhouette score, and assign each customer to a cluster.
- Cluster interpretation: analyze the characteristics of each cluster and provide insights for marketing strategies.
- Prediction: using classifiers such as KNN, Random Forest, Decision Tree(J48)
- Analysis: Confusion matrix, Accuracy score, Recall, Precision, RMSE(of mean squared error).
The project uses the following Python libraries: β’ pandas: for data manipulation and analysis β’ numpy: for numerical computation β’ matplotlib: for data visualization β’ seaborn: for data visualization β’ sklearn: for data preprocessing and clustering
To run the project, you need to have Python 3 and the above-mentioned libraries installed. You can use any Python IDE or notebook environment, such as Jupyter Notebook, to open and run the Clustering_Linda_Z.ipynb file. Alternatively, you can clone or download the GitHub repository and run the file from your local machine.