Skip to content

This work involves two subtasks: assessing clustering results using all input variables and applying PCA for dimensionality reduction to improve understanding of multi-dimensional problems.

License

Notifications You must be signed in to change notification settings

LuciferDIot/Vehicle-Clustering-and-Dimensionality-Reduction

Repository files navigation

Vehicle Clustering and Dimensionality Reduction

1st Subtask Objectives:

a. Pre-processing: Before conducting k-means, perform scaling and outliers detection/removal. The order of scaling and outlier removal is important. Outlier removal is not covered in tutorials, so you need to explore it yourself.

b. Determining the number of cluster centers: Use four automated tools (NBclust, Elbow, Gap statistics, and silhouette methods) to determine the optimal number of clusters. Provide the related R-outputs and your discussion on these outcomes in the report.

c. K-means clustering investigation: Perform k-means analysis using all input variables and the most favored "k" value from the automated tools. Show the related R-based kmeans output, including information for the centers, clustered results, and the ratio of Between-Cluster Sums of Squares (BSS) over Total Sum of Squares (TSS). Calculate and illustrate the BSS and Within-Cluster Sums of Squares (WSS) indices as internal evaluation metrics.

d. Silhouette plot: Provide the silhouette plot, which displays the closeness of each point in one cluster to points in neighboring clusters. Include the average silhouette width score and your discussion on the quality of the obtained clusters.

2nd Subtask Objectives:

e. PCA method: Apply Principal Component Analysis (PCA) to the vehicle dataset and show all R-outputs related to the analysis, including eigenvalues/eigenvectors and the cumulative score per principal component (PC). Create a new "transformed" dataset with principal components as attributes, choosing PCs that provide at least a cumulative score > 92%. Provide a brief discussion for your choice.

f. Determining the number of clusters for the PCA-based dataset: Apply the four automated tools to the new PCA-based dataset. Provide the related R-outputs and your discussion on the outcomes.

g. K-means clustering on the PCA-based dataset: Perform k-means analysis using the most favored k from the automated tools. Show the related R-based kmeans output, including information for the centers, clustered results, and the ratio of BSS over TSS. Calculate and illustrate the BSS and WSS indices as internal evaluation metrics.

h. Silhouette plot on the PCA-based dataset: Provide the silhouette plot for evaluating the quality of the obtained clusters. Include the average silhouette width score and your discussion on the plot.

i. Calinski-Harabasz Index: Implement and illustrate the Calinski-Harabasz Index as another internal evaluation metric. Provide a brief discussion on the outcome of this index.

Usage

  1. Install R and R Studio on your machine.
  2. Clone this repository to your local system or download the code files.
  3. Open the R Studio project for this repository.
  4. Ensure that the "vehicles.xls" dataset is available in the working directory.
  5. Run the code file "1st_Subtask.R" to execute the tasks related to the 1st subtask.
  6. Review the generated outputs, including the pre-processing tasks, automated tools results, k-means analysis, and silhouette plot.
  7. Proceed to the 2nd subtask by running the code file "2nd_Subtask.R".
  8. Examine the outputs related to PCA analysis, automated tools for determining the number of clusters, k-means analysis on the PCA-based dataset, silhouette plot, and Calinski-Harabasz Index.
  9. Review the report, which includes the discussion and interpretation of the results for each subtask.

Appendix

The appendix section of the report provides the full code developed for all the tasks mentioned above. It includes the necessary functions, libraries, and step-by-step instructions for executing the analysis.

Dependencies

The code in this repository has the following dependencies:

  • R (version 4.2.2-win)
  • R Studio (version RStudio 2023.03.0+386 "Cherry Blossom" Release)
  • Required R packages (list the required packages and versions)

Make sure to install the required dependencies before running the code.

License

This project is licensed under the MIT License. Feel free to explore, modify, and use the code in this repository according to the terms of the license.

© 2023 Pasindu Geevinda. All rights reserved.

About

This work involves two subtasks: assessing clustering results using all input variables and applying PCA for dimensionality reduction to improve understanding of multi-dimensional problems.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages