Skip to content

K-means clustering is a popular method for categorizing data into clusters based on similarity. Its efficacy can be influenced by various factors, one of which could be missing data. Understanding how missing data affects the K-means algorithm is crucial for its application in real-world scenarios where complete data might not always be available.

Notifications You must be signed in to change notification settings

GabrielJobert/Simulation_paper---Effect_of_missing_data_on_K-means_performance---MATH60603A_STATISTICAL_LEARNING

Repository files navigation

Simulation Paper on Statistical Learning

Author

Gabriel Jobert

Date

2023-11-05

Description

This repository contains the R Markdown file and associated resources for a simulation study. The paper focuses on employing a Monte Carlo simulation approach to evaluate the impact of missing data on the performance and stability of various statistical methods.

Objective

The primary objective of this study is to provide insights into how missing data can affect the outcome and reliability of different statistical learning techniques.

Methods and Libraries Used

  • R Programming: The analysis and simulations are conducted using R.
  • Libraries:
    • MASS: For support with statistical functions.
    • caret: For data splitting, pre-processing, feature selection, etc.
    • cluster & mclust: For clustering analysis.
    • infotheo: For information theory-based methods.
    • RANN: For approximate nearest neighbor techniques.
    • ggplot2: For creating elegant data visualisations.
    • dplyr: For data manipulation.

File Structure

  • Simulation Paper.Rmd: The main R Markdown file containing the study, code, and documentation.
  • HTML version : The paper in plain word from the RMD file in html version (prefer reading this version than the pdf one).
  • PDF version : Paper in plain word in PDF version.

How to Use

  1. Ensure you have R and the necessary libraries installed.
  2. Clone this repository.
  3. Open the SImulation Paper.Rmd file in an R environment like RStudio.
  4. Run the code chunks sequentially to reproduce the analysis and view the results.

This README is a brief overview of the 'Simulation Paper on Statistical Learning'. For more detailed information, please refer to the Rmd file itself.

About

K-means clustering is a popular method for categorizing data into clusters based on similarity. Its efficacy can be influenced by various factors, one of which could be missing data. Understanding how missing data affects the K-means algorithm is crucial for its application in real-world scenarios where complete data might not always be available.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published