K-means-Clustering-Using-Mapreduce

In this project we are going to model a simple unsupervised machine learning algorithm k-means clustering using MapReduce for a letter recognition. Although unsupervised algorithm, we will project this problem as a semi-supervised approach by cross-checking test data labels with labels given by clustering models. We will use publicaly available Letter Recognition Data. This data comprises of 16 features manually engineered from 20000 images of English alphabates(A-Z).

What is this project?

In this project we are going to create 26 clusters from the given n% training data using k-means clustering algorithm and evaluate the clusters using a classification task by predicting the class labels (alphabates) of the test data. Following is the pictorial representation of entire pipeline:

About Files:

sample.py: sample.py will generate training and test data from the given data along with initial 26 cluster centroids. This program takes 'n' as input, where 'n' is the % of data to consider as training data. The execution will output centroids.txt, train.txt, and test.txt in the specified output directory 'Datasets'. User should create the specified output directory 'Datasets' before executing this program.(We will use only the training and centroid data in the k-means clustering algorithm.)
Mapper.py: Mapper function will map each data instance to its cluster labels.
reducer.py: The reducer function simply aggregates each data instance cluster membership and finds the updated cluster centroid coordinates.
u_reducer: The u_reducer function simply aggregates each data instance cluster membership and finds the updated cluster centroids coordinates with updated labels.
model.sh: This is the shell script. We are designing k-mean clustering algorithm which requires mapper.py and reducer.py to execute I interation and then execute mapper.py and u_reducer.py for I interation. But mapreduce can perform only one iteration, so to avoid manual execution we will write shell script to execute the pipeline.
c_mapper and c_reducer: We will simply use test data to get cluster labels and cluster centroid coordinates.
evaluation.py: After execution of this file we will get accuracy(%) and the heatmap of confusion matrix based on the prediction.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
c_mapper.py		c_mapper.py
c_reducer.py		c_reducer.py
evaluation.py		evaluation.py
letter-recognition.data		letter-recognition.data
mapper.py		mapper.py
model.sh		model.sh
reducer.py		reducer.py
sample.py		sample.py
u_reducer.py		u_reducer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

K-means-Clustering-Using-Mapreduce

What is this project?

About Files:

Dataset link : https://archive.ics.uci.edu/ml/datasets/Letter+Recognition

About

Releases

Packages

Languages

yoginim/K-means-Clustering-Using-Mapreduce

Folders and files

Latest commit

History

Repository files navigation

K-means-Clustering-Using-Mapreduce

What is this project?

About Files:

Dataset link : https://archive.ics.uci.edu/ml/datasets/Letter+Recognition

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages