Group Members (in alphabetical order):
Jiahua Guo
Jiachen Song
Xinyuan Wang
Jiawei Zhuang
We choose MPI + OpenMP/OpenACC/CUDA as our heterogenous computing environment.
Many huge data sets are now publicly available. There are several ways to turn those large amounts of data into useful knowledge. Here we focus on exploratory data analysis, or unsupervised machine learning, which means finding structural information without prior knowledge.
Among all the unsupervised learning methods, k-means is a commonly used algorithm, which partitions observations into k clusters in which each observation belongs to the cluster with the nearest mean. Finding the minimum of a k-means cost function is a NP-hard problem when the dimension d>1 and the number of clusters k>1. Scientists came up with several heuristic methods to find the local minimum, but the process is still highly computationally-intensive, especially with huge data sets. We want to implement a parallel version of a k-means heuristic method on a cluster of machines, to significantly speed up the computing time of the clustering process, without any reduction on the accuracy rate of the clustering model.
(Preliminary plan. Might change in the future.)
Hubway system data:
https://www.thehubway.com/system-data
Airbnb data:
https://data.beta.nyc/dataset/inside-airbnb-data