This is the repository for the Integer Optimization class at MIT. It has been developed by Marco Antonioli and Mattia Ravasio.
Given a set of observations
Where
One of the most widely used methods to solve this problem is the K-means algorithm. This algorithm starts with a random assignment of k centroids, then it enters in a loop where each point is assigned to one centroid in the current set of centroids, and each centroid is then recalculated as the mean of the point in the cluster, until convergence.
Given the random initialization component, the algorithm lacks provable optimality and presents stability issues. For this reason, we aim at re-formulating the problem as a Mixed Integer Optimization problem (MIO), and try to propose a method to solve it.
Using a data matrix
This formulation of the Clustering Problem presents non-linearities, and a fractional objective. The linearization of the problem becomes:
Note that this problem still presents a quadratic objective, but it can easily be handled by the Gurobi optimizer using second order cones.
This formulation takes care of non-linearities within the problem by adding many variables. More specifically, we added
To tackle the scalability issue we developed a set partitioning reformulation of the problem, in which each partition will be a cluster. We will minimize the total cost
This reformulation, which is seemingly very easy, hides it's complexity in the set
This formulation presents an exponential number of variable, thus to solve it we propose a column generation algorithm. We will firstly solve the linear relaxation of the Master Problem (MP) over a restricted set of possible clusters,
In this context, the subproblem is:
The suproblem formulation is itself as complicates as the full linearization of the MIO problem P, for this reason, we developed a Heuristic method to be able to have "good" clustering solutions very efficiently, and add clusters iteratively to the set of clusters
At every step we use the k-means algorithm with different initializations to generate a large number of valid columns (each initialization will generate
This heuristics is fast and generates valid columns, but sacrifices the optimality guarantee when we solve the sub-problem. Also, by generating clusters using k-means, we easily avoid generating strictly dominated columns like very bad clustering decisions. The effectiveness of this heuristic lies in the fact that we have to generate a large number of valid columns (we chose this number to be
Over the course of the experiments we saw that our heuristic algorithm tend to outperform K-means in various instances. A central factor is the standard deviation of the clusters: by increasing the standard deviation we can generate more difficult clustering problems and we can see that in those scenarios the proposed column generation heuristics is outperforming the most the traditional approach.
The computational time taken by our algorithm depends on the difficulty of the instance, and only in very complex instances our proposed method takes up to 5 times the time taken by K-means. Given the very low computational time of K-means, our algorithm is able to converge to a solution in under 2 seconds even for instances of 1000 data points, 5 clusters and high standard deviation. Here it follows an example of an instance with 1000 points, 3 clusters and a standard deviation of 1.5, where our formulation outperforms K-means:
As it can be seen, the K-means proposed solution is clearly not the best one, while our algorithm manages to correctly separate the data into what also visually is the right clustering assignment. Here it follows another example where our formulation outperforms K-means. We have 1000 points, 4 clusters and a standard deviation of 1.5. In this second example instead we can see a different clustering choice where our algorithm chooses to generate two clusters in the middle blob, while K-means prefers to do it in the lower-left blob.
We now focus our attention on the performance of our column generation algorithm against K-means on different problem instances. We experimented for standard deviations equal to 1.5, 3, and 4.5, for number of points equal to 100, 300, 500, 700 and 1000 and for number of clusters equal to 2,3,4 and 5. For every combination of parameters we run 20 different problems, we averaged the results of those 20 to better understand the behaviour of the heuristic under the different scenarios. Over all of our tests we saw an average decrease of 2% in the cost. Here we will present the tables with the results. Every set of tables corresponds to a different standard deviation value, and in the tables the columns are the number of points and the rows are the number of clusters. The values in the table are the sum of the squared distances between the points and their assigned centroid, averaged across the 20 runs that we did with the similar setting.