Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change output type of clusters-kmeans #850

Merged
merged 8 commits into from
Jul 21, 2017
Merged

Conversation

DenisCarriere
Copy link
Member

@DenisCarriere DenisCarriere commented Jul 20, 2017

Updates to @turf/clusters-kmeans

  • Outputs simple FeatureCollection (with additional properties cluster)
  • Dropped numberOfClusters can\'t be greater than the number of points (number of Clusters fallbacks to the count instead of throwing an error since the user might be passing multiple featureCollections and the total number of features is unknown/variable)
  • Update typescript definition
  • Update tests
  • Dropped making centroids in index.js
  • Added clone to prevent any input mutation
  • Add Centroid & Concave polygons to test outputs using clusterEach (visual/testing purposes only)
  • Added Input mutation tests & add mutate param to allow input to be mutated
  • (Optional) Allow user input to be mutated via parameter (mutation should be disabled by default)?

JSDocs

/**
 * Takes a set of {@link Point|points} and partition them into clusters using the k-mean .
 * It uses the [k-means algorithm](https://en.wikipedia.org/wiki/K-means_clustering)
 *
 * @name clustersKmeans
 * @param {FeatureCollection<Point>} points to be clustered
 * @param {number} [numberOfClusters=Math.sqrt(numberOfPoints/2)] numberOfClusters that will be generated
 * @param {boolean} [mutate=false] allows GeoJSON input to be mutated (significant performance increase if true)
 * @returns {FeatureCollection<Point>} Clustered Points with an additional two properties associated to each Feature:
 * - {number} cluster - the associated clusterId
 * - {[number, number]} centroid - Centroid of the cluster [Longitude, Latitude]
 * @example
 * // create random points with random z-values in their properties
 * var points = turf.random('point', 100, {
 *   bbox: [0, 30, 20, 50]
 * });
 * var numberOfClusters = 7;
 * var clustered = turf.clustersKmeans(points, numberOfClusters);
 *
 * //addToMap
 * var addToMap = clustered;
 */
module.exports = function (points, numberOfClusters, mutate) {

Example

image

@DenisCarriere
Copy link
Member Author

@morganherlocker These changes might help understand the other modules (getCluster, clusterEach, clusterReduce) might of been confusing if you looked at those methods without these changes.

@DenisCarriere DenisCarriere self-assigned this Jul 20, 2017
@stebogit
Copy link
Collaborator

I love the colors in the test output, so festive! 🎉 😄

@stebogit
Copy link
Collaborator

stebogit commented Jul 20, 2017

@DenisCarriere although the module's output might still be acceptable, there is an issue I have been thinking about in the last few days.
Currently the choice of the initial centroids is not random as it's supposed to be in the actual K-means algorithm; in fact I set them arbitrarily as the first numberOfClusters points of the input. This guarantees a consistent output for each input (thus all the tests pass).
This algorithm though is known to be extremely fast (compared to other algorithms like DBSCAN for instance) but also likely to converge to a local optimum (basically a "wrong" clustering). This in practice means the output is generally not the same.
There are some solutions that attempt to improve the probability for the algorithm to reach the global optimum selecting an appropriate set of initial centroids, so I suggested the author of skmeans to add the K-means++ algorithm (which seems one of the most popular and effective) to the library. That feature is now available, although it doesn't seem to bring significant improvements in our test cases (maybe because the implementation has not been extensively tested or maybe because our tests are not enough complex to appreciate its benefit).

The above to say that I believe we should at least use the default random selection of initial centroids (provided by skmeans) or use the optional k-means++ algorithm to truly implement the algorithm that names this Turf module. However this would require a different smart (and less pretty 😞 😅 ) way to test the output since it's most likely to vary all the time. I believe for this purpose the new @turf/clusters module will come in handy.

We might define valid an output that:

  • have the expected amount of centroids
  • in which (if I'm not mistaken) the centroids coincide to the centers of mass of the correspondent cluster
  • labels all the points

which is basically all you can/should expect from k-means.

Copy link
Collaborator

@stebogit stebogit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat! 👍

if (numberOfClusters > count) numberOfClusters = count;

// Clone points to prevent any mutations
points = clone(points, true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we let the user decide if the input should change or (by default) not, like we did for the transform modules? I believe it's likely the user wants the input to be added the clustering properties

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could always add a mutate parameter (to allow input mutations).

It will significantly increase the performance if the user doesn't mind having the input be added.

We might be able to do handle this without the use of clone, but as a safe measure it was added.

Benchmark results

without clone

fiji x 260,987 ops/sec ±1.03% (87 runs sampled)
many-points x 173 ops/sec ±7.86% (77 runs sampled)
points-with-properties x 245,562 ops/sec ±1.54% (86 runs sampled)
points1 x 71,572 ops/sec ±0.86% (88 runs sampled)
points2 x 40,784 ops/sec ±1.06% (86 runs sampled)

with clone (~6 times slower)

fiji x 41,566 ops/sec ±1.18% (89 runs sampled)
many-points x 145 ops/sec ±2.42% (79 runs sampled)
points-with-properties x 38,453 ops/sec ±1.22% (83 runs sampled)
points1 x 12,598 ops/sec ±1.09% (91 runs sampled)
points2 x 8,124 ops/sec ±0.97% (89 runs sampled)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's what I had in mind :+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 yep, I didn't realize how "slow" it was after adding clone, that JSON.parse + JSON.stringify really takes a toll on the performance benchmarks.

points: points,
centroids: featureCollection(centroids)
};
return points;
Copy link
Collaborator

@stebogit stebogit Jul 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the reasons why we want to output GeoJSON objects, however I believe the centroids in this particular clustering method are relevant and would be beneficial to the user having them already available in the output (as it's a common pattern in ALL the kmeans library available).
Couldn't we add them as a centroids property of the output? It would still be a valid GeoJSON format, wouldn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 How about... we add [number, number] as a new centroid property, that way it doesn't deviate too much from the GeoJSON specification.

@DenisCarriere
Copy link
Member Author

@stebogit Added the centroid back to the test cases using the (new) centroid property.

image

@DenisCarriere DenisCarriere merged commit ab7b472 into master Jul 21, 2017
@DenisCarriere DenisCarriere deleted the update-clusters-kmeans branch July 21, 2017 04:01
@stebogit
Copy link
Collaborator

@DenisCarriere did you read my comment? Just curious about your point of view.

@DenisCarriere
Copy link
Member Author

That was a big paragraph... :)

You might of not noticed it, but I started added bullet points of To-Do's in the first comment #850 (comment)

Things I might of missed:

suggested the author of skmeans to add the K-means++ algorithm (which seems one of the most popular and effective) to the library. That feature is now available, although it doesn't seem to bring significant improvements in our test cases

It would be interesting to see the difference between K-means++, I personally don't know the difference, I'd have to dig deeper into this.

👍 We can include the improved K-means++

default random selection of initial centroids
However this would require a different smart (and less pretty 😞 😅 ) way to test the output since it's most likely to vary all the time.

We should use the default random selection of centroids if it's recommended by K-means to do so.

As for the test cases, I'm sure we can figure out a clever way to test this, we can still output the results using write.sync() except we don't need to test if they are an exact match, we could do the following:

  • Test if clustered has Features (clustered.length > 0)
  • Test if first feature is not undefined
  • Does contain a cluster property

And the ones you mentioned:

  • have the expected amount of centroids
  • in which (if I'm not mistaken) the centroids coincide to the centers of mass of the correspondent cluster
  • labels all the points

Those should be enough tests to make sure the module is working correctly.

Try to summarize them in a concise "to-do" list and we can tackle them in a new PR.

@stebogit
Copy link
Collaborator

Sorry @DenisCarriere I didn't noticed your changes.
I'll think about how to approach the tests and create the PR with the summary of the above. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants