Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustering - DBSCAN #86

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open

Clustering - DBSCAN #86

wants to merge 15 commits into from

Conversation

Wei-1
Copy link

@Wei-1 Wei-1 commented Sep 12, 2019

Description of Changes

Add clustering algorithm, DBSCAN.
Please help to check the format and the corresponding tests.

Includes
  • Code changes
  • Tests
  • Documentation

@inejc inejc added awaits review enhancement New feature or request labels Sep 12, 2019
Copy link
Member

@matejklemen matejklemen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR 🙂
I've left some comments regarding changes that need to be made. I've only familiarized myself with the algorithm this weekend, so please let me know if you feel that I've made a mistake in any of the comments regarding the algorithm's behavior.

Besides these comments, please try to limit the use of mutable constructs (not at all costs, of course) - I'll let you know if I can think of some concrete improvements along the way.

@picnicml picnicml deleted a comment Sep 15, 2019
Copy link
Author

@Wei-1 Wei-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been thinking about the mutable parameters,
but I haven't think of a good way to avoid them yet.
Please also provide some comment if you have an idea on that.

@picnicml picnicml deleted a comment Sep 18, 2019
Copy link
Member

@inejc inejc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Wei-1, thanks for the PR, it's much appreciated 🙂. I only skimmed through the code and will do a more detailed review over the weekend. One thing that stands out are the calls to the findNeighbors function; distance for the same points is unnecessarily calculated many times. We have at least two better ways of doing this IMHO:

  • construct a pairwise distance matrix before the clustering and use it in findNeighbors
  • construct a kd-tree (or a ball-tree) before the clustering and use it in findNeighbors

The second approach is what we eventually want to get to as it's faster and more memory efficient but for the first solution I'm happy with the first method as it already avoids recalculation of the same distances.

@Wei-1
Copy link
Author

Wei-1 commented Sep 19, 2019

Add the Distance Map as stated by @inejc

@Wei-1 Wei-1 requested a review from inejc September 19, 2019 01:09
@picnicml picnicml deleted a comment Sep 19, 2019
@Wei-1
Copy link
Author

Wei-1 commented Sep 20, 2019

Modified as stated. @matejklemen

@picnicml picnicml deleted a comment Sep 20, 2019
@matejklemen
Copy link
Member

I'm just leaving a comment with some sample code here to start a bit of discussion (the following should not be integrated in the code in current state).
I've been playing around a bit with the code to see which of the mutable things could be removed and I've come up with this:

// current seed points -> new seed points
LazyList.iterate(groupQueue)((seedPoints: Set[Int]) => {
  // iterate over each seed point
  seedPoints.foldLeft(Set[Int]())(
    (newSeedPoints: Set[Int], currSeedPoint: Int) => {
      label(currSeedPoint) = groupId
      val neighsOfNeigh = findNeighbors(currSeedPoint, x, model.eps)
      // check if neighbourhood of neighbor contains at least minSamples
      if (neighsOfNeigh.size + 1 >= model.minSamples) {
        neighsOfNeigh.foldLeft(newSeedPoints: Set[Int])((seeds, currNeighPoint) => {
          label(currNeighPoint) match {
            case NOISE =>
              label(currNeighPoint) = groupId
              seeds
            case UNASSIGNED =>
              label(currNeighPoint) = groupId
              seeds + currNeighPoint
            case _ => seeds
          }
        })
      }
      else
        newSeedPoints
  })
}).takeWhile(_.nonEmpty).foreach(_ => {})

This would be a replacement for the while loop.
It probably could be trimmed down a bit (I'm guessing the types are not needed in all the places), but still, I'm not completely certain if this is readable.

I'd like to hear your thoughts and/or suggestions on improving this (though it's really not of a high priority). @Wei-1 @inejc

@inejc inejc added this to Review in progress in Kanban Board for doddle-model Sep 20, 2019
@picnicml picnicml deleted a comment Sep 23, 2019
@inejc
Copy link
Member

inejc commented Sep 23, 2019

Notes:

1. there is no space in all algorithms that I referenced between `private(eps: ...`

2. I am not able to use DenseVector[Double] in the output

3. getOrBreak doesn't seem to work
  1. I fixed this in linear models (thanks!) so we should also fix it here
  2. I will provide an example for that
  3. @matejklemen's suggestion should work

@Wei-1 Wei-1 requested a review from inejc September 23, 2019 23:23
@picnicml picnicml deleted a comment Sep 23, 2019
@inejc
Copy link
Member

inejc commented Sep 24, 2019

Hey @Wei-1, I went ahead and fixed the Clusterer typeclass and improved the calculation of distances a bit (and fixed the tests). I'm copying all three files here to avoid committing into your branch. All that is left is the actual algorithm which I will work on in the next days. Let me know if this makes sense to you (I also changed scaladoc stuff, moved things around, etc.) If you have any questions about the proposed changes please also let me know :).

Clusterer.scala

package io.picnicml.doddlemodel.typeclasses

import breeze.linalg.DenseVector
import io.picnicml.doddlemodel.data.Features

trait Clusterer[A] extends Estimator[A] {

  def fit(model: A, x: Features): A = {
    require(!isFitted(model), "Called fit on a model that is already fitted")
    fitSafe(copy(model), x)
  }

  def fitPredict(model: A, x: Features): DenseVector[Double] = {
    require(!isFitted(model), "Called fit on a model that is already fitted")
    labelsSafe(fitSafe(copy(model), x))
  }

  /** A function that creates an identical clusterer. */
  protected def copy(model: A): A

  /** A function that is guaranteed to be called on a non-fitted model. **/
  protected def fitSafe(model: A, x: Features): A

  def labels(model: A):  DenseVector[Double] = {
    require(isFitted(model), "Called labels on a model that is not fitted yet")
    labelsSafe(model)
  }

  /** A function that is guaranteed to be called on a fitted model. */
  protected def labelsSafe(model: A): DenseVector[Double]
}

DBSCAN.scala

package io.picnicml.doddlemodel.cluster

import breeze.linalg.DenseVector
import breeze.linalg.functions.euclideanDistance
import cats.syntax.option._
import io.picnicml.doddlemodel.data.Features
import io.picnicml.doddlemodel.syntax.OptionSyntax._
import io.picnicml.doddlemodel.typeclasses.Clusterer

import scala.collection.mutable

/** An immutable DBSCAN clustering model.
  *
  * @param eps: the maximum distance between two datapoints to be considered in a common neighborhood
  * @param minSamples: the minimum number of datapoints in a neighborhood for a point to be considered the core point
  */
case class DBSCAN private (eps: Double, minSamples: Int, private val labels: Option[DenseVector[Double]])

object DBSCAN {

  def apply(eps: Double = 0.5, minSamples: Int = 5): DBSCAN = {
    require(eps > 0.0, "Maximum distance eps needs to be larger than 0")
    require(minSamples > 0, "Minimum number of samples needs to be larger than 0")
    DBSCAN(eps, minSamples, none)
  }

  implicit lazy val ev: Clusterer[DBSCAN] = new Clusterer[DBSCAN] {

    override protected def copy(model: DBSCAN): DBSCAN =
      model.copy()

    override def isFitted(model: DBSCAN): Boolean = model.labels.isDefined

    override protected def fitSafe(model: DBSCAN, x: Features): DBSCAN = {
      val distances = computeDistances(x)
      println(distances)
      ???
    }

    private def computeDistances(x: Features): Distances = {
      val distanceMatrix = mutable.AnyRefMap[(Int, Int), Double]()
      (0 until x.rows).combinations(2).foreach { case rowIndex0 +: rowIndex1 +: IndexedSeq() =>
        distanceMatrix((rowIndex0, rowIndex1)) = euclideanDistance(x(rowIndex0, ::).t, x(rowIndex1, ::).t)
      }
      new Distances(distanceMatrix)
    }

    override protected def labelsSafe(model: DBSCAN): DenseVector[Double] = model.labels.getOrBreak
  }

  private class Distances(private val distanceMatrix: mutable.AnyRefMap[(Int, Int), Double]) {
    def get(x: Int, y: Int): Double = if (x > y) distanceMatrix((y, x)) else distanceMatrix((x, y))
  }
}

DBSCANTest.scala

package io.picnicml.doddlemodel.cluster

import breeze.linalg.{DenseMatrix, DenseVector}
import io.picnicml.doddlemodel.TestingUtils
import io.picnicml.doddlemodel.cluster.DBSCAN.ev
import org.scalactic.{Equality, TolerantNumerics}
import org.scalatest.{FlatSpec, Matchers}

class DBSCANTest extends FlatSpec with Matchers with TestingUtils {

  implicit val doubleTolerance: Equality[Double] = TolerantNumerics.tolerantDoubleEquality(1e-4)

  private val x = DenseMatrix(
    List(1.0, 1.0),
    List(0.0, 2.0),
    List(2.0, 0.0),
    List(8.0, 1.0),
    List(7.0, 2.0),
    List(9.0, 0.0)
  )

  "DBSCAN" should "cluster the datapoints" in {
    val model = DBSCAN(eps = 3.0, minSamples = 1)
    breezeEqual(ev.fitPredict(model, x), DenseVector(0.0, 0.0, 0.0, 1.0, 1.0, 1.0))
  }

  it should "cluster each datapoint into it's own group when eps is too small" in {
    val model = DBSCAN()
    breezeEqual(ev.fitPredict(model, x), DenseVector(0.0, 1.0, 2.0, 3.0, 4.0, 5.0))
  }

  it should "cluster all data points into a single group when eps is too large" in {
    val model = DBSCAN(eps = 10.0)
    breezeEqual(ev.fitPredict(model, x), DenseVector(0.0, 0.0, 0.0, 0.0, 0.0, 0.0))
  }

  it should "label all points as outliers when min samples is too large" in {
    val model = DBSCAN(minSamples = 7)
    breezeEqual(ev.fitPredict(model, x), DenseVector(-1.0, -1.0, -1.0, -1.0, -1.0, -1.0))
  }

  it should "cluster all datapoints into a single group when eps equals the distance between points" in {
    val smallX = DenseMatrix(
      List(0.0, 0.0),
      List(3.0, 0.0)
    )
    val model = DBSCAN(eps = 3.0)
    breezeEqual(ev.fitPredict(model, smallX), DenseVector(0.0, 0.0))
  }

  it should "cluster all datapoints into a single group" in {
    val d1X = DenseMatrix(
      List(0.0, 12.0),
      List(0.0, 9.0),
      List(0.0, 6.0),
      List(0.0, 3.0),
      List(0.0, 0.0)
    )
    val model = DBSCAN(eps = 3.0, minSamples = 3)
    breezeEqual(ev.fitPredict(model, d1X), DenseVector(0.0, 0.0, 0.0, 0.0, 0.0))
  }

  it should "prevent the usage of negative eps" in {
    an [IllegalArgumentException] shouldBe thrownBy(DBSCAN(eps = -0.5))
  }

  it should "prevent the usage of negative min samples" in {
    an [IllegalArgumentException] shouldBe thrownBy(DBSCAN(minSamples = -1))
  }
}

@inejc
Copy link
Member

inejc commented Sep 24, 2019

The changes are available in this branch which I will delete later.

@Wei-1
Copy link
Author

Wei-1 commented Sep 24, 2019

These are beautiful!
I will follow your modification after you are done.

@inejc
Copy link
Member

inejc commented Sep 25, 2019

@Wei-1 @matejklemen the proposed implementation for DBSCAN is available in this file. Typeclass is available here. The tests need to be fixed because I changed default values of parameters (I suggest that we come up with a few examples produced by the scikit-learn implementation).

Anyway, let me know what you think, there might be things there that should be changed as I haven't tested this (and it probably needs some refactoring too).

@Wei-1 should we rebase your branch with doddle-model/Wei-1-master, or will you copy the changes and commit them directly once we agree on them 🤔?

@inejc
Copy link
Member

inejc commented Sep 25, 2019

Also todo: fix 2.11/2.12 compatibility because of mutable.Queue.enqueueAll.

@inejc
Copy link
Member

inejc commented Sep 26, 2019

@matejklemen I also looked at your proposal for the replacement of the while loop and I think it's good (I initially came up with a similar example but then decided to go back to the loop for some reason :D). What are your preferences around that?

@Wei-1
Copy link
Author

Wei-1 commented Sep 26, 2019

@inejc, I think it is easier that we simply close this PR and start another one with doddle-model/Wei-1-master

@inejc
Copy link
Member

inejc commented Sep 26, 2019

@Wei-1 I can try to push my changes to your fork or alternatively, you can try to pull doddle-model/Wei-1-master into your own fork and open a PR from there.

@matejklemen
Copy link
Member

@inejc Well if we don't have a clean way to do it in a functional style, I'm fine with leaving a while loop in (my solution is not more readable than a simple while loop IMO).

@Wei-1
Copy link
Author

Wei-1 commented Sep 27, 2019

Updated this PR with new commits from @inejc

@picnicml picnicml deleted a comment Sep 27, 2019
@inejc
Copy link
Member

inejc commented Sep 27, 2019

Thanks @Wei-1. Will you work on 2.11 and 2.12 support and fixing the tests or should I do it?

@Wei-1
Copy link
Author

Wei-1 commented Sep 27, 2019

Before that, the current code will fail 2 unit tests.

[info] DBSCANTest:
[info] DBSCAN
[info] - should cluster each datapoint into it's own group when eps is too small *** FAILED ***
[info]   false did not equal true (DBSCANTest.scala:30)
[info] - should cluster all datapoints into a single group when eps equals the distance between points *** FAILED ***
[info]   false did not equal true (DBSCANTest.scala:50)

I checked and found that all data points are clustered as NOISEs.
Can you please help to find out what went wrong in the algorithms? @inejc

@inejc
Copy link
Member

inejc commented Sep 27, 2019

Can you please help to find out what went wrong in the algorithms? @inejc

I will check it out, the tests might be failing due to changed default parameters for minSamples and eps but need to confirm. As I said, I will do that by coming up with some examples and then running the scikit-learn implementations to get the clusters and make tests based on that.

@inejc
Copy link
Member

inejc commented Sep 27, 2019

By 2.11 and 2.12 I mean scala versions and the fact that the current code doesn't compile for them. It compiles for 2.13 though. We have this for 2.11/2.12 and this for 2.13 interop.

@Wei-1
Copy link
Author

Wei-1 commented Sep 28, 2019

Interesting. I thought we always need to compile with the specific scala version.
Do we have to change anything for that this machanism can support the clustering?

@inejc
Copy link
Member

inejc commented Sep 29, 2019

Interesting. I thought we always need to compile with the specific scala version.
Do we have to change anything for that this machanism can support the clustering?

Compatibility files allow for compilation on 2.11, 2.12 and 2.13, one at a time, i.e. one still has to choose a specific Scala version. The current code only compiles on 2.13 though due to mutable.Queue.enqueueAll which is not available on 2.11 and 2.12 so this needs to be fixed.

@Wei-1
Copy link
Author

Wei-1 commented Oct 22, 2019

@inejc I don't seem to find why your structure is not working as intended for those unit tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaits review enhancement New feature or request
Projects
Kanban Board for doddle-model
  
Review in progress
Development

Successfully merging this pull request may close these issues.

None yet

3 participants