Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-1438] Adding SDML loss function #17298

Merged
merged 8 commits into from
Jan 17, 2020
Merged

Conversation

anjishnu
Copy link
Contributor

Description

(Brief description on what this PR is about)

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
  • Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html

Changes

  • Added new loss for information retrieval - Smoothed Deep Metric Learning Loss with relevant API doc.

Comments

@anjishnu anjishnu requested a review from szha as a code owner January 14, 2020 03:31
@haojin2
Copy link
Contributor

haojin2 commented Jan 14, 2020

@anjishnu Please address the sanity errors: http:https://jenkins.mxnet-ci.amazon-ml.com/job/mxnet-validation/job/sanity/job/PR-17298/1/display/redirect . Also can you randomize your unit test to ensure that we're covering more numerically different cases?

@anjishnu
Copy link
Contributor Author

anjishnu commented Jan 14, 2020

@haojin2 Sure will address the sanity cases.

Can you give an example of a unit test that is appropriately randomized so I can base it on that?

@anjishnu
Copy link
Contributor Author

anjishnu commented Jan 15, 2020

It looks a little tricky to port this into the 'fit' and 'score' paradigm since this is a retrieval specific loss function which uses the other elements in a batch as implicit negative samples - and I'm not sure how cleanly it fits into the Module API for this kind of test. Specially since the loss computation needs to know the shape of the minibatch which doesn't seem to be possible in the symbol API.

The loss only guarantees that associated pairs will be closer in the chosen metric space after learning as compared to the non-associated pairs.

Maybe I can write something equivalent using the gluon API, to train a small network and ensure it learns the right associations. I'll come up with a proposal shortly.

@haojin2
Copy link
Contributor

haojin2 commented Jan 15, 2020

@anjishnu I was actually only asking for getting the input data randomized with the original version of test code untouched.

@anjishnu
Copy link
Contributor Author

anjishnu commented Jan 15, 2020

@haojin2 if I randomized the input data in the original test code the losses would would have different values during each run (SDML loss imposes a distribution over the relative distances of data points in a minibatch) - so I would not be able to compare the output against precomputed loss values any more - thus the original unit test procedure cannot be reused.

That's why I added a test that fits a toy model to some toy data instead. The current test was running in ~50 ms on my machine on CPU. Would love to hear your thoughts on how to improve on this.

@anjishnu
Copy link
Contributor Author

Verifying that mention of this PR in #17292 is a typo - author meant to refer to #17208 - we are not breaking horovod

@anjishnu
Copy link
Contributor Author

@haojin2 Are there any serious concerns with the new unit test?

@haojin2
Copy link
Contributor

haojin2 commented Jan 16, 2020

@anjishnu I'm only concerned about if there's any flakiness in such a test, to verify, please try
MXNET_TEST_COUNT=500 nosetests tests/python/unittest/test_loss.py:test_sdml_loss on your end and lemme know if it passes.

@anjishnu
Copy link
Contributor Author

@haojin2
With the original confirguration I was getting around 1 in 100 or 1 in 200 failures, which is not acceptable for a unit test.

Changing the hyperparameters (e.g. increasing dimensionality, lowering N) does make it more robust (tried up to MXNET_TEST_COUNT=2000 runs) but there's always a chance of failure

Other options:

  • Go with the original precomputed values approach, but maybe over a larger range of precomputed values.

  • Set up a similar test ... but use correlated data instead of using purely random data:

    • Generate the positive samples that are actually correlated to the paired data rather than purely random.
    • Always use some set of samples that are known to converge successfully
  • Add a test to see loss reduction (e.g. to <0.05 as in many of the other tests)

Which of these would be your preferred approach - I have tested them all to MXNET_TEST_COUNT=2000

@anjishnu
Copy link
Contributor Author

@haojin2 could you take a look at the latest version - the new unit test (which looks at loss reduction) is passing MXNET_TEST_COUNT=2000

@haojin2
Copy link
Contributor

haojin2 commented Jan 17, 2020

@anjishnu it seems ok. I've re-triggered your failed website build, I think it's good for merge after that passes.

@anjishnu
Copy link
Contributor Author

@haojin2 ok, thanks! sounds good!

@haojin2 haojin2 merged commit 11a2dcb into apache:master Jan 17, 2020
@haojin2
Copy link
Contributor

haojin2 commented Jan 17, 2020

@anjishnu Merged, thx for the contribution.

szhengac pushed a commit to szhengac/mxnet that referenced this pull request Jan 21, 2020
* Added loss function

* cleaning out loss function

* added loss fn

* added loss function and unit test

* fixed linter issues and updated unit test to train net and eval

* fixed documentation

* made unit tests more robust
szhengac pushed a commit to szhengac/mxnet that referenced this pull request Jan 21, 2020
* Added loss function

* cleaning out loss function

* added loss fn

* added loss function and unit test

* fixed linter issues and updated unit test to train net and eval

* fixed documentation

* made unit tests more robust
szhengac pushed a commit to szhengac/mxnet that referenced this pull request Jan 21, 2020
* Added loss function

* cleaning out loss function

* added loss fn

* added loss function and unit test

* fixed linter issues and updated unit test to train net and eval

* fixed documentation

* made unit tests more robust
distances = self._compute_distances(x1, x2)
log_probabilities = F.log_softmax(-distances, axis=1)
# multiply for the number of labels to obtain the correct loss (gluon kl_loss averages instead of sum)
return self.kl_loss(log_probabilities, labels.as_in_context(distances.context)) * batch_size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't work in sym. 1.x will need to be fixed while 2.0 will be switched to deferred compute mode

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants