Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threshold metrics calculation fix when unseen labels are present #293

Merged
merged 84 commits into from
Apr 22, 2019
Merged
Changes from 1 commit
Commits
Show all changes
84 commits
Select commit Hold shift + click to select a range
a0f5bc7
label metadata after data cut
gerashegalov Mar 31, 2019
0a7d872
idempotent trim and special-case DataCutter
gerashegalov Apr 1, 2019
8f900a1
ignore previous labels in metadata
gerashegalov Apr 1, 2019
0b520ee
more logging
gerashegalov Apr 1, 2019
0a775eb
num_vals instead of vals
gerashegalov Apr 2, 2019
9da41b2
Revert "num_vals instead of vals"
gerashegalov Apr 3, 2019
3e73aaa
fix metadata labels
gerashegalov Apr 3, 2019
bde7cf0
sync labels
gerashegalov Apr 3, 2019
c5d3272
add mkstring
gerashegalov Apr 3, 2019
2ad6876
cleanup and fixes
gerashegalov Apr 3, 2019
e084d4e
WIP
gerashegalov Apr 4, 2019
937dc17
Revert "cleanup and fixes"
gerashegalov Apr 4, 2019
a803644
rewwork labels dropped
gerashegalov Apr 4, 2019
ae1b58d
restore accidental revert
gerashegalov Apr 5, 2019
4f7e2ef
re-add num_vals
gerashegalov Apr 5, 2019
30f961d
WIP
gerashegalov Apr 5, 2019
81f5ee9
WIP2
gerashegalov Apr 5, 2019
acff707
tests passing
gerashegalov Apr 7, 2019
aee7bbe
debug for leah
gerashegalov Apr 8, 2019
f140aa9
if inverted
gerashegalov Apr 8, 2019
1b3bc27
LR repro'ed
gerashegalov Apr 8, 2019
5074ff5
WIP
gerashegalov Apr 9, 2019
bcb93fd
test vectors
gerashegalov Apr 9, 2019
8c6d685
test both indexed and non-idexed data with data cutter
gerashegalov Apr 9, 2019
f1bb79d
Leah's comments
gerashegalov Apr 11, 2019
da400d4
param for number dropped lables to log
gerashegalov Apr 12, 2019
5a5b19d
WIP
gerashegalov Apr 15, 2019
74e9dd1
WIP
gerashegalov Apr 15, 2019
4a4171f
WIP
gerashegalov Apr 15, 2019
2e467c1
WIP
gerashegalov Apr 15, 2019
2d64f14
formatting
gerashegalov Apr 15, 2019
c43259d
Added metrics calculation to existing test to reproduce error seen in…
Jauntbox Apr 16, 2019
5e99cb7
Fixed bug in threshold metrics when labels unseen at training time ar…
Jauntbox Apr 17, 2019
e89b647
Removed debug logging from test
Jauntbox Apr 17, 2019
cb17f96
Merge branch 'master' of github.com:salesforce/TransmogrifAI into km/…
Jauntbox Apr 17, 2019
27ff0b9
label metadata after data cut
gerashegalov Mar 31, 2019
453c1e1
idempotent trim and special-case DataCutter
gerashegalov Apr 1, 2019
6b31bad
ignore previous labels in metadata
gerashegalov Apr 1, 2019
2f0af21
more logging
gerashegalov Apr 1, 2019
78acc4c
num_vals instead of vals
gerashegalov Apr 2, 2019
3ee8149
Revert "num_vals instead of vals"
gerashegalov Apr 3, 2019
4415245
fix metadata labels
gerashegalov Apr 3, 2019
86e5425
sync labels
gerashegalov Apr 3, 2019
c3bda19
add mkstring
gerashegalov Apr 3, 2019
6000be7
cleanup and fixes
gerashegalov Apr 3, 2019
1fdbe3e
WIP
gerashegalov Apr 4, 2019
4c5aa83
Revert "cleanup and fixes"
gerashegalov Apr 4, 2019
896dcb0
rewwork labels dropped
gerashegalov Apr 4, 2019
71d0b60
restore accidental revert
gerashegalov Apr 5, 2019
dd75509
re-add num_vals
gerashegalov Apr 5, 2019
9af4edf
WIP
gerashegalov Apr 5, 2019
62c489d
WIP2
gerashegalov Apr 5, 2019
e70a621
tests passing
gerashegalov Apr 7, 2019
af77fda
debug for leah
gerashegalov Apr 8, 2019
e545f8b
if inverted
gerashegalov Apr 8, 2019
19c91be
LR repro'ed
gerashegalov Apr 8, 2019
1948507
WIP
gerashegalov Apr 9, 2019
0ee4ffb
test vectors
gerashegalov Apr 9, 2019
9f592ac
test both indexed and non-idexed data with data cutter
gerashegalov Apr 9, 2019
2841359
Leah's comments
gerashegalov Apr 11, 2019
cf55a49
param for number dropped lables to log
gerashegalov Apr 12, 2019
5701b08
WIP
gerashegalov Apr 15, 2019
b9eb963
WIP
gerashegalov Apr 15, 2019
ae3e5c5
WIP
gerashegalov Apr 15, 2019
a44707b
WIP
gerashegalov Apr 15, 2019
831e114
formatting
gerashegalov Apr 15, 2019
a1f37c3
Replace cacheValidatedDFForTesting with the extended DataCutter class…
gerashegalov Apr 17, 2019
b3db9c4
mt feedback WIP
gerashegalov Apr 18, 2019
dd16900
build failure fix
gerashegalov Apr 18, 2019
0dd9fb6
whitespace
gerashegalov Apr 18, 2019
37b2dce
column index constants
gerashegalov Apr 18, 2019
cc2d4f0
Param rename
gerashegalov Apr 18, 2019
e973f46
CodeFactor issue
gerashegalov Apr 18, 2019
27992db
Merge branch 'master' into gera/factorlabeltrim
tovbinm Apr 18, 2019
3029177
Merged with Gera's branch
Jauntbox Apr 18, 2019
8197332
Fixed merge
Jauntbox Apr 18, 2019
6bf7d72
Merge branch 'master' into km/label-trim-metrics
leahmcguire Apr 19, 2019
0d8941b
Merge with master
Jauntbox Apr 19, 2019
a13cf54
Merge branch 'km/label-trim-metrics' of github.com:salesforce/Transmo…
Jauntbox Apr 19, 2019
6970775
Put blank line back in
Jauntbox Apr 19, 2019
9307140
Put blank line back in, but not with space
Jauntbox Apr 19, 2019
69fad18
Address comment
Jauntbox Apr 19, 2019
14b442d
Merge branch 'master' of github.com:salesforce/TransmogrifAI into km/…
Jauntbox Apr 19, 2019
321bc7e
Merge branch 'master' of github.com:salesforce/TransmogrifAI into km/…
Jauntbox Apr 22, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Revert "num_vals instead of vals"
This reverts commit d6d9ded
  • Loading branch information
gerashegalov committed Apr 14, 2019
commit 9da41b226ede5279432074330cde892dd7f1e9bb
Original file line number Diff line number Diff line change
Expand Up @@ -99,19 +99,6 @@ class DataCutter(uid: String = UID[DataCutter]) extends Splitter(uid = uid) with
summary
}


private def getNumValuesFromMetadata(data: DataFrame): Int = {
val labelSF: StructField = data.schema.head
val labelColMetadata = labelSF.metadata
log.info(s"Label column metadata ${labelSF.name} $labelColMetadata")
Try(labelColMetadata.getLong("num_vals").toInt)
.recover { case nonFatal =>
log.warn(s"Recovering non-fatal failure to extract num_vals with -1", nonFatal)
-1
}
.get
}

/**
* Removes labels that should not be used in modeling
*
Expand All @@ -120,27 +107,41 @@ class DataCutter(uid: String = UID[DataCutter]) extends Splitter(uid = uid) with
*/
override def validationPrepare(data: Dataset[Row]): Dataset[Row] = {
val dataPrep = super.validationPrepare(data)
val labelSet = getLabelsToKeep.toSet
log.info(s"Dropping rows with columns not in $labelSet")

val labelColName = dataPrep.columns(0)

val metadata = NominalAttribute.defaultAttr
.withName(labelColName)
.withNumValues(labelSet.size)
.toMetadata()

val labelCol = dataPrep
.col(labelColName)
.as("_", metadata)

val res = dataPrep
.filter(r => labelSet.contains(r.getDouble(0)))
.withColumn(labelColName, labelCol)

assert(labelSet.size == getNumValuesFromMetadata(res))

res
Try {
val labelSF: StructField = dataPrep.schema.head
val labelColMeta = labelSF.metadata
log.info(s"Label column metadata ${labelSF.name} $labelColMeta")
labelColMeta
.getMetadata("ml_attr")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could keep those magic values as constants?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wsuchy This PR is very confusing because it's based and contains commits from an older version of #263. @Jauntbox's change is actually a small commit. @Jauntbox please open a PR stacked on mine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gerashegalov I don't know how to make this PR against your branch since it's from a branch on your fork. This should be up to date w/ your branch now. I was hoping we could merge your branch first and then this PR should reduce down to ~10 lines when I merge it with master.

Copy link
Contributor

@gerashegalov gerashegalov Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can fetch branches from other forks by git remote add other remotes first, branch off of it, and push to another remote, and and create a cross-fork PR. Here is a demo:

git remote add mtovbin-github [email protected]:tovbinm/TransmogrifAI.git
git fetch mtovbin-github 
git checkout mt/xgboost 
git checkout -b gera/crossfork-mt-xgboost
vim README.md 
git commit -a -m test 
git push gera-github gera/crossfork-mt-xgboost 

https://github.com/tovbinm/TransmogrifAI/pull/1/files

.getStringArray("vals")
.map(_.toDouble)
}
.map(_.toSet)
.recover { case nonFatal =>
log.warn(s"Recovering non-fatal exception: $nonFatal", nonFatal)
Set.empty
}
.map { valSet =>

val labelSet = getLabelsToKeep.toSet
if (valSet == labelSet) {
log.info("Label sets identical in dataset metadata and datacutter, skipping label trim")
dataPrep
} else {
log.info(s"Dropping rows with columns not in $labelSet")

val labelColName = dataPrep.columns(0)
val labelCol = dataPrep.col(labelColName)
val metadata = NominalAttribute.defaultAttr
.withName(labelColName)
.withValues(getLabelsToKeep.map(_.toString))
.toMetadata()
dataPrep
.filter(r => labelSet.contains(r.getDouble(0)))
.withColumn(labelColName, labelCol.as("_", metadata))
}
}
.get
}

/**
Expand Down