Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Length Row Level Results #465

Merged
merged 7 commits into from
Apr 13, 2023

Conversation

eycho-am
Copy link
Contributor

Issue #, if available:
N/A

Description of changes:
Building on this PR to add more analyzers that support row level results.

This PR address this for MinLength, as MaxLength was completed in the PR mentioned above.

In addition, this PR adds an AnalzyerOptions case class that gives an option to convert null values to result to failing the check. This option is not set by default, which means that for a MinLength analyzer, the check would succeed even if null values were present.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@@ -249,6 +255,12 @@ case class NumMatchesAndCount(numMatches: Long, count: Long, override val fullCo
}
}

case class AnalyzerOptions(convertNull: Boolean) {
def getConvertNull(): Boolean = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather have this be an enum or other option, and also named more clearly since we put it in Analyzer.

I see two problems with it:

  1. The name isn't clear. I don't know when nulls will be converted
  2. The result isn't clear. I don't know (and can't control) what nulls will be converted to

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to use an enum NullBehavior that gives users 3 options

  1. Ignore (Default behavior)
  2. Empty (Treat as 0 or empty string)
  3. Fail (Always fail nulls)

src/main/scala/com/amazon/deequ/analyzers/Analyzer.scala Outdated Show resolved Hide resolved
private def criterion(convertNull: Boolean): Column = {
if (convertNull) {
val colLengths: Column = length(conditionalSelection(column, where)).cast(DoubleType)
conditionalSelectionForLength(colLengths, Option(s"${column} IS NULL"), Double.MinValue)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we replacing after the length has been computed? I would expect to replace all null strings with "" and then compute the length. Is this approach better?

Also, min value is going to be negative, is that what we want?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now with the NullBehavior enum, users can choose to replace with "" or treat as false (fixed to Double.MaxValue).

Copy link
Contributor

@rdsharma26 rdsharma26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left some minor comments.

@@ -453,6 +471,20 @@ private[deequ] object Analyzers {
conditionalSelection(col(selection), where)
}

def conditionalSelection(selection: Column, where: Option[String], replaceWith: Double): Column = {
val conditionColumn = where.map { expression => expr(expression) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: where.map(expr)

}

def conditionalSelection(selection: Column, where: Option[String], replaceWith: String): Column = {
val conditionColumn = where.map { expression => expr(expression) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: where.map(expr)

Comment on lines 260 to 262
def getNullBehavior(): NullBehavior = {
nullBehavior
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required.

scala> case class Options(iLikeNulls: Boolean)
defined class Options

scala> val opt = Options(iLikeNulls = false)
opt: Options = Options(false)

scala> opt.iLikeNulls
res2: Boolean = false


object NullBehavior extends Enumeration {
type NullBehavior = Value
val Ignore, Empty, Fail = Value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does Empty mean for numbers? If it is only applicable for Strings, should we call it EmptyString ?

case _ => length(conditionalSelection(column, where)).cast(DoubleType)
}
}
private def getNullBehavior(): NullBehavior = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The () can be removed. In idiomatic Scala, (), called an empty paren(theses) method, denotes that function performs some side effect. Like calling a service or even just logging. Here, we are just extracting data from an object, and therefore, we can remove this () from the method.

@@ -41,4 +48,22 @@ case class MinLength(column: String, where: Option[String] = None)
}

override def filterCondition: Option[String] = where

@VisibleForTesting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed if we are using package private modifier and have the test in the appropriate package?

Comment on lines 235 to 236


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Excess whitespace

@eycho-am eycho-am merged commit 067710b into awslabs:master Apr 13, 2023
eycho-am added a commit that referenced this pull request Apr 13, 2023
- Add row level results for MinLength in addition to MaxLength
- Add AnalzyerOptions case class with options to change null behavior (Ignore, EmptyString, Fail)
mentekid pushed a commit that referenced this pull request Apr 14, 2023
- Add row level results for MinLength in addition to MaxLength
- Add AnalzyerOptions case class with options to change null behavior (Ignore, EmptyString, Fail)
rdsharma26 pushed a commit that referenced this pull request Apr 16, 2024
- Add row level results for MinLength in addition to MaxLength
- Add AnalzyerOptions case class with options to change null behavior (Ignore, EmptyString, Fail)
rdsharma26 pushed a commit that referenced this pull request Apr 16, 2024
- Add row level results for MinLength in addition to MaxLength
- Add AnalzyerOptions case class with options to change null behavior (Ignore, EmptyString, Fail)
rdsharma26 pushed a commit that referenced this pull request Apr 16, 2024
- Add row level results for MinLength in addition to MaxLength
- Add AnalzyerOptions case class with options to change null behavior (Ignore, EmptyString, Fail)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants