Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changed to look specifically for part files rather than listing the files and searching through them #505

Merged
merged 8 commits into from
Sep 14, 2020

Conversation

leahmcguire
Copy link
Collaborator

Related issues
Refer to issue(s) addressed in this pull request from Issues page.

Describe the proposed solution
A clear and concise description of what the changes are.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context about the changes here.

@tovbinm
Copy link
Collaborator

tovbinm commented Sep 2, 2020

Is it faster? What if another compression technique is being used, i.e. 7Zip, then we would have .7z extension?

@codecov
Copy link

codecov bot commented Sep 2, 2020

Codecov Report

Merging #505 into master will decrease coverage by 0.70%.
The diff coverage is 58.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #505      +/-   ##
==========================================
- Coverage   86.74%   86.03%   -0.71%     
==========================================
  Files         347      347              
  Lines       11859    11860       +1     
  Branches      388      607     +219     
==========================================
- Hits        10287    10204      -83     
- Misses       1572     1656      +84     
Impacted Files Coverage Δ
...cala/com/salesforce/op/OpWorkflowModelReader.scala 91.20% <58.33%> (-5.46%) ⬇️
...e/op/stages/impl/selector/RandomParamBuilder.scala 0.00% <0.00%> (-94.45%) ⬇️
...main/scala/com/salesforce/op/dsl/RichFeature.scala 50.00% <0.00%> (-50.00%) ⬇️
...mpl/regression/OpGeneralizedLinearRegression.scala 50.00% <0.00%> (-26.93%) ⬇️
...op/stages/impl/regression/OpXGBoostRegressor.scala 13.46% <0.00%> (-13.47%) ⬇️
...tages/impl/preparators/SanityCheckerMetadata.scala 84.45% <0.00%> (-5.41%) ⬇️
...s/impl/preparators/DerivedFeatureFilterUtils.scala 88.67% <0.00%> (-4.41%) ⬇️
...com/salesforce/op/utils/stages/FitStagesUtil.scala 89.47% <0.00%> (-3.95%) ⬇️
...rce/op/stages/impl/preparators/SanityChecker.scala 88.93% <0.00%> (-1.64%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d46181...3d036d2. Read the comment docs.

}
val finalPath = partPath.getOrElse(path)

val partFile = new Path(pathString, "part-00000")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better configure the part file name on the write call path so we don't have to guess.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will always be this but the compression codec will determine the extension


val partFile = new Path(pathString, "part-00000")
val partZipped = new Path(pathString, "part-00000.gz")
val finalPath = if (fs.exists(partZipped)) partZipped else if (fs.exists(partFile)) partFile else path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exists add redundant filesystem calls that will be even more expensive with S3

we can just call open directly and catch FileNotFound to try another filename

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gerashegalov fixed to just try loading

@leahmcguire
Copy link
Collaborator Author

@gerashegalov suggested this for faster reading

@gerashegalov
Copy link
Contributor

Is it faster? What if another compression technique is being used, i.e. 7Zip, then we would have .7z extension?

@tovbinm it should be faster on object stores since you have fast path lookups but slow entry list performance.

@leahmcguire
Copy link
Collaborator Author

@tovbinm this covers no compression and our default settings as well as passing in the exact full path. Do you think we need to cover more possibilities?

Copy link
Contributor

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor comment

}
}
} match {
case Failure(e) => throw new RuntimeException(s"Failed to load workflow because of $e")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us keep e as the cause exception as well, i.e. provided as the second parameter to the RTE constructor.

@gerashegalov
Copy link
Contributor

this covers no compression and our default settings as well as passing in the exact full path. Do you think we need to cover more possibilities?

I don't think we need to make it that complex. At most I would make Compression Codec configurable instead of doing any guesses at all.

@leahmcguire leahmcguire merged commit be60275 into master Sep 14, 2020
nicodv added a commit that referenced this pull request Sep 16, 2020
…ng the files and searching through them (#505)"

This reverts commit be60275
@tovbinm tovbinm deleted the lm/loadSpecificFiles branch January 18, 2021 05:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants