Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove or document Direct Output Committers limitations #54

Closed
gerashegalov opened this issue Aug 14, 2018 · 3 comments
Closed

Remove or document Direct Output Committers limitations #54

gerashegalov opened this issue Aug 14, 2018 · 3 comments

Comments

@gerashegalov
Copy link
Contributor

Describe the bug
Direct output committer and vanilla committers are susceptible to list inconsistency on s3. It only works with layers on top of s3 such as S3Guard or EMRFS as long as one relies on _SUCCESS files.

To Reproduce
Run on pure s3 :)

Expected behavior
Use the default committer. Just refer users to options out there and how to configure them

Logs or screenshots
N/A

Additional context
N/A

@ajayborra
Copy link
Contributor

Hi @gerashegalov, Was able to reproduce this issues and following are the observations.

  • When we use paths with (s3|s3n):https://<object key> format, code crashes with No FileSystem for scheme: (s3|s3n).
    |- Seems like it is possible to detect this prefixes early on and raise an error indicating the fact and recommend using s3a here and document this fact.
  • Default implementation for s3 is configured as s3a org.apache.hadoop.fs.s3a.S3AFileSystem
    |- Seems like we can still use org.apache.hadoop.fs.s3native.NativeS3FileSystem even though its in maintanace mode. So we can document this fact and how to configure it. (An alternative could be, we can provide a cli level --s3-scheme flag to choose s3a vs s3n scheme and set these properties programatically -- Let me know your thoughs on this).

Also, Can you please expand a bit more on the Expected behavior section. Are you refering to chainging the behaviour of DirectOutputCommitter class in some way ?

@gerashegalov
Copy link
Contributor Author

gerashegalov commented Aug 23, 2018

Hi @ajayborra,

thank you for checking out the issue.

just for the reference: s3n is deprecated in Hadoop 2.x and removed in Hadoop 3.x . I would like to point out that the issue is related to this but not about FileSystem URI's. And yes S3 conf is missing in the default Hadoop libraries outside EMR.

this issue is primarilyabout hadoop output committers. I propose dropping DOC classes from these project. We should definitely not hardcode them like in

jobConf.setOutputCommitter(classOf[DirectOutputCommitter])
because it introduces inconsistency windows on HDFS and other filesystems.

The expected behavior I think should be the default output committer as in Hadoop 3, FileOutputCommitter v2: https://docs.databricks.com/spark/latest/faq/append-slow-with-spark-2.0.0.html

FileOutputCommitter class will already be used when we stop overriding it via
spark.hadoop.mapred.output.committer.class and we only need to set spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2

@ajayborra
Copy link
Contributor

@gerashegalov Thanks for clarifying on this. +1 on leaving it to the users to choose the committer class. Raised a PR for this #86. Please review it when you get a chance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants