Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Partitioned execution #409

Open
3 of 13 tasks
bjchambers opened this issue Jun 1, 2023 · 1 comment
Open
3 of 13 tasks

feat: Partitioned execution #409

bjchambers opened this issue Jun 1, 2023 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@bjchambers
Copy link
Collaborator

bjchambers commented Jun 1, 2023

Summary

The current "execution plan" is close, but not quite correctly adapted for use when describing the steps necessary to perform a computation.
Ideally, we would introduce a physical plan that is more similar to relational query engines, allowing us to leverage existing techniques and creating the options to run on existing systems.

For now, the plan is to introduce these and move execution towards running them directly, and then (separately) work towards compiling queries directly to physical plans.

  • Introduce serde compatible physical plan structs.
  • Introduce a "pipeline scheduler" putting schedule information in the physical plans
  • Introduce new expression executors
  • Introduce a partitioned, pipeline executor
  • Add additional transforms (select, lookup, etc.)
  • Add merge based pipelines (repartition, merge)
  • Figure out how to incorporate aggregations
  • Connect to "real" sources
  • Connect to real sinks
  • Convert the existing execution plan to a physical plan.
  • Update tests to run (with a flag) using the converted plans
  • Introduce a protobuffer encoding for the physical plan. It may be possible to share this with a serialization of logical plans since they will have similar structure. Ideally, this would allow for source/sink extensibility.
  • "partitioned" parquet reader + "partitioned" prepare
@bjchambers bjchambers added the enhancement New feature or request label Jun 1, 2023
@bjchambers bjchambers self-assigned this Jun 1, 2023
@bjchambers
Copy link
Collaborator Author

#407 was a refactoring to move the ScalarValue into a more accessible location for the physical plans. For now, the intention is to use the ScalarValue within the physical plan to represent literal values. In the long term we may want to revisit that and use a better encoding that would be more aligned with logical plans, but can revisit once the basic plumbing is laid out a bit better.

bjchambers added a commit that referenced this issue Jun 1, 2023
bjchambers added a commit that referenced this issue Jun 1, 2023
bjchambers added a commit that referenced this issue Jun 2, 2023
bjchambers added a commit that referenced this issue Jun 2, 2023
This is part of #409.

Introduces `Pipeline` information to the physical plan. This indicates
which steps are part of a linear sequence, and should (ideally) be
executed together.

Also implements a pipeline "scheduler" to determine the pipeline for
each step, in a new `sparrow-backend` crate. As the physical plan is
built-up, the code should go in this "compiler backend" package, which
can own optimization and conversion of logical plans to physical plans.
bjchambers added a commit that referenced this issue Jun 3, 2023
This is part of #409.

Introduces `Pipeline` information to the physical plan. This indicates
which steps are part of a linear sequence, and should (ideally) be
executed together.

Also implements a pipeline "scheduler" to determine the pipeline for
each step, in a new `sparrow-backend` crate. As the physical plan is
built-up, the code should go in this "compiler backend" package, which
can own optimization and conversion of logical plans to physical plans.
@bjchambers bjchambers changed the title feat: Introduce physical plans suitable for partitioned & distributed execution feat: Partitioned execution Jul 21, 2023
github-merge-queue bot pushed a commit that referenced this issue Jul 25, 2023
This introduces the key components of partitioned execution.

- `sparrow-scheduler` provides functionality for managing the separate
pipelines within the query plan and morsel-driven parallelism. It
managing a thread-pool of workers pinned to specific CPUs pulling tasks
from local queues.
- `sparrow-transforms` will provide implementations of the "transforms"
(project, select, etc.) and a pipeline for executing the transforms.
- `sparrow-execution` will pull everything together to provide
partitioned execution.

This is part of #409.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant