Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE][VL] Add OffloadProject to offload project having input_file_name's support considered #6200

Merged
merged 7 commits into from
Jun 28, 2024

Conversation

gaoyangxiaozhu
Copy link
Contributor

@gaoyangxiaozhu gaoyangxiaozhu commented Jun 24, 2024

What changes were proposed in this pull request?

This PR is used to improve the logic to try to offload project with input_file_name to native for velox backend.
It basically follow the suggestion of @zhztheplayer to extend projectExec offload logic.

add a rule OffloadInputFileName (or something) to:
Match on Project(input_file_name) + Scan pattern
Check validity
If eligible, replace the pattern with ProjectTransformer(input_file_name) + ScanTransformer(input_file_name)
If not eligible, do nothing

The basic full implement logic of offloadProject is

  1. If a ProjectExec is transformable, then it can be offloaded.
  2. If a ProjectExec is not transformable:
    • If it doesn't have input_file_name expression, then it can't be offloaded.

    • If it has input_file_name:

      • If it is still not transformable after removing input_file_name, then it can't be offloaded.

      • If it is transformable after removing input_file_name:

        • If it has no scan child or the scan child is not transformable, then it can't be offloaded.

        • If it has a scan child and the scan child is transformable, then it can be offloaded after converting project + scan to replace input_file_name with the related _metadata column.

(Please fill in changes proposed in this fix)

(Fixes: #6157)

How was this patch tested?

manually & ut

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

@gaoyangxiaozhu gaoyangxiaozhu changed the title [VL][WIP] Introduce offloadProject [VL] Introduce offloadProject to offload signle project node Jun 25, 2024
@gaoyangxiaozhu
Copy link
Contributor Author

gaoyangxiaozhu commented Jun 25, 2024

@zhztheplayer i checked the spark code if a project node has input_file_name expression it can only have at mostly one data source scan child node https://github.com/apache/spark/blob/e459674127e7b21e2767cc62d10ea6f1f941936c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala#L506. That makes it easily for us to implement one rule/method to make the call if offload/fallback a project node if have input_file_name.

I have follow your suggestion to modify OffloadSingleNode , the difference is i don't add a new offload way only for project + input_file_name scenario, instead I extend the logic of offload ProjectExec to add code path if a project is not transfromable but it match Project(input_file_name) + Scan pattern and scan is transformable, then we do convert and still offload the project.
Please help review thanks

Copy link
Member

@zhztheplayer zhztheplayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost in nice shape. Thank you for working on this.

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

@gaoyangxiaozhu
Copy link
Contributor Author

kindly ping @zhztheplayer / @xumingming

@gaoyangxiaozhu
Copy link
Contributor Author

any new comments @zhztheplayer / @xumingming thanks ?

@zhztheplayer
Copy link
Member

any new comments @zhztheplayer / @xumingming thanks ?

In case you missed the latest comments #6200 (comment)

Copy link

Run Gluten Clickhouse CI

Copy link
Member

@zhztheplayer zhztheplayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Waiting for CI.

Copy link

Run Gluten Clickhouse CI

@gaoyangxiaozhu
Copy link
Contributor Author

looks CH still fail after re-trigger, i just sync the latest code to re-run the CI @PHILO-HE / @zhztheplayer

@xumingming
Copy link
Contributor

nit: There is typo in the PR title, better fix it: Introduce offloadProject to offload signle project node -> Introduce OffloadProject to offload single project node

@gaoyangxiaozhu gaoyangxiaozhu changed the title [VL] Introduce offloadProject to offload signle project node [VL] Introduce OffloadProject to offload single project node Jun 27, 2024
@gaoyangxiaozhu
Copy link
Contributor Author

gaoyangxiaozhu commented Jun 27, 2024

@zhztheplayer could you help merge ?

Copy link

Run Gluten Clickhouse CI

1 similar comment
@gaoyangxiaozhu
Copy link
Contributor Author

Run Gluten Clickhouse CI

@zhztheplayer
Copy link
Member

@zhztheplayer could you help merge ?

Sure. Let's wait for remaining CI.

@gaoyangxiaozhu
Copy link
Contributor Author

@zhztheplayer the CI is passed.

@zhztheplayer zhztheplayer merged commit 86449d0 into apache:main Jun 28, 2024
40 checks passed
@zhztheplayer zhztheplayer changed the title [VL] Introduce OffloadProject to offload single project node [CORE][VL] Add OffloadProject to offload project having input_file_name's support considered Jun 28, 2024
@GlutenPerfBot
Copy link
Contributor

===== Performance report for TPCDS SF2000 with Velox backend, for reference only ====

query log/native_master_06_29_2024_time.csv log/native_master_06_27_2024_22dc4fdcb_time.csv difference percentage
q1 14.68 14.69 0.011 100.07%
q2 15.29 19.29 3.998 126.14%
q3 4.25 4.34 0.093 102.20%
q4 63.47 63.59 0.117 100.18%
q5 7.83 6.46 -1.363 82.58%
q6 3.61 3.58 -0.034 99.06%
q7 5.84 4.12 -1.715 70.63%
q8 3.61 5.08 1.472 140.81%
q9 17.56 18.54 0.980 105.58%
q10 11.33 11.30 -0.022 99.81%
q11 39.56 36.94 -2.618 93.38%
q12 2.42 2.41 -0.002 99.91%
q13 5.32 5.68 0.357 106.70%
q14a 43.67 41.94 -1.731 96.04%
q14b 40.67 44.07 3.400 108.36%
q15 2.91 2.84 -0.068 97.67%
q16 40.21 39.50 -0.710 98.23%
q17 6.07 4.81 -1.256 79.30%
q18 6.83 6.24 -0.587 91.41%
q19 2.35 2.39 0.046 101.98%
q20 1.34 1.42 0.081 106.03%
q21 1.08 5.68 4.593 524.09%
q22 8.88 8.70 -0.183 97.94%
q23a 82.19 87.00 4.816 105.86%
q23b 101.09 102.16 1.070 101.06%
q24a 81.75 75.38 -6.366 92.21%
q24b 74.22 76.71 2.490 103.35%
q25 4.48 5.95 1.477 132.99%
q26 2.81 2.97 0.156 105.56%
q27 3.11 3.30 0.187 106.02%
q28 20.98 23.56 2.575 112.27%
q29 7.13 7.70 0.571 108.01%
q30 4.36 6.47 2.115 148.56%
q31 6.77 6.23 -0.531 92.16%
q32 1.10 1.10 0.002 100.17%
q33 4.84 4.84 0.003 100.07%
q34 4.07 3.54 -0.532 86.93%
q35 6.58 6.47 -0.113 98.28%
q36 3.48 3.30 -0.181 94.79%
q37 5.53 3.95 -1.579 71.45%
q38 11.66 11.86 0.206 101.77%
q39a 3.30 3.42 0.124 103.75%
q39b 2.90 2.87 -0.026 99.12%
q40 3.75 3.75 -0.003 99.93%
q41 0.63 0.62 -0.013 98.00%
q42 2.41 0.92 -1.485 38.33%
q43 3.91 3.52 -0.391 90.02%
q44 12.72 8.25 -4.469 64.87%
q45 3.62 3.67 0.049 101.36%
q46 4.78 3.31 -1.470 69.22%
q47 14.26 14.71 0.443 103.11%
q48 4.59 4.57 -0.015 99.68%
q49 11.02 9.74 -1.280 88.38%
q50 22.73 23.70 0.961 104.23%
q51 8.77 8.40 -0.364 95.85%
q52 1.12 1.08 -0.036 96.75%
q53 2.07 2.03 -0.042 97.97%
q54 3.27 3.45 0.178 105.43%
q55 1.02 1.02 -0.002 99.81%
q56 4.45 4.46 0.012 100.27%
q57 9.30 8.48 -0.823 91.15%
q58 2.92 2.69 -0.231 92.10%
q59 13.93 13.96 0.026 100.18%
q60 4.80 6.49 1.684 135.06%
q61 5.53 5.72 0.199 103.60%
q62 4.74 4.55 -0.188 96.02%
q63 2.08 2.33 0.251 112.09%
q64 48.99 49.04 0.051 100.10%
q65 14.70 13.81 -0.892 93.93%
q66 3.48 5.59 2.112 160.67%
q67 346.30 349.42 3.125 100.90%
q68 3.73 3.45 -0.280 92.48%
q69 9.68 6.40 -3.280 66.13%
q70 7.60 13.50 5.906 177.74%
q71 2.31 2.02 -0.293 87.35%
q72 190.50 187.82 -2.680 98.59%
q73 2.17 4.45 2.284 205.26%
q74 21.74 22.19 0.454 102.09%
q75 23.35 23.03 -0.316 98.64%
q76 9.11 9.14 0.032 100.35%
q77 2.24 2.19 -0.046 97.92%
q78 39.38 41.33 1.955 104.96%
q79 3.52 3.57 0.048 101.37%
q80 11.16 11.02 -0.144 98.71%
q81 5.19 5.29 0.101 101.95%
q82 6.97 9.55 2.575 136.93%
q83 1.44 1.60 0.160 111.06%
q84 2.87 2.78 -0.089 96.89%
q85 6.78 7.01 0.231 103.40%
q86 3.22 3.21 -0.014 99.56%
q87 13.57 12.23 -1.340 90.13%
q88 28.30 27.33 -0.968 96.58%
q89 3.25 3.36 0.107 103.29%
q90 4.42 10.11 5.687 228.74%
q91 2.62 2.62 -0.003 99.90%
q92 1.35 1.35 0.005 100.34%
q93 30.79 28.71 -2.080 93.24%
q94 21.30 21.28 -0.016 99.93%
q9 81.43 85.02 3.599 104.42%
q5 3.72 3.53 -0.188 94.95%
q96 12.31 12.38 0.072 100.59%
q97 1.85 2.16 0.313 116.92%
q98 9.55 8.93 -0.617 93.54%
q99 9.55 8.93 -0.617 93.54%
total 1910.38 1930.27 19.886 101.04%

@GlutenPerfBot
Copy link
Contributor

===== Performance report for TPCDS SF2000 with Velox backend, for reference only ====

query log/native_master_06_30_2024_time.csv log/native_master_06_29_2024_86449d0c8_time.csv difference percentage
q1 14.83 14.68 -0.153 98.97%
q2 14.85 15.29 0.437 102.94%
q3 4.21 4.25 0.034 100.80%
q4 63.51 63.47 -0.036 99.94%
q5 7.38 7.83 0.448 106.07%
q6 2.32 3.61 1.299 156.10%
q7 6.03 5.84 -0.195 96.77%
q8 5.14 3.61 -1.537 70.13%
q9 21.68 17.56 -4.118 81.01%
q10 11.58 11.33 -0.254 97.81%
q11 36.42 39.56 3.135 108.61%
q12 2.59 2.42 -0.180 93.08%
q13 9.04 5.32 -3.723 58.83%
q14a 41.82 43.67 1.856 104.44%
q14b 38.81 40.67 1.860 104.79%
q15 3.62 2.91 -0.707 80.47%
q16 39.65 40.21 0.553 101.40%
q17 5.07 6.07 0.999 119.71%
q18 6.20 6.83 0.633 110.21%
q19 2.34 2.35 0.005 100.20%
q20 1.40 1.34 -0.060 95.72%
q21 1.02 1.08 0.060 105.91%
q22 7.79 8.88 1.089 113.97%
q23a 84.02 82.19 -1.833 97.82%
q23b 101.13 101.09 -0.040 99.96%
q24a 69.39 81.75 12.355 117.81%
q24b 72.20 74.22 2.016 102.79%
q25 4.35 4.48 0.129 102.96%
q26 2.85 2.81 -0.043 98.49%
q27 3.15 3.11 -0.035 98.90%
q28 28.74 20.98 -7.755 73.01%
q29 7.11 7.13 0.023 100.32%
q30 4.19 4.36 0.164 103.91%
q31 6.70 6.77 0.062 100.92%
q32 2.13 1.10 -1.031 51.51%
q33 4.74 4.84 0.099 102.09%
q34 3.65 4.07 0.426 111.67%
q35 6.57 6.58 0.014 100.21%
q36 3.35 3.48 0.127 103.80%
q37 4.05 5.53 1.480 136.52%
q38 11.85 11.66 -0.195 98.35%
q39a 5.69 3.30 -2.388 58.02%
q39b 3.27 2.90 -0.378 88.46%
q40 3.73 3.75 0.019 100.51%
q41 0.62 0.63 0.007 101.19%
q42 0.94 2.41 1.467 256.11%
q43 3.89 3.91 0.025 100.65%
q44 8.62 12.72 4.104 147.64%
q45 3.40 3.62 0.222 106.52%
q46 3.22 4.78 1.554 148.24%
q47 14.40 14.26 -0.133 99.07%
q48 4.58 4.59 0.007 100.16%
q49 9.80 11.02 1.216 112.41%
q50 24.23 22.73 -1.497 93.82%
q51 8.51 8.77 0.252 102.96%
q52 1.04 1.12 0.081 107.80%
q53 2.01 2.07 0.059 102.93%
q54 3.23 3.27 0.045 101.39%
q55 0.99 1.02 0.030 103.00%
q56 4.53 4.45 -0.087 98.07%
q57 9.28 9.30 0.020 100.22%
q58 2.59 2.92 0.329 112.69%
q59 16.43 13.93 -2.499 84.79%
q60 4.83 4.80 -0.025 99.48%
q61 5.67 5.53 -0.140 97.53%
q62 4.76 4.74 -0.018 99.62%
q63 2.32 2.08 -0.239 89.69%
q64 49.99 48.99 -1.008 97.98%
q65 13.70 14.70 0.998 107.28%
q66 4.25 3.48 -0.770 81.88%
q67 355.08 346.30 -8.779 97.53%
q68 3.65 3.73 0.077 102.12%
q69 6.25 9.68 3.427 154.80%
q70 8.60 7.60 -1.006 88.30%
q71 2.45 2.31 -0.139 94.32%
q72 188.05 190.50 2.449 101.30%
q73 2.26 2.17 -0.092 95.94%
q74 21.26 21.74 0.478 102.25%
q75 23.20 23.35 0.148 100.64%
q76 9.42 9.11 -0.314 96.66%
q77 2.21 2.24 0.026 101.20%
q78 41.97 39.38 -2.586 93.84%
q79 3.51 3.52 0.015 100.42%
q80 11.16 11.16 0.004 100.03%
q81 5.19 5.19 -0.005 99.91%
q82 10.14 6.97 -3.171 68.74%
q83 1.55 1.44 -0.111 92.86%
q84 2.68 2.87 0.189 107.06%
q85 6.54 6.78 0.245 103.75%
q86 3.21 3.22 0.009 100.29%
q87 12.09 13.57 1.475 112.19%
q88 24.76 28.30 3.537 114.28%
q89 3.25 3.25 0.002 100.06%
q90 9.55 4.42 -5.130 46.27%
q91 5.39 2.62 -2.771 48.59%
q92 1.40 1.35 -0.048 96.58%
q93 28.51 30.79 2.283 108.01%
q94 21.60 21.30 -0.300 98.61%
q9 83.91 81.43 -2.483 97.04%
q5 3.78 3.72 -0.063 98.32%
q96 12.34 12.31 -0.038 99.69%
q97 1.96 1.85 -0.108 94.49%
q98 9.18 9.55 0.371 104.04%
q99 9.18 9.55 0.371 104.04%
total 1914.13 1910.38 -3.749 99.80%

@GlutenPerfBot
Copy link
Contributor

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query log/native_master_06_30_2024_time.csv log/native_master_06_25_2024_524434826_time.csv difference percentage
q1 35.24 37.64 2.398 106.81%
q2 23.80 23.31 -0.493 97.93%
q3 40.97 40.76 -0.213 99.48%
q4 32.74 32.85 0.115 100.35%
q5 70.02 72.82 2.803 104.00%
q6 7.93 10.28 2.353 129.69%
q7 82.49 80.76 -1.732 97.90%
q8 83.59 85.49 1.898 102.27%
q9 118.63 118.88 0.255 100.22%
q10 49.82 44.99 -4.824 90.32%
q11 20.68 20.46 -0.224 98.92%
q12 25.51 24.64 -0.874 96.58%
q13 38.43 39.04 0.602 101.57%
q14 17.87 19.01 1.145 106.41%
q15 30.75 33.57 2.815 109.15%
q16 14.20 14.18 -0.020 99.86%
q17 104.33 103.92 -0.405 99.61%
q18 148.98 148.88 -0.099 99.93%
q19 14.66 15.08 0.421 102.88%
q20 29.28 31.11 1.825 106.23%
q21 261.77 259.47 -2.303 99.12%
q22 12.54 12.59 0.043 100.34%
total 1264.22 1269.71 5.488 100.43%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[VL] Improve the implementation of Spark function input_file_name
4 participants