Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

验证阶段读不到模型的输出? #5662

Open
FancyXun opened this issue Jul 11, 2024 · 11 comments
Open

验证阶段读不到模型的输出? #5662

FancyXun opened this issue Jul 11, 2024 · 11 comments

Comments

@FancyXun
Copy link

FancyXun commented Jul 11, 2024

YAML文件如下:

dag:
  parties:
  - party_id: ['9999']
    role: guest
  - party_id: ['10000']
    role: host
  party_tasks:
    guest_9999:
      parties:
      - party_id: ['9999']
        role: guest
      tasks:
        reader_0:
          parameters: {name: breast_hetero_guest, namespace: experiment}
    host_10000:
      parties:
      - party_id: ['10000']
        role: host
      tasks:
        reader_0:
          parameters: {name: breast_hetero_host, namespace: experiment}
  stage: train
  tasks:
    eval_0:
      component_ref: evaluation
      dependent_tasks: [sbt_0]
      inputs:
        data:
          input_data:
            task_output_artifact:
            - output_artifact_key: train_data_output
              parties:
              - party_id: ['9999']
                role: guest
              producer_task: sbt_0
      parameters:
        label_column_name: null
        metrics: [auc]
        predict_column_name: null
      parties:
      - party_id: ['9999']
        role: guest
      stage: default

    psi_0:
      component_ref: psi
      dependent_tasks: [reader_0]
      inputs:
        data:
          input_data:
            task_output_artifact:
              output_artifact_key: output_data
              parties:
              - party_id: ['9999']
                role: guest
              - party_id: ['10000']
                role: host
              producer_task: reader_0
      parameters: {}
      stage: default
    reader_0:
      component_ref: reader
      parameters: {}
      stage: default
    sbt_0:
      component_ref: hetero_secureboost
      dependent_tasks: [psi_0]
      inputs:
        data:
          cv_data:
            task_output_artifact:
              output_artifact_key: output_data
              parties:
              - party_id: ['9999']
                role: guest
              - party_id: ['10000']
                role: host
              producer_task: psi_0
        model: {}
      parameters:
        cv_param: {n_splits: 3}
        gh_pack: true
        goss: false
        goss_start_iter: 0
        he_param: {key_length: 1024, kind: paillier}
        hist_sub: true
        l1: 0
        l2: 0.1
        learning_rate: 0.3
        max_bin: 32
        max_depth: 2
        min_child_weight: 1
        min_impurity_split: 0.01
        min_leaf_node: 1
        min_sample_split: 2
        num_class: 2
        num_trees: 2
        objective: binary:bce
        other_rate: 0.1
        split_info_pack: true
        top_rate: 0.2
      stage: cross_validation
schema_version: 2.0.0

提交后

Screen Shot 2024-07-11 at 16 06 48

报错如下:

[ERROR][2024-07-11 07:54:39,269][585110][_wraps.run][line:92]: Get data artifacts failed: {'job_id': '202407110739162543270', 'role': 'guest', 'party_id': '9999', 'task_name': 'sbt_0', 'output_key': 'train_data_output'}, response: {"code":2005,"message":"failed"}
[ERROR][2024-07-11 07:54:39,269][585110][_wraps.run][line:92]: Get data artifacts failed: {'job_id': '202407110739162543270', 'role': 'guest', 'party_id': '9999', 'task_name': 'sbt_0', 'output_key': 'train_data_output'}, response: {"code":2005,"message":"failed"}

感觉像是sbt没有输出模型吗?所以fate-server没找到?这个yaml文件我也是基于 https://github.com/FederatedAI/FATE/blob/v2.0.0/examples/pipeline/hetero_secureboost/test_hetero_sbt_binary.py 这个生成的

@talkingwallace
Copy link
Contributor

sbt是有输出模型的 你提交任务的文件是什么样子的?这个看着是intput/output相关的配置配置错误了

@FancyXun
Copy link
Author

sbt是有输出模型的 你提交任务的文件是什么样子的?这个看着是intput/output相关的配置配置错误了

提交的就是我图中写的yaml文件 @talkingwallace

@talkingwallace
Copy link
Contributor

我看到你跑的是cv模式 cv的话是没有数据输出的

@FancyXun
Copy link
Author

我不跑cv模式,会有如下的Bug,看起来是我更新了fate-spark到2.1以后的问题
Screen Shot 2024-07-12 at 12 05 55

@FancyXun
Copy link
Author

因为现在没有2.1的k8s镜像,但是在2.0会有spark的rdd 分区的问题,所以我将一部分修复这个rdd分区的代码替换了,具体详情可以在这里看到 #5656 @talkingwallace

@FancyXun
Copy link
Author

我不跑cv模式,会有如下的Bug,看起来是我更新了fate-spark到2.1以后的问题 Screen Shot 2024-07-12 at 12 05 55

对应的yaml文件如下:

dag:
  parties:
  - party_id: ['9999']
    role: guest
  - party_id: ['10000']
    role: host
  party_tasks:
    guest_9999:
      parties:
      - party_id: ['9999']
        role: guest
      tasks:
        reader_0:
          parameters: {name: breast_hetero_guest, namespace: experiment}
    host_10000:
      parties:
      - party_id: ['10000']
        role: host
      tasks:
        reader_0:
          parameters: {name: breast_hetero_host, namespace: experiment}
  stage: train
  tasks:
    eval_0:
      component_ref: evaluation
      dependent_tasks: [sbt_0]
      inputs:
        data:
          input_data:
            task_output_artifact:
            - output_artifact_key: train_data_output
              parties:
              - party_id: ['9999']
                role: guest
              producer_task: sbt_0
      parameters:
        label_column_name: null
        metrics: [auc]
        predict_column_name: null
      parties:
      - party_id: ['9999']
        role: guest
      stage: default

    psi_0:
      component_ref: psi
      dependent_tasks: [reader_0]
      inputs:
        data:
          input_data:
            task_output_artifact:
              output_artifact_key: output_data
              parties:
              - party_id: ['9999']
                role: guest
              - party_id: ['10000']
                role: host
              producer_task: reader_0
      parameters: {}
      stage: default
    reader_0:
      component_ref: reader
      parameters: {}
      stage: default
    sbt_0:
      component_ref: hetero_secureboost
      dependent_tasks: [ psi_0 ]
      inputs:
        data:
          train_data:
            task_output_artifact:
              output_artifact_key: output_data
              parties:
                - party_id: [ '9999' ]
                  role: guest
                - party_id: [ '10000' ]
                  role: host
              producer_task: psi_0
        model: { }
      parameters:
        gh_pack: true
        goss: true
        goss_start_iter: 0
        he_param: { key_length: 1024, kind: paillier }
        hist_sub: true
        l1: 0
        l2: 0.1
        learning_rate: 0.3
        max_bin: 32
        max_depth: 2
        min_child_weight: 1
        min_impurity_split: 0.01
        min_leaf_node: 1
        min_sample_split: 2
        num_class: 2
        num_trees: 2
        objective: binary:bce
        other_rate: 0.1
        split_info_pack: true
        top_rate: 0.2
schema_version: 2.0.0

@FancyXun
Copy link
Author

FancyXun commented Jul 12, 2024

补充一下,我打印了一些中间结果

_balance_block_func = functools.partial(

block_table = block_table.mapPartitions(_balance_block_func, use_previous_behavior=False)

block_table里面rdd的分区是0-7,但是block_order_mappings里面只有0-5,所以应该就报了key error。

block_table.rdd.key :
7                                                                               
0
4
2
5
6
3
1
block_order_mappings:
{0: {'start_index': 0, 'end_index': 123, 'start_block_id': 0, 'end_block_id': 0}, 1: {'start_index': 124, 'end_index': 206, 'start_block_id': 1, 'end_block_id': 1}, 2: {'start_index': 207, 'end_index': 295, 'start_block_id': 2, 'end_block_id': 2}, 3: {'start_index': 296, 'end_index': 362, 'start_block_id': 3, 'end_block_id': 3}, 4: {'start_index': 363, 'end_index': 428, 'start_block_id': 4, 'end_block_id': 4}, 5: {'start_index': 429, 'end_index': 568, 'start_block_id': 5, 'end_block_id': 6}}

@dylan-fan
Copy link
Collaborator

2.1 spark模式,你代码没有替换正确。你把整个arch目录替换下。或者现在2.1.1的镜像。现在镜像有了

@FancyXun
Copy link
Author

2.1 spark模式,你代码没有替换正确。你把整个arch目录替换下。或者现在2.1.1的镜像。现在镜像有了

https://hub.docker.com/r/federatedai/fateflow-spark/tags 我用的是这个镜像,这个还是2.0.0的,请问你是让我替换成哪个? https://hub.docker.com/r/federatedai/fateflow/tags fateflow-spark的镜像替换成这个吗?

@FancyXun
Copy link
Author

2.1 spark模式,你代码没有替换正确。你把整个arch目录替换下。或者现在2.1.1的镜像。现在镜像有了

我之前就是把这个arch的替换了,那我试下最新的镜像。多谢哈

@dylan-fan
Copy link
Collaborator

你记得重启下服务。你上面错误在2.0上的确有,2.1修复了。你可以看看arch github提交记录

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants