Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于Actor更新问题 #3

Open
Coastchb opened this issue Jul 4, 2023 · 2 comments
Open

关于Actor更新问题 #3

Coastchb opened this issue Jul 4, 2023 · 2 comments

Comments

@Coastchb
Copy link

Coastchb commented Jul 4, 2023

请教几个细节问题:
1、论文中公式15上方有一段关于描述:
Before the TD error 𝛿 converges to threshold 𝜖, we update the current actor networks parameterized by 𝜃𝑘 through the gradients back-propagation of loss function for each tower layer after the forward process of each batch transitions from B。
意思是在使用TD算法更新Critic的同时也使用公式15更新Actor是吧?
但是3.2部分的实现细节中,说“In order to improve data efficiency, we initialize the Actor with pretrained parameters from the MTL models and keep them frozen until the Critics converge. ” 固定Actor预训练的参数直到Critics收敛?这两个地方是不是有矛盾?请教具体做法。

2、3.2部分实现细节中说到:Then we multiply the critic value in the total loss and retrain the MTL model.
这句话的意思是将Q值代入公式10,对Actor进行fine-tune吗?

期待回复,谢谢~

@LZR-S
Copy link
Collaborator

LZR-S commented Jul 7, 2023

你好,感谢你对RMTL的兴趣。对第一个问题,实际上我们是先训练actor网络,之后固定actor网络参数 开始训练critic网络到收敛。第二个问题你理解的没错,我们使用更新后的total loss对actor进行fine-tune

@Coastchb
Copy link
Author

@LZR-S 感谢回复。

再请教一下细节:
(1)样本的话,是类似 {用户性别、用户年龄...等用户侧属性特征;session内第1个item的属性特征;session内第2个item的属性特征;...;用户对该session内各个item的反馈(是否点击、是否转化)}这样的嘛?
(2)如果模型输入没有使用序列特征(用户历史行为序列),那对同一个session内的预估,对应的模型输入是完全相同的?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants