Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How Paper input matches the code state s(t)? #141

Open
ahmad-hl opened this issue Dec 9, 2021 · 3 comments
Open

How Paper input matches the code state s(t)? #141

ahmad-hl opened this issue Dec 9, 2021 · 3 comments

Comments

@ahmad-hl
Copy link

ahmad-hl commented Dec 9, 2021

Dear Hongzi,

I was trying to figure out the matching between the RL agent's state s(t) in the code and the input info in the paper.

Input: After the download of each chunk t, Pensieve’s learning
agent takes state inputs st = (xt, τt, nt, bt ,ct ,lt) to its neural networks. xt
is the network throughput measurements for the past k video chunks; τt
is the download time of the past k video chunks, nt is a vector of m available sizes for the next video chunk; bt is
the current buffer level; ct is the number of chunks remaining in the video; and lt is the bitrate at which the last chunk was downloaded.
First of all, which code package we need to look at, multi-video_sim or sim?

When I look at sim, I see in def agent that the input state is

0: last quality ?
1: buffer_size ( bt)
2: chunk_size ?
3: delay ? is it download time (τt)?
4: next_chunk_sizes (nt)
5: remain_chunks (ct )

Could you please illustrates the matching, and the actor & critic networks (figure 5) if possible?

@hongzimao
Copy link
Owner

multi-video sim is for agents that can generalize to videos with different numbers and different level of bitrate encoding.

It looks to me in the above writing your understanding of the code and the paper is correct.

@ahmad-hl
Copy link
Author

ahmad-hl commented Dec 14, 2021

I have upgraded the code to work on python 3.8 and used cooked_traces to train the multiagent RL model in sim dir.
Given that I'm using a computer with 2 GPU and tensorboard to monitor, What is the time required for the model to converge?
How do you know if the model converged?

Can you also explain the main components in the objective function?

# Compute the objective (log action_vector and entropy)
self.obj = tf.reduce_sum(tf.multiply(tf.log(tf.reduce_sum(tf.multiply(self.out, self.acts), axis=1, keepdims=True) - self.act_grad_weights)) 
+ ENTROPY_WEIGHT * tf.reduce_sum(tf.multiply(self.out, tf.log(self.out + ENTROPY_EPS)))

@hongzimao
Copy link
Owner

Thanks again for upgrading the codebase. The training wall time really depends on your physical hardware. You can monitor the learning curve and see when the performance on validation set is stabilized. To determine if the model is converged, you can use some heuristic like relative performance didn't improve much for the past xxx iteration or something. At our time, we just eyeballed it.

The main objective is just the policy gradient expression (the expression after the gradient operator). It's basically log pi_t * (R_t - baseline_t) + entropy regulator, sum over the training batch.

Hope these help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants