-
Notifications
You must be signed in to change notification settings - Fork 564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XLA compilation error of a TARF payoff with MC #73
Comments
Hi @arthurpham , Thanks for reaching out! The issue you are referring to is a TensorFlow (TF) issue as you implicitly have a while loop inside a vectorized map. I just checked internally, it seems that someone is on it but feel free filing the bug with TF team directly. Could you please give me edit rights for the colab? Otherwise, I have fiddled a bit with your my_function adding batched version of the calculation, making the code a bit more inefficient on a CPU device but clearly improving GPU performance. Also, it is XLA-compatible now (runs in < 70 ms for me on a T4 device). def my_function_new(paths):
# Shape [num_timesteps, num_samples]
paths = tf.transpose(paths)
cur_spot = paths[0]
total = tf.zeros_like(cur_spot)
discounted_payoff = tf.zeros_like(cur_spot)
df = tf.constant(1.0, dtype=tf.float64)
is_active = tf.ones([num_samples], dtype=tf.bool)
cashflow = tf.zeros_like(cur_spot)
i = tf.constant(0, dtype=tf.int32)
# Explicitly define the while_loop
def cond(i, is_active, cashflow, total, discounted_payoff):
return i < num_timesteps
def body(i, is_active, cashflow, total, discounted_payoff):
# Here Tensors are of shape `[num_samples]`
cur_spot = paths[i]
add_cashflow = False
new_is_active = K_knockout > cur_spot
add_cashflow = tf.where(
tf.logical_or(K_upper <= cur_spot, cur_spot < K_lower),
True, False)
new_cashflow = tf.where(
K_upper <= cur_spot,
cur_spot - strike,
cashflow
)
new_cashflow = tf.where(cur_spot < K_lower,
step_up_ratio*(cur_spot - strike),
new_cashflow)
new_is_active = tf.where(
add_cashflow,
tf.where(total + new_cashflow >= tarf_target,
False, new_is_active),
new_is_active)
new_cashflow = tf.where(
add_cashflow,
tf.where(total + new_cashflow >= tarf_target,
tarf_target - total, new_cashflow),
new_cashflow
)
new_total = tf.where(
add_cashflow,
total + new_cashflow,
total
)
new_discounted_payoff = tf.where(
add_cashflow,
discounted_payoff + df * new_cashflow,
discounted_payoff)
# Update values only if active
new_cashflow = tf.where(is_active,
new_cashflow,
cashflow)
new_total = tf.where(is_active, new_total, total)
new_discounted_payoff = tf.where(is_active,
new_discounted_payoff,
discounted_payoff)
new_is_active = tf.where(is_active,
new_is_active,
is_active)
return (i + 1, new_is_active, new_cashflow,
new_total, new_discounted_payoff)
_, is_active, cashflow, total, discounted_payoff = tf.while_loop(
cond, body, (i, is_active, cashflow, total, discounted_payoff)
)
return discounted_payoff Then no need to use payoffs = my_function_new(reshaped_paths) Please let me know if that makes sense Would you be interested cleaning up the code and contributing it to the library? I have not checked the maths so testing is needed. |
Thank you for the help. I reused your payoff and that worked for the price. @tf.function(jit_compile=True,
input_signature=[tf.TensorSpec([], dtype=tf.float64),
tf.TensorSpec([], dtype=tf.float64),
tf.TensorSpec([], dtype=tf.float64)
])
def delta_fn_xla(strikes, spot, sigma):
fn = lambda spot: price_eu_options_xla(strikes, spot, sigma)
return tff.math.fwd_gradient(fn, spot, use_gradient_tape=True) InternalError: Propagate: Cannot find body function while_body_5075_grad_16381_grad_16947_const_0 for While node gradients/PartitionedCall_2/gradients/gradients/PartitionedCall_grad/PartitionedCall_15_grad/PartitionedCall/gradients/gradients/while_grad/while_grad_grad/gradients/while_grad/while_grad_grad [Op:__inference_delta_fn_xla_17607] Also for the non-xla version, is there anything i can do to improve the performance ? t = time.time()
tarf_price = price_eu_options_xla(strikes, spot, sigma)
tarf_delta = delta_fn(strikes, spot, sigma)
tarf_vega = vega_fn(strikes, spot, sigma)
time_tqf = time.time() - t With fwd_gradients: TQF gpu TARF price+delta+vega
wall time + tracing: 31.278723001480103
options per second + tracing: 0.03197061465561366
wall time: 7.026975631713867
options per second: 0.142308733146425
------------------------
price tf.Tensor(-246.927357070608, shape=(), dtype=float64)
delta tf.Tensor(21.37942877385511, shape=(), dtype=float64)
vega tf.Tensor(-331.98229464223164, shape=(), dtype=float64) With gradients: TQF gpu TARF price+delta+vega
wall time + tracing: 15.258108139038086
options per second + tracing: 0.0655389246745136
wall time: 5.396306037902832
options per second: 0.1853119509857581
------------------------
price tf.Tensor(-246.927357070608, shape=(), dtype=float64)
delta tf.Tensor(21.37942877385511, shape=(), dtype=float64)
vega tf.Tensor(-331.98229464223175, shape=(), dtype=float64) For reference, here is the same TARF payoff implemented in c++, using Antoine Savine library from the book "Modern Computational Finance: AAD and Parallel Simulations". |
Thank you, Arthur! The issue you are having is nested jit_compilation. Here you'd need to do a few changes:
As for the perfomance, just for the reference, GPU performance I get is We can try to improve performance separately. Does 1020 ms include greek computation? If so, what is pricing speed separately? Could you please point to the source code? |
Also, do you need Euler sampling for the Geometric Brownian Motion? This can be sampled more efficiently with the designated sampler |
The 1020ms include the price+greeks (1st derivative with respect to the spot, vol, risk free rate, dividend yield) in serial (no parallel computation). I will include your suggestions and measure again the timing. |
Running on Google Colab Pro with a Tesla T4, i get 1.27s with GPU. Running on Google Colab Pro with a Tesla P100, i get 780ms with GPU. What kind of gpu do you use to get 130 ms ? |
I am using Tesla T4. So given it is working, I looked a bit in performance details. I would still recommend using Geometric Brownian motion directly as that has a different implementation than the GenericItoProcess. The latter relies on TensorArrays (basically a list) to store location values, so differentiating through it is a bit slow. Nevertheless,
This improves things quite a bit. On public colab with Tesla T4 I get If needed, I can try further optimize CPU performance |
Also, could you please report the random number generation speed? I assume you'd need [200_000, 53] samples from Sobol that you then covert to normals. Could you please let me know how long that takes? |
I have also changed Sobol sequence to use
On Tesla T4, I seem to compute The colab CPUs are a bit weak. On my 2020 intel macbook pro (with enforced single-threading) I get |
Ok that's surprising to see that Another question: https://colab.research.google.com/github/arthurpham/google_colab/blob/f02d64e65e9ba59ca9ce33048f3459560cef7fa5/TARF_MC_Performance_TQF.ipynb Let's assume i'm trying to determine what would be a realistic latency and cost (assuming a $/sec for gpu) of a pricing service relying on TQF. So let's say i receive a new trade with different inputs (the request arrives at different time, so no batching possible), i should be able to reuse the optimized function (maybe the input_signature of the tf.function is not configured properly ?), but i don't observe that. So, is it possible to avoid retracing with xla when changing the inputs (assuming no change in shapes) ? for spot in [15.0, 18.2, 20.0, 25.0]:
spot = tf.constant(spot, dtype=dtype)
t = time.time()
tarf_price, tarf_greeks = greeks_fn_xla(strikes, spot, sigma, rate, dividend)
time_tqf0 = time.time() - t
t = time.time()
tarf_price, tarf_greeks = greeks_fn_xla(strikes, spot, sigma, rate, dividend)
time_tqf = time.time() - t
When i call
|
The library that i forked to add the payoff is not mine, it comes from a book and the code is shared on github. I don't have an easy way to measure what you are asking, but if i comment the path generation and payoff, i think it that the random number generation takes roughly 40% of the total time in serial mode : auto cRng = rng.clone();
cRng->init(cMdl->simDim());
// Iterate through paths
for (size_t i = 0; i<nPath; i++)
{
// Next Gaussian vector, dimension D
cRng->nextG(gaussVec);
// Generate path, consume Gaussian vector
////cMdl->generatePath(gaussVec, path);
// Compute result
////prd.payoffs(path, results[i]);
} |
Hi Arthur, For TF I checked that the random numbers also take roughly 40% of the compute time.
|
I've pushed a change so that |
In my example above the spot was converted with for spot in [tf.convert_to_tensor(x, tf.float64) for x in [15.0, 18.2, 20.0, 25.0]]:
t = time.time()
tarf_price, tarf_greeks = greeks_fn_xla(strikes, spot, sigma, rate, dividend)
time_tqf0 = time.time() - t
t = time.time()
tarf_price, tarf_greeks = greeks_fn_xla(strikes, spot, sigma, rate, dividend)
time_tqf = time.time() - t
Thank you for the quick fix, seems to make a difference from a quick test. |
You need to remove Just in case, here is a version of the pricer incorporating all suggestions above:
Now I can define
and try running for different spot values:
This works as expected for me |
Yes that was the problem, thanks a lot for being patient. |
I'm trying to replicate the TARF payoff from this paper.
So far, i've been able to write a payoff that can be evaluated with MC and GPU, but i get an error with the XLA compilation.
https://colab.research.google.com/github/arthurpham/google_colab/blob/7d5715c2b0f988c8d7d544fea0ff741e63272056/TARF_MC_Performance_TQF.ipynb
I understand that the payoff code style might not be the best, but i wasn't able to find an example of a path-dependent payoff with MC.
Any pointer would be helpful.
The text was updated successfully, but these errors were encountered: