Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HardSoftmax missing? #4

Closed
bartbussmann opened this issue Sep 30, 2019 · 5 comments
Closed

HardSoftmax missing? #4

bartbussmann opened this issue Sep 30, 2019 · 5 comments

Comments

@bartbussmann
Copy link

Hello, thanks for the interesting paper and codebase.

In the paper (Section 4.2) you state that:

After training network N_j, we apply our straightforward semi-binarization function HardSoftmax that truncates all attention scores that fall below a threshold τ_j to zero.

However, in your code, you seem to omit this and use regular soft attention instead. This might have an impact on the Causal Validation process. Am I missing something?

@M-Nauta
Copy link
Owner

M-Nauta commented Sep 30, 2019

During network training, we indeed use soft attention since hard attention is not differentiable. This is also described in Section 4.2 of the paper:

We therefore first use the soft attention approach by applying the Softmax function s to each a in each training epoch

However, in function findcauses in TCDF.py you can see that subsequently we use a threshold to only consider time series with an attention score higher than a certain threshold. That's why we talk about semi-binarization: all time series with attention scores below the threshold are not considered as potential causes, all other time series are.

@bartbussmann
Copy link
Author

Ah okay, I see! From the paper, I understood that the HardSoftmax was used in the Causal Validation step as well (such that only the 'potential causes' are used to predict the value of interest during PIVM). Although it's probably not a big difference, it might be interesting to experiment with.

@M-Nauta
Copy link
Owner

M-Nauta commented Sep 30, 2019

I'm not sure if I understand you correctly, but we indeed apply PIVM only on the potential causes. This is described in the paper in Section 4.3.1. To clarify, let's take an example. Suppose we have 4 timeseries: the prices of apples, butter, cheese and milk. For predicting the price of cheese, the attention mechanism will probably find that apples are not related to cheese. By using a threshold, butter and cheese might be selected as potential causes. Secondly, we apply PIVM: we shuffle the values of both the price of butter and the price of milk and predict the price of cheese again. The result will probably be that only milk is a true cause of cheese.

@bartbussmann
Copy link
Author

bartbussmann commented Sep 30, 2019

I'm not sure if I understand you correctly, but we indeed apply PIVM only on the potential causes. This is described in the paper in Section 4.3.1. To clarify, let's take an example. Suppose we have 4 timeseries: the prices of apples, butter, cheese and milk. For predicting the price of cheese, the attention mechanism will probably find that apples are not related to cheese. By using a threshold, butter and cheese might be selected as potential causes.

Yes, great example!

Secondly, we apply PIVM: we shuffle the values of both the price of butter and the price of milk and predict the price of cheese again.

This is where the unclarity is. Suppose that we want to find out if butter is a true cause. Therefore, we shuffle the values of butter and use these shuffled values, together with the 'original' past values of cheese, milk, and apples to predict the price of butter again. So, in this prediction, the soft attention value of apples is used, instead of the HardSoftmax value.

@M-Nauta
Copy link
Owner

M-Nauta commented Sep 30, 2019

Ah, I get your point. It would indeed be a small change (and maybe improvement!) to exclude the non-causal time series (in this case, the apple time series). However, the disadvantage is that you then need to re-train a neural network with only the potential-causal time series, since the trained network does expect the time series of apples as input as well. So this would increase computational costs substantially. But it's definitely an interesting experiment.

@M-Nauta M-Nauta closed this as completed Oct 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants