-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More detectors #15
Comments
BackdoorsThere are two things I've been trying to find reported ASR for detection methods and how well cited different methods are. List of backdoor detectors in order of citations
ASSET tableBest overview of performance I've found is this table: From this ASSET paper. Unsure how representative is. Top-3 according to that paper are:
Confusion Training tableSeemingly good methods according to that table
Existing collections of implementationsBackdoorBox looks a lot more promising than BackdoorBench and might just be wrappable. |
I do think it would be nice to have a method that works against WaNets, which apparently rules out quite a few.
Wrapping it would be nice if it doesn't require too many hacks and workarounds. Though looking at the commit history, it doesn't seem like they're adding methods very actively, so if we only want one or two of the methods they have implemented, I'm not sure whether wrapping is easier than reimplementing them. (I haven't looked at their code, let me know if getting my sense of how hard it would be to integrate would be useful!) ASSET is interesting even apart from performance. Summary of the idea from their paper:
This sounds very similar to what Paul has written about here, and I've experimented a bit with something like this in an abstraction setting (except that this is more like the "finetuning-version" of that idea). Unlike normal finetuning, this could also work for detecting adversarial examples or broken spurious correlations. Particularly interesting is this part:
This was sometimes a problem in practice for me too, might be worth trying their approach. Might not be the easiest to implement and fiddly to get to work though, so overall not sure how much of a priority this should be---if we can get faster results from other methods, that's also a consideration. |
Remaining would be:
I don't think wrapping would be easier. Wrapping could be useful to have one less standard, but let's skip that for now then.
I'm a bit sceptical to methods that need the original poisoned training data besides the clean datasets to train the detector. The main exception is if the detector generalizes to other attacks than it was trained on (if trained on backdoors it can detect adversarial attacks or vice versa). But still, might worth investigating. |
Adversarial attacksI haven't found a good survey comparing methods yet, here are some of the techniques I've found. Some surveys anyway are:
List of some adversarial detectors in order of citations
I would roughly split these into three groups:
Resources
|
I'll probably start with Spectral/Spectre and then perhaps MagNet after that. Seems easy enough without having looked at the details yet. ASSET requires some more work so I'll hold off on that for now. |
I think such methods can still be interesting if individual failures aren't too bad, and we mainly want to avoid consistently making mistakes in the long run (what Paul calls the "low-stakes" setting). In that case, we can use deployment inputs as the mixed set of clean+poisoned---early on, we won't be able to distinguish them yet, but at some point we'll have seen enough poisoned data. A potential problem is if the poisoning rate is too low, in which case we might keep making mistakes forever with most detection methods. But I think even that is fine in many applications. For example, if deceptive AIs can only take treacherous actions in 1% of cases, that makes me more optimistic about preventing takeover (though we'd still have to prevent certain forms of coordination between AIs). Ironically, I'm less sure how helpful this is in the backdoor detection case---intuitively it does feel like you'll only see a very low poison rate during deployment for many applications, and it's still bad. But I'm sure there are deployment scenarios where you're only worried about high rates of attempted attacks.
I think a guiding principle for choosing which ones to implement is whether they're specific to e.g. image adversarial attacks, or could also be applied to many other tasks in our benchmark. Input augmentations will often be pretty specific I assume, but generic methods like MagNet seem promising. Activation statistics could be pretty general---my sense is that they are often designed specifically with adversarial examples in mind, but could still be interesting to try them, especially ones that seem reasonable a priori for other tasks. I think we might care less about adversarial training, since it's unclear how you'd apply that to cases like preventing measurement tampering. (And on the other hand, for deception, the main problem would be that finding the adversarial inputs might be very hard, which is a pretty separate topic from what we're doing.)
Sounds good, I agree with the decision to start with easier to implement methods. |
Looking at what collections of results there are to compare against (besides Backdoorbench combinations:
Asset table:
Confusion Training table:
Looking at this it seems like what we have covered now and can compare against are:
Which seems a bit limited. On the other hand these papers does not seem to run into memory issues for statistical detectors so we should be able to get numbers for those as well. |
We currently have 3 detectors. In this issue I will investigate some possible new additions.
Top candidates:
The text was updated successfully, but these errors were encountered: