-
-
Notifications
You must be signed in to change notification settings - Fork 932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use hardware-accelerated AES CryptoServiceProvider #865
Conversation
Well that's pretty good! |
Here is the code I used: byte[] CTRArrayXOR(byte[] counter, byte[] data, int offset, int length)
{
for (int loopOffset = 0; length > 0; length -= Vector<byte>.Count)
{
var v = new Vector<byte>(counter, loopOffset) ^ new Vector<byte>(data, offset + loopOffset);
if (length >= Vector<byte>.Count)
{
v.CopyTo(counter, loopOffset);
loopOffset += Vector<byte>.Count;
}
else
{
for (int i = 0; i < length; i++)
{
counter[loopOffset++] = v[i];
}
}
}
return counter;
} |
I've tested your snippet by replacing the XOR on my AES-CTR benchmark, which processes chunks of 32KB of random data (similar to what happens when downloading a file) Note that my code copies the byte[] to uint[] so that the XOR is not done byte by byte. Does your benchmark take that into account? Is the "original" my PR code (accelerated), or the existing SSH-NET code?
I see that vector<T> is likely optimized by the compiler to use CPU registers. Even so, it seems slower than my current code. |
I recompiled the benchmark for NetCore 5.0 and added 3 versions of CTR:
Running it as Release instead of Debug also kicked performance by a lot, and changes the ranking. So here are the results:
So Vector is indeed the fastest, but available only since NetStandard 2.1 (like Span). Blockcopy is not that bad considering it's available for all platforms, and it's still 13x faster than base. We can add a separate PR with Vector/Span later on if needed. Here's the snippet for SpanXOR - it assumes the arrays are 4-byte aligned for simplification: private byte[] SpanXOR(byte[] data, int offset, byte[] output)
{
int uOffset = offset / 4;
Span<uint> uData = MemoryMarshal.Cast<byte, uint>(data);
Span<uint> uOut = MemoryMarshal.Cast<byte, uint>(output);
for (int i = 0; i < uOut.Length; i++)
uOut[i] = uOut[i] ^ uData[i + uOffset];
return output;
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@drieseng This looks fine. I've been using it in production for a couple of months without any side effects. With 10x performance improvement, I propose to merge this before the next release.
I have tested this and it works! |
Reduces CPU usage dramatically, allowing more performance on slower machines
Fix padding for non-AES blockciphers Fix IV exception for non-AES blockciphers
It looks like the legacy code doesn't correctly remove padding, so this code needs to do the same.
restructure AES CSP code into its own class
9efeb24
to
086ad43
Compare
I've rebased this PR, please review and consider merging. |
I'm very much in favour of this change, but I think the design needs some work. In fact, I think it can be much simpler as a starting point: I think we can just replace the implementation of That way, we still use the given If you have any thoughts on this approach, or would like to work on it, let me know. Otherwise, I will play with it. |
All AES tests pass (including the new ones from #1232). The failed test is unrelated, seems spurious:
Is there a way to force re-run of the test suite without doing a dummy commit? |
I've merged #1232 can you update this PR and check if all tests passed? |
DST messed up the order of comments, my comment above was actually made after Wojciech's. |
@Rob-Hague With the removal of the legacy frameworks perhaps we can get rid of the legacy code and just use this new one. However, I think we should first merge this in, and then rework and remove the legacy code in a separate PR. |
Let's get it right in this PR. My biggest concern is the subversion of logic. When passing a I have run some experiments with the following changes to the benchmarks: diff --git a/test/Renci.SshNet.Benchmarks/Security/Cryptography/Ciphers/AesCipherBenchmarks.cs b/test/Renci.SshNet.Benchmarks/Security/Cryptography/Ciphers/AesCipherBenchmarks.cs
index ff414cc4..b91e3b8a 100644
--- a/test/Renci.SshNet.Benchmarks/Security/Cryptography/Ciphers/AesCipherBenchmarks.cs
+++ b/test/Renci.SshNet.Benchmarks/Security/Cryptography/Ciphers/AesCipherBenchmarks.cs
@@ -15,7 +15,7 @@ namespace Renci.SshNet.Benchmarks.Security.Cryptography.Ciphers
{
_key = new byte[32];
_iv = new byte[16];
- _data = new byte[256];
+ _data = new byte[32 * 1024];
Random random = new(Seed: 12345);
random.NextBytes(_key);
@@ -34,5 +34,29 @@ namespace Renci.SshNet.Benchmarks.Security.Cryptography.Ciphers
{
return new AesCipher(_key, new CbcCipherMode(_iv), null).Decrypt(_data);
}
+
+ [Benchmark]
+ public byte[] Encrypt_CFB()
+ {
+ return new AesCipher(_key, new CfbCipherMode(_iv), null).Encrypt(_data);
+ }
+
+ [Benchmark]
+ public byte[] Decrypt_CFB()
+ {
+ return new AesCipher(_key, new CfbCipherMode(_iv), null).Decrypt(_data);
+ }
+
+ [Benchmark]
+ public byte[] Encrypt_CTR()
+ {
+ return new AesCipher(_key, new CtrCipherMode(_iv), null).Encrypt(_data);
+ }
+
+ [Benchmark]
+ public byte[] Decrypt_CTR()
+ {
+ return new AesCipher(_key, new CtrCipherMode(_iv), null).Decrypt(_data);
+ }
}
}
Results on develop branch 826222f:
Results on my suggestion to replace the
Results on your branch zybexXL@a9f68fb
So my suggestion gets moderate gains but clearly we can do a lot better with an approach to delegate the entire encryption to the BCL (rather than block-by-block). So here is what I propose: We keep the behaviour of the existing constructor on We add a new constructor which will allow delegating to the BCL for the entire encryption process. This will achieve the much better performance and separation from the existing behaviour. public AesCipher(
System.Security.Cryptography.CipherMode cipherMode,
System.Security.Cryptography.PaddingMode paddingMode,
bool ctrMode = false)
{ } Since In terms of the design, we define nested classes in public sealed class AesCipher : BlockCipher
{
private readonly BlockCipher _impl;
public AesCipher(byte[] key, CipherMode mode, CipherPadding padding)
: base(key, 16, mode, padding)
{
_impl = new SshNetCipherModeImpl(key, mode, padding);
}
public AesCipher(
byte[] key,
byte[] iv,
System.Security.Cryptography.CipherMode cipherMode,
System.Security.Cryptography.PaddingMode paddingMode,
bool ctrMode = false)
{
_impl = new BclImpl(/* ... */);
}
// AesCipher just forwards all implementation to _impl
public override int EncryptBlock(byte[] inputBuffer, int inputOffset, int inputCount, byte[] outputBuffer, int outputOffset)
{
return _impl.EncryptBlock(inputBuffer, inputOffset, inputCount, outputBuffer, outputOffset);
}
public override int DecryptBlock(byte[] inputBuffer, int inputOffset, int inputCount, byte[] outputBuffer, int outputOffset)
{
return _impl.DecryptBlock(inputBuffer, inputOffset, inputCount, outputBuffer, outputOffset);
}
public override byte[] Encrypt(byte[] input, int offset, int length)
{
return _impl.Encrypt(input, offset, length);
}
public override byte[] Decrypt(byte[] input)
{
return _impl.Decrypt(input);
}
public override byte[] Decrypt(byte[] input, int offset, int length)
{
return _impl.Decrypt(input, offset, length);
}
// Implementations
// The block-by-block implementation using instantiated SshNet.CipherMode, SshNet.CipherPadding objects
private sealed class SshNetCipherModeImpl : BlockCipher
{
private readonly Aes _aes;
private ICryptoTransform _encryptor;
private ICryptoTransform _decryptor;
public SshNetCipherModeImpl(byte[] key, CipherMode mode, CipherPadding padding)
: base(key, 16, mode, padding)
{
// Initialise _aes in ECB mode
}
public override int EncryptBlock(byte[] inputBuffer, int inputOffset, int inputCount, byte[] outputBuffer, int outputOffset)
{
_encryptor ??= _aes.CreateEncryptor();
return _encryptor.TransformBlock(inputBuffer, inputOffset, inputCount, outputBuffer, outputOffset);
}
public override int DecryptBlock(byte[] inputBuffer, int inputOffset, int inputCount, byte[] outputBuffer, int outputOffset)
{
_decryptor ??= _aes.CreateDecryptor();
return _decryptor.TransformBlock(inputBuffer, inputOffset, inputCount, outputBuffer, outputOffset);
}
}
private sealed class BclImpl : BlockCipher
{
// This overrides Encrypt/Decrypt for full BCL perf.
public override byte[] Decrypt(byte[] input)
{
throw new NotImplementedException();
}
public override byte[] Decrypt(byte[] input, int offset, int length)
{
throw new NotImplementedException();
}
public override int DecryptBlock(byte[] inputBuffer, int inputOffset, int inputCount, byte[] outputBuffer, int outputOffset)
{
throw new NotImplementedException();
}
public override byte[] Encrypt(byte[] input, int offset, int length)
{
throw new NotImplementedException();
}
public override int EncryptBlock(byte[] inputBuffer, int inputOffset, int inputCount, byte[] outputBuffer, int outputOffset)
{
throw new NotImplementedException();
}
}
} Comments appreciated. Most important to me is that we do not subvert provided |
One other thought: because the proposed new constructor is a bit awkward for CTR mode, I am in principle OK with subverting our own |
@zybexXL are you going to update this PR. I would see this PR as one of the most important for the next release. |
@WojciechNagorski Yes, I've just been busy this week. I'll update it today or tomorrow. |
Factor out the implementations and re-add the existing constructor
Add tests for stream cipher state preservation
@WojciechNagorski PR is now updated, ready to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. My only "important" feedback is on the naming of AesCipherMode
. The others are not necessary to address.
src/Renci.SshNet/Security/Cryptography/Ciphers/AesCipher.CtrImpl.cs
Outdated
Show resolved
Hide resolved
src/Renci.SshNet/Security/Cryptography/Ciphers/AesCipher.CtrImpl.cs
Outdated
Show resolved
Hide resolved
Appveyor failed on unrelated test (ForwardedPortShouldAcceptNewConnections). I've noticed that some tests always fail if there are 2 or more Appveyor sessions running at the same time, usually because some port is already in use. Maybe there's a way to queue the appveyor sessions so that only one runs at any given time? Edit - consider adding this to appveyor.yml:
|
I'm not talking about this particular failure. The fact is, whenever I push to multiple branches/PRs, Appveyor testing fails. I've seen port-related issues at least twice. |
To be honest, I prefer GitHub Actions but we can't use it because there is a limit. |
@WojciechNagorski Please merge this PR, I think all issues are resolved. |
@WojciechNagorski, can this one go in for you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! This is an amazing contribution.
@Rob-Hague I know you had some comments about AesCipherMode, but to be honest, it doesn't block merging. Please feel free to open the next PR. We still have some time until the next release.
This feature reduces CPU usage dramatically, allowing higher SFTP performance on slower machines. It does so by using the AesCryptoServiceProvider for AES CBC/CTR/ECB encryption/decryption, which taps into AES-NI accelerated OS functions.
STFP download performance testcases:
Test case A: 4 year-old i5-7300U 2-core/4-threads laptop, 250Mbps internet connection:
Test case B: 4 year old I7-7700 4-core/8-threads, 500Mbps internet connection:
Note: all values above were taken with PR #866 also merged in. Tested with sftpClient.DownloadFile()
SSH.NET vs AES-NI benchmarks
SSH.NET implements AES as pure managed code. CPUs nowadays have hardware acceleration for AES via the AES-NI instruction set, and .NET provides APIs to the OS functions that make use of it. There are 3 .NET providers for AES:
AesManaged : older managed-only implementation, not accelerated. Still faster than the current SSH.Net
AesCryptoServiceProvider: call into the OS and makes use of AES-NI if available
AesCNG (CryptoNewGen): newer Crypto API, available since 2018 in Win10. Also makes use of AES-NI and has less overhead
Here are some benchmarks (executed on the dual-core laptop...):
As you can see, the accelerated APIs are about 25x faster than the current managed code!
Since AesCNG is too new I chose to use AesCryptoServiceProvider. This supports CBC and ECB natively, but not CFB or OFB (which are not listed in the SSH.Net protocol list anyway). For CTR (used by AWS SFTP Transfer Family), my code uses ECB followed by XOR. This would be extremely fast on C/C++ (or using unsafe tag in C# to allow memory pointers), but on pure C# I had to do a few memory Blockcopy to convert between byte[] and uint[], so the result is not ideal - however, it's still about 10x faster than the previous code.
Here's the new AES benchmarks with this PR:
These CTR and CBC values are now the new maximum theoretical SFTP performance of SSH.NET 😎