Extremely slow performance when processing virtual terminal sequences #10362

cmuratori · 2021-06-08T06:06:17Z

Windows Terminal version (or Windows build number)

1.8.1521.0

Other Software

No response

Steps to reproduce

Using any command line utility that produces virtual terminal sequences for setting the colors of individual characters, the performance of the terminal drops by a factor of around 40.

To measure this effect precisely, you can use the F2 key in termbench and observe the performance difference between color-per-character output and single-color output:

https://github.com/cmuratori/termbench/releases/tag/V1

Expected Behavior

Despite the increased parsing load, modern CPUs should not have a problem parsing per-character color escape codes quickly. I would expect the performance of the terminal to be able to sustain roughly the same frame rate with per-character color codes as without, and if there was a performance drop, I wouldn't expect it to be anything close to 40x.

Actual Behavior

The speed of per-character color output is 40x slower than the speed of single-color output.

skyline75489 · 2021-06-08T06:39:49Z

Thanks for the amazing benchmark tool! I'm sure @miniksa will be interested in trying it out.

Yeah the current performance of colored output is not as fast as non-colored ones. However, the bottleneck is not in parsing but rather rendering, in my opinion. Well, to be specific, when it comes to terminal, there's a lot of things that may hurt the performance, for example ConPTY, DxRenderer, memory allocation, etc. Anyway this is a good tool to measure the performance of both conhost & terminal.

cmuratori · 2021-06-08T07:09:45Z

Can you explain what you think is the slow part of rendering characters in multiple colors (for those of us unfamiliar with how the renderer of your console works)?

skyline75489 · 2021-06-08T08:28:42Z

If it means anything, here's a sample WPR trace running your benchmark tool:

As you can see, roughly 70% of CPU time is consumed by RenderThread, which in this case (Windows Terminal) relys DxRenderer for the actual rendering job. The OutputThread (where the VT parsing & related work reside) only takes 10% ~ 20% of the CPU time. Note that this pattern of CPU usage can be seen among almost every output-heavy-with-color programs (cmatrix, for example).

Regarding colored vs non-colored text, my initial observation is that drawing colored text frame takes longer time than non-colored ones, which leaves less CPU for the OutputThread and causes the FPF to drop.

skyline75489 · 2021-06-08T08:45:46Z

Deep down in the RenderThread, we use DirectX to do the actual drawing. In a perfect world, we'd like to see RenderThread takes most of the CPU time, which means we are not wasting too much CPU on VT processing. But even if we managed to make RenderThread consume 90% of the CPU, this still won't gives you a near close rendering performance for colored text comparing non-colored text.

Another example that might helps: if we cat a very long file, which gives us non-colored output, the WRP trace usually indicates about 60%~70% of CPU is consumed by OutputThread and only 30% of CPU consumed by RenderThread. See how this is going here? Rendering non-colored text is simply a much more cheaper operation for the renderer, which helps the FPS a lot.

In conclusion, I know there's a lot of space for performance tuning in the Windows Terminal, but I honestly can't expect the performance of colored text to be close comparing to non-colored text. Even with a hardware-accelerated solution like DirectX, we still faces performance bottleneck from the rendering stack & GPUs.

cmuratori · 2021-06-08T08:55:19Z

Two things:

Can you expand that entire Render::PaintFrame trace? I assume it has more detailed attribution of time there?
I'm not sure I understand what you're suggesting. Are you actually saying that you think a GPU will slow down substantially if it has to use a different color for each character?

mmozeiko · 2021-06-08T08:55:44Z

I cannot profile Terminal .exe as there are no symbols for it available on Microsoft pdb servers.
But I wrote dumb vt terminal using CreatePseudoConsole that does not render anything - all it does it sits in a loop doing ReadFile from application output. Basically for (;;) { ReadFile(pipe, ...); } and completely ignores output. So there is no VTE parsing, no rendering, no formatting. Nothing. Only thing that runs is conhost.exe.

In task manager I see that there is ~4MB/s traffic to my "terminal" from Casey's termbench.exe. And I get around 250-300ms per frame which seems pretty slow. Task manager shows that conhost.exe is bottleneck. It is using 100% of one core (3.12% on 32x core Ryzen):

When I run ETW on conhost.exe for my "terminal" I see following things:

A lot of time is spent in std::vector resizing:

A lot of time is spent in std::stringstream:

A lot of time is spent in RenderThread:

The confusing part to me in this RenderThread call stack are all those string formatting functions like _SetGraphicsRenditionRGBColor. Why is conhost.exe formatting VT sequences like this? Shouldn't my "terminal" receive direct bytes in pipe from what termbench application is sending?

Again - there is zero rendering happening in my terminal. There are no DirectX calls at all. Only conhost.exe is bottleneck here.

skyline75489 · 2021-06-08T09:02:33Z

@cmuratori To answer you question:

The deepest call that consumes the CPU is

terminal/src/renderer/dx/CustomTextRenderer.cpp

Line 798 in fb597ed

d2dContext->DrawGlyphRun(baselineOrigin, glyphRun, glyphRunDescription, brush, measuringMode);

This is how Windows Terminal essentially uses to draw text.
I'm seeing the fact that drawing colored-text consumes more CPU time in the renderer. I can't say it's because of DirectX or GPU, or both. I'm no DX expert so I can only guess both.

skyline75489 · 2021-06-08T09:06:20Z

@mmozeiko To helps you understand how things work under the hood, the RenderThread you see is actually used by conhost.exe to "render" VT sequences for the Windows Terminal. That's why you see all the text-related things. This is part of the ConPTY mechanism. A more detailed introduction can be found here.

Shouldn't my "terminal" receive direct bytes in pipe from what termbench application is sending?

Unfortunely no. There's currently no way to bypass the ConPTY layer. This does hurt the performance but it's necessary at the moment for the terminal to work properly. There's discussion about ConPTY "passthough" in #1173.

cmuratori · 2021-06-08T09:19:02Z

1. The deepest call that consumes the CPU is https://github.com/microsoft/terminal/blob/fb597ed304ec6eef245405c9652e9b8a029b821f/src/renderer/dx/CustomTextRenderer.cpp#L798

Is it not possible to post the entire trace?

skyline75489 · 2021-06-08T09:25:46Z

@cmuratori it's possible. Just me being lazy about it because I've seen too many of those traces.

The WPR traces actually varies, depending on the content being drawn, the font and probably also the GPU performance. Check out the screenshot here #6206 (comment) if you're interested. This PR helps the performance with Cacafire, but for cmatrix (or more practially, vim) it does not mean too much.

mmozeiko · 2021-06-08T09:44:29Z

@skyline75489
I cannot test this, because I have no idea how to rebuild conhost.exe, but from my benchmark it is visible that all this terminal stuff can be sped up a lot by improving string processing on ConPTY layer. Nothing needs to be changed in DirectX rendering. Just need to avoid expensive string allocations and operations. This will help all terminal application - not only Windows Terminal. Currently if somebody wants to implement more efficient rendering they really cannot, because they will be bottlenecked by these issues in ConPTY layer.

For example, _SetGraphicsRenditionRGBColor should be changed to something like this (probably can do something even better, but I wrote this in github comment):

HRESULT VtEngine::_SetGraphicsRenditionRGBColor(const COLORREF color, const bool fIsForeground) noexcept
{
    DWORD const r = GetRValue(color);
    DWORD const g = GetGValue(color);
    DWORD const b = GetBValue(color);

#define FMT_BYTE(x)                           \
    if (x >= 100) *ptr++ = (x/100) + '0';     \
    if (x >= 10)  *ptr++ = ((x/10)%10) + '0'; \
    *ptr++ = (r%10) + '0';

    char buffer[10+3+3+3];

    char* ptr = buffer;
    *ptr++ = '\x1b';
    *ptr++ = '[';
    *ptr++ = fIsForeground ? '3' : '4';
    *ptr++ = '8';
    *ptr++ = ';';
    *ptr++ = '2';
    *ptr++ = ';';
    FMT_BYTE(r);
    *ptr++ = ';'
    FMT_BYTE(g);
    *ptr++ = ';'
    FMT_BYTE(b);
    *ptr++ = 'm';
    
#undef FMT_BYTE

    return _Write({ buffer, ptr - buffer });
}

No std::string and no vsnprintf functions. Rest of file uses too much of std::string just for trivial constant string literals used in formatter string.

skyline75489 · 2021-06-08T09:55:12Z

@mmozeiko Yeah you have a good point here actually. What I want to say is that there's a lot things under the hood than just "processing terminal sequences". The performance of ConPTY layer is also very important for the console subsystem, as you mentioned it helps all terminal applications.

Do understand that some of the tuning tricks are not used for both readability and maintainability of the project. That being said, if we found something worth investigating, I think we'd all be happy to squeeze as much CPU as we can to improve the performance.

superninjakiwi · 2021-06-08T10:20:31Z

If all terminals have to go through this code to function on windows, I feel like performance should be more important than it seems to be treated. It's one thing if only the default terminal suffers performance issues, but if an entire classification of applications on windows suffers negative effects due to the way strings are handled, and gets none of the benefits, I'm not sure that's the best way to prioritize things.

skyline75489 · 2021-06-08T11:51:34Z

Here’s something interesting. I tried to port the benchmark to Linux and I’m seeing a even larger performance gap between colored and non-colored text on Linux (roughly 60x - 80x). The overall rendering performance is better on Windows, of course.

Will see if I can port it to macOS tomorrow.

vaualbus · 2021-06-08T11:53:12Z

So the overall result is, rendering text color on a console is no easy and require a super computer to reach good frame rate/performances.

skyline75489 · 2021-06-08T12:04:40Z

Oops. Accidentally closed this.

@vaualbus haha I see what you mean. Basically rendering colored text will be slower than non-colored text. But IMO it’s still fast enough for daily usage, be it on Linux or Windows.

forksnd · 2021-06-08T12:17:48Z

Basically rendering colored text will be slower than non-colored text.

@skyline75489 Modern AAA games can render millions of polygons and do ray-traced lighting at 60 fps, while Windows Terminal is able to render some text at 2-3 fps. Clearly, there is some part of this terminal emulator that is pushing the available hardware beyond its limits, either intentionally or not.

Can you (or someone else familiar with this code base) explain what part of the code needs this much computing power, so that the community can see about potentially filing a pull request to fix this performance issue?

But IMO it’s still fast enough for daily usage, be it on Linux or Windows.

With all due respect, if Windows console infrastructure is not (and has never been) performant, then you do not know what daily-usage applications you are missing, which are impossible to write currently due to the pervasive slowness.

jfhs · 2021-06-08T12:37:58Z

I've just benchmarked running Terminal + OpenConsole with @mmozeiko's change, and see 3x improvement from TermMarkV1.
Here are raw numbers. Before:

Glyphs: 9k  Bytes: 335kb  Frame: 66  Prep: 0ms  Write: 75ms  Read: 0ms  Total: 75ms  
TermMarkV1: 48kcg/s  (Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz Win32 VTS)

After:

Glyphs: 9k  Bytes: 364kb  Frame: 187  Prep: 0ms  Write: 53ms  Read: 0ms  Total: 53ms 
TermMarkV1: 153kcg/s  (Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz Win32 VTS)

Given such significant improvement (even if in benchmark app), and functional equivalence, I think you should consider that change (and similar changes for other trivial formatting code paths), despite some detriment to readability.

skyline75489 · 2021-06-08T12:48:05Z

@forksnd I totally get your point. I for one have been trying to improve the rendering performance of the terminal since the year 2019. I feel eligible to say a few words here.

For those who don’t quite get how text rendering work, it may seem unreasonable. But one major performance block comes from text layout & rendering. I have filed several PRs trying to minimize the impact of text layout. As for now, in order to use all the fancy Unicode features that people expect for a modern terminal(emojis, CJK languages, RTL, etc), a significant amount of time will be needed for text layout. This gets worse when the text is colored because it will force us to split the text into different runs according to their color.

Modern AAA games can render millions of polygons and do ray-traced lighting at 60 fps,

If you understand what I mean about text layout, you’d understand drawing text is a very different task than drawing polygons & alike. I may sound innocent but I don’t actually know a AAA game that draws millions of text. The only kind of applications that ( I know of) draw a lot of text is terminal applications. I mentioned DirectX and maybe this is a bit misleading. The terminal needs DirectX for both rendering(Direct2D) and text layout(DirectWrite). So the technology behind games and terminals are not exactly the same, nor do they have the same performance metrics, IMO.

which are impossible to write currently due to the pervasive slowness

It may surprise you but up until now we’ve been mostly targeting Linux applications & terminals as some sort of benchmarking standard, since people have been using the Linux tools since forever. Also with the help of WSL, this can be easily conducted without firing up a VM. After 3 years of open-source development, the Windows Terminal can handle applications like cmatrix & cacafire easily. I wouldn’t call it perfect example but I think it can be seen as an indicator of how performant the terminal currently is.

superninjakiwi · 2021-06-08T13:09:22Z

I can understand your position here, and I don't think you've taken an unreasonable position. I do think that, considering the vast amount of string handling you do with terminals, the poor performance of std::string hits this kind of application harder than anyone else, and more than any other program, I believe that the terminal would benefit from at least considering alternatives to the current string handling, if it can be shown to be a significant enough performance boost, even if the current string handling is judged to be a bit more readable.

skyline75489 · 2021-06-08T13:22:16Z

@superninjakiwi thanks for the kind words and the suggestions. I will see if there’s anything I can do in the future.

Excuse me for being wordy here. I swear this is my last comment for the day. Most people tend to underestimate how hard text layout is. Turns out it’s really, really hard. When it comes to text layout, it’s really, really hard to even be correct, let alone be performant. In this modern world where you can almost find anything in Unicode, you’ll be surprised how many things are needed to correctly layout complicated text. It’s so hard that it requires a dedicated framework for just text layout (DirectWrite on Windows, CoreText on iOS/macOS)

Again I’m no expert in text layout. One thing I found that is crucial when it comes to the performance of text layout is that there’s isn’t too much space for parallelism. Because sometimes(maybe most of the time) you have to do it sequentially(think about ligature, for example). So not much of the modern multi-core compute power can be used, be it on CPU or GPU.

When we talk about the challenges in performance of the terminal, this here is just tip of the iceberg. I’m super happy that so many people are interested in the project and in the particular area. I’m glad I can explain things so people can better understand what stage we are at, and hopefully what we’re headed.

cmuratori · 2021-06-08T16:31:13Z

Most people tend to underestimate how hard text layout is.
Turns out it’s really, really hard. When it comes to text layout,
it’s really, really hard to even be correct, let alone be performant.
In this modern world where you can almost find anything in
Unicode, you’ll be surprised how many things are needed to
correctly layout complicated text. It’s so hard that it requires
a dedicated framework for just text layout (DirectWrite on
Windows, CoreText on iOS/macOS)

Can you be more specific here about what you are talking about? While text output can be "hard", for some not particularly hard definition of "hard", it is usually because of things that terminals don't do, such as aesthetically-pleasing justification; preventing rivers, widows, and orphans; precisely aligning internal character features with other characters; proper ligatures; etc. So I guess I'm not sure I know where the "hard" part would come in for rasterizing a monospace font into a fixed character grid?

Can you point me to or post some examples of "difficult text output" I can make happen in the Windows Terminal so I can see what you mean?

DHowett · 2021-06-08T17:36:47Z

Sorry -- I'm going to try to corral this thread before it gets further out of hand.

The translation from the console buffer--which we need to keep for compatibility reasons--to VT does too much string math to be properly performant.
Per Extremely slow performance when processing virtual terminal sequences #10362 (comment) the bottleneck identified here is not in the DirectWrite renderer
- Our DirectWrite renderer is somewhat inefficient and causes more command list flushes than should be necessary. I think "text rendering is hard" (Extremely slow performance when processing virtual terminal sequences #10362 (comment)) because we've made it hard, not because of some intrinsic quality of the universe.
Measuring throughput here is somewhat annoying because everything goes through conhost/OpenConsole first (the "VT Renderer") and then through Terminal second (the "DirectWrite renderer")
- Both of these impact perceived performance, but per Extremely slow performance when processing virtual terminal sequences #10362 (comment) (again) the effect is compounded since Terminal requires both; other ConPTY consumers only require one.

Is this an acceptable summary?

Notes

In response to Extremely slow performance when processing virtual terminal sequences #10362 (comment): compiling conhost is relatively easy: from OpenConsole.sln, set Host.EXE as the startup project and hit F5. Make sure that you're running Release/x64 or something that matches your local architecture, as conhost is sensitive to the architecture of the kernel.
In response to Extremely slow performance when processing virtual terminal sequences #10362 (comment): We should be publishing PDBs, and I'm sorry for the miss here. When I find a thread that's asking for them I usually upload them, but we don't have an automatic process in place for making them publicly available.
- I'd like to inch us ever closer to "reproduceable builds", but that's a long-term goal.

All in all, this sounds like a more general case of #410. Chafa does exactly this ("change colors a lot, try to render as fast as possible") and we are now, at least, profiling and optimizing using it as a test case (PR #10071).

cmuratori · 2021-06-08T18:05:13Z

I think "text rendering is hard" because we've made it hard,
not because of some intrinsic quality of the universe.

That sounds much more sensible, yes.

DHowett · 2021-06-08T18:19:30Z

I will take your terse response as accepting the summary. Thanks!

cmuratori · 2021-06-08T19:52:53Z

I apologize for the terseness, but I don't feel like I am in any position to accept or reject a summary, since most of what you were replying to was other people's comments (@mmozeiko, for example, was the person posting about how the processing is slow right now due to unnecessary string manipulation). They would be the ones to accept or reject a summary :)

In general, all I wanted to have open with this particular bug report was "color text rendering should not be slower than uncolored text rendering". While it may be true that architectural decisions made in how Windows Terminal works could mean that it will always be slow in this regard, that is a different thing from it being slow because the actual processing is substantial. The processing required here is definitely insubstantial.

Parsing a 1mb buffer of control codes and outputting the GPU buffer necessary to encode ~30k colored fixed-width glyphs is something I would expect to run in the thousands of frames per second on a modern machine, not five frames per second as it does currently. So the difference between the reasonably expected performance and the realized performance here is several orders of magnitude, which would at least suggest to me that a great deal of improvement could be made to the performance of the product if one were so inclined.

That may not be a priority, however, which is fine, and you are welcome to close this report as not being something you're interested in fixing, etc.

DHowett · 2021-06-08T20:06:29Z

Parsing a 1mb buffer of control codes and outputting the GPU buffer necessary to encode ~30k colored fixed-width glyphs is something I would expect to run in the thousands of frames per second ... that a great deal of improvement could be made to the performance of the product if one were so inclined.

I completely agree. I'll use this report as the tracking issue for any performance improvements we make here.

Thanks for raising this -- and I'm excited to get termbench going.

EDIT: And, that's fair, the bit about the summary. Sorry. 😄

cmuratori · 2021-06-08T22:22:14Z

I completely agree. I'll use this report as the tracking issue
for any performance improvements we make here.

Awesome! Let me know if you need me to modify termbench to test other things at some point. It is obviously very simple at the moment because many (most?) terminals already struggle with its current output.

skyline75489 · 2021-06-08T22:45:32Z

Parsing a 1mb buffer of control codes and outputting the GPU buffer necessary to encode ~30k colored fixed-width glyphs is something I would expect to run in the thousands of frames per second

This is what I see on Linux. You’re totally right about terminal not being that performant, because of the existence of ConPTY & co. But on Linux non-color text is also way faster than non-colored text.

Man, we need this on Linux. I’ll send a PR later.

cmuratori · 2021-06-08T22:56:53Z

Man, we need this on Linux. I’ll send a PR later.

I will go ahead and post a Linux version as well, if that is useful.

cmuratori · 2021-06-08T23:33:48Z

The translation from the console buffer--which we need to keep for compatibility reasons--
to VT does too much string math to be properly performant.

Before I forget, I just wanted to mention: this part was a little confusing to me, because my understanding was that up until recently, you could not even use VT codes in Windows terminal - hence the need to set the console mode to ENABLE_VIRTUAL_TERMINAL_PROCESSING (which doesn't even exist in Windows 8). So, is there a reason you couldn't just bypass the entire pipeline when the person on the other end sets that flag? Because then you know they aren't expecting any backwards compatibility, because it is obviously a new app?

Maybe I'm missing something here, but I just thought I'd mention it, because it seemed odd. It seems like you have an explicit flag that tells you the person doesn't need the old console behavior (or that you can easily define to be that, because it is a brand new flag), and that seems like that might solve the entire string processing problem that is currently going on in the conduit?

skyline75489 · 2021-06-08T23:38:01Z

@cmuratori that is explained in #1173, which is also a very lengthy thread and requires some background knowledge.

stephc-int13 · 2021-06-09T00:55:50Z

@skyline75489 I understand that text layout can be difficult and that support for emoji, ligatures, etc. is needed, but I also think that it should not be too difficult to process the VT stream in chunks and have a quick path for the common cases when the layout is a simple fixed grid to avoid paying for unnecessary processing all the time.

cmuratori · 2021-06-09T02:04:37Z

@cmuratori that is explained in #1173, which is also a very lengthy thread and requires some background knowledge.

Reading through that, ENABLE_PASSTHROUGH_MODE actually sounds like it would improve the termbench performance substantially without anyone needing to optimize the current VT-to-non-VT-and-back-again problems that @mmozeiko was observing. Is this still a planned feature? I would love to add a toggle for it in termbench if it becomes a reality.

mmozeiko · 2021-06-09T02:20:18Z

Just to see how fast terminal can go, I patched conhost source to drop all incoming data - so there will be no VT string parsing & processing happening. I commented out call to ProcessString on this line: https://github.com/microsoft/terminal/blob/v1.8.1444.0/src/host/_stream.cpp#L972

All the terminal will now show is black window, but we could see how fast terminal application like termbench can run.

What I got is 2-3 msec per frame, TermMark shows 7800 score, and termmark.exe is sending data with ~270MB/s to OpenConsole.exe.
Compared to previous 300msec, TermMark=80 and 3MB/speed.
All three numbers show ~100x speedup.

CPU usage also dropped - before conhost was using 100% of one core and termmark.exe was almost idle at 0%. Now both are doing some work at 50% of core, so CPU is not bottleneck anymore for conhost.

Here's the screenshot in task manager:

What this means is that with good incoming text parsing code and good rendering code which should really take no time for 27K characters I'm using, the terminal could easily render 60fps or even upwards of 100fps. Not only that - but it would also save power & battery for laptop users because of lower CPU usage.

ped7g · 2021-06-09T05:44:22Z

One thing made me curious - just a mental exercise: colored vs non-colored throughput. Shouldn't the colored text be actually faster in terms of processed bandwidth? (I guess when comparing amount of characters, it's fair to assume colored will be much slower, but some comments made it sound as actual throughput is slower)

Let's say we have 4MB of input data for terminal to render. If it is colored, the amount of glyphs to render will be considerable lower. Rendering glyphs may be complex, because of RTL/ligatures/... (all the nice Unicode stuff). While processing color code means "just" parsing the color code and modify color of further rendering, but there's no pixels-length calculation or layout-positioning of glyph.

So if you think about it like this, then processing the same amount of input should be faster with color codes, because there's much less actual characters to render?

/end of mental exercise

nico-abram · 2021-06-09T05:50:55Z

because of RTL

Does terminal actually handle RTL? I was under the impression it didn't

cmuratori · 2021-06-09T07:18:30Z

One thing made me curious - just a mental exercise: colored vs non-colored
throughput. Shouldn't the colored text be actually faster in terms of
processed bandwidth?

"Shouldn't" is not really something you can say definitively about this particular situation because it would depend on the implementation details.

If the processing for the input is significantly slower than the glyph rendering, then you would expect the FPS vs. input footprint to be the same or worse for colored vs non-colored, because your performance will be entirely dependent on the input processing, which costs more proportional to the footprint.

On the other hand, if the glyph rendering is significantly slower than the input processing, then you would expect the FPS vs. input footprint to improve substantially for colored glyphs because the performance would stay the same but the footprint would increase, leading to a faster "score" by your metric.

The reason nobody is concerned about "processed bandwidth" in this particular thread is because the memory bandwidth necessary to retrieve the input is insubstantial in both the colored and non-colored cases. The terminal would have to be several orders of magnitude faster before you would be looking at input bandwidth as a metric.

The largest VT-coded input in question is around 1mb of data for a full screen of color-per-glyph output. On a modern machine you would expect to read a cold 1mb buffer at ~20gb/s, or a hot one (which this would be, at least partially) at ~80gb/s, so a single core would expect to read the input somewhere between twenty and eighty thousand times a second. Since the observed frame rate was around five frames per second, we know that input memory bandwidth is not implicated in the performance problems.

(And note that when I say "input memory bandwidth", I am talking only about the bandwidth necessary to get the data from the application to the terminal. Obviously we know the terminal itself is taking a long time to process the data, so that processing may itself be generating large amounts of unnecessary memory traffic which then implicates memory bandwidth as a bottleneck, etc., etc.)

Not sure if that is what you were asking, but hopefully that provides enough information to answer the question.

skyline75489 · 2021-06-10T03:19:29Z

Is (ENABLE_PASSTHROUGH_MODE) this still a planned feature?

I think it's the right direction but considering the amount of all backlog items & the limited developer time, I wouldn't really expect to see it implemented before the year 2023. We'll have to live with the ConPTY layer for a reasonable long time.

cmuratori · 2021-06-10T03:20:40Z

I wouldn't really expect to see it implemented before the year 2023.

Ouch.

DHowett · 2021-06-10T15:44:31Z

I wouldn't really expect to see it implemented before the year 2023.

Ouch.

The hang-up is that this needs OS changes and, while we do contribute the console host code from this repository back into Windows, the OS moves much slower than this project does. 😄

lhecker · 2021-06-16T23:41:27Z

FYI We investigated this today and the slowdown likely occurs because we draw each run of consecutive characters with identical text attributes (colors, etc.) at once. If the background color changes for each character, each character will be drawn independently, which makes rendering slow. This affects us more than other terminals, as our parsing and rendering loops still work in sync - the former can't proceed until the latter is finished.

The situation of the submitter of this issue will vastly improve with #6193.

cmuratori · 2021-06-17T00:29:46Z

For what it's worth, #6193 sounds like a step in the wrong direction. Drawing something in multiple passes that could have been drawn in a single pass wastes GPU render target bandwidth.

Drawing a monospace terminal display is straightforward. You have two textures that encode your data. You have a pixel shader that div-floors the screen coordinate to figure out a cell index then looks up into the first texture. It encodes one background color, one foreground color, and one cell-glyph index per terminal cell.

The cell-glyph index is then used for a single dependent texture fetch which loads a per-pixel glyph out of the second texture, which is a glyph atlas encoding the cell-glyph coverage in whatever way makes it easiest to compute your ClearType blending values. Combine the background and foreground color using the ClearType algorithm and blending values, output final pixel color, done. (I am assuming the terminal has to support ClearType - if it doesn't, you just blend with a regular coverage value directly and it's even easier).

There would only be one dispatch for the entire terminal display, which is a single full-window quad. Note also that I say "cell-glyph", not glyph, because obviously if you want glyphs that span two cells, you split those into two cell-glyphs accordingly (but the renderer doesn't care).

That's it, right? I mean that is the entire renderer. It'd be a very short pixel shader, modulo the fact that you have a couple different ClearType patterns, so you'd need a few different conditional compilations of the shader.

This would render at thousands of frames per second. The only bandwidth to the card would be downloading texture updates. The parser outputs these - one texture update to change the cell contents in the cell contents texture, and then occasional texture updates to add glyphs to the cell-glyph coverage atlas whenever the parser detects a codepoint that has not previously been rasterized (in normal usage this would happen only at the beginning, and then all relevant glyphs would soon be in the atlas and you'd never need any more updates to it).

Am I missing something? Why is all this stuff with "runs of characters" happening at all? Why would you ever need to separate the background from the foreground for performance reasons? It really seems like most of the code in the parser/renderer part of the terminal is unnecessary and just slows things down. What this code needs to do is extremely simple and it seems like it has been massively overcomplicated.

DHowett · 2021-06-17T01:29:37Z

I believe what you’re doing is describing something that might be considered an entire doctoral research project in performant terminal emulation as “extremely simple” somewhat combatively. I am not aware of the body of work around performant GPU terminal emulation, but I’m somewhat surprised that other accelerated terminals aren’t already doing this (as I imagine we would have heard about it before now had they done so.)

Is there not a significant startup cost to this? Rendering the entire glyph closure available from the font and all of its fallbacks to a texture seems prohibitively expensive, but if you’re removing a stage from the pipeline that determines exactly what glyphs to shape and where you’ll need to do that—as well as reimplement a large portion of a text shaper, no?

I expect that DirectWrite does incredible optimizations on its own, and that we are impeding it from doing so by not intelligently commanding it, but I don’t believe that it’s quite that advanced.

Setting the technical merits of your suggestion aside though: peppering your comments with clauses like “it’s that simple” or “extremely simple” and, somewhat unexpectedly “am I missing something?” can be read as impugning the reader. Some folks may be a little put off by your style here. I certainly am, but I am still trying to process exactly why that is.

DHowett · 2021-06-17T01:38:09Z

To address Leonard’s specific reason for calling out background rendering: right now, we don’t have a single stage pipeline that uses a pixel shader to pull cell-glyphs from a texture. What we have instead is a rendering pipeline that emits up to 7,200 individual draw calls, and we’re talking about reducing that[1]. I’m not aiming for instant perfection, but simply trying to converge on a better solution. I can’t justify taking somebody offline for the months it would take to retool the entire renderer and then further justify dealing with the inevitable globalization issues that will follow to push thousands of frames per second when decoupling the renderer from the output pipeline gets the major performance bottleneck out of the way and better local draw call batching can get us in throwing distance of hundreds of fps.

[1]: at the very least, introducing a stage specifically for rendering backgrounds lets us better batch draw calls and let the get the CPU and our drawing pipeline stalls out of the way.

cmuratori · 2021-06-17T01:43:53Z

When we're at the stage when something that can be implemented in a weekend is described as "a doctoral research project", and then I am accused of "impugning the reader" for describing something as simple that is extremely simple, we're done. Consider the bug report closed.

lhecker · 2021-06-17T01:52:54Z

Discussion may continue here: #10461
I deeply apologize for the condescending comment below.

Uneditied original comment

@cmuratori Apart from what Dustin said, frankly, you seem misguided about how text rendering with DirectWrite works. When you call [`DrawGlyphRun`](https://docs.microsoft.com/en-us/windows/win32/api/dwrite/nf-dwrite-idwritebitmaprendertarget-drawglyphrun) it lays down glyphs in your "texture", _by using a backing glyph atlas internally already_. Basically the thing you suggest us to do, is already part of the framework we use.

Now obviously there's a difference between whether you do thousands of glyph layouts or just a few dozen.
Calling DrawGlyphRun doesn't equate a full render stage in your GPU either. In fact your GPU is barely involved in text rendering!

Side note: DirectWrite doesn't necessarily cache glyphs between renderings. This is indeed something we could consider doing, but just absolutely isn't worth it, when the problem you have is caused by the number of calls and not the complexity to layout a couple ASCII letters.

👉 Also ClearType can't be trivially alpha blended making it impossible to render into a separate glyph atlas.
👉 Finally Firefox used to use alpha blending, but they moved away from it towards the DirectWrite-style-of-things, because... you guessed it... the use of alpha blending was an absolute nightmare of complexity and unmaintainable. In fact not something that was created on a weekend.

If you don't believe me I invite you to cat this file in a WSL2 instance. It'll finish drawing the entire 6MB file within about a second or two. From that I can already estimate that after we implemented the issue I linked, your program will render at about ~30 FPS. Significantly more than the current performance, right?

Lastly I can only suggest everyone to read: https://gankra.github.io/blah/text-hates-you/
You were overly confident in your opinion, but I hope this website helps you understand that it's actually really damn hard.
The reason your program shows a high FPS under other terminal emulators is simply, because their rendering pipeline works independent of VT ingestion. Gnome Terminal is not laying out text faster than your display refresh rate either. And of course, again, this is something WT will probably do as well in the future... but this project is nowhere near as old as Gnome Terminal is.

ghost added Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting Needs-Tag-Fix Doesn't match tag requirements labels Jun 8, 2021

skyline75489 added the Area-Performance Performance-related issue label Jun 8, 2021

skyline75489 closed this as completed Jun 8, 2021

skyline75489 reopened this Jun 8, 2021

This was referenced Jun 10, 2021

Should we have a ThrottledFunc that works without a dispatcher? #10393

Closed

Throttle cursor redrawing in outputStream.cpp #10394

Merged

skyline75489 mentioned this issue Jun 15, 2021

Prefer FMT_COMPILE for string formatting in VtRenderer #10426

Merged

cmuratori closed this as completed Jun 17, 2021

microsoft locked as off-topic and limited conversation to collaborators Jun 17, 2021

Extremely slow performance when processing virtual terminal sequences #10362

Extremely slow performance when processing virtual terminal sequences #10362

Comments

cmuratori commented Jun 8, 2021

Windows Terminal version (or Windows build number)

Other Software

Steps to reproduce

Expected Behavior

Actual Behavior

skyline75489 commented Jun 8, 2021

cmuratori commented Jun 8, 2021

skyline75489 commented Jun 8, 2021 • edited Loading

skyline75489 commented Jun 8, 2021 • edited Loading

cmuratori commented Jun 8, 2021

mmozeiko commented Jun 8, 2021 • edited Loading

skyline75489 commented Jun 8, 2021

skyline75489 commented Jun 8, 2021 • edited Loading

cmuratori commented Jun 8, 2021

skyline75489 commented Jun 8, 2021

mmozeiko commented Jun 8, 2021

skyline75489 commented Jun 8, 2021

superninjakiwi commented Jun 8, 2021

skyline75489 commented Jun 8, 2021

vaualbus commented Jun 8, 2021

skyline75489 commented Jun 8, 2021

forksnd commented Jun 8, 2021 • edited Loading

jfhs commented Jun 8, 2021

skyline75489 commented Jun 8, 2021 • edited Loading

superninjakiwi commented Jun 8, 2021

skyline75489 commented Jun 8, 2021

cmuratori commented Jun 8, 2021 • edited Loading

DHowett commented Jun 8, 2021

Notes

cmuratori commented Jun 8, 2021

DHowett commented Jun 8, 2021 • edited Loading

cmuratori commented Jun 8, 2021

DHowett commented Jun 8, 2021 • edited Loading

cmuratori commented Jun 8, 2021

skyline75489 commented Jun 8, 2021

cmuratori commented Jun 8, 2021

cmuratori commented Jun 8, 2021

skyline75489 commented Jun 8, 2021

stephc-int13 commented Jun 9, 2021

cmuratori commented Jun 9, 2021

mmozeiko commented Jun 9, 2021

ped7g commented Jun 9, 2021

nico-abram commented Jun 9, 2021

cmuratori commented Jun 9, 2021

skyline75489 commented Jun 10, 2021

cmuratori commented Jun 10, 2021

DHowett commented Jun 10, 2021

lhecker commented Jun 16, 2021

cmuratori commented Jun 17, 2021 • edited Loading

DHowett commented Jun 17, 2021

DHowett commented Jun 17, 2021

cmuratori commented Jun 17, 2021

lhecker commented Jun 17, 2021 • edited Loading

skyline75489 commented Jun 8, 2021 •

edited

Loading

skyline75489 commented Jun 8, 2021 •

edited

Loading

mmozeiko commented Jun 8, 2021 •

edited

Loading

skyline75489 commented Jun 8, 2021 •

edited

Loading

forksnd commented Jun 8, 2021 •

edited

Loading

skyline75489 commented Jun 8, 2021 •

edited

Loading

cmuratori commented Jun 8, 2021 •

edited

Loading

DHowett commented Jun 8, 2021 •

edited

Loading

DHowett commented Jun 8, 2021 •

edited

Loading

cmuratori commented Jun 17, 2021 •

edited

Loading

lhecker commented Jun 17, 2021 •

edited

Loading