November 29, 2019
On Friday, 29 November 2019 at 10:12:32 UTC, Ethan wrote:
> On Friday, 29 November 2019 at 05:16:08 UTC, Jab wrote:
>> IIRC GPUs are limited in what they can do in parallel, so if you only need to do 1 things for a specific job the rest of the GPU isn't really being fully utilized.
>
> Yeah, that's not how GPUs work. They have a number of shader units that execute on outputs in parallel. It used to be an explicit split between vertex and pixel pipelines in the early days, where it was very easy to underutilise the vertex pipeline. But shader units have been unified for a long time. Queue up a bunch of outputs and the driver and hardware will schedule it properly.

For those that care about this stuff, mesh shaders are the way forward.

https://devblogs.microsoft.com/directx/dev-preview-of-new-directx-12-features/

https://devblogs.microsoft.com/directx/coming-to-directx-12-mesh-shaders-and-amplification-shaders-reinventing-the-geometry-pipeline/

https://devblogs.nvidia.com/introduction-turing-mesh-shaders/

November 29, 2019
On Friday, 29 November 2019 at 09:00:20 UTC, Gregor Mückl wrote:
> These 2d rendering engine in Qt, cairo, Skia... contain proper implememtations for primitives like arcs, polylines with various choices for joints and end caps, filled polygons with correct self-intersection handling, gradients, fill patterns, ... All of these things can be done on GPUs (most of it has), but I highly  doubt that this would be that much faster. You need lots of different shaders for these primitives and switching state while rendering is expensive.

Well, the whole topic is quite complex. One important aspect of this is that a UI should leave as much resources to the main program as possible (memory, computational power and reduce power consumption as much as possible). Another aspect is that users have veeery different hardware. So, what works well on the GPU on one machine and with one application, might not work well with another. Games can make other assumptions than UI frameworks...

But I think the most important aspect is that hardware change over time, so what is available in the standard API may not reflect what you can do with extensions or a proprietary API. As far as I can tell, there is a push for more RISC like setups (more simple cores) with more flexibility, but I don't know how far that will go. Maybe the whole coprocessor market will split into multiple incompatible lanes with a software API layer on top (like we see with OpenGL over Metal on OS-X, just with more distance between what the API delivers and what the hardware supports). AI and raytracing could lead to a departure in coprocessor tech...

Having a software renderer (or CPU/GPU hybrid) clearly is the most portable approach, going GPU only seems to be a risky venture that would require lots of developers to target new architectures as they appear.

November 29, 2019
On Friday, 29 November 2019 at 10:19:58 UTC, rikki cattermole wrote:
> On 29/11/2019 10:55 PM, Basile B. wrote:
>> 
>> Back to the original topic. What people don't realize is that a 100% D GUI would be a more complex project than the D compiler itself. Just the text features is a huge thing in itself: unicode, BDI.
>
> Assuming you mean BIDI yes. Text layouting is a real pain.

Yes of course, sorry. It's even you who said this maybe here on the IRC (something like "it's in itself a full project").

> Mostly because it needs an expert in Unicode to do right. But its all pretty well defined with tests described. So it shouldn't be considered out of scope. Font rasterization on the other hand... ugh


November 29, 2019
On Friday, 29 November 2019 at 02:42:28 UTC, Gregor Mückl wrote:
> Yes, compositors are implemented using 3D rendering APIs these days

I read on one of the blogs made by a browser developer that they making your own compositor was wasteful, because the windowing system would then do an additional copy. So that they should instead use the windowing systems compositor in order to save resources.

Seems like they did just that in Firefox recently:

https://mozillagfx.wordpress.com/2019/10/22/dramatically-reduced-power-usage-in-firefox-70-on-macos-with-core-animation/

November 29, 2019
On Friday, 29 November 2019 at 11:16:32 UTC, Ola Fosheim Grøstad wrote:
> https://mozillagfx.wordpress.com/2019/10/22/dramatically-reduced-power-usage-in-firefox-70-on-macos-with-core-animation/

Btw, one of the comments in that thread points out that Safari uses CoreAnimation for all its composition. Hm.

I guess one advantage with that is that you get your own program updated when the user installs a new version of the OS. Apple tends to sync OS releases with the new hardware.

November 29, 2019
On Friday, 29 November 2019 at 10:08:59 UTC, Ethan wrote:
> On Friday, 29 November 2019 at 02:42:28 UTC, Gregor Mückl wrote:
>> They don't concern themselves with how the contents of these quads came to be.
>
> Amazing. Every word of what you just said is wrong.
>

I doubt this, but I am open to discussion. Let's try to remain civil and calm.

> What, you think stock Win32 widgets are rendered with CPU code with the Aero and later compositors?
>

Win32? Probably still are. WPF and later? No. That has always had a DirectX rendering backend. And at least WPF has a reputation of being sluggish. I haven't had performance issues with either so far, though.

> You're treating custom user CPU rasterisation on pre-defined bounds as the entire rendering paradigm. And you can be assured that your code is reading to- and writing from- a quarantined section of memory that will be later composited by the layout engine.
>
> If you're going to bring up examples, study WPF and UWP. Entirely GPU driven WIMP APIs.
>
> But I guess we still need homework assignments.
>

OK, I'll indulge you in the interest of a civil discussion.

> 1) What is a Z buffer?
>

OK, back to basics. When rendering a 3D scene with opaque surfaces, the resulting image only contains the surfaces nearest to the camera. The rest is occluded. Solutions like depth sorting the triangles and rendering back to front are possible (see e.g. the DOOM engine and it's BSP traversal for rendering), but they have drawbacks. E.g. even a set of three triangles may be mutually overlapping in a way that no consistent z ordering of the entire primitives exist. You need to split primitives to make that work. And you still need to guarantee sorted input.

A z buffer does solves that problem by storing the minimum z value for each pixel that was thus far encountered. When drawing a new primitive over that pixel, that primitive's z value is first compared to the stored value and when it's further away, it is discarded.

Of course, a hardware z buffer can be configured in various other interesting ways. E.g. restricting the z value range to half of the NDC space, alternating half spaces and simultaneously flipping between min and max tests is an old trick to skip clearing the z buffer between frames.

There's still more to this topic: transformation of stored z values to retain precision on 24 bit integer z buffers, hierarchical z buffers, early z testing... I'll just cut it short here.

> 2) What is a frustum? What does "orthographic" mean in relation to that?
>

The view frustum is the volume that is mapped to NDC. For perspective projection, it's a truncated four-sided pyramid. For orthographic projection, it's a cuboid. Fun fact: for correct stereo rendering to a flat display, you need asymmetrical perspective frustums; doing it with symmetric frustums rotated towards the vergence point leads to distortions.

> 3) Comparing the traditional and Aero+ desktop compositors, which one has the advantage with redraws of any kind? Why?
>

I'm assuming that by traditional you mean a CPU compositor. In that case, the GPU compositor has the full image of all top level windows cached as textures. All it needs to do is render these to the screen as textured quads. This is fast and, in simple terms, it can be done without interfering with the vertical scanout of the image to the screen to avoid tearing. Because the window contents is cached, applications don't need to redraw their contents when z order changes (good bye damage events!) and as a side effect, moving and repositioning top level windows is smooth.

> 4) Why does ImGui's code get so complicated behind the scenes? And what advantage does this present to a programmer who wishes to use the API?
>

One word: batching. I'll briefly describe the Vulkan rendering process of ImGUI, as far as I remember it from the top of my head: it creates a single big vertex buffer for all draw operations with a pretty uniform vertex layout, regardless of the primitive involved. All drawing state that doesn't need pipeline changes goes into the vertex buffer (world space coords, UV coords, vertex color...). It also retains a memory of the pipeline state required to draw the current set of primitives. All high level primitives are broken down into triangles, even lines and bezier curves. This trick reduces the number of draw calls later. The renderer retains a list of spans in the vertex buffer and their associated pipeline state. Whenever the higher level drawing code does something that requires a state change, the current span is terminated and a new one for the new pipeline state is started. As far as I remember, the code only has two pipelines: one for solid, untextured primitives, and one for textured primitives that is used for text rendering.

In this model, the higher level rendering code can just emit draw calls for individual primitives, but these are only recorded and not executed immediately. In a second pass, the vertex buffer is uploaded in a single transfer and the list of vertex buffer spans is processed, switching pipelines, setting descriptors and emitting the draw call for the relevant vertex buffer range for each span in order.

The main reason why this works is a fundamental ordering guarantee given by the Vulkan API: primitives listed in a vertex buffer must be rendered in such a way that the result is as if the primitives were processed in the order given in the buffer. For example, when primitives overlap, the last one in the buffer is the one that covers the overlap region in the resulting image.

> 5) Using a single untextured quad and a pixel shader, how would you rasterise a curve?
>

I'll let Jim Blinn answer that one for you:

https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch25.html

I'd seriously mess up the math if I were to try to explain in detail. Bezier curves aren't my strong suit. I'm solving rendering equations for a living.

> (I've written UI libraries and 3D scene graphs in my career as a console engine programmer, so you're going to want to be *very* thorough if you attempt to answer all these.)
>
> On Friday, 29 November 2019 at 08:45:30 UTC, Gregor Mückl wrote:
>> GPUs are vector processors, typically 16 wide SIMD. The shaders and compute kernels for then are written from a single-"threaded" perspective, but this is converted to SIMD qith one "thread" really being a single value in the 16 wide register. This has all kinds of implications for things like branching and memory accesses. Thus forum is not rhe place to go into them.
>
> No, please, continue. Let's see exactly how poorly you understand this.
>

Where is this wrong? Have you looked at CUDA or compute shaders? I'm honestly willing to listen and learn.

I've talked about GPUs in these terms with other experts (Intel and nVidia R&D guys, among others) and this is a common model for how GPUs work. So I'm frankly puzzled by your response.

> On Friday, 29 November 2019 at 09:00:20 UTC, Gregor Mückl wrote:
>> All of these things can be done on GPUs (most of it has), but I highly  doubt that this would be that much faster. You need lots of different shaders for these primitives and switching state while rendering is expensive.
>
> When did you last use a GPU API? 1999?
>

Last weekend, in fact. I'm bootstrapping a Vulkan/RTX raytracer as pet project. I want to update an OpenGL based real time room acoustics rendering method that I published a while ago.

> Top-end gaming engines can output near-photorealistic complex scenes at 60FPS. How many state changes do you think they perform in any given scene?
>

As few as possible. They *do* take time, although they have become cheaper. Batching by shader is still a thing. Don't take my word for it. See the "Pipelines" section here:

https://devblogs.nvidia.com/vulkan-dos-donts/

And that's with an API that puts pipeline state creation up front!

I don't have hard numbers for state changes and draw calls in recent games, unfortunately. The only number that I remember was something like about 2000 draw calls for a frame in Ashes of the Singularity. While that game shows masses of units, I don't find the graphics particularly impressive. There's next to no animation on the units. The glitz is mostly decals and particle effects. There's also not a lot of screen space post processing going on. So I don't consider that to be representative.

> It's all dependent on API, driver, and even operating system. The WDDM introduced in Vista made breaking changes with XP, splitting a whole ton of the stuff that would traditionally be costly with a state change out of kernel space code and in to user space code. Modern APIs like DirectX 12, Vulkan, Metal etc go one step further and remove that responsibility from the driver and in to user code.
>

Ok, this is some interesting information. I haven't ever had to care for where user/kernel mode transitions happen in the driver stack. I guess I've been lucky that I have been able to file that under generic driver overhead so far.

Phew, this has become a long reply and it has taken me a lot of time to write it. I hope I could prove to you that I generally know what I'm writing about. I could point to my history as some additional proof, but I'd rather let this response stand for what it is.
November 29, 2019
On Friday, 29 November 2019 at 13:27:17 UTC, Gregor Mückl wrote:
>>> GPUs are vector processors, typically 16 wide SIMD. The shaders and compute kernels for then are written from a

[…]

> Where is this wrong? Have you looked at CUDA or compute shaders? I'm honestly willing to listen and learn.

Out of curiosity, what is being discussed? The abstract machine, the concrete micro code, or the concrete VLSI pipeline (electric pathways)?

If the latter then I guess it all depends? But I believe a trick to save real estate is to have a wide ALU that is partioned into various word-widths with gates preventing "carry". I would expect there to be a mix (i.e. I would expect 1/x to be implemented in a less efficient, but less costly manner)

However, my understanding is that VLIW caused too many bubbles in the pipeline for compute shaders and that they moved to a more RISC like architecture where things like branching became less costly. However, these are just generic statements found in various online texts, so how that is made concrete in terms om VLSI design, well... that is less obvious. Though it seems reasonable that they would pick a microcode representation that was more granular (flexible).

> Last weekend, in fact. I'm bootstrapping a Vulkan/RTX raytracer as pet project. I want to update an OpenGL based real time room acoustics rendering method that I published a while ago.

Cool!  :-D Maybe you do some version of overlap add convolution in the frequency domain, or is it in the time domain?  Reading up on Laplace transforms right now...

I remember when the IRCAM workstation was state-of-the-art, a funky NeXT cube with lots of DSPs. Things have come a long way in that realm since the 90s, at least on the hardware side.


November 29, 2019
On Friday, 29 November 2019 at 10:12:32 UTC, Ethan wrote:
> On Friday, 29 November 2019 at 05:16:08 UTC, Jab wrote:
>> IIRC GPUs are limited in what they can do in parallel, so if you only need to do 1 things for a specific job the rest of the GPU isn't really being fully utilized.
>
> Yeah, that's not how GPUs work. They have a number of shader units that execute on outputs in parallel. It used to be an explicit split between vertex and pixel pipelines in the early days, where it was very easy to underutilise the vertex pipeline. But shader units have been unified for a long time. Queue up a bunch of outputs and the driver and hardware will schedule it properly.

Wasn't really talking about different processes such as that. You can't run 1000s of different kernels in parallel. In graphical terms it'd be like queuing up a command to draw a single pixel. While that single pixel is drawing, the rest of the stack allocated to that kernel would be idle. It wouldn't be able to be utilized by another kernel.
November 29, 2019
On Fri, Nov 29, 2019 at 11:19:58PM +1300, rikki cattermole via Digitalmars-d wrote:
> On 29/11/2019 10:55 PM, Basile B. wrote:
> > 
> > Back to the original topic. What people don't realize is that a 100% D GUI would be a more complex project than the D compiler itself. Just the text features is a huge thing in itself: unicode, BDI.
> 
> Assuming you mean BIDI yes. Text layouting is a real pain. Mostly because it needs an expert in Unicode to do right. But its all pretty well defined with tests described. So it shouldn't be considered out of scope. Font rasterization on the other hand... ugh

Text layout is non-trivial even with the Unicode specs, because the Unicode specs leaves a lot of things to implementation, mostly because it's out-of-scope (like the actual length of a piece of text in pixels, because that depends on rasterization, kerning, and font-dependent complexities -- some scripts like Arabic require what's essentially running a DSL in order to get the variant glyphs right -- not to mention optional behaviours that the Unicode spec explicitly says are up to user preferences). You can't really do layout as something apart from rasterization, because things like where/how to wrap a line of text will change depending on how the glyphs are rasterized.

Also, layout in my mind also involves things like algorithms that avoid running streams of whitespaces in paragraphs, i.e., LaTeX-style O(n^3) line-breaking algorithms. This is not specified by the Unicode line-breaking algorithm, BTW, which is a misnomer: it doesn't break lines, it only finds line-breaking *opportunities*, and leaves it up to the application to decide where to actually break lines. Part of the reason is because Unicode doesn't deal with font details. It's up to the text rendition code to handle that properly.


T

-- 
It only takes one twig to burn down a forest.
November 29, 2019
On Friday, 29 November 2019 at 15:29:20 UTC, Ola Fosheim Grøstad wrote:
> On Friday, 29 November 2019 at 13:27:17 UTC, Gregor Mückl wrote:
>>>> GPUs are vector processors, typically 16 wide SIMD. The shaders and compute kernels for then are written from a
>
> […]
>
>> Where is this wrong? Have you looked at CUDA or compute shaders? I'm honestly willing to listen and learn.
>
> Out of curiosity, what is being discussed? The abstract machine, the concrete micro code, or the concrete VLSI pipeline (electric pathways)?
>

What I wrote is a very abstract view of GPUs that is useful for programming. I may no have done a good job of summarizing it, now that I read that paragraph again. This is a fairly recent presentation that gives a gentle introduction to that model:

https://aras-p.info/texts/files/2018Academy%20-%20GPU.pdf

This presentation is of course a simplification of what is going on in a GPU, but it gets the core idea across. AMD and nVidia do have a lot of documentation that goes into some more detail, but at some point you're going to hit a wall. A lot of low level details are hidden behind NDAs and that's quite frustrating.

> If the latter then I guess it all depends? But I believe a trick to save real estate is to have a wide ALU that is partioned into various word-widths with gates preventing "carry". I would expect there to be a mix (i.e. I would expect 1/x to be implemented in a less efficient, but less costly manner)
>
> However, my understanding is that VLIW caused too many bubbles in the pipeline for compute shaders and that they moved to a more RISC like architecture where things like branching became less costly. However, these are just generic statements found in various online texts, so how that is made concrete in terms om VLSI design, well... that is less obvious. Though it seems reasonable that they would pick a microcode representation that was more granular (flexible).
>

I don't have good information on that. A lot of the details of the actual ALU designs are kept under wraps. But when you want to cram a few hundred cores that do 16 wide floating point SIMD processing each onto a single die, simpler is better. And throughput trumps latency for graphics.

>> Last weekend, in fact. I'm bootstrapping a Vulkan/RTX raytracer as pet project. I want to update an OpenGL based real time room acoustics rendering method that I published a while ago.
>
> Cool!  :-D Maybe you do some version of overlap add convolution in the frequency domain, or is it in the time domain?  Reading up on Laplace transforms right now...
>

The convolutions for aurealization are done in the frequency domain. Room impulse responses are quite long (up to several seconds), so a time domain convolutions are barely feasible offline. The only feasible way is to use the convolution theorem, transform everything into frequency space, multiply it there, and transform things back... while encountering the pitfalls of FFT in a continuous signal context along the way. There's a lot of pitfalls. I'm doing all of the convolution on the CPU because the output buffer is read from main memory by the sound hardware. Audio buffer updates are not in lockstep with screen refreshes, so you can't reliably copy the next audio frame to the GPU, convolve it there and read it back in time because the GPU is on it's own schedule.

The OpenGL part of my method is for actually propagating sound through the scene and computing the impulse response from that. That is typically so expensive that it's also run asynchronously to the audio processing and mixing. Only the final impulse response is moved to the audio processing thread. Perceptually, it seems that you can get away with a fairly low update rate for the reverb in many cases.

> I remember when the IRCAM workstation was state-of-the-art, a funky NeXT cube with lots of DSPs. Things have come a long way in that realm since the 90s, at least on the hardware side.

Yes, they have! I suspect that GPUs could make damn fine DSPs with their massive throughput. But they aren't linked well to audio hardware in Intel PCs. And those pesky graphics programmers want every ounce of GPU performance all to themselves and never share! ;)