What's new
Frozen in Carbonite

Welcome to FiC! Register a free account today to become a member! Once signed in, you'll be able to participate on this site by adding your own topics and posts, as well as connect with other members through your own private inbox!

GPU General Thread

Will you be getting a 2080 or are you skipping it?

  • Skipping it

    Votes: 9 100.0%
  • Buy buy buy

    Votes: 0 0.0%

  • Total voters
    9

Vyor

Well-known member
NVIDIA has recently unveiled their next GPU architecture.


It is called Turing.

Turing is an evolution of Volta, essentially being Volta but with Raytracing Cores, or Engines. It features higher clockspeeds and a smaller die but also fewer cores. Where the Volta GPU has 5,120 CUDA cores the top end Turing chip, which starts at $6,300, only has 4,608 CUDA cores. It is expected that these cores will be clocked at around 1,730mhz based on the computational performance they have: the same as the Volta GPUs.

The die itself is also rather large, over 700mm^2. This means the top end chip will be very, very power hungry, especially with the clockspeed it is reaching. It's on TSMC's 12nm node, so it won't pull less power than Pascal at similar core counts and clockspeeds. Indeed, it should pull more thans to the Tensor Cores and Raytracing Engines on the chips.

With this information, we can assume that the RTX 2080 will be based on the Quadro RTX 5000. If it is, it will have 3072 CUDA cores and 384 Tensor cores. This is less cores than the 1080ti, but it will likely clock higher if NVIDIA stops caring about power consumption as much(expect a return to kepler and maxwell levels of power consumption). It will perform much better in Vulkan and DX12 titles, but be otherwise about on par with the 1080ti unless Raytracing is involved.

Now, Raytracing will not be a major feature in games. NVIDIA is currently using it like they have Hairworks and Physx: to make their older cards appear worse than they are and to cripple AMD performance. It won't be a major part of any game released in the next 2 or 3 years, the number of people that can use it will be way too small for it to be worth it. Raytracing will also hurt Turing game performance of course, along with increasing power consumption massively, but it will only hurt it to the tune of 30% while absolutely crippling everything else.


So, all of that said, anyone excited for the new GPUs?
 
Ok, now I'm pissed off. Exceedingly so.

1,000 fucking dollars for the 2080ti, a GPU that, outside of very specific applications, will perform perhaps 20% better than the 1080ti which lauched at 700 fucking dollars.

And the founders edition 2080ti? 1,200 FUCKING DOLLARS.

AND THE RAYTRACING LOOKS LIKE SHIT

AAAAAAAAGH
 
Yeah that is super-expensive. Stupid stupid stupid.
 
Yeah that is super-expensive. Stupid stupid stupid.

And when raytracing is used it can't get 60fps.

At 1080p.
https://www.pcgamesn.com/nvidia-rtx-2080-ti-hands-on
But still, it's tough not to be a little concerned when the ultra-expensive, ultra-enthusiast RTX 2080 Ti isn't able to hit 60fps at 1080p in Shadow of the Tomb Raider. We weren't able to see what settings the game was running at as the options screens were cut down in the build we were capturing, but GeForce Experience was capturing at the game resolution and the RTX footage we have is 1080p.

With the FPS counter on in GFE we could see the game batting between 33fps and 48fps as standard throughout our playthrough and that highlights just how intensive real-time ray tracing can be on the new GeForce hardware. We were playing on an external, day-time level, though with lots of moving parts and lots of intersecting shadows.

This is a disaster.
 
Welp, their whitepaper has dropped.
https://www.nvidia.com/content/dam/...ure/NVIDIA-Turing-Architecture-Whitepaper.pdf

Here are my thoughts:

I am not an expert, but I do know graphics tech and API tech so it's more than a layman's understanding. If you see anything I am mistaken on, please call it out. That said, I shall start with their first real point.

New Streaming Multiprocessor (SM)
Turing introduces a new processor architecture, the Turing SM, that delivers a dramatic boost in shading efficiency, achieving 50% improvement in delivered performance per CUDA Core compared to the Pascal generation. These improvements are enabled by two key architectural changes. First, the Turing SM adds a new independent integer datapath that can execute instructions concurrently with the floating-point math datapath. In previous generations, executing these instructions would have blocked floating-point instructions from issuing. Second, the SM memory path has been redesigned to unify shared memory, texture caching, and memory load caching into one unit. This translates to 2x more bandwidth and more than 2x more capacity available for L1 cache for common workloads.

This is bullshit. Now, not only are they changing the definition of a CUDA Core for this paragraph(in this case they are referring to the SM, which is more like a traditional core instead of the ALUs and FPUs they usually refer to), splitting the int and fp units will only theoretically give a 50% improvement. That can only be the case when both are used to 100% capacity at all times... which isn't going to happen. Even Unreal Engine only has 30% of the workload being int shaders at most(and even there it's an ancillary workload, usually done before or after the FP workload is finished simply because that's when it needs to be done). The L1 changes are only there to enable more fine control over the shaders(the execution units) themselves and to allow them to get the data they need to do work at all; it does not change the theoretical limits.

In practise this change might net maybe 5-10% more performance per SM in the best case. Good, but not revolutionary and not on par with the Kepler-Maxwell leap nor the Tesla-Fermi leap.

Then they talk about tensors and DLSS for a bit, meh. Then raytracing, we've heard all this before so I'm skipping it.

Mesh Shading
Mesh shading advances NVIDIA's geometry processing architecture by offering a new shader model for the vertex, tessellation, and geometry shading stages of the graphics pipeline, supporting more flexible and efficient approaches for computation of geometry. This more flexible model makes it possible, for example, to support an order of magnitude more objects per scene, by moving the key performance bottleneck of object list processing off of the CPU and into highly parallel GPU mesh shading programs. Mesh shading also enables new algorithms for advanced geometric synthesis and object LOD management.

This tells us literally nothing, but it looks like something akin to Primitive Shaders on Vega. It helps Vega performance because AMD is extremely Geometry Engine limited... nvidia really, really, really isn't. They have so many on their GPUs that they can't really feed them, even on the P100(this is why they can tank tessellation so well, which is a geometry shader). LOD changes might help performance some, it might make CPU bottlenecks a little better but... overall another "meh" from me.

Variable Rate Shading (VRS)
VRS allows developers to control shading rate dynamically, shading as little as once per sixteen pixels or as often as eight times per pixel. The application specifies shading rate using a combination of a shading-rate surface and a per-primitive (triangle) value. VRS is a very powerful tool that allows developers to shade more efficiently, reducing work in regions of the screen where full resolution shading would not give any visible image quality benefit, and therefore improving frame rate. Several classes of VRS-based algorithms have already been identified, which can vary shading work based on content level of detail (Content Adaptive Shading), rate of content motion (Motion Adaptive Shading), and for VR applications, lens resolution and eye position (Foveated Rendering).

Decent for VR when eye tracking is available, shit everywhere else. Utter and complete shit. Next.

Texture-Space Shading
With texture-space shading, objects are shaded in a private coordinate space (a texture space) that is saved to memory, and pixel shaders sample from that space rather than evaluating results directly. With the ability to cache shading results in memory and reuse/resample them, developers can eliminate duplicate shading work or use different sampling approaches that improve quality.

And now we see the real reason those int cores are there. This could help some, those INT cores are what's doing the everything here, but it won't help much. Pixel shaders aren't too performance intensive for the most part. THis might net another 5% on average, maybe 15% in the absolute best case.

Then they talk a bit about more AI bullshit, some memory stuff, and nvlink. Pointless for 80% of people, so I'll skip it here.

The Turing TU102 GPU is the highest performing GPU of the Turing GPU line and the focus of this section. The TU104 and TU106 GPUs utilize the same basic architecture as TU102, scaled down to different degrees for different usage models and market segments. Details of TU104 and TU106 chip architectures and target usages/markets are provided in Appendix A, Turing TU104 GPU and Appendix B, Turing TU106 GPU.

Now that is interesting, no mention of TU108 or TU100. Keep that in mind for the future, they might rebrand Pascal for the low end or release a variant of Volta.

TURING TU102 GPU
The TU102 GPU includes six Graphics Processing Clusters (GPCs), 36 Texture Processing Clusters (TPCs), and 72 Streaming Multiprocessors (SMs). (See Figure 2 for an illustration of the TU102 full GPU with 72 SM units.) Each GPC includes a dedicated raster engine and six TPCs, with each TPC including two SMs. Each SM contains 64 CUDA Cores, eight Tensor Cores, a 256 KB register file, four texture units, and 96 KB of L1/shared memory which can be configured for various capacities depending on the compute or graphics workloads. Ray tracing acceleration is performed by a new RT Core processing engine within each SM (RT Core and ray tracing features are discussed in more depth in Turing Ray Tracing Technology starting on page 26). The full implementation of the TU102 GPU includes the following:  4,608 CUDA Cores  72 RT Cores  576 Tensor Cores  288 texture units  12 32-bit GDDR6 memory controllers (384-bits total). Tied to each memory controller are eight ROP units and 512 KB of L2 cache. The full TU102 GPU consists of 96 ROP units and 6144 KB of L2 cache. See the Turing TU102 GPU in Figure 3. Table 1 compares the GPU features of the Pascal GP102 to the Turing TU102.

Ok, so Turing is rather close to Maxwell in SM layout in terms of FP32. Nothing else is at all similar bar ROPs.

And the table shows something very interesting. Pascal and Turing have the same GPC count, this means they have the same amount of Raster Engines and Geometry Engines and have more SMs per Engine as compared to pascal. This is why they added the new shader model: they wanted to lower power consumption and to do so they kept Raster and Geometry Engine count the same. That means there is the possibility of being geometry bottlenecked in some scenes like, say, when Hairworks is used. The issue is that not all games will use this new shading model, it's not nearly as easy to implement as DLSS afterall. This could cause some issues for them in the future.

Still better off than fucking AMD though, Nvidia has way more engines and those engines are better than AMD's(this is why GCN doesn't scale on top of the utter shit that is it's scheduling system). Nvidia is still miles ahead in this department and I don't see that changing until 2020 at the earliest unless Navi clocks a whole hell of a lot higher than I think it will.

And same ROP count as pascal because of course it does, Pascal had too many already anyway(bar the low end which doesn't have enough).

Clockspeed is taking a notable hit(about 40mhz on the boost), seems they screwed the power curve somewhere along the lines as the FE card is pulling a rather impressive 260w, 10w over "reference" despite only a 90mhz clock bump. This thing is going to guzzle power when OC'd. Base clock takes an even larger hit, 1350mhz as compared to the 1080ti's 1480mhz. Not a good sign.

More NvLink talk, be warned that using this will either increase power draw or cut clocks. The bus is wide and it runs at a high clockspeed, there's a reason the Pascal NvLink cards had a much higher power draw than their non-link cousins despite only around a 100mhz increase in clockspeed.

Now for the really important bit, the deep dive into the SM itself. Let's see how honest they are!

TURING STREAMING MULTIPROCESSOR (SM) ARCHITECTURE
The Turing architecture features a new SM design that incorporates many of the features introduced in our Volta GV100 SM architecture. Two SMs are included per TPC, and each SM has a total of 64 FP32 Cores and 64 INT32 Cores. In comparison, the Pascal GP10x GPUs have one SM per TPC and 128 FP32 Cores per SM. The Turing SM supports concurrent execution of FP32 and INT32 operations (more details below), independent thread scheduling similar to the Volta GV100 GPU.

Two SMs per TPC is... not good, the rest is fine, esp that last line which makes Async Compute really nice for Turing, their first architecture that really benefits from it(and it's not because it just has more cores than pascal so it needs this, no, pascal was shit at async).

Turing implements a major revamping of the core execution datapaths. Modern shader workloads typically have a mix of FP arithmetic instructions such as FADD or FMAD with simpler instructions such as integer adds for addressing and fetching data, floating point compare or min/max for processing results, etc. In previous shader architectures, the floating-point math datapath sits idle whenever one of these non-FP-math instructions runs. Turing adds a second parallel execution unit next to every CUDA core that executes these instructions in parallel with floating point math. Figure 5 shows that the mix of integer pipe versus floating point instructions varies, but across several modern applications, we typically see about 36 additional integer pipe instructions for every 100 floating point instructions. Moving these instructions to a separate pipe translates to an effective 36% additional throughput possible for floating point.

They are being honest for once! Sorta. Again, most of those need the FP cores to have done something already(good fucking luck running FP compare without the FP's waiting for it to finish), so you know. That 36% number is impossible in practise, but it's more honest than the 50% figure listed above.

Their own slide shows this as well, with only their own games showing anywhere near 50%(2 games show exactly that, both RTX optimized) and Witcher showing a measly 17% or so. Cherry picking at it's finest I guess.

Figure 6 shows how the new combined L1 data cache and shared memory subsystem of the Turing SM significantly improves performance while also simplifying programming and reducing the tuning required to attain at or near-peak application performance. Combining the L1 data cache with the shared memory reduces latency and provides higher bandwidth than the L1 cache implementation used previously in Pascal GPUs. Overall, the changes in SM enable Turing to achieve 50% improvement in delivered performance per CUDA core. Figure 7 shows the results across a set of shader workloads from current gaming applications.

Well that was short lived. Of course, they also never define what "50% more performance" actually means. Even their own marketing numbers don't show it after all.
rtx2080-performance.jpg

Note that it only breaches a 50% performance uplift when either HDR or a next gen API is used. HDR hurts Pascal by 10%(it isn't native on Pascal, it is native on Turing), so you can safely discount those games from having a 50% shading boost. And that's against the 1080 in raw performance! The 2080 has 16% more cores! So even in those games that are left it still doesn't translate to a 50% bump in shading performance per core. And it's comparing a factory overclocked 2080FE with 2 fans vs a stock clocked and throttling 1080FE card! So that's another 5-10% hit! On top of that the 1080 doesn't have the memory bandwidth to properly support 4k! It sees a further 5-10% performance dip against the 1080ti(as compared to 1440p numbers), so at more reasonable resolutions you have to discount that too(for some games anyway)!

Please nvidia, don't try and bullshit in your own fucking white paper. The people looking at them know what they're talking about normally and we don't have the memory of goldfish, thanks.
(and those DLSS numbers show that they were either running the game at a lower resolution and upscaling it to 4k or, more likely, were running with a high level of SSAA on the non-DLSS runs, that's why the performance jump is there).

Don't get me wrong, Turing will be faster than pascal for the same part of the product stack... but that isn't 50% faster per SM nvidia. It might technically be 50% better in shading... but that doesn't mean it won't run into other bottlenecks related to your stupid ROP and TPC decisions, let alone the same amount of Raster Engines.


Oh well, moving on.

Let's see... tensors... more tensors... datacenter... tensors... video codec shit... Ah, memory system. Let's see if they're honest here.

Memory subsystem performance is crucial to application acceleration. Turing improves main memory, cache memory, and compression architectures to increase memory bandwidth and reduce access latency. Improved and enhanced GPU compute features help accelerate both games and many computationally intensive applications and algorithms. New display and video encode/decode features support higher resolution and HDR-capable displays, more advanced VR displays, increasing video streaming requirements in the datacenter, 8K video production, and other video-related applications. The following features are discussed in detail:

Proof that Turing has HDR as a native system, backing up my previous claim that Pascal didn't have it as such. Always good to have proper proof.

GDDR6 talk... meh... Ohhh, ROPs and L2!
Turing GPUs add larger and faster L2 caches in addition to the new GDDR6 memory subsystem. The TU102 GPU ships with 6 MB of L2 cache, double the 3 MB of L2 cache that was offered in the prior generation GP102 GPU used in the TITAN Xp. TU102 also provides significantly higher L2 cache bandwidth than GP102. Like prior generation NVIDIA GPUs, each ROP partition in Turing contains eight ROP units and each unit can process a single-color sample. A full TU102 chip contains 12 ROP partitions for a total of 96 ROPs.

... They said they'd go into detail. This isn't going into detail. L2 cache increase is nice, needed to feed the extra cores and RT shit. ROP count still shit, moving on.

Turing Memory Compression
NVIDIA GPUs utilize several lossless memory compression techniques to reduce memory bandwidth demands as data is written out to frame buffer memory. The GPU's compression engine has a variety of different algorithms which determine the most efficient way to compress the data based on its characteristics. This reduces the amount of data written out to memory and transferred from memory to the L2 cache and reduces the amount of data transferred between clients (such as the texture unit) and the frame buffer. Turing adds further improvements to Pascal's state-of-the-art memory compression algorithms, offering a further boost in effective bandwidth beyond the raw data transfer rate increases of GDDR6. As shown in Figure 12, the combination of raw bandwidth increases, and traffic reduction translates to a 50% increase in effective bandwidth on Turing compared to Pascal, which is critical to keep the architecture balanced and support the performance offered by the new Turing SM architecture.

... but why tho. You... you don't need that. Why did you waste R&D on this. You have GDDR6 now. And faster Cache. And more of it.

nvidia why u do this

Turing GPUs include an all-new display engine designed for the new wave of displays, supporting higher resolutions, faster refresh rates, and HDR. Turing supports DisplayPort 1.4a allowing 8K resolution at 60 Hz and includes VESA's Display Stream Compression (DSC) 1.2 technology, providing higher compression that is visually lossless. Table 3 shows the DisplayPort support in the Turing GPUs.

Ohhh, fancy!

Turing's new NVDEC decoder has also been updated to support decoding of HEVC YUV444 10/12b HDR at 30 fps, H.264 8K, and VP9 10/12b HDR. Turing improves encoding quality compared to prior generation Pascal GPUs and compared to software encoders. Figure 13 shows that on common Twitch and YouTube streaming settings, Turing's video encoder exceeds the quality of the x264 software-based encoder using the fast encode settings, with dramatically lower CPU utilization. 4K streaming is too heavy a workload for encoding on typical CPU setups, but Turing's encoder makes 4K streaming possible.

I'll believe that when I see it. If true though it's lovely.

Even if I think that simply putting an 8core ARM SOC or dedicated RISC-V cores would do the same with less hassle and potentially similar power draw. Meh. Still, bold claims.

Also... they're using 60k bitrate for 4k streaming.

Twitch doesn't even support that. And somehow the CPU isn't maxed out in usage. 6k bitrate for 1080p is... a bit much, but it's reasonable enough if you have a decent CPU and net connection.

VirtualLink stuff... meh.

NvLink... more meh. Still needs dev support, might help with that, might not.

Raytracing! I don't care!

Fun fact: gigarays still means literally nothing! It's a completely meaningless metric(this according to people that actually write raytracing applications, one of which could actually be almost run realtime on a high core count CPU).

Let's see... NGX, related to raytracing denoising. Meh.

DLSS!
Figure 22, shows a sampling of results on the UE4 Infiltrator demo. DLSS provides image quality that is similar to TAA, with much improved performance. The much faster raw rendering horsepower of RTX 2080 Ti, combined with the performance uplift from DLSS and Tensor Cores, enables RTX 2080 Ti to achieve 2x the performance of GTX 1080 Ti.

Oh hey, I was right.

Again.

Well mostly, it's a replacement for TAA, not SSAA. Still a meh thing to run at 4k, but not as meh. So... that's an improvement I guess. And I see that the performance comparison between the 2080 and 1080 holds true for the 2080ti and 1080ti as well, makes sense considering the similar(exact fucking same) core count increase.

The key to this result is the training process for DLSS, where it gets the opportunity to learn how to produce the desired output based on large numbers of super-high-quality examples. To train the network, we collect thousands of "ground truth" reference images rendered with the gold standard method for perfect image quality, 64x supersampling (64xSS). 64x supersampling means that instead of shading each pixel once, we shade at 64 different offsets within the pixel, and then combine the outputs, producing a resulting image with ideal detail and anti-aliasing quality. We also capture matching raw input images rendered normally. Next, we start training the DLSS network to match the 64xSS output frames, by going through each input, asking DLSS to produce an output, measuring the difference between its output and the 64xSS target, and adjusting the weights in the network based on the differences, through a process called back propagation. After many iterations, DLSS learns on its own to produce results that closely approximate the quality of 64xSS, while also learning to avoid the problems with blurring, disocclusion, and transparency that affect classical approaches like TAA. In addition to the DLSS capability described above, which is the standard DLSS mode, we provide a second mode, called DLSS 2X. In this case, DLSS input is rendered at the final target resolution and then combined by a larger DLSS network to produce an output image that approaches the level of the 64x super sample rendering - a result that would be impossible to achieve in real time by any traditional means. Figure 23 shows DLSS 2X mode in operation, providing image quality very close to the reference 64x super-sampled image.

Wait no, I guess I was 100% right. It's a replacement for both TAA and SSAA. Why the fuck would you need any SSAA at 4k, let alone 64 fucking times SSAA. That's just... You can make an argument for TAA, a pretty good one even... but SSAA!? Haha, no.

And I can't make out any difference in the TAA and DLSS example they provide in figure 24. Anyone have input on this one? My screen might not be large enough of high resolution enough to see the difference(it's 1080p).


Then they talk about more AI... gee, I wonder why they made Turing and Volta?

AI super-resolution can run at around 30FPS when scaling from 1080p to 4k... and can't be used for games so I, once again, called it. Moving on.

MESH SHADING
The real world is a visually rich, geometrically complex place. Outdoor scenes in particular can be composed of hundreds of thousands of elements (rocks, trees etc.). CAD models present similar challenges. Today's graphics pipeline with vertex, tessellation, and geometry shaders is very effective at rendering the details of a single object, but still has limitations. Each object requires its own unique draw call from the CPU and the shader model is a per-thread model which limits the types of algorithms that can be used. Mesh Shading introduces a new, more flexible model that enables developers to eliminate CPU draw call bottlenecks and use more efficient algorithms for producing triangles. Visually rich images, like those shown in Figure 27, have too many unique complex objects to render in real time with today's graphics pipeline.

Back to more bullshit! Anyone that's played a modern game with a modern GPU knows you can render those scenes real time. Nvidia, why lie?

And no, no they aren't limited that way. You want to know how many threads a pascal SM has(that's what they mean by per thread)? A few thousand. Ya. Volta has even more and they're all independant.

The draw call part is true though, but... also meaningless. It'll help slightly with CPU bottlenecks, not much more.

And guys! It supports mipmapping!!!!!! Isn't that exciting! Something in fucking minecraft.

Seems like nvidia looked at AMD's Primitive Shaders and went "that's a good idea! Let's use it!" Bah. Doesn't even help performance any, just makes some things a bit simpler to code for if you even bother to use the feature. Fucking pointless.

Variable Rate Shading...
Overall, with Turing's VRS technology, a scene can be shaded with a mixture of rates varying between once per visibility sample (super-sampling) and once per sixteen visibility samples. The developer can specify shading rate spatially (using a texture) and using a per-primitive shading rate attribute. As a result, a single triangle can be shaded using multiple rates, providing the developer with fine-grained control.

Translation: We make the game look worse so it runs better. Buy our shit.

Moving on...
In Content Adaptive Shading, shading rate is simply lowered by considering factors like spatial and temporal (across frames) color coherence. The desired shading rate for different parts of the next frame to be rendered are computed in a post-processing step at the end of the current frame. If the amount of detail in a particular region was relatively low (sky or a flat wall etc.), then the shading rate can be locally lowered in the next frame. The output of the post-process analysis is a texture specifying a shading rate per 16 x 16 tile, and this texture is used to drive shading rate in the next frame. A developer can implement content-based shading rate reduction without modifying their existing pipeline, and with only small changes to their shaders.

All of my hate. You buy a high refresh rate monitor and a high res one at that to fucking see the detail on fast moving objects you fucks. Stop fucking cheating and making the game look fucking worse to make yourself look better! And for fucks sake, some "simple" objects need high res rending just for the aesthetics.

The second application of Variable Rate Shading exploits object motion. Our eyes are designed to track moving objects linearly, so that we can see their details even when in motion. However, objects on LCD screens do not move smoothly or continuously. Rather, they jump from one location to the next with each 60 Hz frame update. From the perspective of our eye, which is trying to smoothly track the object, it looks like it is wiggling back and forth on the retina as its location moves ahead and behind of the path the eye is tracking. The net result is that we cannot see the full detail of the object, instead we see a somewhat lower resolution/blurred version. Figure 34 illustrates this scenario.
latest

The third example application is Foveated Rendering. Foveated Rendering is based on the observation that the resolution that our eye can perceive depends on viewing angle. We have maximum visual resolution for objects in the center of our field of view, but much lower visual resolution for objects in the periphery. Therefore, if the viewer's eye position is known (via eye tracking in either a VR or non-VR system), this can be used to adjust shading rates appropriately. We can shade at lower rates in the periphery, and higher rates in the center of the field of view.

Well, they gave when it should be used so I'll give em a pass on this on. Fair enough.

TSS... What the fuck do you actually do.
One example use case for TSS is improving the efficiency of VR rendering. Figure 35 shows an example use case for TSS in VR rendering. In VR, a stereo pair of images is rendered, with almost all of the elements visible in the left eye also showing up in the right eye view. With TSS, we can shade the full left-eye view, and then render the right eye view by sampling from the completed left-eye view. The right eye view only has to shade new texels in the case that no valid sample was found (for example a background object that was obscured from view from the left-eye perspective but is visible to the right eye).

Ohhh... that's what you do. Ya you ain't getting a simple explanation, let's just say it makes VR easier to do on the GPU and leave it at that. This shit is complicated and actually really fucking cool. Bravo nvidia, you've managed to put something in that actually helps performance by... quite a lot. Fuck if anyone knows by how much, nvidia might but they ain't talkin. I'd guess... 50-60% or more. It's really nice.

In VR only.

Multi-View Rendering MVR) allows developers to efficiently draw a scene from multiple viewpoints or even draw multiple instances of a character in varying poses, all in a single pass. Turing hardware supports up to four views per pass, and up to 32 views are supported at the API level. By fetching and shading geometry only once, Turing optimally processes triangles and their associated vertex attributes while rendering multiple versions. When accessed via the D3D12 View Instancing API, the developer simply uses the variable SV_ViewID to index different transformation matrices, reference different blend weights, or control any shader behavior they like, that varies depending on which view they are processing.

Ya this is pointless for most people, doesn't increase performance much. Neat though, single pass rendering for something like that won't help performance(hurt it a lot really) but can be useful in more professional work cases like 3d modeling.

RESOURCE MANAGEMENT AND BINDING MODEL
DX12 introduced the ability to allow resource views to be directly accessed by shader programs without requiring an explicit resource binding step. Turing extends our resource support to include bindless Constant Buffer Views and Unordered Access Views, as defined in Tier 3 of DX12's Resource Binding Specification. Turing's more flexible memory model also allows for multiple different resource types (such as textures and vertex buffers) to be co-located within the same heap, simplifying aspects of memory management for the app. Turing supports Tier 2 of resource heaps.

It actually supports DX12 and Vulkan now.

Yay.
(this should have been in fucking pascal oh my god)

And the conclusion:
Graphics has just been reinvented. The new NVIDIA Turing GPU architecture is the most advanced and efficient GPU architecture ever built. Turing implements a new Hybrid Rendering model that combines real-time ray tracing, rasterization, AI, and simulation. Teamed with the next generation graphics APIs, Turing enables massive performance gains and incredibly realistic graphics for PC games and professional applications.

Technically true.

Ahh, and the appendix is giving some actually important info for fucking once.
Launching alongside the Turing TU102 GPU is the Turing TU104. The TU104 GPU incorporates all of the new Turing features found in TU102, including the RT Cores, Turing Tensor Cores, and the architectural changes made to the Turing SM. The full TU104 chip contains six GPCs, 48 SMs, and eight 32-bit memory controllers (256-bit total). In TU104, each GPC includes a raster unit and four TPCs. Each TPC contains a PolyMorph Engine and two SMs.

See those "polymorph engines"? Those are the geometry engines. Same amount in each one as Pascal as I recall. Useful to know and backs up my earlier statements that it might be geometry bottlenecked a small amount.

And ouch that 2080 TDP. Ya, they must have done something to murder the clockspeed curve. 20% increase in power draw for lower clockspeeds and only 16% more cores. Ouch. The factory OC is even worse, 25% higher power draw for a 4% higher boost clock.

The GeForce RTX 2070 is based on the full implementation of the TU106 GPU, which contains three GPCs, 36 SMs, and eight 32-bit memory controllers (256-bit total). In the TU106, each GPC includes a raster unit and six TPCs. Each TPC contains a PolyMorph Engine and two SMs. Figure 42 shows the Turing TU106 full-chip diagram.

Ok... So they want an increasing ratio of PolyMorph Engine to SMs down the stack. I... Do not know why. You should want the opposite at worst. Ok. Let's... move on from that head scratcher.

2070 has 20% more cores than the 1070, 4% lower clockspeed, and 16% higher power draw. Literally the only chip in the entire stack that comes ahead here. Fuck if I know why.

If the TDP is from base clock it's still bad though, at a large 6% clockspeed drop. The FE edition is still bad too, a 23% increase in power draw for a... 1.6% increase in clockspeed. And it still has less shaders than the 1080(a deficit of 11%, a clockspeed deficit of .7% and a power TDP difference that has the 1080 with a 2.8% higher TDP... ouch).

Pascal is, for the most part, more efficient than Turing it seems.


Aaaaand RTX-Ops are still fucking nonsense.
For example, RTX-OPS = TENSOR * 20% + FP32 * 80% + RTOPS * 40% + INT32 * 28% Figure 44 shows an illustration of the peak operations of each type for GTX 2080 Ti. Plugging in those peak operation counts results in a total RTX-OPs number of 78. For example, 14 * 80% + 14 * 28% + 100 * 40% + 114 * 20%.

What the fuck are you saying!?

And the rest is useless(for y'all) raytracing stuff, and the people that could use it already know it.


So that's it. I'm done here and with this damn whitepaper. I'm going to go murder a puppy or something. Have a good day.
 
The Windows update that brings DXR (the part of DirectX 12 that allows for ray-tracing) can now be downloaded, which allows Battlefield V to use ray-tracing. Previous demos have shown turning RTX on to tank framerate hard, with even a 2080Ti struggling to reach 60fps at 1080p. However, some people suggested that maybe the developers just weren't used to the API yet, or Nvidia's drivers just weren't up to snuff yet, or other things that might make it better when it finally got released. So, were any of those excuses true?

Ha ha haaa NOPE! The thumbnail says it all!



Holy shit, he even says that turning RTX on lowers the power draw because the RT cores are bottlenecking the normal CUDA cores that hard!

Seriously, NV, I know y'all were working on this for ten years, but I'd rather have waited for a while longer to get something that actually worked the way it was supposed to. I am very glad that my recent GPU upgrade was to a 1080Ti than any of the RTX cards, and I really hope that the next generation solves that problem because goddamn.

I also hope that when AMD eventually gets around to implementing ray-tracing on their own GPUs, they learn from the mistakes NV made here. Yeesh.
 
Could I ask for some practical advice here?

I'm currently using a GTX 750ti.

What older GPUs are still decent? Is it worth getting a GTX 960 for about $120?

Why is the R9 290 supposedly still so good for a 2013 card? Looking at charts it still outperforms the GTX 1050ti. It's a monster for heat and power consumption though. I could get one for just under $100 but I'm worried it would blow my PSU. What's its more modern equivalent in AMD's 4xx lineup? (I mean, performance-wise, not price and segmentation).
 
Could I ask for some practical advice here?

I'm currently using a GTX 750ti.

What older GPUs are still decent? Is it worth getting a GTX 960 for about $120?

Why is the R9 290 supposedly still so good for a 2013 card? Looking at charts it still outperforms the GTX 1050ti. It's a monster for heat and power consumption though. I could get one for just under $100 but I'm worried it would blow my PSU. What's its more modern equivalent in AMD's 4xx lineup? (I mean, performance-wise, not price and segmentation).

I'd go with a 570 or some such
 
Mah budget is less than $200. Which leads me to ask the next question, is it fine to buy an RX 570 that was used for mining for about 3 months?
 
Mah budget is less than $200. Which leads me to ask the next question, is it fine to buy an RX 570 that was used for mining for about 3 months?

You can probably find a new 570 for less than $200.

Top one is an 8 gig version, bottom is 4 gig. Both are good. Both are faster than a 970.
 
You can probably find a new 570 for less than $200.

Top one is an 8 gig version, bottom is 4 gig. Both are good. Both are faster than a 970.
Huh. That is interesting. And the 8 GB version in newegg.ph is even cheaper than the 4GB version for some reason.

Thanks for the link!
 
Back
Top Bottom