APRIL 24, NEW YORK – The excitement was palpable as Will Willis, Senior PR Manager for ATI Technologies, Inc., opened the door into the suite at the W Hotel in Times Square where the company
was giving a preview presentation under the ostentatious title, "The Future of PC
Gaming – The Possibilities of DirectX 10." Dominating the wall that immediately came into view as the door opened was a simply mammoth flat-screen HDTV hanging on the wall, with the oh so
familiar Windows XP desktop on display. Palpable anticipation…
I turned to receive introductions and warm handshakes to Guennardi Rigeur, a Senior ISV Engineer, and Bob Drevin, who has the fascinating title "ATI Fellow," and who has the aura of an evangelist
– which I quickly discover is about right. Bob explains that he helps craft ATI's technology directions and is a representative for the company to Microsoft's DirectX architectural group.
Guennardi then tells me that he and other ISV Engineers often spend time with game developers on site at their studios, working on production code and helping them squeeze every last bit of
performance out of the company's GPUs. Which provides the ideal segue into the substance of this session.
We begin with a discussion of the limitations inherent in DirectX 9, specifically in Direct3D. Guennardi explains that the API as it currently stands is "batch limited." "You can only perform so
many operations within the allotted frame time, because of the amount of overhead inherent in state changes, texture unit access and so on, as well as the organization of the vertex and pixel
shaders." By way of explanation, he shows me a slide in which the vertex shader is heavily loaded while performing geometry operations but the pixel shader is virtually idle, and then the load is
inverted when framebuffer operations are being performed. The fundamental limitations in terms of access to computing resources, he explains, are limiting what developers can do – forcing them
to continue to bring a "false reality" to life.
Guennardi is referring to the variety of mathematically inaccurate models for environmental effects and objects that are used as "good enough" approximations for the real thing, due to the
inability of current hardware to keep up with full simulations. Or, at least, that's what I thought.
Bob takes over, talking about ATI's objective of eliminating the DX 9 constraints and solving the small batch problem – reducing the overhead associated with operations such that more
operations can be performed in each batch, to the point where accurate mathematical models can power our simulations. The challenge, he explains, is balancing the diverse needs of vertex and pixel
processing, which seem to be orthogonal at best, if not anti-parallel. In a sense, Bob expounds, "things are about to get worse… in the hopes of getting better."
And then he really gets into his element, letting out his inner technology evangelist, and before I know it I've drunk the Kool-Aid and I'm seeing happy-happy joy-joy images of developers
frolicking in glorious rendered fields of Direct3D 10 goodness. Bob introduces me to the Unified Shader Architecture.
"Current shader architectures use fairly different approaches for the vertex and pixel shader units, and this is reflected in their supporting different operations and sometimes requiring
different techniques to program. With Direct3D 10, all shader units support the same basic operations and use the same syntax." In addition, Guennardi chips in to tell me about optimizations to the
low-level graphics driver such that shader development no longer necessitates the use of assembly language. "Everything can be done in HLSL," he gushes. "Almost everything," Bob corrects. I
make the analogy to contemporary use of high(er)-level languages like C or C++ with only occasional use of assembly for machine-specific extensions (such as SSE3) and they both eagerly seize and run
with it. The driver analyzes the bytecode produced by the runtime and generates optimized opcodes for the specific hardware, a process I compare to just-in-time (JIT) compilation and which Guennardi
is quite pleased with.
Bob continues, "This is the Unified Shader Architecture, and it enables us to do a lot of really cool things that were just really difficult and tedious before, stuff that was either being
bus-limited or CPU-limited but is now possible because of how we've been able to reorganize the GPU's internal architecture." Now all shader units can fetch textures or access vertex memory, and some
operations can be shifted from one unit to another – some operations are, in essence, shader unit agnostic. As a consequence, an executive process running on the GPU known as Arbitration that
decides what gets to execute next can avoid stalls by determining that the next several operations are not dependent on the result of a block unit, perhaps waiting on I/O. I say that it's like having
a spare CPU, except that it's sort of running on the GPU.
Bob likes the analogy. "We actually have the unified shader architecture running on a production system already – in the Xbox 360, with the custom GPU we designed for that. It's allowed
developed for the 360 to do all sorts of cool stuff, and we'll get into that in a minute." He takes a minute to point out, however, that the Unified Shader Architecture is not a requirement of the
Direct3D 10 specification. Rather, "the specification is written in such a way that encourages and is compatible with the Unified Shader Architecture. This is just ATI's take – and just a first
take at that, and you'll see some of the amazing stuff we've been able to do with it. Essentially, the Direct3D 10 'refresh' of the API presents an opportunity for a more natural mapping to the
capabilities of the underlying hardware."
And now for some demos. Guennardi takes over the keyboard and mouse from Will and first tempers my expectations by reminding me that what I will be shown is actually running on current-generation
hardware. "It's an X1900. What we've done is take the pixel shader unit and run everything in a Direct3D 10-like fashion on it – vertex shader, geometry shader, pixel shader."
"Essentially, you're emulating the Unified Shader Architecture on just the pixel shader?" I ask.
Returning to a point made earlier, one of the real upsides to the D3D 10 API refresh, and to the Unified Shader Architecture, at least in Bob's and Guennardi's eyes, is that it offloads a number
of tasks from the CPU, making more cycles available for artificial intelligence and to otherwise create more immersive, interactive worlds. "With the Unified Shader Architecture, and specifically
with the geometry shader, we can do a lot of things purely on the GPU that used to require CPU involvement, like collision detection." I believe I took a minute here to gather my jaw from the floor.
Collision detection on the GPU?! I am incredulous, and Guennardi clearly enjoys my enthusiasm, while I may remember Will and Bob high-fiving themselves in the background. Or not.
Guennardi shows me a prairie-like environment with shrubbery, a fire and smoke billowing and being blown about by the wind. He pans and zooms around, showing how the smoke particles interact with
the polygons of the vegetation without experiencing planar clipping. The visual simulation is virtually seamless, and it's all running on the GPU.
Next, he shows a large space populated with a crowd of about 10,000 textured and animated characters milling about, performing somersaults and flying kicks, running around and generally getting
their activity on. "Imagine, for example, the battles in Lord of the Rings," Guennardi says, "and being able to have thousands and thousands of individually articulated and driven characters on the
field." "RTS programmers are going to love this!" is my only reply. "By the way, we're using only two meshes. With instancing, which is significantly improved, we scale them and apply different
textures and animations to them, giving us a rich visual environment with very few resources."
Bob pipes up, reiterating a distinction we talked about earlier. "Essentially, we are moving from game rendering, again, to game computing." Guennardi takes control of a slider oriented vertically
along the right side of the screen and drags it up and down, and I see the number of on-screen characters swell and shrink in relation. Around 100,000 characters, the simulation bogs down and the
frame rate begins to stutter sharply. Pulling back to a comfortable number, he points out that the scene is fully interactive. "We can collide with each and every character," he says, as he drives
the camera through the crowd with the mouse and characters go flying, bouncing off the collision volume created around the camera. "And all that code is running on the GPU, all done using
Switching demonstrations again, Guennardi shows me an undulating ocean and a beautifully rendered sky. The sky is a skybox, and is of no interest to us. Hitting a key, he shows me the wireframe
mesh, and points out that the wave effects are being computed using the geometry shader – again, running on the GPU. Pointing to the distant areas of the ocean surface, where the mesh devolves
into a sea of solid color, I recall another difference we touched on earlier. "Yes. In DirectX 9, that area would heavily load the vertex shader but underutilize the pixel shader because the mesh was
resulting in very few pixels drawn. In Direct3D 10, and in particular with the Unified Shader Architecture, the arbitration balances the load and prevents stalling and other forms of performance
Grabbing yet another slider, he modifies the size of the waves by changing the velocity of the wind blowing over the ocean's surface. "All the physics behind the wave generation is being computed
on the GPU, using the geometry shader." He then brings up a menu and enables CPU computation instead of using the GPU (remember, this is all being emulated using solely the pixel shader unit). The
simulation goes choppy. The windspeed has to be brought way down before the simulation runs at all smoothly, and even then at framerates under 50. Switching off CPU computation, the framerate
immediately jumps back over 97. "We're seeing nearly double the performance, even on current-generation hardware, using this approach."
"And the CPU isn't even loaded. Can I see the CPU utilization?" I ask. Guennardi points out that it'll be at 100%, simply because Direct3D and OpenGL both require that the application loop be
within 3 frames of the video refresh. "The CPU is running an empty loop, basically, as fast as it can, which would result in 100% utilization. But it is by no means indicative of actual processing
load or availability."
Giddy like a kid on a sugar high, I lean back as Guennardi ends the demonstrations. Asked what I think, I enthuse about getting back into graphics programming myself. Bob and Guennardi rattle off
a list of tasks that are ideally suited for the sort of parallelization that the Unified Shader Architecture and the friendly Direct3D 10 API enable. I remark on the huge opportunity for PC
developers of all stripes, from AAA to casual, given that Windows Vista will ship with Direct3D 10 included, and on PCs that can exploit all its basic features. As we wrap up the preview, we muse as
to what creative uses developers will find for the capabilities made available to them.
"Basically," Bob concludes, "our objective is to eliminate some of the current constraints faced by developers and push them further back, so developers have more room to grow and explore." Yes,