Oxide Games Partners Dan Baker and Tim Kipp will show you how to build a high throughput renderer using the Mantle API in this AMD technology presentation from the 2014 Game Developers Conference in San Francisco March 17-21. Also view this and other presentations on our developer website at https://developer.amd.com/resources/documentation-articles/conference-presentations/
1 of 38
Downloaded 45 times
More Related Content
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AMD at GDC14
2. 2 | Nitrous and Mantle | 19 March 2014
PRE-REQUISITE MOTIVATIONAL SLIDE
MODERN APIS ARE
STARTING TO FEEL
RATHER DATED
BUT HOW MUCH
BETTER CAN WE
BE?
3. 3 | Nitrous and Mantle | 19 March 2014
PRE-REQUISITE MOTIVATIONAL SLIDE
TURNS OUT… A
WHOLE LOT
FASTER
4. 4 | Nitrous and Mantle | 19 March 2014
HONEY, DOES THIS DRESS MAKE ME LOOK FAT?
…
5. 5 | Nitrous and Mantle | 19 March 2014
STATE OF THE ART TODAY: WHAT’S GOING ON?
Lots of little things add up
2 major problems require rearchitecture
–Functional threading model throws a wrench into task
based systems
–Implicit Hazard tracking and synchronization
API tries to hide the async nature of GPU
Lots of little things, memory model, binding model, etc
Analysis of features like instancing indicate that it is
unreliable and tends to speed up only the fastest
frames, correlation between batches and driver perf is
casual
Can’t RETRO fit old APIS
6. 6 | Nitrous and Mantle | 19 March 2014
DIVING INTO NITROUS
Nitrous = Oxide’s custom engine
Specifically designed for high throughput
Core neutral. Main thread acts only as lightweight
sequencer
All work divided up into small jobs, which are in the
microsecond range
Can produce lots of jobs, 10,000+ range per frame
7. 7 | Nitrous and Mantle | 19 March 2014
STAR SWARM
Nitrous Engine demo
Free to download,
experiment
Proof of concept for
modern API design
Represents 2 AI
opponents, thus
application CPU load
is realistic
10,000 units possible
100,000+ batches
possible
8. 8 | Nitrous and Mantle | 19 March 2014
SECRETS BEHIND STAR SWARM
Much of what is required for high performance isn’t specific to
Mantle
Star Swarm originally not based on Mantle
If engine is structured in certain ways, Mantle support is
straight-forward and intuitive. Maybe even fun.
Work done to restructure engine will have benefits outside of
Mantle support
9. 9 | Nitrous and Mantle | 19 March 2014
ADDING NITROUS TO THE ENGINE
Rendering broken into jobs which generate autonomous command buffers
CPU to GPU data streamlined – constants, texture updates go into GPU frame memory
Shader bindings standardized
Shaders, state, bundled into blocks
Resources grouped into sets
Graphics commands streamlined, restricted bind points
Stateless command format
Expensive state transitioned rarely
Much attention paid to cache usage, lockless data structures
All hazards detangled, all buffers considered non persistent
10. 10 | Nitrous and Mantle | 19 March 2014
MULTI-CORE CPU BASICS
Be Wary, There Is A Lot Of Very Bad Advice In The Wild
Spawning threads to handle tasks
Relying OS preemptive scheduler, heavy weight OS synchronization primitives
Functional threading in general
Your Survival Guide
OK: Multi-thread read of same location
OK: Multi-thread write to different locations
OK: Multi-thread write to same location in ‘stamp’ mode
CAUTION: Atomic instructions
STOP: Multi-thread read/write to same location
STOP: Multi-thread write to same CACHE line
11. 11 | Nitrous and Mantle | 19 March 2014
NITROUS AND MANTLE
Nitrous is NOT built around Mantle
Reverse is more true, Mantle adapts well to Nitrous
internal concepts
The concepts are what make engine fast
Results are astounding, driver time reduced up to
50x
Mantle is the harbinger of future API design, Not just
in Graphics
12. 12 | Nitrous and Mantle | 19 March 2014
TASK BASED SYSTEM
Idea is that work load is a constructed graph of much
smaller nuggets
Many advantages
– Scales well, 32+ cores
– Easy to balance workload
– More power efficient – more slower cores just as good
Already seeing CPUs dynamically slowing clock speed
– If enough similar work items queued, can execute same
code on cores
Cache hit rate much higher
– End up generating a larger number of command buffers
to prevent thread serialization
14. 14 | Nitrous and Mantle | 19 March 2014
NITROUS COMMAND FORMATS
In reality, diagram is over simplified
Nitrous has it’s own internal command format
– Small, efficient commands
– Stateless, each command contains references to all needed state
– Inheritance unneeded
– Separates internal graphics system from any particular API
Being Stateless, can be generated completely out of order
Entire Frame is queued up in internal command format
Frame is translated to GPU commands via Mantle
– Nitrous Command buffers are translated into Mantle Command Buffers at one section
Get’s more optimal use out of instruction cache and data cache
15. 15 | Nitrous and Mantle | 19 March 2014
BUILDING AROUND ASYNCRONISITY: HOW NITROUS THINKS OF A FRAME
Entire app should be exposed to concept of asyncronisity
The concept of a frame:
– A set of commands which will be executed on the GPU
– A set of data which will be read by the GPU
– This concept is fundamental in Nitrous, regardless of API
Frame
CMD CMD CMD CMD
Frame Data
Persistent
Textures Big
Transfer
Buffers
Resource
Sets
16. 16 | Nitrous and Mantle | 19 March 2014
CREATING A FRAME, USING FRAME DATA
Create 2 copies of our frame data
One will be read by GPU, while
other is being written to by the CPU
Must use fence to make sure CPU
doesn’t get ahead
More complex situations could be
explored
Frame data includes
– Constant Data
– Small texture updates
Even Frame
Odd Frame
GPU
CPU
18. 18 | Nitrous and Mantle | 19 March 2014
HINT
Use memory heap that has highest cpuWritePerfRating
In Debug, rather then copying directly to GPU memory,
allocate CPU memory
–Or use pinned Mantle memory
Then, use OS call Virtual Protect with PAGE_NOACCESS for
any data that effects the frame, while the frame is being
accessed by GPU, or could be being translated by the CPU
If any part of system inadvertently writes to the memory, will
throw exception
19. 19 | Nitrous and Mantle | 19 March 2014
SOME EXTRA STUFF WE WILL NEED
Because we track hazards, we will want a few more buffers
A delete queue – objects are not deleted, but placed in the delete queue
–One queue per frame, once that frame is complete, items will be deleted
A state transition queue
–Used only when a resource is created, to transition it to the desired
initial state
An Unordered Command Queue
–Gets flushed before main frames command queue
–Useful for preparing resources for first time use (e.g. initialization)
20. 20 | Nitrous and Mantle | 19 March 2014
INTERNAL COMMAND FORMAT
Nitrous has it’s own internal command format
Persistent state:
– Resource Sets
– Shader Blocks
– Various pipeline state
Frame State, primary construct is a batch set
– Contains primitives, batches and shader sets
– Batches which reference
Primitives
Shader Sets
– Constant references are made into our frame memory
Each one of these has a different, natural change frequency
21. 21 | Nitrous and Mantle | 19 March 2014
NITROUS MEMORY POOLS
Resources used together, created together
Multiple resource sets are often pooled
Simplifies memory management, less then
1000 total allocations
Orange Team Unit’s Memory
FIGHTER 1 CAR. REAR
CAR. FOR CARRIER MAIN
(0) Albiedo
(1) Material Mask
(2) Ambient
Occlusion
(3) Normal Map
(4) Weathering Map
(0) Albiedo
(1) Material Mask
(2) Ambient
Occlusion
(3) Normal Map
(4) Weathering Map
(0) Albiedo
(1) Material Mask
(2) Ambient
Occlusion
(3) Normal Map
(4) Weathering Map
(0) Albiedo
(1) Material Mask
(2) Ambient
Occlusion
(3) Normal Map
(4) Weathering Map
22. 22 | Nitrous and Mantle | 19 March 2014
NITROUS MEMORY POOLS
GPU resource allocation a little tricky – we don’t
know ahead of time how big something might be
2 step process, first calculate size of resource,
then allocate pool based on that size
Does not map 1:1 to Mantle memory allocations
Instead, Pool is created with default page size
When a new resource is added, either it places
inside current allocation, or if resource is bigger
then the page size, creates a new allocation that
fits the resource
A memory pool in Nitrous = a list of allocations in
Mantle
If able to size ahead of time, only 1 allocation
Unit Textures
Diffuse
Specular
Mask
AO
Normal
Mantle
Alloc
Mantle
Alloc
Mantle
Alloc
24. 24 | Nitrous and Mantle | 19 March 2014
SOME EXTRA MANAGEMENT REQUIRED
Creating a Resource slightly more involved
When a resource creation call occurs, check to see if we are a GPU heap
If so, no way to directly map memory and upload resource so
– 1) Allocate, or recycle a CPU visible heap object
– 2) Create Resource and map into this heap
– 3) Create Resource on the GPU in the specified heap, (it will be uninitialized)
– 4) Issue a copy command in our Unordered Command Queue
– 5) Place temp resource in a deletion queue
For any resource, we allow a default state to be specified
– At beginning of frame, before we execute main comands, issue any state transition queues to place
resources from default state into desired state
25. 25 | Nitrous and Mantle | 19 March 2014
RESOURCE SETS
In real world, textures are grouped
Nitrous has 5 bind points
– 2 for batch
– 2 for shader
– 1 for primitive
VB is just a resource set
Nitrous does not allow binding of individual
textures
Clearly, maps 1:1 to a descriptor
Space Fighter 1
(0) Albiedo
(1) Material Mask
(2) Ambient Occlusion
(3) Normal Map
(4) Weathering Map
26. 26 | Nitrous and Mantle | 19 March 2014
VERTEX BUFFERS
Nitrous does not use Vertex Buffers
Instead, Resource Set acts as VB, but with more programmatic control
Vastly simplifies engine side management
– VBs can be saved as DDS files
– Do not require a huge amount of loading code for slightly different Vertex Formats
– Can fold Displacement maps and other geometry modifiers into Primitive Resource Set
Not seen strong evidence on any hardware that this causes a performance issue
27. 27 | Nitrous and Mantle | 19 March 2014
CONSTANT BUFFERS
Nitrous does not have concept of constant buffers
Instead, all constant data is thrown out every frame
– When we render an object, CPU will generate the constants needed for that frame
– Grab a piece of the Frame Memory and write to it
Constant bindings are just references into our frame memory
But… be careful! CPU is writing straight to GPU memory. Do NOT read it back!
Evidence suggests no performance advantage of persisting constants across frames, regenerating every
frame is ample fast. 100k+ batches not a problem
28. 28 | Nitrous and Mantle | 19 March 2014
A BATCH IN NITROUS CONSISTS OF 4 PARTS
Batch Set
Prim 0 Prim 1 Prim 2
Shader 0 Shader 1
Batch
0
Batch
1
Batch
2
Batch
3
Batch
4
Primitive
IB
Resources
Tri info
Shader
Resources (2)
Constants (2)
Shader Block
Batch
Primitive
Shader
Resources (2)
Constants (2)
Batch Set
Batches
Primitives
Shaders
RTs
Blend State
29. 29 | Nitrous and Mantle | 19 March 2014
DESCRIPTOR TABLE LAYOUT FOR NITROUS
Descriptor 0
*Batch Resource Set 0
*Batch Resource Set 1
Batch Constants 1
Batch Constants 2
*Shader Resource Set 0
*Shader Resource Set 1
Shader Constants 0
Shader Constants 1
*UAV
*Samplers (only 1 global bank)
Descriptor 1
*Primitive VB
Dynamic Const
Batch Constants 0
30. 30 | Nitrous and Mantle | 19 March 2014
DESCRIPTOR BINDING STRATEGY
Remember: Descriptors are just structures on GPU memory, so need to double buffer as well
Create 1 giant descriptor table, start update at beginning of frame
Recognize that we have a resource bind vector of only 9 items
Each bind vector can be built into a descriptor table, but don’t need unique one
Check to see if this bind vector has been built before(During this frame), e.g. resident in a small cache, if
so, just reference it
If not, build a new descriptor table, and place in cache
Dynamic constants, batch constant 0, uses grCmdBindDynamicMemoryView
– Usually, this will change every call (e.g. some part of the batch is changing or else it’s the same batch)
Using grCmdBindDynamicMemoryView, for 100k batches, about 5-10k descriptors actually need to get
built per frame
31. 31 | Nitrous and Mantle | 19 March 2014
TRACKING RESOURCE USAGE
Apps responsibility to track what resources get used
Simple strategy: Stamp a frame number on each
memory pool anytime it is bound
Traverse the complete resource list, anything which
matches current frame must be resident
Quick as long as we keep # of heaps reasonable
Important: Frame # should be padded into a cache line
to avoid serialization
Heap description Last Frame Used
UI Textures intro 2401
UI Textures in Game 17204
Orange Faction Units 17204
Purple Faction Units 17204
Weapon effects 16392
Post Process RTs 17204
Terrain Heightmap 17204
32. 32 | Nitrous and Mantle | 19 March 2014
DEALING WITH STATE TRANSITIONS
Most important, difficult part of Mantle
Must understand anytime a resource is getting used in a different way,
Read After Write
Write After Write
33. 33 | Nitrous and Mantle | 19 March 2014
SHADER BLOCKS
Shader Blocks
– Group of shaders with identical resources
– Key point : all shader stages grouped together
– All resources are bound to all stages
– For mantle, need add some extra data
Can we blend?
What back buffer formats might be used?
What z buffer formats might be used?
– Create a matrix of pipeline objects based on specified
modes
The right pipeline objet is selected based on current RT state
RTs and blendstate already chunked, no extra state changes
introduced
ShaderGroup SimpleShader
{
ResourceSetPrimitive = VertexData;
ConstantSetDynamic[0] = DynamicData;
ResourceSetBatch[1] = UserTS;
ConstantSetShader[0] = Globals;
RenderTargetFormats = R16G16B16A16_FLOAT,
R11G11B10_FLOAT;
BlendStates = BlendOff;
DepthTargetFormats = D32_FLOAT;
Methods
{
main:
CodeBlocks = SimpleShaders;
VertexShader = SimpleVSShader;
PixelShader = SimplePSShader;
zprime:
CodeBlocks = SimpleShaders;
VertexShader = SimpleVSShader;
PixelShader = BlankSimplePSShader;
}
}
34. 34 | Nitrous and Mantle | 19 March 2014
CREATING SHADER BLOCKS IN MANTLE
Translate HLSL Byte code to Mantle IC
– All done at compile time, have a Mantle speific executable
Creating a Mapping Table
– Batch has 5 bind points
– Shader has 4 bind points
– Batch Set has 1 bind point
– Primitive has 1 bind point
– Global Samplers have 1 bind point
Set up our IC so all pipeline objects use exactly the same top level desciptor
35. 35 | Nitrous and Mantle | 19 March 2014
WHAT ABOUT THAT PRESENT?
Unlike other APIS, we do not need, or should, block on the present on the main thread
Instead we spawn a job, which we block against on the next present
Void PresentJob()
{
…
result = grQueueSubmit(g_UniversalQueue, g_cCommandBuffers, g_CommandBuffers,
cMemRefs, MemRef, g_FrameFences[g_uSubmittingFrameBuffer]);
uint32 PresentFlags = 0;
if(g_bVSync)
PresentFlags = GR_WSI_WIN_PRESENT_FLIP_DONOTWAIT;
// instruct the GPU to present the backbuffer in the applications window
GR_WSI_WIN_PRESENT_INFO presentInfo =
{
g_hWnd, g_MasterResourceList.Images[DR_BACKBUFFER],
GR_WSI_WIN_PRESENT_MODE_BLT, 0, PresentFlags
};
result = grWsiWinQueuePresent(g_UniversalQueue, &presentInfo);
SignalProcessAndPresentDone(pInfo);
}
}
36. 36 | Nitrous and Mantle | 19 March 2014
WHAT OUR FRAME SUBMISSION LOOKS LIKE
1) Block on last frames present’s job (e.g. NOT the fence, the actual job we spawned)
2) Process and pending resource transitions from newly created resources
3) Generate all pending unordered commands, by generating into 1 or more cmd buffers
4) Send signals to the issuers of unordered commands, to notify them the commands are submiitted
5) Begin translation of Nitrous cmds into Mantle cmds – usually 100-500 jobs across all cores
6) Flush the deletion queues for this frame (likely a few frames old at this point)
7) Any item in our master deletion queue, add to the now empty deletion queue for this frame
8) Handle memory readbacks
9) Spawn Present job
37. 37 | Nitrous and Mantle | 19 March 2014
FUTURE WORK
Now have explicit control over Multi GPU
Can write better MGPU solutions, like split screen which will not increase latency
– We just got rid of a bunch of latency, don’t want to add it back!
Asymetric GPU use situations are doable – e.g. using integrated graphics in tandem with Discrete GPU
38. 38 | Nitrous and Mantle | 19 March 2014
RESULTS
Star Swarm surprised both Oxide and AMD
– We were not expecting to see cases where application was 300-400% faster, still room for
optimizations
– Right now, we are clearly GPU bound, will release an update soon that increases CPU utilization a little
bit to optimize GPU, expecting 10-20% more performance out of Mantle on high end GPUs
Driver overhead very consistent, well correlated to number of calls made
About 2 man months of work
– For an Alpha API, likely 1 month if final version
Especially telling on slower CPUs, surprising number of cases with high end GPUS with old CPUs
Try for yourself: Star Swarm is free to download on Steam!