• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
  • entries
    146
  • comments
    436
  • views
    197057

About this blog

It really was Fruny

Entries in this blog

Washu
In our previous entry we started rendering a UI overlay on our application. We added some basic interactivity in that we can update a health bar by sending javascript commands. However, what if we want the UI to be able to notify us of actions? That's what we're going to cover in this entry. In addition, we're going to add mouse input forwarding to Awesomium, so that it can respond to events that can be triggered through mouse input.

Introduction


Screenshot%202015-01-03%2019.55.45(2).png


Adding mouse input forwarding is fairly simple, you simply need to trap the WM_MOUSEMOVE, along with the WM_LBUTTONDOWN and WM_LBUTTONUP Windows messages. These messages cover pretty much our entire use case, which is the ability to click buttons and detect mouse over events. Handling these events is fairly trivial, as you can see from the snippet below for WM_MOUSEMOVE:
LRESULT OnMouseMove(unsigned message, WPARAM wParam, LPARAM lParam, BOOL & handled) { int xPos = GET_X_LPARAM(lParam); int yPos = GET_Y_LPARAM(lParam); if (m_view) { if (wParam && MK_LBUTTON) m_view->InjectMouseDown(Awesomium::kMouseButton_Left); m_view->InjectMouseMove(xPos, yPos); } return 0;}If you've been using an nice UI stylesheet, then it should automatically start highlighting things now, and you should be able to tell when you have given focus to things like buttons. More importantly, you can now click various links and if you're using the HTML in the previous post, you can now show and hide the quest tracker.

Getting Feedback From the UI


One of our goals is to allow the UI to tell the game things. For instance, if the user clicks on the skill button at the bottom, we expect it to execute whatever skill is bound to that button. To do this we need to bind a global javascript object to the Awesomium WebView, and then map the C++ functions we desire to call onto javascript functions we add to the global object.

We do this fairly simply, using a map of ID and javascript function name to std::function objects:
m_jsApp = m_view->CreateGlobalJavascriptObject(Awesomium::WSLit("app"));Awesomium::JSObject & appObject = m_jsApp.ToObject();appObject.SetCustomMethod(Awesomium::WSLit("skill"), false);JsCallerKey key(appObject.remote_id(), Awesomium::WSLit("skill"));m_jsFunctions[key] = std::bind(&MainWindow::OnSkill, this, std::placeholders::_1, std::placeholders::_2);In this case we're binding the OnSkill non-static member function to the javascript function "skill" in the "app" object. We could have also used a lambda here, or a static function as well.

Of course, since there's no actual relationship between the javascript function name and the C++ function, we need to build a binding system. Thankfully, Awesomium comes with a method handler which allows it to notify us whenever a javascript function is invoked on our global object. In our case, for simplicity, we implement the interface on the MainWindow class, however in general I would actually recommend implementing this on a separate object entirely.
m_view->set_js_method_handler(this);After this we just have to implement the two methods it requires, OnMethodCall and OnMethodCallWithReturnValue, and have them query our map for any functions that match the object ID and function name specified. If the function is found, we invoke it with the expected parameters:
void OnMethodCall(Awesomium::WebView * caller, unsigned remoteObjectId, Awesomium::WebString const & methodName, Awesomium::JSArray const & args) { JsCallerKey key(remoteObjectId, methodName); auto itor = m_jsFunctions.find(key); if (itor != m_jsFunctions.end()) { itor->second(caller, args); }}With this in place, and our app.skill function bound, our HTML can trivially invoke it:
We now have the capability to allow the Awesomium UI to communicate with our game in a meaningful and event driven manner.

More Efficient Rendering


One of the other problems we're going to encounter is determining when input should be directed to the UI layer, and when input should be directed to the game systems.

Along with this we also find ourselves in a position to do a bit of optimization of our rendering. In our previous code we were using the UpdateSubresource call to update portions of our texture (created with D3D11_USAGE_DEFAULT). This has several issues:

  • It creates a copy of the memory passed into it.
  • We cannot later query for information about the UI overlay
  • The source and destination textures must be in the same format.
    Now, we're not going to be changing the backing format (although you might want to for various reasons). However, by switching out to a better method we can reduce our overall overhead, allow us to query for information from the texture, and also give us the ability to alter change said formats.

    Our methodology will be to use a staging texture with the D3D11_CPU_ACCESS_READ and D3D11_CPU_ACCESS_WRITE flags. Why read? The simple answer is: we will eventually want to know when a pixel is transparent to the UI. This way we can determine if the mouse is currently over a UI element, or if the mouse is in the gameplay view.

    For updating the rendered texture, we simply map our staging resource, and then we run through a series of memcpy calls to copy each changed row of the texture over:
    D3D11_MAPPED_SUBRESOURCE resource;m_context->Map(m_staging, 0, D3D11_MAP_WRITE, 0, &resource);auto srcStartingOffset = srcRowSpan * srcRect.y + srcRect.x * 4;uint8_t * srcPtr = srcBuffer + srcStartingOffset;auto dstStartingOffset = resource.RowPitch * destRect.y + destRect.x * 4;uint8_t * dataPtr = reinterpret_cast(resource.pData) + dstStartingOffset;for (int i = 0; i < destRect.height; ++i) { memcpy(dataPtr + resource.RowPitch * i, srcPtr + srcRowSpan * i, destRect.width * 4);}m_context->Unmap(m_staging, 0);Once that's complete, we can simply ask Direct3D11 to copy the updated portion of the staging texture over to our rendered texture:
    m_context->CopySubresourceRegion(m_texture, 0, destRect.x, destRect.y, 0, m_staging, 0, &box);With this in hand, we can also map our staging texture in for reading, and simply ask it if a particular pixel (at an X,Y position) is fully:
    bool IsUIPixel(unsigned x, unsigned y) { D3D11_MAPPED_SUBRESOURCE resource; m_context->Map(m_staging, 0, D3D11_MAP_READ, 0, &resource); auto startingOffset = (m_width * y + x) * 4; uint8_t * dataPtr = reinterpret_cast(resource.pData) + startingOffset; bool result = *dataPtr != 0; m_context->Unmap(m_staging, 0); return result;}This function will return true if the pixel queried has any opaqueness to it (i.e. partial transparency).

    Full Sample


    #define NOMINMAX#include #include #include #include #include #include #include #include #include #include #pragma comment(lib, "d3d11.lib")#pragma comment(lib, "awesomium.lib")#include #include #include #include #include #include #include #include #include #ifdef UNICODEtypedef wchar_t tchar;typedef std::wstring tstring;templatetstring to_string(T t) { return std::to_wstring(t);}#elsetypedef char tchar;typedef std::string tstring;templatetstring to_string(T t) { return std::to_string(t);}#endifstruct Vertex { float position[4]; float color[4]; float texCoord[2]; static const unsigned Stride = sizeof(float) * 10; static const unsigned Offset = 0;};void ThrowIfFailed(HRESULT result, std::string const & text) { if (FAILED(result)) throw std::runtime_error(text + "");}class RenderTarget {public: RenderTarget(ID3D11Texture2D * texture, bool hasDepthBuffer) : m_texture(texture) { CComPtr device; texture->GetDevice(&device); auto result = device->CreateRenderTargetView(m_texture, nullptr, &m_textureRTV); ThrowIfFailed(result, "Failed to create back buffer render target."); m_viewport = CD3D11_VIEWPORT(m_texture, m_textureRTV); result = device->CreateTexture2D(&CD3D11_TEXTURE2D_DESC(DXGI_FORMAT_D32_FLOAT, static_cast(m_viewport.Width), static_cast(m_viewport.Height), 1, 1, D3D11_BIND_DEPTH_STENCIL), nullptr, &m_depthBuffer); ThrowIfFailed(result, "Failed to create depth buffer."); result = device->CreateDepthStencilView(m_depthBuffer, nullptr, &m_depthView); ThrowIfFailed(result, "Failed to create depth buffer render target."); } void Clear(ID3D11DeviceContext * context, float color[4], bool clearDepth = true) { context->ClearRenderTargetView(m_textureRTV, color); if (clearDepth && m_depthView) context->ClearDepthStencilView(m_depthView, D3D11_CLEAR_DEPTH, 1.0f, 0); } void SetTarget(ID3D11DeviceContext * context) { context->OMSetRenderTargets(1, &m_textureRTV.p, m_depthView); context->RSSetViewports(1, &m_viewport); }private: D3D11_VIEWPORT m_viewport; CComPtr m_depthBuffer; CComPtr m_depthView; CComPtr m_texture; CComPtr m_textureRTV;};class GraphicsDevice {public: GraphicsDevice(HWND window, int width, int height) { D3D_FEATURE_LEVEL levels[] = { D3D_FEATURE_LEVEL_11_1, D3D_FEATURE_LEVEL_11_0, D3D_FEATURE_LEVEL_10_1, D3D_FEATURE_LEVEL_10_0, }; DXGI_SWAP_CHAIN_DESC desc = { { width, height, { 1, 60 }, DXGI_FORMAT_R8G8B8A8_UNORM, DXGI_MODE_SCANLINE_ORDER_UNSPECIFIED, DXGI_MODE_SCALING_UNSPECIFIED }, { 1, 0 }, DXGI_USAGE_BACK_BUFFER | DXGI_USAGE_RENDER_TARGET_OUTPUT, 1, window, TRUE, DXGI_SWAP_EFFECT_DISCARD, 0 }; auto result = D3D11CreateDeviceAndSwapChain( nullptr, D3D_DRIVER_TYPE_HARDWARE, nullptr, D3D11_CREATE_DEVICE_DEBUG | D3D11_CREATE_DEVICE_BGRA_SUPPORT, levels, sizeof(levels) / sizeof(D3D_FEATURE_LEVEL), D3D11_SDK_VERSION, &desc, &m_swapChain, &m_device, &m_featureLevel, &m_context ); ThrowIfFailed(result, "Failed to create D3D11 device."); } void Resize(int width, int height) { if (m_renderTarget) m_renderTarget.reset(); auto result = m_swapChain->ResizeBuffers(1, width, height, DXGI_FORMAT_UNKNOWN, 0); ThrowIfFailed(result, "Failed to resize back buffer."); CComPtr backBuffer; result = m_swapChain->GetBuffer(0, __uuidof(ID3D11Texture2D), reinterpret_cast(&backBuffer)); ThrowIfFailed(result, "Failed to retrieve back buffer surface."); m_renderTarget = std::make_unique(backBuffer, true); } void SetAndClearTarget() { static float color[] { 0, 0, 0, 0}; if (!m_renderTarget) return; m_renderTarget->Clear(m_context, color); m_renderTarget->SetTarget(m_context); } void Present() { m_swapChain->Present(0, 0); } ID3D11Device * GetDevice() { return m_device; } ID3D11DeviceContext * GetDeviceContext() { return m_context; }private: D3D_FEATURE_LEVEL m_featureLevel; CComPtr m_device; CComPtr m_context; CComPtr m_swapChain; std::unique_ptr m_renderTarget;};class D3DSurface : public Awesomium::Surface {public: D3DSurface(ID3D11DeviceContext * context, Awesomium::WebView * view, int width, int height) : m_context(context), m_view(view), m_width(width), m_height(height) { CComPtr device; context->GetDevice(&device); auto result = device->CreateTexture2D(&CD3D11_TEXTURE2D_DESC(DXGI_FORMAT_B8G8R8A8_UNORM, width, height, 1, 1), nullptr, &m_texture); result = device->CreateShaderResourceView(m_texture, nullptr, &m_textureView); result = device->CreateTexture2D(&CD3D11_TEXTURE2D_DESC(DXGI_FORMAT_B8G8R8A8_UNORM, width, height, 1, 1, 0, D3D11_USAGE_STAGING, D3D11_CPU_ACCESS_READ | D3D11_CPU_ACCESS_WRITE), nullptr, &m_staging); } virtual void Paint(unsigned char *srcBuffer, int srcRowSpan, const Awesomium::Rect &srcRect, const Awesomium::Rect &destRect) { auto box = CD3D11_BOX(destRect.x, destRect.y, 0, destRect.x + destRect.width, destRect.y + destRect.height, 1); D3D11_MAPPED_SUBRESOURCE resource; m_context->Map(m_staging, 0, D3D11_MAP_WRITE, 0, &resource); auto srcStartingOffset = srcRowSpan * srcRect.y + srcRect.x * 4; uint8_t * srcPtr = srcBuffer + srcStartingOffset; auto dstStartingOffset = resource.RowPitch * destRect.y + destRect.x * 4; uint8_t * dataPtr = reinterpret_cast(resource.pData) + dstStartingOffset; for (int i = 0; i < destRect.height; ++i) { memcpy(dataPtr + resource.RowPitch * i, srcPtr + srcRowSpan * i, destRect.width * 4); } m_context->Unmap(m_staging, 0); m_context->CopySubresourceRegion(m_texture, 0, destRect.x, destRect.y, 0, m_staging, 0, &box); } virtual void Scroll(int dx, int dy, const Awesomium::Rect &clipRect) { auto box = CD3D11_BOX(clipRect.x, clipRect.y, 0, clipRect.x + clipRect.width, clipRect.y + clipRect.height, 1); m_context->CopySubresourceRegion(m_texture, 0, clipRect.x + dx, clipRect.y + dy, 0, m_texture, 0, &box); } void Bind() { m_context->PSSetShaderResources(0, 1, &m_textureView.p); } bool IsUIPixel(unsigned x, unsigned y) { D3D11_MAPPED_SUBRESOURCE resource; m_context->Map(m_staging, 0, D3D11_MAP_READ, 0, &resource); auto startingOffset = (m_width * y + x) * 4; uint8_t * dataPtr = reinterpret_cast(resource.pData) + startingOffset; bool result = *dataPtr != 0; m_context->Unmap(m_staging, 0); return result; } virtual ~D3DSurface() { }private: CComPtr m_textureView; CComPtr m_texture; CComPtr m_staging; ID3D11DeviceContext * m_context; Awesomium::WebView * m_view; int m_width; int m_height;};class D3DSurfaceFactory : public Awesomium::SurfaceFactory {public: D3DSurfaceFactory(ID3D11DeviceContext * context) : m_context(context) { } virtual Awesomium::Surface * CreateSurface(Awesomium::WebView * view, int width, int height) { return new D3DSurface(m_context, view, width, height); } virtual void DestroySurface(Awesomium::Surface * surface) { delete surface; }private: ID3D11DeviceContext * m_context;};class MainWindow : public CWindowImpl, public Awesomium::JSMethodHandler {public: DECLARE_WND_CLASS_EX(ClassName, CS_OWNDC | CS_HREDRAW | CS_VREDRAW, COLOR_BACKGROUND + 1); MainWindow(Awesomium::WebCore * webCore) : m_webCore(webCore), m_view(nullptr, [](Awesomium::WebView * ptr) { ptr->Destroy(); }), m_isMaximized(false), m_surface(nullptr) { RECT rect = { 0, 0, 800, 600 }; AdjustWindowRectEx(&rect, GetWndStyle(0), FALSE, GetWndExStyle(0)); Create(nullptr, RECT{ 0, 0, rect.right - rect.left, rect.bottom - rect.top }, WindowName); ShowWindow(SW_SHOW); UpdateWindow(); } void Run() { MSG msg; while (true) { if (PeekMessage(&msg, 0, 0, 0, PM_REMOVE)) { if (msg.message == WM_QUIT) break; TranslateMessage(&msg); DispatchMessage(&msg); } else { Update(); } } } void Update() { auto context = m_device->GetDeviceContext(); m_webCore->Update(); if (m_view->IsLoading()) { m_isLoading = true; } else if (m_isLoading) { m_isLoading = false; UpdateBossHealth(); m_webCore->Update(); m_surface = static_cast(m_view->surface()); } m_device->SetAndClearTarget(); context->OMSetBlendState(m_blendState, nullptr, ~0); context->IASetPrimitiveTopology(D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST); context->IASetIndexBuffer(nullptr, (DXGI_FORMAT)0, 0); context->IASetVertexBuffers(0, 1, &m_vertexBuffer.p, &Vertex::Stride, &Vertex::Offset); context->IASetInputLayout(m_inputLayout); context->VSSetShader(m_triangleVS, nullptr, 0); context->PSSetShader(m_trianglePS, nullptr, 0); context->Draw(3, 0); context->IASetVertexBuffers(0, 0, nullptr, nullptr, nullptr); context->IASetInputLayout(nullptr); context->VSSetShader(m_vertexShader, nullptr, 0); context->PSSetShader(m_pixelShader, nullptr, 0); context->PSSetSamplers(0, 1, &m_sampler.p); if (m_surface) m_surface->Bind(); context->Draw(3, 0); m_device->Present(); } void OnMethodCall(Awesomium::WebView * caller, unsigned remoteObjectId, Awesomium::WebString const & methodName, Awesomium::JSArray const & args) { JsCallerKey key(remoteObjectId, methodName); auto itor = m_jsFunctions.find(key); if (itor != m_jsFunctions.end()) { itor->second(caller, args); } } Awesomium::JSValue Awesomium::JSMethodHandler::OnMethodCallWithReturnValue(Awesomium::WebView * caller, unsigned remoteObjectId, Awesomium::WebString const & methodName, Awesomium::JSArray const & args) { JsCallerKey key(remoteObjectId, methodName); auto itor = m_jsFunctionsWithRetValue.find(key); if (itor != m_jsFunctionsWithRetValue.end()) { return itor->second(caller, args); } return Awesomium::JSValue(); }private: BEGIN_MSG_MAP(MainWindow) MESSAGE_HANDLER(WM_DESTROY, [](unsigned messageId, WPARAM wParam, LPARAM lParam, BOOL & handled) { PostQuitMessage(0); return 0; }); MESSAGE_HANDLER(WM_CREATE, OnCreate); MESSAGE_HANDLER(WM_SIZE, OnSize); MESSAGE_HANDLER(WM_EXITSIZEMOVE, OnSizeFinish); MESSAGE_HANDLER(WM_KEYUP, OnKeyUp); MESSAGE_HANDLER(WM_MOUSEMOVE, OnMouseMove); MESSAGE_HANDLER(WM_LBUTTONDOWN, OnMouseLButtonDown); MESSAGE_HANDLER(WM_LBUTTONUP, OnMouseLButtonUp); END_MSG_MAP()private: LRESULT OnCreate(unsigned message, WPARAM wParam, LPARAM lParam, BOOL & handled) { try { RECT rect; GetClientRect(&rect); m_device = std::make_unique(m_hWnd, rect.right, rect.bottom); tstring filename(MAX_PATH, 0); GetModuleFileName(GetModuleHandle(nullptr), &filename.front(), filename.length()); filename = filename.substr(0, filename.find_last_of('\\')); SetCurrentDirectory(filename.c_str()); CreateD3DResources(); m_surfaceFactory = std::make_unique(m_device->GetDeviceContext()); m_webCore->set_surface_factory(m_surfaceFactory.get()); CreateWebView(rect.right, rect.bottom); m_device->Resize(rect.right, rect.bottom); } catch (std::runtime_error & ex) { std::cout << ex.what() << std::endl; return -1; } return 0; } LRESULT OnSizeFinish(unsigned message, WPARAM wParam, LPARAM lParam, BOOL & handled) { try { RECT clientRect; GetClientRect(&clientRect); m_device->Resize(clientRect.right, clientRect.bottom); if (m_view->IsLoading()) m_view->Stop(); m_surface = nullptr; CreateWebView(clientRect.right, clientRect.bottom); } catch (std::runtime_error & ex) { std::cout << ex.what() << std::endl; } return 0; } LRESULT OnSize(unsigned message, WPARAM wParam, LPARAM lParam, BOOL & handled) { if (wParam == SIZE_MAXIMIZED) { m_isMaximized = true; return OnSizeFinish(message, wParam, lParam, handled); } else { if (m_isMaximized) { m_isMaximized = false; return OnSizeFinish(message, wParam, lParam, handled); } } return 0; } LRESULT OnKeyUp(unsigned message, WPARAM wParam, LPARAM lParam, BOOL & handled) { if (wParam == 'A' && m_view) { --m_bossHealth; UpdateBossHealth(); } return DefWindowProc(message, wParam, lParam); } LRESULT OnMouseMove(unsigned message, WPARAM wParam, LPARAM lParam, BOOL & handled) { int xPos = GET_X_LPARAM(lParam); int yPos = GET_Y_LPARAM(lParam); if (m_view) { if (wParam && MK_LBUTTON) m_view->InjectMouseDown(Awesomium::kMouseButton_Left); m_view->InjectMouseMove(xPos, yPos); } return 0; } LRESULT OnMouseLButtonDown(unsigned message, WPARAM wParam, LPARAM lParam, BOOL & handled) { if (m_view) { m_view->InjectMouseDown(Awesomium::kMouseButton_Left); } return 0; } LRESULT OnMouseLButtonUp(unsigned message, WPARAM wParam, LPARAM lParam, BOOL & handled) { if (m_view) { m_view->InjectMouseUp(Awesomium::kMouseButton_Left); } return 0; }private: void CreateD3DResources() { auto device = m_device->GetDevice(); std::vector vs(std::istreambuf_iterator(std::ifstream("FullScreenTriangleVS.cso", std::ios_base::in | std::ios_base::binary)), std::istreambuf_iterator()); auto result = device->CreateVertexShader(&vs.front(), vs.size(), nullptr, &m_vertexShader); ThrowIfFailed(result, "Could not create vertex shader."); std::vector ps(std::istreambuf_iterator(std::ifstream("FullScreenTrianglePS.cso", std::ios_base::in | std::ios_base::binary)), std::istreambuf_iterator()); result = device->CreatePixelShader(&ps.front(), ps.size(), nullptr, &m_pixelShader); ThrowIfFailed(result, "Could not create pixel shader."); result = device->CreateSamplerState(&CD3D11_SAMPLER_DESC(CD3D11_DEFAULT()), &m_sampler); ThrowIfFailed(result, "Could not create sampler state."); vs.assign(std::istreambuf_iterator(std::ifstream("TriangleVS.cso", std::ios_base::in | std::ios_base::binary)), std::istreambuf_iterator()); result = device->CreateVertexShader(&vs.front(), vs.size(), nullptr, &m_triangleVS); ThrowIfFailed(result, "Could not create vertex shader."); ps.assign(std::istreambuf_iterator(std::ifstream("TrianglePS.cso", std::ios_base::in | std::ios_base::binary)), std::istreambuf_iterator()); result = device->CreatePixelShader(&ps.front(), ps.size(), nullptr, &m_trianglePS); ThrowIfFailed(result, "Could not create pixel shader."); std::vector inputElementDesc = { { "SV_POSITION", 0, DXGI_FORMAT_R32G32B32A32_FLOAT, 0, 0 }, { "COLOR", 0, DXGI_FORMAT_R32G32B32A32_FLOAT, 0, 4 * sizeof(float) }, }; result = device->CreateInputLayout(&inputElementDesc.front(), inputElementDesc.size(), &vs.front(), vs.size(), &m_inputLayout); ThrowIfFailed(result, "Unable to create input layout."); // Hard coded triangle. Tis a silly idea, but works for the sample. Vertex vertices[] = { { { 0.0f, 0.5f, 0.5f, 1.0f }, { 1.0f, 0.0f, 0.0f, 1.0f }, { 0.5f, 1.0f } }, { { 0.5f, -0.5f, 0.5f, 1.0f }, { 0.0f, 1.0f, 0.0f, 1.0f }, { 0.0f, 0.0f } }, { { -0.5f, -0.5f, 0.5f, 1.0f }, { 0.0f, 0.0f, 1.0f, 1.0f }, { 1.0f, 0.0f } }, }; D3D11_BUFFER_DESC desc = { sizeof(vertices), D3D11_USAGE_DEFAULT, D3D11_BIND_VERTEX_BUFFER }; D3D11_SUBRESOURCE_DATA data = { vertices }; result = device->CreateBuffer(&desc, &data, &m_vertexBuffer); ThrowIfFailed(result, "Failed to create vertex buffer."); D3D11_BLEND_DESC blendDesc; blendDesc.AlphaToCoverageEnable = false; blendDesc.IndependentBlendEnable = false; blendDesc.RenderTarget[0] = { true, D3D11_BLEND_SRC_ALPHA, D3D11_BLEND_INV_SRC_ALPHA, D3D11_BLEND_OP_ADD, D3D11_BLEND_ONE, D3D11_BLEND_ZERO, D3D11_BLEND_OP_ADD, D3D11_COLOR_WRITE_ENABLE_ALL }; device->CreateBlendState(&blendDesc, &m_blendState); } void UpdateBossHealth() { auto javascript = std::string("$('#progressbar').progressbar({ value: ") + std::to_string(m_bossHealth) + "}); "; m_view->ExecuteJavascript(Awesomium::ToWebString(javascript), Awesomium::WSLit("")); } void CreateWebView(int width, int height) { m_view.reset(m_webCore->CreateWebView(width, height, nullptr, Awesomium::kWebViewType_Offscreen)); CreateAndSetJSFunctions(); m_view->SetTransparent(true); Awesomium::WebURL url(Awesomium::WSLit(URL)); m_view->LoadURL(url); } void CreateAndSetJSFunctions() { m_view->set_js_method_handler(this); m_jsApp = m_view->CreateGlobalJavascriptObject(Awesomium::WSLit("app")); Awesomium::JSObject & appObject = m_jsApp.ToObject(); appObject.SetCustomMethod(Awesomium::WSLit("skill"), false); JsCallerKey key(appObject.remote_id(), Awesomium::WSLit("skill")); m_jsFunctions[key] = std::bind(&MainWindow::OnSkill, this, std::placeholders::_1, std::placeholders::_2); } Awesomium::JSValue OnSkill(Awesomium::WebView * view, Awesomium::JSArray const & args) { if (args.size() == 0) return Awesomium::JSValue(); Awesomium::JSValue const & arg = args[0]; if (!arg.IsInteger()) return Awesomium::JSValue(); switch (arg.ToInteger()) { case 1: --m_bossHealth; UpdateBossHealth(); break; default: break; } return Awesomium::JSValue(); }private: typedef std::pair JsCallerKey; typedef std::function JsFunction; std::unique_ptr m_device; std::unique_ptr m_surfaceFactory; std::unique_ptr m_view; Awesomium::WebCore * m_webCore; D3DSurface * m_surface; CComPtr m_pixelShader; CComPtr m_vertexShader; CComPtr m_sampler; CComPtr m_blendState; CComPtr m_vertexBuffer; CComPtr m_trianglePS; CComPtr m_triangleVS; CComPtr m_inputLayout; Awesomium::JSValue m_jsApp; std::map m_jsFunctions; std::map m_jsFunctionsWithRetValue; int m_bossHealth = 100; bool m_isLoading; bool m_isMaximized; private: static const tchar * ClassName; static const tchar * WindowName; static const char * URL;};const tchar * MainWindow::WindowName = _T("DX Window");const tchar * MainWindow::ClassName = _T("GameWindowClass");const char * MainWindow::URL = "file : ///./Resources/UIInterface.html";int main() { Awesomium::WebCore * webCore = Awesomium::WebCore::Initialize(Awesomium::WebConfig()); { MainWindow window(webCore); window.Run(); } Awesomium::WebCore::Shutdown();}
Washu
In the previous entry we built up a basic sample that loads a web page and uploads it to a texture, which we then rendered to a full screen triangle. In this entry we're going to work on optimizing that process a bit, and making it so that our texture updates whenever the source updates.

Introduction


Screenshot%202015-01-01%2018.18.00.png
One of the problems with our current mechanism is that we are not handling page updates. I.e. if the page has animations, images that load after some time, and other similar conditions then our image will not be similarly updated with this new information. This poses a problem for us if we're going to use something like Awesomium for a game UI.

The solution is to not stop updating the web client, and to update the image every time it changes. There are, however, a few problems with this:

  • Updating an entire image is slow, especially if the image takes up the entire screen.
  • Rarely does the entire page change, so why update the entire image when only a small portion of it has changed?
    We can solve this easily enough by simply knowing which parts of the image need to be changed. Which is where the Awesomium::Surface and Awesomium::SurfaceFactory come into play.

    The Surface Factory


    Awesomium provides us with the ability to provide it with a custom surface to render to. The rendering code then simply calls to the surface and asks it to blit certain rectangles of data, which we can then translate into the appropriate texture updates. In order for Awesomium to construct one of our surfaces it needs us to provide it with a factory instance capable of constructing the surface. This is where our D3DSurfaceFactory comes in.

    The D3DSurfaceFactory is a simple factory which we create an instance of and passing it the appropriate Direct3D context to use for updating textures created from the factory. As you can see below, the implementation of the creation and release methods are fairly trivial, being mostly there to simply pass through any necessary state:
    virtual Awesomium::Surface * CreateSurface(Awesomium::WebView * view, int width, int height) { return new D3DSurface(m_context, view, width, height);}virtual void DestroySurface(Awesomium::Surface * surface) { delete surface;}With this state passed through we can move onto the meat of our changes... the D3DSurface.

    The D3DSurface


    The D3DSurface is our implementation of the Awesomium::Surface interface. The Surface interface expects us to provide two methods: Paint, which is called whenever a rectangle of the surface needs to be updated, and Scroll, which is used to scroll a portion of the view. In our case we actually don't care about scrolling, and so we'll leave this method blank. Our paint method, as can be seen below, is fairly trivial:
    virtual void Paint(unsigned char *srcBuffer, int srcRowSpan, const Awesomium::Rect &srcRect, const Awesomium::Rect &destRect) { auto box = CD3D11_BOX(destRect.x, destRect.y, 0, destRect.x + destRect.width, destRect.y + destRect.height, 1); // 4 bytes per pixel, srcRowSpan is already in bytes. auto startingOffset = srcRowSpan * srcRect.y + srcRect.x * 4; m_context->UpdateSubresource(m_texture, 0, &box, srcBuffer + startingOffset, srcRowSpan, 0);}All this really does is convert the destination rectangle into a box, and then calculate the appropriate starting byte in the source buffer. We then simply pass this on through to UpdateSubresource, which does the bulk work of copying our data and uploading it to the GPU.

    Miscellaneous other Bits


    As a final set of pieces for this demo, we want to render our HTML over somethimg. In this case, that colorful triangle from the first of this series.

    To do this we need to change how our Awesomium renders, and also setup a blend state so that our back buffer is blending data instead of overwriting it. Configuring Awesomium views to render transparent turns out to be trivial:
    m_view->SetTransparent(true);Configuring a blendstate is equally as simple:
    blendDesc.AlphaToCoverageEnable = false;blendDesc.IndependentBlendEnable = false;blendDesc.RenderTarget[0] = { true, D3D11_BLEND_SRC_ALPHA, D3D11_BLEND_INV_SRC_ALPHA, D3D11_BLEND_OP_ADD, D3D11_BLEND_ONE, D3D11_BLEND_ZERO, D3D11_BLEND_OP_ADD, D3D11_COLOR_WRITE_ENABLE_ALL};device->CreateBlendState(&blendDesc, &m_blendState);With these in place we can now render our colorful triangle:
    context->OMSetBlendState(m_blendState, nullptr, ~0);context->IASetPrimitiveTopology(D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST);context->IASetIndexBuffer(nullptr, (DXGI_FORMAT)0, 0);context->IASetVertexBuffers(0, 1, &m_vertexBuffer.p, &Vertex::Stride, &Vertex::Offset);context->IASetInputLayout(m_inputLayout);context->VSSetShader(m_triangleVS, nullptr, 0);context->PSSetShader(m_trianglePS, nullptr, 0);context->Draw(3, 0);and then our UI:
    context->IASetVertexBuffers(0, 0, nullptr, nullptr, nullptr);context->IASetInputLayout(nullptr);context->VSSetShader(m_vertexShader, nullptr, 0);context->PSSetShader(m_pixelShader, nullptr, 0);context->PSSetSamplers(0, 1, &m_sampler.p);if (m_surface) m_surface->Bind();context->Draw(3, 0);Simple enough eh?

    An Example Game UI


    With our triangle rendered and our UI being displayed over it, it is time to come up with a UI. For this particular case I'm using the jquery-ui kit, and jquery. I've built a pretty simple HTML web page to showcase what you can do... at the top of it is a progress bar that indicates something, perhaps the health of a boss monster. To the right is a set of quest goals for the currently selected quest. And lastly at the bottom we have a button. It doesn't do anything yet, and perhaps we'll touch on getting that working next time.
    jQuery UI Progressbar - Default functionality
    • A quest goal!
    • Another quest goal!
    Lastly, we can send a message to our view anytime we desire using javascript. This allows us to update our boss health as the player hits a key, and the progress bar will change it's value, redrawing only that portion of the screen:
    LRESULT OnKeyUp(unsigned message, WPARAM wParam, LPARAM lParam, BOOL & handled) { if (wParam == 'A' && m_view) { --m_bossHealth; UpdateBossHealth(); } return DefWindowProc(message, wParam, lParam);}void UpdateBossHealth() { auto javascript = std::string("$('#progressbar').progressbar({ value: ") + std::to_string(m_bossHealth) + "}); "; m_view->ExecuteJavascript(Awesomium::ToWebString(javascript), Awesomium::WSLit(""));}Now, every time we hit the 'A' key, the boss's health will decrease.

    Full Sample


    #define NOMINMAX#include #include #include #include #include #include #include #include #include #include #pragma comment(lib, "d3d11.lib")#pragma comment(lib, "awesomium.lib")#include #include #include #include #include #include #include #ifdef UNICODEtypedef wchar_t tchar;typedef std::wstring tstring;templatetstring to_string(T t) { return std::to_wstring(t);}#elsetypedef char tchar;typedef std::string tstring;templatetstring to_string(T t) { return std::to_string(t);}#endifstruct Vertex { float position[4]; float color[4]; float texCoord[2]; static const unsigned Stride = sizeof(float) * 10; static const unsigned Offset = 0;};void ThrowIfFailed(HRESULT result, std::string const & text) { if (FAILED(result)) throw std::runtime_error(text + "");}class RenderTarget {public: RenderTarget(ID3D11Texture2D * texture, bool hasDepthBuffer) : m_texture(texture) { CComPtr device; texture->GetDevice(&device); auto result = device->CreateRenderTargetView(m_texture, nullptr, &m_textureRTV); ThrowIfFailed(result, "Failed to create back buffer render target."); m_viewport = CD3D11_VIEWPORT(m_texture, m_textureRTV); result = device->CreateTexture2D(&CD3D11_TEXTURE2D_DESC(DXGI_FORMAT_D32_FLOAT, static_cast(m_viewport.Width), static_cast(m_viewport.Height), 1, 1, D3D11_BIND_DEPTH_STENCIL), nullptr, &m_depthBuffer); ThrowIfFailed(result, "Failed to create depth buffer."); result = device->CreateDepthStencilView(m_depthBuffer, nullptr, &m_depthView); ThrowIfFailed(result, "Failed to create depth buffer render target."); } void Clear(ID3D11DeviceContext * context, float color[4], bool clearDepth = true) { context->ClearRenderTargetView(m_textureRTV, color); if (clearDepth && m_depthView) context->ClearDepthStencilView(m_depthView, D3D11_CLEAR_DEPTH, 1.0f, 0); } void SetTarget(ID3D11DeviceContext * context) { context->OMSetRenderTargets(1, &m_textureRTV.p, m_depthView); context->RSSetViewports(1, &m_viewport); }private: D3D11_VIEWPORT m_viewport; CComPtr m_depthBuffer; CComPtr m_depthView; CComPtr m_texture; CComPtr m_textureRTV;};class GraphicsDevice {public: GraphicsDevice(HWND window, int width, int height) { D3D_FEATURE_LEVEL levels[] = { D3D_FEATURE_LEVEL_11_1, D3D_FEATURE_LEVEL_11_0, D3D_FEATURE_LEVEL_10_1, D3D_FEATURE_LEVEL_10_0, }; DXGI_SWAP_CHAIN_DESC desc = { { width, height, { 1, 60 }, DXGI_FORMAT_R8G8B8A8_UNORM, DXGI_MODE_SCANLINE_ORDER_UNSPECIFIED, DXGI_MODE_SCALING_UNSPECIFIED }, { 1, 0 }, DXGI_USAGE_BACK_BUFFER | DXGI_USAGE_RENDER_TARGET_OUTPUT, 1, window, TRUE, DXGI_SWAP_EFFECT_DISCARD, 0 }; auto result = D3D11CreateDeviceAndSwapChain( nullptr, D3D_DRIVER_TYPE_HARDWARE, nullptr, D3D11_CREATE_DEVICE_DEBUG | D3D11_CREATE_DEVICE_BGRA_SUPPORT, levels, sizeof(levels) / sizeof(D3D_FEATURE_LEVEL), D3D11_SDK_VERSION, &desc, &m_swapChain, &m_device, &m_featureLevel, &m_context ); ThrowIfFailed(result, "Failed to create D3D11 device."); } void Resize(int width, int height) { if (m_renderTarget) m_renderTarget.reset(); auto result = m_swapChain->ResizeBuffers(1, width, height, DXGI_FORMAT_UNKNOWN, 0); ThrowIfFailed(result, "Failed to resize back buffer."); CComPtr backBuffer; result = m_swapChain->GetBuffer(0, __uuidof(ID3D11Texture2D), reinterpret_cast(&backBuffer)); ThrowIfFailed(result, "Failed to retrieve back buffer surface."); m_renderTarget = std::make_unique(backBuffer, true); } void SetAndClearTarget() { static float color[] { 0, 0, 0, 0}; if (!m_renderTarget) return; m_renderTarget->Clear(m_context, color); m_renderTarget->SetTarget(m_context); } void Present() { m_swapChain->Present(0, 0); } ID3D11Device * GetDevice() { return m_device; } ID3D11DeviceContext * GetDeviceContext() { return m_context; }private: D3D_FEATURE_LEVEL m_featureLevel; CComPtr m_device; CComPtr m_context; CComPtr m_swapChain; std::unique_ptr m_renderTarget;};class D3DSurface : public Awesomium::Surface {public: D3DSurface(ID3D11DeviceContext * context, Awesomium::WebView * view, int width, int height) : m_context(context), m_view(view), m_width(width), m_height(height) { CComPtr device; context->GetDevice(&device); auto result = device->CreateTexture2D(&CD3D11_TEXTURE2D_DESC(DXGI_FORMAT_B8G8R8A8_UNORM, width, height, 1, 1), nullptr, &m_texture); result = device->CreateShaderResourceView(m_texture, nullptr, &m_textureView); } virtual void Paint(unsigned char *srcBuffer, int srcRowSpan, const Awesomium::Rect &srcRect, const Awesomium::Rect &destRect) { auto box = CD3D11_BOX(destRect.x, destRect.y, 0, destRect.x + destRect.width, destRect.y + destRect.height, 1); // 4 bytes per pixel, srcRowSpan is already in bytes. auto startingOffset = srcRowSpan * srcRect.y + srcRect.x * 4; m_context->UpdateSubresource(m_texture, 0, &box, srcBuffer + startingOffset, srcRowSpan, 0); } virtual void Scroll(int dx, int dy, const Awesomium::Rect &clip_rect) { } void Bind() { m_context->PSSetShaderResources(0, 1, &m_textureView.p); } virtual ~D3DSurface() { }private: CComPtr m_textureView; CComPtr m_texture; ID3D11DeviceContext * m_context; Awesomium::WebView * m_view; int m_width; int m_height;};class D3DSurfaceFactory : public Awesomium::SurfaceFactory {public: D3DSurfaceFactory(ID3D11DeviceContext * context) : m_context(context) { } virtual Awesomium::Surface * CreateSurface(Awesomium::WebView * view, int width, int height) { return new D3DSurface(m_context, view, width, height); } virtual void DestroySurface(Awesomium::Surface * surface) { delete surface; }private: ID3D11DeviceContext * m_context;};class MainWindow : public CWindowImpl {public: DECLARE_WND_CLASS_EX(ClassName, CS_OWNDC | CS_HREDRAW | CS_VREDRAW, COLOR_BACKGROUND + 1); MainWindow(Awesomium::WebCore * webCore) : m_webCore(webCore), m_view(nullptr, [](Awesomium::WebView * ptr) { ptr->Destroy(); }), m_isMaximized(false), m_surface(nullptr) { RECT rect = { 0, 0, 800, 600 }; AdjustWindowRectEx(&rect, GetWndStyle(0), FALSE, GetWndExStyle(0)); Create(nullptr, RECT{ 0, 0, rect.right - rect.left, rect.bottom - rect.top }, WindowName); ShowWindow(SW_SHOW); UpdateWindow(); } void Run() { MSG msg; while (true) { if (PeekMessage(&msg, 0, 0, 0, PM_REMOVE)) { if (msg.message == WM_QUIT) break; TranslateMessage(&msg); DispatchMessage(&msg); } else { Update(); } } } void Update() { auto context = m_device->GetDeviceContext(); m_webCore->Update(); if (m_view->IsLoading()) { m_isLoading = true; } else if (m_isLoading) { m_isLoading = false; UpdateBossHealth(); m_webCore->Update(); m_surface = static_cast(m_view->surface()); } m_device->SetAndClearTarget(); context->OMSetBlendState(m_blendState, nullptr, ~0); context->IASetPrimitiveTopology(D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST); context->IASetIndexBuffer(nullptr, (DXGI_FORMAT)0, 0); context->IASetVertexBuffers(0, 1, &m_vertexBuffer.p, &Vertex::Stride, &Vertex::Offset); context->IASetInputLayout(m_inputLayout); context->VSSetShader(m_triangleVS, nullptr, 0); context->PSSetShader(m_trianglePS, nullptr, 0); context->Draw(3, 0); context->IASetVertexBuffers(0, 0, nullptr, nullptr, nullptr); context->IASetInputLayout(nullptr); context->VSSetShader(m_vertexShader, nullptr, 0); context->PSSetShader(m_pixelShader, nullptr, 0); context->PSSetSamplers(0, 1, &m_sampler.p); if (m_surface) m_surface->Bind(); context->Draw(3, 0); m_device->Present(); }private: BEGIN_MSG_MAP(MainWindow) MESSAGE_HANDLER(WM_DESTROY, [](unsigned messageId, WPARAM wParam, LPARAM lParam, BOOL & handled) { PostQuitMessage(0); return 0; }); MESSAGE_HANDLER(WM_CREATE, OnCreate); MESSAGE_HANDLER(WM_SIZE, OnSize); MESSAGE_HANDLER(WM_EXITSIZEMOVE, OnSizeFinish); MESSAGE_HANDLER(WM_KEYUP, OnKeyUp); END_MSG_MAP()private: LRESULT OnCreate(unsigned message, WPARAM wParam, LPARAM lParam, BOOL & handled) { try { RECT rect; GetClientRect(&rect); m_device = std::make_unique(m_hWnd, rect.right, rect.bottom); auto device = m_device->GetDevice(); tstring filename(MAX_PATH, 0); GetModuleFileName(GetModuleHandle(nullptr), &filename.front(), filename.length()); filename = filename.substr(0, filename.find_last_of('\\')); SetCurrentDirectory(filename.c_str()); std::vector vs(std::istreambuf_iterator(std::ifstream("FullScreenTriangleVS.cso", std::ios_base::in | std::ios_base::binary)), std::istreambuf_iterator()); auto result = device->CreateVertexShader(&vs.front(), vs.size(), nullptr, &m_vertexShader); ThrowIfFailed(result, "Could not create vertex shader."); std::vector ps(std::istreambuf_iterator(std::ifstream("FullScreenTrianglePS.cso", std::ios_base::in | std::ios_base::binary)), std::istreambuf_iterator()); result = device->CreatePixelShader(&ps.front(), ps.size(), nullptr, &m_pixelShader); ThrowIfFailed(result, "Could not create pixel shader."); result = device->CreateSamplerState(&CD3D11_SAMPLER_DESC(CD3D11_DEFAULT()), &m_sampler); ThrowIfFailed(result, "Could not create sampler state."); vs.assign(std::istreambuf_iterator(std::ifstream("TriangleVS.cso", std::ios_base::in | std::ios_base::binary)), std::istreambuf_iterator()); result = device->CreateVertexShader(&vs.front(), vs.size(), nullptr, &m_triangleVS); ThrowIfFailed(result, "Could not create vertex shader."); ps.assign(std::istreambuf_iterator(std::ifstream("TrianglePS.cso", std::ios_base::in | std::ios_base::binary)), std::istreambuf_iterator()); result = device->CreatePixelShader(&ps.front(), ps.size(), nullptr, &m_trianglePS); ThrowIfFailed(result, "Could not create pixel shader."); std::vector inputElementDesc = { { "SV_POSITION", 0, DXGI_FORMAT_R32G32B32A32_FLOAT, 0, 0 }, { "COLOR", 0, DXGI_FORMAT_R32G32B32A32_FLOAT, 0, 4 * sizeof(float) }, }; result = device->CreateInputLayout(&inputElementDesc.front(), inputElementDesc.size(), &vs.front(), vs.size(), &m_inputLayout); ThrowIfFailed(result, "Unable to create input layout."); // Hard coded triangle. Tis a silly idea, but works for the sample. Vertex vertices[] = { { { 0.0f, 0.5f, 0.5f, 1.0f }, { 1.0f, 0.0f, 0.0f, 1.0f }, { 0.5f, 1.0f } }, { { 0.5f, -0.5f, 0.5f, 1.0f }, { 0.0f, 1.0f, 0.0f, 1.0f }, { 0.0f, 0.0f } }, { { -0.5f, -0.5f, 0.5f, 1.0f }, { 0.0f, 0.0f, 1.0f, 1.0f }, { 1.0f, 0.0f } }, }; D3D11_BUFFER_DESC desc = { sizeof(vertices), D3D11_USAGE_DEFAULT, D3D11_BIND_VERTEX_BUFFER }; D3D11_SUBRESOURCE_DATA data = { vertices }; result = device->CreateBuffer(&desc, &data, &m_vertexBuffer); ThrowIfFailed(result, "Failed to create vertex buffer."); D3D11_BLEND_DESC blendDesc; blendDesc.AlphaToCoverageEnable = false; blendDesc.IndependentBlendEnable = false; blendDesc.RenderTarget[0] = { true, D3D11_BLEND_SRC_ALPHA, D3D11_BLEND_INV_SRC_ALPHA, D3D11_BLEND_OP_ADD, D3D11_BLEND_ONE, D3D11_BLEND_ZERO, D3D11_BLEND_OP_ADD, D3D11_COLOR_WRITE_ENABLE_ALL }; device->CreateBlendState(&blendDesc, &m_blendState); m_surfaceFactory = std::make_unique(m_device->GetDeviceContext()); m_webCore->set_surface_factory(m_surfaceFactory.get()); m_view.reset(m_webCore->CreateWebView(rect.right, rect.bottom, nullptr, Awesomium::kWebViewType_Offscreen)); m_view->SetTransparent(true); Awesomium::WebURL url(Awesomium::WSLit(URL)); m_view->LoadURL(url); m_device->Resize(rect.right, rect.bottom); } catch (std::runtime_error & ex) { std::cout << ex.what() << std::endl; return -1; } return 0; } LRESULT OnSizeFinish(unsigned message, WPARAM wParam, LPARAM lParam, BOOL & handled) { try { RECT clientRect; GetClientRect(&clientRect); m_device->Resize(clientRect.right, clientRect.bottom); if (m_view->IsLoading()) m_view->Stop(); m_surface = nullptr; m_view.reset(m_webCore->CreateWebView(clientRect.right, clientRect.bottom, nullptr, Awesomium::kWebViewType_Offscreen)); m_view->SetTransparent(true); Awesomium::WebURL url(Awesomium::WSLit(URL)); m_view->LoadURL(url); } catch (std::runtime_error & ex) { std::cout << ex.what() << std::endl; } return 0; } LRESULT OnSize(unsigned message, WPARAM wParam, LPARAM lParam, BOOL & handled) { if (wParam == SIZE_MAXIMIZED) { m_isMaximized = true; return OnSizeFinish(message, wParam, lParam, handled); } else { if (m_isMaximized) { m_isMaximized = false; return OnSizeFinish(message, wParam, lParam, handled); } } return 0; } LRESULT OnKeyUp(unsigned message, WPARAM wParam, LPARAM lParam, BOOL & handled) { if (wParam == 'A' && m_view) { --m_bossHealth; UpdateBossHealth(); } return DefWindowProc(message, wParam, lParam); } void UpdateBossHealth() { auto javascript = std::string("$('#progressbar').progressbar({ value: ") + std::to_string(m_bossHealth) + "}); "; m_view->ExecuteJavascript(Awesomium::ToWebString(javascript), Awesomium::WSLit("")); }private: std::unique_ptr m_device; std::unique_ptr m_surfaceFactory; std::unique_ptr m_view; Awesomium::WebCore * m_webCore; D3DSurface * m_surface; CComPtr m_pixelShader; CComPtr m_vertexShader; CComPtr m_sampler; CComPtr m_blendState; CComPtr m_vertexBuffer; CComPtr m_trianglePS; CComPtr m_triangleVS; CComPtr m_inputLayout; int m_bossHealth = 100; bool m_isLoading; bool m_isMaximized;private: static const tchar * ClassName; static const tchar * WindowName; static const char * URL;};const tchar * MainWindow::WindowName = _T("DX Window");const tchar * MainWindow::ClassName = _T("GameWindowClass");const char * MainWindow::URL = ""; // Was: file : ///./Resources/UIInterface.html (remove spaces)int main() { Awesomium::WebCore * webCore = Awesomium::WebCore::Initialize(Awesomium::WebConfig()); { MainWindow window(webCore); window.Run(); } Awesomium::WebCore::Shutdown();}
Washu
With our familiarity with Direct2D now maximized, it's time to move on to other methods of rendering UI and text! At this juncture you're now familiar with Direct2D and it's power (well, maybe not. But let's pretend you are!). However, what if we wanted to render a UI to the screen? In this entry we'll take a look at Awesomium, a Chromium based HTML UI engine.

Introduction


Screenshot%202014-12-29%2021.04.24.png



For rendering UI to the screen you have a few options:

  1. you can go the Direct2D route, drawing images to a buffer and then drawing those buffers to the screen, or drawing directly to the back buffer.
  2. You can render quads to the screen placed where your buttons will be.
  3. You can use some form of a UI library to assist you in drawing such things!

We're going to go for option 3. Now, the framework we're going to be using is called Awesomium. It is not great, but it will do what we need. Some of the problems with it includes a lack of x64 support, some weird behaviors relating to memory management, and some API quirks that are rather annoying. Another option, although a paid for one, is Coherent. However, that aside, Awesomium is fairly easy to get up and running in no time at all. It is based off of Chromium, and hence uses web-kit.

Initializing Awesomium


Initialization is a fairly trivial process. However, the pointer you get out of it needs to be made available to everyone who is going to be using Awesomium. Furthermore, Awesomium doesn't handle threading very well, as such, you should only access it from a single thread. Ever.int main() { Awesomium::WebCore * webCore = Awesomium::WebCore::Initialize(Awesomium::WebConfig()); { MainWindow window(webCore); window.Run(); } Awesomium::WebCore::Shutdown();}
As you can see, it's quite trivial.

Loading a Web Page


Our next step will be to load up some HTML and get it rendering. What better to render than the previous post, of course!m_view.reset(m_webCore->CreateWebView(rect.right, rect.bottom, nullptr, Awesomium::kWebViewType_Offscreen));Awesomium::WebURL url(Awesomium::WSLit("http://www.gamedev.net/blog/32/entry-2260630-sweet-snippets-more-text-rendering-with-directwritedirect2d-and-direct3d11/"));m_view->LoadURL(url);
Now we're getting into one of those quirks. Awesomium has its own string class, and so you need to use the WSLit function to instantiate an instance of it. Additionally, each web view has a size associated with it, which means that any time we change our texture size we'll need to recreate the web view. We tell it to use an off screen view as we actually want to use this as a texture.

Rendering the Page


When the page is done rendering we can get the surface associated with it, and then using UpdateSubresource upload that data to our Direct3D texture.if (m_view->IsLoading()) { m_isLoading = true; m_webCore->Update();}else if (m_isLoading) { m_isLoading = false; m_webCore->Update(); Awesomium::BitmapSurface * surface = static_cast(m_view->surface()); context->UpdateSubresource(m_texture, 0, nullptr, surface->buffer(), surface->width() * 4, 0);}
We keep track of if it was previously loading, so as to note keep updating the texture every time we go through the update loop. After this we simply render a full screen triangle with the updated texture bound as a shader resource:context->IASetVertexBuffers(0, 0, nullptr, nullptr, nullptr);context->IASetIndexBuffer(nullptr, (DXGI_FORMAT)0, 0);context->IASetInputLayout(nullptr);context->VSSetShader(m_vertexShader, nullptr, 0);context->PSSetShader(m_pixelShader, nullptr, 0);context->PSSetSamplers(0, 1, &m_sampler.p);context->PSSetShaderResources(0, 1, &m_textureView.p);context->IASetPrimitiveTopology(D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST);context->Draw(3, 0);
This snippet uses a convenient trick of using the vertex ID to generate a full screen triangle (with texture coordinates) without the need of a vertex buffer nor index buffer. See this presentation for more information.

Other Things of Note


The only additional thing to note is our handling of WM_SIZE and WM_EXITSIZEMOVE. Due to the fact that Awesomium creates a new process for each view (although you can avoid this by using child views), resizing the window becomes a chore. As such, we ignore all WM_SIZE messages except SIZE_MAXIMIZED ones, or if the window was previously maximized. This allows us to only recreate the view when the window is maximized, restored from a maximized state, or resizing has finished.LRESULT OnSize(unsigned message, WPARAM wParam, LPARAM lParam, BOOL & handled) { if (wParam == SIZE_MAXIMIZED) { m_isMaximized = true; return OnSizeFinish(message, wParam, lParam, handled); } else { if (m_isMaximized) { m_isMaximized = false; return OnSizeFinish(message, wParam, lParam, handled); } } return 0;}
When this is the case we release our texture, shader view, and web-view, then we recreate them at the new size. Finally we issue a new load request to the same URL as before. When this completes the update loop will properly update the texture with the new data.

Full Sample

#define NOMINMAX#include #include #include #include #include #include #include #include #include #include #pragma comment(lib, "d3d11.lib")#pragma comment(lib, "awesomium.lib")#include #include #include #include #include #include #include #ifdef UNICODEtypedef wchar_t tchar;typedef std::wstring tstring;#elsetypedef char tchar;typedef std::string tstring;#endifvoid ThrowIfFailed(HRESULT result, std::string const & text) { if (FAILED(result)) throw std::runtime_error(text + "");}class RenderTarget {public: RenderTarget(ID3D11Texture2D * texture, bool hasDepthBuffer) : m_texture(texture) { CComPtr device; texture->GetDevice(&device); auto result = device->CreateRenderTargetView(m_texture, nullptr, &m_textureRTV); ThrowIfFailed(result, "Failed to create back buffer render target."); m_viewport = CD3D11_VIEWPORT(m_texture, m_textureRTV); result = device->CreateTexture2D(&CD3D11_TEXTURE2D_DESC(DXGI_FORMAT_D32_FLOAT, static_cast(m_viewport.Width), static_cast(m_viewport.Height), 1, 1, D3D11_BIND_DEPTH_STENCIL), nullptr, &m_depthBuffer); ThrowIfFailed(result, "Failed to create depth buffer."); result = device->CreateDepthStencilView(m_depthBuffer, nullptr, &m_depthView); ThrowIfFailed(result, "Failed to create depth buffer render target."); } void Clear(ID3D11DeviceContext * context, float color[4], bool clearDepth = true) { context->ClearRenderTargetView(m_textureRTV, color); if (clearDepth && m_depthView) context->ClearDepthStencilView(m_depthView, D3D11_CLEAR_DEPTH, 1.0f, 0); } void SetTarget(ID3D11DeviceContext * context) { context->OMSetRenderTargets(1, &m_textureRTV.p, m_depthView); context->RSSetViewports(1, &m_viewport); }private: D3D11_VIEWPORT m_viewport; CComPtr m_depthBuffer; CComPtr m_depthView; CComPtr m_texture; CComPtr m_textureRTV;};class GraphicsDevice {public: GraphicsDevice(HWND window, int width, int height) { D3D_FEATURE_LEVEL levels[] = { D3D_FEATURE_LEVEL_11_1, D3D_FEATURE_LEVEL_11_0, D3D_FEATURE_LEVEL_10_1, D3D_FEATURE_LEVEL_10_0, }; DXGI_SWAP_CHAIN_DESC desc = { { width, height, { 1, 60 }, DXGI_FORMAT_R8G8B8A8_UNORM, DXGI_MODE_SCANLINE_ORDER_UNSPECIFIED, DXGI_MODE_SCALING_UNSPECIFIED }, { 1, 0 }, DXGI_USAGE_BACK_BUFFER | DXGI_USAGE_RENDER_TARGET_OUTPUT, 1, window, TRUE, DXGI_SWAP_EFFECT_DISCARD, 0 }; auto result = D3D11CreateDeviceAndSwapChain( nullptr, D3D_DRIVER_TYPE_HARDWARE, nullptr, D3D11_CREATE_DEVICE_DEBUG | D3D11_CREATE_DEVICE_BGRA_SUPPORT, levels, sizeof(levels) / sizeof(D3D_FEATURE_LEVEL), D3D11_SDK_VERSION, &desc, &m_swapChain, &m_device, &m_featureLevel, &m_context ); ThrowIfFailed(result, "Failed to create D3D11 device."); } void Resize(int width, int height) { if (m_renderTarget) m_renderTarget.reset(); auto result = m_swapChain->ResizeBuffers(1, width, height, DXGI_FORMAT_UNKNOWN, 0); ThrowIfFailed(result, "Failed to resize back buffer."); CComPtr backBuffer; result = m_swapChain->GetBuffer(0, __uuidof(ID3D11Texture2D), reinterpret_cast(&backBuffer)); ThrowIfFailed(result, "Failed to retrieve back buffer surface."); m_renderTarget = std::make_unique(backBuffer, true); } void SetAndClearTarget() { static float color[] { 0, 0, 0, 0}; if (!m_renderTarget) return; m_renderTarget->Clear(m_context, color); m_renderTarget->SetTarget(m_context); } void Present() { m_swapChain->Present(0, 0); } ID3D11Device * GetDevice() { return m_device; } ID3D11DeviceContext * GetDeviceContext() { return m_context; }private: D3D_FEATURE_LEVEL m_featureLevel; CComPtr m_device; CComPtr m_context; CComPtr m_swapChain; std::unique_ptr m_renderTarget;};class MainWindow : public CWindowImpl {public: DECLARE_WND_CLASS_EX(ClassName, CS_OWNDC | CS_HREDRAW | CS_VREDRAW, COLOR_BACKGROUND + 1); MainWindow(Awesomium::WebCore * webCore) : m_webCore(webCore), m_view(nullptr, [](Awesomium::WebView * ptr) { ptr->Destroy(); }), m_isMaximized(false) { RECT rect = { 0, 0, 800, 600 }; AdjustWindowRectEx(&rect, GetWndStyle(0), FALSE, GetWndExStyle(0)); Create(nullptr, RECT{ 0, 0, rect.right - rect.left, rect.bottom - rect.top }, WindowName); ShowWindow(SW_SHOW); UpdateWindow(); } void Run() { MSG msg; while (true) { if (PeekMessage(&msg, 0, 0, 0, PM_REMOVE)) { if (msg.message == WM_QUIT) break; TranslateMessage(&msg); DispatchMessage(&msg); } else { Update(); } } } void Update() { auto context = m_device->GetDeviceContext(); if (m_view->IsLoading()) { m_isLoading = true; m_webCore->Update(); } else if (m_isLoading) { m_isLoading = false; m_webCore->Update(); Awesomium::BitmapSurface * surface = static_cast(m_view->surface()); context->UpdateSubresource(m_texture, 0, nullptr, surface->buffer(), surface->width() * 4, 0); } m_device->SetAndClearTarget(); context->IASetVertexBuffers(0, 0, nullptr, nullptr, nullptr); context->IASetIndexBuffer(nullptr, (DXGI_FORMAT)0, 0); context->IASetInputLayout(nullptr); context->VSSetShader(m_vertexShader, nullptr, 0); context->PSSetShader(m_pixelShader, nullptr, 0); context->PSSetSamplers(0, 1, &m_sampler.p); context->PSSetShaderResources(0, 1, &m_textureView.p); context->IASetPrimitiveTopology(D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST); context->Draw(3, 0); m_device->Present(); }private: BEGIN_MSG_MAP(MainWindow) MESSAGE_HANDLER(WM_DESTROY, [](unsigned messageId, WPARAM wParam, LPARAM lParam, BOOL & handled) { PostQuitMessage(0); return 0; }); MESSAGE_HANDLER(WM_CREATE, OnCreate); MESSAGE_HANDLER(WM_SIZE, OnSize); MESSAGE_HANDLER(WM_EXITSIZEMOVE, OnSizeFinish); END_MSG_MAP()private: LRESULT OnCreate(unsigned message, WPARAM wParam, LPARAM lParam, BOOL & handled) { try { RECT rect; GetClientRect(&rect); m_device = std::make_unique(m_hWnd, rect.right, rect.bottom); tstring filename(MAX_PATH, 0); GetModuleFileName(GetModuleHandle(nullptr), &filename.front(), filename.length()); filename = filename.substr(0, filename.find_last_of('\\')); SetCurrentDirectory(filename.c_str()); std::vector vs(std::istreambuf_iterator(std::ifstream("VertexShader.cso", std::ios_base::in | std::ios_base::binary)), std::istreambuf_iterator()); auto result = m_device->GetDevice()->CreateVertexShader(&vs.front(), vs.size(), nullptr, &m_vertexShader); ThrowIfFailed(result, "Could not create vertex shader."); std::vector ps(std::istreambuf_iterator(std::ifstream("PixelShader.cso", std::ios_base::in | std::ios_base::binary)), std::istreambuf_iterator()); result = m_device->GetDevice()->CreatePixelShader(&ps.front(), ps.size(), nullptr, &m_pixelShader); ThrowIfFailed(result, "Could not create pixel shader."); result = m_device->GetDevice()->CreateSamplerState(&CD3D11_SAMPLER_DESC(CD3D11_DEFAULT()), &m_sampler); ThrowIfFailed(result, "Could not create sampler state."); m_device->GetDevice()->CreateTexture2D(&CD3D11_TEXTURE2D_DESC(DXGI_FORMAT_B8G8R8A8_UNORM, rect.right, rect.bottom, 1, 1), nullptr, &m_texture); ThrowIfFailed(result, "Could not create web texture."); m_device->GetDevice()->CreateShaderResourceView(m_texture, nullptr, &m_textureView); ThrowIfFailed(result, "Could not create shader view."); m_view.reset(m_webCore->CreateWebView(rect.right, rect.bottom, nullptr, Awesomium::kWebViewType_Offscreen)); Awesomium::WebURL url(Awesomium::WSLit("http://www.gamedev.net/blog/32/entry-2260630-sweet-snippets-more-text-rendering-with-directwritedirect2d-and-direct3d11/")); m_view->LoadURL(url); m_device->Resize(rect.right, rect.bottom); } catch (std::runtime_error & ex) { std::cout << ex.what() << std::endl; return -1; } return 0; } LRESULT OnSizeFinish(unsigned message, WPARAM wParam, LPARAM lParam, BOOL & handled) { try { RECT clientRect; GetClientRect(&clientRect); m_device->Resize(clientRect.right, clientRect.bottom); if (m_texture) m_texture.Release(); if (m_textureView) m_textureView.Release(); m_device->GetDevice()->CreateTexture2D(&CD3D11_TEXTURE2D_DESC(DXGI_FORMAT_B8G8R8A8_UNORM, clientRect.right, clientRect.bottom, 1, 1), nullptr, &m_texture); m_device->GetDevice()->CreateShaderResourceView(m_texture, nullptr, &m_textureView); if (m_view->IsLoading()) m_view->Stop(); m_view.reset(m_webCore->CreateWebView(clientRect.right, clientRect.bottom, nullptr, Awesomium::kWebViewType_Offscreen)); Awesomium::WebURL url(Awesomium::WSLit("http://www.gamedev.net/blog/32/entry-2260630-sweet-snippets-more-text-rendering-with-directwritedirect2d-and-direct3d11/")); m_view->LoadURL(url); } catch (std::runtime_error & ex) { std::cout << ex.what() << std::endl; } return 0; } LRESULT OnSize(unsigned message, WPARAM wParam, LPARAM lParam, BOOL & handled) { if (wParam == SIZE_MAXIMIZED) { m_isMaximized = true; return OnSizeFinish(message, wParam, lParam, handled); } else { if (m_isMaximized) { m_isMaximized = false; return OnSizeFinish(message, wParam, lParam, handled); } } return 0; }private: std::unique_ptr m_device; Awesomium::WebCore * m_webCore; std::unique_ptr m_view; CComPtr m_pixelShader; CComPtr m_vertexShader; CComPtr m_sampler; CComPtr m_texture; CComPtr m_textureView; bool m_isLoading; bool m_isMaximized;private: static const tchar * ClassName; static const tchar * WindowName;};const tchar * MainWindow::WindowName = _T("DX Window");const tchar * MainWindow::ClassName = _T("GameWindowClass");int main() { Awesomium::WebCore * webCore = Awesomium::WebCore::Initialize(Awesomium::WebConfig()); { MainWindow window(webCore); window.Run(); } Awesomium::WebCore::Shutdown();}
Washu
Previously we built ourselves a short little application that displayed some text over a triangle. Rendered using Direct3D11, DirectWrite and Direct2D. There were a few problems with the sample though, and so I've decided to do a followup which shows some changes which fix those issues.

Introduction


Screenshot%202014-12-25%2023.52.53.png
When you initially get text rendering with Direct2D, using DrawText and DirectWrite, it feels rather powerful at first. You're able to render text, with a brush of your choosing, to a texture or the screen. But you will quickly find that DrawText is actually not that great of a function. Hence we have the IDWriteTextLayout interface.

This interface allows us the capability to build much more complex text objects, and in fact is used internally by DrawText. The interface provides a great deal of functionality, and so we shall now harness it to enhance the previous example.

But first, we need a goal. Goals are important in every field, including software development. Without an end goal in mind, code quickly begins to wander, and you soon find yourself in dark alleys best not trod. Thus our goal: To be able to render text that includes hyperlinks. These links will render in a fixed width font, with a different color, and when the mouse moves over them we expect our cursor to change from an arrow to a hand. Furthermore, when we click a link we expect it to open the default browser to the URL the link points to.

Using IDWriteTextLayout


The IDWriteTextLayout interface is fairly simple, you construct it by providing it with the text you desire to layout, the bounds of the text, and a default formatter (which provides font information).
auto result = m_factory->CreateTextLayout(m_text.c_str(), m_text.length(), m_defaultFormat, size.x, size.y, &m_textLayout);As you can see from the snippet above, it's quite trivial to use. But this does raise the question? How does this help us to format our text with links? Well, in our sample we are actually building m_text from smaller bits of text. We use two functions, AppendText and AppendLink to fill the m_text string. However, each time we call AppendLink we also store a few other bits of information: The starting position of the link, and the length, along with the URL associated with this range. As can be seen below:
void AppendText(tstring const & str) { m_text.append(str); m_dirty = true;}void AppendLink(tstring const & str, tstring const & link) { DWRITE_TEXT_RANGE linkRange = { m_text.size(), str.length() }; m_linkRanges.push_back(std::make_pair(linkRange, link)); m_text.append(str); m_dirty = true;}We also set a dirty flag, which we use to determine if the text layout object needs to be recreated.

Once we've built up our text we "compile" it into a text layout object. Once we have our IDWriteTextLayout object created with our text, we need to tell it how to format the text to our liking. In our case, we need to tell it about the links in our text and how we desire to have them rendered.

To do this, we simply iterate over the previously saved ranges (from AppendLink) and tell the text layout interface that for those ranges of characters we desire them to be drawn differently. In our case we're going to render them as being a fixed width font (Consolas), underlined, and a nice powdered blue color:
for (auto const & p : m_linkRanges) { m_textLayout->SetFontFamilyName(_T("Consolas"), p.first); m_textLayout->SetUnderline(true, p.first); m_textLayout->SetDrawingEffect(m_linkBrush, p.first);}.

Drawing the Text and More


Drawing couldn't be simpler. In fact, it's actually simpler than drawing text using DrawText. Since we've done all the work up front to format our text, all we really have to do is pass it off to Direct2D, along with where on the screen (or texture) we desire to render it.
renderTarget->DrawTextLayout(pos, m_textLayout, m_defaultBrush);Of course, this is not where we're going to stop, obviously. Now that we have our text rendering nicely to the screen, we obviously now want to be able to detect if the user has their mouse over the links in our text. To do this we need to perform two actions: The first is that we need to trap the WM_MOUSEMOVE and WM_LBUTTONUP Win32 mouse events, the second is that we need some way to detect where the cursor is in relation to our various links. Thankfully, DirectWrite helps out here too!

Since DirectWrite is a text layout engine, it can tell us a lot about the text it's laying out. Including such things as "where is a point in relation to the characters of this IDWriteTextLayout object." The test is quite trivial:
m_textLayout->HitTestPoint(pos.x, pos.y, &isTrailingHit, &isInside, &hitTestMetrics);With the returned booleans from this function we know if it's hitting the trailing edge of a character, if it's inside the text area at all, and several other metrics from the hit test as well. In our case we will be using the isInside boolean to determine if we should be continuing our tests further, and then from the DWRITE_HIT_TEST_METRICS we'll be using the textPosition member to determine the nearest character to the cursor. WIth that information in hand it's a simple task to iterate over our links (that we stored previously from AppendLink) and check if the textPosition is within the range of characters represented by the link:
for (auto const & p : m_linkRanges) { if (hitTestMetrics.textPosition >= p.first.startPosition && hitTestMetrics.textPosition < p.first.startPosition + p.first.length) { *linkText = p.second; return true; }}We can then use the information from the hit test to perform various actions, such as using ShellExecute to open the browser to the link location:
LRESULT OnMouseUp(unsigned msg, WPARAM wParam, LPARAM lParam, BOOL & bHandled) { tstring link; if (m_textSection->IsOverLink(D2D1::Point2F((float)GET_X_LPARAM(lParam), (float)GET_Y_LPARAM(lParam)), &link)) { ShellExecute(NULL, _T("open"), link.c_str(), NULL, NULL, SW_SHOWNORMAL); } return 0;}The rest of what you can do is really only limited by your imagination.

Full Sample


#define NOMINMAX#include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #pragma comment(lib, "d3d11.lib")#pragma comment(lib, "d2d1.lib")#pragma comment(lib, "dwrite.lib")#pragma comment(lib, "d3dcompiler.lib")#ifdef UNICODEtypedef std::wstring tstring;typedef wchar_t tchar;#elsetypedef std::string tstring;typedef char tchar;#endifstruct Vertex { float position[4]; float color[4]; float texCoord[2];};class Exception : public std::runtime_error {public: Exception(std::string const & error, HRESULT result) : std::runtime_error(error + "\nError was: " + std::to_string(result)) { }};class TextSection { typedef std::pair LinkPair;public: TextSection(CComPtr factory) : m_factory(factory), m_dirty(true) { auto result = m_factory->CreateTextFormat(_T("Calibri"), nullptr, DWRITE_FONT_WEIGHT_NORMAL, DWRITE_FONT_STYLE_NORMAL, DWRITE_FONT_STRETCH_NORMAL, 14.0f, _T(""), &m_defaultFormat); if (FAILED(result)) throw Exception("Failed to create text format.", result); } void SetDefaultColorBrush(CComPtr defaultBrush) { m_defaultBrush = defaultBrush; } void SetLinkColorBrush(CComPtr linkBrush) { m_linkBrush = linkBrush; } void AppendText(tstring const & str) { m_text.append(str); m_dirty = true; } void AppendLink(tstring const & str, tstring const & link) { DWRITE_TEXT_RANGE linkRange = { m_text.size(), str.length() }; m_linkRanges.push_back(std::make_pair(linkRange, link)); m_text.append(str); m_dirty = true; } void Compile(D2D1_POINT_2F const & size) { if (!m_defaultBrush || !m_linkBrush) { throw Exception("Default and link color brushes must be set first.", E_FAIL); } if (m_textLayout) { m_textLayout.Release(); } auto result = m_factory->CreateTextLayout(m_text.c_str(), m_text.length(), m_defaultFormat, size.x, size.y, &m_textLayout); if (FAILED(result)) { throw Exception("Unable to create text layout.", result); } for (auto const & p : m_linkRanges) { m_textLayout->SetFontFamilyName(_T("Consolas"), p.first); m_textLayout->SetUnderline(true, p.first); m_textLayout->SetDrawingEffect(m_linkBrush, p.first); } m_dirty = false; } void Release() { m_defaultBrush.Release(); m_linkBrush.Release(); m_textLayout.Release(); } void Draw(CComPtr renderTarget, D2D1_POINT_2F const & pos) { if (m_dirty || !m_linkBrush || !m_defaultBrush || !m_textLayout) { return; } renderTarget->DrawTextLayout(pos, m_textLayout, m_defaultBrush); } bool IsOverLink(D2D1_POINT_2F const & pos, tstring * linkText) { BOOL isTrailingHit; BOOL isInside; DWRITE_HIT_TEST_METRICS hitTestMetrics; m_textLayout->HitTestPoint(pos.x, pos.y, &isTrailingHit, &isInside, &hitTestMetrics); if (!isInside) return false; for (auto const & p : m_linkRanges) { if (hitTestMetrics.textPosition >= p.first.startPosition && hitTestMetrics.textPosition < p.first.startPosition + p.first.length) { if (linkText != nullptr) { *linkText = p.second; } return true; } } return false; }private: CComPtr m_factory; CComPtr m_defaultFormat; CComPtr m_textLayout; CComPtr m_defaultBrush; CComPtr m_linkBrush; tstring m_text; std::vector m_linkRanges; bool m_dirty;};class MainWindow : public CWindowImpl, public CIdleHandler {public: MainWindow() { m_handCursor = LoadCursor(nullptr, IDC_HAND); m_arrowCursor = LoadCursor(nullptr, IDC_ARROW); RECT bounds = { 0, 0, 800, 600 }; AdjustWindowRect(&bounds, WS_OVERLAPPEDWINDOW, false); bounds = { 0, 0, bounds.right - bounds.left, bounds.bottom - bounds.top }; Create(nullptr, bounds, _T("D3DSample Window"), WS_OVERLAPPEDWINDOW); ShowWindow(SW_SHOW); } virtual BOOL OnIdle() { Present(); return true; } void Present() { static float clearColor[] = { 0, 0, 0, 1 }; { m_deviceContext->OMSetRenderTargets(1, &m_backBufferRTV.p, nullptr); m_deviceContext->ClearRenderTargetView(m_backBufferRTV, clearColor); size_t stride = sizeof(Vertex); size_t offsets = 0; m_deviceContext->IASetVertexBuffers(0, 1, &m_vertexBuffer.p, &stride, &offsets); m_deviceContext->IASetInputLayout(m_inputLayout); m_deviceContext->IASetPrimitiveTopology(D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST); m_deviceContext->VSSetShader(m_vertexShader, nullptr, 0); m_deviceContext->PSSetShader(m_pixelShader, nullptr, 0); } { m_deviceContext->Draw(3, 0); } { m_d2dRenderTarget->BeginDraw(); m_textSection->Draw(m_d2dRenderTarget, D2D1::Point2F(0, 0)); m_d2dRenderTarget->EndDraw(); } { m_swapChain->Present(0, 0); } }public: BEGIN_MSG_MAP(MainWindow) MESSAGE_HANDLER(WM_DESTROY, [](unsigned msg, WPARAM wParam, LPARAM lParam, BOOL & bHandled) { PostQuitMessage(0); return 0; }); MESSAGE_HANDLER(WM_SIZE, OnSize); MESSAGE_HANDLER(WM_CREATE, OnCreate); MESSAGE_HANDLER(WM_MOUSEMOVE, OnMouseMove); MESSAGE_HANDLER(WM_LBUTTONUP, OnMouseUp); END_MSG_MAP()private: void CreateD3DVertexAndShaders() { tstring processFilename(MAX_PATH, _T('\0')); std::vector vertexShader; std::vector pixelShader; GetModuleFileName(GetModuleHandle(nullptr), &processFilename.front(), processFilename.length()); SetCurrentDirectory(processFilename.substr(0, processFilename.find_last_of(_T("\\/"))).data()); { std::ifstream vertexfin("VertexShader.cso", std::ios_base::binary | std::ios_base::in); std::copy(std::istreambuf_iterator(vertexfin), std::istreambuf_iterator(), std::back_inserter(vertexShader)); auto result = m_device->CreateVertexShader(&vertexShader.front(), vertexShader.size(), nullptr, &m_vertexShader); if (FAILED(result)) { throw Exception("Failed to create vertex shader.", result); } } { std::ifstream pixelfin("PixelShader.cso", std::ios_base::binary | std::ios_base::in); std::copy(std::istreambuf_iterator(pixelfin), std::istreambuf_iterator(), std::back_inserter(pixelShader)); auto result = m_device->CreatePixelShader(&pixelShader.front(), pixelShader.size(), nullptr, &m_pixelShader); if (FAILED(result)) { throw Exception("Failed to create pixel shader.", result); } } CComPtr inputLayoutBlob; auto result = D3DGetInputSignatureBlob(&vertexShader.front(), vertexShader.size(), &inputLayoutBlob); if (FAILED(result)) { throw Exception("Failed to get input layout.", result); } // Hard coded triangle. Tis a silly idea, but works for the sample. Vertex vertices[] = { { { 0.0f, 0.5f, 0.5f, 1.0f }, { 1.0f, 0.0f, 0.0f, 1.0f }, { 0.5f, 1.0f } }, { { 0.5f, -0.5f, 0.5f, 1.0f }, { 0.0f, 1.0f, 0.0f, 1.0f }, { 0.0f, 0.0f } }, { { -0.5f, -0.5f, 0.5f, 1.0f }, { 0.0f, 0.0f, 1.0f, 1.0f }, { 1.0f, 0.0f } }, }; D3D11_BUFFER_DESC desc = { sizeof(vertices), D3D11_USAGE_DEFAULT, D3D11_BIND_VERTEX_BUFFER }; D3D11_SUBRESOURCE_DATA data = { vertices }; result = m_device->CreateBuffer(&desc, &data, &m_vertexBuffer); if (FAILED(result)) { throw Exception("Failed to create vertex buffer.", result); } D3D11_INPUT_ELEMENT_DESC inputElementDesc[] = { { "SV_POSITION", 0, DXGI_FORMAT_R32G32B32A32_FLOAT, 0, 0 }, { "COLOR", 0, DXGI_FORMAT_R32G32B32A32_FLOAT, 0, 4 * sizeof(float) }, { "TEXCOORD", 0, DXGI_FORMAT_R32G32_FLOAT, 0, 4 * sizeof(float) }, }; result = m_device->CreateInputLayout(inputElementDesc, sizeof(inputElementDesc) / sizeof(D3D11_INPUT_ELEMENT_DESC), inputLayoutBlob->GetBufferPointer(), inputLayoutBlob->GetBufferSize(), &m_inputLayout); if (FAILED(result)) { throw Exception("Failed to create input layout.", result); } } void CreateD3DResources() { D3D_FEATURE_LEVEL featureLevels[] = { D3D_FEATURE_LEVEL_11_1, D3D_FEATURE_LEVEL_11_0, D3D_FEATURE_LEVEL_10_1, D3D_FEATURE_LEVEL_10_0, }; // We only want to draw to the portion of the window that is the client rect. // This will also work for dialog / borderless windows. RECT clientRect; GetClientRect(&clientRect); DXGI_SWAP_CHAIN_DESC swapChainDesc = { { clientRect.right, clientRect.bottom, { 60, 1 }, DXGI_FORMAT_R8G8B8A8_UNORM, DXGI_MODE_SCANLINE_ORDER_UNSPECIFIED, DXGI_MODE_SCALING_UNSPECIFIED }, { 1, 0 }, DXGI_USAGE_BACK_BUFFER | DXGI_USAGE_RENDER_TARGET_OUTPUT, 1, m_hWnd, true, DXGI_SWAP_EFFECT_DISCARD, DXGI_SWAP_CHAIN_FLAG_ALLOW_MODE_SWITCH }; // At the moment we don't actually care about what feature level we got back, so we don't keep this around just yet. D3D_FEATURE_LEVEL featureLevel; auto result = D3D11CreateDeviceAndSwapChain( nullptr, D3D_DRIVER_TYPE_HARDWARE, nullptr, // BGRA Support is necessary for D2D functionality. D3D11_CREATE_DEVICE_BGRA_SUPPORT | D3D11_CREATE_DEVICE_DEBUG, // D2D works with all of our feature levels, so we don't actually care which oen we get. featureLevels, sizeof(featureLevels) / sizeof(D3D_FEATURE_LEVEL), D3D11_SDK_VERSION, &swapChainDesc, &m_swapChain, &m_device, &featureLevel, &m_deviceContext ); if (FAILED(result)) { throw Exception("Failed to create D3D device and DXGI swap chain.", result); } // And lets create our D2D factory and DWrite factory at this point as well, that way if any of them fail we'll fail out completely. auto options = D2D1_FACTORY_OPTIONS(); options.debugLevel = D2D1_DEBUG_LEVEL_INFORMATION; result = D2D1CreateFactory(D2D1_FACTORY_TYPE_MULTI_THREADED, options, &m_d2dFactory); if (FAILED(result)) { throw Exception("Failed to create multithreaded D2D factory.", result); } result = DWriteCreateFactory(DWRITE_FACTORY_TYPE_SHARED, __uuidof(IDWriteFactory), reinterpret_cast(&m_dwFactory)); if (FAILED(result)) { throw Exception("Failed to create DirectWrite Factory.", result); } } void CreateBackBufferTarget() { CComPtr backBuffer; // Get a pointer to our back buffer texture. auto result = m_swapChain->GetBuffer(0, IID_PPV_ARGS(&backBuffer)); if (FAILED(result)) { throw Exception("Failed to get back buffer.", result); } // We acquire a render target view to the entire surface (no parameters), with nothing special about it. result = m_device->CreateRenderTargetView(backBuffer, nullptr, &m_backBufferRTV); if (FAILED(result)) { throw Exception("Failed to create render target view for back buffer.", result); } } void CreateD2DResources() { CComPtr bufferSurface; // Get a DXGI surface for D2D use. auto result = m_swapChain->GetBuffer(0, IID_PPV_ARGS(&bufferSurface)); if (FAILED(result)) { throw Exception("Failed to get DXGI surface for back buffer.", result); } // Proper DPI support is very important. Most applications do stupid things like hard coding this, which is why you, // can't use proper DPI on most monitors in Windows yet. float dpiX; float dpiY; m_d2dFactory->GetDesktopDpi(&dpiX, &dpiY); // DXGI_FORMAT_UNKNOWN will cause it to use the same format as the back buffer (R8G8B8A8_UNORM) auto d2dRTProps = D2D1::RenderTargetProperties(D2D1_RENDER_TARGET_TYPE_DEFAULT, D2D1::PixelFormat(DXGI_FORMAT_UNKNOWN, D2D1_ALPHA_MODE_PREMULTIPLIED), dpiX, dpiY); // Wraps up our DXGI surface in a D2D render target. result = m_d2dFactory->CreateDxgiSurfaceRenderTarget(bufferSurface, &d2dRTProps, &m_d2dRenderTarget); if (FAILED(result)) { throw Exception("Failed to create D2D DXGI Render Target.", result); } result = m_d2dRenderTarget->CreateSolidColorBrush(D2D1::ColorF(D2D1::ColorF::White), &m_defaultColorBrush); if (FAILED(result)) { throw Exception("Failed to create D2D color brush.", result); } result = m_d2dRenderTarget->CreateSolidColorBrush(D2D1::ColorF(D2D1::ColorF::PowderBlue), &m_linkColorBrush); if (FAILED(result)) { throw Exception("Failed to create D2D color brush.", result); } } void CreateTextResources() { m_textSection = std::make_unique(m_dwFactory); m_textSection->SetDefaultColorBrush(m_defaultColorBrush); m_textSection->SetLinkColorBrush(m_linkColorBrush); m_textSection->AppendText(_T("Tutorials are a horrible way to learn.\n\nI've covered that before though, and so have others, so I won't go into a great deal of depth on the subject, but suffice it to say that tutorials don't have the depth nor breadth to cover a subject in any sufficient detail to be terribly useful. If you don't learn to program, and if you don't learn to learn, then you'll always be stuck in ruts like this...\n\nThat being said, I have written a ")); m_textSection->AppendLink(_T("sweet little snippet to demonstrate exactly how to render text to the screen using Direct2D"), _T("http://www.gamedev.net/blog/32/entry-2260628-sweet-snippets-rendering-text-with-directwritedirect2d-and-direct3d11/")); m_textSection->AppendText(_T(".")); RECT clientRect;; GetClientRect(&clientRect); m_textSection->Compile(D2D1::Point2F(clientRect.right / 2.0f, clientRect.bottom / 2.0f)); }private: LRESULT OnMouseUp(unsigned msg, WPARAM wParam, LPARAM lParam, BOOL & bHandled) { tstring link; if (m_textSection->IsOverLink(D2D1::Point2F((float)GET_X_LPARAM(lParam), (float)GET_Y_LPARAM(lParam)), &link)) { ShellExecute(NULL, _T("open"), link.c_str(), NULL, NULL, SW_SHOWNORMAL); } return 0; } LRESULT OnMouseMove(unsigned msg, WPARAM wParam, LPARAM lParam, BOOL & bHandled) { if (m_textSection->IsOverLink(D2D1::Point2F((float)GET_X_LPARAM(lParam), (float)GET_Y_LPARAM(lParam)), nullptr)) { SetCursor(m_handCursor); } else { SetCursor(m_arrowCursor); } return 0; } LRESULT OnSize(unsigned msg, WPARAM wParam, LPARAM lParam, BOOL & bHandled) { // We need to release everything that may be holding a reference to the back buffer. // This includes D2D interfaces as well, as they hold a reference to the DXGI surface. m_textSection->Release(); m_linkColorBrush.Release(); m_defaultColorBrush.Release(); m_d2dRenderTarget.Release(); m_backBufferRTV.Release(); // And we make sure that we do not have any render tarvets bound either, which could // also be holding references to the back buffer. m_deviceContext->ClearState(); int width = LOWORD(lParam); int height = HIWORD(lParam); auto result = m_swapChain->ResizeBuffers(1, width, height, DXGI_FORMAT_UNKNOWN, DXGI_SWAP_CHAIN_FLAG_ALLOW_MODE_SWITCH); if (FAILED(result)) { std::cout << "Failed to resize swap chain." << std::endl; std::cout << "Error was: " << std::hex << result << std::endl; return -1; } try { // We need to recreate those resources we disposed of above, including our D2D interfaces CreateBackBufferTarget(); CreateD2DResources(); m_textSection->SetDefaultColorBrush(m_defaultColorBrush); m_textSection->SetLinkColorBrush(m_linkColorBrush); m_textSection->Compile(D2D1::Point2F(width / 2.0f, height / 2.0f)); } catch (Exception & ex) { std::cout << ex.what() << std::endl; } D3D11_VIEWPORT viewport = { 0.0f, 0.0f, static_cast(width), static_cast(height), 0.0f, 1.0f }; // We setup our viewport here as the size of the viewport is known at this point, WM_SIZE will be sent after a WM_CREATE. m_deviceContext->RSSetViewports(1, &viewport); return 0; } LRESULT OnCreate(unsigned msg, WPARAM wParam, LPARAM lParam, BOOL & bHandled) { try { CreateD3DResources(); CreateBackBufferTarget(); CreateD3DVertexAndShaders(); CreateD2DResources(); CreateTextResources(); } catch (Exception & ex) { std::cout << ex.what() << std::endl; return -1; } return 0; }private: CComPtr m_swapChain; CComPtr m_device; CComPtr m_deviceContext; CComPtr m_backBufferRTV; CComPtr m_vertexBuffer; CComPtr m_inputLayout; CComPtr m_vertexShader; CComPtr m_pixelShader; CComPtr m_d2dFactory; CComPtr m_d2dRenderTarget; CComPtr m_defaultColorBrush; CComPtr m_linkColorBrush; CComPtr m_dwFactory; CComPtr m_dwFormat; std::unique_ptr m_textSection; HCURSOR m_handCursor; HCURSOR m_arrowCursor;};int main() { CAppModule appModule; CMessageLoop messageLoop; MainWindow window; appModule.Init(nullptr, GetModuleHandle(nullptr)); appModule.AddMessageLoop(&messageLoop); messageLoop.AddIdleHandler(&window); messageLoop.Run(); appModule.Term(); return 0;}
Washu
At one point there was a series called Sweet Snippets. I don't remember where, but I think it was in the C++ Magazine (when such a thing still existed). Anyways, this is not an attempt to resurrect that, however I feel that sometimes certain questions can be answered with a simple sweet snippet of code that demonstrates a simple concept in its entirety. Thus this post (and hopefully more followup ones).

If all you care about is the code (i.e. you're a copy and paste coder), please feel free to skip to the end where the full source is posted.

Introduction


D3D-DWrite-Sample.png
DirectWrite is a Microsoft technology for rendering text and glyphs, as a replacement for GDI. Direct2D is a hardware accelerated 2D rendering technology which can be used in conjunction with DirectWrite to render text to the screen. The combination of the two provides a very powerful mechanism for rendering properly formatted text with minimal effort. Plugging it into your 3D application (game, level editor, etc), gives you a very powerful set of tools for producing text (and other 2d graphics) that can be rendered to a texture and presented in your game on virtual computer screens, projective textures, etc.

For the purposes of this snippet we will be rendering a basic triangle to the screen along with some text that will be drawn over it in a nice lime green.

Initializing Direct3D11


When you're initialing Direct3D11 with the idea of supporting Direct2D in mind you need to be sure to indicate during the device creation that you desire to support the surface formats Direct2D uses. Specifically, Direct2D requires BGRA support as that is the same format GDI uses. We can indicate this to Direct3D11 during device creation by passing in the D3D11_CREATE_DEVICE_BGRA_SUPPORT flag.

Other than that the device creation is quite straightforward:
auto result = D3D11CreateDeviceAndSwapChain( nullptr, D3D_DRIVER_TYPE_HARDWARE, nullptr, // BGRA Support is necessary for D2D functionality. D3D11_CREATE_DEVICE_BGRA_SUPPORT | D3D11_CREATE_DEVICE_DEBUG, // D2D works with all of our feature levels (10.0 - 11.1), so we don't actually care which oen we get. featureLevels, sizeof(featureLevels) / sizeof(D3D_FEATURE_LEVEL), D3D11_SDK_VERSION, &swapChainDesc, &m_swapChain, &m_device, &featureLevel, &m_deviceContext);.

Getting Going With Direct2D


Direct2D is not capable of directly talking to a Direct3D11 texture, instead you need to use a DXGI Surface. Thankfully, all D3D textures (since 10) are DXGI surfaces, thus we can simply QueryInterface for the an IDXGISurface on the appropriate texture, or in the case of the back buffer (as in this sample), we simply query for the IDXGISurface from the swap chain.

CComPtr backBufferSurface;// Get a DXGI surface for D2D use.auto result = m_swapChain->GetBuffer(0, IID_PPV_ARGS(&backBufferSurface));if (FAILED(result)) { std::cout << "Failed to get DXGI surface for back buffer." << std::endl; std::cout << "Error was: " << std::hex << result << std::endl; return result;}// DXGI_FORMAT_UNKNOWN will cause it to use the same format as the back buffer (R8G8B8A8_UNORM)auto d2dRTProps = D2D1::RenderTargetProperties(D2D1_RENDER_TARGET_TYPE_DEFAULT, D2D1::PixelFormat(DXGI_FORMAT_UNKNOWN, D2D1_ALPHA_MODE_PREMULTIPLIED), dpiX, dpiY);// Wraps up our DXGI surface in a D2D render target.result = m_d2dFactory->CreateDxgiSurfaceRenderTarget(backBufferSurface, &d2dRTProps, &m_d2dRenderTarget);if (FAILED(result)) { std::cout << "Failed to create D2D DXGI Render Target." << std::endl; std::cout << "Error was: " << std::hex << result << std::endl; return result;}At this point, with a Direct2D render target in our hands we're ready to do pretty much anything Direct2D can do, except render text. We can, however, create brushes, draw shapes, etc.

DirectWrite


DirectWrite is not specifically a standalone API. It works in conjunction with other APIs such as Direct2D to properly format text and glyphs for display. It has a great many tools, including the ability to build text layout objects which describe text that has multiple formatting characteristics, and then properly render that layout to the screen with such niceties as word wrapping and breaking (hyphenation), proper character spacing, Unicode handling, etc.

For us, and with such a simple sample in mind, we're going to do the bare minimum necessary to get text onto the screen. That calls for us to simply create a text format, which includes information about the font to use, font size, the weight and style, any stretching information, and the locale.
auto result = m_dwFactory->CreateTextFormat(L"Consolas", nullptr, DWRITE_FONT_WEIGHT_NORMAL, DWRITE_FONT_STYLE_NORMAL, DWRITE_FONT_STRETCH_NORMAL, 14.0f, L"", &m_dwFormat);if (FAILED(result)) { std::cout << "Failed to create DirectWrite text format." << std::endl; std::cout << "Error was: " << std::hex << result << std::endl; return result;}.

Rendering Text


At this point we're ready to start rendering to our back buffer. The question is, do we want our text to render infront of whatever is on the screen, or behind it? This is actually something you would have to determine on a case by case basis depending on what the text actually is (for instance, if it's on the screen of a computer in the game, whatever you're holding might obscure it).

In our case we desire the text to be topmost, so we render our text as the last thing in the rendering chain before presenting.
{ m_deviceContext->ClearRenderTargetView(m_backBufferRTV, clearColor); // Draw our triangle first m_deviceContext->Draw(3, 0); // Then render our text over it. m_d2dRenderTarget->BeginDraw(); m_d2dRenderTarget->DrawText(m_text.c_str(), m_text.length(), m_dwFormat, D2D1::RectF(0, 0, 512, 512), m_d2dSolidBrush); m_d2dRenderTarget->EndDraw(); m_swapChain->Present(0, 0);}.

Full Sample


#include #include #include #include #include #include #include #include #include #include #pragma comment(lib, "d3d11.lib")#pragma comment(lib, "d2d1.lib")#pragma comment(lib, "dwrite.lib")#pragma comment(lib, "d3dcompiler.lib")#ifdef UNICODEtypedef std::wstring tstring;typedef wchar_t tchar;#elsetypedef std::string tstring;typedef char tchar;#endifstruct Vertex { float position[4]; float color[4];};class MainWindow : public CWindowImpl {public: MainWindow() { RECT bounds = { 0, 0, 800, 600 }; AdjustWindowRect(&bounds, WS_OVERLAPPEDWINDOW, false); bounds = { 0, 0, bounds.right - bounds.left, bounds.bottom - bounds.top }; Create(nullptr, bounds, _T("D3DSample Window"), WS_OVERLAPPEDWINDOW); ShowWindow(SW_SHOW); // A traditional text. For a traditional time. m_text = _T("Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."); } bool ProcessMessages() { MSG msg; while (PeekMessage(&msg, nullptr, 0, 0, PM_REMOVE) != 0) { if (msg.message == WM_QUIT) return false; TranslateMessage(&msg); DispatchMessage(&msg); } return true; } void Present() { static float clearColor[] = { 0, 0, 0, 1 }; { m_deviceContext->OMSetRenderTargets(1, &m_backBufferRTV.p, nullptr); m_deviceContext->IASetInputLayout(m_inputLayout); m_deviceContext->VSSetShader(m_vertexShader, nullptr, 0); m_deviceContext->PSSetShader(m_pixelShader, nullptr, 0); m_deviceContext->IASetPrimitiveTopology(D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST); size_t stride = sizeof(Vertex); size_t offsets = 0; m_deviceContext->IASetVertexBuffers(0, 1, &m_vertexBuffer.p, &stride, &offsets); } { m_deviceContext->ClearRenderTargetView(m_backBufferRTV, clearColor); // Draw our triangle first m_deviceContext->Draw(3, 0); // Then render our text over it. m_d2dRenderTarget->BeginDraw(); m_d2dRenderTarget->DrawText(m_text.c_str(), m_text.length(), m_dwFormat, D2D1::RectF(0, 0, 512, 512), m_d2dSolidBrush); m_d2dRenderTarget->EndDraw(); m_swapChain->Present(0, 0); } }public: BEGIN_MSG_MAP(MainWindow) MESSAGE_HANDLER(WM_DESTROY, [](unsigned msg, WPARAM wParam, LPARAM lParam, BOOL & bHandled) { PostQuitMessage(0); return 0; }); MESSAGE_HANDLER(WM_SIZE, OnSize); MESSAGE_HANDLER(WM_CREATE, OnCreate); END_MSG_MAP()private: HRESULT CreateD3DVertexAndShaders() { // Hard coded shaders, not a great idea, but works for the sample. std::string vertexShader = "struct VS_IN { float4 pos : POSITION; float4 col : COLOR; }; struct PS_IN { float4 pos : SV_POSITION; float4 col : COLOR; }; PS_IN main( VS_IN input ) { PS_IN output = (PS_IN)0; output.pos = input.pos; output.col = input.col; return output; }"; std::string pixelShader = "struct VS_IN { float4 pos : POSITION; float4 col : COLOR; }; struct PS_IN { float4 pos : SV_POSITION; float4 col : COLOR; }; float4 main( PS_IN input ) : SV_Target { return input.col; }"; // If compilation fails, we don't report the errors, just that it failed. CComPtr vsBlob; CComPtr vsError; auto result = D3DCompile(vertexShader.c_str(), vertexShader.length() * sizeof(tchar), nullptr, nullptr, nullptr, "main", "vs_5_0", 0, 0, &vsBlob, &vsError); if (FAILED(result)) { std::cout << "Failed to compile vertex shader." << std::endl; std::cout << "Error was: " << std::hex << result << std::endl; return result; } // If compilation fails, we don't report the errors, just that it failed. CComPtr psBlob; CComPtr psError; result = D3DCompile(pixelShader.c_str(), pixelShader.length() * sizeof(tchar), nullptr, nullptr, nullptr, "main", "ps_5_0", 0, 0, &psBlob, &psError); if (FAILED(result)) { std::cout << "Failed to compile pixel shader." << std::endl; std::cout << "Error was: " << std::hex << result << std::endl; return result; } CComPtr inputLayoutBlob; result = D3DGetInputSignatureBlob(vsBlob->GetBufferPointer(), vsBlob->GetBufferSize(), &inputLayoutBlob); if (FAILED(result)) { std::cout << "Failed to get input layout." << std::endl; std::cout << "Error was: " << std::hex << result << std::endl; return result; } // Hard coded triangle. Tis a silly idea, but works for the sample. Vertex vertices[] = { { 0.0, 0.5, 0.5, 1.0, 1.0, 0.0, 0.0, 1.0 }, { 0.5f, -0.5f, 0.5f, 1.0, 0.0, 1.0, 0.0, 1.0 }, { -0.5f, -0.5f, 0.5f, 1.0, 0.0, 0.0, 1.0, 1.0 } }; D3D11_BUFFER_DESC desc = { sizeof(vertices), D3D11_USAGE_DEFAULT, D3D11_BIND_VERTEX_BUFFER }; D3D11_SUBRESOURCE_DATA data = { vertices }; result = m_device->CreateBuffer(&desc, &data, &m_vertexBuffer); if (FAILED(result)) { std::cout << "Failed to create vertex buffer." << std::endl; std::cout << "Error was: " << std::hex << result << std::endl; return result; } D3D11_INPUT_ELEMENT_DESC inputElementDesc[] = { { "POSITION", 0, DXGI_FORMAT_R32G32B32A32_FLOAT, 0, 0 }, { "COLOR", 0, DXGI_FORMAT_R32G32B32A32_FLOAT, 0, 16 } }; result = m_device->CreateInputLayout(inputElementDesc, sizeof(inputElementDesc) / sizeof(D3D11_INPUT_ELEMENT_DESC), inputLayoutBlob->GetBufferPointer(), inputLayoutBlob->GetBufferSize(), &m_inputLayout); if (FAILED(result)) { std::cout << "Failed to create input layout." << std::endl; std::cout << "Error was: " << std::hex << result << std::endl; return result; } result = m_device->CreateVertexShader(vsBlob->GetBufferPointer(), vsBlob->GetBufferSize(), nullptr, &m_vertexShader); if (FAILED(result)) { std::cout << "Failed to create vertex shader." << std::endl; std::cout << "Error was: " << std::hex << result << std::endl; return result; } result = m_device->CreatePixelShader(psBlob->GetBufferPointer(), psBlob->GetBufferSize(), nullptr, &m_pixelShader); if (FAILED(result)) { std::cout << "Failed to create pixel shader." << std::endl; std::cout << "Error was: " << std::hex << result << std::endl; return result; } return S_OK; } HRESULT CreateD3DResources() { D3D_FEATURE_LEVEL featureLevels[] = { D3D_FEATURE_LEVEL_11_1, D3D_FEATURE_LEVEL_11_0, D3D_FEATURE_LEVEL_10_1, D3D_FEATURE_LEVEL_10_0, }; // We only want to draw to the portion of the window that is the client rect. // This will also work for dialog / borderless windows. RECT clientRect; GetClientRect(&clientRect); DXGI_SWAP_CHAIN_DESC swapChainDesc = { { clientRect.right, clientRect.bottom, { 60, 1 }, DXGI_FORMAT_R8G8B8A8_UNORM, DXGI_MODE_SCANLINE_ORDER_UNSPECIFIED, DXGI_MODE_SCALING_UNSPECIFIED }, { 1, 0 }, DXGI_USAGE_BACK_BUFFER | DXGI_USAGE_RENDER_TARGET_OUTPUT, 1, m_hWnd, true, DXGI_SWAP_EFFECT_DISCARD, DXGI_SWAP_CHAIN_FLAG_ALLOW_MODE_SWITCH }; // At the moment we don't actually care about what feature level we got back, so we don't keep this around just yet. D3D_FEATURE_LEVEL featureLevel; auto result = D3D11CreateDeviceAndSwapChain( nullptr, D3D_DRIVER_TYPE_HARDWARE, nullptr, // BGRA Support is necessary for D2D functionality. D3D11_CREATE_DEVICE_BGRA_SUPPORT | D3D11_CREATE_DEVICE_DEBUG, // D2D works with all of our feature levels, so we don't actually care which oen we get. featureLevels, sizeof(featureLevels) / sizeof(D3D_FEATURE_LEVEL), D3D11_SDK_VERSION, &swapChainDesc, &m_swapChain, &m_device, &featureLevel, &m_deviceContext ); if (FAILED(result)) { std::cout << "Failed to create D3D device and DXGI swap chain." << std::endl; std::cout << "Error was: " << std::hex << result << std::endl; return result; } // And lets create our D2D factory and DWrite factory at this point as well, that way if any of them fail we'll fail out completely. auto options = D2D1_FACTORY_OPTIONS(); options.debugLevel = D2D1_DEBUG_LEVEL_INFORMATION; result = D2D1CreateFactory(D2D1_FACTORY_TYPE_MULTI_THREADED, options, &m_d2dFactory); if (FAILED(result)) { std::cout << "Failed to create multithreaded D2D factory." << std::endl; std::cout << "Error was: " << std::hex << result << std::endl; return result; } result = DWriteCreateFactory(DWRITE_FACTORY_TYPE_SHARED, __uuidof(IDWriteFactory), reinterpret_cast(&m_dwFactory)); if (FAILED(result)) { std::cout << "Failed to create DirectWrite Factory." << std::endl; std::cout << "Error was: " << std::hex << result << std::endl; return result; } return S_OK; } HRESULT CreateBackBufferTarget() { CComPtr backBuffer; // Get a pointer to our back buffer texture. auto result = m_swapChain->GetBuffer(0, IID_PPV_ARGS(&backBuffer)); if (FAILED(result)) { std::cout << "Failed to get back buffer." << std::endl; std::cout << "Error was: " << std::hex << result << std::endl; return result; } // We acquire a render target view to the entire surface (no parameters), with nothing special about it. result = m_device->CreateRenderTargetView(backBuffer, nullptr, &m_backBufferRTV); if (FAILED(result)) { std::cout << "Failed to create render target view for back buffer." << std::endl; std::cout << "Error was: " << std::hex << result << std::endl; return result; } return S_OK; } HRESULT CreateD2DResources() { CComPtr backBufferSurface; // Get a DXGI surface for D2D use. auto result = m_swapChain->GetBuffer(0, IID_PPV_ARGS(&backBufferSurface)); if (FAILED(result)) { std::cout << "Failed to get DXGI surface for back buffer." << std::endl; std::cout << "Error was: " << std::hex << result << std::endl; return result; } // Proper DPI support is very important. Most applications do stupid things like hard coding this, which is why you, // can't use proper DPI on most monitors in Windows yet. float dpiX; float dpiY; m_d2dFactory->GetDesktopDpi(&dpiX, &dpiY); // DXGI_FORMAT_UNKNOWN will cause it to use the same format as the back buffer (R8G8B8A8_UNORM) auto d2dRTProps = D2D1::RenderTargetProperties(D2D1_RENDER_TARGET_TYPE_DEFAULT, D2D1::PixelFormat(DXGI_FORMAT_UNKNOWN, D2D1_ALPHA_MODE_PREMULTIPLIED), dpiX, dpiY); // Wraps up our DXGI surface in a D2D render target. result = m_d2dFactory->CreateDxgiSurfaceRenderTarget(backBufferSurface, &d2dRTProps, &m_d2dRenderTarget); if (FAILED(result)) { std::cout << "Failed to create D2D DXGI Render Target." << std::endl; std::cout << "Error was: " << std::hex << result << std::endl; return result; } // This is the brush we will be using to render our text, it does not need to be a solid color, // we could use any brush we wanted. In this case we chose a nice solid red brush. result = m_d2dRenderTarget->CreateSolidColorBrush(D2D1::ColorF(D2D1::ColorF::LimeGreen), &m_d2dSolidBrush); if (FAILED(result)) { std::cout << "Failed to create solid color brush." << std::endl; std::cout << "Error was: " << std::hex << result << std::endl; return result; } return S_OK; } HRESULT CreateDWriteResources() { auto result = m_dwFactory->CreateTextFormat(L"Consolas", nullptr, DWRITE_FONT_WEIGHT_NORMAL, DWRITE_FONT_STYLE_NORMAL, DWRITE_FONT_STRETCH_NORMAL, 14.0f, L"", &m_dwFormat); if (FAILED(result)) { std::cout << "Failed to create DirectWrite text format." << std::endl; std::cout << "Error was: " << std::hex << result << std::endl; return result; } return S_OK; }private: LRESULT OnSize(unsigned msg, WPARAM wParam, LPARAM lParam, BOOL & bHandled) { // We need to release everything that may be holding a reference to the back buffer. // This includes D2D interfaces as well, as they hold a reference to the DXGI surface. m_backBufferRTV.Release(); m_d2dRenderTarget.Release(); m_d2dSolidBrush.Release(); // And we make sure that we do not have any render tarvets bound either, which could // also be holding references to the back buffer. m_deviceContext->ClearState(); int width = LOWORD(lParam); int height = HIWORD(lParam); auto result = m_swapChain->ResizeBuffers(1, width, height, DXGI_FORMAT_UNKNOWN, DXGI_SWAP_CHAIN_FLAG_ALLOW_MODE_SWITCH); if (FAILED(result)) { std::cout << "Failed to resize swap chain." << std::endl; std::cout << "Error was: " << std::hex << result << std::endl; return -1; } // We need to recreate those resources we disposed of above, including our D2D interfaces if (FAILED(CreateBackBufferTarget())) return -1; if (FAILED(CreateD2DResources())) { return -1; } D3D11_VIEWPORT viewport = { 0.0f, 0.0f, static_cast(width), static_cast(height), 0.0f, 1.0f }; // We setup our viewport here as the size of the viewport is known at this point, WM_SIZE will be sent after a WM_CREATE. m_deviceContext->RSSetViewports(1, &viewport); return 0; } LRESULT OnCreate(unsigned msg, WPARAM wParam, LPARAM lParam, BOOL & bHandled) { if (FAILED(CreateD3DResources())) return -1; if (FAILED(CreateBackBufferTarget())) return -1; if (FAILED(CreateD3DVertexAndShaders())) return -1; if (FAILED(CreateD2DResources())) return -1; if (FAILED(CreateDWriteResources())) return -1; return 0; }private: CComPtr m_swapChain; CComPtr m_device; CComPtr m_deviceContext; CComPtr m_backBufferRTV; CComPtr m_vertexBuffer; CComPtr m_inputLayout; CComPtr m_vertexShader; CComPtr m_pixelShader; CComPtr m_d2dFactory; CComPtr m_d2dRenderTarget; CComPtr m_d2dSolidBrush; CComPtr m_dwFactory; CComPtr m_dwLayout; CComPtr m_dwFormat; tstring m_text;};int main() { MainWindow window; float clearColor[] = { 0, 0, 0, 0 }; while (true) { if (!window.ProcessMessages()) break; window.Present(); } return 0;}
Washu

C++ Quiz #4

This is a test of your knowledge of C++, not of your compiler's knowledge of C++. Using a compiler during this test will likely give you the wrong answers, or at least incomplete ones.




  1. What is the value of i after the first numbered line is evaluated?

  2. What do you expect the second numbered line to print out?

  3. What is the value of p->c after the third numbered line is evaluated?

  4. What does the fourth numbered line print?



    struct C;
    void f(C* p);


    struct C {
    int c;
    C() : c(1) {
    f(this);
    }
    };


    const C obj;
    void f(C* p) {
    int i = obj.c << 2; //1
    std::cout<< p->c << std::endl; //2
    p->c = i; //3
    std::cout<< obj.c << std::endl; //4
    }

  5. What should you expect the compiler to do on the first numbered line? Why?


  6. What should you expect the value of j to be after the second numbered line is evaluated? Why?



    struct X {
    operator int() {
    return 314159;
    }
    };
    struct Y {
    operator X() {
    return X();
    }
    };
    Y y;
    int i = y; //1
    int j = X(y); //2

  7. What should you expect the compiler to do on the first and second numbered lines? Why?



    struct Z {
    Z() {}
    explicit Z(int) {}
    };
    Z z1 = 1; //1
    Z z2 = static_cast<Z>(1); //2

  8. What should you expect the behavior of each of the numbered lines, irrespective of the other lines, to be?



    struct Base {
    virtual ~Base() {}
    };
    struct Derived : Base {
    ~Derived() {}
    };
    typedef Base Base2;
    Derived d;
    Base* p = &d;
    void f() {
    d.Base::~Base(); //1
    p->~Base(); //2
    p->~Base2(); //3
    p->Base2::~Base(); //4
    p->Base2::~Base2(); //5
    }



Source
Washu

C++ Quiz #3

This is a test of your knowledge of C++, not of your compiler's knowledge of C++. Using a compiler during this test will likely give you the wrong answers, or at least incomplete ones.



Given the following code:



class Base {  
public:
virtual ~Base() {}
virtual void DoSomething() {}
void Mutate();
};

class Derived : public Base {
public:
virtual void DoSomething() {}
};

void Base::Mutate() {
new (this) Derived; // 1
}
void f() {
void* v = ::operator new(sizeof(Base) + sizeof(Derived));
Base* p = new (v) Base();
p->DoSomething(); // 2
p->Mutate(); // 3
void* vp = p; // 4
p->DoSomething(); // 5
}



  1. Does the first numbered line result in defined behavior? (Yes/No)

  2. What should the first numbered line do?

  3. Do the second and third numbered lines produce defined behavior? (Yes/No)

  4. Does the fourth numbered line produce defined behavior? If so, why? If not, why?

  5. Does the fifth numbered line produce defined behavior? If so, why? If not, why?


  6. What is the behavior of calling void exit(int);?




Given the following code:



struct T{};  
struct B {
~B();
};

void h() {
B b;
new (&b) T; // 1
return; // 2
}



  1. Does the first numbered line result in defined behavior?

  2. Is the behavior of the second line defined? If so, why? If not, why is the behavior not defined?


  3. What is the behavior of int& p = (int)0;? Why does it have that behavior? Is this a null reference?


  4. What is the behavior of p->I::~I(); if I is defined as typedef int I; and p is defined as I* p;?




Source
Washu

C++ Quiz #2

This is a test of your knowledge of C++, not of your compiler's knowledge of C++. Using a compiler during this test will likely give you the wrong answers, or at least incomplete ones.




  1. Using the code below as a reference, explain what behavior should be expected of each of the commented lines, please keep your answer very short.



    struct Base {
    virtual void Arr();
    };
    struct SubBase1 : virtual Base { };
    struct SubBase2 : virtual Base {
    SubBase2(Base*, SubBase1*);
    virtual void Arr();
    };


    struct Derived : SubBase1, SubBase2 {
    Derived() : SubBase2((SubBase1*)this, this) { }
    };
    SubBase2::SubBase2(Base* a, SubBase1* b) {
    typeid(*a); //1
    dynamic_cast<SubBase2*>(a); //2
    typeid(*b); //3
    dynamic_cast<SubBase1*>(b); //4
    a->Arr(); //5
    b->Arr(); //6
    }

  2. Using the code below as a reference, explain what behavior should be expected of each of the commented lines?



    template<class T> class X {
    X<T>* p; //1
    X<T> a; //2
    };



Source
Washu
So far I've covered how SlimGen works and the difficulties in doing what it does, including calling convention issues that one must be made aware of when writing replacement methods for use with SlimGen.
So the next question arises, just how much of a difference can using SlimGen make? Well, a lot of that will depend on the developer and their skill level. But we also were pretty curious about this and so we slapped together a test sample that runs through a series of matrix multiplications and times it. It uses three arrays to perform the multiplications, two of the arrays contains 100,000 randomly generated matrixes, with the third being used as the destinations for the results. Both matrix multiplications (the SlimGen one and the .Net one) assume that a source can also be used as a destination, and so they are overlap safe.
The timing results will vary, of course, from machine to machine depending on the processor in the machine, how much ram you have and also on what you're doing at the time. Running the results against my Phenom 9850 I get: Total Matrix Count Per Run: 100,000 Multiply Total Ticks: 2,001,059 SlimGenMultiply Total Ticks: 1,269,200 Improvement: 36.57 %
While when I run it against my T8300 Core2 Duo laptop I get: Total Matrix Count Per Run: 100,000 Multiply Total Ticks: 2,175,380 SlimGenMultiply Total Ticks: 1,621,830 Improvement: 25.45 %
Still, 25-35% improvement over the FPU based multiply is quite significant. Since X64 support hasn't been fully hammered out (in that it "works" but hasn't been sufficiently verified as working), those numbers are unavailable at the moment. However, they should be available in the near future as we finalize error handling and ensure that there are no bugs in the x64 assembly handling.
So why the great difference in performance? Well, part of it is the method size, the .Net method is 566 bytes of pure code, that's over half a kilobyte of code that has to be walked through by the processor, code which needs to be brought into the instruction-cache on the CPU and executed, meanwhile the SSE2 method is around half that size, at 266 bytes. The smaller your footprint in the I-cache, the fewer hits you take and the more likely your code is to actually be IN the I-cache. Then there's the instructions, SSE2 has been around for a while, and so it has had plenty of time to be wrangled around with by CPU manufacturers to ensure optimal performance. Finally there's the memory hit issue, the SSE2 based code hits memory a minimal number of times, reducing the chances of cache misses, after the first read/write, except for a few cases.
Finally there's how it deals with storage of the temporary results. The .Net FPU based version allocates a Matrix type on the stack, calls the constructor (which 0 initializes it), and then proceeds to overwrite those entries one by one with the results of each set of dot products. At the end of the method it does what amounts to a memcpy, and copies the temporary matrix over the result matrix. The SSE2 version however doesn't bother with initializing the stack and only stores three of the results on the stack, opting to write out the final result directly to the destination. The three other rows are then moved back into XMM registers and then back out to the destination.
The SSE2 source code, followed by the .Net source code, note that both are functionally equivalent: start: mov eax, [esp + 4] movups xmm4, [edx] movups xmm5, [edx + 0x10] movups xmm6, [edx + 0x20] movups xmm7, [edx + 0x30] movups xmm0, [ecx] movaps xmm1, xmm0 movaps xmm2, xmm0 movaps xmm3, xmm0 shufps xmm0, xmm1, 0x00 shufps xmm1, xmm1, 0x55 shufps xmm2, xmm2, 0xAA shufps xmm3, xmm3, 0xFF mulps xmm0, xmm4 mulps xmm1, xmm5 mulps xmm2, xmm6 mulps xmm3, xmm7 addps xmm0, xmm2 addps xmm1, xmm3 addps xmm0, xmm1 movups [esp - 0x20], xmm0 ; store row 0 of new matrix movups xmm0, [ecx + 0x10] movaps xmm1, xmm0 movaps xmm2, xmm0 movaps xmm3, xmm0 shufps xmm0, xmm0, 0x00 shufps xmm1, xmm1, 0x55 shufps xmm2, xmm2, 0xAA shufps xmm3, xmm3, 0xFF mulps xmm0, xmm4 mulps xmm1, xmm5 mulps xmm2, xmm6 mulps xmm3, xmm7 addps xmm0, xmm2 addps xmm1, xmm3 addps xmm0, xmm1 movups [esp - 0x30], xmm0 ; store row 1 of new matrix movups xmm0, [ecx + 0x20] movaps xmm1, xmm0 movaps xmm2, xmm0 movaps xmm3, xmm0 shufps xmm0, xmm0, 0x00 shufps xmm1, xmm1, 0x55 shufps xmm2, xmm2, 0xAA shufps xmm3, xmm3, 0xFF mulps xmm0, xmm4 mulps xmm1, xmm5 mulps xmm2, xmm6 mulps xmm3, xmm7 addps xmm0, xmm2 addps xmm1, xmm3 addps xmm0, xmm1 movups [esp - 0x40], xmm0 ; store row 2 of new matrix movups xmm0, [ecx + 0x30] movaps xmm1, xmm0 movaps xmm2, xmm0 movaps xmm3, xmm0 shufps xmm0, xmm0, 0x00 shufps xmm1, xmm1, 0x55 shufps xmm2, xmm2, 0xAA shufps xmm3, xmm3, 0xFF mulps xmm0, xmm4 mulps xmm1, xmm5 mulps xmm2, xmm6 mulps xmm3, xmm7 addps xmm0, xmm2 addps xmm1, xmm3 addps xmm0, xmm1 movups [eax + 0x30], xmm0 ; store row 3 of new matrix movups xmm0, [esp - 0x40] movups [eax + 0x20], xmm0 movups xmm0, [esp - 0x30] movups [eax + 0x10], xmm0 movups xmm0, [esp - 0x20] movups [eax], xmm0 ret 4
The .Net matrix multiplication source code: public static void Multiply(ref Matrix left, ref Matrix right, out Matrix result) { Matrix r; r.M11 = (left.M11 * right.M11) + (left.M12 * right.M21) + (left.M13 * right.M31) + (left.M14 * right.M41); r.M12 = (left.M11 * right.M12) + (left.M12 * right.M22) + (left.M13 * right.M32) + (left.M14 * right.M42); r.M13 = (left.M11 * right.M13) + (left.M12 * right.M23) + (left.M13 * right.M33) + (left.M14 * right.M43); r.M14 = (left.M11 * right.M14) + (left.M12 * right.M24) + (left.M13 * right.M34) + (left.M14 * right.M44); r.M21 = (left.M21 * right.M11) + (left.M22 * right.M21) + (left.M23 * right.M31) + (left.M24 * right.M41); r.M22 = (left.M21 * right.M12) + (left.M22 * right.M22) + (left.M23 * right.M32) + (left.M24 * right.M42); r.M23 = (left.M21 * right.M13) + (left.M22 * right.M23) + (left.M23 * right.M33) + (left.M24 * right.M43); r.M24 = (left.M21 * right.M14) + (left.M22 * right.M24) + (left.M23 * right.M34) + (left.M24 * right.M44); r.M31 = (left.M31 * right.M11) + (left.M32 * right.M21) + (left.M33 * right.M31) + (left.M34 * right.M41); r.M32 = (left.M31 * right.M12) + (left.M32 * right.M22) + (left.M33 * right.M32) + (left.M34 * right.M42); r.M33 = (left.M31 * right.M13) + (left.M32 * right.M23) + (left.M33 * right.M33) + (left.M34 * right.M43); r.M34 = (left.M31 * right.M14) + (left.M32 * right.M24) + (left.M33 * right.M34) + (left.M34 * right.M44); r.M41 = (left.M41 * right.M11) + (left.M42 * right.M21) + (left.M43 * right.M31) + (left.M44 * right.M41); r.M42 = (left.M41 * right.M12) + (left.M42 * right.M22) + (left.M43 * right.M32) + (left.M44 * right.M42); r.M43 = (left.M41 * right.M13) + (left.M42 * right.M23) + (left.M43 * right.M33) + (left.M44 * right.M43); r.M44 = (left.M41 * right.M14) + (left.M42 * right.M24) + (left.M43 * right.M34) + (left.M44 * right.M44); result = r;}

Source
Washu
The question does arise though, when using SlimGen and writing your SSE replacement methods, what kind of calling convention does the CLR use?
The CLR uses a version of fastcall. On x86 processors this means that the first two parameters (that are DWORD or smaller) are passed in ECX and EDX. However, and this is where the CLR differs from standard fastcall, the parameters after the first two are pushed onto the stack from left to right, not right to left. This is important to remember, especially for functions that take a variable number of arguments. So a call like: X('c', 2, 3.0f, "Hello"); becomes:X('c', 2, 3.0f, "Hello"); 00000025 push 40400000h ; 3.0f 0000002a push dword ptr ds:[03402088h] ;Address of "Hello" 00000030 mov edx,2 00000035 mov ecx,63h ;'c' 0000003a call FFB8B040
The situation is the same for member functions as well, except with this being passed in ECX, which leaves only EDX to hold an additional parameter. The rest are passed on the stack as before:p.Y(2, 3.0f); 0000006d push 40400000h ; 3.0f 00000072 mov ecx,dword ptr [ebp-40h] ;this 00000075 mov edx,2 0000007c call FFA1B048
So this all seems clear enough, but it's important to note these differences, especially when you're poking around in the low level bowels of the CLR or when you're doing what SlimGen does: which is replacing actual method bodies.
So this does beget the question, what about on the x64 platform? Well, again, the calling convention is fastcall with a few differences. The first four parameters are in RCX, RDX, R8 and R9 (or smaller registers), unless those parameters are floating point types, in which case they are passed using XMM registers. Z('c', 2, 3.0f, "Hello", 1.0, pa); 000000c0 mov r9,124D3100h 000000ca mov r9,qword ptr [r9] ; "Hello" 000000cd mov rax,qword ptr [rsp+38h] ;pa (IntPtr[]) 000000d2 mov qword ptr [rsp+28h],rax ;pa - stack spill 000000d7 movsd xmm0,mmword ptr [00000118h] ;1.0 000000df movsd mmword ptr [rsp+20h],xmm0 ;1.0 - stack spill 000000e5 movss xmm2,dword ptr [00000110h] ;3.0f 000000ed mov edx,2 ;int (2) 000000f2 mov cx,63h ;'c' 000000f6 call FFFFFFFFFFEC9300
Whew, that looks pretty nasty doesn't it? But if you notice, pretty much every single parameter to that function is passed in a register. The stack spillage is part of the calling convention to allow for variables to be spilled into memory (or read back from memory) when the register needs to be used. Calling an instance method follows pretty much the same rules, except the this pointer is passed in RCX first.p.Q(~0L, ~1L, ~2L, ~3); 0000010a mov rcx,qword ptr [rsp+30h] ; this pointer 0000010f mov qword ptr [rsp+20h],0FFFFFFFFFFFFFFFCh ;~3L, spilled to stack 00000118 mov r9,0FFFFFFFFFFFFFFFDh ;~2L 0000011f mov r8,0FFFFFFFFFFFFFFFEh ;~1L 00000126 mov rdx,0FFFFFFFFFFFFFFFFh ;~0L 0000012d call FFFFFFFFFFEC9310
Calling a function and passing something larger than a register can store does pose an interesting problem, the CLR deals with it by moving the entire data onto the stack, and passing it (hence call by value)var v = new Vector(); p.R(v); 00000169 lea rcx,[rsp+40h] 0000016e mov rax,qword ptr [rcx] 00000171 mov qword ptr [rsp+50h],rax 00000176 mov rax,qword ptr [rcx+8] 0000017a mov qword ptr [rsp+58h],rax 0000017f lea rdx,[rsp+50h] 00000184 mov rcx,r8 00000187 call FFFFFFFFFFEC9318
As you can see, it copies the data from the vector onto the stack, stores the this pointer in RCX, and then calls to the function. This is why pass by reference is the preferred method (for fast code) to move around structures that are non-trivial.
All of this goes into calcuating our matrix multiplication method (which assumes the output is not one of the inputs):BITS 32 ORG 0x59f0 ; void Multiply(ref Matrix, ref Matrix, out Matrix)start: mov eax, [esp + 4] movups xmm4, [edx] movups xmm5, [edx + 0x10] movups xmm6, [edx + 0x20] movups xmm7, [edx + 0x30] movups xmm0, [ecx] movaps xmm1, xmm0 movaps xmm2, xmm0 movaps xmm3, xmm0 shufps xmm0, xmm1, 0x00 shufps xmm1, xmm1, 0x55 shufps xmm2, xmm2, 0xAA shufps xmm3, xmm3, 0xFF mulps xmm0, xmm4 mulps xmm1, xmm5 mulps xmm2, xmm6 mulps xmm3, xmm7 addps xmm0, xmm2 addps xmm1, xmm3 addps xmm0, xmm1 movups [eax], xmm0 ; Calculate row 0 of new matrix movups xmm0, [ecx + 0x10] movaps xmm1, xmm0 movaps xmm2, xmm0 movaps xmm3, xmm0 shufps xmm0, xmm0, 0x00 shufps xmm1, xmm1, 0x55 shufps xmm2, xmm2, 0xAA shufps xmm3, xmm3, 0xFF mulps xmm0, xmm4 mulps xmm1, xmm5 mulps xmm2, xmm6 mulps xmm3, xmm7 addps xmm0, xmm2 addps xmm1, xmm3 addps xmm0, xmm1 movups [eax + 0x10], xmm0 ; Calculate row 1 of new matrix movups xmm0, [ecx + 0x20] movaps xmm1, xmm0 movaps xmm2, xmm0 movaps xmm3, xmm0 shufps xmm0, xmm0, 0x00 shufps xmm1, xmm1, 0x55 shufps xmm2, xmm2, 0xAA shufps xmm3, xmm3, 0xFF mulps xmm0, xmm4 mulps xmm1, xmm5 mulps xmm2, xmm6 mulps xmm3, xmm7 addps xmm0, xmm2 addps xmm1, xmm3 addps xmm0, xmm1 movups [eax + 0x20], xmm0 ; Calculate row 2 of new matrix movups xmm0, [ecx + 0x30] movaps xmm1, xmm0 movaps xmm2, xmm0 movaps xmm3, xmm0 shufps xmm0, xmm0, 0x00 shufps xmm1, xmm1, 0x55 shufps xmm2, xmm2, 0xAA shufps xmm3, xmm3, 0xFF mulps xmm0, xmm4 mulps xmm1, xmm5 mulps xmm2, xmm6 mulps xmm3, xmm7 addps xmm0, xmm2 addps xmm1, xmm3 addps xmm0, xmm1 movups [eax + 0x30], xmm0 ; Calculate row 3 of new matrix ret 4

Source
Washu
So previously we delved into one of the nastier performance corners on the .Net framework. Today I'm going to introduce you to a tool, that is in development currently, which allows you to take those slow math functions of yours and replace them with high performance SSE optimized methods.
We've called it SlimGen, which although not exactly accurate, does fit nicely in with the other Slim projects currently underway including SlimTune, and the flagship that started it all, SlimDX.
So what does SlimGen do? Well, you pass it a .Net assembly and it replaces the native method bodies, which are generated using NGEN, with replacement ones written in assembly (for now). This modified assembly then replaces the original assembly that was stored in the native image store. SlimGen can operate on signed and unsigned assemblies alike, as the native image is not signed, more on this later though.
Managed PE files contain a great deal of metadata stored in tables. You can enumerate these tables and parse them yourself, for instance if you were writing your own CLR. Thankfully though, the .Net framework comes with several COM interfaces that are very helpful in accessing these tables without having to manually parse them out of the PE file, this is especially useful since the table rows are are not a fixed format. Specifically, indexes in the tables can be either a 2 bytes or 4 bytes in size depending on the size of the dataset indexed. In the case of SlimGen we use the IMetaDataImport2 interface for accessing the metadata.
Of course, the managed metadata does not contain all of the information we need. NGEN manipulates the managed assembly and introduces pre-jitted versions of the functions contained within the assembly. However, their managed counterparts remain in the assembly and are what the metadata tables reference to. So how does one go from a managed method and its IL to the associated unmanaged code? Well, the CLR header of a PE file does contain a pointer to a table for a native header. However the exact format of that table is undocumented and as such it makes it hard to parse it and find the information we need. Therefore we have to use an alternative method...
When you load up an assembly the CLR generates, using the metadata and other information found in the PE file, a set of runtime tables that it uses to indicate information about where things are in memory, and their current state. For instance, it can tell if its jitted a method or not. When you load up an assembly that's been NGENed, it checks the native images for an associated copy, assuming your assembly validates, and will load up the NGENed assembly and parse out the appropriate information from that. Therefore we need some way of gaining access to these runtime generated tables. Enter the debugger.
The .Net framework exposes debugging interfaces that are quite trivial to implement, but more important, they give you access to all of the runtime information available to the CLR. In the case of SlimGen what we do is load up your assembly (not run) into a host process and then simply have the host process execute a debugger breakpoint. The SlimGen analyzer first initializes its self as a debugger and then executes the host process as the attached debugger. When the breakpoint is hit, it breaks into the analyzer, which can then begin the work of processing the loaded assemblies. Since SlimGen knows which assembly it fed to the host, it is able to filter out all of the other assemblies that have been loaded and focus in on the one we care about. First we check and see if a native version of the assembly has been loaded, for if one hasn't been loaded there is no point in continuing. if not then we simply report an error and cleanup. Assuming there is a native version of the assembly loaded then we use the aforementioned metadata interfaces to walk the assembly and find all of the methods that have been marked for replacement. Each method is examined to ensure that it has a native counterpart, and if it doesn't another warning is issued and the method is skipped.
Now comes the annoying part. In .Net 1.x the framework had each method exist within a singular code chunk, which made extracting that code quite easy. However in .Net 2.x and forward the framework allows a method to have multiple code chunks, each with a different base address and length. This is theoretically to allow an optimizer to spread work its magic, but it does make extracting methods harder. SlimGen will generate an assembly file per chunk and all of the associated binaries for each chunk, generated from the assembly files, must be present for the method to be replaced. No dangling chunks please. The SlimGen analyzer extracts each base address from each chunk, along with the module base address. Using that information we can then calculate the relative virtual address of each method's native counterpart within the NGENed file.
Using that information the SlimGen client simply walks a copy of the native image performing the replacement of each method, and then when done (and assuming no errors), copies it back over the original NGEN image. Tada, you now have your highly optimized SSE code running in a managed application with no managed -> unmanaged transitions in sight.

Source
Washu
Imagine you could have the safety of managed code, and the speed of SIMD all in one? Sounds like one of those weird dreams Trent has, or perhaps you are already thinking of using C++/CLI to wrap SIMD methods to help reduce the unmanaged transition overhead. You might also be thinking about pinvoking DLL methods such as those used in the D3DX framework to take advantage of its SIMD capabilities.
While all of those are quite possible, and for sufficiently large problems quite efficient too, they also have a relatively high cost of invocation. Managed to unmanaged transitions, even in the best of cases, costs a pretty penny. Registers have to be saved, marshalling of non-fundamental types has to be performed, and in many cases an interop thunk has to be created/jitted. This is a case where the best option is to do as much work as you can in one area before transitioning to the next.
But you can't always do tons of work at once, a prime example is that of managing your game state. You'll have discrete transformations of objects, but batching up those transformations to perform them all at once because a management nightmare. You have to craft special data-structures to avoid marshalling, use pinned arrays, and in general you end up doing a lot of work maintaining the two, will spend plenty of time debugging your interface, and may actually not gain anything speed wise still.
If you're wondering just how bad the interop transition is, you can take a look at my previous entries, where I explored the topic in some detail.
In the .Net framework, most code runs almost as fast, as fast, or faster than the comparable native counterparts. There are cases where the framework is significantly faster, and cases where it loses out at about 10% in the worst case. 10% isn't a horrible loss, and it's not a consistent loss either. The cost will vary depending on factors such as: is JITing required, is memory allocation performed, are you doing FPU math that would be vectorized in native code?
In fact, that 10% figure isn't accurate either: If a method requires JITting the first time it is called, which could cost you 10% on the first invocation, future invocations will not need JITing and so the cost may end up being the same as its native counterpart henceforth. If the method is called a thousand times, then that's only an additional .01% cost over the entire set of invocations.
The only real area that the .Net framework seriously loses out to unmanaged code is in the math department. The inability to use vectorization can significantly increase the cost of managed math over that of unamanged math code, that 10% figure rears its ugly head here. On the integer math side of things managed code is almost on equal footing with unmanaged code, although there are some vectorized operations you can perform that will enhance integer operations quite significantly, but in general the two add up to be about the same. However when it comes to floating point performance managed code loses out due to its dependency on the FPU or single float SSE instructions. The ability to vectorize large chunks of floating point math can work wonders for unmanaged code.
Well, all is not lost for those of us who love the managed world... SlimGen is here. Exactly what SlimGen is will be delved into later, but here's a sample preview of what it can do: SlimDX.Matrix.Multiply(SlimDX.Matrix ByRef, SlimDX.Matrix ByRef, SlimDX.Matrix ByRef) Begin 5a856e64, size 293 5A856E64 8B442404 mov eax,dword ptr [esp+4] 5A856E68 0F1022 movups xmm4,xmmword ptr [edx] 5A856E6B 0F106A10 movups xmm5,xmmword ptr [edx+10h] 5A856E6F 0F107220 movups xmm6,xmmword ptr [edx+20h] 5A856E73 0F107A30 movups xmm7,xmmword ptr [edx+30h] 5A856E77 0F1001 movups xmm0,xmmword ptr [ecx] 5A856E7A 0F28C8 movaps xmm1,xmm0

Source
Washu
As noted previously there are some cases where the performance of unmanaged code can beat that of the managed JIT. In the previous case it was the matrix multiplication function. We do have some other possible performance benefits we can give to our .NET code, specifically, we can NGEN it. NGEN is an interesting utility, it can perform heavy optimizations that would not be possible in the standard runtime JIT (as we shall see). The question before us is: Will it give us enough of a boost to be able to surpass the performance of our unmanaged matrix multiplication?An Analysis of Existing Code
We haven't looked at the current code that was produced for our previous tests yet, so I feel that it is time we gave it a look and see what we have. To keep this shorter we'll only look at the inner product function. The code produced for the matrix multiplication suffers from the same problems and benefits from the same extensions. For the purposes of this writing we'll only consider the x64 platform. First up we'll look at our unmanaged matrix multiplication, which as we may recall is an SSE2 version. There some things we should note: this method cannot be inlined into the managed code, and there are no frame pointers (they got optimized out).00000001`800019c3 0f100a movups xmm1,xmmword ptr [rdx] 00000001`800019c6 0f59c8 mulps xmm1,xmm0 00000001`800019c9 0f28c1 movaps xmm0,xmm1 00000001`800019cc 0fc6c14e shufps xmm0,xmm1,4Eh 00000001`800019d0 0f58c8 addps xmm1,xmm0 00000001`800019d3 0f28c1 movaps xmm0,xmm1 00000001`800019d6 0fc6c11b shufps xmm0,xmm1,1Bh 00000001`800019da 0f58c1 addps xmm0,xmm1 00000001`800019dd f3410f1100 movss dword ptr [r8],xmm0 00000001`800019e2 c3 ret
The code used to produce the managed version shown below has undergone a slight modification. No longer does the method return a float, instead it has an out parameter to a float, which ends up holding the result of the operation. This change was made to eliminate some compilation issues in both the managed and unmanaged versions. In the case of the managed version below, without the out parameter the store operation (at 00000642`801673b3) would have required a conversion to a double and back to a single again, the new versions are shown at the end of this post. Examining the managed inner product we get a somewhat worse picture:00000642`8016732f 4c8b4908 mov r9,qword ptr [rcx+8]00000642`80167333 4d85c9 test r9,r900000642`80167336 0f8684000000 jbe 00000642`801673c000000642`8016733c f30f104110 movss xmm0,dword ptr [rcx+10h]00000642`80167341 488b4208 mov rax,qword ptr [rdx+8]00000642`80167345 4885c0 test rax,rax00000642`80167348 7676 jbe 00000642`801673c000000642`8016734a f30f104a10 movss xmm1,dword ptr [rdx+10h]00000642`8016734f f30f59c8 mulss xmm1,xmm000000642`80167353 4983f901 cmp r9,100000642`80167357 7667 jbe 00000642`801673c000000642`80167359 f30f105114 movss xmm2,dword ptr [rcx+14h]00000642`8016735e 483d01000000 cmp rax,100000642`80167364 765a jbe 00000642`801673c000000642`80167366 f30f104214 movss xmm0,dword ptr [rdx+14h]00000642`8016736b f30f59c2 mulss xmm0,xmm200000642`8016736f f30f58c1 addss xmm0,xmm100000642`80167373 4983f902 cmp r9,200000642`80167377 7647 jbe 00000642`801673c000000642`80167379 f30f105118 movss xmm2,dword ptr [rcx+18h]00000642`8016737e 483d02000000 cmp rax,200000642`80167384 763a jbe 00000642`801673c000000642`80167386 f30f104a18 movss xmm1,dword ptr [rdx+18h]00000642`8016738b f30f59ca mulss xmm1,xmm200000642`8016738f f30f58c8 addss xmm1,xmm000000642`80167393 4983f903 cmp r9,300000642`80167397 7627 jbe 00000642`801673c000000642`80167399 f30f10511c movss xmm2,dword ptr [rcx+1Ch]00000642`8016739e 483d03000000 cmp rax,300000642`801673a4 761a jbe 00000642`801673c000000642`801673a6 f30f10421c movss xmm0,dword ptr [rdx+1Ch]00000642`801673ab f30f59c2 mulss xmm0,xmm200000642`801673af f30f58c1 addss xmm0,xmm100000642`801673b3 f3410f114040 movss dword ptr [r8+40h],xmm0...00000642`801673bd f3c3 rep ret00000642`801673bf 90 nop00000642`801673c0 e88b9f8aff call mscorwks!JIT_RngChkFail (00000642`7fa11350)
Wow! Lots of conditionals there, it's not vectorized either, but we don't expect it to be, automatic vectorization is a hit and miss type of deal with most optimizing compilers (like the Intel one). Not to mention, vectorizing in the runtime JIT would take up far too much time. This method is inlined for us (thankfully), but we see that it is littered with conditionals and jumps. So where are they jumping to? Well, they are actually ending up just after the end of the method. Note the nop instruction that causes the jump destination to be paragraph aligned, that is intentional. As you can probably guess based on the name from the jump destination, those conditionals are checking the array bounds, stored in r9 and rax, against the indices being used. Those jumps aren't actually that friendly for branch prediction, but for the most part they won't hamper the speed of this method much, but they are an additional cost. Unfortunately, they are rather problematic for the matrix version, and tend to cost quite a bit in performance.
We also can see that in x64 mode the JIT will use SSE2 for floating point operations. This is quite nice, but does have some interesting consequences, for instance comparing floating point numbers generated using the FPU and those using SSE2 will actually more than likely fail, EVEN IF you truncate them to their appropriate sizes. The reason for this is that the XMM registers (when using the single versions of the instructions and not the double ones) store the floating point values as exactly 32 bit floats. The FPU however will expand them to 80 bit floats, which means that operations on those 80 bit floats before truncating them can affect the lower bits of the 32 bit result in a manner that will result in them differing in the lower portions. If you are wondering when this might become an issue, then you can imagine the problems of running a managed networked game where you have 64bit and 32 bit clients all sending packets to the server. This is just another reason why you should be using deltas for comparison of floats. Other things to note is that with the addition of SSE2 support came the ability to use instructions that save us loads and stores, such as the cvtss2sd and cvtsd2ss instructions, which perform single to double and double to single conversions respectively.Examining the Call Stack
Of course, there is also the question of exactly what all does our program go through to call our unmanaged methods. First off, the JIT will have to generate several marshalling stubs (to deal with any non-blittable types, although in this case all of the passed types are blittable), along with the security demands. The total number of machines instructions for these stubs is around 10-30, never the less, they aren't inlinable and end up having to be created at runtime. The extra overhead of these calls can add up to quite a bit. First up we'll look at the pinvoke and the delegate stacks:000006427f66bd14 ManagedMathLib!matrix_mul0000064280168b85 mscorwks!DoNDirectCall__PatchGetThreadCall+0x780000064280168ccc ManagedMathLib!DomainBoundILStubClass.IL_STUB(Single[], Single[], Single[])+0xb50000064280168a0f PInvokeTest!SecurityILStubClass.IL_STUB(Single[], Single[], Single[])+0x5c000006428016893e PInvokeTest!PInvokeTest.Program+<>c__DisplayClass8.b__0()+0x1f0000064280167ca1 PInvokeTest!PInvokeTest.Program.TimeTest(TestMethod, Int32)+0x6e000006427f66c5e2 PInvokeTest!PInvokeTest.Program.Main(System.String[])+0x591000006427f66bd14 ManagedMathLib!matrix_mul0000064280168465 mscorwks!DoNDirectCall__PatchGetThreadCall+0x7800000642801685c1 ManagedMathLib!DomainBoundILStubClass.IL_STUB(Single[], Single[], Single[])+0xb50000064280168945 PInvokeTest!SecurityILStubClass.IL_STUB(Single[], Single[], Single[])+0x510000064280167d59 PInvokeTest!PInvokeTest.Program.TimeTest(TestMethod, Int32)+0x75000006427f66c5e2 PInvokeTest!PInvokeTest.Program.Main(System.String[])+0x649
We can see the two stubs that were created, along with this last method calledDoNDirectCall__PatchGetThreadCall
that actually does the work of calling to our unmanaged function. Exactly what it does is probably what the name says, although I haven't actually dug in and tried to find out what's going on in the internals of it. One important thing to notice is the PInvokeTest!PInvokeTest.Program+<>c__DisplayClass8.b__0() call, which is actually a delegate used to call to our unmanaged method (passed in to TimeTest). By using the delegate to call the matrix multiplication function, the JIT was able to eliminate the calls entirely. Other than that, the contents of the two sets of stubs are practically identical. The security stub actually asserts that we have the right to call to unmanaged code, as this is a security demand and can change at runtime, this cannot be eliminated. Calling to our unmanaged function from the manged DLL is up next, and it turns out that this is also the most direct call:000006427f66bf32 ManagedMathLib!matrix_mul0000064280169601 mscorwks!DoNDirectCallWorker+0x6200000642801694ef ManagedMathLib!ManagedMathLib.ManagedMath.MatrixMul(Single[], Single[], Single[])+0xd10000064280168945 PInvokeTest!PInvokeTest.Program+<>c__DisplayClass8.b__3()+0x1f0000064280167ecf PInvokeTest!PInvokeTest.Program.TimeTest(TestMethod, Int32)+0x75000006427f66c5e2 PInvokeTest!PInvokeTest.Program.Main(System.String[])+0x7bf
As we can see, the only real work that is done to call our unmanaged method is the call to DoNDirectCallWorker. Digging around in that method we find that it is basically a wrapper that saves registers, sets up some registers and then dispatches to the unmanaged function. Upon returning it restores the registers and returns to the caller. There is no dynamic method construction, nor does this require any extra overhead on our end. In fact, one could say that the code is about as fast as we can expect it to be for a managed to unmanaged transition. Looking at the difference between the original unmanaged inner product call and the new version (which writes takes a pointer to the destination float), being made from the managed DLL, we can see a huge difference:000006427f66bf32 ManagedMathLib!inner_product0000064280169bd0 mscorwks!DoNDirectCallWorker+0x620000064280169acf ManagedMathLib!ManagedMathLib.ManagedMath.InnerProduct(Single[], Single[], Single ByRef)+0xc00000064280168955 PInvokeTest!PInvokeTest.Program+<>c__DisplayClass8.b__7()+0x1f00000642801681c5 PInvokeTest!PInvokeTest.Program.TimeTest(TestMethod, Int32)+0x75000006427f66c5e2 PInvokeTest!PInvokeTest.Program.Main(System.String[])+0xab5000006427f66bd14 ManagedMathLib!inner_product0000064280169ca3 mscorwks!DoNDirectCall__PatchGetThreadCall+0x780000064280169ba0 ManagedMathLib!DomainBoundILStubClass.IL_STUB(Single*, Single*)+0x430000064280169b00 ManagedMathLib!ManagedMathLib.ManagedMath.InnerProduct(Single[], Single[])+0x50000006428016893e PInvokeTest!PInvokeTest.Program+<>c__DisplayClass8.b__7()+0x2000000642801681c5 PInvokeTest!PInvokeTest.Program.TimeTest(TestMethod, Int32)+0x6e000006427f66c5e2 PInvokeTest!PInvokeTest.Program.Main(System.String[])+0xab5
Notice the second call stack has the marshalling stub (also note the parameters to the stub). Returning value types has all sorts of interesting consequences. By changing the signature to write out to a float (in the case of the managed DLL it uses an out parameter), we eliminate the marshalling stub entirely. This improves performance by a decent bit, but nowhere near enough to make up for the call in the first place. The managed inner product is still significantly faster.And then came NGEN
So, we've gone through and optimized our managed application, but yet it still is running too slow. We contemplate the necessity of moving some code over to the unmanaged world and shudder at the implications. Security would be shot, bugs abound...what to do! But then we remember that there's yet one more option, NGEN!
Running NGEN on our test executable prejitted the whole thing, even methods that eventually ended up being inlined. So, what did it do to our managed inner product? Well first we'll look at the actual method that got prejitted:PInvokeTest.Program.InnerProduct2(Single[], Single[], Single ByRef)Begin 0000064288003290, size b000000642`88003290 4883ec28 sub rsp,28h00000642`88003294 4c8bc9 mov r9,rcx00000642`88003297 498b4108 mov rax,qword ptr [r9+8]00000642`8800329b 4885c0 test rax,rax00000642`8800329e 0f8696000000 jbe PInvokeTest_ni!COM+_Entry_Point (PInvokeTest_ni+0x333a) (00000642`8800333a)00000642`880032a4 33c9 xor ecx,ecx00000642`880032a6 488b4a08 mov rcx,qword ptr [rdx+8]00000642`880032aa 4885c9 test rcx,rcx00000642`880032ad 0f8687000000 jbe PInvokeTest_ni!COM+_Entry_Point (PInvokeTest_ni+0x333a) (00000642`8800333a)00000642`880032b3 4533d2 xor r10d,r10d00000642`880032b6 483d01000000 cmp rax,100000642`880032bc 767c jbe PInvokeTest_ni!COM+_Entry_Point (PInvokeTest_ni+0x333a) (00000642`8800333a)00000642`880032be 41ba01000000 mov r10d,100000642`880032c4 4883f901 cmp rcx,100000642`880032c8 7670 jbe PInvokeTest_ni!COM+_Entry_Point (PInvokeTest_ni+0x333a) (00000642`8800333a)00000642`880032ca 41ba01000000 mov r10d,100000642`880032d0 483d02000000 cmp rax,200000642`880032d6 7662 jbe PInvokeTest_ni!COM+_Entry_Point (PInvokeTest_ni+0x333a) (00000642`8800333a)00000642`880032d8 41ba02000000 mov r10d,200000642`880032de 4883f902 cmp rcx,200000642`880032e2 7656 jbe PInvokeTest_ni!COM+_Entry_Point (PInvokeTest_ni+0x333a) (00000642`8800333a)00000642`880032e4 483d03000000 cmp rax,300000642`880032ea 764e jbe PInvokeTest_ni!COM+_Entry_Point (PInvokeTest_ni+0x333a) (00000642`8800333a)00000642`880032ec b803000000 mov eax,300000642`880032f1 4883f903 cmp rcx,300000642`880032f5 7643 jbe PInvokeTest_ni!COM+_Entry_Point (PInvokeTest_ni+0x333a) (00000642`8800333a)00000642`880032f7 f30f104a14 movss xmm1,dword ptr [rdx+14h]00000642`880032fc f3410f594914 mulss xmm1,dword ptr [r9+14h]00000642`88003302 f30f104210 movss xmm0,dword ptr [rdx+10h]00000642`88003307 f3410f594110 mulss xmm0,dword ptr [r9+10h]00000642`8800330d f30f58c8 addss xmm1,xmm000000642`88003311 f30f104218 movss xmm0,dword ptr [rdx+18h]00000642`88003316 f3410f594118 mulss xmm0,dword ptr [r9+18h]00000642`8800331c f30f58c8 addss xmm1,xmm000000642`88003320 f30f10421c movss xmm0,dword ptr [rdx+1Ch]00000642`88003325 f3410f59411c mulss xmm0,dword ptr [r9+1Ch]00000642`8800332b f30f58c8 addss xmm1,xmm000000642`8800332f f3410f1108 movss dword ptr [r8],xmm100000642`88003334 4883c428 add rsp,28h00000642`88003338 f3c3 rep ret00000642`8800333a e811e0a0f7 call mscorwks!JIT_RngChkFail (00000642`7fa11350)00000642`8800333f 90 nop
Interesting results eh? First off, all of the checks are right up front, and ignoring the stack frames we can see exactly what will be inlined. Some other things to note: This method appears a lot better than before, with all of the branches right up at the top where one would assume branch prediction can best deal with them (the registers never change and are being compared to constants). Never the less there are some oddities in this code, for instance there appear to be some extrenuous instructions like mov eax,3. Yeah, don't ask me. Never the less the code is clearly superior to its previous form, and in fact the matrix version is equally as superior, with the range checks being spaced out significantly more (and a bunch are done right up front as well). Of course, the question now is: How much does this help our performance? First up we'll examine some results from the new code base, and then some from the NGEN results on the same code base.Count: 50PInvoke MatrixMul : 00:00:07.6456226 Average: 00:00:00.1529124Delegate MatrixMul: 00:00:06.6500307 Average: 00:00:00.1330006Managed MatrixMul: 00:00:05.5783511 Average: 00:00:00.1115670Internal MatrixMul: 00:00:04.5377141 Average: 00:00:00.0907542PInvoke Inner Product: 00:00:05.4466987 Average: 00:00:00.1089339Delegate Inner Product: 00:00:04.5001885 Average: 00:00:00.0900037Managed Inner Product: 00:00:00.5535891 Average: 00:00:00.0110717Internal Inner Product: 00:00:02.2694728 Average: 00:00:00.0453894Count: 10PInvoke MatrixMul : 00:00:01.5706254 Average: 00:00:00.1570625Delegate MatrixMul: 00:00:01.2689247 Average: 00:00:00.1268924Managed MatrixMul: 00:00:01.1501118 Average: 00:00:00.1150111Internal MatrixMul: 00:00:00.9302144 Average: 00:00:00.0930214PInvoke Inner Product: 00:00:01.0198933 Average: 00:00:00.1019893Delegate Inner Product: 00:00:00.8538827 Average: 00:00:00.0853882Managed Inner Product: 00:00:00.0987369 Average: 00:00:00.0098736Internal Inner Product: 00:00:00.4287660 Average: 00:00:00.0428766
All in all, our performance changes have helped out the managed inner product a decent amount, although even the unmanaged calls managed to get a bit of a boost. Now for the NGEN results:Count: 50PInvoke MatrixMul : 00:00:07.5788052 Average: 00:00:00.1515761Delegate MatrixMul: 00:00:06.2202549 Average: 00:00:00.1244050Managed MatrixMul: 00:00:04.0376665 Average: 00:00:00.0807533Internal MatrixMul: 00:00:04.5778189 Average: 00:00:00.0915563PInvoke Inner Product: 00:00:05.2785764 Average: 00:00:00.1055715Delegate Inner Product: 00:00:04.1814388 Average: 00:00:00.0836287Managed Inner Product: 00:00:00.5579279 Average: 00:00:00.0111585Internal Inner Product: 00:00:02.2419279 Average: 00:00:00.0448385Count: 10PInvoke MatrixMul : 00:00:01.3822036 Average: 00:00:00.1382203Delegate MatrixMul: 00:00:01.1436108 Average: 00:00:00.1143610Managed MatrixMul: 00:00:00.7386742 Average: 00:00:00.0738674Internal MatrixMul: 00:00:00.8427460 Average: 00:00:00.0842746PInvoke Inner Product: 00:00:00.9507331 Average: 00:00:00.0950733Delegate Inner Product: 00:00:00.7428082 Average: 00:00:00.0742808Managed Inner Product: 00:00:00.1005084 Average: 00:00:00.0100508Internal Inner Product: 00:00:00.4025611 Average: 00:00:00.0402561
So, now we can see that our matrix multiplication doesn't offer any advantages over the managed version, in fact it's actually SLOWER than the managed version! We also can see that the unmanaged invocations also benefitted from the NGEN process, as their managed calls were also optimized somewhat, although the stub wrappers are still there and hence still add their overhead. Other things we note is that the inner product function appears to have slowed down just a bit, this might be nothing, or it might be due to machine load or it might genuinly be slower. I'm tempted to say that it's actually slower now, though.Conclusion
You may recall that this was all sparked by a discussion I had way back when about comparing managed and unmanaged benchmarks and the disadvantages of just setting the /clr flag. I've gone a bit past that though in looking at our managed resources and optimized unmanaged resources and when it is actually beneficial to call into unmanaged code. It is still beneficial to do so, but only with some operations that are just sufficiently taxing enough to bother with. In this case our matrix code which, while in a pure JIT situation, the native code clearly beat out the JIT produced code, gets beat out by the managed version. So what is sufficiently taxing then? Well, set processing might be taxing enough. That is: applying a set of vectorized operations to a collection of objects. But the reality is, you MUST profile first before you can be sure that optimizations of that sort are anywhere near what you need, as if you just assume it will you're probably mistaken.
On a final note, the x86 version also performs better when NGENed than the native version, although in a surprise jump, the delegates actually cost significantly more:Count: 50PInvoke MatrixMul : 00:00:07.9897235 Average: 00:00:00.1597944Delegate MatrixMul: 00:00:27.2561396 Average: 00:00:00.5451227Managed MatrixMul: 00:00:03.5224029 Average: 00:00:00.0704480Internal MatrixMul: 00:00:04.5232549 Average: 00:00:00.0904650PInvoke Inner Product: 00:00:05.5799834 Average: 00:00:00.1115996Delegate Inner Product: 00:00:29.5660003 Average: 00:00:00.5913200Managed Inner Product: 00:00:00.5755690 Average: 00:00:00.0115113Internal Inner Product: 00:00:01.8218949 Average: 00:00:00.0364378
Exactly why this is I haven't investigated, and perhaps I will next time.
Sources for the new inner product functions:void __declspec(dllexport) inner_product(float const* v1, float const* v2, float* out) { __m128 a = _mm_mul_ps(_mm_loadu_ps(v1), _mm_loadu_ps(v2)); a = _mm_add_ps(a, _mm_shuffle_ps(a, a, _MM_SHUFFLE(1, 0, 3, 2))); _mm_store_ss(out, _mm_add_ps(a, _mm_shuffle_ps(a, a, _MM_SHUFFLE(0, 1, 2, 3))));}static void InnerProduct(array^ v1, array^ v2, [Runtime::InteropServices::Out] float% result) { pin_ptr pv1 = &v1[0]; pin_ptr pv2 = &v2[0]; pin_ptr out = &result; inner_product(pv1, pv2, out);}public static void InnerProduct2(float[] v1, float[] v2, out float f) { f = v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2] + v1[3] * v2[3];}

Source
Washu
Integrating unmanaged code into the managed platform is one of the problem areas with the managed world. Often times the exact costs of calling into unmanaged code is unknown. This obviously leads to some confusion as to when it is appropriate to mix in unmanaged code to help to improve the performance of our application.PInvoke
There are three ways to access an unmanaged function from managed code. The first is to use the PInvoke capabilities of the language. In C# this is done by declaring a method with external linkage and indicating (using the DllImportAttribute attribute) in which DLL the method may be found. The second way would be to obtain a pointer to the function (using LoadLibrary/GetProcAddress/FreeLibrary), and marshal that pointer to a managed delegate using Marshal.GetDelegateForFunctionPointer. Finally you can write an unmanaged wrapper around the function, using C++/CLI, and invoke that managed method, which will in turn call the unmanaged method.
For the purposes of this post we'll be using two mathematical sample functions. The first being the standard inner product on R3 (aka the dot product), and the second will be a 4x4 matrix multiplication. We'll be comparing two implementations, the first will be a trivial managed implementation of them, and the second will be a SSE2 optimized version. Thanks must be given to Arseny Kapoulkine for the SSE2 version of the matrix multiplication.
First up are the implementations of the inner product functions, it should be noted that I'll be doing the profiling in x64 mode, however the results are similar (albeit a bit slower) for x86.public static float InnerProduct2(float[] v1, float[] v2) { return v1[0] * v2[0] + v1[1] * v2[1] + v1[2] * v2[2] + v1[3] * v2[3];}float __declspec(dllexport) inner_product(float const* v1, float const* v2) { float result; __m128 a = _mm_mul_ps(_mm_loadu_ps(v1), _mm_loadu_ps(v2)); a = _mm_add_ps(a, _mm_shuffle_ps(a, a, _MM_SHUFFLE(1, 0, 3, 2))); _mm_store_ss(&result, _mm_add_ps(a, _mm_shuffle_ps(a, a, _MM_SHUFFLE(0, 1, 2, 3)))); return result;}
Things that should be noted about these implementations is that they both operate soley on arrays of floats. InnerProduct2 is inlineable since it's only 23 bytes long and is taking reference types as parameters. The unmanaged inner product could also be implemented using the SSE3 haddps instruction, however I decided to keep it as processor neutral as possible by using only SSE2 instructions.
The implementations of the matrix multiplication vary quite significantly as well, the managed version is the trivial implementation, but its expansion into machine code is quite long. The unmanaged version is an SSE2 optimized one, the raw performance boost of using it is quite significant.public static void MatrixMul2(float[] m1, float[] m2, float[] o) { o[0] = m1[0] * m2[0] + m1[1] * m2[4] + m1[2] * m2[8] + m1[3] * m2[12]; o[1] = m1[0] * m2[1] + m1[1] * m2[5] + m1[2] * m2[9] + m1[3] * m2[13]; o[2] = m1[0] * m2[2] + m1[1] * m2[6] + m1[2] * m2[10] + m1[3] * m2[14]; o[3] = m1[0] * m2[3] + m1[1] * m2[7] + m1[2] * m2[11] + m1[3] * m2[15]; o[4] = m1[4] * m2[0] + m1[5] * m2[4] + m1[6] * m2[8] + m1[7] * m2[12]; o[5] = m1[4] * m2[1] + m1[5] * m2[5] + m1[6] * m2[9] + m1[7] * m2[13]; o[6] = m1[4] * m2[2] + m1[5] * m2[6] + m1[6] * m2[10] + m1[7] * m2[14]; o[7] = m1[4] * m2[3] + m1[5] * m2[7] + m1[6] * m2[11] + m1[7] * m2[15]; o[8] = m1[8] * m2[0] + m1[9] * m2[4] + m1[10] * m2[8] + m1[11] * m2[12]; o[9] = m1[8] * m2[1] + m1[9] * m2[5] + m1[10] * m2[9] + m1[11] * m2[13]; o[10] = m1[8] * m2[2] + m1[9] * m2[6] + m1[10] * m2[10] + m1[11] * m2[14]; o[11] = m1[8] * m2[3] + m1[9] * m2[7] + m1[10] * m2[11] + m1[11] * m2[15]; o[12] = m1[12] * m2[0] + m1[13] * m2[4] + m1[14] * m2[8] + m1[15] * m2[12]; o[13] = m1[12] * m2[1] + m1[13] * m2[5] + m1[14] * m2[9] + m1[15] * m2[13]; o[14] = m1[12] * m2[2] + m1[13] * m2[6] + m1[14] * m2[10] + m1[15] * m2[14]; o[15] = m1[12] * m2[3] + m1[13] * m2[7] + m1[14] * m2[11] + m1[15] * m2[15];}void __declspec(dllexport) matrix_mul(float const* m1, float const* m2, float* out) { __m128 r; __m128 col1 = _mm_loadu_ps(m2); __m128 col2 = _mm_loadu_ps(m2 + 4); __m128 col3 = _mm_loadu_ps(m2 + 8); __m128 col4 = _mm_loadu_ps(m2 + 12); __m128 row1 = _mm_loadu_ps(m1); r = _mm_add_ps(_mm_mul_ps(_mm_shuffle_ps(row1, row1, _MM_SHUFFLE(0, 0, 0, 0)), col1), _mm_add_ps(_mm_mul_ps(_mm_shuffle_ps(row1, row1, _MM_SHUFFLE(1, 1, 1, 1)), col2), _mm_add_ps(_mm_mul_ps(_mm_shuffle_ps(row1, row1, _MM_SHUFFLE(2, 2, 2, 2)), col3), _mm_mul_ps(_mm_shuffle_ps(row1, row1, _MM_SHUFFLE(3, 3, 3, 3)), col4)))); _mm_storeu_ps(out, r); __m128 row2 = _mm_loadu_ps(m1 + 4); r = _mm_add_ps(_mm_mul_ps(_mm_shuffle_ps(row2, row2, _MM_SHUFFLE(0, 0, 0, 0)), col1), _mm_add_ps(_mm_mul_ps(_mm_shuffle_ps(row2, row2, _MM_SHUFFLE(1, 1, 1, 1)), col2), _mm_add_ps(_mm_mul_ps(_mm_shuffle_ps(row2, row2, _MM_SHUFFLE(2, 2, 2, 2)), col3), _mm_mul_ps(_mm_shuffle_ps(row2, row2, _MM_SHUFFLE(3, 3, 3, 3)), col4)))); _mm_storeu_ps(out + 4, r); __m128 row3 = _mm_loadu_ps(m1 + 8); r = _mm_add_ps(_mm_mul_ps(_mm_shuffle_ps(row3, row3, _MM_SHUFFLE(0, 0, 0, 0)), col1), _mm_add_ps(_mm_mul_ps(_mm_shuffle_ps(row3, row3, _MM_SHUFFLE(1, 1, 1, 1)), col2), _mm_add_ps(_mm_mul_ps(_mm_shuffle_ps(row3, row3, _MM_SHUFFLE(2, 2, 2, 2)), col3), _mm_mul_ps(_mm_shuffle_ps(row3, row3, _MM_SHUFFLE(3, 3, 3, 3)), col4)))); _mm_storeu_ps(out + 8, r); __m128 row4 = _mm_loadu_ps(m1 + 12); r = _mm_add_ps(_mm_mul_ps(_mm_shuffle_ps(row4, row4, _MM_SHUFFLE(0, 0, 0, 0)), col1), _mm_add_ps(_mm_mul_ps(_mm_shuffle_ps(row4, row4, _MM_SHUFFLE(1, 1, 1, 1)), col2), _mm_add_ps(_mm_mul_ps(_mm_shuffle_ps(row4, row4, _MM_SHUFFLE(2, 2, 2, 2)), col3), _mm_mul_ps(_mm_shuffle_ps(row4, row4, _MM_SHUFFLE(3, 3, 3, 3)), col4)))); _mm_storeu_ps(out + 12, r);}
It is trivially obvious that the managed version of the matrix multiplication cannot be inlined. The overhead of the function call is really the least of your worries though (it is the smallest cost of the entire method really). The unmanaged version is a nicely optimized SSE2 method, and requires only a minimal number of loads and stores from main memory, and the loads and stores are reasonably cache friendly (P4 will prefetch 128 bytes of memory).PInvoke
Of course, the question is, how do these perform against each other when called from a managed application. The profiling setup is quite simple. It simply runs the methods against a set of matricies and vectors (randomly generated) a million times. It repeats those tests several more times (100 in this case), and averages the results. Full optimizations were turned on for both the unmanaged and managed tests. The Internal calls are made from a managed class that directly calls to the unmanaged methods. Both the managed wrapper and the unmanaged methods are hosted in the same DLL (source for the full DLL at the end of this entry).PInvoke MatrixMul : 00:00:15.0203285 Average: 00:00:00.1502032Delegate MatrixMul: 00:00:13.1004306 Average: 00:00:00.1310043Managed MatrixMul: 00:00:10.2809715 Average: 00:00:00.1028097Internal MatrixMul: 00:00:08.8992407 Average: 00:00:00.0889924PInvoke Inner Product: 00:00:10.6779944 Average: 00:00:00.1067799Delegate Inner Product: 00:00:09.3359882 Average: 00:00:00.0933598Managed Inner Product: 00:00:01.3460812 Average: 00:00:00.0134608Internal Inner Product: 00:00:05.6842336 Average: 00:00:00.0568423
The first thing to note is that the PInvoke calls for both the matrix multiplication and inner product were the slowest. The delegate calls were only slightly faster than the PInvoke calls. As we move into the managed territory we find the the results begin to diverge. The managed matrix multiplication is slower than the internal matrix multiplication, however the managed inner product is several times faster than the internal one.
Part of the reason behind this divergance is a result of the invocation framework. There is a cost to calling unmanaged methods from managed code, as each method must be wrapped to perform operations such as fixing any managed resources, performing marshalling for non-blittable types, and finally calling the actual native method. After returning the method further marshalling of the return type may be required, along with checks on the condition of the stack and exception checks (SEH exceptions are caught and wrapped in the SEHException class). Even the internal calls to the unmanaged method require some amount of this, although the actual marshalling requirements are avoided, as are some of the other costs. The result is that the costs add up over time, and in the case of the inner product the additional cost overrode the complexity requirements of the method (which is fairly trivial). The case, on the average, is different for the matrix multiplication. The additional costs of the call do not add a significant amount overhead compared to that of the body of the method, which executes faster than that of the managed matrix multiplication due to vectorization.
Performing further testing with counts at 50 and 25 reveal similar results, however the managed matrix multiplication begins to approach the performance of the internal one. However, even at a count of 1 (that's one million matrix multiplications), the internal matrix multiplication is faster than the managed version.Count = 50PInvoke MatrixMul : 00:00:07.4730356 Average: 00:00:00.1494607Delegate MatrixMul: 00:00:06.4519274 Average: 00:00:00.1290385Managed MatrixMul: 00:00:05.1662482 Average: 00:00:00.1033249Internal MatrixMul: 00:00:04.3371530 Average: 00:00:00.0867430PInvoke Inner Product: 00:00:05.3891030 Average: 00:00:00.1077820Delegate Inner Product: 00:00:04.7625597 Average: 00:00:00.0952511Managed Inner Product: 00:00:00.6791549 Average: 00:00:00.0135830Internal Inner Product: 00:00:02.6719175 Average: 00:00:00.0534383Count = 25PInvoke MatrixMul : 00:00:03.7432932 Average: 00:00:00.1497317Delegate MatrixMul: 00:00:03.2074834 Average: 00:00:00.1282993Managed MatrixMul: 00:00:02.6200096 Average: 00:00:00.1048003Internal MatrixMul: 00:00:02.2144342 Average: 00:00:00.0885773PInvoke Inner Product: 00:00:02.8778559 Average: 00:00:00.1151142Delegate Inner Product: 00:00:02.0178957 Average: 00:00:00.0807158Managed Inner Product: 00:00:00.3385675 Average: 00:00:00.0135427Internal Inner Product: 00:00:01.4391529 Average: 00:00:00.0575661Count = 5PInvoke MatrixMul : 00:00:00.7642981 Average: 00:00:00.1528596Delegate MatrixMul: 00:00:00.6407667 Average: 00:00:00.1281533Managed MatrixMul: 00:00:00.5231416 Average: 00:00:00.1046283Internal MatrixMul: 00:00:00.4458765 Average: 00:00:00.0891753PInvoke Inner Product: 00:00:00.5702666 Average: 00:00:00.1140533Delegate Inner Product: 00:00:00.4122217 Average: 00:00:00.0824443Managed Inner Product: 00:00:00.0683842 Average: 00:00:00.0136768Internal Inner Product: 00:00:00.2899304 Average: 00:00:00.0579860Count = 1PInvoke MatrixMul : 00:00:00.1476958 Average: 00:00:00.1476958Delegate MatrixMul: 00:00:00.1337818 Average: 00:00:00.1337818Managed MatrixMul: 00:00:00.1155993 Average: 00:00:00.1155993Internal MatrixMul: 00:00:00.0919538 Average: 00:00:00.0919538PInvoke Inner Product: 00:00:00.1155769 Average: 00:00:00.1155769Delegate Inner Product: 00:00:00.0906768 Average: 00:00:00.0906768Managed Inner Product: 00:00:00.0155480 Average: 00:00:00.0155480Internal Inner Product: 00:00:00.0653527 Average: 00:00:00.0653527Conclusion
Clearly we should reserve unmanaged operations for longer running methods where the cost of the managed wrappers is negligible compared to the cost of the method. Even heavily optimized methods cost significantly in the wrapping code, and so trivial optimizations are easily overshadowed by that cost. It is best to use unmanaged operations wrapped in a C++/CLI wrapper (and preferably the wrapper will be part of the library that the operations are in). Next time we'll look at the assembly produced by the JIT for these methods under varying circumstances.
Source for Managed DLL:#pragma managed(push, off)extern "C" { float __declspec(dllexport) inner_product(float const* v1, float const* v2) { float result; __m128 a = _mm_mul_ps(_mm_loadu_ps(v1), _mm_loadu_ps(v2)); a = _mm_add_ps(a, _mm_shuffle_ps(a, a, _MM_SHUFFLE(1, 0, 3, 2))); _mm_store_ss(&result, _mm_add_ps(a, _mm_shuffle_ps(a, a, _MM_SHUFFLE(0, 1, 2, 3)))); return result; } void __declspec(dllexport) matrix_mul(float const* m1, float const* m2, float* out) { __m128 r; __m128 col1 = _mm_loadu_ps(m2); __m128 col2 = _mm_loadu_ps(m2 + 4); __m128 col3 = _mm_loadu_ps(m2 + 8); __m128 col4 = _mm_loadu_ps(m2 + 12); __m128 row1 = _mm_loadu_ps(m1); r = _mm_add_ps(_mm_mul_ps(_mm_shuffle_ps(row1, row1, _MM_SHUFFLE(0, 0, 0, 0)), col1), _mm_add_ps(_mm_mul_ps(_mm_shuffle_ps(row1, row1, _MM_SHUFFLE(1, 1, 1, 1)), col2), _mm_add_ps(_mm_mul_ps(_mm_shuffle_ps(row1, row1, _MM_SHUFFLE(2, 2, 2, 2)), col3), _mm_mul_ps(_mm_shuffle_ps(row1, row1, _MM_SHUFFLE(3, 3, 3, 3)), col4)))); _mm_storeu_ps(out, r); __m128 row2 = _mm_loadu_ps(m1 + 4); r = _mm_add_ps(_mm_mul_ps(_mm_shuffle_ps(row2, row2, _MM_SHUFFLE(0, 0, 0, 0)), col1), _mm_add_ps(_mm_mul_ps(_mm_shuffle_ps(row2, row2, _MM_SHUFFLE(1, 1, 1, 1)), col2), _mm_add_ps(_mm_mul_ps(_mm_shuffle_ps(row2, row2, _MM_SHUFFLE(2, 2, 2, 2)), col3), _mm_mul_ps(_mm_shuffle_ps(row2, row2, _MM_SHUFFLE(3, 3, 3, 3)), col4)))); _mm_storeu_ps(out + 4, r); __m128 row3 = _mm_loadu_ps(m1 + 8); r = _mm_add_ps(_mm_mul_ps(_mm_shuffle_ps(row3, row3, _MM_SHUFFLE(0, 0, 0, 0)), col1), _mm_add_ps(_mm_mul_ps(_mm_shuffle_ps(row3, row3, _MM_SHUFFLE(1, 1, 1, 1)), col2), _mm_add_ps(_mm_mul_ps(_mm_shuffle_ps(row3, row3, _MM_SHUFFLE(2, 2, 2, 2)), col3), _mm_mul_ps(_mm_shuffle_ps(row3, row3, _MM_SHUFFLE(3, 3, 3, 3)), col4)))); _mm_storeu_ps(out + 8, r); __m128 row4 = _mm_loadu_ps(m1 + 12); r = _mm_add_ps(_mm_mul_ps(_mm_shuffle_ps(row4, row4, _MM_SHUFFLE(0, 0, 0, 0)), col1), _mm_add_ps(_mm_mul_ps(_mm_shuffle_ps(row4, row4, _MM_SHUFFLE(1, 1, 1, 1)), col2), _mm_add_ps(_mm_mul_ps(_mm_shuffle_ps(row4, row4, _MM_SHUFFLE(2, 2, 2, 2)), col3), _mm_mul_ps(_mm_shuffle_ps(row4, row4, _MM_SHUFFLE(3, 3, 3, 3)), col4)))); _mm_storeu_ps(out + 12, r); }}#pragma managed(pop)using namespace System;namespace ManagedMathLib { public ref class ManagedMath { public: static IntPtr InnerProductPtr = IntPtr(inner_product); static IntPtr MatrixMulPtr = IntPtr(matrix_mul); static float InnerProduct(array^ v1, array^ v2) { pin_ptr pv1 = &v1[0]; pin_ptr pv2 = &v2[0]; return inner_product(pv1, pv2); } static void MatrixMul(array^ m1, array^ m2, array^ out) { pin_ptr pm1 = &m1[0]; pin_ptr pm2 = &m2[0]; pin_ptr outp = &out[0]; matrix_mul(pm1, pm2, outp); } };}

Source
Washu
Previously I discussed various potential issues the x86 JIT had with inlining non-trivial methods and functions taking or returning value types. In this entry I hope to cover some potential pitfalls facing would be optimizers, along with discussing some unexpected optimizations that do take place.Optimizations That Aren't
It is not that uncommon to see people advocating the usage of unsafe code as a means of producing "optimized" code in the managed environment. The idea is a simple one, by getting down to the metal with pointers and all that fun stuff, you can somehow produce code that will be "optimized" in ways that typical managed code cannot be.
Unsafe code does not allow you to manipulate pointers to managed objects in whatever manner you please. Certain steps have to be taken to ensure that your operations are safe with regards to the managed heap. Just because your code is marked as "unsafe" doesn't mean that it is free to do what it wants. For example, you cannot assign a pointer the address of a managed object without first pinning the object. Pointers to objects are not tracked by the GC, so should you obtain a pointer to an object and then attempt to use the pointer, you could end up accessing a now collected region of memory. What can also happen is that you could obtain a pointer to an object, but when the GC runs your object could be shuffled around on the heap. This shuffling would invalidate your pointer, but since pointers are not tracked by the GC it would not be updated (while references to objects are updated). Pinning objects solves this problem, and hence is why you are only allowed to take the address of an object that's been pinned. In essence, a pinned object cannot be moved nor collected by the GC until it is unpinned. This is typically done through the use of the fixed keyword in C# or the GCHandle structure.
Much like how a fixed object cannot be moved by the GC, a pointer to a fixed object cannot be reassigned. This makes it difficult to traverse primitive arrays, as you end up needing to create other temporary pointers, or limiting the size of the fixed area to a small segment. Fixed objects, and unsafe code, increase the overall size of the produced IL by a fairly significant margin. While an increase in the IL is not indicative of the size of the produced machine code, it does prevent the runtime from inlining such methods. As an example, the two following snippets reveal the difference between a safe inner product and an unsafe one; note that in the unmanaged case it was using a fixed sized buffer.public float Magnitude() { return (float)Math.Sqrt(X * X + Y * Y + Z * Z);}.method public hidebysig instance float32 Magnitude() cil managed{ .maxstack 8 L_0000: ldarg.0 L_0001: ldfld float32 PerformanceTests.Vector3::X L_0006: ldarg.0 L_0007: ldfld float32 PerformanceTests.Vector3::X L_000c: mul L_000d: ldarg.0 L_000e: ldfld float32 PerformanceTests.Vector3::Y L_0013: ldarg.0 L_0014: ldfld float32 PerformanceTests.Vector3::Y L_0019: mul L_001a: add L_001b: ldarg.0 L_001c: ldfld float32 PerformanceTests.Vector3::Z L_0021: ldarg.0 L_0022: ldfld float32 PerformanceTests.Vector3::Z L_0027: mul L_0028: add L_0029: conv.r8 L_002a: call float64 [mscorlib]System.Math::Sqrt(float64) L_002f: conv.r4 L_0030: ret}public float Magnitude() { fixed (float* p = V) { return (float)Math.Sqrt(p[0] * p[0] + p[1] * p[1] + p[2] * p[2]); }}.method public hidebysig instance float32 Magnitude() cil managed{ .maxstack 4 .locals init ( [0] float32& pinned singleRef1, [1] float32 single1) L_0000: ldarg.0 L_0001: ldflda PerformanceTests.Unsafe.Vector3/e__FixedBuffer0 PerformanceTests.Unsafe.Vector3::V L_0006: ldflda float32 PerformanceTests.Unsafe.Vector3/e__FixedBuffer0::FixedElementField L_000b: stloc.0 L_000c: ldloc.0 L_000d: conv.i L_000e: ldind.r4 L_000f: ldloc.0 L_0010: conv.i L_0011: ldind.r4 L_0012: mul L_0013: ldloc.0 L_0014: conv.i L_0015: ldc.i4.4 L_0016: conv.i L_0017: add L_0018: ldind.r4 L_0019: ldloc.0 L_001a: conv.i L_001b: ldc.i4.4 L_001c: conv.i L_001d: add L_001e: ldind.r4 L_001f: mul L_0020: add L_0021: ldloc.0 L_0022: conv.i L_0023: ldc.i4.8 L_0024: conv.i L_0025: add L_0026: ldind.r4 L_0027: ldloc.0 L_0028: conv.i L_0029: ldc.i4.8 L_002a: conv.i L_002b: add L_002c: ldind.r4 L_002d: mul L_002e: add L_002f: conv.r8 L_0030: call float64 [mscorlib]System.Math::Sqrt(float64) L_0035: conv.r4 L_0036: stloc.1 L_0037: leave.s L_0039 L_0039: ldloc.1 L_003a: ret}
Note that neither of these two appear to be candidates for inlining, both being well over the 32 byte IL limit. The produced IL, while not directly indicative of the assembly produced by the JIT compiler, does tend to give an overall idea of how much larger we should expect this method to be when reproduced in machine code. Fixed length buffers have other issues that need addressing: You cannot access a fixed length buffer outside of a fixed statement. They are also an unsafe construct, and so you must indicate that the type is unsafe. Finally, they produce temporary types at compilation time that can throw off serialization and other reflection based mechanisms.
In the end, unsafe code does not increase performance, and the reliance upon platform structures to ensure safety, such as the fixed construct, introduces more problems than it solves. Furthermore, even the smallest method that might be inlined tends to bloat up to the point where inlining by the JIT is no longer possible.Surprising Developments and JIT Optimizations
Previously I noted that the JIT compiler can only inline a method that is a maximum of 32 bytes of IL in length. However, I wasn't completely honest with you. In some cases the JIT compiler will inline chunks of code that are longer than 32 bytes of IL. I have not dug in-depth into the reasons for this, nor when these conditions may arise. As such this information is presented as an informal experimental result. In the case of a function returning the result of an intrinsic operation, there may arise a condition whereby the result is inlined. Two examples of this behavior will be shown, note that in both cases the function used is an intrinsic math function and that neither are passed value types (which will prevent inlining). The first is the Magnitude function, which we saw above. Calling it results in it being inlined and produces the following inlined assembly.00220164 D945D4 fld dword ptr [ebp-2Ch]00220167 D8C8 fmul st,st(0)00220169 D945D8 fld dword ptr [ebp-28h]0022016C D8C8 fmul st,st(0)0022016E DEC1 faddp st(1),st00220170 D945DC fld dword ptr [ebp-24h]00220173 D8C8 fmul st,st(0)00220175 DEC1 faddp st(1),st00220177 DD5D9C fstp qword ptr [ebp-64h]0022017A DD459C fld qword ptr [ebp-64h]0022017D D9FA fsqrt
We note that this is the optimal form for the magnitude function, with a minimal number of memory reads, the majority of the work taking place on the FPU stack. Compared to the unsafe version, which is shown next, you can clearly see how much worse unsafe code is.007A0438 55 push ebp007A0439 8BEC mov ebp,esp007A043B 57 push edi007A043C 56 push esi007A043D 53 push ebx007A043E 83EC10 sub esp,10h007A0441 33C0 xor eax,eax007A0443 8945F0 mov dword ptr [ebp-10h],eax007A0446 894DF0 mov dword ptr [ebp-10h],ecx007A0449 D901 fld dword ptr [ecx]007A044B 8BF1 mov esi,ecx007A044D D80E fmul dword ptr [esi]007A044F 8BF9 mov edi,ecx007A0451 D94704 fld dword ptr [edi+4]007A0454 8BD1 mov edx,ecx007A0456 D84A04 fmul dword ptr [edx+4]007A0459 DEC1 faddp st(1),st007A045B 8BC1 mov eax,ecx007A045D D94008 fld dword ptr [eax+8]007A0460 8BD8 mov ebx,eax007A0462 D84B08 fmul dword ptr [ebx+8]007A0465 DEC1 faddp st(1),st007A0467 DD5DE4 fstp qword ptr [ebp-1Ch]007A046A DD45E4 fld qword ptr [ebp-1Ch]007A046D D9FA fsqrt007A046F D95DEC fstp dword ptr [ebp-14h]007A0472 D945EC fld dword ptr [ebp-14h]007A0475 8D65F4 lea esp,[ebp-0Ch]007A0478 5B pop ebx007A0479 5E pop esi007A047A 5F pop edi007A047B 5D pop ebp007A047C C3 ret
Next up is a fairly ubiquitous utility function which obtains the angle between two unit length vectors, note that acos is not directly producible as a machine instruction, none the less it is considered an intrinsic function. As we see below, this produces a nicely optimized set of instructions, with only a single call to a function (which computes acos).public static float AngleBetween(ref Vector3 lhs, ref Vector3 rhs) { return (float)Math.Acos(lhs.X * rhs.X + lhs.Y * rhs.Y + lhs.Z * rhs.Z);}.method public hidebysig static float32 AngleBetween(PerformanceTests.Vector3& lhs, PerformanceTests.Vector3& rhs) cil managed{ .maxstack 8 L_0000: ldarg.0 L_0001: ldfld float32 PerformanceTests.Vector3::X L_0006: ldarg.1 L_0007: ldfld float32 PerformanceTests.Vector3::X L_000c: mul L_000d: ldarg.0 L_000e: ldfld float32 PerformanceTests.Vector3::Y L_0013: ldarg.1 L_0014: ldfld float32 PerformanceTests.Vector3::Y L_0019: mul L_001a: add L_001b: ldarg.0 L_001c: ldfld float32 PerformanceTests.Vector3::Z L_0021: ldarg.1 L_0022: ldfld float32 PerformanceTests.Vector3::Z L_0027: mul L_0028: add L_0029: conv.r8 L_002a: call float64 [mscorlib]System.Math::Acos(float64) L_002f: conv.r4 L_0030: ret}007A01D9 8D55D4 lea edx,[ebp-2Ch] 007A01DC 8D4DC8 lea ecx,[ebp-38h] 007A01DF D902 fld dword ptr [edx] 007A01E1 D809 fmul dword ptr [ecx] 007A01E3 D94204 fld dword ptr [edx+4] 007A01E6 D84904 fmul dword ptr [ecx+4] 007A01E9 DEC1 faddp st(1),st 007A01EB D94208 fld dword ptr [edx+8] 007A01EE D84908 fmul dword ptr [ecx+8] 007A01F1 DEC1 faddp st(1),st 007A01F3 83EC08 sub esp,8 007A01F6 DD1C24 fstp qword ptr [esp] 007A01F9 E868A5AF79 call 7A29A766 (System.Math.Acos(Double), mdToken: 06000b28)
Finally there is the issue of SIMD instruction sets. While the JIT will not use SIMD instructions on the x86 platform, it will utilize them for other operations. One common operation you see is the conversion of floating point numbers to integers. In .NET 2.0 the JIT will optimize this to use the SSE2 instruction. For instance, the following snippet of code will result in the assembly dump following.int n = (int)r.NextDouble();002A02FB 8BCB mov ecx,ebx 002A02FD 8B01 mov eax,dword ptr [ecx] 002A02FF FF5048 call dword ptr [eax+48h] 002A0302 DD5DA0 fstp qword ptr [ebp-60h] 002A0305 F20F1045A0 movsd xmm0,mmword ptr [ebp-60h] 002A030A F20F2CF0 cvttsd2si esi,xmm0
While not quite as optimal as it could be if the JIT were using the full SSE2 instruction set, this minor optimization can go a long way.
So what is left to visit? Well, there's obviously the x64 platform, which is growing in popularity. The x64 platform presents new opportunities to explore, including certain guarantees and performance benefits that aren't available on the x86 platform. Amongst them are a whole new set of optimizations and available instruction sets that the JIT can take advantage of. Finally there is the case of calling to unmanaged code for highly performance intensive operations. Hand optimized SIMD code and the potential performance benefits or hazards calling to an unmanaged function can incur.

Source
Washu
Introduction
.NET has been getting some interesting press recently. Even to the point where an article in Game Developer Magazine was published advocating the usage of managed code for rapid development of components. However, I did raise some issues with the author in regards to the performance metric he used. Thus it is that I have decided to cover some issue with .NET performance, future benefits, and hopefully even a few solutions to some of the problems I'll be posing.
Ultimately the performance of your application will be determined by the algorithms and data-structures you use . No amount of micro-optimization can hope to account for the huge performance differences that can crop up between different choices of algorithms. Thus the most important tool you can have in your arsenal is a decent profiler. Thankfully there are many good profilers available for the .NET platform. Some of the profiling tools are specific to certain areas of managed coding, such as the CLR Profiler, which is useful for profiling the allocation patterns of your managed application. Others, like DevPartner, allow you to profile the entire application, identifying performance bottlenecks in both managed and unmanaged code. Finally there are the low level profiling tools, such as the SOS Debugging Tools, these tools give you extremely detailed information about the performance of your systems but are hard to use.
Applications designed and built towards a managed platform tend to have different design decisions behind them than unmanaged applications. Even such fundamental things as memory allocation patterns are usually quite a bit different. With object lifetimes being non-deterministic, one has to apply different patterns to ensure the timely release of resources. Allocation patterns are also different, partly due to the inability to allocate objects on the stack, but also due to the ease of allocation on the managed heap. Allocating on an unmanaged heap typically requires a heap walk to find a block of free space that is at least the size of the block requested. The managed allocator typically allocates at the end of the heap, resulting in significantly faster allocation times (constant time, for the most part). These changes to the underlying assumptions that drive the system typically have large sweeping changes on the overall design of the systems.Future Developments
Theoretically a JIT compiler can outperform a standard compiler simply because it can target the platform in ways that traditional compilation cannot. Traditionally, to target different instruction sets, you would have to compile a binary for each instruction set. For instance, targeting SSE2 would require you to build a separate binary from that of your non-SSE2 branch. You could, of course, do this through the use of DLLs, or by custom writing your SSE2 code and using function pointers to dictate which branch to chose.
Hand written SIMD code is often faster than compiler generated SIMD, due to the ability to manually vectorize the data thus enabling for true SIMD to take place. Some compilers, like the Intel C++ Compiler can perform automatic vectorization. However it is unable to guarantee the accuracy of the resulting binary and extensive testing typically has to be done in order to ensure that the functionality was correctly generated. While most compilers have the option to target SIMD instruction sets, they usually use it to replace standard floating point operations where they can, as the single based SIMD instructions are generally faster than their FPU counterparts.
The JIT compiler could target any SIMD instruction set supported by its platform, along with any other hardware specific optimizations it knew about. While automatic vectorization is not likely to be in a JIT release anytime soon, even using the non-vectorized SIMD instruction sets can help to parallelize your processing. As an example, multiple independent SIMD operations can typically run in parallel (that is, an add and a multiplication could both run simultaneously). Furthermore, the JIT can allow any .NET application to target any system it supports, provided the libraries it uses are also available on that system. This means that, provided you aren't doing anything highly non-portable such as assuming that a pointer is 32bits..., your application could be JIT compiled to target a 64 bit compiler and run natively that way.
Another area of potential advancement includes the realm of Profile Guided Optimization. Currently POGO is restricted to the arena of unmanaged applications, as it requires the ability to generate raw machine code and to perform instruction reordering. In essence you instrument an application with a POGO profiler; then you use the application normally to allow the profiler to collect usage data and to find the hotspots. Finally you run the optimizer on the solution, which will rebuild the solution, using the profiling data it gathered to optimize the heavily utilized sections of your application. A JIT compiler could instrument a managed program on first launch and watch its usage, while in another thread it could be optimizing the machine code using the profiling data that it gathers. The resulting cached binary image would be optimized on the next launch (excepting those areas that had not been accessed, and thus the JIT hadn't compiled yet). This would be especially effective on systems with multiple cores.JIT Compilation for the x86
The JIT compiler for the x86 platform, as of .NET 2.0, does not support SIMD instruction sets. It will generate occasional MMX or SSE instructions for some integral and floating point promotions, but otherwise it will not utilize SIMD instruction sets. Inlining poses its own problems for the JIT compiler. Currently the JIT compiler will only inline functions that are 32 bytes of IL or smaller. Because the JIT compiler runs in an extremely tight time constraint, it is forced to make sacrifices in the optimizations it can make. Inlining is typically an expensive operation because it requires shuffling around the addresses of everything that comes after the inlined code (which requires interpreting the IL, then determining if its address is before or after the inlined code, then making the appropriate adjustments...). Because of this, all but the smallest of methods will not be inlined. Here's a sample of a method that will not be inlined, and the IL that accompanies it:public float SquareMagnitude() { return X * X + Y * Y + Z * Z;}.method public hidebysig instance float32 SquareMagnitude() cil managed{ .maxstack 8 L_0001: ldfld float32 Performance_Tests.Vector3::X L_0006: ldarg.0 L_0007: ldfld float32 Performance_Tests.Vector3::X L_000c: mul L_000d: ldarg.0 L_000e: ldfld float32 Performance_Tests.Vector3::Y L_0013: ldarg.0 L_0014: ldfld float32 Performance_Tests.Vector3::Y L_0019: mul L_001a: add L_001b: ldarg.0 L_001c: ldfld float32 Performance_Tests.Vector3::Z L_0021: ldarg.0 L_0022: ldfld float32 Performance_Tests.Vector3::Z L_0027: mul L_0028: add L_0029: ret}
This method, as you can tell, is 42 bytes long, counting the return instruction. Clearly this is over the 32 byte IL limit. However, the resulting assembly compiles down to less than 25 bytes:002802C0 D901 fld dword ptr [ecx]002802C2 D9C0 fld st(0)002802C4 DEC9 fmulp st(1),st002802C6 D94104 fld dword ptr [ecx+4]002802C9 D9C0 fld st(0)002802CB DEC9 fmulp st(1),st002802CD DEC1 faddp st(1),st002802CF D94108 fld dword ptr [ecx+8]002802D2 D9C0 fld st(0)002802D4 DEC9 fmulp st(1),st002802D6 DEC1 faddp st(1),st002802D8 C3 ret
Methods that use this one though, like the Magnitude method, may be candidates for inlining however. Which typically reduces to a call to the SquareMagnitude method and a fsqrt call.
Another area where the JIT has issues deals with value-types and inlining. Methods that take value-type parameters are not currently considered for inlining. There is a fix in the pipe for this, as it is considered a bug. An example of this behavior can be seen in the following example function, which although far below the 32 bytes of IL limit, will not be inlined.static float WillNotInline32(float f) { return f * f;}.method private hidebysig static float32 WillNotInline32(float32 f) cil managed{ .maxstack 8 L_0000: ldarg.0 L_0001: ldarg.0 L_0002: mul L_0003: ret}
The resulting call to this function and the assembly code of the function looks as follows0087008F FF75F4 push dword ptr [ebp-0Ch]00870092 FF154C302A00 call dword ptr ds:[002A304Ch]----003F01F8 D9442404 fld dword ptr [esp+4]003F01FC DCC8 fmul st(0),st003F01FE C20400 ret 4
Clearly the x86 JIT requires a lot more work before it will be able to produce machine code approaching that of a good optimizing compiler. However, the news isn't all grim. Interop between .NET and unmanaged code allows for you to write those methods that need to be highly optimized in a lower level language.

Source
Washu
Depends, but probably yes.
A beginning programmer should be focusing on learning to program. That is: the process of taking a concept and turning it into an application. Problem solving, in other words. Learning to program is not the same thing as learning a programming language. Learning a programming language is about learning the syntax and standard library that comes with said programming language, it may involve the process of problem solving, but that is not its primary concern.
Given that, one can quickly see that the best way to introduce a beginning programmer to programming is to get them to use a language that is quick and easy to get up and running in. There are many languages which are quick and easy to get up and running with. Python and Ruby are two prime examples, both of which have a very simple language syntax which allows for a lot of leeway for the programmer, without all the extra clutter that many other languages have (C++ cough). Another good choice, in my opinion, is C# which, when combined with Microsoft Visual C#, provides a very robust but easy to learn language. These languages all have many key features which make them easy to learn and use: All of them are generally garbage collected, they all have fairly simple syntax with few (if any) corner cases, and all of them have huge standard libraries that provide for a great deal of quick and easy to use functionality with minimal programmer effort.
C++ has almost none of those things. While there are useful tools in many IDEs, such as Visual Studio, IntelliSense and similar auto-completion tools are not perfect, even with the help of tools like WholeTomato's VAX. The C++ standard library is very small, dealing mainly with stream based IO, some minimal containers, threading and algorithms for operating on iterators. The rest of the work is left up the developer. This means that for any sufficiently complex project you will either end up implementing a majority of the behaviors needed yourself, or having to dig up third party libraries and APIs for said behavior. Even the recent C++11 work hasn't really alleviated the problem. Then you have the language complexity of which I've commented on previously.
However, the C++ standard library does provide some features that should be in every developers pocketbook... such as std::string. std::string behaves a lot more like what a beginning programmer expects a primitive type to work. They've learned that you can add integers and floats together, so why can't they add strings together? Well, with std::string they can, but with c-strings they can't. They've learned to compare integers and floats using the standard == operator, so why can't they do that with strings? With std::string they can, but with c-strings they can't (well, they "can", but the behavior is not what they want). They've learned how to read in integers and floats from std::cin, so why can't they do the same with strings? They can with std::string, but with c-strings they have to be careful of the length and also that they've pre-allocated it, which has hazards of its own... such as stack space issues when they try to create a char array of 5000 characters.
C-strings do not behave intuitively. They have no inherit length, instead preferring to use null terminators to indicate the end of the string. They cannot be trivially concatenated, instead requiring the user to ensure that appropriate space is available, and then they have to use various function calls to copy the string, and then they have to ensure that those string functions had the space required to copy the null terminator (which the strncpy and other functions MAY omit if there isn't enough space in the destination). Comparison requires the use functionality like strcmp, which doesn't return true/false, but instead returns an integer indicating the string difference, with 0 being no differences. In a language where the user has been taught that 0/null generally means failure, remembering to test for 0 in that one off corner case is rather strange.
For a beginner, all that strangeness doesn't equate to extra power or better performance. Instead it equates to extra confusion, and strange crashes. Had they been taught std::string first, they would have been free and clear, able to use the familiar operators they are used to, while being safe and secure in the bosom that is std::string. In fact, it generally gets worst than that, as c-strings are usually taught before pointers! This makes it even more confusing for the poor beginner, because then they're introduced to arrays and pointers (instead of say std::vector), and now have a whole slew of new functionality to basically kill themselves with.
Thus, in conclusion, if you see a c-string in a beginners code, it probably means they have a bug somewhere in their code.

Source
Washu

A Simple C++ Quiz

This is a very basic C++ quiz, it mainly tests a wee bit of knowldege that I've found some people who profess to have a mastery of C++ to be lacking. The answers should all be based on the C++ standard, and not your compiler's implementation.
Use the following code snippet to answer questions 1 through 3:int* p = new int[10];int* j = p + 11;int* k = p + 10;

  1. Is the second line well defined behavior?
  2. If the second line is well defined, where does the pointer j point to after the second line is executed.

  3. What are some of the legal operations that can be performed on the pointer k?

  4. What output should the following lines of code produce?int a = 10;std::cout<

  5. Assuming the function called in the following block of code has no default parameters, and that no operators are overloaded, how many parameters does it take? Which objects are passed to it?f((a, b, c), d, e, ((g, h), i));


Source
Washu
So if you saw my last entry, it was a basic Win32 app that created a D3D11 device and cleared the display.

Well, I've expanded upon it a wee bit more. If you're familiar with the SlimDX MiniTri11 sample this should look, erm... similar?

Added some Macros to help me twiddle around in COM easier.


.586
.model flat, stdcall
option casemap :none

includelib
includelib

@ArgRev MACRO arglist:REQ
LOCAL txt, arg
txt TEXTEQU <>
% FOR arg,
txt CATSTR , <,>, txt
ENDM

txt SUBSTR txt, 1, @SizeStr( %txt ) - 1
txt CATSTR , txt, >
EXITM txt
ENDM

INVOKEC MACRO comObject, methodNumber, args:VARARG
LOCAL txt
IFNB
% FOR arg, @ArgRev( )
push arg
ENDM
ENDIF
mov ecx, dword ptr [comObject]
push ecx
mov eax, [ecx]
mov edx, dword ptr [eax + methodNumber];
call edx
ENDM

RELEASE MACRO comObject
LOCAL skip
cmp comObject, 0
jz skip
INVOKEC comObject, IUnknown_Release
skip:
ENDM

DEFINEGUID MACRO name, _1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11
name DD _1
DW _2
DW _3
DB _4
DB _5
DB _6
DB _7
DB _8
DB _9
DB _10
DB _11
ENDM

ExitProcess PROTO stdcall :DWORD
MessageBoxA PROTO stdcall :DWORD, :DWORD, :DWORD, :DWORD
RegisterClassExA PROTO stdcall :DWORD
GetModuleHandleA PROTO stdcall :DWORD
DefWindowProcA PROTO stdcall :DWORD, :DWORD, :DWORD, :DWORD
CreateWindowExA PROTO stdcall :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD
PostQuitMessage PROTO stdcall :DWORD
ShowWindow PROTO stdcall :DWORD, :DWORD
UpdateWindow PROTO stdcall :DWORD
PeekMessageA PROTO stdcall :DWORD, :DWORD, :DWORD, :DWORD, :DWORD
TranslateMessage PROTO stdcall :DWORD
DispatchMessageA PROTO stdcall :DWORD
D3D11CreateDeviceAndSwapChain PROTO stdcall :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD
D3DCompile PROTO stdcall :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD
D3DGetInputSignatureBlob PROTO stdcall :DWORD, :DWORD, :DWORD

POINT STRUC 4
x DWORD ?
y DWORD ?
POINT ENDS

MSG STRUC 4
hwnd DWORD ?
message DWORD ?
wparam WORD ?
lparam DWORD ?
time DWORD ?
point POINT <>
MSG ENDS

WNDCLASSEX STRUC 4
cbSize DWORD SIZEOF(WNDCLASSEX)
style DWORD CW_VREDRAW OR CW_HREDRAW OR CW_OWNDC
lpfnWndProc DWORD WndProc
cbClsExtra DWORD 0
cbWndExtra DWORD 0
hInstance DWORD ?
hIcon DWORD 0
hCursor DWORD 0
hbrBackground DWORD COLOR_BACKGROUND
lpszMenuName DWORD 0
lpszClassName DWORD className
hIconSm DWORD 0
WNDCLASSEX ENDS

DXGI_RATIONAL STRUC 4
numerator DWORD ?
denominator DWORD ?
DXGI_RATIONAL ENDS

DXGI_MODE_DESC STRUC 4
w DWORD 800
h DWORD 600
refreshRate DXGI_RATIONAL <1, 60>
format DWORD DXGI_FORMAT_R8G8B8A8_UNORM
scanLineOrder DWORD 0
scaling DWORD 0
DXGI_MODE_DESC ENDS

DXGI_SWAP_CHAIN_DESC STRUC 4
bufferDesc DXGI_MODE_DESC <>
sampleDesc DXGI_RATIONAL <1, 0>
usage DWORD 20h OR 40h
bufferCount DWORD 2
window DWORD ?
windowed DWORD 1
swapEffect DWORD 0
flags DWORD 0
DXGI_SWAP_CHAIN_DESC ENDS

D3D11_BUFFER_DESC STRUC 4
byteWidth DWORD ?
usage DWORD ?
bindFlags DWORD ?
cpuAccessFlags DWORD ?
miscFlags DWORD ?
structByteStride DWORD ?
D3D11_BUFFER_DESC ENDS

D3D11_SUBRESOURCE_DATA STRUC 4
srcData DWORD ?
scrap1 DWORD 0
scrap2 DWORD 0
D3D11_SUBRESOURCE_DATA ENDS

D3D11_INPUT_ELEMENT_DESC STRUC 4
semanticName DWORD ?
semanticIndex DWORD ?
format DWORD ?
slot DWORD ?
byteOffset DWORD ?
slotClass DWORD ?
dataStepRate DWORD ?
D3D11_INPUT_ELEMENT_DESC ENDS

D3D11_VIEWPORT STRUC 4
topLeftX REAL4 0.0
topLeftY REAL4 0.0
windowWidth REAL4 800.0
windowHeight REAL4 600.0
minDepth REAL4 0.0
maxDepth REAL4 1.0
D3D11_VIEWPORT ENDS

MB_OK EQU 0
SW_SHOW EQU 5
CW_VREDRAW EQU 1
CW_HREDRAW EQU 2
CW_OWNDC EQU 20h
COLOR_BACKGROUND EQU 1
WS_OVERLAPPED EQU 00000000h
WS_MINIMIZEBOX EQU 00020000h
WS_MAXIMIZEBOX EQU 00020000h
WS_SYSMENU EQU 00080000h
WS_THICKFRAME EQU 00040000h
WS_CAPTION EQU 00C00000h
WS_OVERLAPPEDWINDOW EQU WS_OVERLAPPED OR WS_MINIMIZEBOX OR WS_MAXIMIZEBOX OR WS_SYSMENU OR WS_THICKFRAME OR WS_CAPTION
WM_QUIT EQU 0012h

D3D_DRIVER_TYPE_UNKNOWN EQU 0
D3D_DRIVER_TYPE_HARDWARE EQU ( D3D_DRIVER_TYPE_UNKNOWN + 1 )
D3D11_CREATE_DEVICE_SINGLETHREADED EQU 1
D3D_FEATURE_LEVEL_9_1 EQU 09100h
D3D_FEATURE_LEVEL_9_2 EQU 09200h
D3D_FEATURE_LEVEL_9_3 EQU 09300h
D3D_FEATURE_LEVEL_10_0 EQU 0a000h
D3D_FEATURE_LEVEL_10_1 EQU 0a100h
D3D_FEATURE_LEVEL_11_0 EQU 0b000h
DXGI_FORMAT_R8G8B8A8_UNORM EQU 28
DXGI_FORMAT_R32G32B32A32_FLOAT EQU 2
D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST EQU 4

IUnknown_QueryInterface EQU 0h
IUnknown_AddRef EQU 4h
IUnknown_Release EQU 8h

ID3D10Blob_GetBufferData EQU IUnknown_Release + 4
ID3D10Blob_GetBufferSize EQU IUnknown_Release + 8

IDXGISwapChain_GetBuffer EQU 24h
IDXGISwapChain_Present EQU 20h

ID3D11Device_CreateRTView EQU 24h
ID3D11Device_CreateVS EQU 30h
ID3D11Device_CreatePS EQU 3Ch
ID3D11Device_CreateBuffer EQU 0Ch
ID3D11Device_CreateLayout EQU 2Ch

ID3D11Context_ClearRTView EQU 0C8h
ID3D11Context_OMSetRT EQU 84h
ID3D11Context_IASetLayout EQU 44h
ID3D11Context_VSSetShader EQU 2Ch
ID3D11Context_PSSetShader EQU 24h
ID3D11Context_IASetPrimTop EQU 60h
ID3D11Context_RSSetViewport EQU 0B0h
ID3D11Context_IASetVertBuff EQU 48h
ID3D11Context_Draw EQU 34h

.data
msgTitle DB "Oh shit son!", 0
regFail DB "Failed to register window class.", 0
cwFail DB "Failed to create window.", 0
showFail DB "Failed to show window.", 0
updateFail DB "Failed to update window.", 0
d3d11Fail1 DB "Failed to create D3D11 device.", 0
d3d11Fail2 DB "Failed to get back buffer from swap chain.", 0
d3d11Fail3 DB "Failed to create render target.", 0
d3d11Fail4 DB "Failed to compile pixel shader.", 0
d3d11Fail5 DB "Failed to compile vertex shader.", 0
d3d11Fail6 DB "Failed to get vertex shader layout.", 0
d3d11Fail7 DB "Failed to create vertex buffer.", 0
d3d11Fail8 DB "Failed to create input layout.", 0
d3d11Fail9 DB "Failed to create shader resource.", 0

messageIdx DD regFail, cwFail, showFail, updateFail, d3d11Fail1, d3d11Fail2, d3d11Fail3, d3d11Fail4, d3d11Fail5, d3d11Fail6, d3d11Fail7, d3d11Fail8, d3d11Fail9

className DB "TestClass", 0
windowTitle DB "My Window", 0
wndClass WNDCLASSEX <>
swapDesc DXGI_SWAP_CHAIN_DESC <>
featureLvl DD D3D_FEATURE_LEVEL_11_0, D3D_FEATURE_LEVEL_10_1, D3D_FEATURE_LEVEL_10_0, D3D_FEATURE_LEVEL_9_3
swapChain DD 0
device DD 0
newFeatLvl DD 0
context DD 0

blackColor DD 0, 0, 0, 0
backBuffer DD 0
view DD 0
DEFINEGUID ID3D11Texture2D_GUID, 6f15aaf2h, 0d208h, 04e89h, 09ah, 0b4h, 048h, 095h, 035h, 0d3h, 04fh, 09Ch

viewport D3D11_VIEWPORT <>

psSource DB "struct VS_IN { float4 pos : POSITION; float4 col : COLOR; }; struct PS_IN { float4 pos : SV_POSITION; float4 col : COLOR; }; float4 PS( PS_IN input ) : SV_Target { return input.col; }", 0
psSourceLen DD $ - OFFSET psSource
psTarget DB "ps_5_0", 0
psEntry DB "PS", 0

vsSource DB"struct VS_IN { float4 pos : POSITION; float4 col : COLOR; }; struct PS_IN { float4 pos : SV_POSITION; float4 col : COLOR; }; PS_IN VS( VS_IN input ) { PS_IN output = (PS_IN)0; output.pos = input.pos; output.col = input.col; return output; }", 0
vsSourceLen DD $ - OFFSET vsSource
vsTarget DB "vs_5_0", 0
vsEntry DB "VS", 0

vertData REAL4 0.0, 0.5, 0.5, 1.0, 1.0, 0.0, 0.0, 1.0
REAL4 0.5f, -0.5f, 0.5f, 1.0, 0.0, 1.0, 0.0, 1.0
REAL4 -0.5f, -0.5f, 0.5f, 1.0, 0.0, 0.0, 1.0, 1.0

vertDesc D3D11_BUFFER_DESC < 96, 0, 1, 0, 0, 0 >
vertSub D3D11_SUBRESOURCE_DATA < vertData >
vertDescPos DB "POSITION", 0
vertDescCol DB "COLOR", 0
vertLayout D3D11_INPUT_ELEMENT_DESC < OFFSET vertDescPos, 0, DXGI_FORMAT_R32G32B32A32_FLOAT, 0, 0, 0, 0 >
D3D11_INPUT_ELEMENT_DESC < OFFSET vertDescCol, 0, DXGI_FORMAT_R32G32B32A32_FLOAT, 0, 16, 0, 0 >
vertStride DD 32
vertOffset DD 0

vertBuff DD 0
inputLayout DD 0
psShader DD 0
psBlob DD 0
psBlobErr DD 0
vsShader DD 0
vsBlob DD 0
vsBlobErr DD 0
layoutBlob DD 0
vsError DD 0
blobPtr DD 0
blobSize DD 0

.code
WndProc PROC stdcall hwnd:DWORD, msg:DWORD, wparam:WORD, lparam:DWORD
cmp msg, 2
jnz done
INVOKE PostQuitMessage, 0
xor eax, eax
ret
done:
INVOKE DefWindowProcA, hwnd, msg, wparam, lparam
ret;
WndProc ENDP

main PROC
LOCAL hwnd:DWORD
LOCAL hmodule:DWORD
LOCAL msg:MSG

INVOKE GetModuleHandleA, 0
mov hmodule, eax
mov wndClass.hInstance, eax

INVOKE RegisterClassExA, OFFSET wndClass
cmp eax, 0
mov eax, 0
jz error

INVOKE CreateWindowExA, 0, OFFSET className, OFFSET windowTitle, WS_OVERLAPPEDWINDOW, 0, 0, 800, 600, 0, 0, hmodule, 0
mov hwnd, eax
cmp eax, 0
mov eax, 1
jz error

INVOKE ShowWindow, hwnd, SW_SHOW
INVOKE UpdateWindow, hwnd
cmp eax, 0
mov eax, 3
jz error

mov eax, hwnd
mov swapDesc.window, eax

INVOKE D3D11CreateDeviceAndSwapChain, 0, D3D_DRIVER_TYPE_HARDWARE, 0, 0, OFFSET featureLvl, 4, 7, OFFSET swapDesc, OFFSET swapChain, OFFSET device, OFFSET newFeatLvl, OFFSET context
cmp eax, 0
mov eax, 4
jnz error

INVOKEC swapChain, IDXGISwapChain_GetBuffer, 0, OFFSET ID3D11Texture2D_GUID, OFFSET backBuffer
cmp eax, 0
mov eax, 5
jnz error

INVOKEC device, ID3D11Device_CreateRTView, dword ptr [backBuffer], 0, OFFSET view
cmp eax, 0
mov eax, 6
jnz error

INVOKE D3DCompile, OFFSET psSource, psSourceLen, 0, 0, 0, OFFSET psEntry, OFFSET psTarget, 0, 0, OFFSET psBlob, OFFSET psBlobErr
cmp eax, 0
mov eax, 7
jnz error

INVOKE D3DCompile, OFFSET vsSource, vsSourceLen, 0, 0, 0, OFFSET vsEntry, OFFSET vsTarget, 0, 0, OFFSET vsBlob, OFFSET vsBlobErr
cmp eax, 0
mov eax, 8
jnz error

INVOKEC vsBlob, ID3D10Blob_GetBufferData
mov blobPtr, eax

INVOKEC vsBlob, ID3D10Blob_GetBufferSize
mov blobSize, eax

INVOKE D3DGetInputSignatureBlob, blobPtr, blobSize, OFFSET layoutBlob
cmp eax, 0
mov eax, 9
jnz error

INVOKEC device, ID3D11Device_CreateBuffer, OFFSET vertDesc, OFFSET vertSub, OFFSET vertBuff
cmp eax, 0
mov eax, 10
jnz error

INVOKEC layoutBlob, ID3D10Blob_GetBufferSize
mov blobSize, eax

INVOKEC layoutBlob, ID3D10Blob_GetBufferData

INVOKEC device, ID3D11Device_CreateLayout, OFFSET vertLayout, 2, eax, blobSize, OFFSET inputLayout
cmp eax, 0
mov eax, 11
jnz error

INVOKEC vsBlob, ID3D10Blob_GetBufferData
mov blobPtr, eax

INVOKEC vsBlob, ID3D10Blob_GetBufferSize

INVOKEC device, ID3D11Device_CreateVS, blobPtr, eax, 0, OFFSET vsShader
cmp eax, 0
mov eax, 12
jnz error

INVOKEC psBlob, ID3D10Blob_GetBufferData
mov blobPtr, eax

INVOKEC psBlob, ID3D10Blob_GetBufferSize

INVOKEC device, ID3D11Device_CreatePS, blobPtr, eax, 0, OFFSET psShader
cmp eax, 0
mov eax, 12
jnz error

RELEASE psBlob
RELEASE vsBlob
RELEASE layoutBlob

INVOKEC context, ID3D11Context_OMSetRT, 1, OFFSET view, 0
INVOKEC context, ID3D11Context_IASetLayout, inputLayout
INVOKEC context, ID3D11Context_VSSetShader, vsShader, 0, 0
INVOKEC context, ID3D11Context_PSSetShader, psShader, 0, 0
INVOKEC context, ID3D11Context_IASetPrimTop, D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST
INVOKEC context, ID3D11Context_RSSetViewport, 1, OFFSET viewport
INVOKEC context, ID3D11Context_IASetVertBuff, 0, 1, OFFSET vertBuff, OFFSET vertStride, OFFSET vertOffset
gameLoop:
msgLoop:
INVOKE PeekMessageA, ADDR msg, 0, 0, 0, 1
cmp eax, 0
jz gameNext
cmp msg.message, WM_QUIT
jz done
INVOKE TranslateMessage, ADDR msg
INVOKE DispatchMessageA, ADDR msg
jmp msgLoop
gameNext:
INVOKEC context, ID3D11Context_ClearRTView, view, OFFSET blackColor
INVOKEC context, ID3D11Context_Draw, 3, 0
INVOKEC swapChain, IDXGISwapChain_Present, 0, 0
jmp gameLoop

done:
RELEASE view
RELEASE backBuffer
RELEASE swapChain
RELEASE device
RELEASE context
RELEASE vsShader
RELEASE psShader
RELEASE inputLayout
RELEASE vertBuff
INVOKE ExitProcess, eax
ret
error:
mov eax, messageIdx[eax*4]
INVOKE MessageBoxA, 0, eax, ADDR msgTitle, MB_OK
jmp done
main ENDP
END main
Washu

When Boredom Strikes

Washu does crazy things. In this case I wrote this up...


.586
.model flat, stdcall
option casemap :none

@ArgRev MACRO arglist:REQ
LOCAL txt, arg
txt TEXTEQU <>
% FOR arg,
txt CATSTR , <,>, txt
ENDM

txt SUBSTR txt, 1, @SizeStr( %txt ) - 1
txt CATSTR , txt, >
EXITM txt
ENDM

INVOKEC MACRO comObject, methodNumber, args:VARARG
LOCAL txt
IFNB
% FOR arg, @ArgRev( )
push arg
ENDM
ENDIF
mov ecx, dword ptr [comObject]
push ecx
mov eax, [ecx]
mov edx, dword ptr [eax + methodNumber];
call edx
ENDM

RELEASE MACRO comObject
LOCAL skip
cmp comObject, 0
jz skip
INVOKEC comObject, IUnknown_Release
skip:
ENDM

DEFINEGUID MACRO name, _1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11
name DD _1
DW _2
DW _3
DB _4
DB _5
DB _6
DB _7
DB _8
DB _9
DB _10
DB _11
ENDM

ExitProcess PROTO stdcall :DWORD
MessageBoxA PROTO stdcall :DWORD, :DWORD, :DWORD, :DWORD
RegisterClassExA PROTO stdcall :DWORD
GetModuleHandleA PROTO stdcall :DWORD
DefWindowProcA PROTO stdcall :DWORD, :DWORD, :DWORD, :DWORD
CreateWindowExA PROTO stdcall :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD
PostQuitMessage PROTO stdcall :DWORD
ShowWindow PROTO stdcall :DWORD, :DWORD
UpdateWindow PROTO stdcall :DWORD
PeekMessageA PROTO stdcall :DWORD, :DWORD, :DWORD, :DWORD, :DWORD
TranslateMessage PROTO stdcall :DWORD
DispatchMessageA PROTO stdcall :DWORD
D3D11CreateDeviceAndSwapChain PROTO stdcall :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD, :DWORD

POINT STRUC 4
x DWORD ?
y DWORD ?
POINT ENDS

MSG STRUC 4
hwnd DWORD ?
message DWORD ?
wparam WORD ?
lparam DWORD ?
time DWORD ?
point POINT <>
MSG ENDS

WNDCLASSEX STRUC 4
cbSize DWORD SIZEOF(WNDCLASSEX)
style DWORD CW_VREDRAW OR CW_HREDRAW OR CW_OWNDC
lpfnWndProc DWORD WndProc
cbClsExtra DWORD 0
cbWndExtra DWORD 0
hInstance DWORD ?
hIcon DWORD 0
hCursor DWORD 0
hbrBackground DWORD COLOR_BACKGROUND
lpszMenuName DWORD 0
lpszClassName DWORD className
hIconSm DWORD 0
WNDCLASSEX ENDS

DXGI_RATIONAL STRUC 4
numerator DWORD ?
denominator DWORD ?
DXGI_RATIONAL ENDS

DXGI_MODE_DESC STRUC 4
w DWORD 800
h DWORD 600
refreshRate DXGI_RATIONAL <1, 60>
format DWORD DXGI_FORMAT_R8G8B8A8_UNORM
scanLineOrder DWORD 0
scaling DWORD 0
DXGI_MODE_DESC ENDS

DXGI_SWAP_CHAIN_DESC STRUC 4
bufferDesc DXGI_MODE_DESC <>
sampleDesc DXGI_RATIONAL <1, 0>
usage DWORD 20h OR 40h
bufferCount DWORD 2
window DWORD ?
windowed DWORD 1
swapEffect DWORD 0
flags DWORD 0
DXGI_SWAP_CHAIN_DESC ENDS

MB_OK EQU 0
SW_SHOW EQU 5
CW_VREDRAW EQU 1
CW_HREDRAW EQU 2
CW_OWNDC EQU 20h
COLOR_BACKGROUND EQU 1
WS_OVERLAPPED EQU 00000000h
WS_MINIMIZEBOX EQU 00020000h
WS_MAXIMIZEBOX EQU 00020000h
WS_SYSMENU EQU 00080000h
WS_THICKFRAME EQU 00040000h
WS_CAPTION EQU 00C00000h
WS_OVERLAPPEDWINDOW EQU WS_OVERLAPPED OR WS_MINIMIZEBOX OR WS_MAXIMIZEBOX OR WS_SYSMENU OR WS_THICKFRAME OR WS_CAPTION
WM_QUIT EQU 0012h

D3D_DRIVER_TYPE_UNKNOWN EQU 0
D3D_DRIVER_TYPE_HARDWARE EQU ( D3D_DRIVER_TYPE_UNKNOWN + 1 )
D3D11_CREATE_DEVICE_SINGLETHREADED EQU 1
D3D_FEATURE_LEVEL_11_0 EQU 0b000h
DXGI_FORMAT_R8G8B8A8_UNORM EQU 28

IUnknown_QueryInterface EQU 0h
IUnknown_AddRef EQU 4h
IUnknown_Release EQU 8h

IDXGISwapChain_GetBuffer EQU 24h
IDXGISwapChain_Present EQU 20h

ID3D11Device_CreateRTView EQU 24h

ID3D11Context_ClearRTView EQU 0C8h

.data
msgTitle DB "Oh shit son!", 0
regFail DB "Failed to register window class.", 0
cwFail DB "Failed to create window.", 0
showFail DB "Failed to show window.", 0
updateFail DB "Failed to update window.", 0
d3d11Fail1 DB "Failed to create D3D11 device.", 0
d3d11Fail2 DB "Failed to get back buffer from swap chain.", 0
d3d11Fail3 DB "Failed to create render target.", 0

messageIdx DD regFail, cwFail, showFail, updateFail, d3d11Fail1, d3d11Fail2, d3d11Fail3

className DB "TestClass", 0
windowTitle DB "My Window", 0
wndClass WNDCLASSEX <>
swapDesc DXGI_SWAP_CHAIN_DESC <>
featureLvl DD D3D_FEATURE_LEVEL_11_0
swapChain DD 0
device DD 0
newFeatLvl DD 0
context DD 0

blackColor DD 0, 0, 0, 0
backBuffer DD 0
view DD 0
DEFINEGUID ID3D11Texture2D_GUID, 6f15aaf2h, 0d208h, 04e89h, 09ah, 0b4h, 048h, 095h, 035h, 0d3h, 04fh, 09Ch

.code
WndProc PROC stdcall hwnd:DWORD, msg:DWORD, wparam:WORD, lparam:DWORD
cmp msg, 2
jnz done
INVOKE PostQuitMessage, 0
xor eax, eax
ret
done:
INVOKE DefWindowProcA, hwnd, msg, wparam, lparam
ret;
WndProc ENDP

main PROC
LOCAL hwnd:DWORD
LOCAL hmodule:DWORD
LOCAL msg:MSG

INVOKE GetModuleHandleA, 0
mov hmodule, eax
mov wndClass.hInstance, eax

INVOKE RegisterClassExA, ADDR wndClass
cmp eax, 0
mov eax, 0
jz error

INVOKE CreateWindowExA, 0, ADDR className, ADDR windowTitle, WS_OVERLAPPEDWINDOW, 0, 0, 800, 600, 0, 0, hmodule, 0
mov hwnd, eax
cmp eax, 0
mov eax, 1
jz error

INVOKE ShowWindow, hwnd, SW_SHOW
INVOKE UpdateWindow, hwnd
cmp eax, 0
mov eax, 3
jz error

mov eax, hwnd
mov swapDesc.window, eax

INVOKE D3D11CreateDeviceAndSwapChain, 0, D3D_DRIVER_TYPE_HARDWARE, 0, 0, ADDR featureLvl, 1, 7, ADDR swapDesc, ADDR swapChain, ADDR device, ADDR newFeatLvl, ADDR context
cmp eax, 0
mov eax, 4
jnz error

lea eax, [backBuffer]
INVOKEC swapChain, IDXGISwapChain_GetBuffer, 0, OFFSET ID3D11Texture2D_GUID, eax
cmp eax, 0
mov eax, 5
jnz error

lea eax, [view]
INVOKEC device, ID3D11Device_CreateRTView, dword ptr [backBuffer], 0, eax
cmp eax, 0
mov eax, 6
jnz error

gameLoop:
msgLoop:
INVOKE PeekMessageA, ADDR msg, 0, 0, 0, 1
cmp eax, 0
jz gameNext
cmp msg.message, WM_QUIT
jz done
INVOKE TranslateMessage, ADDR msg
INVOKE DispatchMessageA, ADDR msg
jmp msgLoop
gameNext:
INVOKEC context, ID3D11Context_ClearRTView, view, OFFSET blackColor
INVOKEC swapChain, IDXGISwapChain_Present, 0, 0
jmp gameLoop

done:
RELEASE view
RELEASE backBuffer
RELEASE swapChain
RELEASE device
RELEASE context
INVOKE ExitProcess, eax
ret
error:
mov eax, messageIdx[eax*4]
INVOKE MessageBoxA, 0, eax, ADDR msgTitle, MB_OK
RELEASE view
RELEASE backBuffer
RELEASE swapChain
RELEASE device
RELEASE context
INVOKE ExitProcess, eax
main ENDP
END main


If you're asking yourself "Is that a Win32 application that creates a D3D11 device, clears the render target, and then presents it?" Then yes. It is.
Washu

The New C++ - functions

C++11 significantly expanded the C++ standard library with a number of new libraries and functionality, which isn't actually all that new if you've used boost. These libraries add a great deal of needed functionality, although it still doesn't compare to the standard library you get with many other languages.

Two of the new extensions to the standard library are function wrappers and arbitrary function binding, located in the header.

Function and Bind in


The header has been expanded with a whole slew of new capabilities. Amongst those are two main things that stand out as being of particular use are the function binding capabilities and the generic function container.

std::function


The std::function class allows you to hold an arbitrary callable object. That includes function pointers, functors, and lambdas. The class takes a signature of the function that it expects to contain as a template parameter, and overloads the parenthesis operator (thus is its self a functor) to present a callable interface.

At first glance you might be wondering what the use is, but if you think about it for a short spell, many possible uses of a type safe arbitrary function container will pop out. Amongst those many possible uses is one in particular, which we shall be using as a demonstration purpose of many concepts. Our example will be that of an event dispatcher, which allows for arbitrary event handlers to be registered. The implementation will be somewhat generic for demonstration purposes, but a very explicit example that you could consider using would be one for Win32 events.

We'll first start off with a basic event arguments object, which will hold the arguments we wish to pass to our event, an example would be the location of the mouse during a mouse click event. Then we'll define a basic multicast event, which will hold a vector of handlers who have registered to receive notification of that event. We'll also provide a method to remove all registered handlers. An implementation that removes specific handlers will be covered later; however we have to employ some tricks to get around issues that will be covered in a bit.
#include
#include
#include

struct EventArgs {};

class MulticastEvent {
public:
typedef std::function EventSignature;

public:
void AddHandler(EventSignature const& f) {
functions.push_back(f);
}

void Clear() {
functions.clear();
}

void Invoke(EventArgs const& arg) {
//for each(auto const& f in functions) in Visual Studio
for(auto const& f : functions) {
f(arg);
}
}

private:
std::vector functions;
};

void handler(EventArgs const&) {
std::cout<<"Handler function."<}

struct EventHandler {
void operator()(EventArgs const&) {
std::cout<<"EventHandler::operator()."< }
};

int main() {
MulticastEvent myEvent;
myEvent.AddDelegate(handler);
myEvent.AddDelegate(EventHandler());
myEvent.AddDelegate([](EventArgs const& arg) { std::cout<<"Lambda function."<
myEvent.Invoke(EventArgs());
}

We can see the usefulness of this simple block of code, as we can now implement any arbitrary event dispatcher. It is useful to note that std::function treats functors (that overload operator()) the same as functions. We'll see how to bind arbitrary member functions in a bit. We can also see that lambdas can be encapsulated in std::function objects, thus enabling you to provide a very generic, yet typesafe, callback mechanism.

std::bind and placeholders


In addition to the function wrapper class above new binding capabilities were added to the standard library as well. You might be familiar with bind1st and bind2nd, two functions in C++98 which returned functors that bound parameters to the 1st and 2nd arguments respectively. In C++11 we get full generic parameter binding through the std::bind function.

The std::bind function takes a set of arguments whereby the first argument is the function to apply the binding to, and the remaining arguments are either values to bind to the argument at that position, or placeholders for the resulting functor's arguments. The placeholders are found in the std::placeholders namespace. An example will show this off best:
void BoundHandler(int i, EventArgs const&) {
std::cout<<"Bound Handler function "<}

struct EventHandler {
void MemberFunction(EventArgs const&) {
std::cout<<"EventHandler::MemberFunction."< }
}
int main() {
MulticastEvent myEvent;
myEvent.AddHandler(std::bind(BoundHandler, 1, std::placeholders::_1));

EventHandler handler;
myEvent.AddHandler(std::bind(&EventHandler::MemberFunction, handler, std::placeholders::_1));
myEvent.Invoke(EventArgs());
}

As can be seen, the first handler has its first parameter bound to the literal integer 1, while its second parameter is bound to the placeholder _1, which then gets replaced with the event argument passed to Invoke(). The second handler passes in a non-static member function with its "this" parameter bound to the instance we wish it to be invoked against. Now, it is important to note that when 'handler' goes out of scope we end up in a nasty situation where our MulticastEvent contains an std::function whose functor is pointing to an object which is now no longer in scope. Thus we now need to work on adding removal to our MulticastEvent.

Implementing Event Deregistration


The main problem with attempting to compare two std::function objects is, simply put, you cannot be sure that the two contained "function objects" are even comparable. An example is a stateless functor that you simply construct when you create the std::function object, as shown in the first example.

Thus we need to provide an alternative method for removing function objects from our MulticastEvent. Perhaps the simplest mechanism to use is to simply return a handle which can later be used to find and remove the specific event. Doing this will require a few minor changes to our setup: We'll want to provide a mapping mechanism from handle to std::function, so we'll wish to change our container from a std::vector to a std::map (or similar container such as std::unordered_map). We'll also need to provide a handle type for returning, which in our case will be a simple integral ID. Lastly we'll need a function that, when given a handle, removes the appropriate event handler.
class MulticastEvent {
public:
typedef std::function EventSignature;
typedef int EventHandle;

public:
MulticastEvent() : eventHandle(0) {}

EventHandle AddHandler(EventSignature const& f) {
functions.insert(std::make_pair(++eventHandle, f));
return eventHandle;
}

void RemoveHandler(EventHandle handle) {
functions.erase(handle);
}

void Clear() {
functions.clear();
}

void Invoke(EventArgs const& arg) {
//for each(auto const& f in functions) in Visual Studio
for(auto const& f : functions) {
f.second(arg);
}
}

private:
EventHandle eventHandle;
std::map functions;
};

This proves to be pretty simple, and wrapping up the integer in a typedef allows us to replace the handle with a structure that provides some form of scoping (and perhaps implicit release on destruction).

Additional Bits


In addition to the major pieces above, the header has a few other new additions. Of particular interest to us is the hash class, which is used by the unordered associative containers hashing the key elements. std::reference_wrapper which is a class that wraps up a reference, allowing for copying and assignment (unlike a reference). std::reference_wrapper is constructed using two new utility functions: std::ref and std::cref, which return a wrapper to a reference or a constant reference, respectively. There are also two utility type information classes provided: std::is_bind_expression whose static value member is true when the type is a bind expression, and std::is_placeholder that again has a static value member that is true when the type is a placeholder.

Conclusion


I mentioned earlier the ability to handle Win32 events, and the above MulticastEvent is certainly capable of this. However with one minor change we can make it a lot more capable than it currently is. Specifically, if we template the MulticastEvent to allow us to specify the type of the argument we wish to pass we can eliminate the EventArgs base class entirely, we can also add some operator overloading to clean up the addition and removal of handlers along with the invocation of the MulticastEvent object. Thus we end up with the following basic piece code for handling arbitrary events, including (as demonstrated) mouse click events:
#include
#include
#include

template
class MulticastEvent {
public:
typedef std::function EventSignature;
typedef int EventHandle;

public:
MulticastEvent() : eventHandle(0) {}

EventHandle AddHandler(EventSignature const& f) {
functions.insert(std::make_pair(++eventHandle, f));
return eventHandle;
}

void RemoveHandler(EventHandle handle) {
functions.erase(handle);
}

void Clear() {
functions.clear();
}

MulticastEvent& operator+=(EventSignature const& f) {
AddHandler(f);
return *this;
}

MulticastEvent& operator-=(EventHandle handle) {
RemoveHandler(handle);
return *this;
}

void operator()(EventArgType const& arg) {
//for each(auto const& f in functions) in Visual Studio
for(auto const& f : functions) {
f.second(arg);
}
}

private:
EventHandle eventHandle;
std::map functions;
};

struct MouseClickEventArgs {
MouseClickEventArgs(int x, int y, int button) : x(x), y(y), button(button) {}
int x, y, button;
};

void HandleMouseClick(MouseClickEventArgs const& args) {
std::cout<<"Mouse click at: ("<}

int main() {
MulticastEvent myEvent;
myEvent += HandleMouseClick;
myEvent(MouseClickEventArgs(16, 37, 1));
}
Washu

The New C++ - lambdas

Ah lambdas. If you've used any functional languages, python, ruby, or C# (or many other languages), you are probably familiar with the concept of lambdas. However, if you've been doing C++ for a while and haven't used boost's lambda library then let me be the first to introduce you to the concept... A lambda is, essentially, an unnamed (or anonymous) function. You can use lambdas as a simplification for many common scenarios that crop up frequently throughout code. Examples include a sort predicate, a complex search predicate, or an event system.

Lambda Basics


Probably the most common type of lambda you'll encounter is one that simply takes a set of parameters and returns a result based on them. One of the simplest examples of this is the factorial function. The factorial function returns the product of all numbers from 1 to N, and is typically represented as N! (reads as "N factorial". It grows extremely quickly, and will typically outpace the size of an integer in a few steps. If we take the code from our previous entries and use that as a starting point, we can use the v1 vector of integers and compute the factorial of 10:
auto factorial = [](int lhs, int rhs) { return lhs * rhs; };
auto result = std::accumulate(v1.begin(), v1.end(), 1, factorial);
std::cout<
The lambda portion of the previous statements is clearly the definition of the factorial variable. The lambda syntax in C++ leaves much to be desired, but I shall endeavor to explain the syntax...

We can break it up into four basic parts, the capture statement, the parameters, the return type, and the body of the lambda:
[ capture-statement ] ( parameters ) /* optional if empty */ -> return-type /*optional*/ { body }
Capture statements and return-type need some explaining, the parameters and body though are pretty self-explanatory. The only thing I will state about the parameters argument is: if you have no arguments then the parenthesis are optional. I tend to insert them anyways, just for clarity.

The Capture Statement


The capture statement is used to capture variables that are not local to the body of the lambda. You can capture variables either by value or by reference, and you can also capture all variables available in a scope by value or by reference. Again, the syntax is a bit odd:

auto push_back_c_v1_byref = [&v1](int i) { v1.push_back(i); };
auto push_back_c_all_byref = [&](int i) { v1.push_back(i); };
auto print_c_v1_byval = [v1]() { for each(auto x in v1) { std::cout< auto print_c_all_byval = [=]() { for each(auto x in v1) { std::cout< auto print_c_all_byval_v2_byref = [=,&v2]() { for each(auto x in v1) { std::cout<
push_back_c_v1_byref(11);
push_back_c_all_byref(11);
print_c_v1_byval();
print_c_all_byval();

The first two capture statements are reference captures. The first one, push_back_c_v1_byref, only captures the v1 vector by reference and nothing else is captured. Since we're capturing v1 by reference we can manipulate v1 in various ways, such as (in this case) inserting an item into the vector. The second capture statement is known as a "default capture by reference" and captures everything in scope by reference. Thus we're able to manipulate v1 without having to explicitly capture it. Obviously there is an efficiency concern here, however compilers should be able to optimize out any references to objects NOT used by the lambda.

The second two captures show by-value capturing. With the first one, print_c_v1_byval, taking a copy of v1. This does result in a copy constructor invocation. As such, for something like this print method, it's not necessarily terribly efficient. Although, for value types or for types you want to ensure the lambda doesn't modify, taking it by-value can be an advantage. The second capture uses the "default capture by value", and much like the "default capture by reference", the compiler will likely optimize out anything you don't explicitly use. The for-each syntax we'll get into later, but it's a Visual Studio 2010 (and VS11) extension to the language. The C++11 version is for(T v : container).

The last capture statement is an interesting one. We declare the default capture to be by-value, but then we explicitly state that we wish to capture v2 by reference. Note that this is also possible to be done vice-versa. You can capture everything by reference except those items you explicitly designate to be by-value.

The Return Type


The return type of a lambda is the one optional component of the lambda declaration (as you may have noticed in the examples above, I've omitted the return type for all of them). The return type is only mandatory when the type cannot be inferred from the body of the lambda, or when the return type does not match what you are attempting to return. Here's a simple example:
auto print_hello_cstr = []() { return "hello"; };
auto print_hello_stdstr = []() -> std::string { return "hello"; };

std::cout< std::cout<
Now, clearly the first one returns a char const*, as that's the C++ string literal type. The second one, however, returns a string type, invoking the appropriate constructor of the string type. There are other cases though where you might find that you have to explicitly specify the result type,
auto return_something = [](int a) { if(a > 40) return 4.5f; return 3; };

This lambda poses a bit of a problem, there are actually two problems here. The first is that the return type is not clear (it could be float or int), and the conditional does nothing to clarify it. Moreover, simply changing the literal integer to a literal float doesn't solve all of the issues, and so the standard requires you to specify the expected return type if the lambda is more than just a simple return statement. We can fix this in one of two ways, in the first way we reduce the statement to a simple return:
auto return_something = [](int a) { return (a > 40) ? 4.5f : 3; };
While in the second method we simply indicate the expected return type:
auto return_something = [](int a) -> float { if(a > 40) return 4.5f; return 3; };

Applying Our Knowledge


We've gone through all of this effort to reach this point, going back to our original piece of code, which we've updated with auto and decltype, we have end up with the resulting piece of code:
#include
#include
#include

template
auto find_first_pair(Sequence1 const& seq1, Sequence2 const& seq2, MatchBinaryFunctor match) -> decltype(std::make_pair(seq1.end(), seq2.end()))
{
for(auto itor1 = seq1.begin(); itor1 != seq1.end(); ++itor1) {
for(auto itor2 = seq2.begin(); itor2 != seq2.end(); ++itor2) {
if(match(*itor1, *itor2)) {
return std::make_pair(itor1, itor2);
}
}
}

return std::make_pair(seq1.end(), seq2.end());
}

bool is_equal_and_odd(int lhs, int rhs) {
return lhs == rhs && lhs % 2 != 0;
}

int main() {
int v1ints[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
int v2ints[] = {13, 4, 5, 7};

std::vector v1(v1ints, v1ints + 10);
std::vector v2(v2ints, v2ints + 4);

auto p = find_first_pair(v1, v2, is_equal_and_odd);

std::cout<<*p.first<<":"<<*p.second<}

The only real outstanding issue, at the moment, is the function is_equal_and_odd. This function is exactly the kind of thing lambdas were designed to help eliminate. Thus if we apply our knowledge of lambdas we can come up with a quick replacement to eliminate that entire function, inlining it into the function that's doing the work:
int main() {
int v1ints[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
int v2ints[] = {13, 4, 5, 7};

std::vector v1(v1ints, v1ints + 10);
std::vector v2(v2ints, v2ints + 4);

auto p = find_first_pair(v1, v2, [](int lhs, int rhs) { return lhs == rhs && lhs % 2 != 0; });

std::cout<<*p.first<<":"<<*p.second<}

Which is, in my opinion, quite a bit cleaner and simpler than having a separate function definition hanging around.
Washu
Working with templates is annoying. Working with return types deduced from template arguments is even more annoying. Things quickly tend to spiral out of control, frequently requiring the introduction of typedef's just to make the code marginally more readable. More importantly though, its very difficult to get the TYPE of an expression out of an expression without deducing it statically. This isn't actually much of a problem with the introduction of auto. The problem comes in, though, when you wish to declare the return type of a function dependent upon the result of an expression.

Motivational Studies



This provided one of the main motivating factors for the introduction of the decltype operator in C++11. Another reason for the existence of decltype is for the purposes of providing perfect forwarding. Something that is a necessity for rvalue-references, and move semantics. It is also useful though in other ways. A very trivial example of the problems with forwarding without decltype is demonstrated as follows:

int foo(int const& i) { return i; }
float const& foo(float const& f) { return f; }

template
T do_something(T const& t) {
return foo(t);
}

The problem here is simple, do_something simply isn't forwarding the return type properly. For the float version of foo, do_something should be returning a constant reference to a float, whereas it simply returns a float. You cannot simply solve this by changing the return to a constant reference either, because then you're returning a reference to an unnamed temporary when you call the integer version of foo.

The question thus becomes: How can we write a do_something such that it returns the exact same type as the function it forwards to? Well, to start off with, we need to change up the format of the function a wee bit. The biggest issue we have at the moment is that the return value is clearly going to be determined by the invocation of foo against the parameter passed to do_something, however at the point that the return type is specified we do not even yet KNOW of the parameter to do_something:

template
T const& do_something(T const& t) {
^ ^--- We do not know about this
|-----------------At this point


So we clearly need some way to re-arrange it so that we can know about things like the parameters to the function. Here enters the new result type specification, or as the standard calls it, late-specified return type...

template
UNKNOWN const& do_something(T const& t) -> T {

clearly though we're still missing something, as we need to specify a placeholder for the return type. Thankfully we have such a something, its called auto.


template
auto const& do_something(T const& t) -> T {

Ok, so we've gone to all this work of re-arranging our return type so that we can know about the function parameters... we still don't yet have a way to figure out what the return type should actually be, and now we bring in decltype.

The decltype operator returns the exact type of an expression. Its important to understand that the type returned is a fully qualified type, down to cv-qualifiers. Much like the sizeof operator, the expression passed to the decltype operator is not evaluated except statically to determine its resulting type. This very useful, because you can pass parameters to get the type you require out of an expression without having to worry about ensuring that the parameters are actually valid. An example would be passing in null pointers to a function which expects non-null pointers. Thus we can now redefine do_something as such:

template
auto do_something(T const& t) -> decltype(foo(t)) {
return foo(t);
}

Which now gives the appropriate perfect forwarding, as expected. More importantly, functions that return references or constant references will appropriately forward their return types.

Building up to a Cleaner Algorithm



Given what we know now about decltype we can go back to our algorithm from the previous post. You may remember we had just finished cleaning it up, and it looked much like so:

template
auto find_first_pair(Sequence1 const& seq1, Sequence2 const& seq2, MatchBinaryFunctor match) -> std::pair
{
for(auto itor1 = seq1.begin(); itor1 != seq1.end(); ++itor1) {
for(auto itor2 = seq2.begin(); itor2 != seq2.end(); ++itor2) {
if(match(*itor1, *itor2)) {
return std::make_pair(itor1, itor2);
}
}
}

return std::make_pair(seq1.end(), seq2.end());
}

Auto has clearly made a big difference, and it is much easier to read, however we still have the small problem of that return type, which is ridiculously long and rather hard to actually read. Not to mention, we're hard coding the iterator type we're returning, something that's usually not very useful. With the trivial application of decltype we can eliminate a large chunk of unreadable code by reducing it down to a simple expression:

template
auto find_first_pair(Sequence1 const& seq1, Sequence2 const& seq2, MatchBinaryFunctor match) -> decltype(std::make_pair(seq1.end(), seq2.end()))
{
for(auto itor1 = seq1.begin(); itor1 != seq1.end(); ++itor1) {
for(auto itor2 = seq2.begin(); itor2 != seq2.end(); ++itor2) {
if(match(*itor1, *itor2)) {
return std::make_pair(itor1, itor2);
}
}
}

return std::make_pair(seq1.end(), seq2.end());
}


It is important to note that decltype is not just restricted to return types. In fact, decltype can be used to declare variables, although it generally makes more sense to use auto... unless you need the exact result type of the expression. More on that when we get to rvalue-references though.
Washu

The New C++, Part 1 - auto

In this short series of entries I'm going to cover some of the new things that have been introduced into the C++ programming language with the latest standard. I will not be covering everything, as much of the new functionality has yet to be adopted or implemented in existing compilers, or are things that you probably shouldn't be playing with unless you have a lot of experience with the language, such as variadic templates. This first entry is going to focus on one of the more useful things added to the new C++ standard, auto. It will also give you a preview of the new function definition syntax, which will come in handy in future episodes.

How We Got Here



It has been nearly 14 years since C++ was first standardized. That makes the C++ standard pretty young and at the same time pretty old. Especially when you consider that, from a language perspective, the core standard has not changed any in those 14 years.
There have been a couple of updates to the standard through those 14 years, but they were all optional updates. If you wished to implement a compliant C++ compiler, you needed to only implement the standard as described in the original document. Unfortunately, that's not as easy as it sounds. The C++ language is a language of corner cases, and it is actually not hard to see why when you grasp one of the underlying thought concepts of the C++ language committee: Reserved words are bad.

When you look through the C++ language you find that there are approximately 73 reserved words, and if you look closer you will see that many of those reserved words get reused in various contexts throughout the standard. For a quick and simple example, the "class" keyword changes its behavior depending on if it's used in a template declaration or to define/declare a type. That, of course, is not the only reason the language is hard to parse, but it is part of the reason.

There have been several updates to the standard that have clarified various parts of it, and the C++ Technical Report 1 (TR1) added a whole bunch of new libraries to the C++ standard library, including tuple types, array, various hashed containers, regular expressions, a whole slew of mathematical functions, and a whole bunch of new random number facilities. Of course, since these were library extensions none of them were mandatory, but some compilers attempted to implement most or all of TR1.

Working with an Example



For the purposes of this entry we'll be working with some basic code, and implementing various bits and pieces using the new C++11 standard. I'll be using the Visual Studio 2010 compiler for this, although GCC should also work. Clang will work with most of the examples, however they do not yet have lambda support in, so the final pieces on lambdas will not compile in clang. Yet.


#include
#include
#include

template
std::pair find_first_pair(Sequence1 const& seq1, Sequence2 const& seq2, MatchBinaryFunctor match)
{
for(typename Sequence1::const_iterator itor1 = seq1.begin(); itor1 != seq1.end(); ++itor1) {
for(typename Sequence2::const_iterator itor2 = seq2.begin(); itor2 != seq2.end(); ++itor2) {
if(match(*itor1, *itor2)) {
return std::make_pair(itor1, itor2);
}
}
}

return std::make_pair(seq1.end(), seq2.end());
}

bool is_equal_and_odd(int lhs, int rhs) {
return lhs == rhs && lhs % 2 != 0;
}

int main() {
int v1ints[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
int v2ints[] = {13, 4, 5, 7};

std::vector v1(v1ints, v1ints + 10);
std::vector v2(v2ints, v2ints + 4);

std::pair::const_iterator, std::vector::const_iterator> p = find_first_pair(v1, v2, is_equal_and_odd);

std::cout<<*p.first<<":"<<*p.second<}

This is a fairly simple piece of code that really does only one thing: It searches two containers passed to it for the first entries that the match functor returns true on, and then returns a pair of iterators to those two elements. Otherwise it returns iterators to the end of each container.
If you run this piece of code you get the output "5:5", as that's clearly the first entries that match our function is_equal_and_odd.

Auto - The Formerly Most Useless Keyword Ever



Let's talk a bit about auto. Auto, prior to C++11, was one of those keywords you never saw, for good reason to. A variable declared auto had "automatic storage duration." Interestingly enough though, variables NOT declared static or extern ALSO had automatic storage duration. Which meant that the difference between "auto int a;" and "int a;" was literally nothing. So why did it exist? Because it was in C, and C is included in the C++ standard.

When it came time for work to start on C++11 one of the things desired was to add static type inference. Type inference is the ability for the TYPE of a variable to be figured out based on its context. Static type inference is static, thus the type of the variable doesn't change and it can be computed at compile time. Thus auto was repurposed to be used in this manner. Of course, when you now declare an auto variable you no longer provide a type, but you must provide an initializer expression whose type is compile time knowable. Thus you can now say auto a = 0; whose type will be integer (as that is the type of the literal 0). As such auto str = std::string("Hello world") clearly is declaring a variable of type std::string.

Unfortunately, you cannot use auto everywhere you might like to. Function parameters cannot be auto, templates are for that. Furthermore, when we get into them, lambda parameters also cannot be declared auto. That last one there is perhaps one my biggest issues with the new standard.

With that bit of knowledge under our belt and the above code to work with we can clearly see several areas where just some small changes using auto can make the code significantly simpler and easier to read. If we replace the iterators in the for loops, and the declaration of p in the main method with auto we get the following piece of code, which I'm sure you'll agree is much simpler:

#include
#include
#include

template
std::pair find_first_pair(Sequence1 const& seq1, Sequence2 const& seq2, MatchBinaryFunctor match)
{
for(auto itor1 = seq1.begin(); itor1 != seq1.end(); ++itor1) {
for(auto itor2 = seq2.begin(); itor2 != seq2.end(); ++itor2) {
if(match(*itor1, *itor2)) {
return std::make_pair(itor1, itor2);
}
}
}

return std::make_pair(seq1.end(), seq2.end());
}

bool is_equal_and_odd(int lhs, int rhs) {
return lhs == rhs && lhs % 2 != 0;
}

int main() {
int v1ints[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
int v2ints[] = {13, 4, 5, 7};

std::vector v1(v1ints, v1ints + 10);
std::vector v2(v2ints, v2ints + 4);

auto p = find_first_pair(v1, v2, is_equal_and_odd);

std::cout<<*p.first<<":"<<*p.second<}

It is important to note that auto does not automatically handle referencing. If you desire a reference auto type, you must specify that you desire a reference. As a trivial example:

auto itor = seq1.begin(); // by value.
auto& itor = seq1.begin(); // by reference.

A Newly Functional Way

Of course, reading this over you can see that the code is still pretty hard to read. I mean, where does the return type end and the function name start? Quite far in, makes it not so easy for the brain to parse, eh? One of the new tidbits that came with C++11 was a little thing called decltype, which we'll get into later. As part of decltype though came something else quite useful as well: A new style of function decleration. For the first time you can now declare the return type of the function AFTER the parameter portion of the function. Of course, you must still provide something to prefix the function with... hello auto. The syntax is a bit wonky though, I must warn you smile.png.

template
auto find_first_pair(Sequence1 const& seq1, Sequence2 const& seq2, MatchBinaryFunctor match) -> std::pair
{
for(auto itor1 = seq1.begin(); itor1 != seq1.end(); ++itor1) {
for(auto itor2 = seq2.begin(); itor2 != seq2.end(); ++itor2) {
if(match(*itor1, *itor2)) {
return std::make_pair(itor1, itor2);
}
}
}

return std::make_pair(seq1.end(), seq2.end());
}

This is, surprisingly, a lot easier to read. You can immediately find the return type (everything after the arrow), and you can find the function decleration as well. What would really make this a lot nicer is if we could somehow get rid of that horrible template mess in the return type. Ahh well, more on that later.
Washu
Have you seen the new UI for the next version of Visual Studio 11? No? Go take a look, then come back.

Terrible isn't it? Makes you wonder just what the heck they were thinking when they were designing the UI.

The thing is, its not that they had a bad idea. Cleaning up the UI, making it more user friendly and removing a lot of the wasted space are all things that I can appreciate and support, it's the removal of all the colors and other similar indicators that I object to. Things like tabs no longer having separators between them, thus not allowing you to easily distinguish between where one inactive tab ends and another inactive tab starts.

But before we get into that, lets go into a little Visual Studio history...

Visual C++ 6



This was perhaps the version of Visual Studio (then called Visual C++, since they hadn't integrated the UIs for Visual Basic and Visual C++ together) that haunts them still. It is not that Visual C++ 6 was a bad product, quite the opposite. It was a great product that brought C++ to a whole generation of people. But it wasn't quite the C++ most other people were used to.

Visual C++ 6 came out in 1997, just one year before the C++ standard was finalized. During those intervening months several changes to the C++ standard drafts occurred which altered the C++ language in ways that Visual C++ 6 was incompatible with. This ended up producing a compiler with two rather glaring issues (amongst many others) that we can still see the effects of today.

For Loop Scoping


You might have seen this particular snippet of code somewhere and wondered what the purpose was:
#define for if(0) {} else for
Well, the purpose was to fix an issue in Visual C++ 6, specifically that, given the following piece of code:

for(int i = 0; i < 10; ++i)
std::cout<for(i = 0; i < 10; ++i)
std::cout<


In Visual C++ 6 this is valid code, because the first for loop introduces the index variable into the scope that houses the for loop, rather than restricting it to the body of the for loop. However, in the C++ standard, the decision was made that the variables introduced in the for loop declaration would be scoped to the body of the for loop. The aforementioned macro fixed this issue in Visual C++ 6, by setting the outer scope of the for loop to the body of the else statement, which its self represents a scope.

The C++ Standard Library



The other major issue Visual C++ 6 had was in its standard library. Which was terrible. It was buggy, only partly complete, and had a bunch of extra headers that were not part of C++, iostream.h for instance. The result was that programmers, who learned and grew up on VC++6, grew to fear the standard library. This was later solved through the introduction of STLPort, but that didn't really solve the issue, as you had to actually download and setup STLPort before you could use it. The result was a whole generation of programmers who thing the standard library is a buggy, chunky, and slow set of libraries. None of which is actually true in todays world.

Visual Studio 2003



This came out nearly six years after VC++6, and was a giant leap forward for the Windows development world. While the open source world had GCC, which had continued to mature, VC++6 had remained pretty much stagnant through its entire six-year life span.

Visual Studio 2003 introduced a whole new version of the standard library, which was both significantly more complete and a lot less buggy. You no longer needed to install STLPort just to have a decent standard library that would work. It also fixed the for loop scoping issue, while maintaining a compiler flag to allow VC++6 code that relied on the scoping bug to continue building (albeit with warnings).

VS2003 also had several other compiler enhancements, which allowed you to actually reliably use template meta-programming in your code. Something VC++6 had always had issues with. It wasn't perfect, but it was a giant leap forward for Microsoft.

Visual Studio 2005 and 2008



Visual Studio ended up on a much shorter release schedule than it had been before, and as such its improvement was accelerated. Visual Studio 2005 and 2008 introduced compilers that were significantly more standards compliant and powerful, along with support for newer instruction sets (SSE3) and optimization techniques (POGO).

Visual Studio 2008 also came with an implementation of some of the C++ TR1 headers, such as unordered_map. It also had compiler intrinsics for the use of type traits.

Visual Studio 2010



Visual Studio 2010 was a leap forward, and backward, at the same time. It introduced preliminary support for C++11 features, such as decltype, lambdas, and auto. It came with a brand new text editor and parsing system, which allowed it to parse your C++ code in real time and provide diagnostics without compiling. However, at the same time, it ended up being significantly slower. The text editor was chunky and even today you can still out type it in large projects.

But things were looking good for it, with the introduction of C++11 features, it showed that Microsoft understood the desire of C++ programmers to be able to focus on modern features of C++ to write safe fast C++ code.

Visual Studio 11



As is Microsoft's habit, they've opted to rebrand Visual Studio again, going back to using the version number instead of the year. Exactly why this is, who knows.

Initially, from a C++ viewpoint, VS11 looked great. Herb Sutter, who sits on the C++ standard committee, was an architect on it. Microsoft was hyping up the C++ renaissance, and how VS11 was going to provide a whole new world of C++. Basically, tings were looking good.

Then they released this post, which showed how much of C++11 they were actually implementing. Furthermore, the developer preview focused mainly on a "new" version of C++ they were showcasing, specifically C++/CX. Exactly where was this C++ revolution, the C++ programmers asked, and the response from Microsoft was, to paraphrase... "If you wanted more C++ support, you should have said so."

From a customer relation's perspective, that was a terrible mistake. The backlash was quite substantial, as you can see from the comments on that post. The assumption had been, based on Microsoft's touting of a C++ renaissance, that they were actually focusing on C++11. Not some bastardized version of C++/CLI.

That spawned a couple of petitions, such as these two on the Visual Studio feature request site. Both of which I strongly suggest you vote for:
Speed up work on VC++
Support C++11 features

Then they recently released a nice little blog post detailing all the fancy changes they've made to the UI. Oh boy.

If you look at the changes then a few major things will be apparent: Firstly, they've removed pretty much ALL COLOR from the UI. Which is terrible, because its hard to tell where one thing ends and another starts. Secondly, for all their claims about "reclaiming wasted space", this pair of screenshots pretty much says it all:
4807.dev10projprops_5F00_5ED19FEC.png2728.dev11projprops_5F00_30781A3F.png
Yep, sure saved a bunch of wasted space there now didn't we. Oh, wait.

Then we get to the things that really start to tick me off, they've added a quick search for visual studio commands. "Sounds great," you say, "no more ctrl+f and then search my source code and then find the search window and look through it... wait, quick search for visual studio commands? What the FUCK?" Indeed:
3302.dev11quicklaunch_5F00_045B1D4E.png
and it really only goes downhill from there: Notice on the left, you can't really tell where one inactive tab ends and the other starts. Also notice that the names of the tabs are all upper case, because, you know, we've learned that people can read mostly uppercase words very easily. Oh, wait. No we haven't. In fact there are whole studies on how mixed casing helps reading comprehension.

Then you get to something like the source navigation pane (the drop downs right above the source file):
8117.dev11toolbars_5F00_2190EB25.png
Look at that, you can't even tell if its a class, a structure, a namespace anymore without staring at the icon going "wait..what the hell is that?" for a second. Whereas before it was a simple matter of identifying the color locality. Worse yet, the class viewer is most likely the same, which means you can't tell a method from a field from a property from a class from a structure from a namespace WITHOUT STARING AT THE ICON, since there's no COLOR CUES (hint: field was light blue, method was magenta, class had lines between squares, structure was just three squares).