Jump to content

  • Log In with Google      Sign In   
  • Create Account

Silverlan

Member Since 02 May 2013
Offline Last Active Aug 19 2016 12:50 PM

#5291693 Vulkan render-call performance drain

Posted by on 15 May 2016 - 08:27 AM

My Vulkan program is running extremely slow, and I'm trying to figure out why. I've noticed that even a few draw-calls already drain the performance far more than they should.

For instance, here's an extract(Pseudocode) for rendering a few meshes:

int32_t numCalls = 0;
int32_t numIndices = 0;
for(auto &mesh : meshes)
{
	auto vertexBuffer = mesh.GetVertexBuffer();
	auto indexBuffer = mesh.GetIndexBuffer();

	vk::DeviceSize offset = 0;
	drawCmd.bindVertexBuffers(0,1,&vertexBuffer,&offset); // drawCmd = CommandBuffer for all drawing commands (single thread)
	drawCmd.bindIndexBuffer(indexBuffer,offset,vk::IndexType::eUint16);

	drawCmd.drawIndexed(mesh.GetIndexCount(),1,0,0,0);

	numIndices += mesh.GetIndexCount();
	++numCalls;
}

There are 238 meshes being rendered, with a total vertex index count of 52050. The GPU is definitely not overburdened (The shaders are extremely cheap).

If I run my program with the code above, the frame is being rendered in approximately 46ms. Without it it's a mere 9ms.

I'm using fifo present mode with 2 swapchain images. Only a primary command buffer at this time (No secondary command buffers/pre-recorded buffers), same buffer for all frames.

 

My problem is, I don't really know what to look for. These few rendering calls should barely make a dent, so the source of the problem must be somewhere else.

Can anyone give me any hints how I should tackle this? Are the any profilers around for Vulkan already?

I just need a nudge in the right direction.

 

// EDIT:

So, it looks like vkDeviceWaitIdle takes about 32ms to execute, if all 238 meshes are rendered. (If none are rendered, it's < 1ms).

Most of the stalling stems from there, but I still don't know what to do about it.




#5282813 [Vulkan] Descriptor binding point confusion / Uniform buffer memory barriers...

Posted by on 23 March 2016 - 01:02 AM

I'm still struggling with compressed images.

Here's what the specification says about that:

 

Compressed texture images stored using the S3TC compressed image formats are represented as a collection of 4×4 texel blocks, where each block contains 64 or 128 bits of texel data. The image is encoded as a normal 2D raster image in which each 4×4 block is treated as a single pixel.

Source: https://www.khronos.org/registry/dataformat/specs/1.1/dataformat.1.1.html#S3TC

 

 

For images created with linear tiling, rowPitch, arrayPitch and depthPitch describe the layout of the subresource in linear memory. For uncompressed formats, rowPitch is the number of bytes between texels with the same x coordinate in adjacent rows (y coordinates differ by one). arrayPitch is the number of bytes between texels with the same x and y coordinate in adjacent array layers of the image (array layer values differ by one). depthPitch is the number of bytes between texels with the same x and y coordinate in adjacent slices of a 3D image (z coordinates differ by one). Expressed as an addressing formula, the starting byte of a texel in the subresource has address:

// (x,y,z,layer) are in texel coordinates

address(x,y,z,layer) = layer*arrayPitch + z*depthPitch + y*rowPitch + x*texelSize + offset

For compressed formats, the rowPitch is the number of bytes between compressed blocks in adjacent rows. arrayPitch is the number of bytes between blocks in adjacent array layers. depthPitch is the number of bytes between blocks in adjacent slices of a 3D image.

// (x,y,z,layer) are in block coordinates

address(x,y,z,layer) = layer*arrayPitch + z*depthPitch + y*rowPitch + x*blockSize + offset;

arrayPitch is undefined for images that were not created as arrays. depthPitch is defined only for 3D images.

For color formats, the aspectMask member of VkImageSubresource must be VK_IMAGE_ASPECT_COLOR_BIT. For depth/stencil formats, aspect must be either VK_IMAGE_ASPECT_DEPTH_BIT or VK_IMAGE_ASPECT_STENCIL_BIT. On implementations that store depth and stencil aspects separately, querying each of these subresource layouts will return a different offset and size representing the region of memory used for that aspect. On implementations that store depth and stencil aspects interleaved, the same offset and size are returned and represent the interleaved memory allocation.

 

Source: https://www.khronos.org/registry/vulkan/specs/1.0/xhtml/vkspec.html#resources-images

 

I'm using GLI to load the dds-data (Which is supposed to work with Vulkan, but I've also tried other libraries).

Here's my code for loading and mapping the data:

struct dds load_dds(const char *fileName)
{
    auto tex = gli::load_dds(fileName);
    auto format = tex.format();
    VkFormat vkFormat = static_cast<VkFormat>(format);
    auto extents = tex.extent();
    auto r = dds {};
    r.texture = new gli::texture(tex);
    r.width = extents.x;
    r.height = extents.y;
    r.format = vkFormat;
    return r;
}
void map_data_dds(struct dds *r,void *imgData,VkSubresourceLayout layout)
{
    auto &tex = *static_cast<gli::texture*>(r->texture);
    gli::storage storage {tex.format(),tex.extent(),tex.layers(),tex.faces(),tex.levels()};

    auto *srcData = static_cast<uint8_t*>(tex.data(0,0,0));
    auto *destData = static_cast<uint8_t*>(imgData); // Pointer to mapped memory of VkImage
    destData += layout.offset; // layout = VkImageLayout of the image
    auto extents = tex.extent();
    auto w = extents.x;
    auto h = extents.y;
    auto blockSize = storage.block_size();
    auto blockCount = storage.block_count(0);
    //auto blockExtent = storage.block_extent();

    auto method = 0; // All methods have the same result
    if(method == 0)
    {
        for(auto y=decltype(blockCount.y){0};y<blockCount.y;++y)
        {
            auto *rowDest = destData +y *layout.rowPitch;
            auto *rowSrc = srcData +y *(blockCount.x *blockSize);
            for(auto x=decltype(blockCount.x){0};x<blockCount.x;++x)
            {
                auto *pxDest = rowDest +x *blockSize;
                auto *pxSrc = rowSrc +x *blockSize; // 4x4 image block
                memcpy(pxDest,pxSrc,blockSize); // 64Bit per block
                //memset(pxDest,128,blockSize); // 64Bit per block
            }
        }
    }
    else if(method == 1)
        memcpy(destData,srcData,storage.size());
    else
    {
        memcpy(destData,tex.data(0,0,0),tex.size(0)); // Just one layer for now
        //destData += tex.size(0);
    }
}

Here's my code for initializing the texture (Which is 1:1 the same as the cube demo from the SDK, except for the dds-code):

static void demo_prepare_texture_image(struct demo *demo, const char *filename,
                                       struct texture_object *tex_obj,
                                       VkImageTiling tiling,
                                       VkImageUsageFlags usage,
                                       VkFlags required_props) {
    VkResult U_ASSERT_ONLY err;
    bool U_ASSERT_ONLY pass;
   /* const VkFormat tex_format = VK_FORMAT_R8G8B8A8_UNORM;
    int32_t tex_width;
    int32_t tex_height;
    if (!loadTexture(filename, NULL, NULL, &tex_width, &tex_height)) {
        printf("Failed to load textures\n");
        fflush(stdout);
        exit(1);
    }
    */
    tiling = VK_IMAGE_TILING_OPTIMAL;
    struct dds ddsData = load_dds("C:\\VulkanSDK\\1.0.5.0\\Demos\\x64\\Debug\\iron01.dds");

    VkFormat tex_format = ddsData.format;
    int32_t tex_width = ddsData.width;
    int32_t tex_height = ddsData.height;

    tex_obj->tex_width = tex_width;
    tex_obj->tex_height = tex_height;

    const VkImageCreateInfo image_create_info = {
        .sType = VK_STRUCTURE_TYPE_IMAGE_CREATE_INFO,
        .pNext = NULL,
        .imageType = VK_IMAGE_TYPE_2D,
        .format = tex_format,
        .extent = {tex_width, tex_height, 1},
        .mipLevels = 1,
        .arrayLayers = 1,
        .samples = VK_SAMPLE_COUNT_1_BIT,
        .tiling = tiling,
        .usage = usage,
        .flags = 0,
        .initialLayout = VK_IMAGE_LAYOUT_PREINITIALIZED,
    };

    VkMemoryRequirements mem_reqs;

    err =
        vkCreateImage(demo->device, &image_create_info, NULL, &tex_obj->image);
    assert(!err);

    vkGetImageMemoryRequirements(demo->device, tex_obj->image, &mem_reqs);

    tex_obj->mem_alloc.sType = VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO;
    tex_obj->mem_alloc.pNext = NULL;
    tex_obj->mem_alloc.allocationSize = mem_reqs.size;
    tex_obj->mem_alloc.memoryTypeIndex = 0;

    pass = memory_type_from_properties(demo, mem_reqs.memoryTypeBits,
                                       required_props,
                                       &tex_obj->mem_alloc.memoryTypeIndex);
    assert(pass);

    /* allocate memory */
    err = vkAllocateMemory(demo->device, &tex_obj->mem_alloc, NULL,
                           &(tex_obj->mem));
    assert(!err);

    /* bind memory */
    err = vkBindImageMemory(demo->device, tex_obj->image, tex_obj->mem, 0);
    assert(!err);

    if (required_props & VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT) {
        const VkImageSubresource subres = {
            .aspectMask = VK_IMAGE_ASPECT_COLOR_BIT,
            .mipLevel = 0,
            .arrayLayer = 0,
        };
        VkSubresourceLayout layout;
        void *data;

        vkGetImageSubresourceLayout(demo->device, tex_obj->image, &subres,
                                    &layout);

        err = vkMapMemory(demo->device, tex_obj->mem, 0,
                          tex_obj->mem_alloc.allocationSize, 0, &data);
        assert(!err);

        // DDS
        map_data_dds(&ddsData,data,layout);
        //

       // if (!loadTexture(filename, data, &layout, &tex_width, &tex_height)) {
       //     fprintf(stderr, "Error loading texture: %s\n", filename);
        //}

        vkUnmapMemory(demo->device, tex_obj->mem);
    }

    tex_obj->imageLayout = VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL;
    demo_set_image_layout(demo, tex_obj->image, VK_IMAGE_ASPECT_COLOR_BIT,
                          VK_IMAGE_LAYOUT_PREINITIALIZED, tex_obj->imageLayout,
                          VK_ACCESS_HOST_WRITE_BIT);
    /* setting the image layout does not reference the actual memory so no need
     * to add a mem ref */
}

I've uploaded the entire demo here. The only things I've changed from the cube demo from the Vulkan SDK are the functions above.

I've tried various different images, with different compressions (BC1/2/3), none of them work.

 

Examples:

#1:

i_view32_2016-03-18_14-35-51.png

turns into:

tri_2016-03-18_14-37-03.png

(Not the cube demo, but same principle)

 

#2:

metalbare2.png

turns into:

cube_2016-03-22_18-05-44.png

 

 

Any hints would be much appreciated.




#5257191 GLSL Error C1502 (Nvidia): "index must be constant expression"

Posted by on 14 October 2015 - 05:13 AM

I have a uniform block in my shader, which I'm accessing within a loop:

#version 330 core

const int MAX_LIGHTS = 8; // Maximum amount of lights
uniform int numLights; // Actual amount of lights (Cannot exceed MAX_LIGHTS)
layout (std140) uniform LightSourceBlock
{
    vec3 position;
    [...]
} LightSources[MAX_LIGHTS]; // Light Data

void Test()
{
    for(int i=0;i<numLights;i++)
    {
        vec3 pos = LightSources[i].position; // Causes "index must be constant expression" error on Nvidia cards
        [...]
    }
}

This works fine on my AMD card, however on a Nvidia card it generates the error "index must be constant expression".

I've tried changing the shader to this:

#version 330 core

const int MAX_LIGHTS = 8; // Maximum amount of lights
uniform int numLights; // Actual amount of lights (Cannot exceed MAX_LIGHTS)
layout (std140) uniform LightSourceBlock
{
    vec3 position;
    [...]
} LightSources[MAX_LIGHTS]; // Light Data

void Test()
{
    for(int i=0;i<MAX_LIGHTS;i++)
    {
        if(i >= numLights)
        	break;
        vec3 pos = LightSources[i].position; // Causes "index must be constant expression" error on Nvidia cards
        [...]
    }
}

I figured this way it might consider "i" to be a constant, but the error remains.

 

So how can I access "LightSources" with a non-const index, without having to break up the loop and just pasting the same code below each other a bunch of times?




#5253424 Bad performance when rendering medium amount of meshes

Posted by on 22 September 2015 - 05:27 AM

Well, I've run into another impasse.

I've decided to add the indices to the same buffer as the vertex data, so the structure of the global buffer now looks like this:

V1|N1|UV1|V2|N2|UV2|V3|N3|UV3|I1|I2|I3|I4|...

 

This works just fine.

 

However some meshes require additional vertex data aside from the positions, normals and uv coordinates. All vertices in the global buffer need to have the same structure, otherwise I run into problems when rendering shadows (Which skip the normal +uv data and don't need to know about the additional data (except in a few special cases)).

 

My initial idea was that I could keep the format of the global buffer (Positions, Normals, UV and Indices), and create a separate buffer for each mesh that requires additional data. This would result in more buffer changes during rendering, however since these type of meshes are a lot more uncommon than regular meshes, it wouldn't be a problem.

 

So, basically all regular vertex data is still stored in the global buffer.

All meshes with additional data have an additional buffer, which contains said data.

 

This is fine in theory, however the last parameter of "glDrawElementsBaseVertex" basically makes that impossible from what I can tell.

I'd need the basevertex to only affect the global buffer, but not the additional buffer (Because the additional buffer only contains data for the mesh that is currently being rendered). Is that in any way possible?

 

If not, what are my options?

Do I have to separate these types of meshes from the global buffer altogether, and just use my old method?




#5252367 Bad performance when rendering medium amount of meshes

Posted by on 15 September 2015 - 10:15 AM

Thank you, but I'm still unclear on a couple of things.

 

I've switched the data order to:

V1|N1|UV1|V2|N2|UV2|V3|N3|UV3

 

But what about the indices? Is it not possible to just append them to the same buffer (i.e. V1|N1|UV1|V2|N2|UV2|V3|N3|UV3|I1|I2|I3|I4|I5|I6), or is an element buffer absolutely required?

 

Either way, I've created a test-scenario with just one object and no vao.

There are two buffers, the vbo with the data as described above, and the element buffer with the vertex indices.

 

During rendering I then use:

glBindBuffer(GL_ARRAY_BUFFER,dataBuffer) // vbo
// Vertex Data
glEnableVertexAttribArray(0)
glVertexAttribPointer(
	0,
	3, // 3 Floats
	GL_FLOAT,
	GL_FALSE,
	sizeof(float) *5, // Offset between vertices is sizeof(normal) +sizeof(uv)
	(void*)0 // First vertex starts at the beginning
);
//

// Normal Data
glEnableVertexAttribArray(1)
glVertexAttribPointer(
	1,
	3, // 3 Floats
	GL_FLOAT,
	GL_FALSE,
	sizeof(float) *5, // Offset between normals is sizeof(uv) +sizeof(vertex)
	(void*)(sizeof(float) *3) // First normal starts after first vertex
);
//

// UV Data
glEnableVertexAttribArray(2)
glVertexAttribPointer(
	2,
	2, // 2 Floats
	GL_FLOAT,
	GL_FALSE,
	sizeof(float) *6, // Offset between uvs is sizeof(vertex) +sizeof(normal)
	(void*)(sizeof(float) *6) // First uv starts after first normal
);
//
glBindBuffer(GL_ELEMENT_ARRAY_BUFFER,indexBuffer); // index/element buffer
glDrawElementsBaseVertex(
    GL_TRIANGLES,
    numTriangles,
    GL_UNSIGNED_INT,
    (void*)0, // For testing purposes; Index buffer contains only one mesh, which starts at index 0
    0 // Not sure about this one? VBO vertex #0 is located at position 0 in the data buffer
);

(I know this isn't effective code, I'm doing it this way to help me understand. I'll optimize it once I got it working)

 

The mesh is rendered, however not correctly (Vertices, normals and uv coordinates are wrong).




#5252345 Bad performance when rendering medium amount of meshes

Posted by on 15 September 2015 - 06:01 AM

You don't need one buffer per attribute, you can put them all in the same buffer (either interleaved or separate).

Hm... I don't think I understand how that's supposed to work.

So, I create a single buffer, and push all of my vertex, normal, uv and index data into that buffer:

 

V = Vertex

N = Normal

I = Index

|x| = 4 Bytes

Buffer Data: ...|V1|V1|V1|V2|V2|V2|V3|V3|V3|V4|V4|V4|N1|N1|N1|N2|N2|N2|N3|N3|N3|N4|N4|N4|UV1|UV1|UV2|UV2|UV3|UV3|UV4|UV4|I1|I2|I3|I4|I5|I6|...

 

Then, during rendering, I can use glDrawElementsBaseVertex to point it to the first index (I1) and draw the mesh:

offsetToFirstIndex = grabOffset()

glDrawElementsBaseVertex(GL_TRIANGLES,2,GL_UNSIGNED_INT,(void*)0,offsetToFirstIndex)

 

But what about the normals and uv coordinates? I'd still have to use glVertexAttribPointer for both to specify their respective offsets, which means I'd still need a VAO for each mesh.

 

What am I missing?




#5252330 Bad performance when rendering medium amount of meshes

Posted by on 15 September 2015 - 04:43 AM

That way you can just pack all your static meshes in one big buffer, managing the offsets yourself (which is fun tongue.png) and have only a couple VAO switches. Since you're essentially doing memory management there, you need to have in mind things like memory fragmentation (ie, what happens if you pack 500 meshes then remove 200 randomly from the same buffer, things get fragmented), so beware.

 

So, basically I need 3 "global" buffers (1 for vertices, 1 for normals, 1 for uv coordinates), then pack all static (Why just static? My dynamic meshes have the same format, can't I just include them as well?) mesh data in those three. During rendering I then just bind these three buffers once at the beginning (=1 vao switch) and use glDrawElementsBaseVertex for each mesh with the appropriate offset.

Is that about right?

 

 

 

How are you measuring time? I'm guessing that's total CPU per frame?

No, it's just the time for the render loop (The pseudo code). I've used std::chrono::high_resolution_clock to measure it, so it's just the CPU time. I'll give ARB_timer_query a try.

According to the profiler "Very Sleepy", the main CPU bottleneck is with "DrvPresentBuffers". I'm not sure if that means it's the GPU itself, or the synchronization/data transfer from CPU to GPU.

 

If your problem is that your GPU time per frame is the bottleneck, then you'll have to optimize your shaders / data formats / overdraw / etc.
If you problem is your CPU e per frame is the bottleneck, then it's a more traditional optimization problem. Measure your CPU-side code to see where the time is going.

I'm pretty sure the shader isn't the problem, the fps stay the same even if I simply discard all fragments and deactivate the vertex shader.

Changing the resolution also changes nothing (I've tried switching between 640x480 and 1920x1080, fps is the same), so I think I can also throw out overdraw as a possible candidate?




#5130598 GPU Gems 3 - Samples and source code?

Posted by on 11 February 2014 - 01:34 PM

The book is available for free on the nvidia website. A lot of the chapters are referring to samples and source code on the DVD which is supposed to be accompanying it, however I can't for the life of me find a download for that.

 

The book, including the DVD, is available for purchase, but the price is ludicrous (466 Euro (That's not a typo) on the german amazon). The kindle version, which is a whole lot cheaper, does not include the DVD, so I'm somewhat stumped.

 

Maybe I'm just blind, does anyone know if the DVD content is available for download on the nvidia website as well?

If not, does anyone know a place where it can be purchased for a reasonable price, within germany?

 

The reason I need the DVD content is because a lot of the articles are somewhat difficult to follow without the source code at hand.




PARTNERS