# PCI Express Throughput

## Recommended Posts

Here (https://en.wikipedia.org/wiki/PCI_Express) is a nice table outlining speeds for PCI Express. I have GF 660 GTX with motherboard with PCI Express 3.0 x16. I made a test by writing a simple D3D11 app that download 1920x1080x32 (8 MB) image from GPU to CPU. The whole operation takes 8 ms. In second this sums up to around 1 GB of data, which corresponds exactly to PCI Express 3.0 x1. Is this how it is supposed to work? Is it like all CopyResource/Map data goes through one of the 16 lanes?

##### Share on other sites

I've never heard this described in detail, but I would imagine that while the interface may have 1, 2, 4, 8 or 16 lanes, the card and driver determine how the data is transmitted. I would assume that if the data can fit within a single lane, a single lane would be used.

Edited by MarkS

##### Share on other sites

The question is what it means "if the data can fit". I would like to copy data back from GPU to CPU as fast as possible and since no other data go that way expect for my one texture download I would ideally like to utilized all 16 lines. If that's possible of course.

##### Share on other sites

The question is what it means "if the data can fit". I would like to copy data back from GPU to CPU as fast as possible and since no other data go that way expect for my one texture download I would ideally like to utilized all 16 lines. If that's possible of course.

You are not streaming data to the monitor. You are telling the card, through the driver, how much data is to be transferred and the card and driver make the appropriate decisions as to how that happens.

You have to understand that you have absolutely no control over what the graphics card and driver does in this matter. I'm not 100% convinced that the driver has control over this, and if not, the user never will.

Out of curiosity, why is this important to you? Have you found yourself bottle-necked by the number of lanes used, or are you looking at potential issues?

Edited by MarkS

##### Share on other sites

I'm just looking at potential uses. I'm aware the GPU->CPU traffic should be avoided as much as possible but for some tests I needed to do this and to make those tests reliable I wanted to utilize full transfer potential.

On a side note, uploading data (CPU -> GPU) takes 3-5 ms (around twice faster than the other way around).

##### Share on other sites

Is this how it is supposed to work? Is it like all CopyResource/Map data goes through one of the 16 lanes?

No thats not how its supposed to work... if it is setup for Pcie3.0 x16 then it should have all 16 lanes transferring at the same time.  Maybe your videocard isn't in the x16 slot or maybe its misconfigured.

##### Share on other sites

..In addition not because you motherboard support PCI-E 3.0 doesn't mean that our graphics cards support PCI-E 3.0, and because the graphics card specification states 3.0 support I would still be wary. The GPU may fallback to a lower speed if certain conditions are not met so unless you have all the low level specification for the GPU in question the all we are dealing with is specification.

##### Share on other sites

I'm now testing my work computer which is brand new with GeForce 1080 GTX. See detailed spec in this picture: https://postimg.org/image/hwhuntpn5/

PCI-E is bidirectional and all sources I've found claim the transfer rate in both directions should be identical, what is not true in my case.

##### Share on other sites

Were you doing anything with the GPU at the same time as the transfer?

##### Share on other sites

        uint64 bef = TickCount();

deviceContext->CopyResource(stagingCopy.texture, gbufferDiffuseRT.texture);

D3D11_MAPPED_SUBRESOURCE mappedSubresource;
memcpy(mydata, mappedSubresource.pData, sizeof(mydata));
deviceContext->Unmap(stagingCopy.texture, 0);

uint64 aft = TickCount();
cout << aft - bef << endl;


As for my home GeForce 660 GTX I've just checked in HWINFO app that it's plugged into PCI-E 2.0, hence the slower speed than at my work computer.

Nevertheless I presume the 8 GB/s and 3 GB/s should be bigger. And identical.

##### Share on other sites

I can think of two things in regards to the uneven transfer bandwidth.

1. the texture might be in morton order or tiled in some fashion and might have to be untiled first before being transfered.

2. there is some sort of arbitrator that deprioritizes read accesses from the CPU to video memory.  But since you aren't doing anything else at the time why would it limit bandwidth?

##### Share on other sites

The benchmark you posted is flawed and will stall. Period.

You need to give time between the calls to CopyResource & your Map. I'd suggest using 3 StagingBuffers: call CopyResource( stagingBuffer[frameCount % 3], and then call Map( stagingBuffer[(frameCount + 5) % 3] );

that is, you will be mapping this frame the texture you started copying 2 frames ago.

What you are measuring right now is how long it takes for the CPU to ask the GPU to begin the copy transfer + the tasks that the GPU has pending before the copy + the time it takes for the GPU to transfer the data to CPU (your CopyResource call) + the time it takes for the CPU to copy from CPU to another region in CPU (your memcpy)

Edited by Matias Goldberg

##### Share on other sites

Here (https://en.wikipedia.org/wiki/PCI_Express) is a nice table outlining speeds for PCI Express. I have GF 660 GTX with motherboard with PCI Express 3.0 x16. I made a test by writing a simple D3D11 app that download 1920x1080x32 (8 MB) image from GPU to CPU. The whole operation takes 8 ms. In second this sums up to around 1 GB of data, which corresponds exactly to PCI Express 3.0 x1. Is this how it is supposed to work? Is it like all CopyResource/Map data goes through one of the 16 lanes?

The bus is not the limiting factor, not even remotely close. First of all, there are maximum bandwidths of the CPU, GPU, and RAM. Second, there's the question of who is actually doing the transfer and when. Is it a DMA operation? Is the driver buffering or doing prep work? That sort of thing. Third, 8 MB is a very small copy size to try and benchmark that bus, so I would not consider your timing to be valid in the first place. Fourth, you're using CPU times before initiating and after completing the transfer, you're capturing extra work happening inside the driver that deals with correcting data formats and layouts. Fifth, who said the driver wants to give you maximum bandwidth in the first place? It has other things going on, including the entire WDDM to manage.

That you got a number comparable to one lane is pure coincidence.

Again, the bus has jack all to do with these speeds. What are the maximum bandwidths of the respective CPU and GPU memories? Both are DMA transfers, and the GPU may have much more capable DMA hardware than CPU, especially since graphics memory bandwidth is so much higher than system memory. Not to mention you're also capturing internal data format conversions.

Edited by Promit

##### Share on other sites

The bus is not the limiting factor, not even remotely close. First of all, there are maximum bandwidths of the CPU, GPU, and RAM. Second, there's the question of who is actually doing the transfer and when.

This!

Doing a transfer over PCIe is very much like reading data from disk. Once it actually happens, even a slow disk delivers over 100MB/s, but it takes some 8-10 milliseconds before the head has even moved to the correct track and the platter has spun far enough for the sector to be read.

Very similarly, the actual PCIe transfer happens with stunning speed, once it happens. But it may be an eternity before the GPU is switched from "render" to "transfer". Some GPUs can do both at the same time, but not all, and some have two controllers for simultaneous up/down transfers. Nvidia in particular did not support transfer during render prior to -- I believe -- Maxwell (could be wrong, could be Kepler?).
Note that PCIe uses the same lanes for data and control on the physical layer, so while a transfer (which is uninterruptible) is going on, it is even impossible to switch the GPU to something different. Plus, there is a non-trivial control-flow and transaction-control protocol in place. Which, of course, adds some latency.

So, it is very possible that a transfer operation does "nothing" for quite some time, and then suddenly happens blazingly fast, with a speed almost rivalling memcpy.

In addition to that, using GetTickCount for something in the single-digit (or less) millisecond range is somewhat bound to fail anyway.

##### Share on other sites
Posted (edited)

Just wanted to let you know that I made a test with CUDA to measure memory transfer rate and it peaked at around ~ 12 GB/s.

Also, measuring CopyResource time with D3D11 queries result in very similar throughput.

Edited by maxest

##### Share on other sites

Potentially relevant to this topic:

## Create an account

Register a new account

• ### Forum Statistics

• Total Topics
628734
• Total Posts
2984444

• 25
• 11
• 10
• 16
• 14