Jump to content

  • Log In with Google      Sign In   
  • Create Account

Banner advertising on our site currently available from just $5!

1. Learn about the promo. 2. Sign up for GDNet+. 3. Set up your advert!

Ohforf sake

Member Since 04 Mar 2008
Offline Last Active May 24 2015 02:33 PM

#5222568 C++ cant find a match for 16 bit float and how to convert 32 bit float to 16...

Posted by Ohforf sake on 11 April 2015 - 03:00 AM

For large amounts of data, there are also SIMD intrinsics that can do this:

half -> float: _mm_cvtph_ps and _mm256_cvtph_ps
float -> half: _mm_cvtps_ph and _mm256_cvtps_ph
see https://software.intel.com/sites/landingpage/IntrinsicsGuide/

Oh, I just noticed you aren't doing this on a PC. But some ARM processors support similar conversion functions. See for example: https://gcc.gnu.org/onlinedocs/gcc/Half-Precision.html

#5222283 Relation between TFLOPS and Threads in a GPU?

Posted by Ohforf sake on 09 April 2015 - 01:13 PM

Peak performance (in FLoating point OPerations per Second = FLOPS) is the theoretical upper limit on how many computations a device can sustain per second. If a Titan X were doing nothing else than computing 1 + 2 * 3 then it could do that 3 072 000 000 000 times per second and since there are two operations in there (an addition and a multiplication) this amounts to 6 144 000 000 000 FLOPS or about 6.144 TFLOPS. But you only get that speed if you never read any data or write back any results or do anything else other than a multiply followed by an addition.


A "thread" (and Krohm rightfully warned of its use as a marketing buzzword) is generally understood to be an execution context. If a device executes a program, this refers to the current state, such as the current position in the program, the current values of the local variables, etc.


Threads and peak performance are two entirely different things!


Some compute devices (some Intel CPUs, some AMD CPUs, SUN niagara CPUs and most GPUs) can store more than one execution context aka "thread" on the chip so that they can interleave the execution of both/all of them. This sometimes falls under the term of "hardware-threads", at least for CPUs. And this is done for performance reasons. But it does not affect the theoretical peak performance of the device, only how much of that you can actually use. And the direct relationship between the maximum number of hardware threads, the used number of hardware threads, and the achieved performance ... is very complicated. It depends on lots of different factors like memory throughput, memory latency, access patterns, the actual algorithm, and so on.

So if this is what you are asking about, then you might have to look into how GPUs work and how certain algorithms make use of that.

#5203246 How can sample rate be 44000Hz?

Posted by Ohforf sake on 10 January 2015 - 06:26 AM

There is more to time-discretization than just the Nyquist theorem.
When you time-discretize a continous signal, you essentially turn it into a stream of impulses. One impulse for each sample. To turn this stream of impulses back into a time-continous signal, you need to low-pass filter it (at least mathematically speaking). Imagine it like the low-pass filter bluring out all the spikes of the impulses, but keeping the general form of the signal intact.
You can show, that if the highest frequencies in the original signal were below half the sampling rate, then all the additional frequencies due to the spiky impulses are above half the sampling rate. So, (again mathematically speaking) the low-pass filter used for perfect reconstruction must let everything below half the sampling frequency pass undisturbed, but completely filter out everything above it. If you had such a filter (you can't build it) and you if you had an infinitely long sample stream (the filter is non-causal and has an infinite response, so you need an infinitely long sample stream) then you can perfectly reconstruct everything if the original signal truly never exceeded half the sampling frequency. As Olof Hedman already pointed out, exactly half the sampling frequency is the point where it breaks apart. At that point, you can no longer differentiate between phase and amplitude. But if the frequency is a smidge lower, due to the infinite amount of samples you can perfectly reconstruct it.

In practice, you can't build a perfect low-pass filter (except, maybe, if the signal is periodic?). Which means, the filter actually being used will have, roughly speaking, three frequency regions. A low frequency region which gets through undisturbed, a middle region, where the amplitudes get damped and a high frequency region where the filter blocks. And depending on the "width" of the middle region, you must keep a margin between the highest frequencies in your original signal and half the sampling rate (essentially what Aressera already said).

Also note, that sampling of a continous signal has nothing to do with the cycles in a synchronous circuit.

#5201743 Linking Words as Conditional Statments

Posted by Ohforf sake on 04 January 2015 - 08:51 AM

You have to keep in mind that natural languages are rather universal in their purpose. They can be used to give orders but they can also be used to explain stuff and transfer knowledge. (Most) programming languages only serve a single purpose: give orders. You don't need to explain to the CPU why it should do s.th.

Hence, many of those linking words have no purpose in a programming language. In a language that describes knowledge (see ontology) things might be different.

#5182972 Normal map from height map

Posted by Ohforf sake on 25 September 2014 - 02:34 PM

Maybe these
tex2D(image, uv + off.xy).x
should be more along the lines of
tex2D(image, uv + off.xy * float2(1.0/textureWidth, 1.0/textureHeight)).x
at least if you are using normalized texture coordinates.

Also you need to output bump.xyz * 0.5 + 0.5 to get the colors of that image.

#5179652 And this is why you don't change names to lowercase

Posted by Ohforf sake on 11 September 2014 - 12:49 PM


Whether you're looking for a long and thin [...] or a thick dark mahogany [...] we have just the one for you.

We Specialize In Wood!

We have been hand-crafting [...] for nearly three decades and our designs have won multiple awards. From single [...] to bulk orders, virgin timber or reclaimed barn wood.

(sry, couldn't resist)

#5178576 Beginners Observation: Fundamental Lack of Source Code Examples?

Posted by Ohforf sake on 06 September 2014 - 01:32 PM

From the perspective of a noob pretty much anyone, actual production codes wouldn't help in learning at all. When a noob anyone will see the code, he won't understand half of the things going on and he will be like "what is this sorcery?!" Production codes will scare any beginner -one away.

There I fixed that for you.

#5178343 GOTO, why are you adverse to using it

Posted by Ohforf sake on 05 September 2014 - 11:01 AM

I can skip a thousand lines of code with a single command and a single cache hit.

If you have 1000 LoC long functions, that is probably equally bad.

Something that I have seen, and I wonder if it's prejudice or actually founded, is that "older" programmers tend to distrust the compiler, specifically its optimization capabilities. Its probably because they have seen the really bad first generations of high level language compilers. But somehow this seems to stick. I still see people trying to reuse local variables to save the compiler the trouble of expanding and reducing the stack frame (Yes, authors of "Numerical Recipes", I'm looking at you). You don't need to do that anymore.
Similarly I'm pretty sure I have seen a compiler reducing a series of break; statements into a single jump.

This means that you can actually use high level constructs like classes without having to fear immediate performance penalties. And with that comes the realization, as others pointed out, that there are a ton of different mechanisms to choose from so that GOTO simply isn't necessary anymore. I think the only place I ever used it was for error handling in pure C, similarly to what chingo wrote.

#5178277 One Buffer Vs. Multiple Buffers...

Posted by Ohforf sake on 05 September 2014 - 06:07 AM

And there are crazy things happening, f. ex. sometimes it's faster to reserve shared memory without using it.

This is actually not that uncommon. The problem is that the cores only have a limited amount of register space (64k per SMx core) which gets divided up by however many threads are running in parallel. So if you are running 1024 threads per SMx, every thread can use up to 64 registers. If you are running the maximum of 2048 threads, every thread only gets to use 32 registers. If more local variables are needed than registers are available, some registers are spilled onto the stack similarly to how it's done on the CPU. But contrary to the CPU, the GPU memory latencies are incredibly high so spilling stuff that is often needed onto the stack can increase the runtime.

Now shared memory is also a restricted resource (64KB on kepler per SMx) but one that can't be spilled. So, if every block only needs less than 2KB, you can get the maximum of 32 resident blocks per SMx. But if you increase the amount of reserved shared memory, lets say to just below 4KB, then you can only have 16 resident blocks. Now, halving the amount of resident blocks also halves the total amount of resident threads, so each thread has twice the amount of registers at its disposal.

So, increasing the amount of reserved shared memory can decrease the number of resident blocks/threads, which increases the number of registers each thread can use, which can reduce register spilling and costly loads from the stack. I don't know about compute shaders, but for cuda I believe the profiler can check for this.

#5178132 Beginners Observation: Fundamental Lack of Source Code Examples?

Posted by Ohforf sake on 04 September 2014 - 01:46 PM

I recently had to use SDL2 and to bootstrap things I googled for a minimal example to get things started. Turns out like 90% of all the example code out there is for SDL1 and outdated. A friend of mine, who started to work with OpenGL, had pretty much the same problem. He would search for tutorials/examples and also find them, but only later on realize that they were for OpenGL 1.2 and severely outdated.

The problem is that writing good, clean, and self contained example code, especially for the more advanced stuff, takes a lot of time. And then a couple of years later, the world has moved on and all that work goes to waste.

#5178005 glut glew freeglut, what is diffrence?

Posted by Ohforf sake on 04 September 2014 - 03:02 AM

sorry but i have more questions. you said modern coding is about using glew library but i have worked on some project that only used glut or freeglut. and working on that was much easier. what i lose if i stop using glew.

Not using glew, or any other library/code that performs the same, you "just" loose every advance in computer graphics of the last 16 years.
Often times, if you just need a quick visualization of something with a couple of lines, points, and triangles, then the stuff from 16 years ago is completely sufficient.

I think whether or not you *need* the new stuff is irrelevant. Rendering has changed significantly since 16 years ago, and if you invest precious time into learning, you might just as well not waste it on s.th. that is dead and buried.

#5178003 my SIMD implementation is very slow :(

Posted by Ohforf sake on 04 September 2014 - 02:50 AM

Instead of using double-indirection [...]

It is actually a quadruple (4x) indirection: 1. idToIntersect[] 2. m_triBuffer[] 3. LocalTri->indiceX 4. m_vertices[]
This is incredibly bad because it "amplifies" your memory problems. CPUs always try to execute instructions either in parallel or at least overlapping if they don't depend on each other. For float addition, you need at least 3 independent (vector) additions (with AVX that is 3x8) to fully saturate an ivybridge ALU. For multiplication it's 5 independent ones. The problem with pointer chasing like the above indirection is that everything depends on the result of a long chain of operations, where in turn every operation depends on the previous one. Computations can not start until step (4), loading m_vertices[], has completed. That however can only start, when the index is known, which means that loading LocalTri->indiceX must fully complete. That again can only start after loading from m_triBuffer[] has fully completed, and so on. Until this chain of operations is done, most of your CPU is idle because there is nothing to execute in parallel.
If you are lucky, everything is in the L1 cache and every load can be serviced in 4 cycles. Then it takes a total of 16 cycles before actual computations start, and remember with AVX 16 cycles are worth 128 floating point operations. Now lets assume that your triangle sizes increase and all the stuff no longer fits into the L1 cache, but has to be loaded from the L2 cache. The L2 latency is 10 cycles, so just a 6 cycle increase, thats not a big deal. But since you have 4x indirection, you actually get 4x that 6 cycle increase. Assuming, that everything is in L2 of course. If your triangle sizes increase even more, you might have to load stuff from the L3. I don't know the latencies for the L3 but lets assume they are just 30 cycles. If all those loads hit the L3, then it takes you 4x30cycles = 120 cycles before you even start computing. That is almost 1000 floating point operations wasted.

Of course some of those loads will probably always hit the L1, but pointer chasing / indirection is extremely bad, and it can severely amplify the effects of cache misses. Getting rid of that would be even higher on my priority list than reducing the size of the data structures.

As a side note, wouldn't it be easier to not simd-vectorize the vector operations, but instead to perform the computations for 8 or 16 rays simultaneously?

#5177655 For-loop-insanity

Posted by Ohforf sake on 02 September 2014 - 08:05 AM

Since we are trying to come up with new, obscure, and complicated ways for counting up, how about this:
#include <stdio.h>
template<typename Type>
class Range {
        Range(const Type &first, const Type &last) : m_first(first), m_last(last) { }
        class const_iterator {
                const_iterator(Type curr) : m_current(curr) { }
                inline const Type &operator*() const { return m_current; }                
                inline bool operator!=(const const_iterator &other) { return m_current != other.m_current; }
                inline const_iterator &operator++() { m_current++; return *this; }
                Type m_current;
        inline const_iterator begin() const {
            return const_iterator(m_first);
        inline const_iterator end() const {
            return const_iterator(m_last+1);
        Type m_first, m_last;

int main()
    for (const auto i : Range<int>(0, 255))
        printf("%i\n", i);

#5177236 glut glew freeglut, what is diffrence?

Posted by Ohforf sake on 31 August 2014 - 11:50 AM

As a rule of thumb, when you want to use a static library you have to do three things:
1. Tell the compiler in which directory it can find the header files
2. Tell the linker in which directory it can find the library files
3. Tell the linker which library files to use.

Some tool chains allow 2+3 to be combined. Some allow the library files to be specified inside the header files via #pragma so that step 3 can be omitted. But again, as a rule of thumb, those are the three steps.

When the compiler complains that it can't find a header file you missed step 1. When the linker complains that it can't find a library file you either missed step 2 or messed up the library name. When the linker complains that it can't find certain symbols, as in your case, you probably missed part 3.

Looking at the current version of glew, the library files for VisualStudio are under lib/ in the glew-1.11.0-win32.zip file. There are different versions depending, amongst other things, on whether you are going for a 32-bit or 64-bit program.

Modern OpenGL refers to the newer versions of OpenGL. The newer versions have new API functions that you can call, but accessing them is a bit tricky and glew helps with that. Independent of which OpenGL version you are going for, using the newest version of glew is probably the best choice.

#5176854 STL List Interator, Whats Happens When You DO This obj *ptr = (obj*)&inter

Posted by Ohforf sake on 29 August 2014 - 01:31 AM

I fully agree with Washu but for the sake of understanding, here is what's going on.

pod = (cDropPodSate*)&iter;//this is bad but the complier does not complain
This is indeed bad. iter is of type std::list<cDropPodSate>::iterator, not of type cDropPodSate. You then take the address of it, &iter, which means that the compiler will place the iterator object on the stack and return the address of it. Then you cast that address, which is of type std::list<cDropPodSate>::iterator* to an address of type cDropPodSate* and use it to access the memory region on the stack, where the iterator resides, as if it were a cDropPodSate. This will of course give you nothing but junk. The reason why the compiler doesn't complain about this although it's really bad is that explicit cast: (cDropPodSate*) It tells the compiler no to worry because you really, really want this and you know what you are doing. Whenever you use such a cast on a pointer (and if it happens often, you really should rethink your design), you should make absolutely sure that you know what you are doing.

What you probably wanted is this:
pod = &(*iter);
iter is again of type std::list<cDropPodSate>::iterator. The * operator of iterator, *iter or iter.operator*(), is a custom overload that will give you a reference to the object that the iterator is referring to, which is a cDropPodSate. &(*iter) then takes the address of the object, that iter is referring to.

Also, you should always preincrement iterators (++iter;) instead of postincrementing them (iter++;).