Jump to content

  • Log In with Google      Sign In   
  • Create Account

We need your help!

We need 7 developers from Canada and 18 more from Australia to help us complete a research survey.

Support our site by taking a quick sponsored survey and win a chance at a $50 Amazon gift card. Click here to get started!


Vilem Otte

Member Since 11 May 2006
Offline Last Active Today, 09:43 AM

#5229787 win32 cpu render bottleneck

Posted by Vilem Otte on 19 May 2015 - 05:06 AM


The only reason one writes a software rasterizer is:
#1: Learning.
#2: Mega-advanced occlusion culling.
#3: It is the year 1993 and hardware-accelerated graphics aren’t mainstream yet.

 

#4: You're doing it in OpenCL/CUDA in a massively parallel way as proof-of-concept for some epic new hi-tech technique (does that count as software rasterizer?)




#5229545 no Vsync? why should you

Posted by Vilem Otte on 18 May 2015 - 01:28 AM

Because in many games I never hit monitor refresh rate.

My monitor is 60 hz. My framerate is usually somewhere between 40 and 50. Vsync enabled would reduce me to 30, possibly even less if stop-n-wait happens.



Keep in mind that there are some incredibly low spec machines. Such as Atoms. Vidcards with 64 bit DDR3 buffers for 30 bucks are still on the shelves and they do sell. Dudes on budget running Intel GMA.

 

I'd also note that low-performance doesn't mean running on budget. This is more important in other software businesses than games in general - but I've already met conditions where we used low-power (without any active cooler!) consumption machine (that was quite expensive) ~ and the machine was a lot slower compared to same price machines with active cooling and over 10 times the power consumption.

 

Although I'm not really sure whether low-power machines count in PC gaming today (for mobile platforms the situation is different!).




#5229294 VSM depth squared

Posted by Vilem Otte on 16 May 2015 - 05:21 AM

I bet you haven't heard about Chebyshev's inequality so far, that is the part you are missing (and that is why you don't get why there is depth squared). Let me try to explain:

 

Start with the definitions:

 

Mean (Expected Value)

is by definition:

https://latex.codecogs.com/gif.latex?\mu=E[X]=x_1%20p_1+%20x_2%20p_2%20+%20...%20+%20x_k%20p_k

Note that X is our random variable taking x1 with probability p1, x2 with probability p2 and so on.

 

Standard Deviation

is by definition:

https://latex.codecogs.com/gif.latex?\sigma=\sqrt{E[X-\mu^2]}=...=\sqrt{E[X^2]%20-%20%28E[X]%29^2}

It measures the amount of value dispersion around the mean. Low standard deviation means random values close to mean, higher otherwise.

 

Variance

is by definition

https://latex.codecogs.com/gif.latex?Var(X)={E[(X-\mu)^2]} \Leftrightarrow Var(X)={E[X^2] - (E[X])^2}=\sigma^2

Variance measures how fart he number from X spread out from mu

 

Chebyshev's Inequality

the equation used in VSM (there are more equivalent variants of this inequality) is:

https://latex.codecogs.com/gif.latex?P%28x%3Et%29=\frac{\sigma^2}{\sigma^2%20+%20%28t%20-%20E[X]%29}

when t > E[X]

 

Applying this to shadow mapping:

Let our E[X] be the average depth value over filter region

Let our E[X2] be the average squared depth value over filter region

Then sigma = E[X2] - (E[X])2 is the variance

And our t is the real distance from light

 

We can compute the probability of whether we are in shadow (note that this will create smooth edges for the shadow), but also, for perfectly lit areas the inequality would return wanted probability equal to 0 -> that is why we actually compute:

https://latex.codecogs.com/gif.latex?\mathcal{P}=max\left%28\frac{\sigma^2}{\sigma^2%20+%20%28t%20-%20E[X]%29},%20s=\begin{cases}1;%20t%20\leq%20E%28x%29\\0;%20otherwise}\end{cases}\right%29

 

So - the reason why we use depth squared should be clear now (it has roots inside probabilistic theory). I haven't checked (mathematically) whether linear-z and z would work; it would need math proof to be useful (or in case you tried with some success, please share the results smile.png ).

 

EDIT: I'd embed those links as actual equations (images), but I'm unable to do so for some reason




#5228670 VSM depth squared

Posted by Vilem Otte on 12 May 2015 - 05:05 PM

Alright, let me try to explain the purpose of VSM (and why it was created). With standard shadow mapping you have pretty much everything you need to cast curved shadows on curved surfaces - although there is still a bit of a problem, percentage close filter (used to smooth out the shadow result) is quite expensive - it can't be separable.

 

Now, what is PCF trying to do is the following:

  • For each pixel take the NxN samples (with the currently computed pixel in the center of this area)
  • On each of this samples determine whether their depth is further than the single depth value in the center, mark samples further and those that are not
  • The percentage of samples (from total number of samples) further will determine the color at given pixel (e.g. it will "blur" the shadow at the edges)

The VSM takes very similar approach,as we don't care about the samples, but just about the percentage that is further than single depth value, we can compute this using Chebyshev's inequality between expected value and average (or estimated value) ~ actually it will give us bound, not the actual percentage (as opposing to real PCF).

 

The advantage of VSM is in using hardware 2x2 bilinear filtering and possibility of separable gaussian blur (instead of doing non-separable PCF). Yet, as we are just calculating the bound, not the actual value, a lot of troubles can be produced (light bleeding read ~ it is solvable, but solving it degrades shadow quality a little bit, at least in my opinion).

 

 

 

 

I recommend to read the paper & there is also short NVidia paper (just google them).

 

 

 

All in all, I've tried VSM (and also some others like ESM/EVSM) ~ and so far I think that bilinear PCF still looks the best for generic case (note that it is quite hard to do plausible shadows ~ "real like" ~ for area-like lights, where you have hard shadows where the object touches ground and it gets blurrier ... the thing is this also should be for sunlight and such). VSM doesn't look so good when your scene geometry is complex (light bleeding ... and darkening makes your shadow edges look ugly instead of having nice blur), I don't really recommend them for sunlight (and we're also not using them for sunlight here).

 

For "point lights" (actually we use area lights only) the situation is different, performing good bilinear PCF on cubemap is quite expensive, so I'm just fine with using VSM there and a bit of darkening (actually I'm still trying to figure out how to make area lights in interior look even better). PCSS (Percentage Closer Soft Shadows) work awesome for them, yet they are quite computation heavy.

 

*By generic case I mean fairly complex scenes with a lot of objects, so VSM in general creates a lot of light bleeding in there ~ which describes the scenes we use here




#5222578 First person wall sliding, what am I missing?

Posted by Vilem Otte on 11 April 2015 - 05:42 AM

I see that you're working in 2D, so in the following equations I will work in 2D space only (noting axes as X (horizontal) and Y (vertical) as per math convention). This technique can be extended to 3D and work properly.

 

Let us define a scene - we have a moving object (further also player) and a static object (further wall). The player is bounded by circle of radius r and center C. The wall is defined as 2 points (A and B). The player is moved with some speed and under direction defined by vector S (the length of the vector is actual speed value).

 

Let us start by defining a line-circle colision.

 

Given equation of line (note that we first express the direction of line and express line using origin, direction and define range for which the line exists) and circle:

 

$$\textbf{V} = \textbf{B} - \textbf{A}$$

$$\boldsymbol{P_{line}} = \textbf{A} + t\cdot \textbf{V}, t \in [0, 1]$$

$$r^2 = (x - C_x)^2 + (y - C_y)^2$$

 

Given these equations we are searching for set of such t where the equation for line and circle are equal. Note that first equation represents 2 equation (for each axis) ... e.g. in this case we are searching for hit point. These two equations are put together and solved:

 

$$r^2 = ((A_x + t \cdot V_x) - C_x)^2 + ((A_y + t \cdot V_y) - C_y)^2$$

 

Which ultimately goes to quadratic equation returning us true when t (to be precise any of two t values - as each quadratic equation is either insolvable = no intersection, or returns two values ... in the real number domain) is in between 0.0 and 1.0, false otherwise. Note that we're not taking into account situation where our circle is larger than wall and containing it (which can be handled too - as one t would be negative and another one t positive).

 

Of course we can compute how much our circle penetrates the line (actually for testing if there is any collision, this is enough - although you have to compare the distance to radius later) - simply by computing distance between player's center and wall.

 

$$\textbf{F} = \textbf{B} - \textbf{C}$$

$$d = {{|V_x F_y - V_y F_x|}\over{|\textbf{V}|}}$$

 

One could get wrong idea, that by just pushing sphere backwards by this penetration in direction of vector that is orthogonal to line directing towards sphere center we would get perfect sliding. Actually for small penetration values d this should be correct, where:

 

$$d < r$$

 

Assuming our tests have absolute (or at least very good) numeric precision! Which is sadly, not the real case (seriously, we are taking square roots and using divisions - the precision hurts here). What can accidentally happen? Our C would get, in single computation step behind the wall (effectively being pushed in wrong direction), or it can even jump through the wall not detecting any collision at all (this can especially happen when our time step is non-constant!!!).

 

An idea for solution - let us have timestep t, why can't we compute two step functions using half the timestep. This helps, but the problem is still there (it just pushes it a bit further away), okay, so instead of 2 samples, let's do 1000. Yeah, this will be okay for almost all the scenarios (and brutally ineffective), but the problem is still there. To get rid of the problem, we need to do infinite number of samples.

 

First simplification, we treat the sphere as center only, if the distance between this point and line is less than r we have a collision. Right now we just need to do infinite tests of point vs line (simplier, yet still unsolvable).

 

Second simplification, infinite points starting at point C moving under constant direction s form a line segment (further player-line)! So all we need is to test two line segments for distance, if it is less than radius r, then we need to find two closest points, one on line AB and one on players line.  We put another circle C' onto the closest point on line AB and test this circle against player-line. That way we get point X, where circle C intersects line AB.

 

So, how we work from there - we transform our player to the point X where we know it collides with wall. Now we know how much "time in between frames we already used" and we also know "how much time we are left". So we will just move along the wall (in direction orthogonal to wall normal that is continuing the direction of movement, e.g. it doesn't goes opposite to the movement) for the rest of time between frames.

 

Such way is much more precise and properly handles cases where our player would jump through the wall.

 

 

 

Of course this approach can be extended into 3D, as most of the physics engines use such intersection tests so they can properly model a reaction.

 

EDIT: Just a note I needed to add, never undo movement - always use swept tests, they are more precise (especially when using divisions and square roots (or inverse square roots) for calculating distances - and you have to use them).




#5222383 A practical real time radiosity for dynamic scene

Posted by Vilem Otte on 10 April 2015 - 04:31 AM

 

And unfortunately to do it "right" is hard, as in probably provably NP Hard if you want classical computing terms for such.

 

Actually it is not that hard - there are "two" common unbiased rendering algorithms - path tracing (and its variants) and progressive photon mapping. Both are, well, quite simple in principle.

 

I'm not sure whether the problem is NP hard in terms of classical computation (but you are probably right), the thing is, that these things are solved using randomized algorithms. These algorithms are using monte carlo approach to the problem - their probabilistic error will be O(1/sqrt(N)), respectively O((log n)^k/N) for quasi method.




#5221056 GCC -malign-double

Posted by Vilem Otte on 02 April 2015 - 09:29 PM

Hello,

 

I'm not sure why you need that setting with physics, but let me explain what it does:

 

When you allocate memory it is always on some virtual address (which is located on some physical memory page ~ when you're working with it; it can be swap-in and swap-out between physical memory and hard drive) ... these virtual pages tend to have size of 4KiB (because 4KiB is base physical page size on x86), or basically a multiply of 4KiB - that is just fyi).

 

Now, on each modern CPU you have so called FPU & SIMD processor. It is Floating Point Unit & Single-Instruction-Multiple-Data ~ you have multiple values in single register and perform single operation over all of them (4x float in SIMD SSE is well know ~ 4x float ~ 4D vector).

 

Let's continue (I will describe just details for SIMD as I don't remember FPU specifications by heart), when you're reading data from memory to register, there is one instruction that allows you to quickly load data from memory into this SIMD register; and one to save - they are 0F 28 and  0F 29 in assembly written as MOVAPS. These instructions perform fast load of 16 bytes from memory address (or another register) or fast save, there is just one condition - the memory address must be aligned on 16-bytes boundary (physically!).

 

This can become a bit problematic when one use virtual memory, although there is one nice property we have - each virtual address of 0x0 definitely begins at 16-byte boundary (because it has to be assigned with physical page ~ which always begins at 16-byte boundary); and it always takes whole such page. E.g. when virtual address X modulo 16 is 0, the physical address during the computation will definitely also be modulo 0 equal to 0 (the modulo operation here means that given address is 16-byte aligned).

 

Now, what you, as a programmer need to know - all the allocations & delocations (both stack, and heap based) must be performed on 16-byte boundaries ~ e.g. each such address must be equal to 0 after 'mod 16' operation. There are OS-specific functions to handle the heap allocation correctly (_aligned_malloc under Windows OS, posix_memalign under POSIX-based OS, etc.); stack based allocations must be hand-specified to the compiler (using __declspec(align(16)) under MSVC or __attribute__((aligned(16))) under GCC).

 

Now the previous also applies for doubles & long long (although they are on 8-byte boundary, not 16-byte ... fyi, there are also 32-byte registers and on some CPU architectures even larger); The mentioned compiler directive -malign-double forces all doubles and long long to be aligned at 8-byte boundary (as they will actually use aligned alternatives of load/store instructions resulting in better performace).

 

Nothing is free though - in case you have structure where there is 8-byte double (aligned on 8-byte boundary) and 1 byte - you have to add unused 7 bytes as pad (e.g. in general your memory usage can be increased).

 

My apologize if I wen't a bit too much into hardware ~ but I wanted to share info about concepts why it is like this.

 

EDIT: So in general it isn't that bad (it can actually be good and yield better performance), yet there might be some troubles (and crashes) when using memory alignment concept without further knowledge behind it.




#5220881 Is my triangle intersection code correct?

Posted by Vilem Otte on 02 April 2015 - 03:57 AM

In fact there is a lot of triangle intersection code algorithms, out of all, most interesting are:

 

Moller-Trumbore http://www.google.cz/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&ved=0CCwQFjAC&url=http%3A%2F%2Fpublic-digital-library.googlecode.com%2Fsvn%2Ftrunk%2FCG%2FGlobal%2520Illumination%2FRay%2520Tracing%2FFast%2520Minimum%2520Storage%2520Ray-Triangle%2520Intersection.pdf&ei=dg8dVeauKIKAUYb6gKgI&usg=AFQjCNF0PrfmXHhQeqCGY63JhQ56bY4DSw&bvm=bv.89744112,d.d24&cad=rja

 

which is just a standard either single-side or double-side barycentric test. It doesn't need any additional storage and in general is very fast on GPUs and CPUs.

 

Woop http://jcgt.org/published/0002/01/05/paper.pdf

 

I find it very fast on the CPU. It needs more storage compared to MT test (plus for shading you will need original vertex coordinates), so I find it to behave a bit worse (on some of the GPUs) compared to MT.

 

Out of others - Plucker test is one of the most robust, Badouel (projection) test behave very fast on SIMD SSE on CPU, and so on. I'm also linking to (a bit older) good summary of algorithms and their comparsions in terms of speed - http://gggj.ujaen.es/docs/grapp14_jimenez.pdf

 

I hope this can help. I could link you to my implementation of gpu-raytracer on GitHub, but that repo is heavily WIP and I haven't pushed for few weeks (I have modifications locally), plus the setup is a bit harder as of now (but I'm working on it!). https://github.com/Zgragselus/OpenTracer ... there is obsolete CPU code with variation of barycentric test and woop test in OpenCL (which is used in runtime).




#5220136 From scratch vs Unity

Posted by Vilem Otte on 30 March 2015 - 03:21 AM

I don't know if one should really write anything from a scratch, since there are a lot of open source projects around. Maybe rather than making one yourself, modify already existing ones or at least assemble one from already existing projects.

 

No.....! In most cases (and I've worked on such projects), where large parts are taken from another projects (open source, etc.) ends up as huge mess. As a first programmer in that project, I was trying to get away ASAP.

 

"The management" automatically states that when you grab code from another open-source project, you know EVERYTHING about it (this is wrong - but hey, try to explain them when they don't listen). So, after few weeks of messages from management (where they wanted large scale changes that would take days in project you written on your own, they wanted them in days in source you just used), where I tried to explain that it can't be done that fast (and no human being is able to know everything about 5 or 6 large open-source projects you downloaded & build 2 weeks ago & be able to do large scale modifications); I ended up quitting.

 

I'm not saying you shouldn't use open-source at all, but explain this to management (and hope they will listen)




#5202181 Writing my own pathtracer on mobile device

Posted by Vilem Otte on 06 January 2015 - 03:50 AM

About saving the image and your issue.

 

There are multiple reasons why you can get the issue:

1.) Due to asynchronous and highly parallel nature of the code, the accumulator buffer is re-written while you are reading from it inside the loop. See following diagram:

 

| Thread 1 | Thread 2 |

|----------|          |

|  Render  |          |

|     |    |          |

|     |    |----------|

|     |    | GetBitmap|

|     |    |     |    |

|     |    |     |    |

|----------|     |    |

|  Render  |     |    |

|     |    |     |    |

|     |    |     |    |

|     |    |----------|

|     |    | saveImage|

|     |    |     |    |

|----------|     |    |

 

Now, the GetBitmap doesn't read the correct data (they are re-written while GetBitmap reads them. The proper solution for this is to use Mutex lock on the shared resources (which is accumulator buffer and samples - as both can be changed during rendering phase).

 

First of all you need an instance of Mutex object (it is one of the core java objects) to synchronize your code; that is accessible from render and getBitmap functions.

 

Your render and getBitmap will then look like:

public void render() {
    try {
        mutex.acquire();
        try {
            ... // The code of render will be here (or at least the part of code where you write into accumulator and samples - assuming you write to samples too)
        }
        finally {
            mutex.release();
        }
    }
    catch (InteruptedException ex) {
        ...
    }
}

...

public Bitmap getBitmap() {
    ... // Here you create bitmap object
    try {
        mutex.acquire();
        try {
            ... // The nested loop from getBitmap goes here
        }
        finally {
            mutex.release();
        }
    }
    catch (InteruptedException ex) {
        ...
    }
    // Here you return rBitmap
}

This way (using Mutex object), you synchronize your code to work like this:

 

 

| Thread 1 | Thread 2 |

|----------|          |

|  Render  |          |

|     |    |          |

|     |    |          |

|     |    |          |

|     |    |          |

|     |    |          |

|----------|          |

|          |----------|

|          | GetBitmap|

|          |     |    |

|          |     |    |

|          |     |    |

|          |     |    |

|          |     |    |

|          |----------|

|----------| saveImage|

|  Render  |     |    |

|     |    |     |    |

|     |    |     |    |

|     |    |     |    |

 

Of course this is only going to help you when those functions are actually processed on multiple threads (and I assume they do).

 

Another possible reason would be out-of-range colors (when your code is sequential, it is most likely related to the divisions or type-casts), in this part of the code:
int r = (int)(accumulator[i*3+0] / (float)samples);
int g = (int)(accumulator[i*3+1] / (float)samples);
int b = (int)(accumulator[i*3+2] / (float)samples);

Make sure that r, g and b are indeed in range from 0 to 255 (just Log.v it - to see whether there isn't something weird happening). Although it would most likely make the issue be throughout whole image, not just parts.




#5201596 Writing my own pathtracer on mobile device

Posted by Vilem Otte on 03 January 2015 - 02:05 PM

My apologise man, I made a stupid mistake there.

 

What we want when computing the dx and dy are the differences between two rays on X, respectively Y axes (on viewport). Now, why do we do that, we take base vector, which is the one directly in the centre of the screen and then we take ray to one pixel right from centre and to one pixel up from centre. The code doesn't do that.

float xone = (((float)width * 0.5f + 1.0f) / (float)width) * 2.0f - 1.0f;
float yone = ((((float)height * 0.5f + 1.0f) / (float)height) * 2.0f - 1.0f) * aspect;

These are the fixed xone and yone vectors. I intentionally kept the computation more intensive (you can actually optimise it to 2/width, respectively (2/height)*aspect.

 

Now it should work properly.

 

And btw. Happy New Year to whole community here smile.png




#5200808 Writing my own pathtracer on mobile device

Posted by Vilem Otte on 30 December 2014 - 09:01 AM

Hm, still the same result.

 

Did you mean this:

Vector3D direction = new Vector3D(xdir + r1 * dxSize, ydir + r2 * dySize, zdir);

instead of this:

Vector3D direction = new Vector3D(xdir + r1 * dxSize, ydir + r2 * dxSize, zdir);

? Even if you did, it is also not working.

 

 

Actually...

Vector3D direction = new Vector3D(xdir + r1 * dxSize, ydir + r2 * dySize, zdir);

direction.normalize();

 

But I think I forgot some division (that is why your rays are "spread" that much (they shouldn't be).




#5200547 Writing my own pathtracer on mobile device

Posted by Vilem Otte on 29 December 2014 - 08:25 AM

Uh.... please normalize this direction vector:

Vector3D direction = new Vector3D(xdir + r1 * dxSize, ydir + r2 * dxSize, zdir);

 

I gotta go to work right now, I'll check the math later today (maybe there was some mistake in the code I posted).




#5200502 What is better? TransformFeedback or OpenCL Kernel?

Posted by Vilem Otte on 29 December 2014 - 02:08 AM

+1 for samoth, although...

 

The synchronization (if done correctly) works for all the devices from major vendors and is not that expensive. Generally...

 

If you need to do something more complex with the data before you output them, go with OpenCL, otherwise using transform feedback might be perfectly fine.

 

To your question exactly:

What would be faster? It depends, if you use constant number of matrices (weights) per vertex, then OpenGL version might be faster. For variable number of weights (theoretically infinite), the computation itself might be highly efficient in OpenCL (it won't be a simple implementation though, because if you mess one thing there it will get horribly slow), most likely better than doing that through OpenGL (but I will not guarantee that, as you would need to benchmark ... you might also get different results on different devices, well... I guess you know how this work).

 

To your second question, OpenCL can work directly with OpenGL vertex buffer objects (and textures). You just need to "acquire" them prior to using them in OpenCL and "release" them prior to using them back again in OpenGL. I've been using OpenCL kernels for quite a long time now and I haven't seen (if done properly) any performance hits by this one (and yes, you can also use double-buffering to overcome this problem - you're working on first buffer in OpenCL and displaying second one (that you generated in the previous frame), that way you most likely won't see any performance hit at all). ... it will cost you more memory though!




#5200500 Writing my own pathtracer on mobile device

Posted by Vilem Otte on 29 December 2014 - 01:54 AM

Antialiasing

Actually this is not as hard when you think about it - for primary rays each ray occupies approximately same "pyramid" - so you just need to get dx and dy (these vectors will have direction of your right, respectively up vector of camera, their length will be the size of this pyramid along these vectors (which are basically X and Y if you projected them into screen-space).

 

Their calculation is straight-forward. Your ray-direction is computed as:

// You can initialize all these values before your rendering loop, they don't change for different primary rays
float xone = (1.0 / (float)width) * 2.0f - 1.0f;
float yone = ((1.0 / (float)height) * 2.0f - 1.0f) * aspect;
float zdir = 1.0f / (float)tan(fov);
Vector3D base = new Vector3D(0, 0, zdir).normalize();
Vector3D xvec = new Vector3D(xone, 0, zdir).normalize();
Vector3D yvec = new Vector3D(0, yone, zdir).normalize();
Vector3D dx = xvec - base;
Vector3D dy = yvec - base;

// As we know our camera right vector and up - this can be further simplified to
float dxSize = dx.length();
float dySize = dy.length();

// Inside your rendering loop
// And for the AA rays you actually work like this
for (int aa = 0; a < AA_SAMPLES; a++) {
    float r1 = Random().GetFloat() * 0.5; // Random().GetFloat() here returns random value between -1.0 and 1.0
    float r2 = Random().GetFloat() * 0.5; // Random().GetFloat() here returns random value between -1.0 and 1.0
    Vector3D direction = new Vector3D(xdir + r1 * dxSize, ydir + r2 * dxSize, zdir);
    Ray ray = new Ray(Camera, direction);
    color.add(Trace(ray, 1));
}
return color.divide(AA_SAMPLES);

Explicit Path Tracing

To answer questions:

1.) 4D vectors are specific to my code (I took an example directly from my code), you don't need them, the code is written in OpenCL code (which works really close to the actual hardware), the thing is, that on that level you care about your type size, memory alignment, etc. (e.g. these things are all about optimization), this is something you can't really do with Java, so feel free to use your Vector3D instead.

 

2.) EPS is a constant here - it prevents your hitpoint to end inside the intersection (you had the problem - it is the one with numeric precision - and solved it with 0.99f - in this code I've used the same, but just used preprocessor directive (in Java you would use something different, like public static float EPS = 0.99f; ... if you know C or OpenCL language, my equivalent is #define EPS 0.99f - they are not entirely the same (from the compiler perspective or language theory perspective), but they will work the same)

 

EDIT:

If anyone reads the post now, there is stupid mistake I noted few posts below, there should be:

float xone = (((float)width * 0.5f + 1.0f) / (float)width) * 2.0f - 1.0f;
float yone = ((((float)height * 0.5f + 1.0f) / (float)height) * 2.0f - 1.0f) * aspect;

For explanation, see few posts below. My apologise again.






PARTNERS