Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 28 Feb 2011
Online Last Active Today, 05:20 AM

#5166663 good random permutation?

Posted by Bacterius on 13 July 2014 - 09:56 PM

Here's my (heavily optimized) implementation of the algorithm I mentioned above, as a Java class with some test/benchmark code:


package permutation;

import java.util.*;

 * Provides a deterministic pseudorandom permutation of [0, 2^31).
 * WARNING: heavily tuned for the specific 2^31 - 1 case - do not edit
 *          the prime and expect it to work (you will at least have to
 *          update the reduce() method and the cofactor array and also
 *          make sure that the modPow() 2-ary algorithm still works)!
public class Permutation {
    private static final int p = 2147483647;         /* Prime = 2^31 - 1  */
    private static final int s = 31;                 /* Mersenne exponent */
    private static final int[] cofactors = {         /* List of cofactors */
        (p - 1) / 2,
        (p - 1) / 3,
        (p - 1) / 7,
        (p - 1) / 11,
        (p - 1) / 31,
        (p - 1) / 151,
        (p - 1) / 331

     * Computes the value of x mod p, where p = 2^s - 1 is prime.
     * This is a division-free optimization of the usual x % p
     * method, with performance improvements of around 10-20%.
     * Only works with p = 2147483647 (it can be implemented for
     * smaller Mersenne primes but it becomes less efficient, as
     * one needs to check more carefully for modular overflow).
    private static long reduce(long x) {
        long r = (x & p) + (x >> s);
        return r >= p ? r - p : r;

     * Computes the value of g^e mod p, using precomputed 2-ary
     * modular exponentiation. The generator array must contain
     * the 2-ary powers of the generator, beginning at g^1.
    private static int modPow(int[] generator, int e) {
        long r = 1;
        int k = 0;
        while (e != 0) {
            if ((e & 1) != 0) {
                r = reduce(r * generator[k]);

            e >>= 1;

        return (int)r;

     * Returns whether g (as a 2-ary power table) is a generator
     * modulo p. Does this efficiently via the factors of p - 1.
    private static boolean isGenerator(int[] g) {
        if (g[0] <= 1) return false;

        for (int cofactor : cofactors) {
           if (modPow(g, cofactor) == 1)
               return false;

        return true;

     * Deterministically converts a seed to a generator, and
     * returns it as a 2-ary power table (starting at g^1).
     * Note: a sizeable fraction of values in [2, p - 2] are
     *       generators of p so this will terminate quickly.
    private static int[] seedToGenerator(int seed) {
        Random random = new Random(seed);
        int[] powers = new int[31];

        while (true) {
            long g = powers[0] = 2 + random.nextInt(p - 3);

            for (int k = 1; k < 31; ++k) {
                powers[k] = (int)(g = reduce(g * g));

            if (isGenerator(powers))
                return powers;
     * The lower bound of the permutation (inclusive).
    public static final int Lower = 0;

     * The upper bound of the permutation (inclusive).
    public static final int Upper = p;

     * Precomputed values of g^(2^k) mod p, used for modular
     * exponentiation. The actual generator can be read from
     * generator[0] = g^1.
    private int[] generator;

     * Instantiates a new permutation of [0, p] using
     * the given seed (same seed = same permutation).
    public Permutation(int seed) {
        generator = seedToGenerator(seed);

     * Permutes the input in [0, p] and returns the
     * corresponding output - will throw an illegal
     * argument exception if x is out of bounds.
    public int permute(int x) {
        if ((x < Lower) || (x > Upper))
            throw new IllegalArgumentException();
        else {
            /* This is for randomizing the fixed points at 0 and p. Remember
             * that the composition of two permutations is a permutation. */
            x ^= generator[0];

            if ((x == 0) || (x == p))
                return x;
                return modPow(generator, x);

     * Little test driver for the class.
    public static void main(String[] args) {
        final int benchTrials = 5, benchSamples = 10000000;
        final int dupTrials   = 5, dupSamples   = 10000000;
        final int arbitrarySeed = 0xDEADBEEF;

        System.out.printf("Instantiating permutation on [%d, %d] (seed %d)\n",
                          Permutation.Lower, Permutation.Upper, arbitrarySeed);

        Permutation perm = new Permutation(arbitrarySeed);
        System.out.println("\nDisplaying first few values:");

        for (int x = 0; x < 20; ++x)
            System.out.printf("%d => %d\n", x, perm.permute(x));

        System.out.println("\nBenchmarking performance:");

        for (int t = 0; t < benchTrials; ++t) {
            long begin = System.nanoTime();

            for (int x = t * benchSamples; x < (t + 1) * benchSamples; ++x)

            double elapsed = (System.nanoTime() - begin) / 1000000000.0;
            System.out.printf(" [+] %.1f perms/s\n", benchSamples / elapsed);

        System.out.println("\nChecking for duplicates:");

        for (int t = 0; t < dupTrials; ++t) {
            Set<Integer> dups = new HashSet<Integer>();
            int dupCount = 0;

            for (int x = t * dupSamples; x < (t + 1) * dupSamples; ++x) {
                int output = perm.permute(x);

                if (!dups.contains(output))

            if (dupCount > 0)
                System.out.printf(" [!] Found %d duplicates!\n", dupCount);
                System.out.printf(" [+] No duplicates\n");


It meets all of your requirements:

- with the added xor operation the fixed points are gone, and the permutation can now be called pseudorandom (it is at least as good as an LCG, and probably much better - someone else can weigh in here)

- it can give you the output permuted integer for any input in the [0, 2^31 - 1] interval in constant time/memory

- it is rather fast (on the PC I am on I am getting between 15.5 and 17.5 million lookups per second across the entire input range, in Java)

- it is deterministic ("stable") and lets you specify a seed (note: not all seeds produce a unique permutation, but most do, i.e. you have around 1 billion seeds to play with)

- it is, in fact, a permutation (so each output corresponds to one, and only one, input)


You don't have to use it, but that's okay, I really enjoyed writing it anyway :)

#5166625 good random permutation?

Posted by Bacterius on 13 July 2014 - 04:00 PM

I don't think a stateless PRNG makes any sense. Is there not any way to store the state of a PRNG with the terrain generator?

Usually the go-to choice for alternative RNGs (if not already provided by the language's libraries) is usually mersenne twister.


He needs a random-access permutation, which is a different problem. But, yes, unless the seed is hardcoded, there is going to be state (by definition).


One possible candidate I can think of is the function:

\[f(x) = \begin{cases} x ~ ~ &\text{if} ~ ~ x = 0 ~ ~ \text{or} ~ ~ x = 2^{31} - 1 \\ g^x \bmod{2^{31} - 1} ~ ~ &\text{otherwise} \end{cases}\]

which has pretty good pseudorandom properties and is a permutation (there are a couple obvious fixed points by construction, which can be eliminated by e.g. multiplying the input with a suitable constant beforehand). Here g is supposed to be a generator of 2^31 - 1, the smallest one is 7 and about 25% of all possible bases are generators, so you won't run out of seeds (it is easy to test if you have a generator, which means you can efficiently find one by instantiating any PRNG with your seed and looking for a generator that way). It is also reasonably fast, because modular exponentiation can be implemented efficiently using modular square and multiply, and, furthermore, reducing an integer modulo 2^31 - 1 thankfully doesn't require a division. I might write up a full implementation this afternoon if anyone is interested (or just for fun)

#5165150 Is this memory leakage?

Posted by Bacterius on 06 July 2014 - 05:39 PM

System.in is really standard input, and your program only gets one of these. Once you close it, that's it - you can't get any further input from the user from standard input ("closing" a scanner closes the underlying stream). So, you generally should not close it if you are going to be using it again later, and it is not a memory leak, as it is closed when you shut down your program anyway. This is one of the exceptions to the rule of "close what you open", because System.in actually belongs to your process.

#5164879 Normals

Posted by Bacterius on 05 July 2014 - 08:30 AM

EDIT: Since you posted in graphics theory, you did mean a plane with 4 vertices and not a mathematical plane given the equation f(x) = a*x + b*x + c*x + d; right?


Those are called quads, not planes... though the method still works, of course, any three non-colinear points on a plane (aka not in a straight line, aka a proper triangle) are sufficient to determine the plane's normal (up to handedness, which is probably going to be defined depending on the rest of your geometry, so that all the normals correctly point either "in" or "out" - your modelling tool should be able to do this automatically for you using "unify normals" or similar which basically rotates the vertices of all the triangles to have a coherent winding order - then you only need to decide on "the right one" and stick to it).


And you do need to normalize, as even if the two vectors in your plane are unit length the cross product will not be unless they are also orthogonal. If your vectors are u and v then the length of the cross product will be equal to |u| |v| sin(theta) where theta is the angle between the two vectors, between 0 and pi. This also means that you typically do not need to normalize the two vectors before taking the cross product as its direction is independent of the magnitude of either vector, which can save some operations smile.png

#5164859 encryption my password

Posted by Bacterius on 05 July 2014 - 05:31 AM

This is a solved problem. Don't try to invent your own clever changes. You _will_ end up making an insecure and easily-cracked system. Follow existing practice as laid out in articles like http://www.codeproject.com/Articles/704865/Salted-Password-Hashing-Doing-it-Right, which includes C# sample source.


The sample code needs to do far more iterations to offer any kind of protection (and the salt is too long, 16 bytes will do just fine). Other than that, decent article, and covers the main points. And, yes, sending the raw password over SSL is probably the right thing to do. You could argue that someone with access to the server could listen in to the passwords, but if they have access to that they can already get all the data they need from the server itself without needing your password. And if you are concerned about the attacker knowing your password beyond the contents stored on that server, it means you're reusing it somewhere else, which you should know by now is a very good thing to do (and since most websites do not do it, there is little advantage of doing it and it's just one more thing that can go wrong).


SSL might not be pretty or confidence-inspiring but it's reasonably secure when used correctly and is still the best there is in terms of browser-supported standards, and as we all know, standards are a good thing, because it's good when things work properly all the time without needing further work. So, until a new improved standard comes along, be industry-standard like everybody else and just use it. Doing that also covers your ass in case your users' credentials are leaked, though this may not be as relevant for a game server.

#5164477 simple SDL framerate?

Posted by Bacterius on 02 July 2014 - 09:48 PM

Basically, wrap the C++ class members (startTicks, etc...) into a struct, and have each function take a pointer to such a struct and operate on it (akin to the "this" parameter in C++). This is the simplest example of an "object" in C.

#5163912 very strange bug (when runing c basic arrays code)

Posted by Bacterius on 30 June 2014 - 01:50 PM

Okay. Not wanting to learn to use a debugger is the straw that broke the camel's back. I will not be replying to any of your threads in the future, fir - I've been patient, but I'm just about through with you. If other people somehow still want to help you after reading through your recent threads, it is their time to waste. I am not wishing you good luck, for you obviously will not need it, being too cool for debuggers and, in fact, anyone's advice that goes against your preconceptions. Do not bother replying to this post.

#5163809 very strange bug (when runing c basic arrays code)

Posted by Bacterius on 30 June 2014 - 06:15 AM

Check that your loop limit isn't overflowing (the int type is only technically guaranteed to hold values from -32768 to +32767, and you did say you were using 32-bit XP in a previous thread). Set compiler warnings to maximum. Print the loop counter every iteration. Run through the code with a debugger and see which iteration fails. Is the bug consistent, does it always crash at the same place? If it sometimes succeeds, does it print the right answer or just garbage? What happens if you decrease the number of iterations? You know, the usual stuff. There's nothing wrong that I can spot with the code except the potential for overflow, and, indeed, it works just fine for me.


By the way, there is a difference between "doesn't crash" and "prints the right answer" - differentiating the two in your diagnostics usually helps. And also, please try to avoid tagging your thread "C language" when you are really compiling with a C++ compiler. The two languages are different and go by (often subtly) different rules - you will get into trouble eventually thinking they are interchangeable. Make up your mind on a language, be it C, C++, or C with classes, but please don't say you are using one language and then compile your code as another, that's just misleading for everyone involved.

#5163735 Can someone help me write a program?

Posted by Bacterius on 29 June 2014 - 06:07 PM

If you use Python and the terminal you can use the colorama library for coloring the different letters, which also happens to be cross-platform. For C# this functionality is built into the Console class, for C and C++ you can use rlutil, there's probably a library for it in Java, etc... if using a graphical window, color should be something very easy to add. Hopefully this helps you set up a prototype, remember to start small: first hardcode the colors for each letter, then think on how you could make them editable, and keep iterating.

#5163714 SDL.h: no such directory found

Posted by Bacterius on 29 June 2014 - 03:58 PM

Guys thanks I fixed the issue


Care to share, just so this thread isn't totally useless and other people coming across it in the future can hope to fix their problem too?

#5163601 Need more than 1 value to unpack?

Posted by Bacterius on 29 June 2014 - 06:23 AM

You need to delete the "python test11.py etc.." line from your script, since it doesn't belong there.

#5163443 decompositing and recompositing color (pixel)

Posted by Bacterius on 28 June 2014 - 08:15 AM



color = (red <<16) + (green<<8) + blue;

this strikes me both as an ugly and probably inefficient..

Yeah, me too...
You should really be using bitwise-or:
color = (red<<16) | (green<<8) | blue;
There we go -- much better.


why much? iznt add one cycle and well optymized?



No, there is no time difference, addition and bitwise OR take the same time in most hardware (see the Intel docs on throughput/latency of both instructions, they are identical and quite fast indeed). On (very) old hardware bitwise OR could even be slightly faster since you don't need to carry bits, but good luck measuring that. There is also no runtime difference as long as red, green and blue are no larger than a byte. But it's slightly more readable, because when packing bytes into a single word you are not really doing any addition in the usual sense, you're just.. packing bits. So in this sense bitwise OR is better than addition, not that it matters much (both will give wrong answers if red, green and blue are wider than 8 bits anyway).


And who knows? If you write it with | instead of + a compiler might actually recognize what you're trying to do and use a special CPU instruction that can pack bytes very quickly (unlikely, but perhaps on DSP's - digital signal processors). When writing C or C++ code, you're talking to the compiler, not the CPU. Without resorting to manually written assembly, your code will be going through the compiler, so if your goal is to make your code fast, you had better make your code as clear and to-the-point as possible, so that the compiler can understand your intent better (and, yes, it does try - compilers have many heuristics that recognize common code patterns). Writing convoluted code will just cause the compiler to give up and emit suboptimal code. As an added bonus, compiler-friendly code is also often human-friendly code. Yes, there are exceptions, in some cases you can produce faster code by writing code a certain way in bottleneck situations, and intrinsics are a nice middle ground between standard code and full-on assembly which can boost performance immensely if you use them just right, but to be blunt, by going over your snippets in your various threads, you are really not at this stage yet.


How can you claim with a straight face that you've properly profiled your code "100x more" and identified likely bottlenecks when you are still questioning in this very thread whether bitwise OR is less "optymized" than addition? You keep getting tons of very useful advice that you really should follow, but you keep brushing it off as "propaganda" as if you were too good for it. It's getting very repetitive. If you think you know better, why are you asking for advice? If you are not looking for help, why are you making threads?


My final advice to you is: get off your high horse and face the possibility that you actually might not know everything (or anything) about optimization. Then try and modify your code and see what changes in the resulting assembly to learn what your compiler does and does not do. Read up a bit on how CPU hardware works, and get familiar with at least the basics of your own architecture (probably x86 Pentium 3 or Core 2). Find existing C/C++ code on github or whatever. There have to be dozens of software rasterizers online - you could study a few and see how they implemented various parts of their pipeline. Learn from other people's code, compare it to yours. It is hard work, yes. But asking vague questions on a forum unfortunately only gets you so far - to learn to write fast code, you must work at it. There's no secret. If you don't want to take this advice, your loss. I will have only wasted 15 minutes writing it.

#5163126 Need sugestion to optimize realtime bitblt->StretchBlt (scaling takes to...

Posted by Bacterius on 26 June 2014 - 06:32 PM

Scaling is expensive. This is why, for instance, Fraps only offers fullscreen or halfscreen resolution when it captures the screen, so that scaling is either unnecessary or very easy. It's just too costly to handle cases where you're not scaling down to a power of two of the original size, because you have to handle filtering of multiple overlapping pixels, which also blows your cache because the resulting memory access patterns are.. suboptimal to say the least. There is one thing you can try, which drastically helped me back when I was on my old laptop and trying to capture stuff, which is to set your screen resolution to 16-bit colors.


Otherwise, what I would try is instead get a copy of the screen in a GPU-bound texture, using your favorite graphics API, scale it using the pixel shader (which ought to be fast enough, even with bilinear filtering) and read it back on the CPU. But this might incur significant capture latency, and might not be feasible, and you said you don't want to do this anyway, so...


I'm not sure what your specific needs are, but have you considered simply getting one of those screen recording devices? They don't have any system overhead and you might be able to get one to feed back into the computer to deliver the screen contents in realtime.

#5162966 Using 3rd party libraries or code your own?

Posted by Bacterius on 26 June 2014 - 02:46 AM

On big-endian architectures you can actually just do the final comparison for 4/8 byte chunks just as you would for a byte-by-byte approach (of course, most consumer hardware is little-endian). I would also imagine that unless your buffers are in cache to begin with, your implementation is likely going to be memory-bound. In short, memcmp is not the first function I would look at when micro-optimizing, compared to, say, memory allocation patterns, excessive use of virtual functions, poor cache usage, etc... but, yes, it will probably make you a better programmer to really understand how it is typically implemented on a variety of architectures.

#5162522 Why does this matrix multiplication order matter?

Posted by Bacterius on 24 June 2014 - 06:34 AM

in words a column vector time M time V time P equals to row vector time transpose(M time V time P)



That is almost true (up to transposition, e.g. you'll get the same result, just as either a row or column vector, obviously, and you must know that a column vector is to be multiplied from the right i.e. matrix * column vector, not from the left like a row vector is). I get what you are saying. But it is not equivalent to your following statement that



P*(V*(M*p)) = p*P*V*M

and this means

P*(V*(M*p)) =!= P*V*M*p



Which is what we are desperately trying to tell you! What you really want to say is that if \(p\) is the vector in column notation, and \(p^T\) is the vector in row notation, where the \(^T\) indicates matrix transposition, then:


\( \left ( P \cdot V \cdot M \cdot p \right )^T = p^T \cdot \left ( P \cdot V \cdot M \right )^T \)


Which is trivially true from the properties of the matrix transpose. But note this is not at all what you end up saying... and what people have been trying to tell you. Now if you'd actually stopped and read what people were saying, you might have noticed your equations were missing the transpose bit that makes them true (or even just meaningful). I get column/row vector semantics don't mean much to most programmers, but that doesn't make it right - factually, your equations are incorrect and misleading, and the second one (literally) contradicts associativity. I hope you understand now that clearly distinguishing between row and column vectors is essential or you can easily arrive to major contradictions (here, to be clear, the mistake is that you use "p" for both the row and column vector).


And finally, you will notice this is clearly not the problem polyfrag is having, as has been explained on page one of this thread.