Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 18 Jan 2008
Offline Last Active Today, 05:59 AM

#5316088 Preventing unintentional pvp (AOE , missiles etc.)

Posted by on Yesterday, 09:42 AM

To me, the possibility of accidentially hitting another player (including friends) with AoE is an important feature, not something one should try to block. Losing karma or being open to unrestricted retaliation quickly teaches the player to be more considerate. But that's just me :)

#5316086 Moving from plain "ascii" to XML?

Posted by on Yesterday, 09:28 AM

Keeping things simple would be not using a XSD at all. What is it good for, really? You use the XML to store game asset data or game settings, right? So it either goes through a tool in your asset pipeline which will detect whether everything is good, and in the other case the XML is program-generated anyway (so it had better be correct!). You're not writing an academic paper, are you.


Note that parsers which respect XSDs are about 10x larger libraries and much slower, too.


I've used TinyXML in the past, which works nice. There is an in-place parser made by some Russian guy which is much faster too (forgot the name!)... used that also, works fine.


But... do you really want to use XML? It is one of these things that somehow at some point gained huge popularity and nobody can explain why (I guess because Java used it for RPC?) but that is actually not all that great, and a lot of people who are or have been using it ask themselves: How could I ever do that?  (I'm asking myself that)


Pretty much everything else is better than XML. Consider JSON if nothing else, or even invent your own data description language (it doesn't need to be able to do everything, you know... it just has to work, and be unambiguous).

#5315214 Running both server and client on one PC causes lag despite ping < 1 ms.

Posted by on 14 October 2016 - 10:15 AM

I hope there aren't any floating point errors that can accumulate and mess up my timers with time

No worries there. Floating point is perfectly accurate, with no precision issues -- none that matter to you, anyway. (Read as: You have different things to worry about)


As stated earlier, using SDL's tick count function is not the most precise or predictable/reliable thing in the world. Depending on how SDL was configured, you get one of these three:

  • the result of GetTickCount
  • the result of timeGetTime where timeBeginPeriod(1) has been called at initialization time
  • the result of QueryPerformanceCounter

GetTickCount will always run at 15.6ms precision, regardless of what scheduler granularity you set, timeGetTime will run at 1ms precision, but this costs you about 5% overall in performance due to 15x more frequent context switches and scheduler runs. QueryPerformanceCounter will, in theory, be much more accurate, but in practice SDL internally multiplies and truncates the value so it has a 1ms resolution (it is admittedly much more close to 1ms precision than timeGetTime, though).


Therefore, whatever small time deltas you add up, you will not have a precise (or usable) result. Floating point is the least of your concerns here. Adding up timers that way is kinda bound to fail, simply because 10ms and 10.75ms are exactly the same, and depending on what timer you have, 10ms and 15ms are exactly the same, too.


Note that under POSIX, SDL uses gettimeofday, which is somewhat better and more reliable. Microsecond resolution and precision on reasonably recent mainstream hardware.

#5314836 Running both server and client on one PC causes lag despite ping < 1 ms.

Posted by on 12 October 2016 - 07:26 AM

Uh, wait, wait, wait... This is not at all how you do it correctly.


Try rather something like:

    while((packet = peer.Receive()) != 0)



Never receive something based on some timer, in particular not a low-resolution timer from which you add up small increments expecting them to be accurate. They are not, but even if they were accurate, you do not want to delay. Why would you?


Whenever there is something to be received (technically it has already been received, you're just pulling it from the queue), by all means, do receive it. Now, not later. You have nothing to gain, and everything to lose from waiting.


Normally, one calls a function like select or poll for that matter, or uses epoll / completion port / kqueue, or whatever (on a server anyway, the client probably has only one socket anyway, so you can as well just read it, as long as you're setting the nonblocking flags... saves one system call).


Now, RakNet doesn't expose that functionality (it probably uses one of the above functions internally to communicate with the network stack, though). Instead, you can call Receive(), which either returns a packet or a null pointer. That's OK, this is actually just what you want.


As long as there is at least one packet available, you will get it (usually your game loop runs much faster, so there will be either zero packets, or exactly one, but sometimes there might be more). Once you get back a null pointer, you know there is nothing interesting available, therefore, there is no point in wasting CPU time polling any further ---> go on.


The same goes for sending. If you have something to send (e.g. user pressed a button) there is usually no good reason to hold it back for some milliseconds. That's the opposite of "low latency". Do it right away, not after some timer is above some threshold. RakNet will do the actual send asynchronously anyway.

#5314824 Running both server and client on one PC causes lag despite ping < 1 ms.

Posted by on 12 October 2016 - 06:04 AM

But basically you are right, UNRELIABLE_SEQUENCED is better that RELIABLE_ORDERED, but just slightly.

Not just slightly, for a "fast paced" game.


RELIABLE_ORDERED is basically TCP (it is not, it's still UDP, but it works the same as TCP). This is great if you need to get data delivered, and you want it in correct order, too. This is rarely what one wants in a game, unless performance doesn't really matter all that much (if performance is secondary, knowing that stuff will eventually just arrive correctly, and in order, is big win!).


UNRELIABLE_SEQUENCED is basically UDP with a counter added. Most packets, most of the time, arrive in correct order even though it's not guaranteed. If any datagram is received where the counter is not higher than the previous datagram's counter, it is silently dropped. End of story. This is what most people who use UDP do.


Hell, no. Why would one want to do this? That's crazy!


Although UDP is "unreliable", it is in some way none more or less reliable than TCP. They both build upon IP, which is unreliable. Only just, TCP resends packets after a while when no ACKs come in, and TCP assumes them being lost. Any of the RELIABLE modes do just that, too. That sounds great, but it really isn't that great. What it means is that in addition to waiting for a packet that doesn't arrive, you now wait for a resend packet that comes in maybe some 50-100ms later, and you aren't getting to look at any other data that comes in during that time because the network layer holds them back. That, and you possibly have another packet lost because it interferes with the resend packet. Plus, there's a chance you now get two packets (which the network layer will silently drop, but you still have to expend bandwidth for it).


All that when, in fact, you give a shit about that state. It's old and obsolete, you have meanwhile received 3 or 4 more recent packets. You could just as well take the next packet, or only the most recent one, and move on. Forget about the rest. Well... if only you could, but you have no way of doing that because the network layer holds it back.


This is the one main reason why people (often, not always) choose UDP over TCP for games. If something doesn't arrive, you just forget about it and move on.


Packets being lost is a rare, but perfectly normal condition. You regularly lose a packet or two to congestion control (this is not a failure, but as-designed) but otherwise packets are delivered >99% reliably all the time. Well, maybe not over WiFi... but as far as cables of some sort are involved, yes. That's true except when something happens, such as a router being flooded and you suddenly have 100% packet loss for half a second (or... longer, if your roommate stumbled over your network cable, or if lightning struck the DSLAM box, or whatever). But in that case, anything you try sucks the same. Resending doesn't help when resends are dropped, too. Time, however, goes on and on, and on. And your state that is being resent and resent gets older and older.


Also, please re-read the quote about HIGH_PRIORITY. There is indeed something you can do here as well. You can use IMMEDIATE_PRIORITY, which will send packets at twice the rate when competing with HIGH_PRIORITY, but more importantly it will not queue your packets for 10ms before sending them.

This 10ms queueing is basically a kind of "Nagle's Algorithm" built into RakNet. This is fine for some applications because it may reduce network load, but it's not what you want for realtime updates.

#5314801 Running both server and client on one PC causes lag despite ping < 1 ms.

Posted by on 12 October 2016 - 02:46 AM

peer->Send( &bsOut, HIGH_PRIORITY, RELIABLE_ORDERED, 0, serverGUID, false );

Not sure if that contributes to the problem, but have you looked at the documentation?


For every 2 IMMEDIATE_PRIORITY messages, 1 HIGH_PRIORITY will be sent. Messages at this priority and lower are buffered to be sent in groups at 10 millisecond intervals to reduce UDP overhead and better measure congestion control.



This message is reliable and will arrive in the order you sent it. Messages will be delayed while waiting for out of order messages. Same overhead as UNRELIABLE_SEQUENCED. Sequenced and ordered messages sent on the same channel will arrive in the order sent.



Both are not exactly the kind of thing that guarantees low-lag, low-latency.


Note that the effect of the second flag is not visible right now, you only feel it in presence of packet loss or reordering (that is, on a real-life server on a real network, not with 3 clients over loopback). It is something to be generally aware of, though -- if low-latency matters for your game.

#5314673 Fast Approximation to memcpy()

Posted by on 11 October 2016 - 08:42 AM

making a good memcpy or memset is excruciatingly hard (huh, did I spell that right?) mostly because you don't know in advance.

Wait, you don't know? What is it you don't know? Well... everything.


You don't know on which CPU your code will run, you don't know how large the block will be (not at the time of writing the memcpy function anyway!) and you don't know what access pattern is going to be used (nor is there a way to communicate this over that API).


On some CPUs, something like REP STOS is the fastest way up to about half a kilobyte or a kilobyte. On others, a simple 4-times unrolled loop in C copying a uint64_t at a time is faster. At some point, it turns around. On some CPUs, SSE4.1 string instructions are fastest, on some they are not, and on some they don't even exist. For blocks greater than so-and-so-many bytes, using SSE registers either way (even SSE2) is faster than registers. Depending on whether or not you are soon reading back the data (e.g. fill a buffer handed to GPU), using write-combining writes is faster and more efficient, but in every other case it's slower. On some CPUs, you must use several threads to saturate memory bandwidth. Prefetching is advantageous, sometimes. And sometimes it is only extra instructions which are slower overall. Usually, almost always, your intuition tells you the wrong thing and you would have been better not prefetching at all.


SSE optimized versions may or may not overwrite without causing a fault if you are being careless, and with no easy way of detecting the problem. Admitted, non-SSE code can do that as well, but you still have a good chance of getting a crash (which is a good thing!). SSE writes are pretty much guaranteed to hide the issue.

Why? Well, because it so happens that your SSE writes are always properly aligned, and thus always fit snugly into a page (because 4096 is a greater power-of-two than 16), so you don't get a segmentation fault if your last write accidentially garbled some bytes past the end. All you get is garbled data, and maybe a crash in a different, unrelated piece of code much later, but never a segfault when it should happen.


I would not be surprised if there even existed systems where remapping pages and thus pulling a page from the operating system's zero page pool was the fastest possible way of zeroing memory (and I wouldn't be surprised if there existed systems which abused DMA in some way (network card, graphics card? disk controller!?) to do memset or memcpy in hardware).


On almost all systems, for all "not terribly huge" sizes, the badly predictable branches necessary to find the best, optimal solution for each case are almost as expensive as the savings, so unless you are operating on a small compiletime-known size, you're kinda lost... and in that case, the compiler would replace the vanilla library call with an optimized inline version anyway. Any no-shit compiler in a release build would, at least.


So, all in all... it's a bit of a totally-no-fun endeavour. I'm just using the C standard library's function. Which, hopefully, is always good enough.

#5312081 Leveling up through mini-quests?

Posted by on 23 September 2016 - 05:19 AM

The problem that I have with these mini-quests is that someone has to generate them, and they will inevitably get boring (there's only so and so many things you can give as quest).

I somewhat like the way stats work in NetHack. Every XYZ you do will "exercise" a stat, and every so-and-so-many turns a check is made whether you have exercised, and there is a small (and diminiushing) chance that the exercised stat will go up.

Now, of course, NetHack wouldn't be NetHack if there wasn't "abusing" a skill, too. Which you can easily do, and there's like 10,000 things you need to keep in mind, or you'll regret.

Basically, remember to carry around enough weight to get stressed (exercises strength) but avoid getting hungry (abuses constitution), and do not carry enough weight to get strained, or even overloaded, since dexterity (and eventually constitution) will be abused.

But yeah, in general, that's a cool concept. Only a tidbit too complex for me, with a little too many pitfalls (I guess that's just the challenge to a hardcore NetHack player, though).

#5310332 fast sin approx algo not faster than sin() ?

Posted by on 11 September 2016 - 05:43 AM

Time to test:
#include "iacaMarks.h"
#include <math.h>
double fsin(double x){
return 1.27323954474 * x + -0.40528473456 * x * fabs(x);

int main(int argc, char** argv)
    double d = 0.0;
        for(int i = 1; i < 1000; ++i)
            d += fsin(argc);
    return (int)d;
(drum roll)
X:\iaca\iaca-win64>gcc -O3 -march=skylake test.cpp -o test.exe && iaca -64 test.exe
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - test.exe
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
Block Throughput: 3.25 Cycles       Throughput Bottleneck: FrontEnd

Port Binding In Cycles Per Iteration:
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
| Cycles | 2.5    0.0  | 2.5  | 1.5    1.5  | 1.5    1.5  | 0.0  | 2.0  | 2.0  | 0.0  |

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
|   0*   |           |     |           |           |     |     |     |     |    | vxorpd xmm1, xmm1, xmm1
|   1    |           |     |           |           |     |     | 1.0 |     |    | mov eax, 0x3e7
|   2    |           | 1.0 |           |           |     | 1.0 |     |     |    | vcvtsi2sd xmm1, xmm1, ebx
|   2^   | 1.0       |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | vmulsd xmm0, xmm1, qword ptr [rip+0x1305]
|   0*   |           |     |           |           |     |     |     |     |    | vmovapd xmm2, xmm1
|   2^   |           |     | 0.5   0.5 | 0.5   0.5 |     | 1.0 |     |     |    | vandpd xmm2, xmm2, xmmword ptr [rip+0x1309]
|   1    | 0.9       | 0.1 |           |           |     |     |     |     |    | vmulsd xmm0, xmm0, xmm2
|   2^   | 0.6       | 0.4 | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | vfmadd132sd xmm1, xmm0, qword ptr [rip+0x130c]
|   0*   |           |     |           |           |     |     |     |     |    | vxorpd xmm0, xmm0, xmm0
|   0*   |           |     |           |           |     |     |     |     |    | nop dword ptr [rax+rax*1], eax
|   1    |           | 1.0 |           |           |     |     |     |     |    | vaddsd xmm0, xmm0, xmm1
|   1    |           |     |           |           |     |     | 1.0 |     |    | sub eax, 0x1
|   0F   |           |     |           |           |     |     |     |     |    | jnz 0xfffffffffffffff9
Total Num Of Uops: 12
Well, coming as 100% for free, and being so darn easy to use, I'd say this is a really awesome tool. Only Skylake does not seem to be supported yet, Haswell is the latest arch this version does.

#5310331 fast sin approx algo not faster than sin() ?

Posted by on 11 September 2016 - 05:27 AM

You may be able to get the number of CPU cycles each instruction costs and explain on a machine code level why they are the speed that they are without even running the code.

This has been nearly impossible to do since at least the time of the Pentium II. The only way to tell how long a piece of code takes is to run it, and even then the exact way in which you run it (cache usage, predictability of branches,&amp;nbsp; whether the result can be computed at compile time, whether the function call can be inlined...) matters a lot.
Well, there is IACA, which does just that. Unluckily, Intel wouldn't be Intel if their website actually worked and you were able to download it.
Gah... I figured it out. Intel... Khaaaaan!

Looking at where the redirection loop gets you (if you paste the link into wget, it spits out the redirections), you land on a site that requires you be logged in with an Intel developer account.

Which would be great if only someone told you! There is a page "accept EULA, download here" and when you click on the link, it just keeps redirecting your browser in an endless loop until the browser gives up.

So... the solution is simple: register an account (use a mailinator address or whatever), log in prior to going to the IACA download page, and it works just fine. :)

#5310121 Could you recommend a lightweight database?

Posted by on 09 September 2016 - 09:17 AM

Postgresql is the best free all-rounder. MySQL is also okay.

Is this funded on concrete experience or even hard evidence? I'm asking because I hear "postgres is better" a lot, but the only real downside in MySQL that I see is that it's not 100% compliant with every little bit of SQL pedantery.

I've used MySQL for decades, and it has never bitten me. Not in an unexpected way anyway... sure, if you use a non-ACID engine, you may lose some... happened maybe 3 or 4 times in my life. But this is something you know in advance. I yet have to see a failure with Inno.

On the other hand, even getting something up and running is (in my opinion) a pain with Postgres, and when I tried, it felt like quite a bit sliggish compared to what I was used to. But maybe that was just because I know nothing of Postgres and did something wrong...

About the actual question, does that huge amount of data have to be saved so often anyway? Items change slots maybe a few dozen times during an entire game session, and they change owner only rarely. Why not write trades to a local journal just to be sure you can recover, and coalesce all item updates once per minute (or less often, even!) to the database.

Similarly, all player stats can easily be held in RAM only, important things logged to a journal just in case, and regularly flushed. If your server crashes so often that this may be an issue, you habe a different problem to look after! And otherwise, if this happens maybe once every other month, no user will be angry if they get restored to full health and full mana upon a crash... as long as they don't lose the Sword of Almighty that they just won a minute earlier (...which the journal prevents from happening to the best of your ability).

#5309820 Faster Sin and Cos

Posted by on 07 September 2016 - 09:38 AM

So you can guess how much I must have been bored during the last hour. I actually moved all those tweakables to the top of the file, and while at it added support for Linux (using clock_gettime with thread cpu time), and for sorting tests by execution time. :lol:
If you reduce the number of iterations so it doesn't run against the quota, you can run it nicely on Coliru: http://coliru.stacked-crooked.com/a/9bd33fd63db7edd5

(I am unsure what architecture Coliru builds for, but looking at the mere 6-7 times performance improvement, I guess it's definitively not something recent with SSE/AVX/FMA support)

Now, please do not mention "faster sin" to me ever again in my life.

#5309788 Faster Sin and Cos

Posted by on 07 September 2016 - 05:25 AM

Almost everything in the test code is kinda easily configurable.


The window for the benchmark and for the max/average/rms error run is set by defaulted parameters. If you want something different, you can just call the test function with whatever you like (note that I'm using a macro to stringify the function name because I'm too lazy to type it twice, so it's probably easiest to just add two params to the macro).


I thought that for benchmarking, +/-3.14 was entirely appropriate. Indeed, I was tempted to only do +/- 1 at first because the execution time should not depend on the input for these approximations (it might, but does not, affect the C lib's execution time).


For the error calculation, it would probably have been very slightly more correct to do a greater range. Mea culpa, When writing this, I just copy/pasted the function, edited the name, and changed the body accordingly. I think it's still 99.9% good, though.


If a value >1 or <-1 is encountered, the function increments a counter, then outputs how many offenders it encountered, and then follows up bracketing all the extrema (I made the assumption that no appromination returns complete bullshit, so if there are values outside the correct range, they must be at the extrema). The bracketing is done by bisection finding, starting with a window 0.1 wide both to the left and to the right of each extremum. This is, for a change, not easily configurable... but 0.1 on each side is quite big, I think it should be enough. An algorithm requiring something larger would mean more than 6% of its ouput values are outside the permitted range! If an algorithm does that, it's complete rubbish.


The "simple" sanity tests are done at precise values (well, as precise as I'm getting them from math.h) for +/- PI, +/- PI/2, and 0.


The monotony tests are done by selecting random samples within an interval starting at the exact location of an extremum, and ending at the location of an extremum minus the value of eps. It then calculates the value for f(t) and f(t+eps) and checks that, depending on whether the function is supposed to be raising or falling in this interval, the results compare correctly.


You can change the value of eps (there's a constexpr with that name inside sanity_impl), in the posted tests, I used 0.001 (tried smaller values as well, no big difference -- algorithms that "pass" with one constant, pass with another too -- and when they "fail", they fail with the other as well... no observable difference on my end, other than number of incidents being different).

A few lines below, in the same function, there is report_threshold (in my tests 3) which you can change if you want more output lines. The test reports up to report_threshold incidents verbosely, giving location, expected, and actual value, and only reports the total number for the rest.


Ah heck, I should probably made all those constants global scope and moved them to the top of the file...

#5309782 fast sin approx algo not faster than sin() ?

Posted by on 07 September 2016 - 04:48 AM

The volatile keyword basically prevents any optimization in the loop.

That's right, but volatile is not supposed to be in the loop.


It is however very important. It acts as sink. The sin function or any of its approximations is a pure function, that is, its effects only depend on its inputs, and its execution does not influence any state other than its return value. In other words, calling or not calling the function is the same, nobody but the consumer of the return value knows a difference, no externally observable effect..

The compiler will thus, according to the as-If rule, translate this:

int main()
    for(1,000,000 times)

to just that:

int main()

The use of volatile to consume each result prevents it from performing that optimization:

int main()
    double sum{};

    for(1,000,000 times)
        sum += expensive_pure_function();   // no volatile here
    volatile double discard = sum;   // sink

Alternatively, you could just print the result (but printing a meaningless number is ugly and confusing), or either cast the accumulated result to int and return it from main, or you could omit initializing the accumulating variable. Returning the value from main is somewhat clumsy due to the cast, and will give a narrowing or "loses precision" warning, and the other solution uses an indeterminate value. Which, although in practice it most of the time "works" (the compiler cannot determine what the result will be, so it has no choice but to execute everything) is technically undefined behavior. The thing with undefined behavior is, you just never know for sure. The compiler might just as well show you the middle finger and optimize out the entire main function because of UB.


With volatile, you are telling the compiler, in an easy and well-defined way, that it is required to set the value of discard at the end (even though it's not otherwise used). It is not allowed to reorder that move, or to omit it alltogether.

In order to do that, the compiler must necessarily know the value and therefore cannot optimize out the calculation. It could, admittedly, run the complete set of calculations at compiletime, and for iteraton counts in the range below 500 or so, this indeed happens. But for iteration counts in the millions, no chance.

#5309700 fast sin approx algo not faster than sin() ?

Posted by on 06 September 2016 - 11:15 AM

Unless every single CPU cycle counts, I recommend you use Spiro's version (text-replace float with double). If that's too slow, use Garrett's.

The reason is that while your version is about 0.5 nanoseconds faster than Garrett (and nearly twice as fast as Spiro-C11 in double) on the average, over 500 million samples (which, frankly, means nothing at all!), it is abysmal when it comes to max error, average error, and root of mean squared error. It also fails every single one of the sanity tests on every value. In other words, it loses on every metric.

That may not matter at all, it can still be a perfectly usable approximation for your case. But there may be cases (which you cannot even foresee now) where it might just matter, too. And using something that is vastly better doesn't really cost you anything. Same timing, more or less, and code readily available, just copy and paste.