For past 20 years, CPUs capacity has increased exponentially.
Meanwhile, RAM throughput has increased only linearly. Same for other forms of storage.
Today, given commodity hardware, all algorithms are IO bound. Or, even RAM is no longer capable of providing data fast enough for CPU to consume. Or, given enough optimizations, all algorithms today will end up IO bound.
This trend is unlikely to reverse, since many of these limits are physical, such as distances between RAM, voltages, heat, ... Especially when talking about real-time, where entire state might need to be visited once every 40ms or more.
The situation with large data sets is so bad, that using languages which run 1000-1 million times slower than what native code could do has no impact on final running time (aka, the web and petabyte databases). CPUs are just sitting there, waiting for data to return.
It would be possible to design computing machines which have different physical design, but this immediately makes costs prohibitive.
In real world, storage is not a viable trade off on current hardware.
The dereferencing itself is the source of the overhead. It is negligible in most practical cases, though.
Except, in most cases, when dealing with math. Such as large number of vectors, coordinate transforms, graphics... Where nullable doesn't even make sense. A point or coordinate is.
Even if it's not overhead in code, it's effectively a nonsensical situation. A matrix where one column is null doesn't make sense. A null matrix has no use. A null vector has no use. It just adds an unneeded complication.
It sure would be sweet if the profiler could just say "Hey dude, this function is 60% slower than it could be!
This type of optimization problems can be tackled using genetic programming, where existing code is mutated and profiled repeatedly. The only thing it can say is whether there exists a faster solution within given search space, possible under local minima. Obviously, such test would need to run to infinity, or have a well-defined search space that would signal termination.
You need to really do something here!" ... but then again I guess this is akin to wishing my compiler would just write my program for me. :-)
It's quite doable. If program inputs are specified as unit tests, then applying the above method (given truly vast computational resources), it would be possible to evolve actual code.
P.S. I've been searching the internet for tutorials that can give advice on how to use the various metrics to spot potential problems in a program, but I haven't had much success so far.
Is there a problem? No? Leave it alone.
Does click of a button take 600ms? Use profiler on all code paths starting from button event handler.
Does the renderer run at 7FPS. Use profiler on anything rendering related.
Regardless, Unreal is not a native MMO engine. It requires heavy modification to use it as such.
And PHP is a crappy language that can only be used to build trivial web sites. It's what HTML+JS offers to user that matters, not how it's generated.
You also don't get the source code with the indie version so it's very unlikely anyone could make a MMO with the indie license.
It's impossible to build a 3D AAA-grade MMO as an indie. The upfront cost is simply too high given the competition, and tools like UE or other similar ones cannot grow organically. When working with this type of engine, one cannot start with stick figures, then progress to simply polygons and use metrics to iterate into what works. It's either full-feature from day one or bust. Even big studios fell into this trap, trying to launch with content for 20 levels out of 50, and hoping to catch up.
Meanwhile, Farmville, Habbo and similar started small, saw what floats and what doesn't, killed fast and iterated furiously. Same for other kinds of start ups.
UE and similar are Big Iron. They work since they are standardized, so one can quickly scale to 10*X new hires if needed, since those have been trained externally and can jump in instantly into existing scaffolding and management structures.
Unfortunately, same goes for big names, who do exactly the same, but with $1 billion budget and much more efficiently.
I would think many indie developers would be dropping bricks right about now.
Indie and other small teams grow organically. They focus on the niche specialties each individual has, and maximize those. The cost of learning a complete vendor locked vertical stack is often detrimental in this setting, since it polarizes the available talent (similar to Java vs. .Net shops and Ruby vs. Python, ...). And this type of fashion statements matter a lot when on shoestring budget.
An indie team also do not have the capacity to produce AAA-grade product such engine implies, and anything less will result in sub-par result.
Teams that secure funding that can afford to not only hire arbitrary talent, but also afford to purchase any cost beneficial technology along with needed oversight and management.
In this regard, this release has no impact. It's probably more of a response to Unreal, which definitely does not have the "they have yet to have a game launched" problem.
Original post by maspeir I'm working on a program unrelated to games. I need to calculate the bi-weekly pay periods for this year and several years out. I can do this easily enough in a spreadsheet, but I need to do it in code.
Oh you poor thing. You have my sympathy. And in C of all the languages... Why are you being tortured this way? Can't you appeal to Geneva convention? Or request asylum in some faraway land?
Let's say the pay period starts today. In Excel or OpenOffice, I would have a cell (say, A1) with "5/9/2010". In the next column, I would have a cell with "=A1+13", which would give me the correct ending date of 5/22/2010. I can calculate all of the previous and future pay periods in the same fashion. Again, this is easy in a spreadsheet.
Well, that works if locale doesn't change. But consider that holidays depend on country you are in, that time adjustments depend on geographic location, that there are leap years and other adjustments, that you may need to consider for arbitrary corrections to any of the above....
How do I go about this in code? It isn't a simple matter of adding days to the date. For portability and ease of coding, I'm using ANSI C date functions.
Next start either reading incredibly carefully through all the documentation on C date functions *for your platform and compiler version* and working around the potentially missing functionality.
Alternatively, use a saner platform, such as Java and JodaTime library. The sheer effort needed to get this thing right is worth using another platform.
Yes - it really is such a horribly convoluted problem, since it's not mathematically pure, but needs to take into consideration potentially different historical timelines as agreed upon by various international committees.
Original post by SoldierX But now I planning to start a rather advanced game project, and I want it both to be cross-platform (Win+Linux+Mac, that is) and have high performance (stunning 3d graphics, yadda yadda). So, C++ fulfills these requirements. But what alternatives do I have?
Did you get the library verified by the lawyer to ensure it is compatible with your project's and company release policy? Did he verify that the implementation does not violate any patents? Did you verify that the author of the library, or any third parties associated with the library are not residents of a country that is banned from doing trade in any of the markets where you wish to ship? Has all of the above been documented and filed? Has the code in the library been verified as original work, or has author signed a "work-for-hire" or other form of transfer of ownership to your company?
Did you submit the library to your build master for integration with the build system? Have the integration and other tests been written?
Did you coordinate the introduction of new third-party library with the project manager for inclusion into the project life-cycle? Has risk analysis with regard to third-party library lifecycle been integrated into your product's life cycle?
All of the above are issues that will need to be cleared in any reasonably-sized company. This 1 minute download just caused 50 hour workload for your company across multiple departments.
Imagine that library happens to contain something that violates some patent (perhaps it uses a linked list), and it gets integrated into your flagship product. Suddenly, you have IBM suing your for hundreds of millions of dollars.
This is why open source matters only in some rare cases, the rest of the code lying around source sites simply doesn't exist. It's too expensive.
As Microsoft once said, Linux is only free if your time is worthless. Most people do not understand that and laugh at Microsoft. And for a startup with no revenue it's true, FOSS is great. But as soon as money is involved, things get very complicated.
The benefit of built-in libraries, or those provided by software vendors is that they are safe from most of the above. Hence, writing something from scratch, despite potential bugs, is considerably better choice in such setting.
CPUs are really good at data parallel (which is a crappy explanation) operations. SIMD and GPUs are all about that on beefier hardware. A Tutorial.
SIMD is just a CPU-friendly way to formulate an algorithm as data parallel.
Many problems cannot be expressed as such. Some are naturally data parallel, some are called embarrassingly parallel. For those that can be, intrinsics are a way of encoding such algorithm in most hardware-friendly way.
What exactly is the reason, that x87 code is often faster
- The algorithm is not suitable for data parallelism - The algorithm isn't applied properly
Code which relies heavily on conditional execution, or one which has highly non-deterministic flow is not suitable for most forms of data parallelism, with exception of certain types of hardware, or abundance of memory. SIMD has neither, GPUs are just barely better.
that i just call one function thousands of times
The overhead here is a killer.
The point of SIMD (especially Streaming part) is to call function once, and the intrinsics process million elements in same call. Same for GPU and batching. Minimize number of calls, especially data conversions. Ideally, you compute square root
Think Formula 1. It needs to be brought to race track by truck, it needs to be tuned, ... Then it keeps going 300mph for 3 hours it is on the track. That is SIMD. It takes time to prepare, but then it's bitchin' fast. Calling function for each element is Formula 1 in New York downtown. 50mph, red light, 50 mph, red light, 50 mph, red light, 50 mph, red light.
Original post by Icebone1000 Its still not very clear to me, you explained more what is cache 1, I want know more why stack data are most likely to be cached.
Each memory access is cached. Cache is small (~64kb or 512kb) so when its full, least recently used cache line must be evicted and is replaced with new data. In addition, cache is organized in lines, so 64kb cache with 64b cache line can only hold 1000 different mappings.
For this reason it makes sense for all memory accesses to be as local as possible. Rather than reading from address 17, 543 and 1024 (loads 17(16 bytes), 543(16 bytes), 1024 (16 bytes), consumes 3 cache slots), it's better to read from 17, 18 and 19 (reads 17 (16 bytes), one cache line), which are all in same cache line and consume one slot.
Stack memory access, by definition, has best cache locality. It either grows continuously upward or downward, but rarely needs to make large jumps.
Typical function call (trivialized):
// stack pointer is 1200 push 10 // offset 0, obligatory cache miss, 1200 is not yet in cache, loads 0..16 into cache push 20 // 1204, offset 4, in cache push 30 // 1208, offset 8, in cache // stack pointer is now 1212 call foo
foo: mov c, sp-12 // load 10 from offset 0 (stack pointer-12) into c, from cache pop a // pop 30 from offset 8, from cache pop b // pop 20 from offset 4, from cache // do something with a and b ret -4 // return and discard last variable which was read directly above
This makes passing parameters and all call-stack operations very cache friendly and as cheap as they can be if registers aren't available.
Note that stack pointer is handled implicitly. If array were used, the code would like this:
... c = arr[offset-12]; offset-=4; a = arr[offset]; offset-=4; b = arr[offset];
This is the work done by compiler, and the push/pop operations, which is considerably more efficient than manually managed stacks, vectors or arrays. It is like a std::vector or std::stack, but one that CPU can manage on its own.
While globals are more likely to appear randomly spreaded over the code.
There are no globals, CPU only sees numbers (0x0324122) *each time* memory access is made. Caching is done based on addresses, not content or context.
If memory at address of global, or first x bytes from that address are accessed frequently enough, it will remain in cache.
with new values, so lets say the cpu alredy got a working set of local hard coded variables on the cache...that is the behavior?
Hard-coded applies to memory address of stack values. Rather than typical data structures which need to chase pointers and determine offsets, members on stack can be addressed directly relative to stack pointer. It always pointers to the top of stack, so saying something like [sp-16] in a function means access a specific member of stack frame, and compiler has generated this to mean c in function parameter list.
"The stack is the most efficient place to store data because the same range of memory addresses is reused again and again."
Due to implied order of operations, most frequently accessed data on stack is almost certainly always in L1 cache.
Cost of stack allocation is zero since it is unchecked and most of the time evaluated at compile time and hard-coded.
Explain me that. Why the hell it will be mirrored? Just because is the same used again and again range of addresses?
L1 cache has access latency of several cycles. Accessing memory in L2, L3 or DRAM has hundreds or thousands of times higher latency.
So you read one byte - CPU needs to wait 500 cycles for memory bus to provide the data. Then you read the next byte, CPU waits another 500 cycles.
Instead, when you read first byte, CPU loads the next 16 (or 64 or as many) bytes in one read into L1 cache. So on first access you pay 500 cycle penalty, but next 100 accesses cost only 2 cycles. The design of stack fits well into this and in typical usage
L1 cache is very fast, it is physically located inside the CPU (to avoid speed of light delays), but overheats and is very expensive to produce. So it is small (in the order of 64kb). The better this cache is utilized, the faster memory accesses are.
DRAM is huge, cheap, doesn't overheat or draw much power, but it is located very far away and bus accessing it operates at only a fraction of CPU's frequency, both of which means high latency when accessing such data.
Both of these favor bulk data accesses, so locality of reference is preferred.
Who cares about the addresses, if data is different it will have to be updated every time...
Writes are indeed more complicated. In single core CPU, writes can still be done directly into cache and eventually be evicted into main cache.
Each thread has its own stack, and threads are the most common way to distribute code across cores, so there is no write conflict. Stack in this respect is shared-nothing design.
Multiple cores accessing shared memory do run into this problem and the code must either be reorganized to schedule writes in such a way to mask the write latency via pipelining, or accept the stall.
What am I missing?
The fundamentals of hierarchical storage architecture and modern CPU design.