Jump to content
  • Advertisement


  • Content Count

  • Joined

  • Last visited

  • Days Won


DerTroll last won the day on November 11

DerTroll had the most liked content!

Community Reputation

245 Neutral


About DerTroll

  • Rank

Personal Information

  • Role
    Amateur / Hobbyist
  • Interests

Recent Profile Visitors

3364 profile views
  1. DerTroll

    C++ 3 quick ways to calculate the square root in c++

    Okay, no chance for me to win the smart-ass battle here. I yield 😛! My assembly skills are extremely limited. I know enough to understand most of the stuff the compiler produces and to make some performance assessments but I am far from being able to optimize pure assembly myself. However, I am quite familiar with the intrinsic functions to write vectorized code, which brings me to your example with the dot product. Maybe I was a little bit too much focussed on stuff like inlining and removing unnecessary operations when I wrote my answer. Unfortunately, auto-vectorization isn't working so well, I agree. Otherwise, I wouldn't be messing around with SSE and AVX myself. Greetings
  2. DerTroll

    C++ 3 quick ways to calculate the square root in c++

    Okay, you are technically right, but I think a little bit nitpicky here. What I meant is that there is no built-in fast approximation of the square root that I know of. Also: "std::sqrt is required by the IEEE standard to be exact." (source). So even if an approximation formula is used to get to the result, the result is exact (in terms of floating-point precision). If you have a look into the intel intrinsics guide, they also talk about an approximation in the documentation of the fast reciprocal square root (_mm_rsqrt_ss), while they don't in _mm_sqrt_ss. So the term approximation refers to the result and not to the method. No, I couldn't because this is the "exact" (see above) and "slow" solution which should result in the same timings as for std::sqrt. I haven't tested it, but I think any smart compiler would use this under the hood if it is faster than the non-vector-register version. Also, compare the latencies of the involved instructions. It will get you to the same assessment (see below). What I was aiming for here was to calculate the approximate sqrt result as fast as possible by sacrificing accuracy (as intended by the TO). There only exists a built-in fast reciprocal square root but no fast square root (at least that I know). Dividing by the fast inverse square root gives an "approximate" result for the square root. However, this will only be faster than the "exact" square root (_mm_sqrt_ss), if you also use another approximation to calculate the reciprocal. That is what I was doing. To give you numbers from the intrinsics guide, here are som latencies for a skylake: _mm_sqrt_ss: 13 _mm_rsqrt_ss + _mm_div_ss: 4 + 11 = 15 _mm_rsqrt_ss + _mm_rcp_ss: 4 + 4 = 8 This pretty much agrees with the results from the benchmark. Well, I don't know how much experience you have with benchmarking this low-level stuff, but nowadays compilers are super smart and aggressive when it comes to optimizations. I haven't looked at the produced assembly but I am pretty sure the function gets inlined automatically. I mean, without using the "DoNotOptimize" function of the benchmark library I couldn't even produce any timings since the compiler removes the whole operation. Probably because he can calculate the result at compile-time (or for whatever other reason). Also, seeing that the timings do agree well with the latencies from the intrinsic guide, I guess you can expect that the compiler eliminated every possible overhead. In my experience, there is often no point in trying to help the compiler in order to produce more performant code. They are so advanced that they can perform most optimizations without any help. I don't want to say that there is no room for optimizations in the presented benchmarks since I am just a passionate hobbyist and no professional full-time programmer. However, the things I said are based on the experiences I made by writing a lot of benchmarks and trying to optimize my code for raw performance. If you made different experiences, feel free to share them Greetings
  3. Just to clarify that: My answer wasn't targeting you in any way. To be honest, I didn't even fully read your post, since the way code is formatted on my smartphone is horrible. I just responded to @Bregma answer, cause it sounds like "Singletons are always" bad and I don't agree with that. In my opinion, It is just often overused in a way it turns into pain than an actual good solution for a problem. The same goes for OOP, unsigned ints ( ) and probably a lot of other options C++ provides. Worst thing: Most of the time there is no clear indication about which approach is the best but people get quite religious about dos and don'ts. You can say that about many other design patterns too. But what is the right tool? They are one possible tool. If it is the best is often opinion based since the weighting on the pros and cons is also opinion-based. As I said before, I think a memory manager is a good example where it can make sense to use a singleton. You can surely program one without using a singleton, but the only other solutions I know/have thought of are hard to work with. If you can show me a good alternative to get rid of the singleton without other extra annoyances, I would appreciate it (actually I have a pretty raw idea floating through my mind, maybe I can figure out something by myself) because singletons are a pain in terms of unit testing. However, for everybody interested, here is a quite interesting StackOverflow question where you get all the pros and cons and different opinions to make up your own: https://stackoverflow.com/questions/137975/what-is-so-bad-about-singletons It was closed because: Surprise, surprise --- > Opinion based. Greetings
  4. Like always: Use the right tool for the right purpose. Singletons are quiete useful in some cases. I use one for my memory management system and I have seen other memory management systems that did it quite similar. However, I agree that singletons can cause some serious problems. It's one of many C++ 'features' that tend to be overused. It often provides a fast solution to get around function interfaces. It removes the need for careful class, function and program design. If you get unlucky, this might backfire when the singleton is already an integral part of your program. Greetings
  5. DerTroll

    C++ C++ size_t for everything?

    Don't worry, you didn't Think so too. Sure, working with unsigned types can lead to additional errors, but in the end, if you avoid those mentioned errors, it doesn't matter in most cases if you use unsigned or signed ints. Greetings
  6. DerTroll

    C++ C++ size_t for everything?

    You are right. I got a little bit confused by the fact, that the unsigned version always needs to start with an index increased by 1 when compared to the signed version. However, the signed version has a -1 after the size: for (int i = a.size() - 1; i >= 0; --i) std::cout << a.at(i) << std::endl; So the unsigned version is as you said: for (size_t i = a.size(); i-- > 0;) std::cout << a.at(i) << std::endl; I think I am getting old ... -.- The thing is, that always talking about this stuff confuses me more than actually working with it xD. I will continue using unsigned integers for values that are inherently non-negative. Usually, the compiler warns me about those signed unsigned issues. Greetings
  7. DerTroll

    C++ C++ size_t for everything?

    The "+1" is correct because "i" is decreased after the comparison inside the comparison statement of the for-loop. This happens before the body is executed. In the signed version, "i" is increased after the loop's body is executed. WRONG --- See answers below Greetings
  8. DerTroll

    C++ 3 quick ways to calculate the square root in c++

    I haven't seen that. I just copied the functions without paying much attention to the implementation details. Changed it and rerun the benchmarks. No effect for GCC. Clang got actually even slower. No clue why. However, I think modern compilers are quite good at optimizing copy by value/ref function signatures. But that's just a guess. I have not tested it, but to my experience, every "obvious" optimization is already performed by the compiler. Well, icc and clang are smart enough to optimize the _mm_set1_ps away. Also tried using _mm_set_ss but it didn't change much except GCC replaced the shuffle by an insert. However, if you compare the timings of GCC and Clang you see no real effect on the performance. That's a little bit confusing to me since 5 flops (latency of inv sqrt + shuffle) vs 4 flops should have a clear impact (25%), but I guess the compilers are inlining the function in the benchmarks and might perform some other optimizations in this context. I don't really know since micro benchmarking such low-level stuff is always a fight against the aggressive compiler xD. Greetings
  9. DerTroll

    C++ 3 quick ways to calculate the square root in c++

    Okay, hate to say it, since you probably put a lot of effort into research and testing, but it seems like @Hodgman is right and there is absolutely no reason to use the functions you presented. I benchmarked it using google benchmark. Here is the code so you can test it yourself. #include <benchmark/benchmark.h> #include <cmath> #include <x86intrin.h> // Functions ---------------------------------------------------------------------------------------------------------- float FastInvSqrt(float val) { __m128 reg = _mm_set1_ps(val); return _mm_cvtss_f32(_mm_rsqrt_ss(reg)); } float FastSqrt(float val) { __m128 reg = _mm_set1_ps(val); return _mm_cvtss_f32(_mm_rcp_ss(_mm_rsqrt_ss(reg))); } float Sqrt1(const float& n) { static union { int i; float f; } u; u.i = 0x5F375A86 - (*(int*)&n >> 1); return (int(3) - n * u.f * u.f) * n * u.f * 0.5f; } float Sqrt2(const float& n) { union { int i; float f; } u; u.i = 0x1FB5AD00 + (*(int*)&n >> 1); u.f = n / u.f + u.f; return n / u.f + u.f * 0.25f; } float Sqrt3(const float& n) { static union { int i; float f; } u; u.i = 0x2035AD0C + (*(int*)&n >> 1); return n / u.f + u.f * 0.25f; } // Actual Benchmarks -------------------------------------------------------------------------------------------------- static void SquareRoot(benchmark::State& state) { float a = 100000000.0; for (auto _ : state) benchmark::DoNotOptimize(a = std::sqrt(a)); } BENCHMARK(SquareRoot); static void ReciprocalSquareRoot(benchmark::State& state) { float a = 100000000.0; for (auto _ : state) benchmark::DoNotOptimize(a = 1. / std::sqrt(a)); } BENCHMARK(ReciprocalSquareRoot); static void SquareRootSSE(benchmark::State& state) { float a = 100000000.0; for (auto _ : state) benchmark::DoNotOptimize(a = FastSqrt(a)); } BENCHMARK(SquareRootSSE); static void ReciprocalSquareRootSSE(benchmark::State& state) { float a = 100000000.0; for (auto _ : state) benchmark::DoNotOptimize(a = FastInvSqrt(a)); }processor BENCHMARK(ReciprocalSquareRootSSE); static void CustomSquareRoot1(benchmark::State& state) { float a = 100000000.0; for (auto _ : state) benchmark::DoNotOptimize(a = Sqrt1(a)); } BENCHMARK(CustomSquareRoot1); static void CustomSquareRoot2(benchmark::State& state) { float a = 100000000.0; for (auto _ : state) benchmark::DoNotOptimize(a = Sqrt2(a)); } BENCHMARK(CustomSquareRoot2); static void CustomSquareRoot3(benchmark::State& state) { float a = 100000000.0; for (auto _ : state) benchmark::DoNotOptimize(a = Sqrt3(a)); } BENCHMARK(CustomSquareRoot3); BENCHMARK_MAIN(); If you are not familiar with google benchmark (documentation), the command `benchmark::DoNotOptimize` doesn't prevent compiler optimizations on the benchmarked expression itself. It just forces a 'flush' to memory (See linked documentation). Benchmarking these low-level functions is not as easy as one might think since nowadays compilers optimize quite aggressively to achieve maximum performance. However, let's get to the results. My Processor: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz GCC release build with -O3 gives: --------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------- SquareRoot 5 ns 5 ns 145729163 ReciprocalSquareRoot 8 ns 8 ns 91433916 SquareRootSSE 3 ns 3 ns 207794538 ReciprocalSquareRootSSE 2 ns 2 ns 298096066 CustomSquareRoot1 8 ns 8 ns 84971507 CustomSquareRoot2 11 ns 11 ns 64788053 CustomSquareRoot3 7 ns 7 ns 98092852 Clang release build with -O3 gives: --------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------- SquareRoot 5 ns 5 ns 113007689 ReciprocalSquareRoot 7 ns 7 ns 93402035 SquareRootSSE 3 ns 3 ns 207083227 ReciprocalSquareRootSSE 2 ns 2 ns 295002686 CustomSquareRoot1 8 ns 8 ns 83792728 CustomSquareRoot2 10 ns 10 ns 70934801 CustomSquareRoot3 6 ns 6 ns 117465124 `SquareRoot` is just `std::sqrt` and `ReciprocalSquareRoot` is `1./std::sqrt`. `ReciprocalSquareRootSSE` is the fast vectorized version I showed you in my previous post. `SquareRootSSE` uses an approximation function to calculate the reciprocal of the reciprocal square root since there is no approximation function for the square root itself (at least that I know). Just using `1.f / ReciprocalSquareRootSSE` produced slower results than the accurate square root. Using 2 approximations might increase the error additionally. However, in my test, the errors partially canceled each other. But I guess it was just a lucky test case. The other functions are the ones from your starting post. As you can see, all of your proposed functions are slower than `std::sqrt`. Only the last one is close to `std::sqrt` but with lower accuracy. The fast SSE versions are both faster than the accurate versions, but the error is not too small: Results for sqrt(5) ---> accurate: 2.23607 - fast: 2.23633. Results for 1.f / sqrt(5) ---> accurate: 0.447214 - fast: 0.447144. Didn't test the errors of your functions, since they already failed the benchmarks. Even though it seems like the gain of the SSE square root isn't really that high, it might get much faster if you calculate multiple square roots at once since _mm_sqrt_ps has just a throughput of 4 while the approximations have a throughput of 1. In this case, some of your functions might get a little speed boost compared to the accurate version but looking at the dependencies of the involved operations I doubt that it will be much. I think the problem is that modern processors have hardware optimized functionality to calculate square roots. Unless you find a brand new extremely simple formula to calculate the square root, you have no chance of beating hardware optimizations. So there is no point in programming functions that are already provided by your hardware. Greetings
  10. DerTroll

    C++ C++ size_t for everything?

    Nope, just use: for (size_t i = x + 1; i-- > 0; ) ... as an alternative for for (int64_t i = x; i >= 0; --i) ... Regarding the topic: I don't like about size_t that it has a varying memory footprint depending on your build setup. As a programming control freak, I only use size_t when it is necessary. Mostly when messing around with STL container sizes and raw memory. For everything else, I think about the expected value range and pick one of the following typedefs: #include <cstdint> using I8 = std::int8_t; using I16 = std::int16_t; using I32 = std::int32_t; using I64 = std::int64_t; using U8 = std::uint8_t; using U16 = std::uint16_t; using U32 = std::uint32_t; using U64 = std::uint64_t; If there is no special requirement I mostly stick to the 32bit integers (loops, indexing, etc.). The value range is large enough in most cases and it is twice as fast as 64bit integers (or size_t in a 64bit environment) if your compiler is able to vectorize your code. However, for the things you mentioned, I think there is no real drawback in using size_t. Only if memory footprint or (vectorized) performance get important you should think about alternatives. Greetings
  11. DerTroll

    C++ 3 quick ways to calculate the square root in c++

    I quickly reread everything without paying much attention to details, so sorry in advance if I missed something. You shouldn't forget, that this is a forum for game developers. A topic like yours is better suited for a computer science forum. As I wrote in my previous posts, I doubt that it is really relevant for most games since the number of square root calculations is usually comparatively low. If it gets relevant, the fast math compiler flag is most certainly the chosen solution for such a low-level problem, especially if the alternatively presented ways violate c++ coding rules. I don't think that any nowadays game developer will spend a lot of time in optimizing a square root calculation. He will probably rely on the functionalities of the utilized math library, where math experts already pulled off any possible trick. Only if a benchmark recognizes that the square root is the reason for a major fps hit, he might get interested in alternative formulations. But in this case, he is probably looking for ways to remove the square root entirely. Also, what your post is missing to make it interesting for many people here are hard numbers. You give some advantages and disadvantages but that doesn't tell much. Pick some specific examples, benchmark them and compare the resulting floating-point error. This is what most people here are interested in (if they are in the first place) and what they can work with. It will enable them to decide if one of the presented methods is a true alternative pick for them. Otherwise, they are just some formulas that aren't written in a c++ standard-compliant way. By the way, this explains why you got the answers you got. Don't get me wrong, I like this low-level stuff but I think that there are just a few people here that can actually provide code for this problem and are also interested in this low-level topic. Most people can only comment on the presented code since they had never tackled this specific problem because it was never necessary. Computer games are such complex beasts that you avoid spending time on such topics unless you really have to address them. Since there is the standard solution, which is pretty damned fast, why should one even bother, except of pure curiosity? Just take a look here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_rsqrt_ps&amp;expand=4803 The standard reciprocal square root for vector registers takes just 4 flops to be finished on a skylake processor. That's the same time as for a single addition or multiplication. Why should a game developer even try to optimize such a fast operation unless he makes a game about calculating square roots? Long story short: The problem you are addressing is not really a concern for nowadays game developers. It is more an interesting scientific problem. Hence, the chances that somebody here can really discuss this problem with you in the way you want are rather low. Greetings and good luck EDIT: An additional piece of information: http://www.tommesani.com/index.php/component/content/article/2-simd/61-sse-reciprocal.html Just had a short look, but it seems that the SSE reciprocal square root is hardware accelerated ("These instructions are implemented via hardware lookup tables"). So you have probably no real chance to get much faster than that. Only a better accuracy to performance ratio might be the reason to pick an alternative function. EDIT2: You can compare performance and accuracy of your implementations against this function: #include <x86intrin.h> float FastInvSqrt(float val) { __m128 reg = _mm_set1_ps(val); return _mm_cvtss_f32(_mm_rsqrt_ss(reg)); } It uses the hardware-accelerated vector register function to calculate the reciprocal square root. Haven't benchmarked it myself (maybe I'll do it later), but I think you will see, that there is not really a point in trying to implement an own square root (at least for the reciprocal) as a game developer. If you need accuracy, you will use the standard `sqrt`. If you need raw speed, you will probably use the hardware-accelerated version unless it is not available on your processor.
  12. DerTroll

    Khronos Releases Vulkan Unified Samples Repository

    Many thanks for sharing the link!
  13. DerTroll

    C++ 3 quick ways to calculate the square root in c++

    Check the latencies on the link. This is for vectorized code, but I guess, the results are quite similar for serial instructions. `_mm_sqrt_ps` is the standard square root while `_mm_rsqrt_ps` calculates the reciprocal square root using an approximation formula as described by the TO. It is 3.25 times faster (on a skylake processor) but less accurate. If you perform a lot of square root calculations, this might be beneficial. However, there are many situations where you can simply avoid calculating the square root. For example, when you want to know which vectors length is bigger, you can simply compare the squared lengths instead of calculating the length using the square root. No square root is faster than a fast square root 😛 You should generally avoid fast square roots if the result is used in complex subsequent calculations (physics simulations, linear solvers like LLT or QR). The error can lead to unwanted numerical instabilities, especially if you are only using 32bit floats. However, if accuracy is not an issue, you might get a performance boost. I am not sure if it really matters since the relative number of square root calculations per frame shouldn't be too high, but that depends on your game/program. Greetings
  14. DerTroll

    A Dev Team Needed

    click me I think you have to provide a little bit more information if you want to find people. Which engine/programming language are you using? What is already there? Is your code on GitHub? etc.
  15. DerTroll

    move ship

    And the question is ... ?
  • Advertisement

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!