Jump to content
  • Advertisement

OneTrickPony82

Member
  • Content Count

    8
  • Joined

  • Last visited

Community Reputation

106 Neutral

About OneTrickPony82

  • Rank
    Newbie
  1.   It's because the tree include random elements (cards appearing on the table) and the optimal representation for every set of cards on the table is different. It does only serve to map indexes but it's the best to use different order for every situation. This is probably too domain specific already :)     Not only there are 2000 of them, they are different for every tree we solve.     We only compile for x86-64. x86 is pointless in our case because the trees we solve wouldn't fit in available RAM with 32bit pointers.     dist arrays are different every time the function is executed (so a million different ones and then they are different in the next iteration yet).
  2. >>You could port some performance sensitive parts of your code from C/GCC to ISPC Thanks, I have't heard about it. I tried Intel C compiler but it produced slower code than GCC 4.8 for me (but that could be the results of me developing/benchmarking under GCC 4.8 in the first place)   >>1) sum is a tempting target. Could you calculate sum when you calculate dist, instead of in this loop? If you use the same dist array for more than one run, that could save a little time.   Yes, actually it's more than one but not massively more (usually 2 or 3 per iteration). I think it will save some cycles. Thanks.   >>2) In the second loop, you add sum to every element of matchups. Could your algorithm just assume that matchups is normalized to start at sum? Or maybe apply sum when you access the matchups array later, instead of adding it every time?   It adds it because inclusion/exclusion principle (the sum is sum of all minus some of those who overlap at one card, plus some of those which overlap at two which is only value at given position) and I could actually save adding sum right now in some places (but not all) but this addition just isn't that tempting to optimize away.   >>3) You are having trouble getting your compiler to vectorize loops for you. I'm not good at this either. But you could try a couple of things. One would be to try a different compiler. I'm completely out of date here. But certainly you could try Visual C++. Intel's C compiler used to be known for optimization. Even if your target platform isn't supported you could "borrow" the assembler output.   I develop for Windows and I switched to GCC because VS is in my view a crappy compiler (doesn't support C99, only supports OpenMP 2.0, produces slow code). Intel is decent but for my case produces slower code as well. I've found this is a very controversial subject and a lot of better programmers than me swear that eitehr VS or Intel are better so I am not looking for religious war.   >>4) Or, split your loops into multiple loops until the compiler vectorizes at least one of the loops. matchups = sum + dist certainly should vectorize on its own, I would think.   Yes, I've tried it but it's slower sadly.   >>I somehow doubt it can get very much faster though.. are those 'table' and 'dist' arrays very large and only used once, or are they used multiple times?   there are many table arrays. When we go over huge decision tree (millions of nodes) we pick the proper one. There are about 2000 of those arrays. Dist is different at every leaf in the tree (it's a probability distribution at current point). As table repeats over many leaves (but not dist) I am pretty sure it's in cache already (but sadly I never figured out how to measure on Windows and as my customer base uses 95% Windows and 5% Mac (but they have parallels/bootcamp anyway) there wasn't enough incentives to port to Linux yet. Having access to better profiling tools might be one though   >>Are you sure you get any vectorization at all from GCC?   Yes, I am sure. Overall my program benefits about 15% from including AVX alone. Maybe I write very simple code but I've found that auto-vectorizer from GCC really does a decent job.   >>I pasted your code into VC++ 2015 and it didn't vectorize anything.   VC is in my view a crappy compiler but again, a lot of people disagree - often people I have a lot of respect for. I code in C so I have no opinion how it handles all the C++ madness but at numerical code it is just slower on everything I've ever tested.   >>(Also I was mistaken about scatter, only gather is currently available, scatter is only in Xeon Phi, though you don't need it for the second loop, but in order to vectorize the first efficiently you would need scatter).   I can't wait for a new Xeon Phi. I have 0 experience with the older ones but wha tthey say about them in the presentations looks like perfect fit for my problem (way better than GPUs). Gather alone would speed it up significantly I believe (and even more so the other similar function I have).     Thanks for your help guys. I was really mostly interested in the "array in the registers" part but I've learnt a few more things on the way :)
  3. >>Low level optimization is only useful after high level optimization is exhausted.   I find low level optimization a very interesting problem in itself. There is surely more to win in high level things and we are working on them as well. Still, no matter what kind of algorithmic improvement we make, those functions will always be a bottle neck. So yeah, I find it both important and interesting.   >>Replacing or tuning high level algorithms can reach many orders of magnitude of difference   We are already doing something faster than publicly known "state of the art". Surely there is way more to be done (and I know some people in my domain got signficiant high level improvement) but getting there is very difficult as well.   >>I'm sorry you feel that is condescending, but it is the truth.   Low level optimization is still interesting. It still can bring very big improvements. High level algorithms are important but it's not that like we (and other people) are not working on them. It's just not a topic of this conversation.   >>. No matter how much you tune your 18 speed bike it will never race around the track like an F1 racer. No matter how fast you get your F1 racer to run on land speed records, rockets will reach faster speeds.   This can be used to argue you should never do low level optimization. It's just meaningless. Low level things matter to some people. I am one of them.   >>When looking for major performance improvements you must first look at the biggest practical high level changes. Only after that is done do you look to low level tunings.   I am working on it for more than a year and I already based my work on what other people in the field got to during many years of work. You may say it's "after that" already even though there are for sure a lot of high level improvements to make as well.   >>That's why people are asking about your algorithm choices, your actual performance needs, the scale of your work.   Why can't we just discuss an interesting low level optimization problem? I find low level optimization to be interesting in itself. I am not very good at it and I hope to learn more about it.   >>If you get a newer CPU with AVX2 I think you can make the second loop significantly shorter when vectorized to do 8 floats at a time as there are scatter/gather load/store ops.. so by tweaking the table to be laid out as a proper mask that could be quite quick.   Wow, this is very interesting. I wonder if GCC will be able to apply those gather/scater instructions by itself. Unfortunatley right now I am working on ivy bridge CPU but I think I am going to get a newer one just to see.   >>(I'm assuming all combinations of 52 cards are possible for the 2 picked, if you could do some high-level changes that divided the problem in parts where each inner-loop only handled part of that range at a time, then that would probably help a lot).   card1 is always different than card2 and I can force card1 being bigger (or smaller) than card2 but that's unfortunately all.     >>Do you need full floating point accuracy, or would a smaller fixed-point format do?   Hard to say but probably here I need full floating point as I am already suffering a bit for having only 32 bits here (those numbers get summed up over huge tree).   >>Whats the oldest CPU you need it to run on?   It's already fast enough for most use cases. Making it faster is important for people who can afford very powerful hardware so even if I make improvement which only works on say Haswell it will be worth something for me.   >>Do you already make use of multithreading at a higher level of the algorithm?   The algorithm is very easy parallelizable and I can basically use as many cores as I am given by splitting the task way before those functions are called (basically every thread gets its own independent branch of the tree).   >>Have you tried partially unrolling the loops? e.g. use [i..i+9] within the loop body and increment the counter by 10 at a time. That will increase your register usage while also helping out the CPU's internal scheduler and prefetcher.   I tried and it was the same time. That's why I am pretty sure reads from L1 are the bottleneck, even with unrolling it the values are written to memory and then read from it in 2nd part.   >>You could always port your entire algorithm to the GPU instead of just this one part   It's not an easy problem. While the whole thing is possible to reduce to linear equations it's not very efficient. (the resulting problem is very big) So far most people in the field think CPUs are better suited for it but we never know.   >>Just a quick remark: even if it's unfeasible to keep everything into registers, it may be way more feasible to try to keep things in the cache, right? Since accessing the actual RAM is a severe performance penalty, keeping everything you need into the cache would definitely make for a huge speed improvement.   Those are reads from L1 which are problem already, sadly.
  4. >>No expert either, but I'm pretty sure counting cycles is a fairly meaningless metric on today's technology.   It's quite meaningful. I am measuring stuff in "wall clock" time as well but I've found measuring cycles is more convenient and very reliable. Calling rdtsc is better than calling a OS API function to get time so there is that.   >>Take a step back. Can you find any heuristics? Divide and conquer? Or parallelize? >>This reeks of http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem   Guys, I am programming this thing for more than a year. I am working on it full-time and the project is quite successful.. I talk regularly to other people who program in the same domain and we all agree those functions are a bottleneck. Posting "the big picture" here is pointless as it would take you at least several weeks to get on speed (that's assuming you are 10x more talented than I am which I am sure you are). I find the attitude a bit condescending. I've asked a well defined question about in my view interesting optimization problem and you are assuming I am missing something in the big picture. Maybe I am or maybe I am not, it isn't meaningful. I am very interested if it's possible to optimize this piece of code beyond what I was able to do myself. It's an interesting problem in itself if anything.   There are no magic shortcuts in what I am doing. There are places which could be optimized and this is one of them. The code is already highly parallel, divide and conquer doesn't apply to algorithms of this class, heuristics may or may not apply but it's completely different issue.   I've posted here because I know some game developers are very good at low level optimization. I am not very good at it and I have no experience in assembly so I am sure there are tricks which I am unaware of there. I really don't know why condescending answers with the assumption that I don't know what I am doing get upvoted on this forum.
  5. >>How slow is too slow? 8 thousand cycles (is that right?) is small enough to be measured in microseconds. How fast does it need to be?   The faster the better. This function and one very similar one (more complicated but with similar structure) are the bottleneck. The algorithm works on huge decision tree and calls those functions hundred thousands or sometimes millions of times per iteration.   >>It's often good practice to perform all reads from memory at the top of a block, do compute in the middle, then perform writes to memory at the bottom. e.g. you can be sure that this block won't constrain the optimizer:   Thanks, this is a useful suggestion but in this case it doesn't change anything. When I am measuring only parts of the function I get that the first loop takes 4.4 cycles per item and 2nd part 3.5 cycles per item. That makes sense I guess seeing how there are 3 loads and 2 writes 1st loop while 2nd one has 3 loads and 1 write.   The thing is that in this function we are first adding numbers to 52 values in memory just to keep loading them in 2nd part. My idea was that there is enough space in SIMD registers to keep 52 floats, so that maybe something like:   1)read card1, card2 2)load a dist 3)retrieve value of registers corresponding to card1/card2 and add dist there, store again   and then when iterating again we have those in registers as well and maybe retrieving them and doing opeartions on them would be faster than reading from cache. I realize you can't just add say 3rd float from 6th SIMD register but you can first extract it, then add, then store again (I think?).   Anyway thanks much for your help. I am looking at this function for a year now and basically lost all hope already but sometimes I wake up with an idea and I can't sleep until someone debunks it :)
  6. >>What is it your program is doing, anyway?   Solving a popular card game.   >>GPU would be faster than CPU for that size of array, but not if you have to ship the data over and back.   The problem is the algorithm I am using (and all the fast known algorithms in my domain) are sequential in nature so CPU is better at other parts. I would need to copy back and forth here.   >>It's irrelevant any way. If there is a bottle neck, big if, it will be in getting the items from memory into cache.   This is incorrect. The 2nd line from 2nd loop takes 3.3 cycles out of 8 the whole thing takes. Reading to cache is not a bottleneck here. The problem is that in those loops  you need to write (in 1st half) and read (2nd half) from a short local array. You need to write/read 2 elements at the time which makes it not vectorizable. This array is in L1 but the reads from L1 are still a bottle neck.   There are enough registers in the CPU to hold 52 floating points. It kinda sorta looks it should be possible to pull off but I am not good enough at assembly
  7. >>AFAIK, there's no such thing as an array of registers, other than memory   Yes I know, but here I have 52 elements only. If it was 2 elements, that would be easy:   if (card1 == 0)  write_to_register1; else  write_to_register2;   Unfortunately putting 52 elements long if-else statement in there isn't too appealing. If somehow it was possible to keep name of the register in memory though then it would make it possible.   >>you may want to consider using a GPU.   The thing is that those arrays are relatively short (1100 elements or less) and I am skeptical about GPU being faster on this than CPU is (because memory copying alone would already by quite costly in relation to total execution time). It is really a bottleneck, going down to say 6 cycles per item would make everything way better.   >>An array of 1000 items doesn't sound very useful. Maybe if you add the next step too, and just compute the items you really need?   This is a probability distribution and the results are needed for every item. I REALLY need all of them. This is not part of the problem, the code is a part of an algorithm which calculates results for all possible items. The "items I really need" doesn't make sense in the context.
  8. Hello all,   There is a following problem:   1)you get an array of say 1000 floats which represent probability distribution over 1000 possible items 2)every item is made of two different elements which are drawn (without returning) from a pool of say 50 elements 3)at any given point in a game you and your opponent have one of the items according to the distribution in 1)   Now the task is to calculate for every item how likely you are to hold them or in other words in how many ways it's possible to construct a pair of (our_item, opponent's_item). The answer isn't just a distribution in 1) because as the elements are drawn from a common pool, opponent's distribution influences ours.   To be a bit less general, it's the best to think about a card game. Any card game would do but let's stick to a traditional deck of 52 cards and hands made of two cards. If for example our distribution of hands is: AsAh:0.5, KsKh:0.5 and our opponent's distribution of hands is: AsAd:1 our real probabilities are 0% for AsAh and 1 for KsKh (as AsAh is not possible for us to hold).   The algoritm to calculate it in linear time is this (it's based on inclusion/exclusion principle): Elem* table = retrieve_proper_table(game_state); float dupli[52] = {0}; float matchups[1000] = {0}; float sum = 0.0f; for (int i = 0; i < 1000; i++){ dupli[table[i]->card1] += dist[i]; dupli[table[i]->card2] += dist[i]; sum += dist[i]; } for (int i = 0; i < 1000; i++){ matchups[i] += sum + dist[i]; /*inclusion/exclusion principle*/ matchups[i] -= dupli[table[i]->card1] + dupli[table[i]->card2]; } Where table is an array of structs like this: typedef struct{ uint16 card1; uint16 card2; } Elem; which represents elements (in this case cards) the item at n'th position is made from. dist is an array of floats from 1)   This algorithm takes about 8 CPU cycles per item in an array to execute. This is because you first need to read all the elements from the distribution, add weights at corresponsing positions to an array of duplicates ("dupli") and iterate again this time subtracting corresponding elements. This is a lot of writes and reads from memory. Even if it's all in L1 cache it's too slow. Especially painful are the reads from dupli array in 2nd iteration as they make vectorizing the loop impossible (GCC vectorizer says: "data not suitable for gather" and it's no surprise as we need to read 2 elements there)   The thing is in my case there are only 52 possible elements the items are made from. I am thinking that maybe it would be faster if we could keep all the 52 required floats in registers instead of an array. The problem is I don't know how to go about it. I can change the way Elem struct looks like and keep w/e info is needed there. is it possible to keep name of registers there? I imagine something like that should be possible.   The way it would work: we iterate over the distribution, then in corresponding array of Elem we have names of registers we want to keep the sum of given cards in, then we use the same registers when iterating 2nd time to subtract. That would spare a lot of reads from L1 in 2nd iteration as we wouldn't need to read from dupli array anymore (as it would be in registers) and only read from table array.   I realize this is quite a domain specific problem but maybe it's interesting for someone and maybe it would be applicable for different card games or some other problems. I've spent quite a lot of time experimenting/optimizing it and I am unable to beat the times I am getting (again, about 8 CPU cycles per item on i7 3770 compiled with GCC 4.8).
  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!