Archived

This topic is now archived and is closed to further replies.

glacai

Sorting multi dimensional structures

Recommended Posts

Hi there, I'm looking for the quickest way of sorting a kind of multi dimensional list with a few parameters to sort by, I'll try and explain as good as I can! struct { int structused; int id; float weight; int occurences; int positions[48]; } data[MAXSTRUCTS][500] Occurences could be any number > 0, the first 48 (or less) occurences are recorded as positions ie. "hello there you there" = "there" 2 occurences, position 2+4 (well 1+3, starting at 0) MAXSTRUCTS would be > 1 but at the moment < 6 but never > 10. Hope that made sense. 500 different ids is the max but there could be less (ie. structused). Say I had three lists: ---------- data[0][0] = id(100), weight(0.9873), occurences(3), positions(0, 2, 8) data[0][1] = id(56), weight(0.8946), occurences(2), positions(3, 10) ----------- data[1][0] = id(34), weight(0.9983), occurences(3), positions(1, 3, 6) data[1][1] = id(56), weight(0.9000), occurences(2), positions(4, 13) ----------- data[2][0] = id(34), weight(0.9999), occurences(4), positions(2, 9, 14) data[2][1] = id(100), weight(0.9873), occurences(3), positions(1, 3, 9) data[2][2] = id(56), weight(0.7635), occurences(1), positions(5, 10) ----------- A perfect match would be id(56) as its in all structures plus in order, positions [0]3, [1]4, [2]5, then id(34) as it has a higher combined weight than id(100) plus id(34) has a match of [1]1, [2]2. The more structs the id appears in the better, if more than one appears in all, the positions and weights come into play with order then weight as importance. Phew, that was difficult to explain I hope someone understands! any help would be appreciated. Edit: Forgot to say each data[x] is already sorted by weight order. Regards.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Do you have a good reason of sticking to C? When working with comprex data structures, C++ and STL would be of great help.

Share this post


Link to post
Share on other sites
You describe a total order thus it's a one dimensional case. Any conventionnal sort algorithm applies. In fact you can create a specific "number" or "string" n for each id. Then just sort all your ids with quicksort if you want (std C lib). You can also use a hash table to accelerate. But I suppose, given the size of data, it's not really worth optimizing it too much.

For n, for instance, simply concatenate the bytes or ints in a string :

id occurences in the structs (don't mix it with the occurences of the word in the sentences)
+ weight
+ positions.

Create a struct like that for each id :

- int id;
- char *string;

then it's only sorting strings ...

Well I replied quickly and not certain I understood your pb perfectly ...

[edited by - Charles B on July 4, 2003 5:52:45 PM]

Share this post


Link to post
Share on other sites
Hi Charles B,

Thanks for your help but yeah I think you misunderstood, I wish I did only have to sort by ids!

I need to sort by weight order, then by word position order over multiple structures which may contain a matched id (which is what I want) but may not.

If an id appears in all data[thisdimension][] structures the sort goes to positions, where a perfect match would be:
data[0][x].pos = 1, data[0][x].id = 100
data[1][x].pos = 2, data[1][x].id = 100
data[2][x].pos = 3, data[2][x].id = 100
with the highest combined weight.
x might not be the same number but data[][x].id is as the lists are already sorted by weight.

Head ache material trying to explain that

Cheers anyway.

[edited by - glacai on July 4, 2003 6:34:18 PM]

Share this post


Link to post
Share on other sites
K let's try again ... step by step this time so that you can correct the points if I misunderstood something

1) Your output is an ordered list of structures (not ids as I first thought). So this means there is a total order : if a < b and b < c then a < c (no cycle possible).

2) Priority 1 is the number of occurences of the structure id in all tables. For instance. All structs with id 56 are before structures with id 100.

3) Priority 2 is if two ids appear a same number of occurences in your tables, then you consider the sum of (combined) weights to know wich id wins.

4) Priority 3 is when two structs share the same id, you order them with positions. It's an alphabetic order that compares strings of numbers (instead of characters). Though you did not mention it I suppose 0,2 is lower than 0,2,3 (right?).

So my starting idea may be readaptated.

Then first sort all your ids.

First build an array that contains id structs like this :

- id
- occurences in struct tables (MAXSTRUCT) // BTW it's a bit confusing why not call it MAXSTRUCTARRAYS, as each array contains structs.
- combined weight.

Parsing your structures, for each new id you meet create an id struct. Store your current index. Continue and each time you meet the id again update the occurence and combined weight counts. BTW if your ids were consecutive it could be much faster. Remark (a) : You would simply have to access the table by indexing with id. If you have only a few holes use a fake id ( 0 ? ) that will remain inactive and not really sorted later. So you could parse your structs linearily (n) and not with a double loop ( n*(n-1)/2 )

Then sort your table by occurences, then weight. BTW if your weights were integers it would be simple to concatenate occurences in most significant bits (msb) and combined weight in lsb. Struct occurences in 8 bytes and weight in fixed point 4.20 will be accurate enough right ? If your number of ids is small then no need to use a super algorithm.

Now that the ids are sorted, for each id in the odered list parse every structure and push each structure with that id in your output table (push only a pointer to the struct probably ... I don't know your application details).

Once it's done for each id, then in each same-id sub table, sort each structure by position (string). If your number of structs per id is small then the basic n square sorts will be good enough.

This time I suppose I am right. It's the naive approach. However I am certain the whole process can be optimised. Still as the sort is split into two passes, it's not as inefficient as it seems. It's like a hashing algorithm in first order.

I think the complexity can be optimal much better than n log(n) in practice using remark a).

Damn you pushed my IQ and english level close to its boundaries ROFL. That's why I lost time to reply. Thanks God no head ache yet.


[edited by - Charles B on July 4, 2003 8:41:12 PM]

Share this post


Link to post
Share on other sites
Nice one Charles!!!!

That is so much appreciated you wouldn''t believe, in the last couple of hours I have nearly implemented exactly as you are suggesting! But its nice for someone else to recommend something similar as I had thought, slightly reassuring.

I agree there''s probably a better way but hopefully this will do for now, I should know in the next hour or so.

I really appreciate your time, thanks!

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Given you can define an ordering rules, as said as before, a < b, b < c -> a < c. You can reduce your problem to a one dimensional sorting problem.

you can use the qsort() function, passing it a function that compares 2 of your structures.

if would be something like:

int compare(void *va, void * vb)
{
struct {
int structused;
int id;
float weight;
int occurences;
int positions[48];
} * a = va, * b = vb;
if(a->id < b->id) return -1;
else if(a->id > b->id) return 1;
....
}


Given that you can sort your array pretty easily.
It seens that the main problem here is that you cannot explain correctly your problem to us.

Share this post


Link to post
Share on other sites
quote:
Given that you can sort your array pretty easily.
It seens that the main problem here is that you cannot explain correctly your problem to us.


Well it seems Charles B understood, thanks anyway.

Share this post


Link to post
Share on other sites
In fact this morning (same in France and UK I think) I saw a straightforward optimisation.

One remark first. There is no way to modify your structs so that a single quicksort call can be used. The total order is not explicit but a result of intrinsic rules. Even if someone tried to use a sort on the whole structs at once this would be highly unoptimised as the id order needs to be done in a separate pass anyway. So the other poster did not undertand your issue.

You understood that pass 1 can be linear with trick a).

I think you understood but more precisely :

You parse all your structures just once, a single loop. You access ids-structs with id being considered as the index in an id-struct array. So you don't have to seek where the id is in the table when you meet the id in another struct again. The table creation can't be optimised.

The sort of these ids can now use a hash table based on id occurence I suppose these occurences are fairly limited by your MAXSTRUCT constant. If you don't see how to optimize such a hash I can explain it to you in more details. It's based on an histogram that you transform into a list of consecutive smaller arrays (ptr, size) of ids with same occurences. This is all linear. Then you only have to sort the ids in these small arrays. Which should be quasi linear as these are very small lists.

Then back to pass 2 the sort of structs that share a same id. The problem in my first implementation is that you need to parse the whole structs for each id. which is some kind of n-squared. How to make it linear and access each struct once or twice only again ?

Simple ! In pass 1/ array construction create a single linked list (SLL) of structs in each id struct. So you just have to add a pNextWithSameID pointer in your structs and a pHead, pTail pointer for the SLL management in each id-struct.

this would look like :

for( pStruct = &StructArray[0][0]; pStruct <= pLastStruct; pStruct++ )
{
int id = pStruct->id;
mySTRUCT *pPrev = IDArray[id]->pTail;
if(pPrev==0) // Empty SLL test.
{
IDArray[id].pHead = pStruct;
IDArray[id].pTail = pStruct; // Start the SLL with one element.
}
else
{
pTail->pNextWithSameID = pStruct;
idArray[id]->pTail = pStruct; // Push the structure in the SLL.
}
}


Now you still have to sort your structures for each id. Simply parse your id-struct list, parse the SLL, push struct pointers in an array of 500*MAXSTRUCT elements, keep the starting point of the sub-array for each id and keep a count of those ids. Then sort the sub arrays of struct pointers with the standard quicksort routine for instance. Or use a hash table to accelerate again.

When you have finished this all your structs are ordered by their pointers in an array.

There is another solution where you dont need SLLs, but only arrays. That's more efficient for the cache. You just need to count the number of structs per ID, an histogram (ID, count). Iterate and create an array of pointers from the histogram, which build a list of sub arrays inside your 500*MAXSTRUCT array of struct pointers. Then parse the structs again and push their pointers in the sub arrays treated as stacks. Then reorder order each sub array with quicksort.

Probably all you need to be quicker to solve your own problems is to gather enough culture in maths and general algorithms to be able to select words (vulgar or scientific) that carry meaningful concepts. Then the solutions appear more quickly when the problems are slightly rewritten. For instance simply speaking of total order and priorities helped me to see it all much more clearly. But that's only experience and scientific culture. Don't worry I was really messy in the past too and you found a solution your own way which is good.

BTW ... something else. Don't believe all what your politicians or papers said about France. They are all liars, in France too. Chirac=Blair=Bush=liars. I still enjoy visiting England when I can and don't care much about considerations for these intellectual crooks.


[edited by - Charles B on July 5, 2003 6:14:34 AM]

Share this post


Link to post
Share on other sites
Hi Charles,

quote:
BTW ... something else. Don''t believe all what your politicians or papers said about France. They are all liars, in France too. Chirac=Blair=Bush=liars. I still enjoy visiting England when I can and don''t care much about considerations for these intellectual crooks.

Personally I like France too and where you come from certainly isn''t a consideration to me, you took time and effort to help which I appreciate very much.

Anyway I''ve got it going and it seems to be working well, in a similar way to what you suggested (I think). I don''t need to order by id at any point, I was thinking of using id as an index into an array but the id could be upto 32 bit number, I know hashing and it''s been used extensively in other parts of the program so was trying to avoid anymore.

As the max is only 5x500 at the moment I decided to order the struct as read from the file, inserting into a doubly linked list into position, if id already exists update a hit for each id match and combine the weight, reshuffle the list if updated weighs higher than previous. It all seems to be working in an instant so that will do me for now I''ve still got to implement the positioning though.

btw, MAXSTRUCT was an example not what I use, thought it would make it easier to explain its not a set size.

quote:
Don''t worry I was really messy in the past too

I hope its not too messy, its a small section out of 10000 odd lines!!

Thanks for your time, regards.

Share this post


Link to post
Share on other sites
K I think your implementation is the normal way of doing it and theorically normally efficient. But when it comes to optimize I try to remove such things as double linked lists that give poor cache, data bursting and adressing anticipation (AGIs). That's why I gave this tedious solution as your question used the word quickest. Maybe you meant quicker to write not fastest code. Still you are right if the data size remains as you say (500*10 structs), on modern PCs your solution will be surely 10000 times good enough.

One of my deafults as a coder is I am so trained to optimize code in terms of CPU (3D stuff) that I always tend to propose more complex but more optimized solutions even for the first version. Rare exceptions is when it does not pay the candel at all and adds too much work for nothing (rare in most 3D stuff). Sometimes it's a bit obscure for ppl that work with me. But as I don't loose much time doing that way ... I like to have always fast running code even for prototypes and try to comment and insert a pseudo code more straightforward to understand.

Just a remark I used a gallicism (a french distortion in english), that's where nationality counts

vulgar (latin origin) is vulgaire in french which means both vulgar and popular. I meant popular words of course.


[edited by - Charles B on July 5, 2003 2:02:24 PM]

Share this post


Link to post
Share on other sites
quote:
Maybe you meant quicker to write not fastest code.

I did mean fastest code not quicker to write, I''d be interested in whatever you think is the fastest most efficient however complicated.

quote:
One of my deafults as a coder is I am so trained to optimize code in terms of CPU (3D stuff)

Thats why I considered this site first as the best place to ask, code optimization is second nature to game programmers.

Maybe I''m not up to your standards, but its fun trying

Share this post


Link to post
Share on other sites
K I'll try to think more deeply about your DLList algo and mentally examine the drawbacks in terms of optimality. I will reedit and fill this mail later. (some work to do right now.)

But first I'd like to know if I guessed your method well :

Your sort is a single pass insert sort whith four steps in the main loop. :

- find the last element appended with the "same-id"
- continue till you find the right position for the struct in the "same id" sub list.
- recombine the weights (increment). Do you update the combined weight in all the structs with same id ? Or did you find a trick to update only the first struct, like a list of all sublist heads ?
- check if the count of structs or combined weights maintain the "same-id" sub-list in position or translate it all closer to the head of the current complete list.

Am I right ?

Then it's grossly a n square algo (insert sort) and even a bit worse.

[edited by - Charles B on July 6, 2003 10:30:09 AM]

Share this post


Link to post
Share on other sites
Thanks Charles,

Exact struct I'm working with:

struct DATA_BANKS {
bool used;
DOCID docid; // unsigned char[4]
unsigned short int cnt; // cnt of positions, maybe higher than MAXPOSTIONS
unsigned short int pos[MAXPOSITIONS];
float weight;
} bank[MAXWORDS][MAXBANKS]; // 10 x 500 (I may up it)

As the data is read from a file into a data bank (seperare file for each word), file contains:
docid weight count pos pos pos pos docid weight count pos pos ... ...
count being how many positions there is for that docid, its inserted into:

typedef struct TOP_DATA {
DOCID docid;
int hits; // incemented each time docid is found, max MAXWORDS
int bank[MAXWORDS][2]; // holds which bank is from, for testing information
float weight; // f - total weight of docid matched words
float average; // f - average of total weight, not used yet
struct TOP_DATA *prev;
struct TOP_DATA *next;
} TDATA, *pTDATA;

- First I see if the docid is already listed (done badly, just traverses the list looking)
- If docid matched, hit and weight is added, bank info set, then I check if hit or weight is higher than previous, if it is swap, check previous, repeat (just thought thats really bad as well, I need to move to position and swap don't know what I was thinking!)
- If docid not matched, traverse from head to hits/weight > next, insert docid, hit, weight etc.
- This is where I will (haven't yet) implement the positioning sort, looking for the closest bank[1, 2, 3] == pos 1, 2, 3
- Then I got a function that returns the head, freeing head and moves to next, ready to return next highest, which is perfect as only a set amount of results are needed at a time.

Thanks Charles, don't worry if you're busy!

[edited by - glacai on July 6, 2003 11:47:51 AM]

[edited by - glacai on July 6, 2003 11:48:50 AM]

Share this post


Link to post
Share on other sites