Archived

This topic is now archived and is closed to further replies.

EvilCrap

Fast String

Recommended Posts

hi, i am writting a proggie that does lots of string compares rapidly... so, i wrote a StrCmp function that is like 5+ times faster than strcmp. im wondering if anyone knows of a even faster way. this is my method: i created a templated class that accepts T, a data type like int or double. i cast my char* to T, and use a T* to step through the string and make comparisons.
  
 //example of sizeof(int) = 4 array comparisons.

char* pc = "joe\0"; //variable as char[4]

int* pi = (int*)pc; //variable as int

const int* test = (int*)"joe\0"; //constant

if ( *pi == *test1) ... //compares sizeof(T) chars at a time.

  
it only works if both strings in the compare are CFAST_STRINGS, because the strings need to have their end aligned with sizeof(T), by adding null chars, so that there is not garbage data at the end. from what i can tell, this is the fastest sort of way. anyone know of a faster way to compare char*s ? ! thanks

Share this post


Link to post
Share on other sites
quote:
Original post by EvilCrap
from what i can tell, this is the fastest sort of way.
anyone know of a faster way to compare char*s ? !


Do they change alot, or is there a consistent set your comparing against? Like a script interpret etc...

You can perform a hashing algorithm on the strings, and store them in a BST (allocate the memory for it as one continous chunk, and just point stuff into it). You start by just comparing the hashes, then you only need to checks the actual strings if the hashes match. This will only be faster if you can pre-compute the majority of the hashes _and_ have a number of strings to compare against (I'd guestimate the break-even point is on the order of a 100 strings to compare to).

For the completely general case, you'll be hard pressed to beat strcmp. As you add restriction you can take advantage of them (like CFAST_STRINGS all being multiple of 4 bytes in size...). As long as you always zero out the dangling unused part of the string, and only compare CFAST_STRINGs to other CFAST_STRINGs that should work. However, it won't be too useful, as to use the optimized compare function, you'd have to copy a normal string into a zero'ed CFAST_STRING buffer - which would be slower than just comparing the two strings with strcmp.


I'm pretty sure the MSVC strcmp function checks alignment at the front and end, and compares 4bytes at a time in the middle.

Edited by - Magmai Kai Holmlor on January 16, 2002 1:41:47 AM

Share this post


Link to post
Share on other sites
If you''re comparing strings against a large set of strings that don''t change alot (e.g. a dictionary) then you''ll want to look into suffix trees (or a related data structure). These work well when the size of the string is small in comparison to what it''s being compared against (in the dictionary example the length of any one word is insignificant compared to the total length of the dictionary).

Share this post


Link to post
Share on other sites
Cunning optimisation (assuming it isn''t already in the MSVCRT), but I expect that, given random strings, less than 1 in 100 ever compare past the first character. Even when you start using single-case alphabetical strings exclusively, I''d be surprised if more than 1 in 10 ever need to compare the first character. Of course, maybe you have an application where this is important, but I''d not recommend that everyone goes around implementing this sort of thing

[ MSVC Fixes | STL | SDL | Game AI | Sockets | C++ Faq Lite | Boost ]

Share this post


Link to post
Share on other sites
VC++7 beta 2''s strcmp() does precisely this anyway (well, it does it better, because it copes with strings that aren''t on 4 byte boundaries).

There''s a problem with your code, in that you can''t readily determine whether the second string is greater or less than the first. The endianness of x86 processors will mean that the int that''s numerically greater may be lexically less, but the reverse may be true on other processors ("abcd" is lexically less than "dcba", but when they''re both treated as little-endian 4-byte integers, "dcba" has the higher value. However, on big-endian platforms, the lexical value will be in the same sense as the alphabetical value).

Share this post


Link to post
Share on other sites