Archived

This topic is now archived and is closed to further replies.

felonius

An efficient implementation of std::string?

Recommended Posts

Hi, I (and my collegues) are getting pretty much fed up with the standard implementations of std::string. On the Playstation 2 it is not easy to view in debuggers and on PC it is hard to read the string source code when stepping through the code. So I decided it was time to build my own easy to read and debug implementation of std::string. I have have considered different approaches to the implementation and decided this: - The size of the string should be stored explicitly in a member field in the string so size() can be implemented efficiently. We use size() a lot so this matters for us. - If the string size is below some small limit (e.g. 16) it is stored locally instead of using new. When it grows beyond this the data is moved out into a dynamically allocated char-array. - The external data is increased in size with equally sized blocks so a realloc isn''t needed every time the string grows. - The use of char arrays that are null-terminated are used for fast conversion to C-style strings. - The user can, in the constructor, give the minimum size of the string so we don''t allocate and deallocate a lot fo times if the string is expected to grow and shrink a lot. This gives are data structure like this (the methods are omitted for simplicity here):
  
class string
  // The actual size of the str in bytes - including 

  // the null-terminating character.

  int iSize; 

  // The minimum size that may be allocated at any time. This is

  // only set to a high value if the user knows the str will

  // grow and shrink a lot.

  int iMinSize;

  union {
    // The str is stored externally.

    struct {
      // The external str data.

      char* cData; 
      // The amount of space in data.

      int iReservedSize;
    };

    // The str is stored internally.

    char cLocalData[STRINGLOCALSIZE];
  };
};
  
Does this seem to be a good idea or is there better ways to implement a string? Of other questions: 1. Is it a good idea to use equally sized blocks. One might argue that the block size should increase with the size of the the string. 2. When should the allocated memory be reduced? Everytime the string shrinks? Any suggestions or links are welcome. Jacob Marner, M.Sc. Console Programmer, Deadline Games

Share this post


Link to post
Share on other sites
quote:
Original post by felonius
I (and my collegues) are getting pretty much fed up with the standard implementations of std::string.

Just to cover all bases, have you looked at all publicly available implementations (especially STLPort)?

On the Playstation 2 it is not easy to view in debuggers and on PC it is hard to read the string source code when stepping through the code.
You shouldn''t need to read the string source code as it has been tested by a large number of developers over a fairly long period. You should be able to refer to it as a black box. I wonder if you refer as well to the massive debug symbols MSVC (and perhaps other compilers) spit out - complete template instantiations instead of compact common references. I remember reading of a solution (it involved intervening somewhere between the macro processor, the compiler and the debugger) in the C/C++ Users Journal... You might be able to find the article on their website, http://www.cuj.com.

The size of the string should be stored explicitly in a member field in the string so size() can be implemented efficiently. We use size() a lot so this matters for us.
Just about every current implementation should do this, since std::strings aren''t necessarily null-terminated.

If the string size is below some small limit (e.g. 16) it is stored locally instead of using new. When it grows beyond this the data is moved out into a dynamically allocated char-array.
This doesn''t help average-case performance in string-intensive algorithms.

The external data is increased in size with equally sized blocks so a realloc isn''t needed every time the string grows.
A realloc isn''t needed currently every time the string grows (if your implementation does that, I suggest looking into STLPort). The common technique is to double capacity when a realloc must occur.

The use of char arrays that are null-terminated are used for fast conversion to C-style strings.
Logical, if you only plan to use your std::string class for strings of ASCII characters. Remember that std::string is really a typedef for std::basic_string< char >. If you simply intend to replace that single class, then the scheme you suggest is logical (remember to remove or comment out the typedef in the string header file).

The user can, in the constructor, give the minimum size of the string so we don''t allocate and deallocate a lot fo times if the string is expected to grow and shrink a lot.
Adding it to the constructor is an interesting approach; I can''t say off the cuff how that may affect STL semantics since it is advisable to maintain consistent behavior with other sequence containers. std::string already provides reserve() though... (inelegant, but functional).

1. Is it a good idea to use equally sized blocks. One might argue that the block size should increase with the size of the the string.
Double the current size (ie, logarithmic growth rather than linear), which will reduce the overall number of relocations as the string grows. Testing on a particular platform, however, may reveal that it''s better in your case to use uniform-sized blocks, in which case it''s obvious what is the superior choice.

2. When should the allocated memory be reduced? Everytime the string shrinks?
Never. Instead encourage your programmers to keep objects around only as long as necessary so that the memory is released when the object goes out of scope (or is deleted if dynamically allocated).

Share this post


Link to post
Share on other sites
Hi Oluseyi,

Thanks for the input. I think I neclected to point out that we on purpose only want to support a subset of the full STL since we are only use some of it anyway (the containers mostly). So whether or not we mess up with the STL semantics is not of high concern - we just need to make sure that our engine still runs after the class replacement. Also, but removing features from STL (such as the ability to compare the contents of an array with a vector or the hierarchy involved in the iterator types) we can simplify the implementation a lot and gain speed.

Just to cover all bases, have you looked at all publicly available implementations (especially STLPort)?

Yes. STLPort does not work well with the Playstation 2 (we tried and has a lot of trouble compiling it) and furthermore the source is extremely hard to read since they don''t use variable names that any mortal can discern. Besides, we like to be able tune the implementation for our specific engine.

The use of local data is a clear example of this. We have a lot of strings that are very short so although this may not be an improvement in string intensive algorithms I am pretty sure it will help in our case. But your claim is interesting; why wouldn''t it help? People often use small short-lived strings in normal programming and avoiding a malloc() here should certainly be worth the effort of doing a test each time a string access method is called. Could you explain why you think it is so?

Just about every current implementation should do this, since std::strings aren''t necessarily null-terminated.
The PS2 default implementation does not do this - so when I look at a string in the debugger I also see all the garbage that follows the string. Very annoying.

A realloc isn''t needed currently every time the string grows (if your implementation does that, I suggest looking into STLPort). The common technique is to double capacity when a realloc must occur.
Well a realloc is needed when the capacity runs out (unless of course the strings is allowed to contain pointers - but requires mroe logic during traversal and access) so it is a matter of guessing how much the capacity should be. I knew that vectors does a doubling in capacity - but I am not so sure that it is a good idea for strings - maybe another smaller exponential growth is more appropriate to avoid memory waste?

std::string already provides reserve() though... (inelegant, but functional).
The big problem with reserve() is that it calls the constructor first and then reserves afterwards. That way code is executed that is not needed anyway.

Never. Instead encourage your programmers to keep objects around only as long as necessary so that the memory is released when the object goes out of scope (or is deleted if dynamically allocated).
Interesting. That also allows me to remove the iMinSize field from the class. It seems reasonable that programmers should be aware of the fact when a string shrinks a lot. Hmm, maybe it would be a good idea to add a method so programmers explictly can tell the system to make the allocated space smaller - that way they can preserve space in the cases a stringh has shrunk and stays like that for a long time.

But any other suggestions for improving performance is also welcome - not just comments to the approach I made.

Jacob Marner, M.Sc.
Console Programmer, Deadline Games

Share this post


Link to post
Share on other sites
quote:
Original post by felonius
Thanks for the input.

My pleasure (and insomnia, since I should be asleep right now!)

quote:

I think I neclected to point out that we on purpose only want to support a subset of the full STL since we are only use some of it anyway (the containers mostly). So whether or not we mess up with the STL semantics is not of high concern - we just need to make sure that our engine still runs after the class replacement. Also, but removing features from STL (such as the ability to compare the contents of an array with a vector or the hierarchy involved in the iterator types) we can simplify the implementation a lot and gain speed.

Ah, I see. While that is a benefit for the short-term (implementation time, required robustness), taking a little time and effort to preserve compatibility as much as possible may end up very useful in the future, My suggestion would be to leave out any functionality you don''t currently need (so it can be added/simulated later if necessary), but not to break any existing STL features.

quote:

Yes. STLPort does not work well with the Playstation 2 (we tried and has a lot of trouble compiling it) and furthermore the source is extremely hard to read since they don''t use variable names that any mortal can discern. Besides, we like to be able tune the implementation for our specific engine.

Yes, ungodly identifiers seems to be popular with STL implementors (Plauger/Dinkumware do the same). Sorry to hear that.

quote:

The use of local data is a clear example of this. We have a lot of strings that are very short so although this may not be an improvement in string intensive algorithms I am pretty sure it will help in our case. But your claim is interesting; why wouldn''t it help? People often use small short-lived strings in normal programming and avoiding a malloc() here should certainly be worth the effort of doing a test each time a string access method is called. Could you explain why you think it is so?

I was looking at average-case performance for most application programming. In that case, strings tend to have very large size variations and the cost benefits of storing the data in an array will rapidly be outweighed by the decision logic for accessing the string data via pointer vs via array (this can be eliminated by setting the pointer to the beginning of the array when storing small strings, but means you can''t use a union in there). In the specific instance of your programming at your company, having the local storage option seems viable because of the ways you use strings on site.

quote:

The PS2 default implementation does not do this - so when I look at a string in the debugger I also see all the garbage that follows the string. Very annoying.

You''re seeing the garbage because the string isn''t null-terminated - ie, it''s length-delimited. Except the debugger isn''t aware of this. You can choose to null-terminate, length-delineate or both as suits your purposes.

quote:


Well a realloc is needed when the capacity runs out (unless of course the strings is allowed to contain pointers - but requires mroe logic during traversal and access) so it is a matter of guessing how much the capacity should be. I knew that vectors does a doubling in capacity - but I am not so sure that it is a good idea for strings - maybe another smaller exponential growth is more appropriate to avoid memory waste?

I believe some implementations do a one-and-a-half times increment (as opposed to doubling). You could even provide an extension that allows the developer to specify the rate (and whether fixed or proportional) of capacity increment.

quote:

The big problem with reserve() is that it calls the constructor first and then reserves afterwards. That way code is executed that is not needed anyway.

Yeah, but how critical is performance during construction - generally speaking, of course! Like I said earlier, spending time and effort making sure you don''t break any existing functionality (even if you don''t support all functionality - and even if you add new, non-standard functionality) may payoff later.

quote:

Interesting. That also allows me to remove the iMinSize field from the class. It seems reasonable that programmers should be aware of the fact when a string shrinks a lot. Hmm, maybe it would be a good idea to add a method so programmers explictly can tell the system to make the allocated space smaller - that way they can preserve space in the cases a stringh has shrunk and stays like that for a long time.

Excellent idea. Would have been my next suggestion (no, really! )

quote:

But any other suggestions for improving performance is also welcome - not just comments to the approach I made.

I''ll be lending the matter some thought. As a C++ programmer with a touch of C experience, and with the languages'' reliance on null-terminated chars at the fundamental level, it''s hard to move away from that representation. Maybe someone versed in another language (such as a text manipulation language/tool like awk or sed) can provide really interesting alternatives.

Share this post


Link to post
Share on other sites
Thanks. It seems we agree pretty much here.

I certainly agree that STL compatibility is worth preserving - who knows we might need to use 3rd party library that uses STL and hell is loose if we can't support it.

I think the one and a half idea is excellent so I will do that. It seems more reasonable for us. Hmm, it seems a bit of an overkill to info to the system about the growth rate - but I will keep it in mind.

With regard to the local storage og data I get your point, so I might very well do two implementations and do a performance analysis for our applications. it is also a matter of tuning the size of the data stored locally to get most of this scheme. Since we use strings mostly for local variables the fact that the strings structure increases in size is not a matter of high concern.

I'll be lending the matter some thought.

Any thoughts will be appreciated. (And bad me, sitting here and coding for my company for fun in a weekend. )

As a C++ programmer with a touch of C experience, and with the languages' reliance on null-terminated chars at the fundamental level, it's hard to move away from that representation. Maybe someone versed in another language (such as a text manipulation language/tool like awk or sed) can provide really interesting alternatives.

Hmm, I think it is important to stay close to the C-definition of strings. There are still a lot of libraries out there that only use C-style strings (all the one we use for instance) so fast conversion to those are extremely important.


Jacob Marner, M.Sc.
Console Programmer, Deadline Games

[edited by - felonius on October 19, 2002 8:26:37 AM]

Share this post


Link to post
Share on other sites
Hmm, I have been doing some thinking about some of the previous arguments. I said that
a) That our strings mostly are shortlived and on the stack so the struct isn''t that much of an issue.
b) Having local data in a union causes more logic to differentiate between them with a loss in performance as the consequence.

This leads to another better solution. I could do like this:


  
class string {
// The actual size of the str in bytes - excluding

// the null-terminating character.

int iSize;

// The external str data.

char* cData;

// The amount of space in data.

int iReservedSize;

// The str is stored internally.

char cLocalData[STRINGLOCALSIZE];
};


And then always use cData to access the string. If however, the string is less than STRINGLOCALSIZE in size then cData points to cLocalData.

In effect I remove all the performance overhead of embedded strings, while avoided malloc() for small strings. The only drawback is that sizeof(string) grows with STRINGLOCALSIZE - but I think it is worth it - if STRINGLOCALSIZE is set to something like 16 it shouldn''t be too bad.

comments?

Jacob Marner, M.Sc.
Console Programmer, Deadline Games

Share this post


Link to post
Share on other sites
quote:
2. When should the allocated memory be reduced? Everytime the string shrinks?

Never; if I have a string that''s around for ages that will benefit from being reduced (which is unusual) I can shrink it myself.

I, the user of the string, knows when it''s useful to reduce the allocated memory.

You, the writer of the string class, do not.

If my string had 1024 bytes allocated and then I reduced the actual used part to 10 bytes, so that I could then write another 1000 bytes to it, I don''t want you to deallocate in the meantime or anything like that.

But if I know that I''d rather have the memory, I can reduce the storage used by copy constructing and swapping.

Share this post


Link to post
Share on other sites
quote:
Original post by felonius
- If the string size is below some small limit (e.g. 16) it is stored locally instead of using new. When it grows beyond this the data is moved out into a dynamically allocated char-array.


This is how MSVC''s string works (15 bytes I believe).

Share this post


Link to post
Share on other sites
There was an article at CUJ that implemented std::string by Andrei Alexandrescu:
Generic Programming: A Policy-Based basic_string Implementation . It provides what you need, as well as code. There is an locally stored string when the string is less then a given amount which is what you wanted..and even if you do re-implement std::string again you would only have to implmement the policy class. Have a look at it, it should be useful.

[edited by - risingdragon3 on October 19, 2002 4:48:06 PM]

[edited by - risingdragon3 on October 19, 2002 4:48:45 PM]

Share this post


Link to post
Share on other sites
DrPizza,

I agree. Oluseyi, convinced me that this is the right way to go.

---

Stoffel,

Interesting. That means it can''t be that bad an approach. But if you read the risingdragon3 shows he actual talks about this approach (called Small Object Storage) and uses a union and a tag as I did. But I think the idea I got that allows this to happen without a tag (see my other post above) is much more efficient since I can eliminate the tag testing logic.

----

risingdragon3,

thanks for the link. it was interesting reading. In Andrei''s words what I aim for is a implementation for single threaded clients that does not use Copy-on-write (copy on write works best with size delimited strings - not null terminated strings).

For the purpose of my implementation I know that the basic_string will only used as basic_string so I have on purpose not implemented basic_string, but only string (not as a template). So for the sake of readability and ease of debugging I will implement it from the ground up.

Why, you all may ask? It is simply because that I (and my collegies for that matter) think that templates should only be used when one actually intend to use varying parameter values. And this is not something we will do in this case.

As a side note, we are actually not to happy about the coding style of Andrei Alexandrescu. I have only browsed his Modern C++ book but read the summarizing article in Game Programming Gems 1. The problem is that his methods (metaprogramming that is) are very hard on compilers and debuggers and to make them behave as intended very ugly code results - code that very often are not portable. In my own opinion metaprogramming is to use templates for something that are not fully intended - namely specialization. Instead I think that the way to go is to partial specialization tools (they are not that good yet) that can process your code to generate specialized versions. I know that people might not agree with this opinion of metaprogramming, but I truely think that code should be written simply as possible.

BTW, would any one here be interested in my implementation when it is done? (I think it will be done sometime during tomorrow)

Jacob Marner, M.Sc.
Console Programmer, Deadline Games

Share this post


Link to post
Share on other sites
I did some research on copy-on-write algorithms. A good introduction is available here:

http://www.cuj.com/experts/1905/henney.htm

I thought at first that this was the holy grail but when you read it closely you find that copy-on-write might in fact slow things down if the copying of strings take only a short while - i.e. if the strings are short - which they are in my case. So copy-on-write is not something I want to use. It was good to take it into consideration though.


Jacob Marner, M.Sc.
Console Programmer, Deadline Games

Share this post


Link to post
Share on other sites
quote:
Original post by felonius
...And then always use cData to access the string. If however, the string is less than STRINGLOCALSIZE in size then cData points to cLocalData.
...
comments?

Exactly what I was (covertly) suggesting.

quote:

Why, you all may ask? It is simply because that I (and my collegies for that matter) think that templates should only be used when one actually intend to use varying parameter values. And this is not something we will do in this case.

In this case your analysis is correct (since the properties of a string are immutable - until you consider Kanji characters and other languages that require UNICODE/MBCS charcters for computer transliteration). In general, however, it''s usually best to keep options open, because good software lends itself to unexpected uses. If a class can conceptually be templated without adding significant complexity, doing so may turn out useful in the long run.

quote:

As a side note, we are actually not to happy about the coding style of Andrei Alexandrescu. I have only browsed his Modern C++ book but read the summarizing article in Game Programming Gems 1. The problem is that his methods (metaprogramming that is) are very hard on compilers and debuggers and to make them behave as intended very ugly code results - code that very often are not portable. In my own opinion metaprogramming is to use templates for something that are not fully intended - namely specialization.

I''d disagree with this. Template metaprogramming is a way to get the compiler to write code for you - just like any other form of template programming. The difference is that metaprogramming uses templates to supply more than just type definitions; it provides process variation.

quote:

BTW, would any one here be interested in my implementation when it is done? (I think it will be done sometime during tomorrow)

Write a GDNet article. It may be insightful for a number of people, especially the concept of refactoring/reimplementing a design for a set of given constraints. It''ll take a while to get to the front page since the queue is quite backed up.

Share this post


Link to post
Share on other sites
quote:

In this case your analysis is correct (since the properties of a string are immutable - until you consider Kanji characters and other languages that require UNICODE/MBCS charcters for computer transliteration). In general, however, it's usually best to keep options open, because good software lends itself to unexpected uses. If a class can conceptually be templated without adding significant complexity, doing so may turn out useful in the long run.



Ahh, but this is where we differ in opinion. I am fan of the principle from Extreme Programming (and many other Agile programming methods for that matter) that states that you should not add complexity to a program before you are sure it is needed - in which case you shoukd refactoring to let it in. This way the code is kept simple and clean as long as possible. The idea is that people use to much time making complex code that later turns out to be hard to change.

In this case a transformation from string to basic_string is fairly trivial but it does complicate the code - and since I am fairly certain it will not ever be used this complexity this complexity is unneccesary and just increases compile time.

quote:

I'd disagree with this. Template metaprogramming is a way to get the compiler to write code for you - just like any other form of template programming. The difference is that metaprogramming uses templates to supply more than just type definitions; it provides process variation.



I think the way to achieve specialization (that supports proces variation) is to use partial evaluation. Here is an introduction to what it is:

http://www.rolemaker.dk/encyclop.pdf

Try for instance to look at this tool (C only):

http://www.diku.dk/research-groups/topps/activities/cmix/

Other ones problably exist too. However, they are really not that good, but I don't think the use of templates for specialization is that good either. You really have to press them to get them to do their specialization - often so much it was much easier just to write the specialized version by hand.

Anyway, we are delving into religion here.

quote:

Write a GDNet article. It may be insightful for a number of people, especially the concept of refactoring/reimplementing a design for a set of given constraints. It'll take a while to get to the front page since the queue is quite backed up.

I don't think so. I have written enough article stuff for a long time. See my thesis at www.rolemaker.dk


Jacob Marner, M.Sc.
Console Programmer, Deadline Games

[edited by - felonius on October 20, 2002 3:44:57 AM]

[edited by - felonius on October 20, 2002 3:59:53 AM]

Share this post


Link to post
Share on other sites