Sign in to follow this  
zennehoy

How to implement string index bounding?

Recommended Posts

Hi all! I'm in the process of implementing my own string class, mostly because I'm not a big fan of the built-in C++ string, but also just for the experience. Dynamic resizing and all the other basics work great, but I'd like some feedback on how to implement indexing of characters within the string. There's no problem when the specified index falls within the current length of the string, but I'm undecided on how to implement an index outside of the valid range. Take the following code for example:
char c;
ztring z = "Hello World!"; // Just create a string with some text.
z[2] = 'a';                // z is now "Healo World!", no problem
c = z[6];                  // c is 'W'
c = z[42];                 // c is '\0', what should happen to z?
z[42] = '?';               // what should this do?

The simplest approach would be to make this illegal, as in C++ or Java. Instead, I'm considering increasing the size of z to make room for the indexed value. The problem is that this would have to occur even when just reading the value, since I can't (as far as I know) differentiate between a read and a write. Thus, any reference to z[534] would make a shorter string 535 bytes long, with a lot of '\0'-space in between. Basically I'm looking for feedback as to what others would consider a comprehensible solution. So, what behavior would you expect from a string? Thanks! Zen

Share this post


Link to post
Share on other sites
whats wrong with std::string? its great.

okay your problem:

ask yourself, when people indexing a string index it out of bounds, what are they trying to do. if they want to expand the string, add a function that does that.

otherwise throw an exception or something. thats the easiest way to handle it.

my guess is for every programmer using a large value to index the array out of bounds to expand it, theres about 400 with a bug in their code.

or you could do the std::string thing an say that operator[]( int ) is unchecked, it is the clients responsibility to control the index variable.

Share this post


Link to post
Share on other sites
Quote:
Original post by zennehoy
Hi all!
I'm in the process of implementing my own string class, mostly because I'm not a big fan of the built-in C++ string

Why not?

Quote:
but also just for the experience.

Fair enough, but realise that std::string will be a better general purpose string than anything you write.

Quote:
Instead, I'm considering increasing the size of z to make room for the indexed value. The problem is that this would have to occur even when just reading the value, since I can't (as far as I know) differentiate between a read and a write. Thus, any reference to z[534] would make a shorter string 535 bytes long, with a lot of '\0'-space in between.

Basically I'm looking for feedback as to what others would consider a comprehensible solution. So, what behavior would you expect from a string?
Thanks!
Zen

So you're wanting something that behaves so that reading an character out of bounds is.. undefined? an error? returns a default value? something else? And writing a character out of bounds resizes the string. You could do something like:
// I call it silly string because I consider any attempt to write a
// general purpose string class in C++ for any purpose other than learning to be
// silly. Just use std::string
#include <iostream>
#include <string>

class SillyString
{

private:

class CharacterProxy;

public:

SillyString(std::string const & s);
unsigned int size() const;
operator std::string() const;
CharacterProxy operator[](unsigned int index);

private:

void resize(int size);

// for this exercise I've just wrapped std::string for ease of
// implementation
std::string data_;

class CharacterProxy
{

public:

CharacterProxy(SillyString & string, unsigned int index);
operator char const() const;
CharacterProxy & operator=(char c);

private:

SillyString & string_;
unsigned int const index_;

};

friend class CharacterProxy;

};

SillyString::SillyString(std::string const & s)
:
data_(s)
{
}

unsigned int SillyString::size() const
{
return data_.size();
}

SillyString::operator std::string() const
{
return data_;
}

SillyString::CharacterProxy SillyString::operator[](unsigned int index)
{
return SillyString::CharacterProxy(*this, index);
}

void SillyString::resize(int size)
{
data_.insert(data_.end(), size + 1 - data_.size(), '\0');
}

SillyString::CharacterProxy::CharacterProxy(SillyString & string, unsigned int index)
:
string_(string),
index_(index)
{
}

SillyString::CharacterProxy::operator char const() const
{
if (index_ >= string_.size())
{
return '\0';
}
return string_.data_[index_];
}

SillyString::CharacterProxy & SillyString::CharacterProxy::operator=(char c)
{
if (index_ >= string_.size())
{
string_.resize(index_);
}
string_.data_[index_] = c;
return *this;
}

int main()
{
SillyString str("hello");
std::cout << str[3] << '\n';
std::cout << str[72] << '\n';
str[0] = 'H';
std::cout << static_cast< std::string >(str) << '\n';
str[17] = '!';
std::cout << static_cast< std::string >(str) << '\n';
}

Enigma

Share this post


Link to post
Share on other sites
Alright, I honestly hope you didn't just type all that up...
I suppose using a "proxy"-class like that is the only way to differentate a read from a write? I've used the same approach in a project before, but was hoping to avoid it in this case. How much overhead does it introduce? Otherwise that's exactly the sort of functionality I was imagining, resize on write, default value on read.

As to why: it started with the inability to do str += int. I haven't used std::string much, but I always felt I had to be carefull when passing strings as parameters. The key goal for my own ztring class has always been to make the syntax of its usage identical to that of an int (ignoring additional features like []). Duplication of data is also minimized (and I mean minimized), even across assignments and such.
Really though it was an excuse to play around with C++...

Anyways, thanks for your replies!
Zen

ps. Just out of curiosity, what does std::string do with something like this?
std::string s1="hello",s2=" ",s3="world",s4="!",s5;
s5 = s1 + s2 + s3 + s4;
How many data buffers are created? 5 should be enough, 1 for each initialized string and 1 for the entire series of concatenations. A simple solution would create 7, 1 for each string and another for each addition.

Share this post


Link to post
Share on other sites
Quote:
Original post by zennehoy
Alright, I honestly hope you didn't just type all that up...
I suppose using a "proxy"-class like that is the only way to differentate a read from a write? I've used the same approach in a project before, but was hoping to avoid it in this case. How much overhead does it introduce? Otherwise that's exactly the sort of functionality I was imagining, resize on write, default value on read.

In terms of overhead I would guess on a modern optimising compiler you'd be looking at something between "nothing" and "not much".

Quote:
... I haven't used std::string much, but I always felt I had to be carefull when passing strings as parameters.

The only thing you need to be careful about is passing by value verses passing by const reference and that's not a string specific issue, it applies to many C++ types. Even if you make the wrong choice a profiler will usually tell you and it's easy to fix.

Quote:
The key goal for my own ztring class has always been to make the syntax of its usage identical to that of an int (ignoring additional features like []).

I'm not sure what you mean by this. In what way is std::string unlike an int?

Quote:
...Really though it was an excuse to play around with C++...

That's fine, just don't let me catch you crawling back to the forums with an error you can't track down in six months time because you're using your own string class and it has a subtle bug in it [lol].

Quote:
ps. Just out of curiosity, what does std::string do with something like this?
std::string s1="hello",s2=" ",s3="world",s4="!",s5;
s5 = s1 + s2 + s3 + s4;
How many data buffers are created? 5 should be enough, 1 for each initialized string and 1 for the entire series of concatenations. A simple solution would create 7, 1 for each string and another for each addition.

If you're talking about dynamic memory allocations (i.e. operator new) then I believe the standard allows implementations to make any number of allocations in such a situation, although "normal" implementations will probably make between none and nine (yes, none is possible using the short string optimisation). In practice it looks like modern compilers make eight dynamic allocations (not sure about VC++ 2005, I haven't gotten round to installing that yet).

The question though is "is this a problem with std::string"? The answer is almost certainly no. Is it possible to write operator+ and operator= for some string class such that you can guarantee that no more than X dynamic allocations for the expression string s1;string s2 = ...;string s3 = ...;string s4 = ...;...;string sX = ...;s1 = s2 + s3 + s4 + ... + sX;? As far as I'm aware it is not (excluding techniques that could just as easily be applied to std::string). Not even using proxy classes and clever techniques.

So lets look at the question slightly differently. Is is possible to write code for std::string such as:
string s1;
string s2 = ...;
string s3 = ...;
string s4 = ...;
...
string s
X = ...;
something;
such that s1 equals the concatentation of s2..sX and can guarantee no more than X + 1 memory allocations? (I'll allow plus one for the implementation to dynamically allocate a small buffer for s1 initially, even though I would expect most implementations not to do so). The answer is, unsurprisingly, yes, to an extent. And it's not even particularly hard. Start with the basic case for X = 5:
void concatenate(std::string & target, std::string const & s2, std::string const & s3, std::string const & s4, std::string const & s5)
{
target.reserve(s2.size() + s3.size() + s4.size() + s5.size());
target += s2;
target += s3;
target += s4;
target += s5;
}

Then you can either provide versions of the above for varying numbers of parameters, or use preprocessor metaprogramming and a template (preprocessor template, not a C++ template) to generate them automatically.

Enigma

Share this post


Link to post
Share on other sites
Quote:
Original post by Enigma
The only thing you need to be careful about is passing by value verses passing by const reference and that's not a string specific issue, it applies to many C++ types. Even if you make the wrong choice a profiler will usually tell you and it's easy to fix.
Quote:
The key goal for my own ztring class has always been to make the syntax of its usage identical to that of an int (ignoring additional features like []).

I'm not sure what you mean by this. In what way is std::string unlike an int?

If you pass a std::string by value, does its data get recreated? If so, that would mean you can't pass a std::string by value (can't as in, well, you know). That would be one difference to a regular int. My ztring class passes by value as expected; changes made in the function don't effect the value of the ztring in the calling function. Pass it as a reference and changes effect both. sizeof(ztring) is 4, so it's no different than passing a pointer.

Quote:
don't let me catch you crawling back to the forums with an error you can't track down in six months time because you're using your own string class and it has a subtle bug in it [lol].

Alright, if I get any string-related errors, I promise to try it with std:string first ;)

Quote:
"normal" implementations will probably make between none and nine (yes, none is possible using the short string optimisation).

What is short string optimisation?

As far as memory allocations are concerned, I was talking about the number of buffers created to actually hold the string data. When running a string of additions (result=s1+s2+s3+...+sn) you don't need to store the data for each sub-sum, you can just create a single data buffer containing s1+s2, then simply append all further strings. Its identical to result=s1+s2; result+=s3; result+=s4; etc., but written as a single line with + instead of +=.

I'll give the proxy-class idea a go, unless someone has a cleaner (ie no extra class) idea to accomplish the same. Thanks for your input!
Zen

Share this post


Link to post
Share on other sites
Quote:
Original post by zennehoy
... sizeof(ztring) is 4, so it's no different than passing a pointer.


Not exactly. Make an std::vector with a million elements in it and the sizeof will be the same as an empty vector of the same type, while passing the million element vector by value would be very expensive

sizeof doesn't account for dynamic memory that the class allocates. Passing a string by value is somewhat expensive because it news another dynamic buffer to hold a copy of the string for the life of the function. const reference is preferred in most cases for this reason.

Share this post


Link to post
Share on other sites
Some std::string implementations use the copy-on-write semantic - that is, copying a string object does not duplicate the buffer holding the characters until you modify the string - then if it's buffer is shared, it will recreate it and do the modification to that.

Why don't you read the standard? The standard for C++ outlines exactly what a std::string implementation needs to guarantee in order to be a conforming std::string implementation. Even though you're 'writing your own', you should take a look at the list of requirements - one of the primary goals of C++ is efficiency - in fact, often in the C++ standard library efficiency wins over safety - that is, range-checking is less efficient than not range checking, so it's not normally done. However, there are ways (like the at() method) to do range checking, and many debug implementations do range checking and such so that you're in the clear, but you don't suffer many performance hits. For some reason people have this idea that "objects are bloated and std::string takes up gigabytes of memory just to store 'hello' and it runs really slow and you can't pass it by value and I can write better code using character arrays and blah blah blah..."

Share this post


Link to post
Share on other sites
Quote:
Original post by zennehoy
If you pass a std::string by value, does its data get recreated? If so, that would mean you can't pass a std::string by value (can't as in, well, you know). That would be one difference to a regular int. My ztring class passes by value as expected; changes made in the function don't effect the value of the ztring in the calling function. Pass it as a reference and changes effect both. sizeof(ztring) is 4, so it's no different than passing a pointer.

Passing a std::string by value means that any change made to the string inside the function does not affect the value of the original string, just like an int. Whether this is achieved by an immediate deep copy or by Copy-On-Write or some other technique is up to the implementation.

Quote:
What is short string optimisation?

Some implementations of std::string make the actual string object slightly larger so that they can store short strings within the actual string memory itself instead of having to use dynamically allocated memory. For more details see Scott Meyers' Effective STL. (In fact even if you don't want any more details read the book and the two others in the series - Effective C++ and More Effective C++ - anyway because they're excellent books).

Quote:
As far as memory allocations are concerned, I was talking about the number of buffers created to actually hold the string data. When running a string of additions (result=s1+s2+s3+...+sn) you don't need to store the data for each sub-sum, you can just create a single data buffer containing s1+s2, then simply append all further strings. Its identical to result=s1+s2; result+=s3; result+=s4; etc., but written as a single line with + instead of +=.

I know what you meant. It sounds easy and worthwhile on the face of it, but try implementing it. I think you'll find it's not as easy/possible as you thought in the general case.

Enigma

Share this post


Link to post
Share on other sites
sizeof( std::string ) is 4 on my compiler (some gcc version )

and if you pass a std::string to a function and modify it, it will not modify the original... so other than the expand on write what are you gaining?

string += int can be added as a external operator function

you could probably just inherit the zstring class from std::string to make it act like you want.

void func( std::string str )
{
str += "zxw";
std::cout << str << std::endl;
}

std::string& operator+=( std::string& str, int val )
{
str += intToString( val ); // uses stringstreams and such
return str;
}

int main( int argc, char **argv )
{
std::cout << "size of string == " << sizeof( std::string ) << std::endl;
std::string str = "xxx";
func( str );
cout << str << std::endl;
str += 10;
cout << str << std::endl;
}




outputs

size of string == 4
xxxzxw
xxx
xxx10

EDIT: im slow :(

Share this post


Link to post
Share on other sites
Quote:
Original post by DrEvil
Not exactly. Make an std::vector with a million elements in it and the sizeof will be the same as an empty vector of the same type, while passing the million element vector by value would be very expensive

Yes, I took that into account. The data isn't duplicated unless it is changed. Thus you can pass a ztring by value, copying only the 4 bytes actually belonging to the ztring object onto the stack, and the data buffer doesn't get duplicated unless modified within the function. No matter the calling mechanism, if you want to modify the buffer within the function without affecting the value of the buffer in the calling function, you have no choice but to duplicate the buffer.

Quote:

sizeof doesn't account for dynamic memory that the class allocates. Passing a string by value is somewhat expensive because it news another dynamic buffer to hold a copy of the string for the life of the function.

Passing a string by value calls the string constructor with a string argument. If the string's data is duplicated, it is duplicated explicitly by the string's constructor.

Zen

Share this post


Link to post
Share on other sites
Quote:
Original post by RDragon1
Some std::string implementations use the copy-on-write semantic - that is, copying a string object does not duplicate the buffer holding the characters until you modify the string - then if it's buffer is shared, it will recreate it and do the modification to that.


Good to know. Though I try to avoid features that are implementation specific and stick with pass by reference or const reference in most cases, because the behavior will be consistant across implementations.

Share this post


Link to post
Share on other sites
Hehe, too slow by half ;)

Quote:
Original post by Enigma
I know what you meant. It sounds easy and worthwhile on the face of it, but try implementing it. I think you'll find it's not as easy/possible as you thought in the general case.

Well, it's implemented and it works, so I'm happy...

Quote:
Original post by RDragon1
Some std::string implementations use the copy-on-write semantic - that is, copying a string object does not duplicate the buffer holding the characters until you modify the string - then if it's buffer is shared, it will recreate it and do the modification to that.

The trouble with "some implementations" is that it isn't very portable. With my ztring class I can be sure that this is the case.

Quote:
Original post by rip-off
and if you pass a std::string to a function and modify it, it will not modify the original... so other than the expand on write what are you gaining?

If a std::string duplicates the string data even when the data isn't modified by the function, any function that uses the string data as read-only must have the string passed by reference. Int-like behavior doesn't have this limitation. Not a big deal I admit, but if you have the tendency to forget '&'s like me, it's nice not having to worry about it.

Quote:

string += int can be added as a external operator function

True, I didn't know this was possible when I started coding though. At least I learned something from all this...

No alternatives to a proxy indexing class yet though?
Zen

Share this post


Link to post
Share on other sites
Quote:
Original post by zennehoy
Quote:
Original post by Enigma
I know what you meant. It sounds easy and worthwhile on the face of it, but try implementing it. I think you'll find it's not as easy/possible as you thought in the general case.

Well, it's implemented and it works, so I'm happy...

Would you mind posting that implementation? I'd be interested to see what it looks like.

Enigma

Share this post


Link to post
Share on other sites
Quote:
Original post by Enigma
Would you mind posting that implementation? I'd be interested to see what it looks like.

The solution I adopted is rather spread out in the code, and the code is rather long (not sure what code windows here can take). Rather than make you dig through all that, basically what I did was create a global pointer (yeah yeah, it's not thread safe blah blah) that gets set to the result of any binary operation (+). If another binary operation follows, with the first argument identical to the global pointer stored previously, data is appended to the previous result rather than created new. As soon as the global pointer is assigned to anything, the global pointer is cleared so the next operation will start over.

I've tested it pretty thoroughly, and couldn't find any loopholes on the algorithmic level either, so I've decided it works. If someone can think of a scenario that would break things, I'd be glad to hear it!
Zen

Edit: Come to think about it, it might actually be thread-safe. Just not thread-efficient (the global pointer is overwritten by one thread, when it could potentially still have been used by another). Not something I would count on though...

Share this post


Link to post
Share on other sites
Just gotta say this (even though you've allready admitted this is for educational purpouses)...

This is not your bottleneck.

Well, it's pretty likely it's not, considering last time I checked, simple console i/o was about 100x (yes, that much) slower than a quick:
std::string text = str( boost::format( "Hello, %1%" ) % user );

Call. This was checked using GCC 3.4.2, which I believe uses COW strings. So really, pass your strings by "const string_type & reference" and you should be fine - or you'll likely be using a different optimization than COW. Did I note that this probably does a lot of string (de)constructing? Even passing by value, it's not likely going to be your bottleneck.

Plus, by forcing your user to use COW, you may take a preformance hit on multithreaded programs - for average strings, I believe COW is outpreformed by the traditional copy-on-copy mechanism.

See http://www.gotw.ca/publications/optimizations.htm

Take note of some of the benchmark results:

               Plain   1726ms
Plain_FastAlloc 642ms
COW_Unsafe 1629ms basic cow: barely an improvement
COW_AtomicInt 1949ms thread safe cow: costs more
COW_CritSec 10660ms
COW_Mutex 33298ms


As for operator[], I suggest you do as the standard library does - let out of bound access equate to undefined behavior, and let std::string::at( size_t index ) throw if it is used to access something out of bounds.

Share this post


Link to post
Share on other sites
Quote:
Original post by MaulingMonkey
See http://www.gotw.ca/publications/optimizations.htm

Awesome link, thanks! I've honestly done very little with multi-threading, so this was an excellent heads-up.
Looks like ztring'll be for non-threaded use only, though at a first glance the acceptably efficient method using atomic operations shouldn't be a problem to implement.
I'll keep it in mind, but not worry about it untill I actually need a thread-safe implementation,
Zen

Share this post


Link to post
Share on other sites
Quote:
Original post by zennehoy
The solution I adopted is rather spread out in the code, and the code is rather long (not sure what code windows here can take). Rather than make you dig through all that, basically what I did was create a global pointer (yeah yeah, it's not thread safe blah blah) that gets set to the result of any binary operation (+). If another binary operation follows, with the first argument identical to the global pointer stored previously, data is appended to the previous result rather than created new. As soon as the global pointer is assigned to anything, the global pointer is cleared so the next operation will start over.

I've tested it pretty thoroughly, and couldn't find any loopholes on the algorithmic level either, so I've decided it works. If someone can think of a scenario that would break things, I'd be glad to hear it!
Zen

I'm afraid I don't see how a simple global pointer helps you. Unless you significantly over-allocate memory you'll likely need to make a new dynamic allocation and copy on nine concatenations out of ten anyway. And if you do significantly over-allocate memory then you risk hurting spatial locality and possibly lose more performance than you gained. Also, how does this interact with your COW implementation. What happens if you do something like:
class Test
{

public:

Test(String s)
:
s_(s)
{
}

String s() const
{
return s_;
}

private:

String s_;

};

int main()
{
String s1("Hello ");
String s2("World.");
String s3(s1 + s1); // operator+ stores the resulting string globally
Test t(s3); // t copies the resulting string, no writes
std::cout << (s3 + s2); // further concatenation onto the stored string
std::cout << t.s() << '\n'; // is t correctly unchanged?
}

Or am I misunderstanding your implementation somehow?

Enigma

Share this post


Link to post
Share on other sites
Quote:
Original post by DrEvil
Quote:
Original post by RDragon1
Some std::string implementations use the copy-on-write semantic - that is, copying a string object does not duplicate the buffer holding the characters until you modify the string - then if it's buffer is shared, it will recreate it and do the modification to that.


Good to know. Though I try to avoid features that are implementation specific and stick with pass by reference or const reference in most cases, because the behavior will be consistant across implementations.


I did not at all intend for you to rely on it, I was just stating that things are usually 'better than you might think', and there are all kinds of different optimizations in different library implementations.

Share this post


Link to post
Share on other sites
Quote:
Original post by Enigma
I'm afraid I don't see how a simple global pointer helps you. Unless you significantly over-allocate memory you'll likely need to make a new dynamic allocation and copy on nine concatenations out of ten anyway. And if you do significantly over-allocate memory then you risk hurting spatial locality and possibly lose more performance than you gained. Also, how does this interact with your COW implementation. What happens if you do something like:
*** Source Snippet Removed ***
Or am I misunderstanding your implementation somehow?
Enigma


True, if you concatenate strings beyond the initial data allocation size, you have to re-allocate your data. That's no different in a series of += calls though, which is what we're trying to emulate.

The global pointer only helps if we have more than a single addition at once:

int main() {
String s1("Hello ");
String s2("World.");
String s3(s1 + s1 + s2); // 2 concatenations, g1=s1+s1 and s3=g1+s2
}


Rather than allocating new data for the second addition, we'd like it to perform the same as a += operation:
s3 = s1 + s1; s3 += s2
Thus, a global pointer to the result of the first addition is stored (g1). The second operation notices that the first parameter is identical with the global pointer, and appends g1 with s2. s3 then takes over the data pointed to by g1, and having been assigned, g1 is set to NULL.
Zen

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this