(C++) std::string::c_str() truncation issue

Started by
5 comments, last by Zahlman 16 years, 2 months ago
A variable 'buffer' is an std::string that has a lot of characters in it, not just the usual "ABC123" that one would find in normal printing (any character, actually, including '\n' and '\r'). I'm having serious issues getting the stored characters within buffer, however, using anything involving std::string::substr(). I can do it without issue using buffer.at(location), but this seems costly in terms of speed (and I am doing operations involving hundreds of kilobytes worth of characters). Here's an example of the problem:
printf("By index: %c%c%c%c'",buffer.at(3),buffer.at(4),buffer.at(5),buffer.at(6));
OUTPUTS: '½⌂ '
printf("by substring: '%s'\n",buffer.substr(3,4).c_str());
OUTPUTS: '½⌂'//note the missing whitespace, null, or whatever the case may be This truncated version is useless to me because I need to know the precise value of all four characters (in this case it's just four, in other's it's thousands). I'm also weary of just assuming those values have a certain meaning - is std::string::c_str() truncating at the first null, thus leaving all following values indeterminate? Or does it simply shave off all the last null characters, so all following characters can safely be assumed null? And, perhaps more importantly, is null the only one that causes this truncation? If not, then I can't tell anything by its lack than that it may be one of those. The only way around this I see is
int pos,len;//starting position and length, as in std::string::substr()
char* t_data;//temporary data storer
t_data = new char[len];
for(int i=0;i<len;i++)
	t_data = buffer.at(pos+i);
//use t_data, in this case it's actually an integer so atoi(t_data)
delete[] t_data;
but this appears slower and with much more overhead. Is there no other way? Or am I somehow misunderstanding std::string::substr()?
Advertisement
It's implementation defined. c_str() returns a const char*, usually to the internally character array. If there's any NULLs in that, printf() can't display them, because NULL means "end of string". My implementation (VC2005) of std::string uses the C string functions internally for all sorts of stuff, and NULL means "End of string". I've seen some weird errors from using NULLs in std::strings.

As far as I know, NULL is the only thing that should cause problems, but I wouldn't be at all surprised if values < 0 (Since char's are (usually) signed) cause oddities too.

As soon as you start using buffers containing 0x00 bytes with std::string, you're entering a world of pain. Is there a reason you don't use a std::vector<unsigned char> instead?
Quote:Original post by Zouflain
Here's an example of the problem:
printf("By index: %c%c%c%c'",buffer.at(3),buffer.at(4),buffer.at(5),buffer.at(6));

OUTPUTS: '½⌂ '


Looks like the contents of your string are screwed if you were expecting something different. BTW, you can use [] indexing if you're sure that the index is in range.

Quote:
printf("by substring: '%s'\n",buffer.substr(3,4).c_str());

OUTPUTS: '½⌂'//note the missing whitespace, null, or whatever the case may be
This truncated version is useless to me because I need to know the precise value of all four characters (in this case it's just four, in other's it's thousands).

The problem is with printf. It expects a C-style 0-terminated string. It's going to stop when it hits an embedded null.


Quote:I'm also weary of just assuming those values have a certain meaning - is std::string::c_str() truncating at the first null, thus leaving all following values indeterminate?

If you're passing it to printf() via a %s, it won't make any difference. printf() stops looking at the first 0. But FYI, a zero is added to the end of the buffer, in effect, when calling c_str().

Quote:The only way around this I see is
int pos,len;//starting position and length, as in std::string::substr()char* t_data;//temporary data storert_data = new char[len];for(int i=0;i<len;i++)	t_data = buffer.at(pos+i);//use t_data, in this case it's actually an integer so atoi(t_data)delete[] t_data;
but this appears slower and with much more overhead. Is there no other way? Or am I somehow misunderstanding std::string::substr()?


I still don't fully understand what problem you're trying to solve. But it looks like the root of your problem is with printf(), not with std::string.

If you want to print a range of characters from a string to the console, then perhaps do something like:

typedef std::string::size_type sz_tvoid dump_string_section(const std::string &s,                          sz_t begin, sz_t end,                          std::ostream &out = std::cout){    typedef std::ostream_iterator<char> out_iter;    assert(end >= begin);    std::copy(s.begin() + begin, s.begin() + end, out_iter(out));}


[Edited by - the_edd on February 12, 2008 9:27:19 AM]
Yes, vector<char> would be a good idea, although I believe it should work with std::string as well.

c_str() simply returns a null terminated pointer to the string data. If the string contains any 0x00 characters inside it this may be a problem for printf, since printf assumes 0x00 to mark the end of the string.

There is also the data() member function which also returns a pointer to the data of the string. This must not necessarily be null terminated though.

And substr() should return a substring no matter what the contents of the actual string are, even when there are 0x00 characters in it.

You can test this by making a string which contains 0x00 (or maybe a string of only 0x00) and then doing a substring s.substr(1, 4). s.length() or s.size() should be 4 then even though printf would not print anything.

And for performance reasons you should avoid the at() member function (like buffer.at()) since this always does a bounds check. You can simply use the overloaded [] operator if you know your index is in range.
The truncation is due to the way that printf works (not std::string). The end of a C-style string is denoted quite literally by a null character. Part of the slowness with your new method is going to be from the at() function which performs bounds checking on the indices. The accessor method (string[index]) method will be certainly faster in this case. Why do you want to see the characters in the file when you have (I assume) binary data? You could eliminate a lot of this hassle if you're just printing out the numeric value of the characters. Another viable option would be to just the <algorithms> copy method to copy each character to the output stream.
______________________________________________________________________________________The Phoenix shall arise from the ashes... ThunderHawk -- ¦þ"So. Any n00bs need some pointers? I have a std::vector<n00b*> right here..." - ZahlmanMySite | Forum FAQ | File Formats______________________________________________________________________________________
if (buffer.length() >= 7) {  std::cout << "By index: " << buffer.at(3) << buffer.at(4) << buffer.at(5) << buffer.at(6); } else {  std::cout << "Index out of bounds";}
Quote:Original post by Antheus
if (buffer.length() >= 7) {  std::cout << "By index: " << buffer.at(3) << buffer.at(4) << buffer.at(5) << buffer.at(6); } else {  std::cout << "Index out of bounds";}


QFE. Except that just indexing in with operator[] makes more sense here, since you already have an explicit bounds check.

This topic is closed to new replies.

Advertisement