Sign in to follow this  
Nyarlath

Pascal-string from file to std::string

Recommended Posts

I have to read a Pascal-string (sequence of characters preceded by their number) from a file to a string. What I do:
std::string ReadPascalString(const std::string& filename) {
  std::ifstream file(filename.c_str(), std::ios_base::in | std::ios_base::binary);
  int size;
  file >> size;
  boost::scoped_array<char> buffer(new char[size]);
  file.read(buffer.get(), size);
  return std::string(buffer.get(), size);
}

This way I do an extra copy of all the characters (it may be even some megabytes). I can for sure live with that, but the faster, the better. And the cleaner, the better. So I tried to read the data directly into the string:
  std::string result(size);
  file >> std::setw(size) >> result;
...but this stops at the first space found (at most size chars are read). Any suggestion is appreciated!

Share this post


Link to post
Share on other sites

I would use the std::istream.read function to just read raw bytes into a stack array (for speed) and then assign to the string.

 std::ifstream   f( "file.bin", std::ios::in | std::ios::binary); 

std::string result;

// read len
unsigned char sz;
f.read( (char *) &sz, sizeof( sz));

// read data
char s0[ 256];
f.read( s0, sz);
result.assign( s0, s0 + sz);





By way of explanation, my first thought would be to use iterators after getting the length, eg

std::istream_iterator< char> start( f); 
std::copy( start, start + sz, std::back_inserter( result));



but
1) istream_iterator isnt a random access iterator to which you can add fix the offset bytes you want to read.
2) alternatively you could use std::copy_n to fix the distance but it seems to be depricated.
3) and even if file is opened ios::binary whitespace is ignored for istream_iterator so it would skip valuable data. [edit] you can use std::istreambuf_iterator< char> start( f); but its still not random access


My collegue says that pascal string lengths are always one byte encoded. If they aren't then maybe use an oversized stack array for small pascal strings and then do an allocation using std::vector if large, something like this,





unsigned sz; // if len was encoded using 32 bit integer
f.read( (char *) &sz, sizeof( sz));

const unsigned small_sz = 256;
if( sz < small_sz) {
// if string is small then use stack buffer for speed
char s0[ small_sz];
f.read( s0, sz);
result.assign( s0, s0 + sz);
}
else {
// otherwise allocate dynamic buffer then assign to string
std::vector< char> s0( sz);
f.read( &s0[ 0], s0.size());
result.assign( s0.begin(), s0.end());
}




note that you have to be careful with endiness issues

[Edited by - chairthrower on October 9, 2008 10:12:50 AM]

Share this post


Link to post
Share on other sites
I don't know that there's all that much you can do with a std::string. Personally I'd be wary of storing multi-megabyte texts inside a string to begin with, it's just not what the container was designed for, and you'll have very little wiggle room for optimizations such as this.

Anyway, perhaps the best solution performance-wise without resorting to using other containers and file access methods would be to reserve the string space ahead of time, then read in the data in small chunks (something on the order of a kilobyte or so) and insert these into the string. There'll still be plenty of copying going on inside the stream and from the operating system for buffering purposes, but at least that way you'll be minimize the overhead and be cache friendly.

Either that or throw up a memory mapped file and return a set of pointers as iterators for the string. That way you'd avoid all unnecessary copying, remove one pass over the file, and speed up swapping. But I rather doubt that's necessary.

Share this post


Link to post
Share on other sites
There's a question here, and if there is a standard format for a Pascal string in a file, I apologise but I can't find anything on Google.

Is the string saved in text or binary format? I.e. is the length prefix in binary immediately followed by the text data, or is the length prefix encoded as text, followed by a space (or a colon, or a whatever), followed by the text data?

Nyarlath's code is treating it as text, chairthrower's as binary.

(Nyarlath - a stream's >> operator expects text-formatted data in the stream, even if the stream is opened in ios::binary mode.)

I am aware that in memory, a Pascal string has a binary length then the characters, but in a file, unless there is a standard I am not aware of, it depends how the file was written, and you need to know which it is to write code to read it in.

Share this post


Link to post
Share on other sites
Quote:

2) alternatively you could use std::copy_n to fix the distance but it seems to be depricated.


If I'm not mistaken it is deprecated only by the VC++ compiler. Of course any function that assumes that the target buffer is large enough is unsafe, but if you use back_inserter that is not a problem.

Share this post


Link to post
Share on other sites
Quote:
Original post by implicit

Either that or throw up a memory mapped file and return a set of pointers as iterators for the string. That way you'd avoid all unnecessary copying, remove one pass over the file, and speed up swapping. But I rather doubt that's necessary.


Some would be very annoyed if you do that.

Share this post


Link to post
Share on other sites
One suggestion, which I expect will cause people to weep and them complain, but will work on all Visual Studio STL implementations so far (But certainly isn't guaranteed to work at all) is to abuse the std::strings internal layout:

std::string ReadPascalString(const std::string& filename) {
std::ifstream file(filename.c_str(), std::ios_base::in | std::ios_base::binary);
int size;
file >> size;
std::string strRet;
strRet.resize(size);
file.read(&strRet[0], size);
return strRet;
}

It might be a little safer to use the data() member function, but it's still implementation dependant. But it should be fast anyway...

/me sits back and waits for the shouting [smile]

Share this post


Link to post
Share on other sites
Quote:
Original post by Antheus
Quote:
Original post by implicit
Either that or throw up a memory mapped file and return a set of pointers as iterators for the string. That way you'd avoid all unnecessary copying, remove one pass over the file, and speed up swapping. But I rather doubt that's necessary.

Some would be very annoyed if you do that.
That's a rather odd criticism of memory mapped files. I agree that they're often ascribed near-magical properties, but complaining about poor cache behavior assumes that the OS is brain-dead and ignores the existence of hints like FILE_FLAG_SEQUENTIAL_SCAN.
Also, recommending reading the whole file in a single scan since memory mapping might block the process is just plain bizarre. A single bulk read call wouldn't allow any data processing to proceed until the whole file is read, unlike memory mapped files which are arguably the most straightforward mechanism for reading and processing data simultaneously.
Sure, asynchronous I/O can occasionally be useful to avoid blocking a UI thread or starving the other users of a server, but between the mess of reliably implementing such a scheme and the prevalence of threading it's not something I'd expect to see very often.
Finally he's ignoring a big benefit of memory mapping, that the RAM backing them is simply a cache and the OS knows it. So assuming your system is sane (that is you're not running Win9x...) they can be purged automatically when running out of memory.

That's not to say there aren't issues with memory mapped files. But I'd focus on running out of address space, the problems of dealing with I/O errors, the unnecessary locking-down of files, or the fact that just because it might look like any other memory block doesn't mean you can jump around freely in it.

Oh, and just for the record, I'm still not recommending using memory mapped files in this case. Not unless you're seeing some real performance problems anyway.

Share this post


Link to post
Share on other sites
Quote:
Original post by visitor
Quote:

2) alternatively you could use std::copy_n to fix the distance but it seems to be depricated.


If I'm not mistaken it is deprecated only by the VC++ compiler. Of course any function that assumes that the target buffer is large enough is unsafe, but if you use back_inserter that is not a problem.


In fact i only tried against gcc. for gcc ( 4.1.2) grep -R copy_n[^a-zA-Z_] /usr/include/c++/ indicates that its not c++98

/4.1.2/ext/algorithm: // copy_n (not part of the C++ standard)

so although probably undesirable, you could use by including <ext/algorithm> and then prefixing the gnu ext namespace eg __gnu_cxx::copy_n( ...)

boost uses copy_n in detail but i think always provides local definitions

Share this post


Link to post
Share on other sites
Thanks all for the replys.
I have spent some time rating, but I think I will stick with my solution: Abuse a vector or a string don't give a big advantage over the buffer. I have to read null-terminated strings and pascal string with prefixed size of type int8, int16 and int32 (the latter not well-suited for constant preallocation).

I will look at copy_n, thank you again for the hint.

As a last whine I would say that regarding the stl, the fileio part is the less intuitive, worst documented and least supported by the vs2005's intellisense.

Share this post


Link to post
Share on other sites
Don't worry, it will only be hopefully another year before C++0x is out, with a bunch of fixes to the standard library design, and only a few more years before it's well supported. ;)

Just for fun, another way to get around the non-standard-ness of copy_n. This avoids an explicit copy or repeated memory allocation and doesn't depend on the string being contiguous, although it will waste time default-initializing a bunch of memory.


struct overwriteWith {
std::istream& is;
char operator()(char toOverwrite) {
char result;
// Throw the same type of exception that would be thrown if the exceptions
// flag is set on the stream. This avoids having to save and restore the
// exceptions state ;)
if (!is.get(result)) { throw ios::failure(); }
return result;
}
};

int length = readLengthFrom(file);
std::string result(length, 0);
std::transform(result.begin(), result.end(), result.begin(), overwriteWith(file));


I know, evil :)

[Edited by - Zahlman on October 10, 2008 4:50:33 PM]

Share this post


Link to post
Share on other sites
To further belabor the point (this is c++ afterall) another possibiliy would be to write the algorithm rather than attacking with a functor adapted to an algorithm. here n is implied by the distance of the target iterator range. there are no double copies or double allocations and istreambuf_iterator means its good on binary.

I suspect that copy_n would never be able to be used very well (even if/when its standard) because it would obscure what happens if you unexpectedly hit the eof (cannot be predetermined due to streaming iterators and no std::distance). and hitting eof is likely where you want an exception.

#include <fstream>
#include <string>
#include <iterator>

template< class IT>
void copy_istream( std::istream &is, IT start, const IT end)
{
typedef typename IT::value_type v_type;
typedef std::istreambuf_iterator< v_type> itr_type;
itr_type ip( is);
while( start != end) {
if( ip == itr_type()) throw std::ios::failure( "ughh");
*start++ = *ip++;
}
}

int main()
{
std::ifstream file( "file.bin", std::ios::binary);
std::string s( 10, ' '); // eg allocate string of 10 bytes
copy_istream( file, s.begin(), s.end()); // copy exactly 10 bytes into string
}

Share this post


Link to post
Share on other sites
Multi-megabyte Pascal string? I thought Pascal strings were limited to 255 characters: the length is represented by a single byte on the start of the string. Assuming the byte is 8 bits, you can't get further than 255 characters.

Share this post


Link to post
Share on other sites
Quote:
Original post by Oxyd
Multi-megabyte Pascal string? I thought Pascal strings were limited to 255 characters: the length is represented by a single byte on the start of the string. Assuming the byte is 8 bits, you can't get further than 255 characters.


For real Pascal strings, sure, but nothing stops one from using the same technique with a larger length field.

Share this post


Link to post
Share on other sites
I realize those C++ snippets were meant as academic exercises rather than optimizations, but reading the file character-by-character rather than buffering is a terrible idea from a performance standpoint. I've yet to see even inlined getc() macro in an unrolled loop generate decent code, and throwing stream iterators and standard algorithms around doesn't exactly help.

In this case I still recommend reading small chunks from the input stream into a decently sized stack-allocated buffer, and copying them into a string with the final buffer size reserved ahead of time. I suppose you might be able to avoid some unnecessary copying by disabling the file buffer (stream.rdbuf().setbuf(NULL, 0) after opening an ifstream I believe) but I'd profile it careful first, especially if small buffers are common.

(This is an interesting benchmark if anyone cares.)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this