Pascal-string from file to std::string

Started by
13 comments, last by implicit 15 years, 6 months ago
I have to read a Pascal-string (sequence of characters preceded by their number) from a file to a string. What I do:

std::string ReadPascalString(const std::string& filename) {
  std::ifstream file(filename.c_str(), std::ios_base::in | std::ios_base::binary);
  int size;
  file >> size;
  boost::scoped_array<char> buffer(new char[size]);
  file.read(buffer.get(), size);
  return std::string(buffer.get(), size);
}

This way I do an extra copy of all the characters (it may be even some megabytes). I can for sure live with that, but the faster, the better. And the cleaner, the better. So I tried to read the data directly into the string:
  std::string result(size);
  file >> std::setw(size) >> result;
...but this stops at the first space found (at most size chars are read). Any suggestion is appreciated!
Advertisement

I would use the std::istream.read function to just read raw bytes into a stack array (for speed) and then assign to the string.

 std::ifstream   f( "file.bin", std::ios::in | std::ios::binary);         std::string     result;     // read len    unsigned char sz;       f.read( (char *) &sz, sizeof( sz));         // read data    char s0[ 256];     f.read( s0, sz);     result.assign( s0, s0 + sz);




By way of explanation, my first thought would be to use iterators after getting the length, eg

std::istream_iterator< char> start( f); std::copy( start,  start + sz, std::back_inserter( result));


but
1) istream_iterator isnt a random access iterator to which you can add fix the offset bytes you want to read.
2) alternatively you could use std::copy_n to fix the distance but it seems to be depricated.
3) and even if file is opened ios::binary whitespace is ignored for istream_iterator so it would skip valuable data. [edit] you can use std::istreambuf_iterator< char> start( f); but its still not random access


My collegue says that pascal string lengths are always one byte encoded. If they aren't then maybe use an oversized stack array for small pascal strings and then do an allocation using std::vector if large, something like this,



     unsigned  sz;  // if len was encoded using 32 bit integer      f.read( (char *) &sz, sizeof( sz));         const unsigned small_sz = 256;     if( sz < small_sz) {         // if string is small then use stack buffer for speed         char s0[ small_sz];         f.read( s0, sz);         result.assign( s0, s0 + sz);      }    else {          // otherwise allocate dynamic buffer then assign to string         std::vector< char> s0( sz);             f.read( &s0[ 0], s0.size());         result.assign( s0.begin(), s0.end());       }



note that you have to be careful with endiness issues

[Edited by - chairthrower on October 9, 2008 10:12:50 AM]
I don't know that there's all that much you can do with a std::string. Personally I'd be wary of storing multi-megabyte texts inside a string to begin with, it's just not what the container was designed for, and you'll have very little wiggle room for optimizations such as this.

Anyway, perhaps the best solution performance-wise without resorting to using other containers and file access methods would be to reserve the string space ahead of time, then read in the data in small chunks (something on the order of a kilobyte or so) and insert these into the string. There'll still be plenty of copying going on inside the stream and from the operating system for buffering purposes, but at least that way you'll be minimize the overhead and be cache friendly.

Either that or throw up a memory mapped file and return a set of pointers as iterators for the string. That way you'd avoid all unnecessary copying, remove one pass over the file, and speed up swapping. But I rather doubt that's necessary.
There's a question here, and if there is a standard format for a Pascal string in a file, I apologise but I can't find anything on Google.

Is the string saved in text or binary format? I.e. is the length prefix in binary immediately followed by the text data, or is the length prefix encoded as text, followed by a space (or a colon, or a whatever), followed by the text data?

Nyarlath's code is treating it as text, chairthrower's as binary.

(Nyarlath - a stream's >> operator expects text-formatted data in the stream, even if the stream is opened in ios::binary mode.)

I am aware that in memory, a Pascal string has a binary length then the characters, but in a file, unless there is a standard I am not aware of, it depends how the file was written, and you need to know which it is to write code to read it in.
Quote:
2) alternatively you could use std::copy_n to fix the distance but it seems to be depricated.


If I'm not mistaken it is deprecated only by the VC++ compiler. Of course any function that assumes that the target buffer is large enough is unsafe, but if you use back_inserter that is not a problem.
Quote:Original post by implicit

Either that or throw up a memory mapped file and return a set of pointers as iterators for the string. That way you'd avoid all unnecessary copying, remove one pass over the file, and speed up swapping. But I rather doubt that's necessary.


Some would be very annoyed if you do that.
One suggestion, which I expect will cause people to weep and them complain, but will work on all Visual Studio STL implementations so far (But certainly isn't guaranteed to work at all) is to abuse the std::strings internal layout:
std::string ReadPascalString(const std::string& filename) {  std::ifstream file(filename.c_str(), std::ios_base::in | std::ios_base::binary);  int size;  file >> size;  std::string strRet;  strRet.resize(size);  file.read(&strRet[0], size);  return strRet;}
It might be a little safer to use the data() member function, but it's still implementation dependant. But it should be fast anyway...

/me sits back and waits for the shouting [smile]
Quote:Original post by Antheus
Quote:Original post by implicit
Either that or throw up a memory mapped file and return a set of pointers as iterators for the string. That way you'd avoid all unnecessary copying, remove one pass over the file, and speed up swapping. But I rather doubt that's necessary.

Some would be very annoyed if you do that.
That's a rather odd criticism of memory mapped files. I agree that they're often ascribed near-magical properties, but complaining about poor cache behavior assumes that the OS is brain-dead and ignores the existence of hints like FILE_FLAG_SEQUENTIAL_SCAN.
Also, recommending reading the whole file in a single scan since memory mapping might block the process is just plain bizarre. A single bulk read call wouldn't allow any data processing to proceed until the whole file is read, unlike memory mapped files which are arguably the most straightforward mechanism for reading and processing data simultaneously.
Sure, asynchronous I/O can occasionally be useful to avoid blocking a UI thread or starving the other users of a server, but between the mess of reliably implementing such a scheme and the prevalence of threading it's not something I'd expect to see very often.
Finally he's ignoring a big benefit of memory mapping, that the RAM backing them is simply a cache and the OS knows it. So assuming your system is sane (that is you're not running Win9x...) they can be purged automatically when running out of memory.

That's not to say there aren't issues with memory mapped files. But I'd focus on running out of address space, the problems of dealing with I/O errors, the unnecessary locking-down of files, or the fact that just because it might look like any other memory block doesn't mean you can jump around freely in it.

Oh, and just for the record, I'm still not recommending using memory mapped files in this case. Not unless you're seeing some real performance problems anyway.
Quote:Original post by visitor
Quote:
2) alternatively you could use std::copy_n to fix the distance but it seems to be depricated.


If I'm not mistaken it is deprecated only by the VC++ compiler. Of course any function that assumes that the target buffer is large enough is unsafe, but if you use back_inserter that is not a problem.


In fact i only tried against gcc. for gcc ( 4.1.2) grep -R copy_n[^a-zA-Z_] /usr/include/c++/ indicates that its not c++98

/4.1.2/ext/algorithm: // copy_n (not part of the C++ standard)

so although probably undesirable, you could use by including <ext/algorithm> and then prefixing the gnu ext namespace eg __gnu_cxx::copy_n( ...)

boost uses copy_n in detail but i think always provides local definitions

Thanks all for the replys.
I have spent some time rating, but I think I will stick with my solution: Abuse a vector or a string don't give a big advantage over the buffer. I have to read null-terminated strings and pascal string with prefixed size of type int8, int16 and int32 (the latter not well-suited for constant preallocation).

I will look at copy_n, thank you again for the hint.

As a last whine I would say that regarding the stl, the fileio part is the less intuitive, worst documented and least supported by the vs2005's intellisense.

This topic is closed to new replies.

Advertisement