vector for reading bytes from a file

Started by
8 comments, last by phil_t 17 years, 5 months ago
There are cases when I have to read a bunch of bytes from a source that doesn't use streams... basically, I need to pass it a buffer which will be filled, and the buffer size (which is known ahead of time). If I use a vector<BYTE>, I need to resize or reserve it with the size first. resize results in a perf hit though, since it will initialize everything to zero. reserve won't, but the vector is then not really in a consistent state, since it won't think it has any items in it. I can't use "auto_ptr<BYTE> buf(new BYTE[size]);" because auto_ptr uses delete, not delete[]. Is there something else available that is suited to this purpose? It would be trivial to write a class that new[]'s an array of bytes and uses delete[] in its constructor, but I'm wondering if there is already something else.
Advertisement
I think Boost has a shared pointer made specifically for arrays.
http://www.boost.org/libs/smart_ptr/shared_array.htm
Thanks, I'll try that... for some reason I've been reluctant to install boost :-)

But actually - the main reason I'm using Win32 file apis instead of streams, is because I need robust error checking (e.g. knowing the reasons a file couldn't be opened), which fstream doesn't supply. But I have a feeling boost's filesystem classes would suit me there too....
If you make your vector/buffer a static size, and only reserve/resize it once at app startup, the performance hit becomes unimportant.

From there, you can read the file in chunks.

If you are using a C++ filestream, you can use .read() and .gcount() to do all the work.

Obtaining Win32's ReadFile parameter lpNumberOfBytesRead is equivalent to the return value of .gcount().
I think I've gotten around this by using a custom allocator. The allocator simply had empty overrides for construct() and destruct(). It might depend on optimizations to eliminate the whole loop, though, that calls these empty functions.
template <typename T>class c_Allocator_Unfilled : public std::allocator<T>{	public:		void construct(pointer P, const T& V) {}		void destroy(pointer P) {}};//....std::vector<BYTE, c_Allocator_Unfilled<BYTE> > Buffer;
"We should have a great fewer disputes in the world if words were taken for what they are, the signs of our ideas only, and not for things themselves." - John Locke
Quote:Original post by taby
If you make your vector/buffer a static size, and only reserve/resize it once at app startup, the performance hit becomes unimportant.

From there, you can read the file in chunks.

If you are using a C++ filestream, you can use .read() and .gcount() to do all the work.

Obtaining Win32's ReadFile parameter lpNumberOfBytesRead is equivalent to the return value of .gcount().


The data I'm reading is generally unbounded, but preceded by a byte count, so the simplest thing is just to have a buffer of the right size. Also I have multiple threads, so I'd need some way to synchronize access to a global buffer.

It's really unfortunate that file streams don't return any information about failures (other than 'end of data'). I was hoping to make my code more portable by ridding myself of win32 filesystem calls, but I have a lot of error checking code (for things like sharing violations, access denied, etc...) that communicates information to the user, and I don't want to lose that.
First off, .reserve() is fine in general. There's no such thing as an "inconsistent state" for a standard library container, unless you invoked undefined behaviour somewhere. That's the whole idea with classes, you know: maintaining an invariant.

When you .reserve() an empty vector, some uninitialized memory is requested, the capacity is set to that value, and the size remains 0. The fact that the memory beyond the size is uninitialized *doesn't matter*, because you will never read that data (e.g. by operator[]) until after it has been written (e.g. by .push_back), assuming correct program logic.

... Oh, you wanted to read in using a direct read() call to that pointer (such that the vector doesn't know about" the change)? Well, don't do that and *then* resize(), because that's undefined behaviour AFAIK (although I can't think of conditions in which it wouldn't work).

Seriously, in that case, just .resize(). You only have to pay the initialization cost once for the whole file. Well, multiple times if you don't know the maximum chunk size ahead of time and it's not the first one, but in total, the number of initializations is equal to the maximum chunk size (plus whatever happens during copying for resizing).

None of this is realistically going to be your bottleneck. Hint: You are doing I/O.

Just write simple code.

Quote:Original post by phil_t
Also I have multiple threads, so I'd need some way to synchronize access to a global buffer.


Why can't you have a buffer per thread?

Quote:It's really unfortunate that file streams don't return any information about failures (other than 'end of data').


That's not true. In addition to the 'eofbit', there is a 'badbit' indicating a corrupted stream, and a 'failbit' indicating a failed attempt to use an insertion or extraction operator (or, IIRC, an attempt to read from an output-only stream, or write to an input-only stream).

Exactly what kind of information are you looking for? If a file can't be opened, why should you care why not? Different systems have different reasons, anyway. Good luck with portability (unless you use boost::filesystem).
Quote:Original post by Zahlman
That's not true. In addition to the 'eofbit', there is a 'badbit' indicating a corrupted stream, and a 'failbit' indicating a failed attempt to use an insertion or extraction operator (or, IIRC, an attempt to read from an output-only stream, or write to an input-only stream).


'failbit' would be a problem with the calling code, so really the only error information you might possibly have is 'badbit'. In what scenarios would you ever get a corrupted stream? To sum up: filestreams won't return any "real world" error information.

IMO, this is a pretty serious lack in the standard. Maybe boost::filesystem will be part of the next standard?

Quote:
Exactly what kind of information are you looking for? If a file can't be opened, why should you care why not? Different systems have different reasons, anyway. Good luck with portability (unless you use boost::filesystem).


Why would I care why not? Because I care about usability!

In my scenario, as I mentioned, access denied, sharing violations, file not found, are actually pretty common situations. As a user, I'd much rather have an actionable error message that says "Couldn't open foo: access denied", or "Couldn't open foo: another process has the file open", rather than just "Couldn't open foo.". I hate programs that do that!
Quote:Original post by Zahlman
None of this is realistically going to be your bottleneck. Hint: You are doing I/O.

Just write simple code.


Well, you do have a good point there, and this may be a case of too-early-optimization. Allocations/copying a few thousand extra bytes won't be noticeable in the face of disk I/O.

However, I do have some similar situations where I need arrays of bytes, and no I/O is being done, so it may apply there.

I'll just use a scoped_array.

This topic is closed to new replies.

Advertisement