Battery of Tests For IO Library

Started by
10 comments, last by Ectara 11 years, 4 months ago
I've been having a hard time thinking up a battery of tests for an IO library, and any advice or ideas would be appreciated. For a point of reference, it shares a lot in common with the C standard IO library. Ideally, I'd like to test that writing and reading works, files can be grown when written past the end if permissions allowed, file read/write permissions, file EOF flags are set properly, and file seeking and telling work as expected.

One of the problems that I run into is that it is hard to test whether reading or writing works without using the opposite functionality; it is hard to test whether or not what I've written is correct if I don't read it back in to test it. I can't seem to find a way around it. One important point is that though the IO stream can be implemented in any number of ways, they all have the same interface. I do expect at least one of these tests to fail for any of the streams, so it is of great importance.

What I have so far, in terms of tests:
-Create a new file
-Write a sequence of bytes (or a few, or rule out a fluke)
-Seek to the beginning
-Read in the written data (if a problem occurs, how can I prove that it was either the reading, or the writing)
-Try to read again, to make sure that I get the EOF flag
-Explicitly check to see if stream has the EOF flag set, similarly to feof()
-Put a single byte
-Get a single byte (same caveat with the reading and the writing)
-Try getting the current file position
-Try seeking to the beginning
-Try seeking to the stored position
-Try to get the file position again, and ensure that they're the same
-Try to seek to the end of the stream
-Get the file position at the end, and try seeking to the beginning and to the end's stored position to see if I still wind up at the end
-Seek to the first stored position and flush the stream, and ensure that the position did not change

I'm drawing a blank on the rest so far.

Additionally, should the function for getting a string from the stream explicitly call the function to get a single character until a newline is encountered, or should the string getting function be independent of the character get function for speed, and depend upon the stream's implementation?
Advertisement
I might try adding some more "error case" tests, like attempting to read/write a negative number of bytes or something for example. It's good to make sure your libraries behave nicely even when given bad input.
I personally enjoy randomized testing against a reference library (such as C's stdio or C++'s iostream) when I have the luxury of having such a reference. The idea is to perform a series of completely random operations, even operations which are meaningless and should result in errors, in parallel on both implementations, and as soon as you get divergent results which are not a feature of your library (hopefully you can enumerate those), you've failed and you go to the code to fix it. This can be executed fairly quickly and eventually the entire search space will be covered. When everything works, you're probably ready to launch the library into the real world and iron out any remaining issues through bug reports.

“If I understand the standard right it is legal and safe to do this but the resulting value could be anything.”


I might try adding some more "error case" tests, like attempting to read/write a negative number of bytes or something for example. It's good to make sure your libraries behave nicely even when given bad input.

Hm... all of my quantities are unsigned, to both prevent that issue, and to allow higher quantities to be read and written at once without overflow. I could try sending zeroes, and null pointers, and also send values that would seek past the beginning or end of the stream. However, most of these would be caught by the highest level of the stream, during input validation before it ever reaches the actual implementation of the stream. These tests will still be useful, but unless I change the interface, most would no longer fail, even if the underlying implementation is faulty.

I personally enjoy randomized testing against a reference library (such as C's stdio or C++'s iostream) when I have the luxury of having such a reference. The idea is to perform a series of completely random operations, even operations which are meaningless and should result in errors, in parallel on both implementations, and as soon as you get divergent results which are not a feature of your library (hopefully you can enumerate those), you've failed and you go to the code to fix it. This can be executed fairly quickly and eventually the entire search space will be covered. When everything works, you're probably ready to launch the library into the real world and iron out any remaining issues through bug reports.

Randomized tests sound good. I've more than learned by now that when you write a test program, and it works, it does not mean it is bug free, like my previous memory allocator that worked perfectly until the program allocated a certain of amount memory relative to the size of the pool. I only found the bug when actually using it in an application that requested large amounts of memory. However, it does leave open the area of the features that C's standard IO does not support that mine does.

I will implement some sort of randomized tests; I never really thought of that. Most of the tests will be aimed at making sure that the underlying stream's implementation performs as expected, because there are a lot of stream types, and testing all of them bit by bit would be troublesome.

Additionally, should the function for getting a string from the stream explicitly call the function to get a single character until a newline is encountered, or should the string getting function be independent of the character get function for speed, and depend upon the stream's implementation?

Additionally, should the function for getting a string from the stream explicitly call the function to get a single character until a newline is encountered, or should the string getting function be independent of the character get function for speed, and depend upon the stream's implementation?

Well, in my opinion I'd mark the single-character function inline (as it's probably quite small) so that calling it in a loop won't be so costly, instead of duplicating code, but it really depends on how the streams are implemented. It may make more sense to buffer and seek the newline in memory if you're not already caching I/O operations (and it might be faster, though working with textual data is always going to be slow in any case).

And yes, randomized testing won't help you test stuff that the reference library doesn't do, so you'll need to resort to conventional testing for those features.

“If I understand the standard right it is legal and safe to do this but the resulting value could be anything.”


Well, in my opinion I'd mark the single-character function inline (as it's probably quite small) so that calling it in a loop won't be so costly, instead of duplicating code,

I can't guarantee that I have inline support, since it is written in C89. For this reason, the are all normal functions; the parts that absolutely must be inlined are safely written macros.


but it really depends on how the streams are implemented. It may make more sense to buffer and seek the newline in memory if you're not already caching I/O operations (and it might be faster, though working with textual data is always going to be slow in any case).

This is one of the major reasons that I have it done through the stream back-end itself; for instance, the memory map stream would do it much faster by iterating through a block of memory, as opposed to a general function that reads character by character. However, it seems like I'd be saving a lot of time and strain by having the get string function be higher level, by doing it as explicitly stated in the C standard. Caching sounds like a great choice, but it would require me having a (n - 1) / n chance of reading too far and needing to seek the file position backward for a buffer size of n bytes; if the newline is not at the end of the buffer, I'd have to seek the file pointer back, and seeking backward is something I avoid at all costs.

On the other hand, streams that cannot function without seeking backward to maintain normal stream appearance (like an encrypted stream with a block cipher) would be horrendously slow to read one character at a time, as the current implementation decrypts the block to get the character requested (over time, I could optimize it to not reread the encrypted block before decrypting if the file position is still within the block) every time you call the function to get a character.

Basically, I'm torn between being fast, or being simple and easy to debug.


And yes, randomized testing won't help you test stuff that the reference library doesn't do, so you'll need to resort to conventional testing for those features.


To resolve this issue, I plan to ensure that the most basic file stream is bug free, and use that as a control in the test; the file stream calls whichever OS's file routines in its back-end, so it should be pretty safe.
This is one of the major reasons that I have it done through the stream back-end itself; for instance, the memory map stream would do it much faster by iterating through a block of memory, as opposed to a general function that reads character by character. However, it seems like I'd be saving a lot of time and strain by having the get string function be higher level, by doing it as explicitly stated in the C standard. Caching sounds like a great choice, but it would require me having a (n - 1) / n chance of reading too far and needing to seek the file position backward for a buffer size of n bytes; if the newline is not at the end of the buffer, I'd have to seek the file pointer back, and seeking backward is something I avoid at all costs.[/quote]
Well, then since you don't want to seek backward, I don't see a solution other than incrementing the file position by one byte repeatedly until you hit the newline. Is it really that bad to seek backwards? It does cut out a large class of possible optimizations and design methods.

On the other hand, streams that cannot function without seeking backward to maintain normal stream appearance (like an encrypted stream with a block cipher) would be horrendously slow to read one character at a time, as the current implementation decrypts the block to get the character requested (over time, I could optimize it to not reread the encrypted block before decrypting if the file position is still within the block) every time you call the function to get a character.[/quote]
This is off-topic but just out of curiosity, how are you implementing encrypted streams? Are you using CTR mode for random-access encryption/decryption? And yes, I think in this situation caching would be worthwhile, as it is common to conditionally read binary streams (e.g. read first byte, then read the next four if the first byte has some value). But of course, too much caching is very bad as well - it's difficult to make the perfect library suited to every task at such a low level.


Basically, I'm torn between being fast, or being simple and easy to debug.

Honestly I recommend a bit of both, leaning towards maintainability. Like having something completely self-documenting, but pathetically slow is just as bad as having a super-fast but unreadable implementation. Code naturally gets faster with time, but it doesn't get any cleaner.

“If I understand the standard right it is legal and safe to do this but the resulting value could be anything.”


Well, then since you don't want to seek backward, I don't see a solution other than incrementing the file position by one byte repeatedly until you hit the newline.

That would be the way explicitly specified by the C standard; the other way would have the underlying implementation bend the rules a little bit behind the scenes, like a memory stream being able to read through the entire buffer without moving the file pointer, or a buffered stream being able to scan the buffer without performing any actions. Each stream knows the fastest way to scan its data without seeking backwards (or if the stream knows it can seek backwards, it can do so without the the caller knowing).


Is it really that bad to seek backwards? It does cut out a large class of possible optimizations and design methods.

I know that it won't always be possible to avoid seeking backwards; a lot of my streams require that the target stream beneath it be backward seekable for the functionality to work, but I'd like to have the parts of code that actually use the stream try to avoid seeking backward to allow things like pipes to work.

This is off-topic but just out of curiosity, how are you implementing encrypted streams? Are you using CTR mode for random-access encryption/decryption?

It's implemented in such a way that the encryption algorithm and mode don't make a difference. The stream makes use of my encryption module, and makes sure that block ciphers and stream ciphers both can work through the same code path. For instance, AES ECB and RC4 can both be used through the same stream implementation (not at the same time); all it requires is that you set up the encryption information correctly, and use it to attach an encrypted stream to the data. The encryption stream requires seeking backward, so if it cannot be seeked, it must be cached to a seekable stream type.

Granted, I don't have all of the block cipher modes implemented, but after doing so, it wouldn't be that difficult to implement it in a way that they function properly, and code for different ciphers don't affect each other in the general encryption stream; the stream only calls functions for optional bookkeeping, and general encrypt and decrypt functions based on the information it has.


Honestly I recommend a bit of both, leaning towards maintainability. Like having something completely self-documenting, but pathetically slow is just as bad as having a super-fast but unreadable implementation. Code naturally gets faster with time, but it doesn't get any cleaner.

Yeah. In the name of reasonable speed, it may be best to leave it until I can find a way to make all of the streams fast enough to make the generic version a reasonable option.

One of the problems that I run into is that it is hard to test whether reading or writing works without using the opposite functionality; it is hard to test whether or not what I've written is correct if I don't read it back in to test it.

Refactor to allow reading from and writing to abstract sources and sinks. A simple implementation of the source interface can then be trivially implemented to allow reading from character buffers for testing purposes. The same can be done for sinks/writing.

You can also have sources/sinks that fail at specific points for testing failures cases deterministically, or at random points for fuzz testing should that be required.

This topic is closed to new replies.

Advertisement