Jump to content

  • Log In with Google      Sign In   
  • Create Account

Battery of Tests For IO Library


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
11 replies to this topic

#1 Ectara   Crossbones+   -  Reputation: 2970

Like
0Likes
Like

Posted 01 December 2012 - 08:43 PM

I've been having a hard time thinking up a battery of tests for an IO library, and any advice or ideas would be appreciated. For a point of reference, it shares a lot in common with the C standard IO library. Ideally, I'd like to test that writing and reading works, files can be grown when written past the end if permissions allowed, file read/write permissions, file EOF flags are set properly, and file seeking and telling work as expected.

One of the problems that I run into is that it is hard to test whether reading or writing works without using the opposite functionality; it is hard to test whether or not what I've written is correct if I don't read it back in to test it. I can't seem to find a way around it. One important point is that though the IO stream can be implemented in any number of ways, they all have the same interface. I do expect at least one of these tests to fail for any of the streams, so it is of great importance.

What I have so far, in terms of tests:
-Create a new file
-Write a sequence of bytes (or a few, or rule out a fluke)
-Seek to the beginning
-Read in the written data (if a problem occurs, how can I prove that it was either the reading, or the writing)
-Try to read again, to make sure that I get the EOF flag
-Explicitly check to see if stream has the EOF flag set, similarly to feof()
-Put a single byte
-Get a single byte (same caveat with the reading and the writing)
-Try getting the current file position
-Try seeking to the beginning
-Try seeking to the stored position
-Try to get the file position again, and ensure that they're the same
-Try to seek to the end of the stream
-Get the file position at the end, and try seeking to the beginning and to the end's stored position to see if I still wind up at the end
-Seek to the first stored position and flush the stream, and ensure that the position did not change

I'm drawing a blank on the rest so far.

Additionally, should the function for getting a string from the stream explicitly call the function to get a single character until a newline is encountered, or should the string getting function be independent of the character get function for speed, and depend upon the stream's implementation?

Sponsor:

#2 Samith   Members   -  Reputation: 2256

Like
0Likes
Like

Posted 01 December 2012 - 09:42 PM

I might try adding some more "error case" tests, like attempting to read/write a negative number of bytes or something for example. It's good to make sure your libraries behave nicely even when given bad input.

#3 Bacterius   Crossbones+   -  Reputation: 8880

Like
0Likes
Like

Posted 02 December 2012 - 12:24 AM

I personally enjoy randomized testing against a reference library (such as C's stdio or C++'s iostream) when I have the luxury of having such a reference. The idea is to perform a series of completely random operations, even operations which are meaningless and should result in errors, in parallel on both implementations, and as soon as you get divergent results which are not a feature of your library (hopefully you can enumerate those), you've failed and you go to the code to fix it. This can be executed fairly quickly and eventually the entire search space will be covered. When everything works, you're probably ready to launch the library into the real world and iron out any remaining issues through bug reports.

The slowsort algorithm is a perfect illustration of the multiply and surrender paradigm, which is perhaps the single most important paradigm in the development of reluctant algorithms. The basic multiply and surrender strategy consists in replacing the problem at hand by two or more subproblems, each slightly simpler than the original, and continue multiplying subproblems and subsubproblems recursively in this fashion as long as possible. At some point the subproblems will all become so simple that their solution can no longer be postponed, and we will have to surrender. Experience shows that, in most cases, by the time this point is reached the total work will be substantially higher than what could have been wasted by a more direct approach.

 

- Pessimal Algorithms and Simplexity Analysis


#4 Ectara   Crossbones+   -  Reputation: 2970

Like
0Likes
Like

Posted 02 December 2012 - 07:17 AM

I might try adding some more "error case" tests, like attempting to read/write a negative number of bytes or something for example. It's good to make sure your libraries behave nicely even when given bad input.

Hm... all of my quantities are unsigned, to both prevent that issue, and to allow higher quantities to be read and written at once without overflow. I could try sending zeroes, and null pointers, and also send values that would seek past the beginning or end of the stream. However, most of these would be caught by the highest level of the stream, during input validation before it ever reaches the actual implementation of the stream. These tests will still be useful, but unless I change the interface, most would no longer fail, even if the underlying implementation is faulty.

I personally enjoy randomized testing against a reference library (such as C's stdio or C++'s iostream) when I have the luxury of having such a reference. The idea is to perform a series of completely random operations, even operations which are meaningless and should result in errors, in parallel on both implementations, and as soon as you get divergent results which are not a feature of your library (hopefully you can enumerate those), you've failed and you go to the code to fix it. This can be executed fairly quickly and eventually the entire search space will be covered. When everything works, you're probably ready to launch the library into the real world and iron out any remaining issues through bug reports.

Randomized tests sound good. I've more than learned by now that when you write a test program, and it works, it does not mean it is bug free, like my previous memory allocator that worked perfectly until the program allocated a certain of amount memory relative to the size of the pool. I only found the bug when actually using it in an application that requested large amounts of memory. However, it does leave open the area of the features that C's standard IO does not support that mine does.

I will implement some sort of randomized tests; I never really thought of that. Most of the tests will be aimed at making sure that the underlying stream's implementation performs as expected, because there are a lot of stream types, and testing all of them bit by bit would be troublesome.

#5 Ectara   Crossbones+   -  Reputation: 2970

Like
0Likes
Like

Posted 02 December 2012 - 03:43 PM

Additionally, should the function for getting a string from the stream explicitly call the function to get a single character until a newline is encountered, or should the string getting function be independent of the character get function for speed, and depend upon the stream's implementation?



#6 Bacterius   Crossbones+   -  Reputation: 8880

Like
0Likes
Like

Posted 02 December 2012 - 03:55 PM

Additionally, should the function for getting a string from the stream explicitly call the function to get a single character until a newline is encountered, or should the string getting function be independent of the character get function for speed, and depend upon the stream's implementation?

Well, in my opinion I'd mark the single-character function inline (as it's probably quite small) so that calling it in a loop won't be so costly, instead of duplicating code, but it really depends on how the streams are implemented. It may make more sense to buffer and seek the newline in memory if you're not already caching I/O operations (and it might be faster, though working with textual data is always going to be slow in any case).

And yes, randomized testing won't help you test stuff that the reference library doesn't do, so you'll need to resort to conventional testing for those features.

The slowsort algorithm is a perfect illustration of the multiply and surrender paradigm, which is perhaps the single most important paradigm in the development of reluctant algorithms. The basic multiply and surrender strategy consists in replacing the problem at hand by two or more subproblems, each slightly simpler than the original, and continue multiplying subproblems and subsubproblems recursively in this fashion as long as possible. At some point the subproblems will all become so simple that their solution can no longer be postponed, and we will have to surrender. Experience shows that, in most cases, by the time this point is reached the total work will be substantially higher than what could have been wasted by a more direct approach.

 

- Pessimal Algorithms and Simplexity Analysis


#7 Ectara   Crossbones+   -  Reputation: 2970

Like
0Likes
Like

Posted 02 December 2012 - 04:42 PM

Well, in my opinion I'd mark the single-character function inline (as it's probably quite small) so that calling it in a loop won't be so costly, instead of duplicating code,

I can't guarantee that I have inline support, since it is written in C89. For this reason, the are all normal functions; the parts that absolutely must be inlined are safely written macros.

but it really depends on how the streams are implemented. It may make more sense to buffer and seek the newline in memory if you're not already caching I/O operations (and it might be faster, though working with textual data is always going to be slow in any case).

This is one of the major reasons that I have it done through the stream back-end itself; for instance, the memory map stream would do it much faster by iterating through a block of memory, as opposed to a general function that reads character by character. However, it seems like I'd be saving a lot of time and strain by having the get string function be higher level, by doing it as explicitly stated in the C standard. Caching sounds like a great choice, but it would require me having a (n - 1) / n chance of reading too far and needing to seek the file position backward for a buffer size of n bytes; if the newline is not at the end of the buffer, I'd have to seek the file pointer back, and seeking backward is something I avoid at all costs.

On the other hand, streams that cannot function without seeking backward to maintain normal stream appearance (like an encrypted stream with a block cipher) would be horrendously slow to read one character at a time, as the current implementation decrypts the block to get the character requested (over time, I could optimize it to not reread the encrypted block before decrypting if the file position is still within the block) every time you call the function to get a character.

Basically, I'm torn between being fast, or being simple and easy to debug.

And yes, randomized testing won't help you test stuff that the reference library doesn't do, so you'll need to resort to conventional testing for those features.


To resolve this issue, I plan to ensure that the most basic file stream is bug free, and use that as a control in the test; the file stream calls whichever OS's file routines in its back-end, so it should be pretty safe.

#8 Bacterius   Crossbones+   -  Reputation: 8880

Like
0Likes
Like

Posted 02 December 2012 - 05:08 PM

This is one of the major reasons that I have it done through the stream back-end itself; for instance, the memory map stream would do it much faster by iterating through a block of memory, as opposed to a general function that reads character by character. However, it seems like I'd be saving a lot of time and strain by having the get string function be higher level, by doing it as explicitly stated in the C standard. Caching sounds like a great choice, but it would require me having a (n - 1) / n chance of reading too far and needing to seek the file position backward for a buffer size of n bytes; if the newline is not at the end of the buffer, I'd have to seek the file pointer back, and seeking backward is something I avoid at all costs.

Well, then since you don't want to seek backward, I don't see a solution other than incrementing the file position by one byte repeatedly until you hit the newline. Is it really that bad to seek backwards? It does cut out a large class of possible optimizations and design methods.

On the other hand, streams that cannot function without seeking backward to maintain normal stream appearance (like an encrypted stream with a block cipher) would be horrendously slow to read one character at a time, as the current implementation decrypts the block to get the character requested (over time, I could optimize it to not reread the encrypted block before decrypting if the file position is still within the block) every time you call the function to get a character.

This is off-topic but just out of curiosity, how are you implementing encrypted streams? Are you using CTR mode for random-access encryption/decryption? And yes, I think in this situation caching would be worthwhile, as it is common to conditionally read binary streams (e.g. read first byte, then read the next four if the first byte has some value). But of course, too much caching is very bad as well - it's difficult to make the perfect library suited to every task at such a low level.

Basically, I'm torn between being fast, or being simple and easy to debug.

Honestly I recommend a bit of both, leaning towards maintainability. Like having something completely self-documenting, but pathetically slow is just as bad as having a super-fast but unreadable implementation. Code naturally gets faster with time, but it doesn't get any cleaner.

The slowsort algorithm is a perfect illustration of the multiply and surrender paradigm, which is perhaps the single most important paradigm in the development of reluctant algorithms. The basic multiply and surrender strategy consists in replacing the problem at hand by two or more subproblems, each slightly simpler than the original, and continue multiplying subproblems and subsubproblems recursively in this fashion as long as possible. At some point the subproblems will all become so simple that their solution can no longer be postponed, and we will have to surrender. Experience shows that, in most cases, by the time this point is reached the total work will be substantially higher than what could have been wasted by a more direct approach.

 

- Pessimal Algorithms and Simplexity Analysis


#9 Ectara   Crossbones+   -  Reputation: 2970

Like
0Likes
Like

Posted 03 December 2012 - 09:33 AM

Well, then since you don't want to seek backward, I don't see a solution other than incrementing the file position by one byte repeatedly until you hit the newline.

That would be the way explicitly specified by the C standard; the other way would have the underlying implementation bend the rules a little bit behind the scenes, like a memory stream being able to read through the entire buffer without moving the file pointer, or a buffered stream being able to scan the buffer without performing any actions. Each stream knows the fastest way to scan its data without seeking backwards (or if the stream knows it can seek backwards, it can do so without the the caller knowing).

Is it really that bad to seek backwards? It does cut out a large class of possible optimizations and design methods.

I know that it won't always be possible to avoid seeking backwards; a lot of my streams require that the target stream beneath it be backward seekable for the functionality to work, but I'd like to have the parts of code that actually use the stream try to avoid seeking backward to allow things like pipes to work.

This is off-topic but just out of curiosity, how are you implementing encrypted streams? Are you using CTR mode for random-access encryption/decryption?

It's implemented in such a way that the encryption algorithm and mode don't make a difference. The stream makes use of my encryption module, and makes sure that block ciphers and stream ciphers both can work through the same code path. For instance, AES ECB and RC4 can both be used through the same stream implementation (not at the same time); all it requires is that you set up the encryption information correctly, and use it to attach an encrypted stream to the data. The encryption stream requires seeking backward, so if it cannot be seeked, it must be cached to a seekable stream type.

Granted, I don't have all of the block cipher modes implemented, but after doing so, it wouldn't be that difficult to implement it in a way that they function properly, and code for different ciphers don't affect each other in the general encryption stream; the stream only calls functions for optional bookkeeping, and general encrypt and decrypt functions based on the information it has.

Honestly I recommend a bit of both, leaning towards maintainability. Like having something completely self-documenting, but pathetically slow is just as bad as having a super-fast but unreadable implementation. Code naturally gets faster with time, but it doesn't get any cleaner.

Yeah. In the name of reasonable speed, it may be best to leave it until I can find a way to make all of the streams fast enough to make the generic version a reasonable option.

Edited by Ectara, 03 December 2012 - 10:36 AM.


#10 e‍dd   Members   -  Reputation: 2105

Like
0Likes
Like

Posted 03 December 2012 - 11:56 AM

One of the problems that I run into is that it is hard to test whether reading or writing works without using the opposite functionality; it is hard to test whether or not what I've written is correct if I don't read it back in to test it.

Refactor to allow reading from and writing to abstract sources and sinks. A simple implementation of the source interface can then be trivially implemented to allow reading from character buffers for testing purposes. The same can be done for sinks/writing.

You can also have sources/sinks that fail at specific points for testing failures cases deterministically, or at random points for fuzz testing should that be required.

#11 Bacterius   Crossbones+   -  Reputation: 8880

Like
0Likes
Like

Posted 03 December 2012 - 06:54 PM

It's implemented in such a way that the encryption algorithm and mode don't make a difference. The stream makes use of my encryption module, and makes sure that block ciphers and stream ciphers both can work through the same code path. For instance, AES ECB and RC4 can both be used through the same stream implementation (not at the same time); all it requires is that you set up the encryption information correctly, and use it to attach an encrypted stream to the data. The encryption stream requires seeking backward, so if it cannot be seeked, it must be cached to a seekable stream type.

Wouldn't that be relatively inefficient, though? Consider a stateful mode like CBC, to encrypt at position N you'd need to encrypt everything before position N - how does your stream deal with this, do you forbid seeking ahead?

The slowsort algorithm is a perfect illustration of the multiply and surrender paradigm, which is perhaps the single most important paradigm in the development of reluctant algorithms. The basic multiply and surrender strategy consists in replacing the problem at hand by two or more subproblems, each slightly simpler than the original, and continue multiplying subproblems and subsubproblems recursively in this fashion as long as possible. At some point the subproblems will all become so simple that their solution can no longer be postponed, and we will have to surrender. Experience shows that, in most cases, by the time this point is reached the total work will be substantially higher than what could have been wasted by a more direct approach.

 

- Pessimal Algorithms and Simplexity Analysis


#12 Ectara   Crossbones+   -  Reputation: 2970

Like
0Likes
Like

Posted 04 December 2012 - 11:20 AM

Refactor to allow reading from and writing to abstract sources and sinks. A simple implementation of the source interface can then be trivially implemented to allow reading from character buffers for testing purposes. The same can be done for sinks/writing.

I'm not quite sure what you mean, as my streams are already as abstract as it gets. I have stream type that reads from and writes to a block of memory. Only problem is, I need to test that, too

The stream are not always simple file descriptors. If I'm testing an AES or compression stream, it's incredibly time consuming to test every output for correctness by hand if I simply check the written bytes in memory. For the most part the streams are working, but I need to test edge cases.

You can also have sources/sinks that fail at specific points for testing failures cases deterministically, or at random points for fuzz testing should that be required.


However, having tests fail purposely is useful. Perhaps having a target stream that purposely returns values that don't make sense, to see how the higher-level parts of the code deal with it, or something.

Wouldn't that be relatively inefficient, though?

Inefficient, yes. However, the goal of the stream interface is to abstract away the implementation to provide one uniform way of accessing and storing data. Doing it this way ensures that one code path works for all sources or destinations (like being able to load an image from anywhere through one interface), but if they need the utmost efficiency, they can use the encryption module directly, having the code now assume that input will be encrypted in the way they expect, and they can handle the data and the encryption any way they choose. The goal of the encryption stream is just to get the job done in a way that would allow it to conform to a stream interface, after all.

Consider a stateful mode like CBC, to encrypt at position N you'd need to encrypt everything before position N - how does your stream deal with this, do you forbid seeking ahead?

Well, it's a little simple. The streams function similar to C's standard IO, as that was my inspiration when I started. In order to write at position N in a C FILE, you'd need to have N - 1 bytes already written to the file; files can't be grown without writing data to extend the file. Thus, to write at position N, you must have written or seeked past N - 1 existing bytes. The data in my encrypted streams are already encrypted; if you attach an encryption stream to another stream, you are making the assumption that the existing data is encrypted in the way that you expect; if the data is not yet encrypted, then you need to read the data and write it through the encryption stream again in order to encrypt it.

So, first, you'd find the block where byte N resides. Then, you'd take the previous ciphertext (or the IV, if it is the first block), decrypt the current block, and XOR the two to get the plaintext. You'd then modify the plaintext, then XOR it with the previous ciphertext or the IV, and encrypt it again. It will, however, require re-encryption of all following blocks, due to the nature of CBC.

Since CBC uses the ciphertext of the previous block, to read or write at an arbitrary point only needs examining the ciphertext of the previous block. However, modifying the plaintext requires re-encrypting all following blocks until the end of the stream (or until you find one that encrypts to the same block, which is extraordinarily unlikely). When I do implement different block cipher modes, it wouldn't be too much trouble to make it agnostic to the actual cipher, and add in more optional bookkeeping depending on the mode.

As another example, when seeking backwards, it has optional bookkeeping to reset stream ciphers to their initial states, then quickly burn through iterations until it reaches the new position, so that the higher-level code can seek an RC4 stream backwards without knowing the difference, for instance.

There will be inefficiencies, but people that need to take care to avoid them would never need to hit them, because these inefficiencies are only caused to emulate functionality that other stream types take for granted. Of course, these inefficiencies don't really occur when you read or write straight through from beginning to end without changing direction or modifying the existing data, which is the most likely action to serialize data through a stream. Again, the stream's goal is to provide an interface to simply get the job done the way the caller needs it, which is immensely useful.




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS