Separate Functions to Compress From Memory and From a File

Started by
20 comments, last by Ectara 11 years, 5 months ago
First, if my assumption about you doing buffering for files or anything else was wrong, please ignore the first 'byte_source' interface I mentioned. There was no code to analyse, so I had to guess. Concentrate on the second byte_source interface.


[quote name='e?dd' timestamp='1354050082' post='5004670']
Turn the input interface on its head. Rather than having an abstraction for buffer filling,

I don't understand. I have an abstraction for buffer filling?
[/quote]
I was asking that, really (though this wasn't obvious -- apologies). Again, had to guess.


I have a function that compresses input as it reads from a stream; if I already have the buffer, then I'm basically memory mapping. If I'm reading into a buffer through an abstract interface,
[/quote]
I'm saying that you don't have to (manually) read in to a buffer at all. You could argue memory mapping a file is doing this, but it should be doing it (near) optimally and it's handled for you, for the most part.

how is this different than using a buffered stream?[/quote]
There's no need for buffers! You can implement a buffered stream derived from the second interface if you want, but it's not necessary for (memory-mapped) files or raw memory, which seem to be your primary data sources.


If I use an abstract interface that could be a memory buffer, or a file, and I use functions to read from it, how is this different than the streams I already have?
[/quote]

Let me attempt to clarify by implementing the byte_source interface for a region of memory:


class memory_byte_source : public byte_source
{
public:
memory_byte_source(const unsigned char *begin, const unsigned char *end) : begin(begin) end(end) { }

byte_range next_range()
{
byte_range ret(begin, end);
begin = end;
return ret;
}

private:
const unsigned char *begin;
const unsigned char *end;
};


The first call to next_range() gives you the first and only byte_range (no copies, the range is the memory). The second call returns an empty range, telling you that the underlying data source is exhausted. The implementation for a memory mapped file will look very similar, except for the part where you create the mapping (in the constructor, perhaps).

The compression algorithm may of course have to be adapted for the new interface, but I don't imagine that would be too hard as you're now reading bytes through a pointer, regardless of the interface's implementation.
Advertisement

I'm saying that you don't have to (manually) read in to a buffer at all. You could argue memory mapping a file is doing this, but it should be doing it (near) optimally and it's handled for you, for the most part.

I already don't read into a buffer. I pass the stream to a compression function, and it reads from it directly, and writes to an output stream directly.


There's no need for buffers! You can implement a buffered stream derived from the second interface if you want, but it's not necessary for (memory-mapped) files or raw memory, which seem to be your primary data sources.

I have a buffered stream interface, but I'm not using it in this case. I'm also not memory mapping any files; I've decided that I will not be doing that, because it unnecessarily restricts the file size, and if I have to allocate it the naive way due to having no OS support for it, I chance not having enough memory.


The compression algorithm may of course have to be adapted for the new interface, but I don't imagine that would be too hard as you're now reading bytes through a pointer, regardless of the interface's implementation.

How is this different than using streams? Through my way, both are represented as files. Through this way, both are represented as an array of bytes in memory. It seems to be more or less how I already did it, except type A is converted to type B before compression in mine, and type B is converted to type A in the method presented. Through my way, the data is read from the array through function calls like a file, creating a bottleneck. Through this way, the data is read into an array either beforehand, or during the read somehow, creating a bottleneck. Both methods have one key flaw, and since I feel that compressing from memory to memory is less likely than any stream (including memory) to any stream (including memory), I choose the flaw that is least likely to be encountered.

I already don't read into a buffer. I pass the stream to a compression function, and it reads from it directly, and writes to an output stream directly.


The interface I've provided enables optimal access to both memory and files (note when you memory-map a file, you can do it, say, 16mb at a time or something. You don't have to do the whole lot at once).

Specifically, what I've presented solves this problem of yours:


My point is, all of my algorithms have been rewritten to take stream sources and destinations, and I could make a stream from a buffer in memory. However, this would be much slower than accessing the memory directly in the special case of memory to memory compression/decompression.
[/quote]

It solves this problem because the interface still allows direct access from memory, where it can be provided. For file access, it also allows either manual reads in to a buffer, or memory-mapping, which ever works best or is faster. In either case, it's the thinnest and fastest interface providing access to a range of bytes.

If that's not what you were asking for, I'm afraid I don't understand what you're after.

Or phrased another way: how about you present your actual interface and point to what you specifically don't like about it? Otherwise I fear we'll continue to talk past each other.

Through this way, the data is read into an array either beforehand, or during the read somehow, creating a bottleneck.[/quote]
There's no bottleneck with my method (beyond the unavoidable cost of disk access).

For OS's that do support memory mapping a file without actually having it reside in memory, I'm still limited by address space. If I can't map even 2gb of contiguous memory on a 32bit machine, then there is no way that I can handle the maximum file size if I were to compress an entire file at once.


Any POSIX system supports memory mapping. Windows supports memory mapping.

This is the point at which you need to make decisions. Do you actually need to support other OSs? Do you actually need to support files > 2gb? Are you just putting artificial obstacles in your own way? Are you hitting a point at which you're looking for a theoretical best-case solution at the expense of being pragmatic about stuff? are you focussing on edge cases that will never happen at the expense of just getting the job done?

They're questions only you can answer, but I suspect that for all practical use cases memory mapping is going to work just fine.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.


The interface I've provided enables optimal access to both memory and files (note when you memory-map a file, you can do it, say, 16mb at a time or something. You don't have to do the whole lot at once).


I concede, that your method will likely result in a higher speed. However, this now requires a whole new interface just for one module, whereas streams are used all over my library. Additionally, the library also targets platforms that don't support memory mapping, so it is emulated during the few times it is needed; I'd like for this not to be one of them.

[quote name='Ectara' timestamp='1354052947' post='5004693']
For OS's that do support memory mapping a file without actually having it reside in memory, I'm still limited by address space. If I can't map even 2gb of contiguous memory on a 32bit machine, then there is no way that I can handle the maximum file size if I were to compress an entire file at once.


Any POSIX system supports memory mapping. Windows supports memory mapping.

This is the point at which you need to make decisions. Do you actually need to support other OSs? Do you actually need to support files > 2gb? Are you just putting artificial obstacles in your own way? Are you hitting a point at which you're looking for a theoretical best-case solution at the expense of being pragmatic about stuff? are you focussing on edge cases that will never happen at the expense of just getting the job done?

They're questions only you can answer, but I suspect that for all practical use cases memory mapping is going to work just fine.
[/quote]

I choose compatibility.

For OS's that do support memory mapping a file without actually having it reside in memory, I'm still limited by address space. If I can't map even 2gb of contiguous memory on a 32bit machine, then there is no way that I can handle the maximum file size if I were to compress an entire file at once.

You do know you can map the file in multiple stages - you don't have to open the entire file at once. That's what all the arguments you are used to setting to zero do :)

“If I understand the standard right it is legal and safe to do this but the resulting value could be anything.”


[quote name='Ectara' timestamp='1354067836' post='5004784']
For OS's that do support memory mapping a file without actually having it reside in memory, I'm still limited by address space. If I can't map even 2gb of contiguous memory on a 32bit machine, then there is no way that I can handle the maximum file size if I were to compress an entire file at once.

You do know you can map the file in multiple stages - you don't have to open the entire file at once. That's what all the arguments you are used to setting to zero do smile.png
[/quote]
I'm well aware that I can map blocks of it at a time. I've read the documentation.

Since it doesn't look like anyone has any other suggestions but memory mapping, which I can't/won't use (How do I do memory mapping for an encrypted file stream in an archive through the OS?), I consider this thread complete until someone has a new suggestion. I've thought long and hard about all of these things, and while it wasn't through the OS, I had previously used memory mapping for all streams in order to compress them, resulting in cruft like reading in the file to a memory map, compressing it to a new buffer, and writing it, rather than the new "read from a stream, write to a stream". It added too many steps, and took up too much memory, because when it comes down to it, native memory mapping solves only one problem: reading from a file descriptor given by the OS. Any other stream type, I'd have to memory map it myself, the naive way, and that puts me right back where I was before. So, it's either I stay where I am now at post #5, or I memory map a file when I see it, and allocate memory to store the contents of every other stream type so that they all use memory like the poor design I had before, or I create a new type of stream on top of the stream I already have to have the stream, which could be a block of memory acting like a file stream, become another stream that is now acting like a block of memory.

I appreciate that people are trying to offer solutions other than what I'm already doing, but I'm not doing things the way that I am now for lack of education on the subject; I sat down, and thought these through, then came here to get a fresh pair of eyes on the project to see if there are any alternatives that I missed.

How do I do memory mapping for an encrypted file stream in an archive through the OS?


Let me get this straight. You want to use your compression algorithm on an encrypted 2Gb file inside another compressed archive?

I can only assume that you're writing some kind of library that has to account for literally every possibility.

In an attempt to be at least somewhat useful, let me just recommend that you not try to re-invent the wheel. These problems have been solved at a lower level and to a higher degree of efficiency than you or I can likely achieve.

Let me get this straight. You want to use your compression algorithm on an encrypted 2Gb file inside another compressed archive?

No, the archive need not be compressed. And one usually compresses before they encrypt, but that isn't the point. The stream does not provide information on how it is implemented. These solutions require knowing the backend to write a whole new stream interface, while only solving two types of streams. This will not work for my purposes.


In an attempt to be at least somewhat useful, let me just recommend that you not try to re-invent the wheel. These problems have been solved at a lower level and to a higher degree of efficiency than you or I can likely achieve.

You're right. I'm going to drop support for all of these features, remove several modules of my libraries, and use all features that aren't supported on all of the platforms I own. I was waiting for when the first "abandon the project" post would appear.

This topic is closed to new replies.

Advertisement