noob memory use question related to number storage

Started by
10 comments, last by way2lazy2care 12 years, 6 months ago
Hola guys,
I am going to be storing potentially a lot of number values in an array. These numbers will always come in increasing in value (possibly staying the same) and will always be greater than zero. What I am wondering is if it is possible to store these values in a data structure in such a way that for values less than a certain threshold it will use the smallest size of memory needed to store it. For ex. 0-255 would be stored as bytes, 256-65,535 would be stored as shorts, etc. It seems like this could save a lot of data for very large data sets.

I'm probably just going to implement it with longs for now as this is very much a premature optimization, but I was just wondering if there was a way to do this or if it would cause more trouble than it would help.

More than likely the simpler way to do it would come down to compressing my huge number of points into fewer points, but I thought it would be illuminating to know the answer to this question :)
Advertisement
What language? I would personally use an ADT array but that's just me :P
Given those constraints, you could certainly do that. Different languages would make that easier/worse.

You could also store only the delta between each number if you can guarantee that to be small and don't need random access. For example:


0 2 0 4 3


instead of


0 2 2 6 9

Given those constraints, you could certainly do that. Different languages would make that easier/worse.

C# is the language

You could also store only the delta between each number if you can guarantee that to be small and don't need random access. For example:


0 2 0 4 3


instead of


0 2 2 6 9

[/quote]
I like this. At the very least I should be able to cut the data in half because the differences should be fairly small.
Out of curiosity, how much is "a lot"?

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]


Out of curiosity, how much is "a lot"?


hundreds of thousands to millions.
Millions of longs isn't a big deal on modern desktops. Hell, it's probably not that big of a deal on most mobile devices either once it's in memory.

Millions of longs isn't a big deal on modern desktops. Hell, it's probably not that big of a deal on most mobile devices either once it's in memory.


Yea. I'm aware it shouldn't be a huge deal, but it got me curious how one might go about doing it efficiently.
Telastyn's suggestion is probably the best. If you can constrain the difference between any two successive numbers to a given range, you can compact it to store, say, a byte or short per delta; otherwise, you could probably do some kind of variable-length encoding magic to get that down even further. I don't know if you really want to mess with bit-stream variable encoding in C#, though.

What really matters, though, is the usage pattern of this data:
  • Are you adding values on the fly? If so, how often?
  • How do you retrieve values and what do you do with them?
  • What access patterns to the value list do you need?


For instance, if you know you'll only ever access the last 25 numbers, just use a circular buffer of size 25 and voila, no memory problems at all!

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]


What really matters, though, is the usage pattern of this data:
  • Are you adding values on the fly? If so, how often?
  • How do you retrieve values and what do you do with them?
  • What access patterns to the value list do you need?


It's the histogram thingy that I made another thread about. Values can be added on the fly to a database, and then I grab the new stuff from the db periodically. Access patterns is something I'm still dealing with as some of the histogram algorithms I've seen require access to all the data at some point in different patterns (usually either binary search or in order). I think what I'm going to end up doing is reading in all my data over whatever range and storing it in a kind of hi res simple histogram that's then used to generate the more dynamic scalable histogram.

Kind of like this:
1. 100,000-1,000,000s of data points.
2. Downscale to 5,000-10,000 weighted points (some time at startup, but afterwards it would just be incrementing the last containers weight with new data and if it runs out of space it will merge neighbors.)
3. Use 5,000 weighted points to generate a ~200 point histogram that can zoom/scale/pan/update fairly quickly.

I was looking at the visual usefulness of my graph, and anything over 200 points seems to be a waste for the size of the graph I'm using.

The problem is mostly due to the panning and zooming that makes the 200 point final histogram have to access more detailed data than it contains itself if I want to keep it balanced. I suppose I could also do something like having a fixed depth tree that does something similar to the above except the 5,000 weighted points would just be the lowest level of the tree. I will think about this some more actually, because now I think I've just sold myself on doing it that way.

This topic is closed to new replies.

Advertisement