Ordering Byte Pairs by Frequency In A File

Started by
10 comments, last by Ectara 11 years, 5 months ago
When enough memory is not available to do a count for the whole thing in one pass, I'd probably try maintaining K highest frequency values together with their frequency, take as much extra memory as I can get (M bytes) and then run 65536*4/M + 1 passes of the file, with each pass counting the frequencies of M/4 different byte pair values. In between passes the K value-frequency pairs have to be updated if the last value block had any higher frequencies.
Advertisement

When enough memory is not available to do a count for the whole thing in one pass, I'd probably try maintaining K highest frequency values together with their frequency, take as much extra memory as I can get (M bytes) and then run 65536*4/M + 1 passes of the file, with each pass counting the frequencies of M/4 different byte pair values. In between passes the K value-frequency pairs have to be updated if the last value block had any higher frequencies.

So, in layman's terms, for however much memory I can hold, count as many different pairs as I can, and keep track of the highest ones between passes.

I think I could make that work to scale how much reading it has to do based on how much memory it has available to it. This scale-able approach to the obvious solution seems like my best bet so far, but I'm open to any other suggestions.

This topic is closed to new replies.

Advertisement