Ordering Byte Pairs by Frequency In A File
When enough memory is not available to do a count for the whole thing in one pass, I'd probably try maintaining K highest frequency values together with their frequency, take as much extra memory as I can get (M bytes) and then run 65536*4/M + 1 passes of the file, with each pass counting the frequencies of M/4 different byte pair values. In between passes the K value-frequency pairs have to be updated if the last value block had any higher frequencies.
When enough memory is not available to do a count for the whole thing in one pass, I'd probably try maintaining K highest frequency values together with their frequency, take as much extra memory as I can get (M bytes) and then run 65536*4/M + 1 passes of the file, with each pass counting the frequencies of M/4 different byte pair values. In between passes the K value-frequency pairs have to be updated if the last value block had any higher frequencies.
So, in layman's terms, for however much memory I can hold, count as many different pairs as I can, and keep track of the highest ones between passes.
I think I could make that work to scale how much reading it has to do based on how much memory it has available to it. This scale-able approach to the obvious solution seems like my best bet so far, but I'm open to any other suggestions.
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement