Removing data from data structues based on age

Started by
3 comments, last by Antheus 14 years, 1 month ago
I am reading data off a network connection. My data always contains a timestamp and a source id. I am storing this in a dictionary (.Net) using the id as the key. As new data arrives for a specific id, I have to either discard it based on some rules or replace the current data if it is older then X amount of time. I am looking for a decent solution to discard the old data automatically. Currently it just sits in memory forever if it is never referenced again. There is only even 1 thread writing to the data structure currently and the same thread is also the only reader, and I do not like the idea of having to lock the entire structure in another thread to iterate through it to find old data (we actually a bit behind our performance goals at the moment so introducing a lot of contention on a heavily used data structure may cause issues, I do know I will have to expect some) and we are running into some memory problems. Thought I would list other some constraints/guarantees: -The data structure has to be checked per packet. -The rate a single data with a unique id can update is highly variable, the normal could be 0-10 seconds, but it could disappear for a while before we see the data again or we may never see the data for that id again. -As I said above currently the data structure is only used by a single thread. -There can only ever be 1 billion ids if that helps.... -Trying to keep data moving through the system with minimal delay is my highest priority. Is there someway I could organize that would make for this an easier operation to clean up?
Advertisement
The easiest solution without changing all your data structures is to periodically iterate the dictionary and remove old data. You can even amortize the cost of this operation if there is a lot of data by spreading the iteration over several frames.

The other solution is to use a list and sort the data by age, so that the oldest data is at the end of the list. Then every frame you only have to check that one element to see if it's too old and erase it. Whenever new data comes in, you find it in the list and move it to the end.

The third solution is to combine both approaches, and have both a dictionary for keying by ID and a list for eliminating old data. The list would only contain the ID and time, so that it's only an indirect reference to data in the dictionary and doesn't have to duplicate it.

Both approaches have their advantages and disadvantages, however the underlying issue is that you have two things you want to "key" data by, time and ID, so you have to decide whether you want to use one data structure and optimize for a single key lookup, or use two data structures to optimize both lookups but eat the extra memory.
Is this, by any chance, managing the pseudo-connections for UDP protocol? You have ban and DoS list which reject IP instantly, and the "older" means that connection has timed out?
I have no idea what you are talking about Antheus, but I am going to look it up.

I am working on a method for filtering duplicate data from multiple sources. C transmits data, A and B receive it, A and B forward it to D for storage, D tries to determine if it is the same data. A and B do not know each other exist, the id is generated by C.
Use Hadoop. Really - map/reduce was built for precisely this type of task.

This topic is closed to new replies.

Advertisement