Sign in to follow this  
theOcelot

Nondestructively reading and modifying generated data

Recommended Posts

I want to implement a data language like the example below, in a way that it can be read from and also modified and reserialized with all the generative structures in place. Please pardon the slightly weird syntax.

[code]-- a boolean function. most will be more complicated.
f = function(n)
n * n == 16
end

-- Control structures return streams of key-value pairs,
-- which are automatically added to the map, in order.
my_map = {
foo = 'default value' -- a simple assignment creates a string key

for key in ['bar' 'baz' 'foo'] do
-- '@' introduces a new key in the map
-- for ex. @'foo' = 'bar' is equivalent to foo = bar
-- but @ allows non-strings and values of variables to be used as keys
@key = 'another value'
end

if f(4) do
foo = 'yet another value'
end
}[/code]

What's the value of my_map['foo'], if later definitions of foo override earlier ones? It's not too complicated to just evaluate it all and see it's 'yet another value', but I have to evaluate all the control structures and merge them into the map to be sure. But I might want to keep all of them intact if I reserialize the data. Further suppose I want to allow users to modify this structure, or change the input variables, and show the updated value of foo in real time. I want to dig the correct value out of the control structures without hurting them.

The brute-force solution would be to re-evaluate the entire data structure (basically the parse tree) into a separate copy every time it's changed, and query the resulting simplified structure. But that seems inelegant at best, and for larger, more complicated structures it would become inefficient. Caching and updating based on a dirty flag would work, but I don't know if it's the best solution.

So basically, the data I want may be locked up in control structures I don't want to destroy. I'm sure this isn't a super hard problem; a typical spreadsheet does something similar with its formulas. What techniques are usually used in situations where you have interconnected data that you want to read and modify?

This will be implemented in Python at first, but I hope to do it in C later, and maybe others. Thanks for reading.

Share this post


Link to post
Share on other sites
A couple of points here:

- Any time you have imperative control flow, it is impossible even in theory to reliably compute a value's result from just examining the code. This is a fundamental limitation of computability. In essence, we can reduce your situation to a variant of the Halting Problem, which is uncomputable; therefore, you cannot guarantee without harsh limitations on your language that you will be able to recover the value of your "foo" [i]without[/i] executing the entire code segment associated with the map containing foo.

- Spreadsheets don't actually do what you suggest: any given cell only has [i]one[/i] definition of its value. Computing a cell from many other cells' values is a nontrivial task, but computable. Any time you change the formula computed by a cell, you destructively eradicate all of its previous formulae, and effectively reduce it to a single point in the computation space at a time. In other words, it'd be analogous to forbidding "foo" to be assigned more than once in a given map, in your language.

- You should strongly consider making your computation format purely functional in nature, i.e. not imperative. This gives you spreadsheet-like capabilities for caching values, reducing recomputation of already-known values, and so forth. It also lets you query a single key's value trivially, by evaluating it as a function of other keys and their values. If you aren't familiar with functional programming, I highly recommend you play with a functional language for a while before settling on a design for this project. Of course, this suggestion is made in vacuum - I don't know the actual precise requirements for your language or why you need it, so I'm extrapolating a bit; but nevertheless, I think you could benefit a lot from looking into functional designs.

Share this post


Link to post
Share on other sites
I see what you mean about spreadsheets being different. I think what I had in mind with spreadsheets was stuff like:
[code]a = 3 b=4 c = 5
foo = a + b
bar = foo - c[/code]
...and getting the value of bar while preserving the relationships, blah blah. It's much more straightforward to represent everything as functions here. I had kind of forgotten about the halting problem here. Oops.

Generally, I did have something functional-ish in mind, but I want the flexibility to do things like the first example, as well. Does that mean caching (and recalculating based on a flag) is going to be the only shortcut, at least for objects like that? I think I should be able to use some of the spreadsheet type tricks too. Are there any specific topics/algorithms I should look up?

P.S. I don't want to get hung up on spreadsheets. That's just the only application I could immediately think of with that kind of logic.

Share this post


Link to post
Share on other sites
Some type of caching will probably be mandatory to get good performance, yes.

You might consider building a list of dependencies in the execution layer when computing a value. For instance, if foo = bar + 6, have the execution layer note that bar had to be read when computing foo. Then, store a listing of all dependencies and what they affect. So if someone goes and changes bar from 3 to 9, you know you have to recompute foo as well.

This might become memory-intensive in large systems, but should work nicely for mid-size applications.

Share this post


Link to post
Share on other sites
I was planning on leaving it in a format pretty close to a parse tree, so keeping track of dependencies shouldn't be too hard. But if a name is suddenly defined in an inner scope (further in) or otherwise changes where it points, it would be hard to detect that and update. NameReference's are their own nodes. To mark themselves dirty, they would have to detect when any of their containing scopes changes, and the list of containing scopes itself might change.

I guess there aren't a lot of shortcuts. Thanks for your help.

Share this post


Link to post
Share on other sites
Well, that depends on how you build your scope tree. I use a linked list to maintain the parent scope hierarchy, so I can always follow a symbol from its assignment point back to its declaring scope. This allows for easy tracing of what operations contribute to the value of a very specific variable.

Of course, Epoch doesn't permit identifier shadowing, so YMMV.

Share this post


Link to post
Share on other sites
So far, every tree node just has a self.scope member (which may as well be self.parent). When it sees a name reference, it calls self.scope.reference(the_name), which either returns the name or calls self.scope.reference recursively until there's either a value or a name-not-found exception. This system naturally allows identifier shadowing.

Share this post


Link to post
Share on other sites
Shadowing shouldn't be too hard to support; just traverse upwards in the scope tree until you find a declaring scope, then stop. The bottom-most (inner-most/most-local) scope in the tree should always correspond to the defining scope of the identifier.

For uniquely identifying shadowed symbols, simply prepend a scope ID of some kind, e.g. foobarfunction_forloop1_ifcondition32_bletch. This gives you a program-unique ID of every symbol regardless of its shadowed state. You can then use my suggestion from earlier to track global dependencies even in the presence of shadowing.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this