Jump to content

  • Log In with Google      Sign In   
  • Create Account

The Bag of Holding

Spaces for all the things, and everything in its space

Posted by , 28 September 2014 - - - - - - · 526 views
So Epoch has managed to self-host (as of an embarrassingly long time ago) and the Era IDE is slowly turning into something actually worth using. Over the past few days I got a rudimentary symbol lookup feature done, where you can press F11 to process a project and F12 with the text caret over a symbol to jump immediately to the definition of the symbol. This makes navigating the compiler source a lot easier, but still far from ideal.

As often happens with this sort of project, needs in Era fuel features in the compiler, and vice versa. At the moment my biggest annoyance with Epoch is the lack of good encapsulation support - there is no module system, no namespace system, no object system, nothing.

I've written here before about the ideas I have for solving this, but it's worth reiterating since the notion has evolved a bit in my head since the last time I was really prolifically talking about it.

Essentially, the unit of organization in an Epoch program right now is a function. The difficulty is that there is no mechanism for grouping functions into higher-level constructs.

My idea for solving this is the task. A task is an abstract collection of functions, data structures, and/or variables. It acts like a namespace, but also acts like a class in that you can instantiate a task and send messages to the instance. Functions in a task can be static, meaning that (as with most languages using the term) you can invoke them as messages without an instance of the task.

For example, here's the unit test for basic tasks in the compiler suite right now:

// Compiler test for a very, very simple task

task SuccessTask
	succeed : [static]

entrypoint :
	SuccessTask => succeed()
The entrypoint function, as always, is where the program begins. All this program does is send the "succeed" message to the SuccessTask task. Since "succeed" is static, it can be handled without a task instance, which simplifies the code a bit. Once the succeed message is received, it calls the unit test harness function and notes that the test has passed, and that's the end of that.

So why is this a big deal? Because tasks don't have to be confined to a single execution unit. A task, as an abstract notion, might actually represent a coroutine, or a thread, or another process, or even an RPC interface to a machine across the world. All of them get uniform syntax and semantics from the language, and all execution infrastructure is transparent. A task can be turned into a thread or whatnot just by applying the suitable tag to its definition. The rest Just Works.

The idea is nothing new; it's heavily based on CSP, the actor model, Alan Kay's "objects", Erlang's concurrency model, and so on. What's important is that all grouping of code and data in Epoch programs follows a single pattern.

This works for everything from small globs that would typically be classes in most languages, up to modules, and through all the way to entire systems interacting; because it's all uniform message passing, none of the guts matter to the programmer unless he wants to dig into them and configure the way tasks are executed.

Tasks of course can have no shared state, and one of the things I look forward to doing with this feature is destroying the global{} block for once and for all. This makes it trivially easy to spin up tasks on multiple threads (or machines, or GPUs, or whatever) and know for certain that they can't corrupt each other's state - the boundaries are always clearly defined by the message APIs exposed by each task.

A lot of the details remain to be worked out, but this is the general direction I'm heading, and so far I really like it. I'm kind of firing first and aiming later, so expect some course correction and refinement as I home in on the final setup. Most of this is being informed by actually writing the compiler (and Era) in the task model, so the decisions I make will be based on real programs and real concerns instead of abstract theoretical ones.

Anyways. To get all this to work, I first need to build the concept of a self-contained namespace into the compiler. So that will be my concrete project for probably the next few weeks, if the experience retrofitting the C++ implementation to use namespacing is any indicator. I hope Epoch proves to be more productive than C++ for this sort of change, but we shall see.

Once namespaces exist, I just need to build the idea of a task which owns a namespace, and the idea of sending messages into tasks. After that I'll build the notion of instantiating tasks, and then see what strikes my fancy from there.

Oh... and all of this is subject to take a long-ass time, because I'm currently prone to being hounded for belly-rubs by this giant fuzzball:

Attached Image

Playing with colors!

Posted by , 22 September 2014 - - - - - - · 361 views
Been playing around with a dark theme for Era:

Attached Image

Syntax highlighting still needs some love, but it's getting there.

Note that the IDE now highlights structure type names and function names. This is currently activated by pressing F12, which triggers a parse of the entire project. The parse stores off relevant identifiers for use in syntax highlighting. Obviously that's a painfully manual process, but I'll get it more seamless someday.

I'm not sure I care enough about local variables to bother figuring out a way to highlight them.

Sooner or later I'll actually have enough of this dumb IDE working that I'll feel comfortable doing heavy-lifting code work in it all the time. I think the main thing I want is symbol lookup, and with the parser now integrated and doing its thing, that shouldn't be too hard.

Argh, bitrot.

Posted by , 19 September 2014 - - - - - - · 416 views
Turns out that leaving a project alone for six months is a great way to discover that it's full of mysterious bugs you don't remember having.

I noticed some weird behavior with Era, the Epoch IDE, earlier this evening, and started poking around looking for explanations. It turns out that in some weird combination of circumstances, the Epoch program (Era in this case) can get partway through executing a Windows message handler, crash internally, and then continue on merrily as if nothing had ever happened (except without executing the remainder of the WndProc or whatever else the proc had called out to).

My best guess is that there is some Structured Exception Handling magic going on in the LLVM library that causes the crash to partially unwind the call stack and then continue executing from some weird spot. I don't get reliable enough callstacks to prove this just yet, because the JITted Epoch code is pretty hard to untangle in a debugger.

So the goal of the day is to find out what the actual crash is, hopefully by capturing it under a debugger. But this apparently happens a lot more frequently than I'd realized, because attaching a debugger to the running Era process turns up all kinds of crashes and weird behavior. Something in the Epoch runtime is seriously borked.

At this point, it's 2332 hours (yayy palindromic times) and I'm not liable to get much sleep tonight. This is going to bug the shit out of me. So I grab a fresh beverage and settle in for a heavy debug session.

Initially, I turn on all of the CRT memory debugging features I know of, and fire up Era. Well, I try. Turns out that _CRTDBG_CHECK_ALWAYS_DF really does murder performance... to the tune of Era - which normally starts up and has my code visible in under a second - has been trying to load for the better part of 15 minutes now.

LLVM's optimizer apparently allocates memory like a freaking madman. Obviously not written by game developers.

Meanwhile, Era has pegged a core of my laptop's already-warm CPU and is showing no signs of being ready any time soon. Maybe that beverage needs a little ethanol in it...

At 2352, the IDE still shows no signs of finishing the loading process. The LLVM optimizer is hard at work burning through trillions of allocations and also murdering my poor CPU. If it were anywhere near this painful in Release, I'd seriously consider offering to write them some better allocation strategies so I can stop wasting my youth waiting for the dumb thing to finish calling operator new().

Periodic checking of the progress in the debugger indicates that, yes, we are making progress - it seems that the optimizer is finally working through the last few passes. This is at 0002 hours, so basically 40 minutes have elapsed just waiting for the IDE to load.

I might not whine about Visual Studio's startup time for a little while.

Nah, I'll still whine.

Anyways... of course once optimizations are done, we still have to generate machine code. Turns out this is even worse in terms of millions of tiny allocations. Quick back-of-the-cocktail-napkin estimates show the IDE loading at sometime around 8 AM. Screw this.

Sure enough, a couple of minutes of poking with the per-allocation checking turned off yields pay dirt. Looks like I had an off-by-one error in the library routine for converting strings between UTF-8 and UTF-16. DERP.

Fixing that leads to more interesting crashes, this time somewhere in the garbage collector. It's verging on 0045 and I'm wondering how much more I've got in the tank... but this is too compelling to pass up.

The faulting code looks innocent enough: loop through a giant std::map of allocated string handles, and prune out all the ones that don't have any outstanding references. For some reason, though, std::wstring is barfing deep in a destructor call, apparently because the "this" pointer is something wonky.

My first guess, of course, is that I have mismatched code - something compiled in one way while linking to (or sharing objects with) something from another compilation setup. Time to go spelunking in the Visual Studio project settings...

Sadly, probing into the compiler/linker settings yields no obvious discrepancies. Time for the good ol' magical Clean Rebuild.

No joy. Next attempt is to disable the deletion of garbage strings... it'll murder my memory footprint, but it might also reveal what else is interfering with the string table. This causes the crashes to stop for the most part, even with Application Verifier enabled, which is pesky. I do, however, get a crash when exiting the program - ie. when garbage collection is not destroying strings, but rather the actual teardown process.

This indicates a memory stomp to me... which is slightly terrifying. Something, somewhere, seems to be clobbering entries on the string table. I haven't yet discerned a pattern to the data that gets written on top of the table entries, so it isn't entirely clear what's doing the stomping.

It's 0111 and I'm seriously tired. My best guess is that the stomp originates from the string marshaling code that interacts with external C APIs, specifically in this case the Win32 API. I suspect that I'm doing some evil cast someplace that confuses pointer types and causes chaos, but I'm far too hazy to pinpoint that as the cause for certain tonight.

0116 - time to cash in. We'll see how this goes next time!

A Quick Introduction to Sampler-Based Profiling

Posted by , 17 September 2014 - - - - - - · 2,473 views
Sampler-Based Profiling: The Quick Version
So you're happily working on some code, and suddenly it happens: everything is just too damn slow! Something is eating up all your performance, but it's not immediately obvious what to do about it.

One of the first things any experienced programmer will tell you is to profile. In a nutshell, this is a grizzled beard-wearing programmer shorthand for measure things and be scientific about your approach to performance. Even some of the most brilliant programmers in the world find it hard to intuitively find the real performance bottlenecks in a complex piece of code. So don't rely on voodoo and ritualism to find your slow points - measure your code and act accordingly.

Picking a Profiler
There are fundamentally two approaches to profiling. One is instrumentation, and the other is sampling. Instrumentation means adding some magic to your code that times how long the program spends executing various functions. It can be as simple as subtracting a couple of timer values, or as complex as invasive changes to the entire program that do things like automatically record call stacks and such. The "industrial strength" profilers generally support this mode, although in practice I find it very hard to use for things like games, for one simple reason: it dramatically changes the behavior of your program when you want to do something in near real-time.

So I will suggest, if you're new to profiling, that you start with a sampling-based profiler. By now you have all the keywords you need to find one on the Internet, so I won't recommend anything in particular (people can be very dogmatic about what their favorite tools are).

Sampling and How it Works
The basic idea behind sampling profilers is simple statistics. First, start by taking a program that's already running. Freeze it in place, like hitting the Pause button, and note what code was running when you stopped the world. Optionally, you can record a callstack as well to get contextual information. Once the snapshot is taken, un-pause the program and let it continue running. Do this thousands of times, and you will - statistically speaking - slowly build up a picture of where and why your program is slow.

It is important to remember two things about sample-based profilers. First, they are statistical estimates only. They will give you slightly different results each time you run the program. Therefore, running as much code as possible is in your best interests - this gives you the opportunity to get lots of data and make a more accurate statistical picture. I usually run for something like 10,000 to 20,000 samples, to give you a ballpark idea.

Second, sampling will tend to downplay the importance of very tiny pieces of code. Statistically, it is easy to see why this has to be true: since the piece of code is tiny, the odds of freezing the program exactly in that spot are correspondingly slim. You might land just before it, or just after it, for example. If your program is largely built up of equally slow tiny bits of code, sampling might make it impossible to find a bottleneck.

That said, sampling is still a great tool, and an easy way to get started with profiling. So let's talk about how to use it.

Using Sampling
Typically, profilers will show you two crucial stats about a given piece of code: how often it was seen in the freeze-frame snapshots (sample count), and a rough guess at how much time was spent in that chunk of code across the entire profiling session (aggregate time). Some profilers distinguish between time spent in just the code (exclusive time) versus time spent in the code and everything that code calls (inclusive time). These are useful measurements in various situations so don't get too used to focusing on a single statistic.

For introductory purposes, often the easiest way to fix performance issues is to look at the combination of inclusive time and sample count. This tells you (roughly) what fraction of a program's life is spent chugging through a given piece of code. Any decent profiler will have a sorted breakdown that lets you view your worst offenders in some easy format.

Once you have identified a slow piece of code, it takes some analysis of the program itself to understand why it is slow. Sometimes, you might find a piece of code can't be optimized, but chews up a lot of profiling samples - this is common with high-throughput inner loops, for instance. So don't be afraid to use different stats (especially exclusive time) and look further down the list for potential wins.

Pitfalls and Gotchas
There's a few things worth mentioning yet; I'll only hit them at a high level, though, because the details will vary a lot based on OS, compiler, and so on.
  • Memory allocations can hide in weird places. Look for mysterious samples in system code, for example. Learn to recognize the signs of memory allocation and deallocation - even if you're in a garbage-collected language, these can be important.
  • Multithreading is the enemy of sample-based profiling, because blocking a thread looks an awful lot like chewing up CPU on the thread. Note that some good profilers can tell the difference, which is nice.
  • Statistics are lies. If your profiler is telling you something profoundly confusing, seek advice on whether you're missing something in the code, or if the profiler is just giving you an incomplete statistical picture of reality.
  • Practice is important, but so is realistic work. Why didn't I include an example program and profiling screenshots? Because real programs are much harder to profile than simple examples. If you want practice, work on real programs whenever possible, because you'll learn a lot more.

Brain dump: considerations for organizing code

Posted by , 08 September 2014 - - - - - - · 593 views
No structure or real nice formatting will be found in this post. This is a stream-of-consciousness blathering process wherein I contemplate how to organize code in a way that escapes the limitations of the file paradigm.

Considerations for organizing code
Main goal: aid in discoverability and navigation of complex code bases. Secondary benefit could be that complicated architectures will be correspondingly "messy" to look at, encouraging cleaner design.

Files suck because:

- They force revision history to occur at the file level instead of the unit-of-code (function/etc.) level. Move a function into a new file and POOF goes its revision history. This is lame.

- They imply a single view of code which may not be the most expedient way to look at code "right now". Working on functionality that spans multiple files is a classic example; I have to change files all the time when the chunk of code I'm conceptually editing has nothing to do with these dumbass file divisions.

- Files as modules or other semantic implications of file separation are FUCKING EVIL (Java I'm looking at you). There should be no forced correlation between the code storage mechanism on disk and the conceptual organizationS (emphatically plural) of the code at a high level.

- Where the fuck did that function go? Sigh, time for grep. This is lame if you misspell the function's name or can't remember exactly what it's called. grep cannot search for IDEAS. Files are not directly to blame for this, but the forced implied taxonomy of code based on files exacerbates the problem significantly.

- They unnecessarily complicate code reuse. I should be able to reuse just that function, or that entire group of related code, or whatever - independent of shuffling files around. This should tie in to revision history too so I can see the history of a function when I steal it for another project.

Other assorted thoughts:

Avoid the sin of only being able to see tiny fragments of code at a time, ala early Visual Basic; ensure that a comfortable volume of code is always at hand to be edited and browsed.

Would be nice to have some kind of graphical visualization of the code taxonomies created by the programmer.

I want a way to select groups of functions to view together; checkboxes are lame; lasso?

"Search for code unit by name" as first-class IDE feature, including full language parse?

How do we minimize reparsing without having to write a standalone incremental parser?

Parsing != semantic checks; maybe the former can be done fast enough for realtime work, and maybe that justifies keeping the semantic work as semi-modal operations?

The Right Taxonomy of Code

Posted by , 07 September 2014 - - - - - - · 510 views
I've written before about how much I want to get away from the "code goes in files" model of programming. The more code I write, and the larger the project, the less it makes sense to organize everything strictly by file names.

Yes, folder hierarchies can be one reasonable way to group related code... but they're still file-based, and they still assume that you can find The One Right Taxonomy to describe your entire codebase. This is a dumb assumption and we should make it go away.

Right now, the Epoch IDE is set up to store code in... yes, sigh, flat files. However, files mean nothing - they are all lumped together equally during compilation. There is no implicit file-to-module relationship like in virtually every other language ever.

As an experiment, I'm thinking about removing this from the IDE entirely. I want to see what it would feel like to work on a large project - like the Epoch compiler, or the Era IDE itself - without the restriction of thinking in terms of files.

I can hear the protests already... how does one find code if there is no file grouping?

Well, one option - and certainly not the only or "best" option - is tagging. Each function or other top-level construct (like a structure or type definition) appears in its own universe by default. You can join this universe into other universes to make pockets of code, simply by tagging a function with some label. The IDE then displays a list of labels, and you can open them up as if the tagged content existed inside a single file. Behind the scenes, each "island" of code is a single file on disk.

I kind of like this for the sake of version control, because it allows naive version control software (i.e. basically all of it) to actually do function-level histories. I kind of also hate it because it means a large project may consist of tens of thousands of files when a few dozen would suffice.

I'll try and think of other options, but I'm open to suggestions. Basically I want to make the code-to-file mapping a little more modern and a little less stuck in 1970.

Time sure flies...

Posted by , 03 September 2014 - - - - - - · 501 views
It's been well over a year since I embarked on the monumental project of self-hosting the Epoch language compiler. In all that time, there have been a whopping ZERO releases of the language or any of its accompanying tools/examples/etc.

I'd been taking some time off from Epoch for a number of reasons, but this week I found myself with the inevitable itch to work on it... mostly because I'm yet again frustrated up to my eyes with C++ and really want an alternative. Which is suitable, since that's pretty much why the language exists in the first place.

Things are starting out gently, with some updates and polish to the Era IDE. It's mostly minor conveniences and visual improvements, but slowly and surely Era is starting to look like a real development environment. You can even compile projects now, although support for building/testing individual code files is still missing - I intend to build a proper REPL-type thing at some point. Ha.

Anyways, the point is, even with a tremendous amount of progress since Release 14 (including self-hosting), there's been no publicly visible changes. If you care enough to sync the Google Code repository you can play with the bleeding-edge stuff, but to the best of my knowledge nobody does that.

So this really comes down to a fundamental tension between two halves of my personality.

On the one hand, I really like the idea of constantly pushing out updates - it sends a strong message that things are still being worked on, and encourages people to follow progress more closely. The downside is that this runs headlong into my perfectionism, and makes me really uncomfortable. I hate shipping stuff that I know is missing important functionality or has huge bugs in it.

So while I love the thought of "release often", I kind of hate the idea of "release notes have tons of known bugs listed."

I might just wind up sucking it up though, and shipping Release 15 soon. It's a huge landmark and I want to have it out there before an entire year goes by between the self-hosting success point and the first time anyone actually uses the damn compiler.

September 2014 »