Preprocessor

Started by
11 comments, last by WitchLord 19 years, 7 months ago
I've just uploaded a new version of the preprocessor I preprocessor I posted in 'Binding somethingorother'. It now supports function-macros. It's passed all my tests, but I would appreciate it if anyone could run some macro-code through and tell me if it works as expected! You'll get a complete listing of the script and a list of all the macros defined outputted to the console right before the script runs. Check the test script (script.txt) for a minor quirk in the syntax. Because my lexer strips whitespace, I can't detect wether the () in a function macro immediatly follows the macros name ( #define macro(a) is valid, #define macro (a) is not! ) I instead had to include the # character before the (). It MUST go #define NAME #(args) ... Note the placement of spaces. Keep in mind that this is a work-in-progress. The code is very messy. It supports everything I need for my own project now, so I probably won't be adding new features unless I see a demand for them. I will, however, be cleaning up the code. Specifically, I'm going to be adding a file-loader functor argument, and I still need to figure out how to support relative paths for #include. Clicky! [edit]They replaced UBB with HTML, didn't they? :/[/edit]
Advertisement
Thank you very much for this great contribution to the AngelScript community. I'm sure many people will find it very useful.

I will upload this as soon as possible. I'm a bit swamped at work right now so I don't have much time, but I'm sure it will clear up in a couple of days, and I'll get back to work on AngelScript as normal.

AngelCode.com - game development and more - Reference DB - game developer references
AngelScript - free scripting library - BMFont - free bitmap font generator - Tower - free puzzle game

No rush. You'll just have to replace it anyway. :D!
Hello Deyja
I have also made a preprocessor as part of my project that uses angelscript.
As far as the problem of relative path and default path for the #include is concerned, i have solved that. So if you want i can help you in this regard. (you have to deal with GetCurrentDirectory() and SetCurrentDirectory())

The area where i am stuck is also macro and i can not use my own convention as you have used(# before ()) because i have to parse C++ header file. The simple Definitions such as
#define INT int
are not difficult.(and I have coped)

I would really appreciate any solution (even partial) and the identified problems in this regard.

Regards
Rizwan Khlid
The simplest solution (and the one I will probably employ) is to NOT strip whitespace during the lex phase, but instead generate a lexem of type 'whitespace'. I'm already doing this with newlines. You have to deal with all the extra tokens later on, though. I did this at one point, but it was much easier to just use my nasty syntax than to deal with the excessive whitespace!
Another way would be to lex on the fly, not all at once. You then end up with something very close to an actual compiler, and it can change the lexer state before lexing the define name so that it can check for that space in there. I won't be doing this, because it makes the splicing operations I do near impossible.

I am VERY interested in your relative-path code. I'm already tinkering with a simple way of making the algorithm recursive, so that includes in included files are relative the included file, not the 'root' file. That probably didn't make sense.
File A includes file B/C. File C includes file D. File D is actually at B/D relative A, but is just D relative C. With my current system, it will look in the same dir as A for it, NOT in B.
The only hitch is 'adding' the paths. Given path A/B/C.txt and ../D.txt, the result should be A/D.txt I'm sure I could do it, if I just sat down and worked it out.
It's very interesting to hear about the progress in your preprocessors.

I think I'll try to expose some way for a preprocessor to adjust the line and cursor position in the code it outputs. That way any error that AngelScript reports would still be given the correct line and column even if the preprocessor has added extra code.

I'll probably do this as a special token that will be treated as whitespace by the compiler but that can be detected when the stream position is converted into line and column number.

I'm not sure when this will be done though. I'll have to analyze it some more first.

AngelCode.com - game development and more - Reference DB - game developer references
AngelScript - free scripting library - BMFont - free bitmap font generator - Tower - free puzzle game

I'm going to have to look into the AngelScript source myself. If I can by-pass AngelScript's lexer, or work directly with the lexem stream it produces, I imagine I will gain substantial speed benefits. Atleast I can avoid dumping it all back into a raw buffer, just for AngelScript to lex it again.

Right now, my preprocessor doesn't do much of any error reporting. It generally just fails silently, and bludgeons on through the rest of the script. I haven't yet found an error state that won't eventually lead AngelScript to complain, though. If you have a bad define, but don't use it, everything is find. If you do, AngelScript will complain - chances are, the define won't be expanded, and will result in an unknown identifier.
As for error reporting inside AngelScript, Defines can't have newlines in them, so they won't change the number of lines when they are removed. The only problem is included files. I can easily preserve the number of lines, and position of lines in a single script. Because of the way AngelScript works, I do not actually have to splice included files into a single chunk of source code. I could load them all into a module, using the filename as the section. I'd merely have to pre-process all the files with a single define table!
If I decide to implement the / preprocessor operator, I might have some trouble. In the meantime, don't worry about it. You've already supplied all the tools we need!
Wow. I don't think I've ever seen a recursive descent Tokenizer before. Any particular reason you didn't use a state machine?
No special reason, it was just the first working solution I came up with, and since it worked very well I didn't feel the need to try something else.

Would a state machine provide much improvement?

My tokenizer doesn't produce a lexem stream like you're looking for, it simply identifies the first token in a string. The parser manually moves the position in the character stream to identify each token. I thought that was a better solution than producing an intermediate lexem stream.

AngelCode.com - game development and more - Reference DB - game developer references
AngelScript - free scripting library - BMFont - free bitmap font generator - Tower - free puzzle game

For most applications, it is probably much faster. I produce a lexem stream so that I can manipulate it in the form of a linked list. It makes inserting and removing lexems much easier, and faster. I'm going to make changes, whenever I work on it again, to preserve line numbers. I'm not so sure about column numbers, though. I can preserve whitespace exactly (Though it's going to make parsing things a real bitch) but, of course, expanding a define immediatly screws the column numbers to hell. Anything you can add to allow us to correct column numbers would be great. The # character isn't used any where in the script, so you can use this to signify some sort of column changing command. It fits well with the pre-processor, too, and the preprocessor will strip out any script-writers put in. It has to be relative, though. '#>5' could add five to the column number. '#<5' could subtract five.

It seems like a lot of hassle for error messages.

This topic is closed to new replies.

Advertisement