Sign in to follow this  
  • entries
    73
  • comments
    131
  • views
    54686

SILex & lexing

Sign in to follow this  

368 views

I've been playing around with SILex recently. It's a very sweet lexical analyzer, essentially a port of Lex to Scheme. Now, I'm no experienced lexer (never even used Lex), so I'm not going to compare Lex and SILex. But I do want to write about something cool SILex let me do.

I need to write an FFI (foreign function interface) for each C library I want to use from Scheme. Possibly the most important one is OpenGL. To give you a taste of the FFI I'm working with in Gambit-C Scheme, here's an example:


(define glRotatef
(c-lambda (GLfloat GLfloat GLfloat GLfloat) void "glRotatef"))


This lets me call `glRotatef' from Scheme. Now, if you think writing this for every single C function would be monotonous, repetitive, and error-prone... you're right! There's no reason why we can't automatically generate this. Well, there are some difficulties with FFI's in general, but we can get close.

The most straight-forward way is to parse the OpenGL header, gl.h, and generate something of worth from it. Technically, we will first "lexically analyze" the text in the file, and then "parse" the tokens into something we can use.

Lexical Analysis



Lets use SILex for some lexing. There's a postscript manual for SILex in the download if you want to know more about it. From what I can tell, however, it's remarkably like Lex in the way it works, so you'll recognize this if you're familiar with Lex. Lex's manual describes the basic process.

You give SILex a specification file which contains a mapping of regular expressions to tokens. It then uses this to generate a "lexer" program, or something that takes a program as input and produces a set of tokens.

In order to focus more on lexing/parsing, lets simply try to parse all of the preprocessor constants defined in the OpenGL header gl.h. You should be able to parse all of the function definitions too, but that involves a few tricky nuances that would just be distracting.

I knew that Chicken Scheme already has a general FFI generator, and I noticed that it's using SILex as well. So I grabbed it's SILex specification file which tokenizes C and parts of C++/Objective-C and started there. Here it is:


; easyffi.l -*- Scheme -*-
;
; Copyright (c) 2000-2004, Felix L. Winkelmann
; All rights reserved.
;
; Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following
; conditions are met:
;
; Redistributions of source code must retain the above copyright notice, this list of conditions and the following
; disclaimer.
; Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following
; disclaimer in the documentation and/or other materials provided with the distribution.
; Neither the name of the author nor the names of its contributors may be used to endorse or promote
; products derived from this software without specific prior written permission.
;
; THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS
; OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
; AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR
; CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
; CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
; SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
; THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
; OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
; POSSIBILITY OF SUCH DAMAGE.
;
; Send bugs, suggestions and ideas to:
;
; felix@call-with-current-continuation.org
;
; Felix L. Winkelmann
; Unter den Gleichen 1
; 37130 Gleichen
; Germany


letter [a-zA-Z]
digit [0-9]
digit16 [0-9a-fA-F]
digit8 [0-7]
space [ ]

%%

\\[ ] (yycontinue)
\\?\13?\n (if pp-mode
(begin
(set! pp-mode #f) 'pp-end)
(yycontinue) )
{space}+ (yycontinue)
\9+ (yycontinue)
\13+ (yycontinue)
\12+ (yycontinue)
"//" (let loop ()
(let ([c (yygetc)])
(if (or (eq? 'eof c) (char=? #\newline c))
(begin
(if pp-mode
(begin
(set! pp-mode #f)
'pp-end)
(yycontinue) ) )
(loop) ) ) )
"/*" (let loop ([c (yygetc)])
(cond [(eq? 'eof c) (parsing-error "unexpected end of comment")]
[(char=? #\newline c) (loop (yygetc))]
[(char=? c #\*)
(let ([c2 (yygetc)])
(if (eq? #\/ c2)
(yycontinue)
(loop c2) ) ) ]
[else (loop (yygetc))] ) )
"enum" 'enum
"typedef" 'typedef
"extern" 'extern
"static" 'static
"___fixnum" 'fixnum
"___number" 'number
"___symbol" 'symbol
"___bool" 'bool
"___pointer" 'pointer
"___u32" 'u32
"___s32" 's32
"___s64" 's64
"int64_t" 's64
"__int64" 's64
"bool" 'bool
"___safe" 'callback
"___declare" 'declare
"___scheme_value" 'scheme-value
"___scheme_pointer" 'scheme-pointer
"___byte_vector" 'byte-vector
"C_word" 'scheme-value
"___abstract" 'abstract
"___specialize" 'specialize
"___byte" 'byte
"___discard" 'discard
"___in" 'in
"___out" 'out
"___inout" 'inout
"___mutable" 'mutable
"___length" 'length
"size_t" 'size_t
"int" 'int
"unsigned" 'unsigned
"signed" 'signed
"float" 'float
"double" 'double
"short" 'short
"long" 'long
"char" 'char
"void" 'void
"struct" 'struct
"union" 'union
"const" 'const
"class" 'class
"public" 'public
"protected" 'protected
"private" 'private
"volatile" 'volatile
"namespace" 'namespace
"virtual" 'virtual
"explicit" 'explicit
"inline" 'inline
"using" 'using
"@interface" 'interface
"@implementation" 'implementation
"@end" 'end
"@class" 'objc-class
"@protocol" 'protocol
"@public" 'objc-public
"@protected" 'objc-protected
"@private" 'objc-private
"@encode" (list 'id "@encode")
"@defs" (list 'id "@defs")
"@selector" (list 'id "@selector")
"..." 'dots
^[ \t]*#[ ]*define (begin (set! pp-mode 'define) 'pp-define)
^[ \t]*#[ ]*include (begin (set! pp-mode 'include) 'pp-include)
^[ \t]*#[ ]*import (begin (set! pp-mode 'import) 'pp-import)
^[ \t]*#[ ]*ifdef (begin (set! pp-mode #t) 'pp-ifdef)
^[ \t]*#[ ]*ifndef (begin (set! pp-mode #t) 'pp-ifndef)
^[ \t]*#[ ]*elif (begin (set! pp-mode #t) 'pp-elif)
^[ \t]*#[ ]*if (begin (set! pp-mode #t) 'pp-if)
^[ \t]*#[ ]*else (begin (set! pp-mode #t) 'pp-else)
^[ \t]*#[ ]*pragma (begin (set! pp-mode #t) 'pp-pragma)
^[ \t]*#[ ]*endif (begin (set! pp-mode #t) 'pp-endif)
^[ \t]*#[ ]*error (begin (set! pp-mode #t) 'pp-error)
^[ \t]*#[ ]*undef (begin (set! pp-mode #t) 'pp-undef)
# '(op "#")
"if" 'if
"else" 'else
@?\" (let loop ([cs '()])
(let ([c (yygetc)])
(cond [(eq? 'eof c)
(parsing-error "unexpected end of string constant")]
[(char=? c #\\) (loop (cons (yygetc) cs))]
[(char=? c #\")
(list 'string (list->string (reverse cs))) ]
[else (loop (cons c cs))] ) ) )
\'\\{digit}{digit}{digit}\' (list 'char (string->number (substring yytext 3 5) 8))
\'\\0\' '(char #\nul)
\'\\a\' '(char #\alarm)
\'\\b\' '(char #\backspace)
\'\\f\' '(char #\page)
\'\\n\' '(char #\newline)
\'\\r\' '(char #\return)
\'\\t\' '(char #\tab)
\'\\v\' '(char #\vtab)
\'\\.\' (list 'char (string-ref yytext 2))
\'.\' (list 'char (string-ref yytext 1))
({letter}|_)({letter}|_|{digit})* (list 'id yytext)
0(x|X){digit16}+ (list 'num (string->number (substring yytext 2 (string-length yytext)) 16))
0{digit8}+ (list 'num (string->number (substring yytext 1 (string-length yytext)) 8))
[-+]?{digit}+(\.{digit}*)?([eE][-+]?{digit}+)?
(list 'num (string->number yytext))
"<" (if (eq? pp-mode 'include)
(let loop ([s '()])
(let ([c (yygetc)])
(cond [(eq? 'eof c) (parsing-error "unexpected end of include file name")]
[(char=? #\> c)
(set! pp-mode #f)
`(i-string ,(list->string (reverse s))) ]
[else (loop (cons c s))] ) ) )
`(op "<") )
"(" 'open-paren
")" 'close-paren
"[" 'open-bracket
"]" 'close-bracket
"{" 'open-curly
"}" 'close-curly
"," 'comma
";" 'semicolon
"*" 'star
"."|"+="|"-="|">>="|"<<="|"*="|"/="|"%="|"%"|"&="|"|="|"^="|"+"|"-"|"/"|">="|"<="|"=="|"<<"|">>"|"&&"|"||"|"&"|"|"|">"|"<"|"^"|"~"|"?"|"::"|":"|"="|"!="|"!"
(list 'op yytext)
<> (begin (set! pp-mode #f) 'stop)
<> (lexer-error (yygetc))






The chunks of code may be a bit confusing. The second statement of each entry is a piece of Scheme code which, when evaluated at a specific point in the text, will produce the appropriate token.

So now we're ready to lex! First, we need to generate our lexing program.


Gambit v4.3.2

> (include "silex.scm")
> (lex "autoffi.l" "autoffi.scm")
#t


This took in the specification file "autoffi.l" and generated "autoffi.scm". Now we can use this program for lexing C code, such as gl.h!

Here's a simple program using our lexer:


(include "autoffi.scm")

;; Token generators depend on these definitions
(define pp-mode #t)
(define (lexer-error c)
(display "*** ERROR *** invalid token: ")
(write c)
(newline)
(exit 1))

(define (lex-gl output-port)
(lexer-init 'port (open-file "gl.h"))
(let loop ()
(let ((tok (lexer)))
(write tok output-port)
(newline)
(if (not (eq? tok 'stop))
(loop)))))

(lex-gl (current-output-port))







It basically opens the file "gl.h" and passes it off to the C lexer. Then we read all of the tokens and write them to standard output. The output is this:


pp-ifndef
(id "__gl_h_")
pp-define
(id "__gl_h_")
pp-ifdef
(id "__cplusplus")
extern
(string "C")
open-curly
pp-endif
pp-end
typedef
unsigned
int
(id "GLenum")
semicolon
typedef
unsigned
char
(id "GLboolean")
semicolon
typedef
unsigned
int

... snip ...

pp-define
(id "GL_POLYGON_MODE")
(num 2880)
pp-define
(id "GL_POLYGON_SMOOTH")
(num 2881)
pp-define
(id "GL_POLYGON_STIPPLE")
(num 2882)
pp-define

... snip ...

(id "glGetTexParameteriv")
open-paren
(id "GLenum")
(id "target")
comma
(id "GLenum")
(id "pname")
comma
(id "GLint")
star
(id "params")
close-paren
semicolon
extern
void
(id "glHint")
open-paren
(id "GLenum")
(id "target")
comma
(id "GLenum")
(id "mode")
close-paren
semicolon

... (thousands of more lines, of course)







That's great! We have "lexed" the OpenGL header gl.h (I just facepalmed myself for using the word "lexed").

Parsing



Now we need to write a parser which will take the tokens and construct somewhat of an AST for us. Since we're only interested in the preprocessor constants, it's pretty easy. Here's what I came up with. This program will pick out most preprocessor statements (and typedefs too, just for fun):



(define (parser-error err)
(display err)
(exit 1))

(define (parse input-port output-port)
(define (writer node)
(write node output-port)
(newline))
(let loop ((mode #f) (tokens '()))
(let ((t (read input-port)))
(case t
((pp-end)
(if (pair? tokens)
(writer (reverse tokens))
(parser-error "invalid preprocessor statement: pp-end"))
(loop #f '()))
((pp-define pp-include pp-if
pp-ifdef pp-ifndef
pp-else pp-endif
pp-undef pp-import
pp-pragma pp-error)
(loop 'pp (list t)))
((typedef)
(loop 'typedef (list t)))
((semicolon)
(if (not (null? tokens))
(writer (reverse tokens)))
(loop #f '()))
((stop)
#t)
(else
(loop mode (if mode
(cons t tokens)
tokens)))))))

(parse (current-input-port)
(current-output-port))






Running this program, given the input of our lexing program, spits out:


(pp-ifndef (id "__gl_h_"))
(pp-define (id "__gl_h_"))
(pp-ifdef (id "__cplusplus"))
(pp-endif)
(typedef unsigned int (id "GLenum"))
(typedef unsigned char (id "GLboolean"))
(typedef unsigned int (id "GLbitfield"))
(typedef signed char (id "GLbyte"))
(typedef short (id "GLshort"))
(typedef int (id "GLint"))
(typedef int (id "GLsizei"))
(typedef unsigned char (id "GLubyte"))
(typedef unsigned short (id "GLushort"))
(typedef unsigned int (id "GLuint"))
(typedef float (id "GLfloat"))
(typedef float (id "GLclampf"))
(typedef double (id "GLdouble"))
(typedef double (id "GLclampd"))
(typedef void (id "GLvoid"))
(typedef long (id "GLintptr"))
(typedef long (id "GLsizeiptr"))
(pp-ifndef (id "GL_TYPEDEFS_2_0"))
(pp-define (id "GL_TYPEDEFS_2_0"))
(typedef char (id "GLchar"))
(pp-endif)
(pp-ifndef (id "GL_GLEXT_LEGACY"))
(pp-endif)
(pp-define (id "GL_LOGIC_OP") (id "GL_INDEX_LOGIC_OP"))
(pp-define (id "GL_TEXTURE_COMPONENTS") (id "GL_TEXTURE_INTERNAL_FORMAT"))
(pp-define (id "GL_VERSION_1_1") (num 1))
(pp-define (id "GL_VERSION_1_2") (num 1))
(pp-define (id "GL_VERSION_1_3") (num 1))
(pp-define (id "GL_VERSION_1_4") (num 1))
(pp-define (id "GL_VERSION_1_5") (num 1))
(pp-define (id "GL_VERSION_2_0") (num 1))
(pp-define (id "GL_VERSION_2_1") (num 1))
(pp-define (id "GL_ACCUM") (num 256))
(pp-define (id "GL_LOAD") (num 257))
(pp-define (id "GL_RETURN") (num 258))
(pp-define (id "GL_MULT") (num 259))
(pp-define (id "GL_ADD") (num 260))
(pp-define (id "GL_NEVER") (num 512))
(pp-define (id "GL_LESS") (num 513))
(pp-define (id "GL_EQUAL") (num 514))
(pp-define (id "GL_LEQUAL") (num 515))
(pp-define (id "GL_GREATER") (num 516))
(pp-define (id "GL_NOTEQUAL") (num 517))
(pp-define (id "GL_GEQUAL") (num 518))

... snip ...







Now we're getting somewhere. This isn't quite an AST, but lets call it that. Now we need to analyze the AST and generate our preprocessor constants interface. We'll need a program that identifies tokens and their sub-parts, and is able to pull data out of each token. Here's this program:



(define (parser-error . args)
(for-each
(lambda (x)
(if (not (string? x))
(write x)
(display x)))
args)
(newline)
#f)

(define (num-token? token)
(and (pair? token)
(eq? (car token) 'num)
(eq? (length token) 2)))

(define (make-num token)
(cadr token))

(define (id-token? token)
(and (pair? token)
(eq? (car token) 'id)
(eq? (length token) 2)))

(define (make-id token)
(string->symbol (cadr token)))

(define (constant-token? token)
(and (pair? token)
(eq? (car token) 'pp-define)
(eq? (length token) 3)))

(define (make-constant-expr token)
(let ((id-token (cadr token))
(val-token (caddr token)))
(if (not (id-token? id-token))
(parser-error "invalid id: " id-token)
(let ((id (make-id id-token))
(val (cond
((id-token? val-token) (make-id val-token))
((num-token? val-token) (make-num val-token))
(else (parser-error
"invalid constant value: "
val-token)))))
(and val
`(define ,id ,(if (symbol? val)
`(lambda () ,val)
val)))))))

(define (parse input-port output-port)
(let loop ()
(let ((token (read input-port)))
(if (constant-token? token)
(begin
(write (make-constant-expr token) output-port)
(newline)))
(if (not (eq? token #!eof))
(loop)))))

(parse (current-input-port)
(current-output-port))






Now, give the output of our parser to this, and this generates:


(define GL_LOGIC_OP (lambda () GL_INDEX_LOGIC_OP))
(define GL_TEXTURE_COMPONENTS (lambda () GL_TEXTURE_INTERNAL_FORMAT))
(define GL_VERSION_1_1 1)
(define GL_VERSION_1_2 1)
(define GL_VERSION_1_3 1)
(define GL_VERSION_1_4 1)
(define GL_VERSION_1_5 1)
(define GL_VERSION_2_0 1)
(define GL_VERSION_2_1 1)
(define GL_ACCUM 256)
(define GL_LOAD 257)
(define GL_RETURN 258)
(define GL_MULT 259)
(define GL_ADD 260)
(define GL_NEVER 512)
(define GL_LESS 513)
(define GL_EQUAL 514)
(define GL_LEQUAL 515)
(define GL_GREATER 516)
(define GL_NOTEQUAL 517)
(define GL_GEQUAL 518)
(define GL_ALWAYS 519)

... snip ...







And we have a list of OpenGL preprocessor constants available to us in Scheme! Notice how we aren't actually using any of the FFI mechanisms. Since all boundaries between Scheme and C land are typed, we can't really pull out preprocessor constants from C. We aren't trying to make this FFI compatible across versions of header files either. The point of automatically generating these interfaces is so that it's easier to re-generate them against any version/platform.

I hope someone else found this as interesting as I did. I will be working on parsing typedef's and function declarations as well, which is a natural extension to the above programs. Having such technology will make it almost painless to be working in Scheme, where I'm usually cut off from all of the math/graphics/etc. C libraries out there.
Sign in to follow this  


0 Comments


Recommended Comments

There are no comments to display.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now