• 12
• 15
• 19
• 27
• 9

# C++ to HTML Syntax Highlighter[Complete]

This topic is 4439 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Well I've been working on my own little C++ to HTML syntax hilighter in preparation of making a new web site. I know there are a lot out there already, but I wanted to learn how to do it properly myself and all of that. Anyways if anyone wants to try it out, I have put together a demo package. Everything that you need to know is in the short "ReadMe.txt", so take a look at that first. I'm really interested if anyone can find any bugs or any source files that break the parser. I feel fairly confident in it already though, I've tested against MMGR.cpp and so far it seems pretty good. I have found and fixed a few recent bugs with operations with strings, numbers, and comments, but something might have broken that I have yet to see. That and I've been adding a few features here and there such as the wrapping and comment stripping. Ok, thanks for any help! [smile] Any comments, feedback, or suggestions are welcomed. [Edited by - Drew_Benton on January 19, 2006 12:01:54 PM]

##### Share on other sites
looking good. i tried it out on two of my files, works a charm.
good work.

as for suggestions, if it could handle more than one colour (eg, preprocessor is coloured differently to keywords, etc ), that would be very nice.

##### Share on other sites
Quote:
 Original post by rip-offas for suggestions, if it could handle more than one colour (eg, preprocessor is coloured differently to keywords, etc ), that would be very nice.

* Preprocessor now has a seperate list and its own style* Logic added so keywords that appear in preprocessor lines do not color* Added a custom list and its own style* Order of searching is now preprocessor->keywords->custom

##### Share on other sites
works great. can't think of anything you could add to it...

##### Share on other sites
Thanks! I've added one more main update, now the program produces the resulting output in a more web usable means. For example, here is the mmgr.cpp file in the new format. Ok that's it for now, I'll be adding various stuff later on and be making more improvements and optimizations.

##### Share on other sites
I found a little bug that I've fixed with the generated html output, before it was writing out "&lt;" and "&gt;" for "<" and ">", but when you copied over the text to your web page, it was interperted as "<" ">" characters still, so that was fixed to "&amp;lt;" and "&amp;gt;" so now you will see the correct "&lt;" and "&gt;" when you copy and paste into a web page.

For all you PHP users, you can easily create a new .PHP file that you paste in the html for your code, then do something like the following:
echo '<div class="Source">';echo '<pre>';include("post1.php");echo '</pre>';echo '</div>';

Where you have defined a specific layout for the DIV and have nice neat blog/page source code snippets without cluttering up the actual file. This way, you can change the source code snippets as you need and never have to touch your main page - which is another great feature of PHP [wink]

##### Share on other sites
Big Update! I've gone though and made the design a bit more flexible as well as added some speed optimizations.

First of, there is only one main keyword file now. You will be able to add as many 'sections' as you want of specific words to color. The only naming limitation is that you have to use "preprocessor" currently to denote a preprocessor section. This will be improved in the future, but for now it is like that. You can also opt not to use the keyword list, and only strings, comments, and numbers will be hilighted according to your css settings.

Second, I used one central vector for all the words, which implements its own comparison and equality operators, so when searching for a speciifc word, a binary search is then used to see if the word exists to increase performance.

Then the next minor changes were to get rid of the config file altogether and get a better option working soon. I made a few little changes here and there as well to make things more smoother. So for the time being, the input file it uses is hardcoded and must be named "input.txt". It will then make an "output.html" file. The specific CSS file it will read from is "style.css" and the keyword list has to be "keywords.txt". This is just temporary though, so if you use this, make sure to keep this in mind.

Now for a new example, here's a colored version of Lazy Foo's Tutorial 20. For this example, I added a few SDL specific kewords (defines/data types) in a new section and added the css entry of that.

I've also made a new readme.txt file that explains all of this. If anyone has any suggestions, please share! Also if any bugs are found, I'd like to hear about those as well as any other feedback! Thanks!

edit 1: Fixed a bug where a closing bracket follwed by a period, then any character that can be used for hex, f, u, and l would be parsed as a number.
edit 2: Used 'reserve' for the final string that contains the output file to eliminate continous string reallocations.
edit 3: Fixed another minor parsing bug caused with /*/
edit 4: Strings in preprocessor lines are now colored as strings

[Edited by - Drew_Benton on January 12, 2006 6:03:04 PM]

##### Share on other sites
results are pretty good, tried it with a couple of smaller sources of mine. is the lexer handwritten or do you use a lexer generator?

##### Share on other sites
Quote:
 Original post by marzecresults are pretty good, tried it with a couple of smaller sources of mine. is the lexer handwritten or do you use a lexer generator?

Awesome, glad it worked fine for you, here is a rough little procedure for what I do:
Load words to colorLoad file to color into a stringLoop though each character in the string    Process the characters and keep track of specific states based on occurances    Apply coloring where applicableWrite out the final string to the html file

It's pretty simple, minus out all of my commenting, it's around 200-250 total lines of actual code. The trick for me was doing it on a char by char basics and use a state system rather than try it with tokenizing. I had 3 days worth of failures when trying to tokenize and color like that [lol]. I'm still not done with this yet.

I'm still looking for freatures to add and still bug testing. For me though, it runs really fast, I'd like to hear how fast it runs for you all as well. I am using a 124kb test file to regress test against, and it generates the 500kb html file in only 250ms on release. I guess it should be relatively fast since I preload the file to be parsed, then it's just going though the characters and processing here and there. I'll use a high resolution timer though in the next release to get a more accurate time.

Thanks for trying it out! I need to find some web people that would have a use for this to get feedback on the style of the text generated, since this is for that purpose, but maybe I can make it for something else as well...

edit 1: Added in the high resolution timer, and the performance was correct with the GetTickCounts, it takes very low time in ms for smaller files, and a 834kb input file took 6 seconds to finish.

##### Share on other sites
he that reminds me of the "lexer" i made for a simple script language once. tedious and a hell to debug if your state machine is complex.

anyhow, if you are worried about performance then i recommend reading in the data from the file in big chunks and operate in memory instead of reading in one character at the time. but you most probably do that already i guess :)

edit: ok me's a retard, didn't properly read you last message :p