# C++ to HTML Syntax Highlighter[Complete]

This topic is 4348 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Well I've been working on my own little C++ to HTML syntax hilighter in preparation of making a new web site. I know there are a lot out there already, but I wanted to learn how to do it properly myself and all of that. Anyways if anyone wants to try it out, I have put together a demo package. Everything that you need to know is in the short "ReadMe.txt", so take a look at that first. I'm really interested if anyone can find any bugs or any source files that break the parser. I feel fairly confident in it already though, I've tested against MMGR.cpp and so far it seems pretty good. I have found and fixed a few recent bugs with operations with strings, numbers, and comments, but something might have broken that I have yet to see. That and I've been adding a few features here and there such as the wrapping and comment stripping. Ok, thanks for any help! [smile] Any comments, feedback, or suggestions are welcomed. [Edited by - Drew_Benton on January 19, 2006 12:01:54 PM]

##### Share on other sites
looking good. i tried it out on two of my files, works a charm.
good work.

as for suggestions, if it could handle more than one colour (eg, preprocessor is coloured differently to keywords, etc ), that would be very nice.

##### Share on other sites
Quote:
 Original post by rip-offas for suggestions, if it could handle more than one colour (eg, preprocessor is coloured differently to keywords, etc ), that would be very nice.

* Preprocessor now has a seperate list and its own style* Logic added so keywords that appear in preprocessor lines do not color* Added a custom list and its own style* Order of searching is now preprocessor->keywords->custom

##### Share on other sites
works great. can't think of anything you could add to it...

##### Share on other sites
Thanks! I've added one more main update, now the program produces the resulting output in a more web usable means. For example, here is the mmgr.cpp file in the new format. Ok that's it for now, I'll be adding various stuff later on and be making more improvements and optimizations.

##### Share on other sites
I found a little bug that I've fixed with the generated html output, before it was writing out "&lt;" and "&gt;" for "<" and ">", but when you copied over the text to your web page, it was interperted as "<" ">" characters still, so that was fixed to "&amp;lt;" and "&amp;gt;" so now you will see the correct "&lt;" and "&gt;" when you copy and paste into a web page.

For all you PHP users, you can easily create a new .PHP file that you paste in the html for your code, then do something like the following:
echo '<div class="Source">';echo '<pre>';include("post1.php");echo '</pre>';echo '</div>';

Where you have defined a specific layout for the DIV and have nice neat blog/page source code snippets without cluttering up the actual file. This way, you can change the source code snippets as you need and never have to touch your main page - which is another great feature of PHP [wink]

##### Share on other sites
Big Update! I've gone though and made the design a bit more flexible as well as added some speed optimizations.

First of, there is only one main keyword file now. You will be able to add as many 'sections' as you want of specific words to color. The only naming limitation is that you have to use "preprocessor" currently to denote a preprocessor section. This will be improved in the future, but for now it is like that. You can also opt not to use the keyword list, and only strings, comments, and numbers will be hilighted according to your css settings.

Second, I used one central vector for all the words, which implements its own comparison and equality operators, so when searching for a speciifc word, a binary search is then used to see if the word exists to increase performance.

Then the next minor changes were to get rid of the config file altogether and get a better option working soon. I made a few little changes here and there as well to make things more smoother. So for the time being, the input file it uses is hardcoded and must be named "input.txt". It will then make an "output.html" file. The specific CSS file it will read from is "style.css" and the keyword list has to be "keywords.txt". This is just temporary though, so if you use this, make sure to keep this in mind.

Now for a new example, here's a colored version of Lazy Foo's Tutorial 20. For this example, I added a few SDL specific kewords (defines/data types) in a new section and added the css entry of that.

I've also made a new readme.txt file that explains all of this. If anyone has any suggestions, please share! Also if any bugs are found, I'd like to hear about those as well as any other feedback! Thanks!

edit 1: Fixed a bug where a closing bracket follwed by a period, then any character that can be used for hex, f, u, and l would be parsed as a number.
edit 2: Used 'reserve' for the final string that contains the output file to eliminate continous string reallocations.
edit 3: Fixed another minor parsing bug caused with /*/
edit 4: Strings in preprocessor lines are now colored as strings

[Edited by - Drew_Benton on January 12, 2006 6:03:04 PM]

##### Share on other sites
results are pretty good, tried it with a couple of smaller sources of mine. is the lexer handwritten or do you use a lexer generator?

##### Share on other sites
Quote:
 Original post by marzecresults are pretty good, tried it with a couple of smaller sources of mine. is the lexer handwritten or do you use a lexer generator?

Awesome, glad it worked fine for you, here is a rough little procedure for what I do:
Load words to colorLoad file to color into a stringLoop though each character in the string    Process the characters and keep track of specific states based on occurances    Apply coloring where applicableWrite out the final string to the html file

It's pretty simple, minus out all of my commenting, it's around 200-250 total lines of actual code. The trick for me was doing it on a char by char basics and use a state system rather than try it with tokenizing. I had 3 days worth of failures when trying to tokenize and color like that [lol]. I'm still not done with this yet.

I'm still looking for freatures to add and still bug testing. For me though, it runs really fast, I'd like to hear how fast it runs for you all as well. I am using a 124kb test file to regress test against, and it generates the 500kb html file in only 250ms on release. I guess it should be relatively fast since I preload the file to be parsed, then it's just going though the characters and processing here and there. I'll use a high resolution timer though in the next release to get a more accurate time.

Thanks for trying it out! I need to find some web people that would have a use for this to get feedback on the style of the text generated, since this is for that purpose, but maybe I can make it for something else as well...

edit 1: Added in the high resolution timer, and the performance was correct with the GetTickCounts, it takes very low time in ms for smaller files, and a 834kb input file took 6 seconds to finish.

##### Share on other sites
he that reminds me of the "lexer" i made for a simple script language once. tedious and a hell to debug if your state machine is complex.

anyhow, if you are worried about performance then i recommend reading in the data from the file in big chunks and operate in memory instead of reading in one character at the time. but you most probably do that already i guess :)

edit: ok me's a retard, didn't properly read you last message :p

##### Share on other sites
In your example http://www.utdallas.edu/~dbb033000/mmgr.html the "new" in "new_handler" is highlighted as a keyword.

##### Share on other sites
Quote:
 Original post by marzeche that reminds me of the "lexer" i made for a simple script language once. tedious and a hell to debug if your state machine is complex. edit: ok me's a retard, didn't properly read you last message :p

[smile] Yea, reading it again, low does look like slow at first glance. The main technique I used for debugging mine was test case by case with smaller input files, usually 1-3 lines so I can walk though and see if there was a problem. Most of the issues I've had were relatively simple though, such as the /*/ bug, I just added in a check to make sure the second from last char wasn't /.

Of course the biggest thing is to make sure you do regression testing to ensure you didn't break anything. I had an awesome 900kb file I put together that should have tested about everything but I accidently lost that when I made it simplier, forgot to undo then exited VS last night. Oh well [lol] I can make another one.

If there is one thing that I'd like to do better is the actual handling of states. Right now I have 10 or so some bool states to help track where I'm at. I'd definitly like a better design for that, as well as consolidate the logic for comments into a more general solution, to allow for other languages as well, but I'm not sure how that will work out.

Quote:
 Original post by kwackersIn your example http://www.utdallas.edu/~dbb033000/mmgr.html the "new" in "new_handler" is highlighted as a keyword.

Thanks for pointing that out. I actually fixed that a while back, the bug was that I was only checking the next character as a alpha/digit rather than the additional _ to determine if I was still in a word or not. I just forgot to update that link, so I went ahead and reuploaded the most recent version. Good catch!

edit 1: Added in support for spaces in the preprocessor. #if will be parsed and colored the same wasy that # if, #'/t'if, and etc... for all preprocessors.

[Edited by - Drew_Benton on January 13, 2006 12:43:33 PM]

##### Share on other sites
ooh, fancy.

for the input:
Quote:
 string text = "jimmy says \"hi how are you \".";

your program generates the correct output. nice one!

unlike you.
*points finger at source tags*
string text = "jimmy says \"hi how are you \".";

##### Share on other sites
[grin] Thanks! I was reading this post over again, been busy with school and at first I thought you were saying it did it wrong, and I was like nooooo lol. Then I read it again [smile] My next task will be converting it to PHP to see if I can pull off a web based version. I tried it once already but no luck, so I'll have to try again.

##### Share on other sites
Edit: Ignore me. I got it working...

##### Share on other sites
Quote:
 Original post by AdamWebbEdit: Ignore me. I got it working...

If there was trouble, what was it? Chances are something wasn't clear then or needs updating. Of course I'm still working on using a config file, so if it was related to that, getting the right names of the files, then you can ignore me [wink]

##### Share on other sites
Ok guys I got it updated, it is now command line based, so you do not have to rename your files to meet what it used to be. I've also took out the timer code, so the code is smaller now, the package is only 54KB now. I've updated the readme as well to reflect this, if anyone has any questions or needs help feel free to ask!

edit 1: Fixed a few logical errors when processing valid C++ syntax. For example:
#include <iostream>/* #include <iostream> */ #include <iostream>/* #include <iostream> */ #include <iostream> /* #include <iostream> *//*#include <iostream> */ #include <iostream> /* #include <iostream> *//*#include <iostream>*/ #include <iostream> /* #include <iostream> *//*#include <iostream>*/ #include <iostream> /* #include <iostream> *//*#include <iostream> */#include <iostream> /* #include <iostream> */

Now works and is parsed correctly. Before it was incorrect. From this logic, I think I will need to do more testing with weird examples using numbers in the name as well. I've kept a big regression test, and it validated after these updates, so nothing should break. But if you notice anything wrong, please point it out!

edit 2: Fixed another parsing error.
int int134;int int_134;int _int_134;

Will all parse correctly now, the <keyword><number pattern> was generating incorrect coloring before.

edit 3: I've made a few optimizations to make post processing faster. Rather than scan though the end and replace '\n' with <br /> I do that at parse time, so immediate speed ups are seen. However, there is still an overhead of replacing & with & for the code preview, so I have made two versions of the program now. One version will only generate the final code to use in your web pages, and the other one will generate the final code along with the preview window. That way, if you just need the specific code and don't need a preview, you can get rather instant results, as composed having to wait seconds for the preview to generate. Of course with smaller files this is not a problem, but if you have larger files, then you might need it. I found it a lot more easier just to make two programs rather than try tricky logics in one. Any feedback/ideas on this?

[Edited by - Drew_Benton on January 16, 2006 6:33:40 PM]

##### Share on other sites
i have a new complaint/request.

stop naming all your files demo!

i have 8 files/programs belong to you, and they are named demo, demo1, demo(1)...

i have no idea which one is which...

other than that, its still good.

##### Share on other sites
Quote:
 Original post by rip-offi have a new complaint/request.stop naming all your files demo!i have 8 files/programs belong to you, and they are named demo, demo1, demo(1)...i have no idea which one is which...other than that, its still good.

Hehehe, actually, I was thinking the same thing myself when I started making changes and stuff, so what I did was made the project now have 4 configurations, Debug (No Preview), Debug - Preview, Release (No Preview), and Release - Preview. In addition, the debug versions have _d appended on. I'm still trying to break that bad habit of not doing this until then end, but I promise I will improve [grin]. I've gone ahead and finished up my Wiki page for it.

I feel that right now, it's pretty much in a final form, I started to want to add more stuff to it, but then I decided against that "Feature Creep", and kept it doing exactly what I had wanted it to do. I plan on using the base of it though for other projects similar in nature though. Thanks for all your time and testing rip-off! If you want, you can delete all of what you have file wise, then get the new final "Parser Demo.zip". I went ahead and cleaned out all of my old backup code files and only kept this instatement as the one version.

##### Share on other sites
A few suggestions:
1. Operators, you should perhaps be wrapping them in spans as well.
2. Includes, In general I tend to view the include text as a string (as does VS).
3. Identifiers: I may not want my identifiers to be whatever the default color is, wrap them in spans as well.

##### Share on other sites
Just when I thought my work was done! [wink] I've gone ahead and added updated the code so it can handle that. Getting the operators correctly required me to change my post-parsing means in terms of not only looking for alphanumberic or '_', but I've tested again with large files and so far, no bugs have crept in [smile] I didn't add in 'identifiers' explicity because you can change the default text to display whatever color you want via CSS. Thanks for the suggestions! I've updated all files and uploaded them over the old ones.

##### Share on other sites
I have placed the now final version here on my webpage. If you look around the stuff I've added so far on my page, you can see it in action [smile]. Overall, I am very happy with it as is, it has saved me a lot of time when re-adding my programming content. Washu suggest I look up a LL(k) parser to do this in as well, so I will see if I can come up with a second version that is rule/definition based so any language can be supported. Ok, thanks for reading!

##### Share on other sites
Great Job!

I will check out your wiki most thoroughly [wink]

I have a few requests.

1. Ability to output <br /> instead of <br> (important for XHTML)
2. Command-line option to suppress <br> (not needed if used within <pre>)
3. Command-line option to not change spaces to nbsp;
4. Version for linux

Just thought I'd mention that the source tag example says "SoureColor"

##### Share on other sites
Quote:
 Original post by BoderI have a few requests.1. Ability to output
(important for XHTML)2. Command-line option to suppress
(not needed if used within
)3. Command-line option to not change spaces to nbsp;4. Version for linux

Great ideas! For the first one, that was really an overlooked bug in which I just forgot to use the "<br />" rather than "<br>". Thanks for bringing that up! I have added 1-3, only took a few seconds with how it's desinged. I've updated the SourceColor Wiki page, but the main change is just you add on two extra paramters when you call it, [break string] [space string] that specify what you want the breaking string to be after a new line as well as what a space string would be. If you do not want a breaking string, just pass in "", as goes the same for spaces -- so #1 and #2 done with that one little change. The same thing was applied for #3 as well, now 4x 'Space' strings are used for a tab.

Since I'm not a Linux programmer, I cannot help you with #4 directly, but I've gone ahead and put the source up on the Wiki now (Code coloring compliments of itself [grin]). The main reason I was holding off on souce was just for testing, bugs, and all of that stuff, but now, I feel it is fairly rock solid, I've added a lot of stuff to the Wiki now and no problems at all with the program. So, if you are a Linux programmer, and can make a linux port, that would be greatly appreciable! I could add that to the site as well with credits to you. I think that code I wrote is pretty much ready to be used as is, I just don't have a compiler to make the files and such.

So, if anyone makes any neat changes that I didn't think of, feel free to share, please! [smile] I think I will work on a PHP port soon if time permits as well as another version, I just haven't gotten around to it, busy porting SDL to C++. Anyways, Boder, the one thing I didn't get is the: "Just thought I'd mention that the source tag example says "SoureColor" can you explain some more please, it's kinda late and I'm not sure what you mean.