Plain text to HTML conversion/HTML cleaning

Started by
2 comments, last by polly 18 years, 5 months ago
Hi there, Obviously not a game related question, but I thought I'd ask here in case anyone knew. Does anyone know of a (Java) class or API that allows you to stream text in from a plain text/html file (or database) and clean it/convert it to HTML, a line at a time before streaming it back out again. My company is having a bit of a problem with reading large amounts of text into memory and then trying to convert it into HTML. Basically we need to be able to handle a large number of potentially very big files in this way. The only way I can think of to get around the problem would to be to stream the text into memory a small piece at a time, whilst maintaining a set of "states", then streaming the result out again, storing any state change in the local variables and waiting for the next chunk of text. But I really, really, don't want to write it myself.. Comments and observations are welcome even if you don't know something that will do the job. Cheers Jon
Advertisement
I don't see what you mean by

Quote:
that allows you to stream text in from a plain text/html file (or database) and clean it/convert it to HTML


If the file is already html then why would you want to clean it (what criteria would you use?) and especially what do you mean by "convert it to HTML?

Jc

A concrete example could be useful
If the source data is XHTML compliant, then you can just a forward-scanning XML parser to parse it. Those parsers will read the tags, and give you events when you reach various points in the source stream; you feed them chunks of text and they keep parse state internally.
enum Bool { True, False, FileNotFound };
Quote:Original post by Anonymous Poster
I don't see what you mean by

Quote:
that allows you to stream text in from a plain text/html file (or database) and clean it/convert it to HTML


If the file is already html then why would you want to clean it (what criteria would you use?) and especially what do you mean by "convert it to HTML?

Jc

A concrete example could be useful


OK. Maybe I wasn't too clear. Essentially there are two different operations being performed here:

1) Cleaning HTML. User supplied/document HTML may not be valid, or it may be in an older version of HTML (i.e. anything before XHTML), it needs to be put through a HTML cleaner to convert it to decent XHTML. We have a HTML cleaner at the moment, and it requires all the HTML to be loaded before it can start "cleaning" it.

2) Converting HTML to plain text and converting plain text to HTML. Again, we have code that does this, but it required the all the HTML to be in memory before it can convert it to plain text - or needs all the plain text before it can convert it to HTML.

In both cases the whole HTML document needs to be loaded into memory before either operation can be performed. I just wondered if anyone knew of anything that could perform one, or both, operations without having the whole DOM tree loaded.

Jon

This topic is closed to new replies.

Advertisement