Sign in to follow this  

Plain text to HTML conversion/HTML cleaning

This topic is 4400 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi there, Obviously not a game related question, but I thought I'd ask here in case anyone knew. Does anyone know of a (Java) class or API that allows you to stream text in from a plain text/html file (or database) and clean it/convert it to HTML, a line at a time before streaming it back out again. My company is having a bit of a problem with reading large amounts of text into memory and then trying to convert it into HTML. Basically we need to be able to handle a large number of potentially very big files in this way. The only way I can think of to get around the problem would to be to stream the text into memory a small piece at a time, whilst maintaining a set of "states", then streaming the result out again, storing any state change in the local variables and waiting for the next chunk of text. But I really, really, don't want to write it myself.. Comments and observations are welcome even if you don't know something that will do the job. Cheers Jon

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
I don't see what you mean by

Quote:

that allows you to stream text in from a plain text/html file (or database) and clean it/convert it to HTML


If the file is already html then why would you want to clean it (what criteria would you use?) and especially what do you mean by "convert it to HTML?

Jc

A concrete example could be useful

Share this post


Link to post
Share on other sites
If the source data is XHTML compliant, then you can just a forward-scanning XML parser to parse it. Those parsers will read the tags, and give you events when you reach various points in the source stream; you feed them chunks of text and they keep parse state internally.

Share this post


Link to post
Share on other sites
Quote:
Original post by Anonymous Poster
I don't see what you mean by

Quote:

that allows you to stream text in from a plain text/html file (or database) and clean it/convert it to HTML


If the file is already html then why would you want to clean it (what criteria would you use?) and especially what do you mean by "convert it to HTML?

Jc

A concrete example could be useful


OK. Maybe I wasn't too clear. Essentially there are two different operations being performed here:

1) Cleaning HTML. User supplied/document HTML may not be valid, or it may be in an older version of HTML (i.e. anything before XHTML), it needs to be put through a HTML cleaner to convert it to decent XHTML. We have a HTML cleaner at the moment, and it requires all the HTML to be loaded before it can start "cleaning" it.

2) Converting HTML to plain text and converting plain text to HTML. Again, we have code that does this, but it required the all the HTML to be in memory before it can convert it to plain text - or needs all the plain text before it can convert it to HTML.

In both cases the whole HTML document needs to be loaded into memory before either operation can be performed. I just wondered if anyone knew of anything that could perform one, or both, operations without having the whole DOM tree loaded.

Jon

Share this post


Link to post
Share on other sites

This topic is 4400 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this