Save web pages as text file

Started by
4 comments, last by furiousp 18 years, 8 months ago
Hi, When i'm at a website i can save the current web page as a text file with my browser. I want to write a program that can do this automatically. Does anyone have any ideas on how i could go about doing this (I haven't written any programs that interact with webpages before)? I generally use Java but I have compilers for C++, C#, J#, and VB would any of those be better for this? Thanks
Advertisement
Use any language you like.
And how to do it ? Just open the .html file, read it word by word, html tag by html tag and think of appromiate algorithm. Look for HTML standards to find how HTML tags work.

You could also just strip the tags, but you have to watch out for special ones, like scripts, html comments, image tags and such .. these should not be shown.

FYI you have quite a long way ahead of you in this task.
What's the purpose? Is it for fun or learning? Otherwise I'd guess you can find tons of application that does this already.


I assume you can use the automation APIs for Internet Explorer, see IDM_SAVEAS.

http://msdn.microsoft.com/library/default.asp?url=/workshop/browser/mshtml/reference/constants/saveas.asp

That way I assume you could save in .mht format directly, meaning frames, images, sound etc gets embedded in the same file.

I would personally do this from C#.


If you do everything from scratch you could use the HttpWebRequest in .Net, but you need to do lots of work your self, such as parsing html, loading frames, images etc. You'd also need to define how to save all this (a single file? Several files?)
Thanks for the replies. I should have mentioned that all i want to do is take information from a table on a webpage, if i save the file as a text file in a browser it stores it as a tab dilimited text file, which is easy to process.

To be more exact, what i'm doing is taking information from the google public service search for each day in a particular month to get statistics for that month (what words or terms were searched for the most)
Quote:Original post by furiousp
Thanks for the replies. I should have mentioned that all i want to do is take information from a table on a webpage, if i save the file as a text file in a browser it stores it as a tab dilimited text file, which is easy to process.

To be more exact, what i'm doing is taking information from the google public service search for each day in a particular month to get statistics for that month (what words or terms were searched for the most)


Have you considered using the Google API? It gives you direct access to the query data so you don't have to parse out the information yourself.
Cool, i didn't know about that. I'll give it a try, thanks

This topic is closed to new replies.

Advertisement