• Advertisement

Archived

This topic is now archived and is closed to further replies.

HTML File Utility

This topic is 5808 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I'm trying to write a small win-32 console based program that: Loads a HTML file Extracts all the e-mail addresses in that HTML file Output those e-mail addresses into a seperate file Any help would be appreciated, thanks. Faet http://www.garrith.com/offworld -= Off World Technologies, Inc. =- Edited by - faet83 on February 24, 2002 6:54:26 PM

Share this post


Link to post
Share on other sites
Advertisement
Ok, this is what I got so far.

--------------------------
#include <iostream.h>
#include <fstream.h>
#include <stdlib.h>

int main()
{
char buffer[256];
char email_buf[256];

ifstream sourcefile( "source.html" );
ofstream destfile( "dest.txt" );

if( !sourcefile.is_open() )
{
cout << "Error Opening File"; exit(1);
}

while( !sourcefile.eof() )
{
sourcefile.getline( buffer, 256 );
// Here is where i'm lost
// How do I analyze my buffer and determine if there is
// a email address on that line

destfile << email_buf << ", ";
}
destfile.close();

return 0;
}


-----------------------------

Any help would be appreciated

-= Off World Technologies, Inc. =-

Edited by - faet83 on February 24, 2002 6:47:43 PM

Share this post


Link to post
Share on other sites
Use std::string instead and load data into the string using getline. Then use find() to check if the string holds any "mailto:", if not try the next line.. May work

Share this post


Link to post
Share on other sites
quote:
Original post by faet83
// Here is where i''m lost
// How do I analyze my buffer and determine if there is
// a email address on that line


Check the line buffer using a regular expression. There are quite a few packages out there, PCRE comes immediately to mind. Also check the Boost libraries.

Share this post


Link to post
Share on other sites
I''m more of a C coder, so I can''t much help you with std::string, but I can help you with regular expressions. Maybe Rickmeister or someone else can help you with std::string.

Share this post


Link to post
Share on other sites
How about the find() function. how do i use that, thanks.

-= Off World Technologies, Inc. =-

Share this post


Link to post
Share on other sites
I''m just trying to learn.

---------------------------
Live Online Interpretative Services
http://www.garrith.com/offworld
---------------------------


-= Off World Technologies, Inc. =-

Share this post


Link to post
Share on other sites
Heres what I got so far:

Where am I going wrong...

#include <iostream>
#include <fstream>
#include <string>
#include <stdlib.h>

int main()
{
using namespace std;

string buffer;
string out_buf;

ifstream sourcefile( "source.html" );
ofstream destfile( "dest.txt" );

if( !sourcefile.is_open() )
{
cout << "Error Opening File"; exit(1);
}

while( !sourcefile.eof() )
{
getline( sourcefile, buffer );
// Here is where i''m lost
// How do I analyze my buffer and determine if there is
// a email address on that line
buffer.find( buffer, out_buf, "mailto:" );

}
destfile.close();

return 0;
}

Faet

-= Off World Technologies, Inc. =-

Share this post


Link to post
Share on other sites
You''ll want to extract the string following "mailto", use find to locate the start of that string (ie it begins immediately after ''mailto:'') and it ends most likely with a ''"'' double quote.

Share this post


Link to post
Share on other sites
Is my usage of string::find correct?

Thanks

-= Off World Technologies, Inc. =-

Share this post


Link to post
Share on other sites
Again, I''m more of a C coder, so I can''t say specificially, but it looks to me that "buffer.find( buffer, out_buf, "mailto:" );" will search for "mailto:" and put that into the buffer rather than the address following it. You''ll have to read the docs and go from there.

Using C, I would check the line for ''mailto;'' using the strstr function. If the returned value isn''t null, then I know the line contains the string and the return value is a pointer to the beginning of ''mailto:''. I would then advance the pointer by 7 to skip over ''mailto:'' and then store/write the following values from the line until I encountered a closing ''"''. And then loop until eof.

Share this post


Link to post
Share on other sites
std::string::find_first_of() will find the first occurence of a given character or set of characters - such as a colon in your case. It returns the integer offset into the string, so use that plus one and std::string::substr() to extract the desired address. See the STL link in either my or Kylotan''s signatures (it''s the same) for documentation on the Standard Template Library.

[ GDNet Start Here | GDNet Search Tool | GDNet FAQ | MS RTFM [MSDN] | SGI STL Docs | Google! ]
Thanks to Kylotan for the idea!

Share this post


Link to post
Share on other sites

  • Advertisement