download HTML via Winsock

Started by
6 comments, last by vlzvl 15 years, 9 months ago
hello, im the author of Jolt3D! engine (jolt-3d.sf.net) im trying to create an alternative way to 'InternetReadFile()' func of WININET.DLL, to read HTML pages from C programs. My goal anyway is to get rid of the above dll & just use the wsock dll, but i hit a problem...The problem is that with the following winsock code im not getting correct results, or i dont know well the "GET" syntax. The (easy) code is:

	// address
	IN_ADDR		iaHost;
	LPHOSTENT	lpHostEntry;
	iaHost.s_addr = inet_addr(Servername);
	if (iaHost.s_addr == INADDR_NONE) // Wasn't an IP address string, assume it is a name		
		lpHostEntry = gethostbyname(Servername);
	else // It was a valid IP address string
		lpHostEntry = gethostbyaddr((const char *)&iaHost, sizeof(struct in_addr), AF_INET);
	if (lpHostEntry == NULL)
	{
		J_event("error A");
		return;
	}
	// socket
	SOCKET	Socket;	
	Socket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
	if (Socket == INVALID_SOCKET)
	{
		J_event("error B");
                return;
        }
	// port
	LPSERVENT lpServEnt;
	SOCKADDR_IN saServer;
	lpServEnt = getservbyname("http", "tcp");
	if (lpServEnt == NULL)
		saServer.sin_port = htons(80);
	else
		saServer.sin_port = lpServEnt->s_port;
	// fill rest
	saServer.sin_family = AF_INET;
	saServer.sin_addr = *((LPIN_ADDR)*lpHostEntry->h_addr_list);
	// connect
	int nRet = connect(Socket, (LPSOCKADDR)&saServer, sizeof(SOCKADDR_IN));
	if (nRet == SOCKET_ERROR)
        {
		J_event("error C");
                return;
        }
	// build the HTTP request
	char szBuffer[1024];
	sprintf(szBuffer, "GET %s\n", Filename);
	nRet = send(Socket, szBuffer, strlen(szBuffer), 0);
	if (nRet == SOCKET_ERROR)
        {
		J_event("error D");
                closesocket(Socket);
                return;
        }
	// receive the file contents and print to local 'index.html'
	FILE *f=fopen("index.html","wb");
	while(1)
	{
		nRet = recv(Socket, szBuffer, sizeof(szBuffer), 0);
		if (nRet == SOCKET_ERROR)
		{
			J_event("error E");
			break;
		}
		else if (nRet == 0) // server closes connection ? (or just there'arent bytes to read)
			break;
        fwrite(szBuffer, nRet, 1, f); // write to file
	}
	closesocket(Socket);	
	fclose(f);
...where the 'Servername' is either an IP or domain name etc. www.google.com, and 'Filename' a specific HTML file with its directory etc. /files/index.html As you see im using the most easiest GET syntax: "GET %s\r\n", no HTTP/1.1 or Host: or anythine else... This syntax works for www.google.com/index.html, but not with: http://www.nba.com/games/20071030/scoreboard.html // request timed out http://jolt-3d.sf.net/index.htm // error 400, bad URI ??? My questions are: 1) is those errors have to do with bad parameters after GET ? 2) why when im using "GET %s HTTP/1.1\r\n" the system halts? (or any version) 3) my internet connection is not ADSL, but GPRS / 3G. Maybe winsock is confused somewhat with this ? P.S. I found something odd with explorer & my code. I tried to download a file (that didnt exists) from my site (in sourceforge) with both ways: the explorer returned with the known sourceforge error page which shows the error code, the server & url, all filled normally: My winsock code HADNT filled the server name, and the url was somewhat formatted with %1/%%3 etc, more specically was: /home/groups/%1%%2/htdocs/ff.htm, instead of the correct /home/groups/j/jo/jolt-3d/htdocs/ff.htm What is going on? If anyone has some time pls check this code, i think it will help anyone that wants to download "freely" an HTML page without grab his hands into commercial products. thanx
http://jolt-3d.sf.netJolt3D! 3D Game Engine,1999-2008
Advertisement
You can try the HTTP-GET library, that does exactly that.
enum Bool { True, False, FileNotFound };
nice library,
although it doesnt handle re-direction &
other things, is far better than my code :)
thanx
http://jolt-3d.sf.netJolt3D! 3D Game Engine,1999-2008
Yes, it's somewhat minimal :-)

However, it should be not too hard to put re-direct parsing, cookies, and whatever else you need on top of what's there. The networking and request/response part works fairly well.
enum Bool { True, False, FileNotFound };
...i suppose that is a library of yours (i saw the ~hplus directory :) )
really nice work ! Just one more question: is there a way to bypass
the header-like text before the actual html page ?
Im using a number of html pages from my c programs in real-time & doing parsing byte-2-byte, so i know (and need) the same byte-offsets for several of these pages; but with the header things (and offsets :) ) are changing...
I must start thinking where the <HTML> starts or is there an easier way ?
my thanx
http://jolt-3d.sf.netJolt3D! 3D Game Engine,1999-2008
Quote:Original post by vlzvl
...i suppose that is a library of yours (i saw the ~hplus directory :) )
really nice work ! Just one more question: is there a way to bypass
the header-like text before the actual html page ?
Im using a number of html pages from my c programs in real-time & doing parsing byte-2-byte, so i know (and need) the same byte-offsets for several of these pages; but with the header things (and offsets :) ) are changing...
I must start thinking where the <HTML> starts or is there an easier way ?
my thanx


HTTP headers are fixed. You can send minimal subset, but it needs to conform to specification.

HTTP supports partial GET requests. They need to be supported by server. Some do not support it, and some deliberately disable it.
The headers end after the character sequence "\r\n\r\n" (CR, LF, CR, LF). That character sequence cannot be part of the header. Thus, you can just look for that sequence, and when you find it, you know that the data starts with the very next byte. That may or may not be "<HTML>" by the way -- it could be "<html>," or "<?xml>," or "<!DOCTYPE>," or "<!-->," or some extra blanks inserted by whomever generated the page.
enum Bool { True, False, FileNotFound };
thats the info i wanted :) thanx to both of you
http://jolt-3d.sf.netJolt3D! 3D Game Engine,1999-2008

This topic is closed to new replies.

Advertisement