Combining Files 101 ... Help needed :(

Started by
8 comments, last by flangazor 19 years, 4 months ago
So I decide to write an encryption program... no problem! So I decide to write a program that would take two files and encrypt them and place them into one file... no problem-HOWEVER... And it begins. I learned that when I combined two files together "1.txt" - 16 bytes "2.txt" - 20 bytes I got the new combined file "new.txt" - 39 bytes. Now if I was writing a program that would break this single file "new.txt" into it's original "1.txt" and "2.txt" then I would make a program read from the beginning of "new.txt" until 16 bytes have passed. And to extract the second file I would start w/offset at 16 bytes then read until 16 + 20 = 36 bytes have passed. But as you can see, the file size is 39--3 bytes greater than the original. Does this have anything to do with the file encoding? unicode? ansi? My question : How does one go about combining a file with characters ranging from 0-255 using the [FILE *, fopen("rb" & "awb"), putc, getc] library. Am I going to have to use the Win32 to deal with file combining or can this be attained through the above library mentioned? It's amazing that after being around computers so long I wasn't aware of this until now. Thanks in advance, toonkides
Advertisement
You can safely write characters in the [0..255] range IF you open the file as "wb" (write binary) and if you use fwrite to write to your file.
Writing to file is no problem, I was actually using the putc instead of fwrite, perhaps the reason for some problems.

After disabling the encryption(so the program just combines files now), the wierd thing is that when I look at the "new.txt" file I notice that it contains the contents of 1.txt and 2.txt--Nothing more or less added, but the size is 3 bytes greater!

I'm just curious as to where those extra bytes are coming from.
tia
#include <cstdio>#include <fstream>#include <iostream>#include <iterator>#include <string>int combinec(const char* infile1, const char* infile2, const char* outfile){	using namespace std;	FILE* file1 = fopen(infile1, "rb");	if (!file1)	{		return 0;	}	FILE* file2 = fopen(infile2, "rb");	if (!file2)	{		fclose(file1);		return 0;	}	fseek(file1, 0, SEEK_END);	unsigned int size1 = ftell(file1);	fseek(file2, 0, SEEK_END);	unsigned int size2 = ftell(file2);	if (size1 + size2 < min(size1, size2))	{		fclose(file1);		fclose(file2);		return 0;	}	char* data = (char*)malloc(size1 + size2);	fseek(file1, 0, SEEK_SET);	fseek(file2, 0, SEEK_SET);	if (fread(data, 1, size1, file1) != size1)	{		free(data);		fclose(file1);		fclose(file2);		return 0;	}	fclose(file1);	if (fread(data + size1, 1, size2, file2) != size2)	{		free(data);		fclose(file2);		return 0;	}	fclose(file2);	FILE* file3 = fopen(outfile, "wb");	if (!file3)	{		free(data);		return 0;			}	fwrite(data, 1, size1 + size2, file3);	free(data);	fclose(file3);	return 1;}void combinecpp(std::string infile1, std::string infile2, std::string outfile){	std::ifstream reader1(infile1.c_str(), std::ios::binary);	std::ifstream reader2(infile2.c_str(), std::ios::binary);	std::ofstream writer(outfile.c_str(), std::ios::binary);	if (!reader1 || !reader2 || !writer)	{		throw std::runtime_error("file not found");	}	std::copy(std::istreambuf_iterator<char>(reader1), std::istreambuf_iterator<char>(), std::ostreambuf_iterator<char>(writer));	std::copy(std::istreambuf_iterator<char>(reader2), std::istreambuf_iterator<char>(), std::ostreambuf_iterator<char>(writer));}int main(){	if (!combinec("text1.txt", "text2.txt", "textc.txt"))	{		std::cout << "failed c\n";	}	try	{		combinecpp("text1.txt", "text2.txt", "textcpp.txt");	}	catch (...)	{		std::cout << "failed cpp\n";	}}


Enigma
Quote:Original post by Toonkides
I'm just curious as to where those extra bytes are coming from.
A newline may be 1 or 2 characters.
Quote:Original post by Toonkides
Writing to file is no problem, I was actually using the putc instead of fwrite, perhaps the reason for some problems.

After disabling the encryption(so the program just combines files now), the wierd thing is that when I look at the "new.txt" file I notice that it contains the contents of 1.txt and 2.txt--Nothing more or less added, but the size is 3 bytes greater!

I'm just curious as to where those extra bytes are coming from.
tia


Use a hex editor so you can see all the bytes. Observe the values and positions of bytes in the result file that are not in the originals.
After disabling the encryption(so the program just combines files now), the wierd thing is that when I look at the "new.txt" file I notice that it contains the contents of 1.txt and 2.txt--Nothing more or less added, but the size is 3 bytes greater!


- If you are using "text" in files, and use binary mode, the file contents will "appear" the same, but the size will be different. So what might have happened is that you were using binary, you ended up saving as 36 bytes and then converted to ascii, and you ended up with 39. Thats an idea, but I doubt what is happening in your case.

- Do you think you can post some of your code, that is where the bytes are coming from - you may be accidenlty addening a newline or something wihtout knowing it.

- You could also try to read/write in a different way, try using read/write of the fstream class. That is what I generally use and its awesome. It has a little complicated use at first, but once you gee the hang of it, it is super fast and efficient.

I hope this helps! Good luck.
I had a PM from the OP regarding the code I posted earlier:
Quote:Original post by Toonkides
One thing that I am curious about is this :
Let's say that I was working with huge files, I'm talking about gigabytes.

Would it be possible to not use data ? I mean read the file and immediately start appending it to the outfile as you're reading it.

The reason I ask is because of the following example :

data = (char*)malloc(1000000000) ;

First off I've point out that there are actually a number of bugs in my original code. The malloc call was not checked for failure and most of the variables are actually C++ declared, not C declared (you can tell I don't program in c much)!
Now to answer the question. I don't think C has any library functions for direct streaming of files. However, it should be simple to adjust my code to iteratively read and write chunks to avoid a huge malloc call:
#include <cstdio>#include <fstream>#include <iostream>#include <iterator>#include <string>int combinec(const char* infile1, const char* infile2, const char* outfile){	using namespace std;	const long maxChunkSize = 1 << 16;	FILE* file1,* file2,* file3;	long size1, size2, dataRead;	unsigned int currentChunkSize;	char* data;	file1 = fopen(infile1, "rb");	if (!file1)	{		return 0;	}	if (fseek(file1, 0, SEEK_END) == -1)	{		fclose(file1);		return 0;	}	size1 = ftell(file1);	if (size1 == -1)	{		fclose(file1);		return 0;	}	if (fseek(file1, 0, SEEK_SET) == -1)	{		fclose(file1);		return 0;	}	file2 = fopen(infile2, "rb");	if (!file2)	{		fclose(file1);		return 0;	}	if (fseek(file2, 0, SEEK_END) == -1)	{		fclose(file1);		fclose(file2);		return 0;	}	size2 = ftell(file2);	if (size2 == -1)	{		fclose(file1);		fclose(file2);		return 0;	}	if (fseek(file2, 0, SEEK_SET) == -1)	{		fclose(file1);		fclose(file2);		return 0;	}	file3 = fopen(outfile, "wb");	if (!file3)	{		fclose(file1);		fclose(file2);		return 0;			}	data = (char*)malloc(min(max(size1, size2), maxChunkSize));	if (!data)	{		fclose(file1);		fclose(file2);		fclose(file3);		return 0;	}	dataRead = 0;	while (dataRead < size1)	{		currentChunkSize = min(size1 - dataRead, maxChunkSize);		if (fread(data, 1, currentChunkSize, file1) != currentChunkSize)		{			free(data);			fclose(file1);			fclose(file2);			fclose(file3);			return 0;		}		if (fwrite(data, 1, currentChunkSize, file3) != currentChunkSize)		{			free(data);			fclose(file1);			fclose(file2);			fclose(file3);			return 0;		}		dataRead += currentChunkSize;	}	if (fclose(file1) == EOF)	{		free(data);		fclose(file2);		fclose(file3);		return 0;	}	dataRead = 0;	while (dataRead < size2)	{		currentChunkSize = min(size2 - dataRead, maxChunkSize);		if (fread(data, 1, currentChunkSize, file2) != currentChunkSize)		{			free(data);			fclose(file2);			fclose(file3);			return 0;		}		if (fwrite(data, 1, currentChunkSize, file3) != currentChunkSize)		{			free(data);			fclose(file2);			fclose(file3);			return 0;		}		dataRead += currentChunkSize;	}	if (fclose(file2) == EOF)	{		free(data);		fclose(file3);		return 0;	}	if (fclose(file3) == EOF)	{		return 0;	}	free(data);	return 1;}typedef struct{	std::FILE* file1;	std::FILE* file2;	std::FILE* file3;	char* data;} combinecneatStruct;int combinecneatimpl(combinecneatStruct* data, const char* infile1, const char* infile2, const char* outfile){	using namespace std;	const long maxChunkSize = 1 << 16;	long size1, size2, dataRead;	unsigned int currentChunkSize;	(*data).file1 = fopen(infile1, "rb");	if (!(*data).file1)	{		return 0;	}	if (fseek((*data).file1, 0, SEEK_END) == -1)	{		return 0;	}	size1 = ftell((*data).file1);	if (size1 == -1)	{		return 0;	}	if (fseek((*data).file1, 0, SEEK_SET) == -1)	{		return 0;	}	(*data).file2 = fopen(infile2, "rb");	if (!(*data).file2)	{		return 0;	}	if (fseek((*data).file2, 0, SEEK_END) == -1)	{		return 0;	}	size2 = ftell((*data).file2);	if (size2 == -1)	{		return 0;	}	if (fseek((*data).file2, 0, SEEK_SET) == -1)	{		return 0;	}	(*data).file3 = fopen(outfile, "wb");	if (!(*data).file3)	{		return 0;			}	(*data).data = (char*)malloc(min(max(size1, size2), maxChunkSize));	if (!(*data).data)	{		return 0;	}	dataRead = 0;	while (dataRead < size1)	{		currentChunkSize = min(size1 - dataRead, maxChunkSize);		if (fread((*data).data, 1, currentChunkSize, (*data).file1) != currentChunkSize)		{			return 0;		}		if (fwrite((*data).data, 1, currentChunkSize, (*data).file3) != currentChunkSize)		{			return 0;		}		dataRead += currentChunkSize;	}	dataRead = 0;	while (dataRead < size2)	{		currentChunkSize = min(size2 - dataRead, maxChunkSize);		if (fread((*data).data, 1, currentChunkSize, (*data).file2) != currentChunkSize)		{			return 0;		}		if (fwrite((*data).data, 1, currentChunkSize, (*data).file3) != currentChunkSize)		{			return 0;		}		dataRead += currentChunkSize;	}	return 1;}int combinecneat(const char* infile1, const char* infile2, const char* outfile){	combinecneatStruct data;	int returnVal;	data.file1 = 0;	data.file2 = 0;	data.file3 = 0;	data.data = 0;	returnVal = combinecneatimpl(&data, infile1, infile2, outfile);	if (data.file1 != 0)	{		if (fclose(data.file1) == EOF)		{			returnVal = 0;		}	}	if (data.file2 != 0)	{		if (fclose(data.file2) == EOF)		{			returnVal = 0;		}	}	if (data.file3 != 0)	{		if (fclose(data.file3) == EOF)		{			returnVal = 0;		}	}	if (data.data != 0)	{		free(data.data);	}	return returnVal;}void combinecpp(std::string infile1, std::string infile2, std::string outfile){	std::ifstream reader1(infile1.c_str(), std::ios::binary);	std::ifstream reader2(infile2.c_str(), std::ios::binary);	if (!reader1 || !reader2)	{		throw std::runtime_error("file not found");	}	std::ofstream writer(outfile.c_str(), std::ios::binary);	if (!writer)	{		throw std::runtime_error("failed creating output file");	}	std::copy(std::istreambuf_iterator<char>(reader1), std::istreambuf_iterator<char>(), std::ostreambuf_iterator<char>(writer));	std::copy(std::istreambuf_iterator<char>(reader2), std::istreambuf_iterator<char>(), std::ostreambuf_iterator<char>(writer));}int main(){	if (!combinec("text1.txt", "text2.txt", "textc.txt"))	{		std::cout << "failed c\n";	}	if (!combinecneat("text1.txt", "text2.txt", "textcneat.txt"))	{		std::cout << "failed cneat\n";	}	try	{		combinecpp("text1.txt", "text2.txt", "textcpp.txt");	}	catch (...)	{		std::cout << "failed cpp\n";	}}

The code required to do that, with all the necessary error checking, is so disgusting that I also tried to write a "neat" version. However, that's not particularly neat either. Also made a slight adjustment to the C++ way to check for existence of the input files before creating the ofstream object. This is more consistent with the C code as it doesn't create an output file in the event that one or both input files are missing.

Personally I think anyone who has the option of using the C++ way and chooses to do it in C is just plain barking mad!

Enigma
Original post by Drew_Benton
Quote: After disabling the encryption(so the program just combines files

- You could also try to read/write in a different way, try using read/write of the fstream class. That is what I generally use and its awesome. It has a little complicated use at first, but once you gee the hang of it, it is super fast and efficient.


:)

Well, I fixed the problem. And I am 100% confident that it had something to do with closing the file fclose() then reopening with the append fopen("awb")

Yes, I have actually used the fstream library before with no problem, I actually find it easier. But if you ask why I use the fopen now, it's because I tested both for the program I wanted to make. And I noticed a few second difference when copying a file character-by-character using the fopen vs the fstream.

fstream was extremely slower, which dissappointed me. If you make a program that reads by character,


Consider the following :
c++ implementation using fstream

char c ;
int startTime = timeGetTime() ;
while(fin >> c)
fout << c ;
cout << startTime << " - " << timeGetTime() << " = " << startTime - timeGetTime() ;

vs.

c implementation using the fopen/fclose
int startTime = timeGetTime() ;
int c ;
while((c = getc(fp)) != EOF)
putc(c, fwp) ;
printf("%d - %d = %d", startTime, timeGetTime(), startTime - timeGetTime()) ;

the bottom is MUCH much faster, don't ask why.

Another incentive to learning was that php uses the fopen library :)
Quote:Original post by bleyblue2
You can safely write characters in the [0..255] range IF you open the file as "wb" (write binary) and if you use fwrite to write to your file.
Only in ANSI C (as opposed to ISO C):
Quote:
man fopen(3) says:
The mode string can also include the letter ``b'' either
as a third character or as a character between the characters in any of the two-character strings described above. This is strictly for compatibility with ANSI C3.159-1989 (``ANSI C'') and has no effect; the ``b'' is ignored.


On that matter, perhaps you should read the rest of the options to fopen. i.e. using "awb+" is contradictory and will likely ignore the 'w' (and 'b').
Quote: The argument mode points to a string beginning with one of
the following sequences (Additional characters may follow
these sequences.):

r Open text file for reading. The stream is posi-
tioned at the beginning of the file.

r+ Open for reading and writing. The stream is posi-
tioned at the beginning of the file.

w Truncate file to zero length or create text file
for writing. The stream is positioned at the
beginning of the file.

w+ Open for reading and writing. The file is created
if it does not exist, otherwise it is truncated.
The stream is positioned at the beginning of the
file.

a Open for writing. The file is created if it does
not exist. The stream is positioned at the end of
the file.

a+ Open for reading and writing. The file is created
if it does not exist. The stream is positioned at
the end of the file.

This topic is closed to new replies.

Advertisement