Sign in to follow this  
Followers 0
suliman

any fast program for reading large txt files?

9 posts in this topic

Hi

Any program (or better yet, complete c++ code) for a program that inputs a txtfile (ascii data), takes a custom separator (like ";" or ",") to read line by line and split every "word" (column actually) into separate strings? I then can handle the string.

 

I have huge txt files (60MB and up, 200 000 rows of somewhat unfiltered measured data) and matlab is way too slow to deal with them.

 

Thanks a lot!

Erik

0

Share this post


Link to post
Share on other sites

How is the input file structured? And how sould it be formated?? We can't read your mind.

0

Share this post


Link to post
Share on other sites

structure may differ. Preferably the program will determine the number of rows and columns automatically and arrange the data into a matrix accordingly. When column number changes on different rows the max number of columns will be found. And skip the first x rows (headers and other stuff).

 

Everything as a huge char array? Even if it 80MB? And then go through it from the beggining to end sorting the data? Seems weird to me but as you say maybe the fastest way for the cpu.

 

example of possible txt file (but around a hundred columns and 200 000 rows). And many of these files exist.

 

some text in the beggining
header1; header2; header3;header4
234 ; 23423.322 ; error code ; 234233
11; 123;12;123423
 

So there def needs to be some flexibility which of course is the challenge here. I know some c++ but though there would a something like this already. Once properly sorted, selected parts needs to be saved again into txt-files but i think i can manage that.

0

Share this post


Link to post
Share on other sites

Loading all at once is the fastest way, indeed. Parsing it yourself is not really hard, and will as fast as any ready implementation. If you are still looking for a ready solution, try iniparser, but you will have to adapt your text format to it.

 

Finally, I don't know the context of your application, but using text files is only a good solution if you need to save text, which doesn't seems to be the case here. If you need performance, consider using a binary file. If you want a lot of data well organized, consider using some database (there are local files solutions, such as sqlite).

1

Share this post


Link to post
Share on other sites

Everything as a huge char array? Even if it 80MB?

 

A desktop PC has typically several GB of ram, 80MB is nothing smile.png

 

But if you need to conserve ram (maybe you use it for something else), it could be done in chunks too. 

Just read as much as your buffer is big, parse as far as you can, and repeat until end of file.

Edited by Olof Hedman
2

Share this post


Link to post
Share on other sites
If you are not afraid of delving into somewhat system-dependent functionality, you could give memory mapping a try. Unluckily, you cannot just do that from pure C++, you need to call a function specific to your operating system (such as [tt]mmap[/tt] under POSIX or [tt]CreateFileMapping[/tt]/[tt]MapViewOfFile[/tt] under Windows). On the positive side, it is not really all that complicated, you'll need probably less than half an hour of RTFM to grok it.

Memory mapping basically makes a file part of your program's memory space, without you explicitly "loading" anything. You get a pointer, and the data within your file is just "magically" present at that location. You can also tell the operating sytem to do "copy on write", so any modifications you make stay private to your program's memory (otherwise, if you modify the memory, the on-disk file will be "magically" modified as well).

Like this, you could map the whole file. You get back a pointer, and you know the size of the file, so you can trivially iterate over that memory region with a simple [tt]for[/tt] loop and do whatever you want.

For example, given [tt]234 ; 23423.322 ; 5; 1[/tt], you want to extract these numbers as strings? Nothing easier than that: Do not extract anything.

Instead, set a [tt]char*[/tt] to the location where the string starts (after seeing a newline or a semicolon), and write a zero byte at the next semicolon (or newline). No need to allocate memory and copy data around. This is the way some in situ XML parsers (such as RapidXML or FastXML) work. They are ultra fast because they avoid doing ten thousands of dynamic memory allocations and copying data.

Of course you can do anything else you want too (e.g. copy the strings, or parse numbers, or whatever).

I acknowledge that this sounds maybe a bit scary to a beginner, but once one has learned to use memory mapping, one also learns to love its beauty and its ease. Plus, there is probably no faster way of doing such a thing (well, not without using a binary file, anyway).
0

Share this post


Link to post
Share on other sites

A desktop PC has typically several GB of ram, 80MB is nothing

 

You'd still want to constrain your application's total memory usage to 250 or 500MB though; And 80MB is already quite big for loading a file. But as what Olof Hedman said, you may want to load your data in chunks instead. This will give you a balance between performance and memory usage.

0

Share this post


Link to post
Share on other sites

Hi again

managed to get all data into a char array so i can access it like so:

 

char test = data[203]; // assign character 203 in the ascii file
 

So i can loop through everything now, has 65 million chars in the file:)

Is this an ok way to do it:

 

1. Loop until i find an separator. Look at what i got so far. Is it a number? (how do i check that?) Save number to cleanData[columnID][rowID] otherwise save NaN to the slot (any clever way of indicating NaN?)

2. if finding separator, columnID++;

3. if finding newline/return, rowID++, columnID=0;

4. Go on until end of file.

 

Will it be reasonable fast? Dont need lightning but it have to be ok at least. This is a method i would understand.

 

The data will be managed a bit once in the cleanData structure and then saved to a new ascii file (foi, i think i can manage that if i get that far).

Thanks for your help

Erik

0

Share this post


Link to post
Share on other sites

#include <iostream.h>
#include <fstream.h>
#include <stddef.h>
#include <stdlib.h>

//LIST DATA TYPE DEF'S
typedef char DataT_;
typedef long DataT_Key;
///////////////////////

#include <List_.h>


using namespace std;

 

//data browser meant for opening very large resource files
//and browsing page by page for peaks and looks at the data
//for research and study of the file structure
//purpose: to dynamic load one page of 6.4k bytes of raw data
//so that memory is managed for speed and performance and not overloaded

//key data will serve as data position for bit page following
//page #

class DataBrowser
{

List_ *Page;   //80x80 bytes (6.4k/page)
fstream DataFile_;  //file stream
char * fname;  //file name

 

public:

DataBrowser();
~DataBrowser();


DataBrowser(char*); //overload construct parameter sets filename

void setfname(char*);

int openfile(char*);

void closefile();

 

int LoadPage(long);  //load by page index
   //page index = (pos 0) + (page_num * 6400)
   //returns true if the page is valid else 0

int DisplayPage(); //displays this->page if valid
   //returns true if display success;

List_ * GetPage(); //returns this->page memory address

 

}; //class DataBrowser

 

DataBrowser::DataBrowser() { Page = new List_; }
DataBrowser::~DataBrowser(){if(Page != NULL) delete Page; else return;}

DataBrowser::DataBrowser(char * f_name)
{
fname = f_name;
cout<<"\nconstructing file name: "<<this->fname;
Page = new List_;
}
 //overload construct parameter sets filename

void DataBrowser::setfname(char* f_name)
{
fname = f_name;
}

 

int DataBrowser::openfile(char* f_name)
{
DataFile_.open(f_name, ios::in | ios::ate | ios::binary);
DataFile_.seekg(0,ios::beg);
return (DataFile_.fail() ? 0 : 1);
}

void DataBrowser::closefile()
{
if(DataFile_.is_open())
{ DataFile_.close();DataFile_.clear();}
else return;
}


   //load by page index
   //page index = (pos 0) + (page_num * 6400)
   //returns true if the page is valid else 0

int DataBrowser::LoadPage(long page_num)
{  


 if(!openfile(fname)){ cout<<"\nfile read error"; closefile(); return -1;}
 else{
 
 long pos = page_num * 3200;
 long epos = pos + 3200;
 
 
 system("cls");
 cout<<"file: "<<fname;
 cout<<"\nLoading Page: "<<page_num<<endl;
 DataFile_.seekg(pos, ios::beg);
 
 

 DataT_ tmp_;
 while(!DataFile_.fail() && pos < epos)
 {
 
 DataFile_.get(tmp_);
 Page->enQueue( (tmp_ == '\a' ? 0 : tmp_) );
 pos = DataFile_.tellg();

 };
 int rtn = DataFile_.eof();
 closefile();
 if(rtn)
 return 0;
 else
 return 1;
 }
 
 

}

//displays this->page if valid
//returns true if display success

int DataBrowser::DisplayPage()
{

while(Page->gItorator() != NULL){ cout<<  Page->deQueue() ;}
return 0;

List_ * DataBrowser::GetPage(){return Page;} //returns this->page memory address


 

Edited by Mathimetric
-1

Share this post


Link to post
Share on other sites

//List class def

//programmer : MJO 2013'

//#include "datatype.h"

 

#include <stddef.h>

 

 

class List_
{
 
 

 typedef struct Node_
 {
  DataT_ data_;
  DataT_Key key_;
  
  Node_ *next;
  Node_ *prev;
 }Node_;

private: 

Node_ * list_;
Node_ * head_;
Node_ * tail_;
Node_ * Itorator_;

long size_;

public:
////////////////////////////////////////////////////////////////////////////////////////////////////
//construct/destruct//
List_();
~List_();
////////////////////////////////////////////////////////////////////////////////////////////////////
//stack function//
int Push(DataT_ );
DataT_  Pop();
DataT_ Top();
////////////////////////////////////////////////////////////////////////////////////////////////////
//list function//
int Insert(DataT_Key);
int Delete(DataT_Key);
////////////////////////////////////////////////////////////////////////////////////////////////////
//queue function//
int enQueue(DataT_ );
DataT_  deQueue();
////////////////////////////////////////////////////////////////////////////////////////////////////
//general function
int clear_list();
int HasNext();
int HasPrev();
Node_ * iBegin();  //return itorator_ = head
Node_ * iEnd(); //return itorator_ = tail
Node_ * gHead();
Node_ * gTail();
Node_ * gItorator();
Node_ * gNext();
Node_ * gPrev();
long  Size();
DataT_ ShowData(Node_ *);
////////////////////////////////////////////////////////////////////////////////////////////////////
};//List_ class


////////////////////////////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////////////////////////////

//construct/destruct//
List_::List_()
{
 list_ = NULL;
 head_= NULL;
 tail_= NULL;
 Itorator_= NULL;
 size_ = 0;
}

List_::~List_()
{
 clear_list();
}

long List_::Size()
{
return size_;
}

//stack function//
int List_::Push(DataT_ input_)
{
 if(list_ == NULL){ 
 //start new list
 list_ = new Node_;
 list_->next = NULL;
 list_->prev = NULL;
 //set list addressing
 head_ = list_;
 tail_ = list_;
 Itorator_ = list_;
 list_->data_ = input_;
 return ++size_; 
 //else push onto list STACK
 }else{
 head_->prev = new Node_;
 Itorator_ = head_;
 head_ = head_->prev;
 head_->next = Itorator_;
 Itorator_->prev = head_;
 head_->prev = NULL;
 head_->data_ = input_;
 return ++size_;
 } 
 
}

DataT_  List_::Pop()
{
 //pop head ,delete, fix addressing, return dataT
if(head_->next != NULL)
{
 Itorator_ = head_;
 head_ = head_->next;
 DataT_ temp_ = Itorator_->data_;
 delete Itorator_;
 head_->prev = NULL;
 --size_;
 return temp_;
}
else
{
 Itorator_ = head_; 
 DataT_ temp_ = Itorator_->data_;
 delete Itorator_;
 Itorator_ = NULL;
 head_ = NULL;
 tail_ = NULL;
 list_ = NULL;
 --size_;
 return temp_;
}

}


DataT_  List_::Top()
{
 Itorator_ = head_;
 return Itorator_->data_;

}
//list function//
int List_::Insert(DataT_Key key_in)
{
 
 return 0;

}
int List_::Delete(DataT_Key)
{
return 0;


}

//queue function//
int List_::enQueue(DataT_ input_)
{
 if(list_ == NULL){
 
 //start new list
 list_ = new Node_;
 list_->next = NULL;
 list_->prev = NULL; 
 //set list addressing
 head_ = list_;
 tail_ = list_;
 Itorator_ = list_;
 list_->data_ = input_;
 return ++size_;
 }else{

 head_->prev = new Node_;
 Itorator_ = head_;
 head_ = head_->prev;
 head_->next = Itorator_;
 Itorator_->prev = head_;
 head_->prev = NULL;
 head_->data_ = input_;
 return ++size_; 
 }
 
 


}

DataT_  List_::deQueue()
{

if(tail_->prev != NULL)
{
 Itorator_ = tail_;
 DataT_ temp_ = Itorator_->data_;
 tail_ = tail_->prev;
 delete Itorator_;
 tail_->next = NULL;
 --size_;
 return temp_;
}else{

 Itorator_ = tail_;
 DataT_ temp_ = Itorator_->data_;
 delete Itorator_;
 tail_ = NULL; 
 Itorator_ = NULL;
 head_ = NULL;
 list_ = NULL;
 --size_;
 return temp_;
}

}

List_::Node_*  List_::gHead()
{
return head_;
}

List_::Node_* List_::gTail()
{
return tail_;
}

List_::Node_* List_::gItorator()
{
return Itorator_;
}

//general function
int List_::clear_list()
{
if(head_ == NULL) return -1; //had no list
static Node_ *temp_i;
if(temp_i == NULL) temp_i = head_;

 if(temp_i->next != NULL){
 Itorator_ = temp_i;
 temp_i = temp_i->next;
 delete Itorator_;
 Itorator_ = NULL; --size_;
 clear_list(); //call to clear again
 }else{ //reached the end of the list
 delete temp_i;
 temp_i = NULL;
 head_ = NULL;
 tail_ = NULL;
 list_ = NULL;
 --size_;
 }  
return size_;//size would be zero
}


List_::Node_ * List_::iBegin()
{
return Itorator_ = head_;
}

List_::Node_ * List_::iEnd()
{
return Itorator_ = tail_;
}

List_::Node_ * List_::gNext()
{
return Itorator_ =  (Itorator_->next == NULL ? NULL : Itorator_->next );
}
List_::Node_ * List_::gPrev()
{
return Itorator_ =  (Itorator_->prev == NULL ? NULL : Itorator_->prev );
}

DataT_ List_::ShowData(List_::Node_ *in_pointer)
{
return in_pointer->data_;
}

int List_::HasNext()
{
return Itorator_->next == NULL ? 0: 1;
}
int List_::HasPrev()
{
return Itorator_->prev == NULL ? 0: 1;
}

-1

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0