Jump to content

  • Log In with Google      Sign In   
  • Create Account


any fast program for reading large txt files?


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
10 replies to this topic

#1 suliman   Members   -  Reputation: 553

Like
0Likes
Like

Posted 06 March 2013 - 02:49 PM

Hi

Any program (or better yet, complete c++ code) for a program that inputs a txtfile (ascii data), takes a custom separator (like ";" or ",") to read line by line and split every "word" (column actually) into separate strings? I then can handle the string.

 

I have huge txt files (60MB and up, 200 000 rows of somewhat unfiltered measured data) and matlab is way too slow to deal with them.

 

Thanks a lot!

Erik



Sponsor:

#2 Vortez   Crossbones+   -  Reputation: 2697

Like
0Likes
Like

Posted 06 March 2013 - 05:50 PM

How is the input file structured? And how sould it be formated?? We can't read your mind.



#3 ~Nem   Members   -  Reputation: 311

Like
3Likes
Like

Posted 07 March 2013 - 12:55 AM

Depends on what you need to do, but the fastest way to do this in C++ will be to load the entire file into a large char array, then scan for your delimiters and replace them with null characters in place.  You can then store the address of the start of the delimited value and use it like a string later (since it is null terminated).



#4 suliman   Members   -  Reputation: 553

Like
0Likes
Like

Posted 07 March 2013 - 02:31 AM

structure may differ. Preferably the program will determine the number of rows and columns automatically and arrange the data into a matrix accordingly. When column number changes on different rows the max number of columns will be found. And skip the first x rows (headers and other stuff).

 

Everything as a huge char array? Even if it 80MB? And then go through it from the beggining to end sorting the data? Seems weird to me but as you say maybe the fastest way for the cpu.

 

example of possible txt file (but around a hundred columns and 200 000 rows). And many of these files exist.

 

some text in the beggining
header1; header2; header3;header4
234 ; 23423.322 ; error code ; 234233
11; 123;12;123423
 

So there def needs to be some flexibility which of course is the challenge here. I know some c++ but though there would a something like this already. Once properly sorted, selected parts needs to be saved again into txt-files but i think i can manage that.



#5 KnolanCross   Members   -  Reputation: 1271

Like
1Likes
Like

Posted 07 March 2013 - 09:13 AM

Loading all at once is the fastest way, indeed. Parsing it yourself is not really hard, and will as fast as any ready implementation. If you are still looking for a ready solution, try iniparser, but you will have to adapt your text format to it.

 

Finally, I don't know the context of your application, but using text files is only a good solution if you need to save text, which doesn't seems to be the case here. If you need performance, consider using a binary file. If you want a lot of data well organized, consider using some database (there are local files solutions, such as sqlite).


Currently working on a scene editor for ORX (http://orx-project.org), using kivy (http://kivy.org).


#6 Olof Hedman   Crossbones+   -  Reputation: 2740

Like
2Likes
Like

Posted 07 March 2013 - 01:02 PM

Everything as a huge char array? Even if it 80MB?

 

A desktop PC has typically several GB of ram, 80MB is nothing smile.png

 

But if you need to conserve ram (maybe you use it for something else), it could be done in chunks too. 

Just read as much as your buffer is big, parse as far as you can, and repeat until end of file.


Edited by Olof Hedman, 07 March 2013 - 01:04 PM.


#7 samoth   Crossbones+   -  Reputation: 4683

Like
0Likes
Like

Posted 08 March 2013 - 09:12 AM

If you are not afraid of delving into somewhat system-dependent functionality, you could give memory mapping a try. Unluckily, you cannot just do that from pure C++, you need to call a function specific to your operating system (such as mmap under POSIX or CreateFileMapping/MapViewOfFile under Windows). On the positive side, it is not really all that complicated, you'll need probably less than half an hour of RTFM to grok it.

Memory mapping basically makes a file part of your program's memory space, without you explicitly "loading" anything. You get a pointer, and the data within your file is just "magically" present at that location. You can also tell the operating sytem to do "copy on write", so any modifications you make stay private to your program's memory (otherwise, if you modify the memory, the on-disk file will be "magically" modified as well).

Like this, you could map the whole file. You get back a pointer, and you know the size of the file, so you can trivially iterate over that memory region with a simple for loop and do whatever you want.

For example, given 234 ; 23423.322 ; 5; 1, you want to extract these numbers as strings? Nothing easier than that: Do not extract anything.

Instead, set a char* to the location where the string starts (after seeing a newline or a semicolon), and write a zero byte at the next semicolon (or newline). No need to allocate memory and copy data around. This is the way some in situ XML parsers (such as RapidXML or FastXML) work. They are ultra fast because they avoid doing ten thousands of dynamic memory allocations and copying data.

Of course you can do anything else you want too (e.g. copy the strings, or parse numbers, or whatever).

I acknowledge that this sounds maybe a bit scary to a beginner, but once one has learned to use memory mapping, one also learns to love its beauty and its ease. Plus, there is probably no faster way of doing such a thing (well, not without using a binary file, anyway).

#8 BrentChua   Crossbones+   -  Reputation: 1066

Like
0Likes
Like

Posted 08 March 2013 - 09:31 AM

A desktop PC has typically several GB of ram, 80MB is nothing

 

You'd still want to constrain your application's total memory usage to 250 or 500MB though; And 80MB is already quite big for loading a file. But as what Olof Hedman said, you may want to load your data in chunks instead. This will give you a balance between performance and memory usage.



#9 suliman   Members   -  Reputation: 553

Like
0Likes
Like

Posted 08 March 2013 - 11:31 AM

Hi again

managed to get all data into a char array so i can access it like so:

 

char test = data[203]; // assign character 203 in the ascii file
 

So i can loop through everything now, has 65 million chars in the file:)

Is this an ok way to do it:

 

1. Loop until i find an separator. Look at what i got so far. Is it a number? (how do i check that?) Save number to cleanData[columnID][rowID] otherwise save NaN to the slot (any clever way of indicating NaN?)

2. if finding separator, columnID++;

3. if finding newline/return, rowID++, columnID=0;

4. Go on until end of file.

 

Will it be reasonable fast? Dont need lightning but it have to be ok at least. This is a method i would understand.

 

The data will be managed a bit once in the cleanData structure and then saved to a new ascii file (foi, i think i can manage that if i get that far).

Thanks for your help

Erik



#10 Mathimetric   Members   -  Reputation: 133

Like
-1Likes
Like

Posted 10 March 2013 - 06:32 AM

#include <iostream.h>
#include <fstream.h>
#include <stddef.h>
#include <stdlib.h>

//LIST DATA TYPE DEF'S
typedef char DataT_;
typedef long DataT_Key;
///////////////////////

#include <List_.h>


using namespace std;

 

//data browser meant for opening very large resource files
//and browsing page by page for peaks and looks at the data
//for research and study of the file structure
//purpose: to dynamic load one page of 6.4k bytes of raw data
//so that memory is managed for speed and performance and not overloaded

//key data will serve as data position for bit page following
//page #

class DataBrowser
{

List_ *Page;   //80x80 bytes (6.4k/page)
fstream DataFile_;  //file stream
char * fname;  //file name

 

public:

DataBrowser();
~DataBrowser();


DataBrowser(char*); //overload construct parameter sets filename

void setfname(char*);

int openfile(char*);

void closefile();

 

int LoadPage(long);  //load by page index
   //page index = (pos 0) + (page_num * 6400)
   //returns true if the page is valid else 0

int DisplayPage(); //displays this->page if valid
   //returns true if display success;

List_ * GetPage(); //returns this->page memory address

 

}; //class DataBrowser

 

DataBrowser::DataBrowser() { Page = new List_; }
DataBrowser::~DataBrowser(){if(Page != NULL) delete Page; else return;}

DataBrowser::DataBrowser(char * f_name)
{
fname = f_name;
cout<<"\nconstructing file name: "<<this->fname;
Page = new List_;
}
 //overload construct parameter sets filename

void DataBrowser::setfname(char* f_name)
{
fname = f_name;
}

 

int DataBrowser::openfile(char* f_name)
{
DataFile_.open(f_name, ios::in | ios::ate | ios::binary);
DataFile_.seekg(0,ios::beg);
return (DataFile_.fail() ? 0 : 1);
}

void DataBrowser::closefile()
{
if(DataFile_.is_open())
{ DataFile_.close();DataFile_.clear();}
else return;
}


   //load by page index
   //page index = (pos 0) + (page_num * 6400)
   //returns true if the page is valid else 0

int DataBrowser::LoadPage(long page_num)
{  


 if(!openfile(fname)){ cout<<"\nfile read error"; closefile(); return -1;}
 else{
 
 long pos = page_num * 3200;
 long epos = pos + 3200;
 
 
 system("cls");
 cout<<"file: "<<fname;
 cout<<"\nLoading Page: "<<page_num<<endl;
 DataFile_.seekg(pos, ios::beg);
 
 

 DataT_ tmp_;
 while(!DataFile_.fail() && pos < epos)
 {
 
 DataFile_.get(tmp_);
 Page->enQueue( (tmp_ == '\a' ? 0 : tmp_) );
 pos = DataFile_.tellg();

 };
 int rtn = DataFile_.eof();
 closefile();
 if(rtn)
 return 0;
 else
 return 1;
 }
 
 

}

//displays this->page if valid
//returns true if display success

int DataBrowser::DisplayPage()
{

while(Page->gItorator() != NULL){ cout<<  Page->deQueue() ;}
return 0;

List_ * DataBrowser::GetPage(){return Page;} //returns this->page memory address


 


Edited by Mathimetric, 10 March 2013 - 06:36 AM.


#11 Mathimetric   Members   -  Reputation: 133

Like
-1Likes
Like

Posted 10 March 2013 - 06:35 AM

//List class def

//programmer : MJO 2013'

//#include "datatype.h"

 

#include <stddef.h>

 

 

class List_
{
 
 

 typedef struct Node_
 {
  DataT_ data_;
  DataT_Key key_;
  
  Node_ *next;
  Node_ *prev;
 }Node_;

private: 

Node_ * list_;
Node_ * head_;
Node_ * tail_;
Node_ * Itorator_;

long size_;

public:
////////////////////////////////////////////////////////////////////////////////////////////////////
//construct/destruct//
List_();
~List_();
////////////////////////////////////////////////////////////////////////////////////////////////////
//stack function//
int Push(DataT_ );
DataT_  Pop();
DataT_ Top();
////////////////////////////////////////////////////////////////////////////////////////////////////
//list function//
int Insert(DataT_Key);
int Delete(DataT_Key);
////////////////////////////////////////////////////////////////////////////////////////////////////
//queue function//
int enQueue(DataT_ );
DataT_  deQueue();
////////////////////////////////////////////////////////////////////////////////////////////////////
//general function
int clear_list();
int HasNext();
int HasPrev();
Node_ * iBegin();  //return itorator_ = head
Node_ * iEnd(); //return itorator_ = tail
Node_ * gHead();
Node_ * gTail();
Node_ * gItorator();
Node_ * gNext();
Node_ * gPrev();
long  Size();
DataT_ ShowData(Node_ *);
////////////////////////////////////////////////////////////////////////////////////////////////////
};//List_ class


////////////////////////////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////////////////////////////

//construct/destruct//
List_::List_()
{
 list_ = NULL;
 head_= NULL;
 tail_= NULL;
 Itorator_= NULL;
 size_ = 0;
}

List_::~List_()
{
 clear_list();
}

long List_::Size()
{
return size_;
}

//stack function//
int List_::Push(DataT_ input_)
{
 if(list_ == NULL){ 
 //start new list
 list_ = new Node_;
 list_->next = NULL;
 list_->prev = NULL;
 //set list addressing
 head_ = list_;
 tail_ = list_;
 Itorator_ = list_;
 list_->data_ = input_;
 return ++size_; 
 //else push onto list STACK
 }else{
 head_->prev = new Node_;
 Itorator_ = head_;
 head_ = head_->prev;
 head_->next = Itorator_;
 Itorator_->prev = head_;
 head_->prev = NULL;
 head_->data_ = input_;
 return ++size_;
 } 
 
}

DataT_  List_::Pop()
{
 //pop head ,delete, fix addressing, return dataT
if(head_->next != NULL)
{
 Itorator_ = head_;
 head_ = head_->next;
 DataT_ temp_ = Itorator_->data_;
 delete Itorator_;
 head_->prev = NULL;
 --size_;
 return temp_;
}
else
{
 Itorator_ = head_; 
 DataT_ temp_ = Itorator_->data_;
 delete Itorator_;
 Itorator_ = NULL;
 head_ = NULL;
 tail_ = NULL;
 list_ = NULL;
 --size_;
 return temp_;
}

}


DataT_  List_::Top()
{
 Itorator_ = head_;
 return Itorator_->data_;

}
//list function//
int List_::Insert(DataT_Key key_in)
{
 
 return 0;

}
int List_::Delete(DataT_Key)
{
return 0;


}

//queue function//
int List_::enQueue(DataT_ input_)
{
 if(list_ == NULL){
 
 //start new list
 list_ = new Node_;
 list_->next = NULL;
 list_->prev = NULL; 
 //set list addressing
 head_ = list_;
 tail_ = list_;
 Itorator_ = list_;
 list_->data_ = input_;
 return ++size_;
 }else{

 head_->prev = new Node_;
 Itorator_ = head_;
 head_ = head_->prev;
 head_->next = Itorator_;
 Itorator_->prev = head_;
 head_->prev = NULL;
 head_->data_ = input_;
 return ++size_; 
 }
 
 


}

DataT_  List_::deQueue()
{

if(tail_->prev != NULL)
{
 Itorator_ = tail_;
 DataT_ temp_ = Itorator_->data_;
 tail_ = tail_->prev;
 delete Itorator_;
 tail_->next = NULL;
 --size_;
 return temp_;
}else{

 Itorator_ = tail_;
 DataT_ temp_ = Itorator_->data_;
 delete Itorator_;
 tail_ = NULL; 
 Itorator_ = NULL;
 head_ = NULL;
 list_ = NULL;
 --size_;
 return temp_;
}

}

List_::Node_*  List_::gHead()
{
return head_;
}

List_::Node_* List_::gTail()
{
return tail_;
}

List_::Node_* List_::gItorator()
{
return Itorator_;
}

//general function
int List_::clear_list()
{
if(head_ == NULL) return -1; //had no list
static Node_ *temp_i;
if(temp_i == NULL) temp_i = head_;

 if(temp_i->next != NULL){
 Itorator_ = temp_i;
 temp_i = temp_i->next;
 delete Itorator_;
 Itorator_ = NULL; --size_;
 clear_list(); //call to clear again
 }else{ //reached the end of the list
 delete temp_i;
 temp_i = NULL;
 head_ = NULL;
 tail_ = NULL;
 list_ = NULL;
 --size_;
 }  
return size_;//size would be zero
}


List_::Node_ * List_::iBegin()
{
return Itorator_ = head_;
}

List_::Node_ * List_::iEnd()
{
return Itorator_ = tail_;
}

List_::Node_ * List_::gNext()
{
return Itorator_ =  (Itorator_->next == NULL ? NULL : Itorator_->next );
}
List_::Node_ * List_::gPrev()
{
return Itorator_ =  (Itorator_->prev == NULL ? NULL : Itorator_->prev );
}

DataT_ List_::ShowData(List_::Node_ *in_pointer)
{
return in_pointer->data_;
}

int List_::HasNext()
{
return Itorator_->next == NULL ? 0: 1;
}
int List_::HasPrev()
{
return Itorator_->prev == NULL ? 0: 1;
}






Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS