Jump to content
  • Advertisement
Sign in to follow this  
Sagar_Indurkhya

Text/Data Processing

This topic is 4162 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I have been tinkering around with a data manipulation program for my simulation engine for the last few days, and I've realized that perhaps I should think out more clearly what my data file format is before I move on (I already have one that works well, but needs to be expanded). The basic gist of how the application works is that there is a simulation program I wrote that models a whole colony of bacterial cells. The program models this, and then spits out two data files, one called JobName.CellBirthLog.dat, which is just a log that has a record of when each cell was born, and some of it's values. It looks like this: (Note, cells are identified like 0112 is a 4th generation cell, and will have kids 01121 and 01122 if it divides)
[Source]
# Job Name: TestRun4
# File Name: TestRun4.CellBirthLog.dat
# Generated by PLGRNSE v. 1.0.
# Author: SagarIndurkhya
# Date: Mon Jul 23 13:32:52 2007


# KEY: 
#	(0) Cell ID Number
#	(1) Time of Birth
#	(2) Maximum Cell Lifetime
#	(3) Initial Cell Volume
#	(4) Final Cell Volume

# Number of Entries for each cell
5

# Birth Data
0	0	3738.25	7.27417e-16	1.33003e-15
01	3738	3956.53	6.65076e-16	1.32734e-15
02	3738	3809.38	6.65076e-16	1.37637e-15
021	7548	4090.85	6.88241e-16	1.34726e-15
022	7548	4025.25	6.88241e-16	1.36981e-15
...
[/Source]
[/source] (The comments are # so that the data can be processed easily by matlab and GNUPlot). The second data file (JobName.CellSpeciesLog.dat) looks like this (it holds each cells' protein concentrations, etc)
[Source]
# Job Name: TestRun4
# File Name: TestRun4.CellSpeciesLog.dat
# Generated by PLGRNSE v. 1.0.
# Author: SagarIndurkhya
# Date: Mon Jul 23 13:32:52 2007

#############################################################
# SPECIES DESCRIPTION
# ___________________
# (0) - Time
# (1) - Plac
# (2) - PcI
# (3) - Ptet
# (4) - PtetGFP
# (5) - Protein_lacI
# (6) - Protein_tetR
# (7) - Protein_cI
# (8) - Protein_GFP
# 
# Note: Volume Calculations enabled, second row for each
# entry is (*)/CellVolume.
#
# Note: Surface Area Calculations enabled, third row for each
# entry is (*)/CellSurfaceArea.
#
#############################################################


# (0)	(1)	(2)	(3)	(4)	(5)	(6)	(7)	(8)	
0
0	  4	4	4	40	0	1000	0	0	
5.49891e+15	5.49891e+15	5.49891e+15	5.49891e+16	0	1.37473e+18	0	0	
1.02263e+10	1.02263e+10	1.02263e+10	1.02263e+11	0	2.55658e+12	0	0	
0
600	  0	4	0	0	24032	926	11	267	
0	4.85355e+15	0	0	2.91601e+19	1.1236e+18	1.33473e+16	3.23975e+17	
0	9.4097e+09	0	0	5.65335e+13	2.17834e+12	2.58767e+10	6.28097e+11	
0
1200	  0	1	0	0	36027	466	33	750	
0	1.08594e+15	0	0	3.91232e+19	5.06049e+17	3.58361e+16	8.14456e+17	
0	2.18467e+09	0	0	7.87073e+13	1.01806e+12	7.20943e+10	1.63851e+12	
0
1800	  0	0	0	1	24588	225	255	2893	
0	0	0	9.82722e+14	2.41632e+19	2.21113e+17	2.50594e+17	2.84302e+18	
0	0	0	2.04395e+09	5.02566e+13	4.59888e+11	5.21207e+11	5.91314e+12	
[/Source]
[/source] (Note in the data shown, the cell hasn't yet divided, so there is only one cell). The problem I'm having is that I find it hard to extract information that has to be commented out from GNUPlot or Matlab. Usually I just use cin << ... for each of the numbers. However, for other things, like whether Volume Calculations have been enabled, these are represented by text. The reason it has to be represented by text is that I want the data to be readable by humans at every stage. So I'm wondering if there is a more easy way to juggle data files than just using << and >> operators. I don't want XML, but perhaps there is something else?

Share this post


Link to post
Share on other sites
Advertisement
Well... If you read the file in using getline (rather than stream operator)... you can examine its content. You need to provide a way for the system to know what type of line it is reading. I guess one option is to have a file start with a header section containing a specific data format (eg start each line with a text key indicating what the value is, then the value appropriately formatted after that). You can use string streams to process the lines in the normal way... you can also have some sort of special line to indicate the data begins. Once you've read the special line, you can begin dealing with the data as normal. If you use the stream objects correctly, they are quite flexible and you can do a lot of clever tricks with them (and provide robust IO). Another option is to output pure data in one file... and all non-data in another (appropriately matched). This is equivalent to what I suggested I guess, but maybe easier for you programmatically if you have to output in a specific order. It can even be recombined into a single file afterwards. Also, why not XML? Are you just worried about the file size increasing? One option to avoid this is to pipe the data through a compression system (eg gzip, bzip). TinyXML is a nice light/easy library for XML IO if you do decide to go for it.
Hope this helps somehow,

Dan

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!