• 13
• 15
• 27
• 9
• 9

# Text/Data Processing

This topic is 3891 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I have been tinkering around with a data manipulation program for my simulation engine for the last few days, and I've realized that perhaps I should think out more clearly what my data file format is before I move on (I already have one that works well, but needs to be expanded). The basic gist of how the application works is that there is a simulation program I wrote that models a whole colony of bacterial cells. The program models this, and then spits out two data files, one called JobName.CellBirthLog.dat, which is just a log that has a record of when each cell was born, and some of it's values. It looks like this: (Note, cells are identified like 0112 is a 4th generation cell, and will have kids 01121 and 01122 if it divides)
[Source]
# Job Name: TestRun4
# File Name: TestRun4.CellBirthLog.dat
# Generated by PLGRNSE v. 1.0.
# Author: SagarIndurkhya
# Date: Mon Jul 23 13:32:52 2007

# KEY:
#	(0) Cell ID Number
#	(1) Time of Birth
#	(3) Initial Cell Volume
#	(4) Final Cell Volume

# Number of Entries for each cell
5

# Birth Data
0	0	3738.25	7.27417e-16	1.33003e-15
01	3738	3956.53	6.65076e-16	1.32734e-15
02	3738	3809.38	6.65076e-16	1.37637e-15
021	7548	4090.85	6.88241e-16	1.34726e-15
022	7548	4025.25	6.88241e-16	1.36981e-15
...
[/Source]
[/source] (The comments are # so that the data can be processed easily by matlab and GNUPlot). The second data file (JobName.CellSpeciesLog.dat) looks like this (it holds each cells' protein concentrations, etc)
[Source]
# Job Name: TestRun4
# File Name: TestRun4.CellSpeciesLog.dat
# Generated by PLGRNSE v. 1.0.
# Author: SagarIndurkhya
# Date: Mon Jul 23 13:32:52 2007

#############################################################
# SPECIES DESCRIPTION
# ___________________
# (0) - Time
# (1) - Plac
# (2) - PcI
# (3) - Ptet
# (4) - PtetGFP
# (5) - Protein_lacI
# (6) - Protein_tetR
# (7) - Protein_cI
# (8) - Protein_GFP
#
# Note: Volume Calculations enabled, second row for each
# entry is (*)/CellVolume.
#
# Note: Surface Area Calculations enabled, third row for each
# entry is (*)/CellSurfaceArea.
#
#############################################################

# (0)	(1)	(2)	(3)	(4)	(5)	(6)	(7)	(8)
0
0	  4	4	4	40	0	1000	0	0
5.49891e+15	5.49891e+15	5.49891e+15	5.49891e+16	0	1.37473e+18	0	0
1.02263e+10	1.02263e+10	1.02263e+10	1.02263e+11	0	2.55658e+12	0	0
0
600	  0	4	0	0	24032	926	11	267
0	4.85355e+15	0	0	2.91601e+19	1.1236e+18	1.33473e+16	3.23975e+17
0	9.4097e+09	0	0	5.65335e+13	2.17834e+12	2.58767e+10	6.28097e+11
0
1200	  0	1	0	0	36027	466	33	750
0	1.08594e+15	0	0	3.91232e+19	5.06049e+17	3.58361e+16	8.14456e+17
0	2.18467e+09	0	0	7.87073e+13	1.01806e+12	7.20943e+10	1.63851e+12
0
1800	  0	0	0	1	24588	225	255	2893
0	0	0	9.82722e+14	2.41632e+19	2.21113e+17	2.50594e+17	2.84302e+18
0	0	0	2.04395e+09	5.02566e+13	4.59888e+11	5.21207e+11	5.91314e+12
[/Source]
[/source] (Note in the data shown, the cell hasn't yet divided, so there is only one cell). The problem I'm having is that I find it hard to extract information that has to be commented out from GNUPlot or Matlab. Usually I just use cin << ... for each of the numbers. However, for other things, like whether Volume Calculations have been enabled, these are represented by text. The reason it has to be represented by text is that I want the data to be readable by humans at every stage. So I'm wondering if there is a more easy way to juggle data files than just using << and >> operators. I don't want XML, but perhaps there is something else?