Sign in to follow this  
Sagar_Indurkhya

Biological Programming

Recommended Posts

I'm writing a proposal for a research project and I don't know where to start with this issue: I am proposing to design a programming language / scripted language that would compile to DNA and then be inserted in E. Coli. Now I know I can do take the DNA and put it in the E. Coli because I am working with a team of at least 30 engineers(grad students, profs, post docs) who have done this type of thing. However, I am confused as to how I would start thinking about the programming language aspect. The cell is a living organism, and say I would like to create an oraganelle inside the cell that would glow. Well my plan would be to have the compiler express an instruction as the nucleotides that would compose of a flourescent gene. Yet, how can I expand this model to be abstract? Any thoughts are welcome. What I said above could be totally wrong. I just wanted to see what you guys had to say about this whole idea.

Share this post


Link to post
Share on other sites
This sounds really ambitious...

Though here is how I view your issue: DNA, basically, was identified as comprising series of genes, each gene "coding for a given protein". DNA is trasnlated to RNA, which is then translate to these proteins.

The DNA can be seen as language itself that is expressed in the form of proteins, then the presence of such proteins inside the cell/blood/organs has an influence (or not...) on the host metabolism.

It is also known that there can be strong interdependence between the genes.

Now regarding programming languages, all languages are, a one point or another, translated into the CPU assembly language. The thing is tha the CPU assembly was designed in a deterministic way by humans, and the purpose and effect of each instruction is, thus, well known and documented.

In a way, the DNA is already a programming language in that it is a sequence of coding instructions.

The usual aim, in designing a programming language, is to change the paradigm for the programmer: for instance, C focuses on functions, which make it possible to organize modular code in regard to ASM. Object-Oriented languages revolve around design concepts that make it possible to model applications in a different perspective.

In a word I see 2 distinct reseqrch domqin in your question:

1. Compilation to DNA: here, you can use a very simple language that would for example be only composed of the 21 basic aminated acids (?) and their control sequences, and develop a hardware device that generates the DNA molecule corresponding to your sequence.

2. Given you already have the device described above, abstract the DNA "language" into a new programming paradigm that will be at the center of the language syntax and grammar. Now, I don't know enough about genetics to have an idea how to abstract it...

Hope this helps.

Share this post


Link to post
Share on other sites
Part 1 of what you described we have discussed and we know we can do this(although the biologists will have to try and shorten the timeframe hopefully).

"2. Given you already have the device described above, abstract the DNA "language" into a new programming paradigm that will be at the center of the language syntax and grammar. Now, I don't know enough about genetics to have an idea how to abstract it..."

This is precisely what I am having trouble with... I think that all these years of programming in C++ has corrupted my mind with one paradigm.

Share this post


Link to post
Share on other sites
I don't think DNA works like that, does it? I mean, each "instruction" of DNA is supposed to code for a particular amino acid to be turned into a specific protein. DNA and living organisms just don't function at a basic level like computers do. I'm sort of getting the impression that you're trying to turn E. Coli into a computer, and I just don't think they work that way.

Now, I may be totally off-base, of course, and I'm not in university yet (although I'm taking a university level bio course right now); this is all just off the top of my head. It's a bit confusing as to what you're trying to do...

Share this post


Link to post
Share on other sites
Quote:
Original post by silverphyre673
I don't think DNA works like that, does it? I mean, each "instruction" of DNA is supposed to code for a particular amino acid to be turned into a specific protein. DNA and living organisms just don't function at a basic level like computers do. I'm sort of getting the impression that you're trying to turn E. Coli into a computer, and I just don't think they work that way.

Now, I may be totally off-base, of course, and I'm not in university yet (although I'm taking a university level bio course right now); this is all just off the top of my head. It's a bit confusing as to what you're trying to do...


Well I recognize that we can't just turn E. Coli into a computer. So I'm trying to grasp how to build a language around DNA. To do this I need to probably draw upon a really deep knowledge of molecular biology and computer science. Too bad I'm a junior in High School, so my knowledge is both fields isn't all that great.

What my team is trying to do is develop the ability to program a cell. What I see is:

1) Write some source code in the language i'm trying to develop. Compile with interpreter(we'll write this as well) and output as dna sequence.

2) Fabricate DNA

3) Insert DNA in E. Coli

4) Make observations on respective E. Coli cells and make changes to the language.

The idea is that we could theoretically reprogram cells to do what we wanted(to a degree). Will we fully complete this? Probably not. Is it possible for this idea to work? I suspect so.

I don't think the language that we will have to develop will look like any traditional languages like C/C++, Java, LISP, etc. Upon further googling I have determined that this field is related to Systems Biology, although I am working with a Synthetic Biology team.

Share this post


Link to post
Share on other sites
I'm definitely not very knowledgable about this kind of thing, but I have tried one experiment. Remember the human genome project? Well, as part of the project, you can download a (mostly complete) set of text files listing the A/C/G/T pattern. So, being a programmer, I thought "I can compress this data by using two bits per letter", and hacked together a quick program to do the conversion.

Once it was done, I popped the file open in my hex-editor (which uses an ascii-character-per-byte rather than hexadecimal display), and noticed that the output looked VERY similar to common compressed data that you'd see in a ZIP file or so.

So I tried compressing the output file using WinZip and WinRAR. Usually when you compress something that has very few repetetive patterns in it (or if it's already compressed), the resulting file is just about the same size as the input. And that's what happened with the human genome output.


And while I have no idea how to READ the data, it still freaks the heck out of me to think that it might be some kind of compressed data.


That got me thinking... the DNA is data. What operates on the DNA? (I'm not totally sure... enzymes or something). So it's kind of like a Turing machine model - the DNA is turing tape, and all the entities that read/write DNA are Turing machines.

If you guys who're more into biology than I am spend time figuring out the rules to whatever is reading the DNA, you could probably go a LONG way toward figuring out how to write a language that "compiles" into DNA.


This leads to other interesting discussions, like:

- Since DNA isn't an active entity (it's just data), does that mean that comparing DNA between different species might be missing the possibility that whatever is reading the DNA might be performing totally different operations?!

- DNA is susceptible to mutations and other things changing the sequence. But what about the other parts of the cell that are acting on the DNA?

- How do you find out all of the rules that govern how DNA is read? These interactions occur at the molecular scale. How do you scientifically record and analyze a (human?) cell in normal operation?

[Edited by - Nypyren on March 17, 2006 10:40:51 PM]

Share this post


Link to post
Share on other sites
Quote:
Original post by Shannon Barber
Are you trying to perform computations using the cell or trying to write a "DNA compiler" that allows you code the organism characteristics in a "higher level" langauge than nucleotide sequences?


I am trying to "code the organism characteristics in a "higher level" langauge than nucleotide sequences?"

Share this post


Link to post
Share on other sites
Most people already "program" DNA at higher level than sequence, by splicing together cassettes containing either genes or regulatory sequences.

I am interested to know what sort of input you would type into your compiler. Do you picture entering:

print("Hello World!");

and having the E. coli spell out "Hello World!" in lights on the petrie dish?

Assuming that E. coli is the processor and DNA is the program, I think it would help to define more clearly what is input and what is output.



Share this post


Link to post
Share on other sites
Quote:
Original post by Nypyren
(About the fact that DNA is hard to compress).


Any random sequence of data is hard to compress, and this is the case for DNA. You'd have the same trouble compressing an average sequence of coin flips.

Quote:

That got me thinking... the DNA is data. What operates on the DNA? So it's kind of like a Turing machine model - the DNA is turing tape, and all the entities that read/write DNA are Turing machines.


The only operation applied to DNA (in eucaryotes) is replication: either into more DNA, or into RNA. The RNA is then caught by ribosomes which, through a process similar to replication, serves as a template for proteins.

Quote:

- Since DNA isn't an active entity (it's just data), does that mean that comparing DNA between different species might be missing the possibility that whatever is reading the DNA might be performing totally different operations?!


They're not missing it. There are some differences in the actual sequence-aminoacid associations. However, most associations are the same for all species.

Quote:

- DNA is susceptible to mutations and other things changing the sequence. But what about the other parts of the cell that are acting on the DNA?


These parts are coded for by the DNA, so they will change as the DNA mutates.

Quote:
- How do you find out all of the rules that govern how DNA is read? These interactions occur at the molecular scale. How do you scientifically record and analyze a (human?) cell in normal operation?


Microscopy. However, a lot of work is done by killing the cell and isolating the important parts (DNA, proteins, ribosomes).

Share this post


Link to post
Share on other sites
OK, well, since I haven't been blasted for being too far out of my depth, and seeing that you're the same age as me, let me fill you in on what my AP Bio textbook has to say about DNA expression (a field that is definitely not fully understood):

There are four types of codons for DNA, Adenine, Thymine, Glycine, and Cytosine. Abbreviations are A, T, G and G codons, respectively. As you probably know, DNA is a double helix, composed of two complementary sets of DNA, with A pairing with T and G with C. Therefore, a sequence that looks like this:

ACCTAGGAC would be paired with this:
TGGATCCTG

Now, if you want to think about the most basic unit of gene expression, that would be a triplet of codons. DNA is read in threes, so the above would code for three amino acids (which make up the proteins coded for by DNA), like this:

ACC TAG GAC

If a single codon, say a G, were inserted as, say, the third codon, it would be read like this instead:

ACG CTA GGA C

As you can see, this will end up coding for completely different amino acids than it did before, and won't make the right protein at all.

Through the processes of transcription and translation (in prokaryotic cells, like E. Coli, things work a bit differently), these genes are eventually turned into an RNA sequence that is the complement of the section of DNA copied, except for replacing Thymine with Uracil. Only one side of the DNA sequence is expressed. For the first above sequence, if this sequence was to be expressed:

ACCTAGGAC the complementary RNA (mRNA) would be:
UGGAUCCUG

That's transcription. Translation follows, when the mRNA sequence is "translated" into a protein. A big complex is formed which reads each set of triplets and turns it into the appropriate amino acid. There are, obviously, up to 4^3 possible kinds of amino acids that can be expressed, although there are actually only twenty. Most amino acids are coded for by more than one set of triplets.

So in this complex, the RNA is scanned and amino acids are added to the forming protein as coded for by the triplets on the RNA. Eventually, the new protein is complete and takes off to do its duty, which is dependent on is molecular makeup and its shape and structure.


There's lots more to it; the above is a simplification. Also, the process is definitely not fully understood. The issue I see here with your idea is that, unlike with a computer, we don't fully understand the fundamentals of how the process works. You can't really build a house if there isn't a foundation. If you do figure out all of these basics, you're probably in for a Nobel prize.

I would definitely encourage you to look into this further - let me know what you find out. I just don't think you're going to be looking at a similar paradigm to what computer programming is looking at at all. This is all off the top of my head, so you'll definitely want to do a *lot* more research if you're truly interested in this.

Share this post


Link to post
Share on other sites
Quote:
Original post by silverphyre673...Biology...


I know, I've been reading through Reese and Cambell's for a couple weeks. The problem is how to think of this on an abstract level.

Question: In a functional based programming language, we just call a function to do something at any time. In a cell, we can't do that. Everything is created when the cell is created, and activations are controled by chemical reactions. Where is the jump? The leap in thought I might say.

Share this post


Link to post
Share on other sites
Quote:
Original post by Sagar_Indurkhya
Quote:
Original post by silverphyre673...Biology...


I know, I've been reading through Reese and Cambell's for a couple weeks. The problem is how to think of this on an abstract level.

Question: In a functional based programming language, we just call a function to do something at any time. In a cell, we can't do that. Everything is created when the cell is created, and activations are controled by chemical reactions. Where is the jump? The leap in thought I might say.


The first thing that comes to mind is a macro system with first-class macros (basically a compile-time functional language). Not sure if that would be useful, though.

Share this post


Link to post
Share on other sites
A correction is in order first: A codon is the term for a triplet of nucleotide bases. The nucleotides that make up a gene are A, T, G, and C (again, replacing T with U in RNA). Adenine, Thymine, Glycine and Cytosine are not codons, but TTT, GAC, ATG, etc. are. Sorry.

I think the most fundamental issue is to abstract the functions of proteins in the cellular system. You would have to be able to understand the effects of the amino acids making it up, as well as its shape and structure, on the actual cell. This would be really tough; I'm sure you could figure out a few general rules, but you must remember that the cell is a much more complex device than a computer, and that the "programs" (coding for proteins) that you develop for it can actually kill it.

The thing is that we mostly know how a protein is coded for (as I described above). However, we don't know everything - fortunately for you, prokaryotes are much simpler than eukaryotes. In eukaryotes, about 97% of the base-pair codons in their genomes don't code for proteins, and we don't really understand what they do.

I'm sure you could pretty easily come up with a programming language that codes for adding the amino acids - the only instruction is adding an amino acid, and this instruction just needs to take one argument, for the type of amino acid to add. Remember, though, that since multiple triplets of codons code for the same amino acid - in mRNA, UUU and UUC (or TTT and TTC, the equivalent in DNA) both code for the amino acid phenylalanine. The start codon, AUG, marks the beginning of a gene and also codes for methionine. This nucleic acid is often removed after the protein is formed. There are three stop codons, UAA, UAG and UGA, all of which code for the end of a gene.

So I suppose I could think of three instructions, one for starting a gene, one for adding a nucleotide within a gene, and one for ending a gene. All of these really involve adding a nucleotide sequence, but this is one level of abstraction, albeit a minor one. You might do something like this:

[code]
START_GENE
ADD_ACID Phenylalanine
ADD_ACID Tryptophan
ADD_ACID Valine
END_GENE
[code]

This would translate to a genetic sequence of

ATGTTTTGGGTTTAG

It could also translate to this, which would theoretically result in the same protein:

ATGTTGTGGGTGTGA

Because you'd need to figure out how the basics of DNA work, and whether your rules for abstraction really work, you would need to fully automate the process of coding the protein, adding it to the cell, and then seeing how the gene was expressed.

The most people have really been doing with this lately has been taking genes from one organism and adding it to another. A couple examples that I can think of that have been done include putting a gene from fireflies into tobacco and making the plant glow, making plants glow when they need watering by adding a gene from a fluorescent jellyfish, and adding a gene that codes for insulin production to bacteria and producing the substance that way.

We haven't really been creating our own genes yet because we don't understand most of what goes on "under the hood." It's really, really complex, and depends both on how DNA expression works in general and on the intricacies of how the organism in question works (of course, the makeup of the organism is coded for by DNA too, so it all comes down to how genes interact with other genes).

If you do figure this out, please make sure to mention me when you get the Nobel prize [grin]. I'd definitely talk to a professor about this. Good luck.

Share this post


Link to post
Share on other sites
Quote:
Original post by silverphyre673
A correction is in order first: A codon is the term for a triplet of nucleotide bases. The nucleotides that make up a gene are A, T, G, and C (again, replacing T with U in RNA). Adenine, Thymine, Glycine and Cytosine are not codons, but TTT, GAC, ATG, etc. are. Sorry.

I think the most fundamental issue is to abstract the functions of proteins in the cellular system. You would have to be able to understand the effects of the amino acids making it up, as well as its shape and structure, on the actual cell. This would be really tough; I'm sure you could figure out a few general rules, but you must remember that the cell is a much more complex device than a computer, and that the "programs" (coding for proteins) that you develop for it can actually kill it.

The thing is that we mostly know how a protein is coded for (as I described above). However, we don't know everything - fortunately for you, prokaryotes are much simpler than eukaryotes. In eukaryotes, about 97% of the base-pair codons in their genomes don't code for proteins, and we don't really understand what they do.

I'm sure you could pretty easily come up with a programming language that codes for adding the amino acids - the only instruction is adding an amino acid, and this instruction just needs to take one argument, for the type of amino acid to add. Remember, though, that since multiple triplets of codons code for the same amino acid - in mRNA, UUU and UUC (or TTT and TTC, the equivalent in DNA) both code for the amino acid phenylalanine. The start codon, AUG, marks the beginning of a gene and also codes for methionine. This nucleic acid is often removed after the protein is formed. There are three stop codons, UAA, UAG and UGA, all of which code for the end of a gene.

So I suppose I could think of three instructions, one for starting a gene, one for adding a nucleotide within a gene, and one for ending a gene. All of these really involve adding a nucleotide sequence, but this is one level of abstraction, albeit a minor one. You might do something like this:

[code]
START_GENE
ADD_ACID Phenylalanine
ADD_ACID Tryptophan
ADD_ACID Valine
END_GENE
[code]

This would translate to a genetic sequence of

ATGTTTTGGGTTTAG

It could also translate to this, which would theoretically result in the same protein:

ATGTTGTGGGTGTGA

Because you'd need to figure out how the basics of DNA work, and whether your rules for abstraction really work, you would need to fully automate the process of coding the protein, adding it to the cell, and then seeing how the gene was expressed.

The most people have really been doing with this lately has been taking genes from one organism and adding it to another. A couple examples that I can think of that have been done include putting a gene from fireflies into tobacco and making the plant glow, making plants glow when they need watering by adding a gene from a fluorescent jellyfish, and adding a gene that codes for insulin production to bacteria and producing the substance that way.

We haven't really been creating our own genes yet because we don't understand most of what goes on "under the hood." It's really, really complex, and depends both on how DNA expression works in general and on the intricacies of how the organism in question works (of course, the makeup of the organism is coded for by DNA too, so it all comes down to how genes interact with other genes).

If you do figure this out, please make sure to mention me when you get the Nobel prize [grin]. I'd definitely talk to a professor about this. Good luck.


Thanks, I've actually decided to base the language around membrane-receptor logic gate structures. Yes, I am indeed getting ready to discuss this with many professors. Thanks again!

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this