So I started to implement translation process (from RNA/DNA sequence to protein sequence).
First of all, I needed to prepare codon table – instruction what three nucleotides code what amino acid. I managed that with Python dictionary. I also prepared backwards dictionary (for each amino acid a list of codons that encodes it).
However, when I started to implement the functionality, couple of questions appeared. What reading frame should be taken? Single or all? Should the sequence be translated on the whole length or from first start codon? From the first start codon or all start codons? Should it end with stop codon or be translated to the end of the input sequence with a gap representing stop codon? Should it be translated for one strand or also for reversed sequence?
I decided to check out existing solutions:
One chosen reading frame (or choice of the sequence by uppercase), option to reverse the sequence and possibility to change genetic code on your own (I don’t think it’s necessary)
Same but with multiple choice of genetic codes.
Multiple choice of genetic codes.
One reading frame, choice of genetic codes, trim, reverse options
all reading frames in both directions
I still don’t have all the answers but I decided to leave the decisions to the user for most of the options (so I have to implement all the options). I didn’t plan to, but now I think that implementing various genetic codes is necessary.
I started with finding start codons with regex and translating the sequence in three reading frames. So far so good. Next project-post, I think it will be done.