Translate nucleic acid sequence into its corresponding amino acid sequence












13














Goal of the program



The goal of the program is to translate a nucleic acid sequence into its corresponding amino acid sequence. The nucleic sequences
have to be formatted in a specific format called fasta. There is an existing implementation of this program in C here: emboss transeq



Fasta format



A fasta file looks like this:



>sequenceId comment
nucleic sequence


for example:



>Seq1 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
CCTTTATCTAATCTTTGGAGCATGAGCTGGCATAGTTGGAACCGCCCTCAGCCTCCTCATCCGTGCAGAA
CCCAGTCCTGTACCAACACCTCTTCTGATTCTTCGGCCATCCAGAAGTCTATATCCTCATTTTAC

>Seq2 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
GGTAGGTACCGCCCTAAGNCTCCTAATCCGAGCAGAACTANGCCAACCCGGAGCCCTTCTGGGAGACGAC
AATCAACATAAAA


A nucleic sequence is a string composed of the letters A, C, T, G, U, N



Expected output



A combinaison of 3 nucleic acid, named codon, gives a specif amino acid, for exemple GCT is the code for Alanine, symbolised by the letter A



With the Seq1 define above:



codon      CCT TTA TCT AAT CTT TGG AGC ATG ...
amino acid P L S N L W S M ...


The expected output is:



>Seq1_1 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
PLSNLWSMSWHSWNRPQPPHPCRTQSCTNTSSDSSAIQKSISSFY
>Seq2_1 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
GRYRPKXPNPSRTXPTRSPSGRRQST*X


Specific rules



The program takes 4 parameters:





  • clean : if true, write STOP codon as X instead of *


  • trim : if true, remove all X and * chars from the right end of the amino - acid sequence


  • alternative: different way to compute reverse frame


  • frame: a string value in ["1", "2", "3", "F", "-1", "-2", "-3", "R", "6" ]


The frame define the position in the nucleic sequence to start from:



-frame 1
|-frame 2
||-frame 3
|||
CCTTTATCTAATCTTTGGAGCATGAGCTGGCATAGTTGGAACCGCCCTCAGCCTCCTCATCCGTGCAGAA
CCCAGTCCTGTACCAACACCTCTTCTGATTCTTCGGCCATCCAGAAGTCTATATCCTCATTTTAC
|||
||-frame -1
|-frame -2
-frame -3


frame "F" = "1", "2", "3"
frame "R" = "-1", "-2", "-3"
frame "6" = "1", "2", "3", "-1", "-2", "-3"


In the output file, we add the frame used to the sequenceId: sequenceId = sequenceId_frame



For example if the program is used with frame=6, the expected output is



>Seq1_1 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
PLSNLWSMSWHSWNRPQPPHPCRTQSCTNTSSDSSAIQKSISSFY
>Seq1_2 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
LYLIFGA*AGIVGTALSLLIRAEPSPVPTPLLILRPSRSLYPHFT
>Seq1_3 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
FI*SLEHELA*LEPPSASSSVQNPVLYQHLF*FFGHPEVYILILX
>Seq1_4 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
VK*GYRLLDGRRIRRGVGTGLGSARMRRLRAVPTMPAHAPKIR*R
>Seq1_5 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
KMRI*TSGWPKNQKRCWYRTGFCTDEEAEGGSNYASSCSKD*IKX
>Seq1_6 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
*NEDIDFWMAEESEEVLVQDWVLHG*GG*GRFQLCQLMLQRLDKG
>Seq2_1 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
GRYRPKXPNPSRTXPTRSPSGRRQST*X
>Seq2_2 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
VGTALXLLIRAELXQPGALLGDDNQHKX
>Seq2_3 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
*VPP*XS*SEQNXANPEPFWETTINIK
>Seq2_4 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
LC*LSSPRRAPGWXSSARIRXLRAVPT
>Seq2_5 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
FMLIVVSQKGSGLA*FCSD*EX*GGTYX
>Seq2_6 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
FYVDCRLPEGLRVGXVLLGLGXLGRYLP


Improvements




  • Performance improvements

  • Is the code easy to read / understand ?

  • Is the code idiomatic ?


( full project with tests + flags handling is avaible on github: gotranseq)



code :



package transeq

import (
"bufio"
"bytes"
"context"
"encoding/binary"
"fmt"
"io"
"runtime"
"sync"
)

var (
letterCode = map[byte]uint8{
'A': aCode,
'C': cCode,
'T': tCode,
'G': gCode,
'N': nCode,
'U': uCode,
}
standard = map[string]byte{
"TTT": 'F',
"TCT": 'S',
"TAT": 'Y',
"TGT": 'C',
"TTC": 'F',
"TCC": 'S',
"TAC": 'Y',
"TGC": 'C',
"TTA": 'L',
"TCA": 'S',
"TAA": '*',
"TGA": '*',
"TTG": 'L',
"TCG": 'S',
"TAG": '*',
"TGG": 'W',
"CTT": 'L',
"CCT": 'P',
"CAT": 'H',
"CGT": 'R',
"CTC": 'L',
"CCC": 'P',
"CAC": 'H',
"CGC": 'R',
"CTA": 'L',
"CCA": 'P',
"CAA": 'Q',
"CGA": 'R',
"CTG": 'L',
"CCG": 'P',
"CAG": 'Q',
"CGG": 'R',
"ATT": 'I',
"ACT": 'T',
"AAT": 'N',
"AGT": 'S',
"ATC": 'I',
"ACC": 'T',
"AAC": 'N',
"AGC": 'S',
"ATA": 'I',
"ACA": 'T',
"AAA": 'K',
"AGA": 'R',
"ATG": 'M',
"ACG": 'T',
"AAG": 'K',
"AGG": 'R',
"GTT": 'V',
"GCT": 'A',
"GAT": 'D',
"GGT": 'G',
"GTC": 'V',
"GCC": 'A',
"GAC": 'D',
"GGC": 'G',
"GTA": 'V',
"GCA": 'A',
"GAA": 'E',
"GGA": 'G',
"GTG": 'V',
"GCG": 'A',
"GAG": 'E',
"GGG": 'G',
}
)

const (
nCode = uint8(0)
aCode = uint8(1)
cCode = uint8(2)
tCode = uint8(3)
uCode = uint8(3)
gCode = uint8(4)

stopByte = '*'
unknown = 'X'
// Length of the array to store code/bytes
// uses gCode because it's the biggest uint8 of all codes
arrayCodeSize = (uint32(gCode) | uint32(gCode)<<8 | uint32(gCode)<<16) + 1
)

func createCodeArray(clean bool) byte {

resultMap := map[uint32]byte{}
twoLetterMap := map[string]byte{}

tmpCode := make(uint8, 4)

for codon, aaCode := range standard {
// generate 3 letter code
for i := 0; i < 3; i++ {
tmpCode[i] = letterCode[codon[i]]
}
// each codon is represented by an unique uint32:
// each possible nucleotide is represented by an uint8 (255 possibility)
// the three first bytes are the the code for each nucleotide
// last byte is unused ( eq to uint8(0) )
// example:
// codon 'ACG' ==> uint32(aCode) | uint32(cCode)<<8 | uint32(gCode)<<16
uint32Code := uint32(tmpCode[0]) | uint32(tmpCode[1])<<8 | uint32(tmpCode[2])<<16
resultMap[uint32Code] = aaCode

// generate 2 letter code
codes, ok := twoLetterMap[codon[0:2]]
if !ok {
twoLetterMap[codon[0:2]] = byte{aaCode}
} else {
twoLetterMap[codon[0:2]] = append(codes, aaCode)
}
}
for twoLetterCodon, codes := range twoLetterMap {
uniqueAA := true
for i := 0; i < len(codes); i++ {

if codes[i] != codes[0] {
uniqueAA = false
}
}
if uniqueAA {
first := letterCode[twoLetterCodon[0]]
second := letterCode[twoLetterCodon[1]]

uint32Code := uint32(first) | uint32(second)<<8
resultMap[uint32Code] = codes[0]
}
}
// if clean is specified, we want to replace all '*' by 'X' in the output
// sequence, so replace all occurrences of '*' directly in the ref map
if clean {
for k, v := range resultMap {
if v == stopByte {
resultMap[k] = unknown
}
}
}
r := make(byte, arrayCodeSize)
for k, v := range resultMap {
r[k] = v
}
return r
}

func computeFrames(frameName string) (frames int, reverse bool, err error) {

frames = make(int, 6)
reverse = false

switch frameName {
case "1":
frames[0] = 1
case "2":
frames[1] = 1
case "3":
frames[2] = 1
case "F":
for i := 0; i < 3; i++ {
frames[i] = 1
}
case "-1":
frames[3] = 1
reverse = true
case "-2":
frames[4] = 1
reverse = true
case "-3":
frames[5] = 1
reverse = true
case "R":
for i := 3; i < 6; i++ {
frames[i] = 1
}
reverse = true
case "6":
for i := range frames {
frames[i] = 1
}
reverse = true
default:
err = fmt.Errorf("wrong value for -f | --frame parameter: %s", frameName)
}
return frames, reverse, err
}

type writer struct {
buf *bytes.Buffer
currentLineLen int
bytesToTrim int
}

func (w *writer) addByte(b byte) {
w.buf.WriteByte(b)
w.currentLineLen++
if b == stopByte || b == unknown {
w.bytesToTrim++
} else {
w.bytesToTrim = 0
}
}

func (w *writer) addUnknown() {
w.buf.WriteByte(unknown)
w.currentLineLen++
w.bytesToTrim++
}

func (w *writer) newLine() {
w.buf.WriteByte('n')
w.currentLineLen = 0
w.bytesToTrim++
}

const (
// size of the buffer for writing to file
maxBufferSize = 1024 * 1024 * 30
// max line size for sequence
maxLineSize = 60
// suffixes ta add to sequence id for each frame
suffixes = "123456"
)

// Translate read a fata file, translate each sequence to the corresponding prot sequence in the specified frame
func Translate(inputSequence io.Reader, out io.Writer, frame string, clean, trim, alternative bool) error {

arrayCode := createCodeArray(clean)
framesToGenerate, reverse, err := computeFrames(frame)
if err != nil {
return err
}

fnaSequences := make(chan encodedSequence, 10)
errs := make(chan error, 1)

var wg sync.WaitGroup
ctx, cancel := context.WithCancel(context.Background())
defer cancel()

for nWorker := 0; nWorker < runtime.NumCPU(); nWorker++ {

wg.Add(1)

go func() {

defer wg.Done()

startPosition := make(int, 3)
w := &writer{
buf: bytes.NewBuffer(nil),
bytesToTrim: 0,
currentLineLen: 0,
}

for sequence := range fnaSequences {

select {
case <-ctx.Done():
return
default:
}

frameIndex := 0
startPosition[0], startPosition[1], startPosition[2] = 0, 1, 2

idSize := int(binary.LittleEndian.Uint32(sequence[0:4]))
nuclSeqLength := len(sequence) - idSize

Translate:
for _, startPos := range startPosition {

if framesToGenerate[frameIndex] == 0 {
frameIndex++
continue
}

// sequence id should look like
// >sequenceID_<frame> comment
idEnd := bytes.IndexByte(sequence[4:idSize], ' ')
if idEnd != -1 {
w.buf.Write(sequence[4 : 4+idEnd])
w.buf.WriteByte('_')
w.buf.WriteByte(suffixes[frameIndex])
w.buf.Write(sequence[4+idEnd : idSize])
} else {
w.buf.Write(sequence[4:idSize])
w.buf.WriteByte('_')
w.buf.WriteByte(suffixes[frameIndex])
}
w.newLine()

// if in trim mode, nb of bytes to trim (nb of successive 'X', '*' and 'n'
// from right end of the sequence)
w.bytesToTrim = 0
w.currentLineLen = 0

// read the sequence 3 letters at a time, starting at a specific position
// corresponding to the frame
for pos := startPos + 2 + idSize; pos < len(sequence); pos += 3 {

if w.currentLineLen == maxLineSize {
w.newLine()
}
// create an uint32 from the codon, to retrieve the corresponding
// AA from the map
codonCode := uint32(sequence[pos-2]) | uint32(sequence[pos-1])<<8 | uint32(sequence[pos])<<16

b := arrayCode[codonCode]
if b != byte(0) {
w.addByte(b)
} else {
w.addUnknown()
}
}

// the last codon is only 2 nucleotid long, try to guess
// the corresponding AA
if (nuclSeqLength-startPos)%3 == 2 {

if w.currentLineLen == maxLineSize {
w.newLine()
}
codonCode := uint32(sequence[len(sequence)-2]) | uint32(sequence[len(sequence)-1])<<8

b := arrayCode[codonCode]
if b != byte(0) {
w.addByte(b)
} else {
w.addUnknown()
}
}

// the last codon is only 1 nucleotid long, no way to guess
// the corresponding AA
if (nuclSeqLength-startPos)%3 == 1 {
if w.currentLineLen == maxLineSize {
w.newLine()
}
w.addUnknown()
}

if trim && w.bytesToTrim > 0 {
// remove the last bytesToTrim bytes of the buffer
// as they are 'X', '*' or 'n'
w.buf.Truncate(w.buf.Len() - w.bytesToTrim)
w.currentLineLen -= w.bytesToTrim
}

if w.currentLineLen != 0 {
w.newLine()
}
frameIndex++
}

if reverse && frameIndex < 6 {

// get the complementary sequence.
// Basically, switch
// A <-> T
// C <-> G
// N is not modified
for i, n := range sequence[idSize:] {

switch n {
case aCode:
sequence[i+idSize] = tCode
case tCode:
// handle both tCode and uCode
sequence[i+idSize] = aCode
case cCode:
sequence[i+idSize] = gCode
case gCode:
sequence[i+idSize] = cCode
default:
//case N -> leave it
}
}
// reverse the sequence
for i, j := idSize, len(sequence)-1; i < j; i, j = i+1, j-1 {
sequence[i], sequence[j] = sequence[j], sequence[i]
}

if !alternative {
// Staden convention: Frame -1 is the reverse-complement of the sequence
// having the same codon phase as frame 1. Frame -2 is the same phase as
// frame 2. Frame -3 is the same phase as frame 3
//
// use the matrix to keep track of the forward frame as it depends on the
// length of the sequence
switch nuclSeqLength % 3 {
case 0:
startPosition[0], startPosition[1], startPosition[2] = 0, 2, 1
case 1:
startPosition[0], startPosition[1], startPosition[2] = 1, 0, 2
case 2:
startPosition[0], startPosition[1], startPosition[2] = 2, 1, 0
}
}
// run the same loop, but with the reverse-complemented sequence
goto Translate
}

if w.buf.Len() > maxBufferSize {
_, err := out.Write(w.buf.Bytes())
if err != nil {
select {
case errs <- fmt.Errorf("fail to write to output file: %v", err):
default:
}
cancel()
return
}
w.buf.Reset()
}
pool.Put(sequence)
}

if w.buf.Len() > 0 {
_, err := out.Write(w.buf.Bytes())
if err != nil {
select {
case errs <- fmt.Errorf("fail to write to output file: %v", err):
default:
}
cancel()
return
}
}
}()
}
readSequenceFromFasta(ctx, inputSequence, fnaSequences)

wg.Wait()
select {
case err, ok := <-errs:
if ok {
return err
}
default:
}
return nil
}

func readSequenceFromFasta(ctx context.Context, inputSequence io.Reader, fnaSequences chan encodedSequence) {

feeder := &fastaChannelFeeder{
idBuffer: bytes.NewBuffer(nil),
commentBuffer: bytes.NewBuffer(nil),
sequenceBuffer: bytes.NewBuffer(nil),
fastaChan: fnaSequences,
}
// fasta format is:
//
// >sequenceID some comments on sequence
// ACAGGCAGAGACACGACAGACGACGACACAGGAGCAGACAGCAGCAGACGACCACATATT
// TTTGCGGTCACATGACGACTTCGGCAGCGA
//
// see https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp
// section 1 for details
scanner := bufio.NewScanner(inputSequence)
Loop:
for scanner.Scan() {

line := scanner.Bytes()
if len(line) == 0 {
continue
}
if line[0] == '>' {

if feeder.idBuffer.Len() > 0 {
select {
case <-ctx.Done():
break Loop
default:
}
feeder.sendFasta()
}
feeder.reset()

// parse the ID of the sequence. ID is formatted like this:
// >sequenceID comments
seqID := bytes.SplitN(line, byte{' '}, 2)
feeder.idBuffer.Write(seqID[0])

if len(seqID) > 1 {
feeder.commentBuffer.WriteByte(' ')
feeder.commentBuffer.Write(seqID[1])
}
} else {
// if the line doesn't start with '>', then it's a part of the
// nucleotide sequence, so write it to the buffer
feeder.sequenceBuffer.Write(line)
}
}
// don't forget to push last sequence
select {
case <-ctx.Done():
default:
feeder.sendFasta()
}
close(fnaSequences)
}

// a type to hold an encoded fasta sequence
//
// s[0:4] stores the size of the sequence id + the size of the comment as an uint32 (little endian)
// s[4:idSize] stores the sequence id, and the comment id there is one
// s[idSize:] stores the nucl sequence
type encodedSequence byte

var pool = sync.Pool{
New: func() interface{} {
return make(encodedSequence, 512)
},
}

func getSizedSlice(idSize, requiredSize int) encodedSequence {
s := pool.Get().(encodedSequence)
binary.LittleEndian.PutUint32(s[0:4], uint32(idSize))

for len(s) < requiredSize {
s = append(s, byte(0))
}
return s[0:requiredSize]
}

func (f *fastaChannelFeeder) sendFasta() {

idSize := 4 + f.idBuffer.Len() + f.commentBuffer.Len()
requiredSize := idSize + f.sequenceBuffer.Len()

s := getSizedSlice(idSize, requiredSize)

if f.commentBuffer.Len() > 0 {
copy(s[idSize-f.commentBuffer.Len():idSize], f.commentBuffer.Bytes())
}

copy(s[4:4+f.idBuffer.Len()], f.idBuffer.Bytes())

// convert the sequence of bytes to an array of uint8 codes,
// so a codon (3 nucleotides | 3 bytes ) can be represented
// as an uint32
for i, b := range f.sequenceBuffer.Bytes() {

switch b {
case 'A':
s[i+idSize] = aCode
case 'C':
s[i+idSize] = cCode
case 'G':
s[i+idSize] = gCode
case 'T', 'U':
s[i+idSize] = tCode
case 'N':
s[i+idSize] = nCode
default:
fmt.Printf("WARNING: invalid char in sequence %s: %s, ignoring", s[4:4+idSize], string(b))
}
}
f.fastaChan <- s
}

type fastaChannelFeeder struct {
idBuffer, commentBuffer, sequenceBuffer *bytes.Buffer
fastaChan chan encodedSequence
}

func (f *fastaChannelFeeder) reset() {
f.idBuffer.Reset()
f.sequenceBuffer.Reset()
f.commentBuffer.Reset()
}









share|improve this question

















This question has an open bounty worth +500
reputation from felix ending in 6 days.


This question has not received enough attention.





















    13














    Goal of the program



    The goal of the program is to translate a nucleic acid sequence into its corresponding amino acid sequence. The nucleic sequences
    have to be formatted in a specific format called fasta. There is an existing implementation of this program in C here: emboss transeq



    Fasta format



    A fasta file looks like this:



    >sequenceId comment
    nucleic sequence


    for example:



    >Seq1 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
    CCTTTATCTAATCTTTGGAGCATGAGCTGGCATAGTTGGAACCGCCCTCAGCCTCCTCATCCGTGCAGAA
    CCCAGTCCTGTACCAACACCTCTTCTGATTCTTCGGCCATCCAGAAGTCTATATCCTCATTTTAC

    >Seq2 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
    GGTAGGTACCGCCCTAAGNCTCCTAATCCGAGCAGAACTANGCCAACCCGGAGCCCTTCTGGGAGACGAC
    AATCAACATAAAA


    A nucleic sequence is a string composed of the letters A, C, T, G, U, N



    Expected output



    A combinaison of 3 nucleic acid, named codon, gives a specif amino acid, for exemple GCT is the code for Alanine, symbolised by the letter A



    With the Seq1 define above:



    codon      CCT TTA TCT AAT CTT TGG AGC ATG ...
    amino acid P L S N L W S M ...


    The expected output is:



    >Seq1_1 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
    PLSNLWSMSWHSWNRPQPPHPCRTQSCTNTSSDSSAIQKSISSFY
    >Seq2_1 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
    GRYRPKXPNPSRTXPTRSPSGRRQST*X


    Specific rules



    The program takes 4 parameters:





    • clean : if true, write STOP codon as X instead of *


    • trim : if true, remove all X and * chars from the right end of the amino - acid sequence


    • alternative: different way to compute reverse frame


    • frame: a string value in ["1", "2", "3", "F", "-1", "-2", "-3", "R", "6" ]


    The frame define the position in the nucleic sequence to start from:



    -frame 1
    |-frame 2
    ||-frame 3
    |||
    CCTTTATCTAATCTTTGGAGCATGAGCTGGCATAGTTGGAACCGCCCTCAGCCTCCTCATCCGTGCAGAA
    CCCAGTCCTGTACCAACACCTCTTCTGATTCTTCGGCCATCCAGAAGTCTATATCCTCATTTTAC
    |||
    ||-frame -1
    |-frame -2
    -frame -3


    frame "F" = "1", "2", "3"
    frame "R" = "-1", "-2", "-3"
    frame "6" = "1", "2", "3", "-1", "-2", "-3"


    In the output file, we add the frame used to the sequenceId: sequenceId = sequenceId_frame



    For example if the program is used with frame=6, the expected output is



    >Seq1_1 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
    PLSNLWSMSWHSWNRPQPPHPCRTQSCTNTSSDSSAIQKSISSFY
    >Seq1_2 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
    LYLIFGA*AGIVGTALSLLIRAEPSPVPTPLLILRPSRSLYPHFT
    >Seq1_3 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
    FI*SLEHELA*LEPPSASSSVQNPVLYQHLF*FFGHPEVYILILX
    >Seq1_4 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
    VK*GYRLLDGRRIRRGVGTGLGSARMRRLRAVPTMPAHAPKIR*R
    >Seq1_5 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
    KMRI*TSGWPKNQKRCWYRTGFCTDEEAEGGSNYASSCSKD*IKX
    >Seq1_6 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
    *NEDIDFWMAEESEEVLVQDWVLHG*GG*GRFQLCQLMLQRLDKG
    >Seq2_1 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
    GRYRPKXPNPSRTXPTRSPSGRRQST*X
    >Seq2_2 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
    VGTALXLLIRAELXQPGALLGDDNQHKX
    >Seq2_3 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
    *VPP*XS*SEQNXANPEPFWETTINIK
    >Seq2_4 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
    LC*LSSPRRAPGWXSSARIRXLRAVPT
    >Seq2_5 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
    FMLIVVSQKGSGLA*FCSD*EX*GGTYX
    >Seq2_6 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
    FYVDCRLPEGLRVGXVLLGLGXLGRYLP


    Improvements




    • Performance improvements

    • Is the code easy to read / understand ?

    • Is the code idiomatic ?


    ( full project with tests + flags handling is avaible on github: gotranseq)



    code :



    package transeq

    import (
    "bufio"
    "bytes"
    "context"
    "encoding/binary"
    "fmt"
    "io"
    "runtime"
    "sync"
    )

    var (
    letterCode = map[byte]uint8{
    'A': aCode,
    'C': cCode,
    'T': tCode,
    'G': gCode,
    'N': nCode,
    'U': uCode,
    }
    standard = map[string]byte{
    "TTT": 'F',
    "TCT": 'S',
    "TAT": 'Y',
    "TGT": 'C',
    "TTC": 'F',
    "TCC": 'S',
    "TAC": 'Y',
    "TGC": 'C',
    "TTA": 'L',
    "TCA": 'S',
    "TAA": '*',
    "TGA": '*',
    "TTG": 'L',
    "TCG": 'S',
    "TAG": '*',
    "TGG": 'W',
    "CTT": 'L',
    "CCT": 'P',
    "CAT": 'H',
    "CGT": 'R',
    "CTC": 'L',
    "CCC": 'P',
    "CAC": 'H',
    "CGC": 'R',
    "CTA": 'L',
    "CCA": 'P',
    "CAA": 'Q',
    "CGA": 'R',
    "CTG": 'L',
    "CCG": 'P',
    "CAG": 'Q',
    "CGG": 'R',
    "ATT": 'I',
    "ACT": 'T',
    "AAT": 'N',
    "AGT": 'S',
    "ATC": 'I',
    "ACC": 'T',
    "AAC": 'N',
    "AGC": 'S',
    "ATA": 'I',
    "ACA": 'T',
    "AAA": 'K',
    "AGA": 'R',
    "ATG": 'M',
    "ACG": 'T',
    "AAG": 'K',
    "AGG": 'R',
    "GTT": 'V',
    "GCT": 'A',
    "GAT": 'D',
    "GGT": 'G',
    "GTC": 'V',
    "GCC": 'A',
    "GAC": 'D',
    "GGC": 'G',
    "GTA": 'V',
    "GCA": 'A',
    "GAA": 'E',
    "GGA": 'G',
    "GTG": 'V',
    "GCG": 'A',
    "GAG": 'E',
    "GGG": 'G',
    }
    )

    const (
    nCode = uint8(0)
    aCode = uint8(1)
    cCode = uint8(2)
    tCode = uint8(3)
    uCode = uint8(3)
    gCode = uint8(4)

    stopByte = '*'
    unknown = 'X'
    // Length of the array to store code/bytes
    // uses gCode because it's the biggest uint8 of all codes
    arrayCodeSize = (uint32(gCode) | uint32(gCode)<<8 | uint32(gCode)<<16) + 1
    )

    func createCodeArray(clean bool) byte {

    resultMap := map[uint32]byte{}
    twoLetterMap := map[string]byte{}

    tmpCode := make(uint8, 4)

    for codon, aaCode := range standard {
    // generate 3 letter code
    for i := 0; i < 3; i++ {
    tmpCode[i] = letterCode[codon[i]]
    }
    // each codon is represented by an unique uint32:
    // each possible nucleotide is represented by an uint8 (255 possibility)
    // the three first bytes are the the code for each nucleotide
    // last byte is unused ( eq to uint8(0) )
    // example:
    // codon 'ACG' ==> uint32(aCode) | uint32(cCode)<<8 | uint32(gCode)<<16
    uint32Code := uint32(tmpCode[0]) | uint32(tmpCode[1])<<8 | uint32(tmpCode[2])<<16
    resultMap[uint32Code] = aaCode

    // generate 2 letter code
    codes, ok := twoLetterMap[codon[0:2]]
    if !ok {
    twoLetterMap[codon[0:2]] = byte{aaCode}
    } else {
    twoLetterMap[codon[0:2]] = append(codes, aaCode)
    }
    }
    for twoLetterCodon, codes := range twoLetterMap {
    uniqueAA := true
    for i := 0; i < len(codes); i++ {

    if codes[i] != codes[0] {
    uniqueAA = false
    }
    }
    if uniqueAA {
    first := letterCode[twoLetterCodon[0]]
    second := letterCode[twoLetterCodon[1]]

    uint32Code := uint32(first) | uint32(second)<<8
    resultMap[uint32Code] = codes[0]
    }
    }
    // if clean is specified, we want to replace all '*' by 'X' in the output
    // sequence, so replace all occurrences of '*' directly in the ref map
    if clean {
    for k, v := range resultMap {
    if v == stopByte {
    resultMap[k] = unknown
    }
    }
    }
    r := make(byte, arrayCodeSize)
    for k, v := range resultMap {
    r[k] = v
    }
    return r
    }

    func computeFrames(frameName string) (frames int, reverse bool, err error) {

    frames = make(int, 6)
    reverse = false

    switch frameName {
    case "1":
    frames[0] = 1
    case "2":
    frames[1] = 1
    case "3":
    frames[2] = 1
    case "F":
    for i := 0; i < 3; i++ {
    frames[i] = 1
    }
    case "-1":
    frames[3] = 1
    reverse = true
    case "-2":
    frames[4] = 1
    reverse = true
    case "-3":
    frames[5] = 1
    reverse = true
    case "R":
    for i := 3; i < 6; i++ {
    frames[i] = 1
    }
    reverse = true
    case "6":
    for i := range frames {
    frames[i] = 1
    }
    reverse = true
    default:
    err = fmt.Errorf("wrong value for -f | --frame parameter: %s", frameName)
    }
    return frames, reverse, err
    }

    type writer struct {
    buf *bytes.Buffer
    currentLineLen int
    bytesToTrim int
    }

    func (w *writer) addByte(b byte) {
    w.buf.WriteByte(b)
    w.currentLineLen++
    if b == stopByte || b == unknown {
    w.bytesToTrim++
    } else {
    w.bytesToTrim = 0
    }
    }

    func (w *writer) addUnknown() {
    w.buf.WriteByte(unknown)
    w.currentLineLen++
    w.bytesToTrim++
    }

    func (w *writer) newLine() {
    w.buf.WriteByte('n')
    w.currentLineLen = 0
    w.bytesToTrim++
    }

    const (
    // size of the buffer for writing to file
    maxBufferSize = 1024 * 1024 * 30
    // max line size for sequence
    maxLineSize = 60
    // suffixes ta add to sequence id for each frame
    suffixes = "123456"
    )

    // Translate read a fata file, translate each sequence to the corresponding prot sequence in the specified frame
    func Translate(inputSequence io.Reader, out io.Writer, frame string, clean, trim, alternative bool) error {

    arrayCode := createCodeArray(clean)
    framesToGenerate, reverse, err := computeFrames(frame)
    if err != nil {
    return err
    }

    fnaSequences := make(chan encodedSequence, 10)
    errs := make(chan error, 1)

    var wg sync.WaitGroup
    ctx, cancel := context.WithCancel(context.Background())
    defer cancel()

    for nWorker := 0; nWorker < runtime.NumCPU(); nWorker++ {

    wg.Add(1)

    go func() {

    defer wg.Done()

    startPosition := make(int, 3)
    w := &writer{
    buf: bytes.NewBuffer(nil),
    bytesToTrim: 0,
    currentLineLen: 0,
    }

    for sequence := range fnaSequences {

    select {
    case <-ctx.Done():
    return
    default:
    }

    frameIndex := 0
    startPosition[0], startPosition[1], startPosition[2] = 0, 1, 2

    idSize := int(binary.LittleEndian.Uint32(sequence[0:4]))
    nuclSeqLength := len(sequence) - idSize

    Translate:
    for _, startPos := range startPosition {

    if framesToGenerate[frameIndex] == 0 {
    frameIndex++
    continue
    }

    // sequence id should look like
    // >sequenceID_<frame> comment
    idEnd := bytes.IndexByte(sequence[4:idSize], ' ')
    if idEnd != -1 {
    w.buf.Write(sequence[4 : 4+idEnd])
    w.buf.WriteByte('_')
    w.buf.WriteByte(suffixes[frameIndex])
    w.buf.Write(sequence[4+idEnd : idSize])
    } else {
    w.buf.Write(sequence[4:idSize])
    w.buf.WriteByte('_')
    w.buf.WriteByte(suffixes[frameIndex])
    }
    w.newLine()

    // if in trim mode, nb of bytes to trim (nb of successive 'X', '*' and 'n'
    // from right end of the sequence)
    w.bytesToTrim = 0
    w.currentLineLen = 0

    // read the sequence 3 letters at a time, starting at a specific position
    // corresponding to the frame
    for pos := startPos + 2 + idSize; pos < len(sequence); pos += 3 {

    if w.currentLineLen == maxLineSize {
    w.newLine()
    }
    // create an uint32 from the codon, to retrieve the corresponding
    // AA from the map
    codonCode := uint32(sequence[pos-2]) | uint32(sequence[pos-1])<<8 | uint32(sequence[pos])<<16

    b := arrayCode[codonCode]
    if b != byte(0) {
    w.addByte(b)
    } else {
    w.addUnknown()
    }
    }

    // the last codon is only 2 nucleotid long, try to guess
    // the corresponding AA
    if (nuclSeqLength-startPos)%3 == 2 {

    if w.currentLineLen == maxLineSize {
    w.newLine()
    }
    codonCode := uint32(sequence[len(sequence)-2]) | uint32(sequence[len(sequence)-1])<<8

    b := arrayCode[codonCode]
    if b != byte(0) {
    w.addByte(b)
    } else {
    w.addUnknown()
    }
    }

    // the last codon is only 1 nucleotid long, no way to guess
    // the corresponding AA
    if (nuclSeqLength-startPos)%3 == 1 {
    if w.currentLineLen == maxLineSize {
    w.newLine()
    }
    w.addUnknown()
    }

    if trim && w.bytesToTrim > 0 {
    // remove the last bytesToTrim bytes of the buffer
    // as they are 'X', '*' or 'n'
    w.buf.Truncate(w.buf.Len() - w.bytesToTrim)
    w.currentLineLen -= w.bytesToTrim
    }

    if w.currentLineLen != 0 {
    w.newLine()
    }
    frameIndex++
    }

    if reverse && frameIndex < 6 {

    // get the complementary sequence.
    // Basically, switch
    // A <-> T
    // C <-> G
    // N is not modified
    for i, n := range sequence[idSize:] {

    switch n {
    case aCode:
    sequence[i+idSize] = tCode
    case tCode:
    // handle both tCode and uCode
    sequence[i+idSize] = aCode
    case cCode:
    sequence[i+idSize] = gCode
    case gCode:
    sequence[i+idSize] = cCode
    default:
    //case N -> leave it
    }
    }
    // reverse the sequence
    for i, j := idSize, len(sequence)-1; i < j; i, j = i+1, j-1 {
    sequence[i], sequence[j] = sequence[j], sequence[i]
    }

    if !alternative {
    // Staden convention: Frame -1 is the reverse-complement of the sequence
    // having the same codon phase as frame 1. Frame -2 is the same phase as
    // frame 2. Frame -3 is the same phase as frame 3
    //
    // use the matrix to keep track of the forward frame as it depends on the
    // length of the sequence
    switch nuclSeqLength % 3 {
    case 0:
    startPosition[0], startPosition[1], startPosition[2] = 0, 2, 1
    case 1:
    startPosition[0], startPosition[1], startPosition[2] = 1, 0, 2
    case 2:
    startPosition[0], startPosition[1], startPosition[2] = 2, 1, 0
    }
    }
    // run the same loop, but with the reverse-complemented sequence
    goto Translate
    }

    if w.buf.Len() > maxBufferSize {
    _, err := out.Write(w.buf.Bytes())
    if err != nil {
    select {
    case errs <- fmt.Errorf("fail to write to output file: %v", err):
    default:
    }
    cancel()
    return
    }
    w.buf.Reset()
    }
    pool.Put(sequence)
    }

    if w.buf.Len() > 0 {
    _, err := out.Write(w.buf.Bytes())
    if err != nil {
    select {
    case errs <- fmt.Errorf("fail to write to output file: %v", err):
    default:
    }
    cancel()
    return
    }
    }
    }()
    }
    readSequenceFromFasta(ctx, inputSequence, fnaSequences)

    wg.Wait()
    select {
    case err, ok := <-errs:
    if ok {
    return err
    }
    default:
    }
    return nil
    }

    func readSequenceFromFasta(ctx context.Context, inputSequence io.Reader, fnaSequences chan encodedSequence) {

    feeder := &fastaChannelFeeder{
    idBuffer: bytes.NewBuffer(nil),
    commentBuffer: bytes.NewBuffer(nil),
    sequenceBuffer: bytes.NewBuffer(nil),
    fastaChan: fnaSequences,
    }
    // fasta format is:
    //
    // >sequenceID some comments on sequence
    // ACAGGCAGAGACACGACAGACGACGACACAGGAGCAGACAGCAGCAGACGACCACATATT
    // TTTGCGGTCACATGACGACTTCGGCAGCGA
    //
    // see https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp
    // section 1 for details
    scanner := bufio.NewScanner(inputSequence)
    Loop:
    for scanner.Scan() {

    line := scanner.Bytes()
    if len(line) == 0 {
    continue
    }
    if line[0] == '>' {

    if feeder.idBuffer.Len() > 0 {
    select {
    case <-ctx.Done():
    break Loop
    default:
    }
    feeder.sendFasta()
    }
    feeder.reset()

    // parse the ID of the sequence. ID is formatted like this:
    // >sequenceID comments
    seqID := bytes.SplitN(line, byte{' '}, 2)
    feeder.idBuffer.Write(seqID[0])

    if len(seqID) > 1 {
    feeder.commentBuffer.WriteByte(' ')
    feeder.commentBuffer.Write(seqID[1])
    }
    } else {
    // if the line doesn't start with '>', then it's a part of the
    // nucleotide sequence, so write it to the buffer
    feeder.sequenceBuffer.Write(line)
    }
    }
    // don't forget to push last sequence
    select {
    case <-ctx.Done():
    default:
    feeder.sendFasta()
    }
    close(fnaSequences)
    }

    // a type to hold an encoded fasta sequence
    //
    // s[0:4] stores the size of the sequence id + the size of the comment as an uint32 (little endian)
    // s[4:idSize] stores the sequence id, and the comment id there is one
    // s[idSize:] stores the nucl sequence
    type encodedSequence byte

    var pool = sync.Pool{
    New: func() interface{} {
    return make(encodedSequence, 512)
    },
    }

    func getSizedSlice(idSize, requiredSize int) encodedSequence {
    s := pool.Get().(encodedSequence)
    binary.LittleEndian.PutUint32(s[0:4], uint32(idSize))

    for len(s) < requiredSize {
    s = append(s, byte(0))
    }
    return s[0:requiredSize]
    }

    func (f *fastaChannelFeeder) sendFasta() {

    idSize := 4 + f.idBuffer.Len() + f.commentBuffer.Len()
    requiredSize := idSize + f.sequenceBuffer.Len()

    s := getSizedSlice(idSize, requiredSize)

    if f.commentBuffer.Len() > 0 {
    copy(s[idSize-f.commentBuffer.Len():idSize], f.commentBuffer.Bytes())
    }

    copy(s[4:4+f.idBuffer.Len()], f.idBuffer.Bytes())

    // convert the sequence of bytes to an array of uint8 codes,
    // so a codon (3 nucleotides | 3 bytes ) can be represented
    // as an uint32
    for i, b := range f.sequenceBuffer.Bytes() {

    switch b {
    case 'A':
    s[i+idSize] = aCode
    case 'C':
    s[i+idSize] = cCode
    case 'G':
    s[i+idSize] = gCode
    case 'T', 'U':
    s[i+idSize] = tCode
    case 'N':
    s[i+idSize] = nCode
    default:
    fmt.Printf("WARNING: invalid char in sequence %s: %s, ignoring", s[4:4+idSize], string(b))
    }
    }
    f.fastaChan <- s
    }

    type fastaChannelFeeder struct {
    idBuffer, commentBuffer, sequenceBuffer *bytes.Buffer
    fastaChan chan encodedSequence
    }

    func (f *fastaChannelFeeder) reset() {
    f.idBuffer.Reset()
    f.sequenceBuffer.Reset()
    f.commentBuffer.Reset()
    }









    share|improve this question

















    This question has an open bounty worth +500
    reputation from felix ending in 6 days.


    This question has not received enough attention.



















      13












      13








      13


      2





      Goal of the program



      The goal of the program is to translate a nucleic acid sequence into its corresponding amino acid sequence. The nucleic sequences
      have to be formatted in a specific format called fasta. There is an existing implementation of this program in C here: emboss transeq



      Fasta format



      A fasta file looks like this:



      >sequenceId comment
      nucleic sequence


      for example:



      >Seq1 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
      CCTTTATCTAATCTTTGGAGCATGAGCTGGCATAGTTGGAACCGCCCTCAGCCTCCTCATCCGTGCAGAA
      CCCAGTCCTGTACCAACACCTCTTCTGATTCTTCGGCCATCCAGAAGTCTATATCCTCATTTTAC

      >Seq2 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
      GGTAGGTACCGCCCTAAGNCTCCTAATCCGAGCAGAACTANGCCAACCCGGAGCCCTTCTGGGAGACGAC
      AATCAACATAAAA


      A nucleic sequence is a string composed of the letters A, C, T, G, U, N



      Expected output



      A combinaison of 3 nucleic acid, named codon, gives a specif amino acid, for exemple GCT is the code for Alanine, symbolised by the letter A



      With the Seq1 define above:



      codon      CCT TTA TCT AAT CTT TGG AGC ATG ...
      amino acid P L S N L W S M ...


      The expected output is:



      >Seq1_1 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
      PLSNLWSMSWHSWNRPQPPHPCRTQSCTNTSSDSSAIQKSISSFY
      >Seq2_1 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
      GRYRPKXPNPSRTXPTRSPSGRRQST*X


      Specific rules



      The program takes 4 parameters:





      • clean : if true, write STOP codon as X instead of *


      • trim : if true, remove all X and * chars from the right end of the amino - acid sequence


      • alternative: different way to compute reverse frame


      • frame: a string value in ["1", "2", "3", "F", "-1", "-2", "-3", "R", "6" ]


      The frame define the position in the nucleic sequence to start from:



      -frame 1
      |-frame 2
      ||-frame 3
      |||
      CCTTTATCTAATCTTTGGAGCATGAGCTGGCATAGTTGGAACCGCCCTCAGCCTCCTCATCCGTGCAGAA
      CCCAGTCCTGTACCAACACCTCTTCTGATTCTTCGGCCATCCAGAAGTCTATATCCTCATTTTAC
      |||
      ||-frame -1
      |-frame -2
      -frame -3


      frame "F" = "1", "2", "3"
      frame "R" = "-1", "-2", "-3"
      frame "6" = "1", "2", "3", "-1", "-2", "-3"


      In the output file, we add the frame used to the sequenceId: sequenceId = sequenceId_frame



      For example if the program is used with frame=6, the expected output is



      >Seq1_1 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
      PLSNLWSMSWHSWNRPQPPHPCRTQSCTNTSSDSSAIQKSISSFY
      >Seq1_2 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
      LYLIFGA*AGIVGTALSLLIRAEPSPVPTPLLILRPSRSLYPHFT
      >Seq1_3 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
      FI*SLEHELA*LEPPSASSSVQNPVLYQHLF*FFGHPEVYILILX
      >Seq1_4 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
      VK*GYRLLDGRRIRRGVGTGLGSARMRRLRAVPTMPAHAPKIR*R
      >Seq1_5 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
      KMRI*TSGWPKNQKRCWYRTGFCTDEEAEGGSNYASSCSKD*IKX
      >Seq1_6 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
      *NEDIDFWMAEESEEVLVQDWVLHG*GG*GRFQLCQLMLQRLDKG
      >Seq2_1 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
      GRYRPKXPNPSRTXPTRSPSGRRQST*X
      >Seq2_2 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
      VGTALXLLIRAELXQPGALLGDDNQHKX
      >Seq2_3 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
      *VPP*XS*SEQNXANPEPFWETTINIK
      >Seq2_4 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
      LC*LSSPRRAPGWXSSARIRXLRAVPT
      >Seq2_5 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
      FMLIVVSQKGSGLA*FCSD*EX*GGTYX
      >Seq2_6 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
      FYVDCRLPEGLRVGXVLLGLGXLGRYLP


      Improvements




      • Performance improvements

      • Is the code easy to read / understand ?

      • Is the code idiomatic ?


      ( full project with tests + flags handling is avaible on github: gotranseq)



      code :



      package transeq

      import (
      "bufio"
      "bytes"
      "context"
      "encoding/binary"
      "fmt"
      "io"
      "runtime"
      "sync"
      )

      var (
      letterCode = map[byte]uint8{
      'A': aCode,
      'C': cCode,
      'T': tCode,
      'G': gCode,
      'N': nCode,
      'U': uCode,
      }
      standard = map[string]byte{
      "TTT": 'F',
      "TCT": 'S',
      "TAT": 'Y',
      "TGT": 'C',
      "TTC": 'F',
      "TCC": 'S',
      "TAC": 'Y',
      "TGC": 'C',
      "TTA": 'L',
      "TCA": 'S',
      "TAA": '*',
      "TGA": '*',
      "TTG": 'L',
      "TCG": 'S',
      "TAG": '*',
      "TGG": 'W',
      "CTT": 'L',
      "CCT": 'P',
      "CAT": 'H',
      "CGT": 'R',
      "CTC": 'L',
      "CCC": 'P',
      "CAC": 'H',
      "CGC": 'R',
      "CTA": 'L',
      "CCA": 'P',
      "CAA": 'Q',
      "CGA": 'R',
      "CTG": 'L',
      "CCG": 'P',
      "CAG": 'Q',
      "CGG": 'R',
      "ATT": 'I',
      "ACT": 'T',
      "AAT": 'N',
      "AGT": 'S',
      "ATC": 'I',
      "ACC": 'T',
      "AAC": 'N',
      "AGC": 'S',
      "ATA": 'I',
      "ACA": 'T',
      "AAA": 'K',
      "AGA": 'R',
      "ATG": 'M',
      "ACG": 'T',
      "AAG": 'K',
      "AGG": 'R',
      "GTT": 'V',
      "GCT": 'A',
      "GAT": 'D',
      "GGT": 'G',
      "GTC": 'V',
      "GCC": 'A',
      "GAC": 'D',
      "GGC": 'G',
      "GTA": 'V',
      "GCA": 'A',
      "GAA": 'E',
      "GGA": 'G',
      "GTG": 'V',
      "GCG": 'A',
      "GAG": 'E',
      "GGG": 'G',
      }
      )

      const (
      nCode = uint8(0)
      aCode = uint8(1)
      cCode = uint8(2)
      tCode = uint8(3)
      uCode = uint8(3)
      gCode = uint8(4)

      stopByte = '*'
      unknown = 'X'
      // Length of the array to store code/bytes
      // uses gCode because it's the biggest uint8 of all codes
      arrayCodeSize = (uint32(gCode) | uint32(gCode)<<8 | uint32(gCode)<<16) + 1
      )

      func createCodeArray(clean bool) byte {

      resultMap := map[uint32]byte{}
      twoLetterMap := map[string]byte{}

      tmpCode := make(uint8, 4)

      for codon, aaCode := range standard {
      // generate 3 letter code
      for i := 0; i < 3; i++ {
      tmpCode[i] = letterCode[codon[i]]
      }
      // each codon is represented by an unique uint32:
      // each possible nucleotide is represented by an uint8 (255 possibility)
      // the three first bytes are the the code for each nucleotide
      // last byte is unused ( eq to uint8(0) )
      // example:
      // codon 'ACG' ==> uint32(aCode) | uint32(cCode)<<8 | uint32(gCode)<<16
      uint32Code := uint32(tmpCode[0]) | uint32(tmpCode[1])<<8 | uint32(tmpCode[2])<<16
      resultMap[uint32Code] = aaCode

      // generate 2 letter code
      codes, ok := twoLetterMap[codon[0:2]]
      if !ok {
      twoLetterMap[codon[0:2]] = byte{aaCode}
      } else {
      twoLetterMap[codon[0:2]] = append(codes, aaCode)
      }
      }
      for twoLetterCodon, codes := range twoLetterMap {
      uniqueAA := true
      for i := 0; i < len(codes); i++ {

      if codes[i] != codes[0] {
      uniqueAA = false
      }
      }
      if uniqueAA {
      first := letterCode[twoLetterCodon[0]]
      second := letterCode[twoLetterCodon[1]]

      uint32Code := uint32(first) | uint32(second)<<8
      resultMap[uint32Code] = codes[0]
      }
      }
      // if clean is specified, we want to replace all '*' by 'X' in the output
      // sequence, so replace all occurrences of '*' directly in the ref map
      if clean {
      for k, v := range resultMap {
      if v == stopByte {
      resultMap[k] = unknown
      }
      }
      }
      r := make(byte, arrayCodeSize)
      for k, v := range resultMap {
      r[k] = v
      }
      return r
      }

      func computeFrames(frameName string) (frames int, reverse bool, err error) {

      frames = make(int, 6)
      reverse = false

      switch frameName {
      case "1":
      frames[0] = 1
      case "2":
      frames[1] = 1
      case "3":
      frames[2] = 1
      case "F":
      for i := 0; i < 3; i++ {
      frames[i] = 1
      }
      case "-1":
      frames[3] = 1
      reverse = true
      case "-2":
      frames[4] = 1
      reverse = true
      case "-3":
      frames[5] = 1
      reverse = true
      case "R":
      for i := 3; i < 6; i++ {
      frames[i] = 1
      }
      reverse = true
      case "6":
      for i := range frames {
      frames[i] = 1
      }
      reverse = true
      default:
      err = fmt.Errorf("wrong value for -f | --frame parameter: %s", frameName)
      }
      return frames, reverse, err
      }

      type writer struct {
      buf *bytes.Buffer
      currentLineLen int
      bytesToTrim int
      }

      func (w *writer) addByte(b byte) {
      w.buf.WriteByte(b)
      w.currentLineLen++
      if b == stopByte || b == unknown {
      w.bytesToTrim++
      } else {
      w.bytesToTrim = 0
      }
      }

      func (w *writer) addUnknown() {
      w.buf.WriteByte(unknown)
      w.currentLineLen++
      w.bytesToTrim++
      }

      func (w *writer) newLine() {
      w.buf.WriteByte('n')
      w.currentLineLen = 0
      w.bytesToTrim++
      }

      const (
      // size of the buffer for writing to file
      maxBufferSize = 1024 * 1024 * 30
      // max line size for sequence
      maxLineSize = 60
      // suffixes ta add to sequence id for each frame
      suffixes = "123456"
      )

      // Translate read a fata file, translate each sequence to the corresponding prot sequence in the specified frame
      func Translate(inputSequence io.Reader, out io.Writer, frame string, clean, trim, alternative bool) error {

      arrayCode := createCodeArray(clean)
      framesToGenerate, reverse, err := computeFrames(frame)
      if err != nil {
      return err
      }

      fnaSequences := make(chan encodedSequence, 10)
      errs := make(chan error, 1)

      var wg sync.WaitGroup
      ctx, cancel := context.WithCancel(context.Background())
      defer cancel()

      for nWorker := 0; nWorker < runtime.NumCPU(); nWorker++ {

      wg.Add(1)

      go func() {

      defer wg.Done()

      startPosition := make(int, 3)
      w := &writer{
      buf: bytes.NewBuffer(nil),
      bytesToTrim: 0,
      currentLineLen: 0,
      }

      for sequence := range fnaSequences {

      select {
      case <-ctx.Done():
      return
      default:
      }

      frameIndex := 0
      startPosition[0], startPosition[1], startPosition[2] = 0, 1, 2

      idSize := int(binary.LittleEndian.Uint32(sequence[0:4]))
      nuclSeqLength := len(sequence) - idSize

      Translate:
      for _, startPos := range startPosition {

      if framesToGenerate[frameIndex] == 0 {
      frameIndex++
      continue
      }

      // sequence id should look like
      // >sequenceID_<frame> comment
      idEnd := bytes.IndexByte(sequence[4:idSize], ' ')
      if idEnd != -1 {
      w.buf.Write(sequence[4 : 4+idEnd])
      w.buf.WriteByte('_')
      w.buf.WriteByte(suffixes[frameIndex])
      w.buf.Write(sequence[4+idEnd : idSize])
      } else {
      w.buf.Write(sequence[4:idSize])
      w.buf.WriteByte('_')
      w.buf.WriteByte(suffixes[frameIndex])
      }
      w.newLine()

      // if in trim mode, nb of bytes to trim (nb of successive 'X', '*' and 'n'
      // from right end of the sequence)
      w.bytesToTrim = 0
      w.currentLineLen = 0

      // read the sequence 3 letters at a time, starting at a specific position
      // corresponding to the frame
      for pos := startPos + 2 + idSize; pos < len(sequence); pos += 3 {

      if w.currentLineLen == maxLineSize {
      w.newLine()
      }
      // create an uint32 from the codon, to retrieve the corresponding
      // AA from the map
      codonCode := uint32(sequence[pos-2]) | uint32(sequence[pos-1])<<8 | uint32(sequence[pos])<<16

      b := arrayCode[codonCode]
      if b != byte(0) {
      w.addByte(b)
      } else {
      w.addUnknown()
      }
      }

      // the last codon is only 2 nucleotid long, try to guess
      // the corresponding AA
      if (nuclSeqLength-startPos)%3 == 2 {

      if w.currentLineLen == maxLineSize {
      w.newLine()
      }
      codonCode := uint32(sequence[len(sequence)-2]) | uint32(sequence[len(sequence)-1])<<8

      b := arrayCode[codonCode]
      if b != byte(0) {
      w.addByte(b)
      } else {
      w.addUnknown()
      }
      }

      // the last codon is only 1 nucleotid long, no way to guess
      // the corresponding AA
      if (nuclSeqLength-startPos)%3 == 1 {
      if w.currentLineLen == maxLineSize {
      w.newLine()
      }
      w.addUnknown()
      }

      if trim && w.bytesToTrim > 0 {
      // remove the last bytesToTrim bytes of the buffer
      // as they are 'X', '*' or 'n'
      w.buf.Truncate(w.buf.Len() - w.bytesToTrim)
      w.currentLineLen -= w.bytesToTrim
      }

      if w.currentLineLen != 0 {
      w.newLine()
      }
      frameIndex++
      }

      if reverse && frameIndex < 6 {

      // get the complementary sequence.
      // Basically, switch
      // A <-> T
      // C <-> G
      // N is not modified
      for i, n := range sequence[idSize:] {

      switch n {
      case aCode:
      sequence[i+idSize] = tCode
      case tCode:
      // handle both tCode and uCode
      sequence[i+idSize] = aCode
      case cCode:
      sequence[i+idSize] = gCode
      case gCode:
      sequence[i+idSize] = cCode
      default:
      //case N -> leave it
      }
      }
      // reverse the sequence
      for i, j := idSize, len(sequence)-1; i < j; i, j = i+1, j-1 {
      sequence[i], sequence[j] = sequence[j], sequence[i]
      }

      if !alternative {
      // Staden convention: Frame -1 is the reverse-complement of the sequence
      // having the same codon phase as frame 1. Frame -2 is the same phase as
      // frame 2. Frame -3 is the same phase as frame 3
      //
      // use the matrix to keep track of the forward frame as it depends on the
      // length of the sequence
      switch nuclSeqLength % 3 {
      case 0:
      startPosition[0], startPosition[1], startPosition[2] = 0, 2, 1
      case 1:
      startPosition[0], startPosition[1], startPosition[2] = 1, 0, 2
      case 2:
      startPosition[0], startPosition[1], startPosition[2] = 2, 1, 0
      }
      }
      // run the same loop, but with the reverse-complemented sequence
      goto Translate
      }

      if w.buf.Len() > maxBufferSize {
      _, err := out.Write(w.buf.Bytes())
      if err != nil {
      select {
      case errs <- fmt.Errorf("fail to write to output file: %v", err):
      default:
      }
      cancel()
      return
      }
      w.buf.Reset()
      }
      pool.Put(sequence)
      }

      if w.buf.Len() > 0 {
      _, err := out.Write(w.buf.Bytes())
      if err != nil {
      select {
      case errs <- fmt.Errorf("fail to write to output file: %v", err):
      default:
      }
      cancel()
      return
      }
      }
      }()
      }
      readSequenceFromFasta(ctx, inputSequence, fnaSequences)

      wg.Wait()
      select {
      case err, ok := <-errs:
      if ok {
      return err
      }
      default:
      }
      return nil
      }

      func readSequenceFromFasta(ctx context.Context, inputSequence io.Reader, fnaSequences chan encodedSequence) {

      feeder := &fastaChannelFeeder{
      idBuffer: bytes.NewBuffer(nil),
      commentBuffer: bytes.NewBuffer(nil),
      sequenceBuffer: bytes.NewBuffer(nil),
      fastaChan: fnaSequences,
      }
      // fasta format is:
      //
      // >sequenceID some comments on sequence
      // ACAGGCAGAGACACGACAGACGACGACACAGGAGCAGACAGCAGCAGACGACCACATATT
      // TTTGCGGTCACATGACGACTTCGGCAGCGA
      //
      // see https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp
      // section 1 for details
      scanner := bufio.NewScanner(inputSequence)
      Loop:
      for scanner.Scan() {

      line := scanner.Bytes()
      if len(line) == 0 {
      continue
      }
      if line[0] == '>' {

      if feeder.idBuffer.Len() > 0 {
      select {
      case <-ctx.Done():
      break Loop
      default:
      }
      feeder.sendFasta()
      }
      feeder.reset()

      // parse the ID of the sequence. ID is formatted like this:
      // >sequenceID comments
      seqID := bytes.SplitN(line, byte{' '}, 2)
      feeder.idBuffer.Write(seqID[0])

      if len(seqID) > 1 {
      feeder.commentBuffer.WriteByte(' ')
      feeder.commentBuffer.Write(seqID[1])
      }
      } else {
      // if the line doesn't start with '>', then it's a part of the
      // nucleotide sequence, so write it to the buffer
      feeder.sequenceBuffer.Write(line)
      }
      }
      // don't forget to push last sequence
      select {
      case <-ctx.Done():
      default:
      feeder.sendFasta()
      }
      close(fnaSequences)
      }

      // a type to hold an encoded fasta sequence
      //
      // s[0:4] stores the size of the sequence id + the size of the comment as an uint32 (little endian)
      // s[4:idSize] stores the sequence id, and the comment id there is one
      // s[idSize:] stores the nucl sequence
      type encodedSequence byte

      var pool = sync.Pool{
      New: func() interface{} {
      return make(encodedSequence, 512)
      },
      }

      func getSizedSlice(idSize, requiredSize int) encodedSequence {
      s := pool.Get().(encodedSequence)
      binary.LittleEndian.PutUint32(s[0:4], uint32(idSize))

      for len(s) < requiredSize {
      s = append(s, byte(0))
      }
      return s[0:requiredSize]
      }

      func (f *fastaChannelFeeder) sendFasta() {

      idSize := 4 + f.idBuffer.Len() + f.commentBuffer.Len()
      requiredSize := idSize + f.sequenceBuffer.Len()

      s := getSizedSlice(idSize, requiredSize)

      if f.commentBuffer.Len() > 0 {
      copy(s[idSize-f.commentBuffer.Len():idSize], f.commentBuffer.Bytes())
      }

      copy(s[4:4+f.idBuffer.Len()], f.idBuffer.Bytes())

      // convert the sequence of bytes to an array of uint8 codes,
      // so a codon (3 nucleotides | 3 bytes ) can be represented
      // as an uint32
      for i, b := range f.sequenceBuffer.Bytes() {

      switch b {
      case 'A':
      s[i+idSize] = aCode
      case 'C':
      s[i+idSize] = cCode
      case 'G':
      s[i+idSize] = gCode
      case 'T', 'U':
      s[i+idSize] = tCode
      case 'N':
      s[i+idSize] = nCode
      default:
      fmt.Printf("WARNING: invalid char in sequence %s: %s, ignoring", s[4:4+idSize], string(b))
      }
      }
      f.fastaChan <- s
      }

      type fastaChannelFeeder struct {
      idBuffer, commentBuffer, sequenceBuffer *bytes.Buffer
      fastaChan chan encodedSequence
      }

      func (f *fastaChannelFeeder) reset() {
      f.idBuffer.Reset()
      f.sequenceBuffer.Reset()
      f.commentBuffer.Reset()
      }









      share|improve this question















      Goal of the program



      The goal of the program is to translate a nucleic acid sequence into its corresponding amino acid sequence. The nucleic sequences
      have to be formatted in a specific format called fasta. There is an existing implementation of this program in C here: emboss transeq



      Fasta format



      A fasta file looks like this:



      >sequenceId comment
      nucleic sequence


      for example:



      >Seq1 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
      CCTTTATCTAATCTTTGGAGCATGAGCTGGCATAGTTGGAACCGCCCTCAGCCTCCTCATCCGTGCAGAA
      CCCAGTCCTGTACCAACACCTCTTCTGATTCTTCGGCCATCCAGAAGTCTATATCCTCATTTTAC

      >Seq2 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
      GGTAGGTACCGCCCTAAGNCTCCTAATCCGAGCAGAACTANGCCAACCCGGAGCCCTTCTGGGAGACGAC
      AATCAACATAAAA


      A nucleic sequence is a string composed of the letters A, C, T, G, U, N



      Expected output



      A combinaison of 3 nucleic acid, named codon, gives a specif amino acid, for exemple GCT is the code for Alanine, symbolised by the letter A



      With the Seq1 define above:



      codon      CCT TTA TCT AAT CTT TGG AGC ATG ...
      amino acid P L S N L W S M ...


      The expected output is:



      >Seq1_1 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
      PLSNLWSMSWHSWNRPQPPHPCRTQSCTNTSSDSSAIQKSISSFY
      >Seq2_1 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
      GRYRPKXPNPSRTXPTRSPSGRRQST*X


      Specific rules



      The program takes 4 parameters:





      • clean : if true, write STOP codon as X instead of *


      • trim : if true, remove all X and * chars from the right end of the amino - acid sequence


      • alternative: different way to compute reverse frame


      • frame: a string value in ["1", "2", "3", "F", "-1", "-2", "-3", "R", "6" ]


      The frame define the position in the nucleic sequence to start from:



      -frame 1
      |-frame 2
      ||-frame 3
      |||
      CCTTTATCTAATCTTTGGAGCATGAGCTGGCATAGTTGGAACCGCCCTCAGCCTCCTCATCCGTGCAGAA
      CCCAGTCCTGTACCAACACCTCTTCTGATTCTTCGGCCATCCAGAAGTCTATATCCTCATTTTAC
      |||
      ||-frame -1
      |-frame -2
      -frame -3


      frame "F" = "1", "2", "3"
      frame "R" = "-1", "-2", "-3"
      frame "6" = "1", "2", "3", "-1", "-2", "-3"


      In the output file, we add the frame used to the sequenceId: sequenceId = sequenceId_frame



      For example if the program is used with frame=6, the expected output is



      >Seq1_1 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
      PLSNLWSMSWHSWNRPQPPHPCRTQSCTNTSSDSSAIQKSISSFY
      >Seq1_2 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
      LYLIFGA*AGIVGTALSLLIRAEPSPVPTPLLILRPSRSLYPHFT
      >Seq1_3 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
      FI*SLEHELA*LEPPSASSSVQNPVLYQHLF*FFGHPEVYILILX
      >Seq1_4 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
      VK*GYRLLDGRRIRRGVGTGLGSARMRRLRAVPTMPAHAPKIR*R
      >Seq1_5 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
      KMRI*TSGWPKNQKRCWYRTGFCTDEEAEGGSNYASSCSKD*IKX
      >Seq1_6 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds
      *NEDIDFWMAEESEEVLVQDWVLHG*GG*GRFQLCQLMLQRLDKG
      >Seq2_1 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
      GRYRPKXPNPSRTXPTRSPSGRRQST*X
      >Seq2_2 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
      VGTALXLLIRAELXQPGALLGDDNQHKX
      >Seq2_3 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
      *VPP*XS*SEQNXANPEPFWETTINIK
      >Seq2_4 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
      LC*LSSPRRAPGWXSSARIRXLRAVPT
      >Seq2_5 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
      FMLIVVSQKGSGLA*FCSD*EX*GGTYX
      >Seq2_6 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene
      FYVDCRLPEGLRVGXVLLGLGXLGRYLP


      Improvements




      • Performance improvements

      • Is the code easy to read / understand ?

      • Is the code idiomatic ?


      ( full project with tests + flags handling is avaible on github: gotranseq)



      code :



      package transeq

      import (
      "bufio"
      "bytes"
      "context"
      "encoding/binary"
      "fmt"
      "io"
      "runtime"
      "sync"
      )

      var (
      letterCode = map[byte]uint8{
      'A': aCode,
      'C': cCode,
      'T': tCode,
      'G': gCode,
      'N': nCode,
      'U': uCode,
      }
      standard = map[string]byte{
      "TTT": 'F',
      "TCT": 'S',
      "TAT": 'Y',
      "TGT": 'C',
      "TTC": 'F',
      "TCC": 'S',
      "TAC": 'Y',
      "TGC": 'C',
      "TTA": 'L',
      "TCA": 'S',
      "TAA": '*',
      "TGA": '*',
      "TTG": 'L',
      "TCG": 'S',
      "TAG": '*',
      "TGG": 'W',
      "CTT": 'L',
      "CCT": 'P',
      "CAT": 'H',
      "CGT": 'R',
      "CTC": 'L',
      "CCC": 'P',
      "CAC": 'H',
      "CGC": 'R',
      "CTA": 'L',
      "CCA": 'P',
      "CAA": 'Q',
      "CGA": 'R',
      "CTG": 'L',
      "CCG": 'P',
      "CAG": 'Q',
      "CGG": 'R',
      "ATT": 'I',
      "ACT": 'T',
      "AAT": 'N',
      "AGT": 'S',
      "ATC": 'I',
      "ACC": 'T',
      "AAC": 'N',
      "AGC": 'S',
      "ATA": 'I',
      "ACA": 'T',
      "AAA": 'K',
      "AGA": 'R',
      "ATG": 'M',
      "ACG": 'T',
      "AAG": 'K',
      "AGG": 'R',
      "GTT": 'V',
      "GCT": 'A',
      "GAT": 'D',
      "GGT": 'G',
      "GTC": 'V',
      "GCC": 'A',
      "GAC": 'D',
      "GGC": 'G',
      "GTA": 'V',
      "GCA": 'A',
      "GAA": 'E',
      "GGA": 'G',
      "GTG": 'V',
      "GCG": 'A',
      "GAG": 'E',
      "GGG": 'G',
      }
      )

      const (
      nCode = uint8(0)
      aCode = uint8(1)
      cCode = uint8(2)
      tCode = uint8(3)
      uCode = uint8(3)
      gCode = uint8(4)

      stopByte = '*'
      unknown = 'X'
      // Length of the array to store code/bytes
      // uses gCode because it's the biggest uint8 of all codes
      arrayCodeSize = (uint32(gCode) | uint32(gCode)<<8 | uint32(gCode)<<16) + 1
      )

      func createCodeArray(clean bool) byte {

      resultMap := map[uint32]byte{}
      twoLetterMap := map[string]byte{}

      tmpCode := make(uint8, 4)

      for codon, aaCode := range standard {
      // generate 3 letter code
      for i := 0; i < 3; i++ {
      tmpCode[i] = letterCode[codon[i]]
      }
      // each codon is represented by an unique uint32:
      // each possible nucleotide is represented by an uint8 (255 possibility)
      // the three first bytes are the the code for each nucleotide
      // last byte is unused ( eq to uint8(0) )
      // example:
      // codon 'ACG' ==> uint32(aCode) | uint32(cCode)<<8 | uint32(gCode)<<16
      uint32Code := uint32(tmpCode[0]) | uint32(tmpCode[1])<<8 | uint32(tmpCode[2])<<16
      resultMap[uint32Code] = aaCode

      // generate 2 letter code
      codes, ok := twoLetterMap[codon[0:2]]
      if !ok {
      twoLetterMap[codon[0:2]] = byte{aaCode}
      } else {
      twoLetterMap[codon[0:2]] = append(codes, aaCode)
      }
      }
      for twoLetterCodon, codes := range twoLetterMap {
      uniqueAA := true
      for i := 0; i < len(codes); i++ {

      if codes[i] != codes[0] {
      uniqueAA = false
      }
      }
      if uniqueAA {
      first := letterCode[twoLetterCodon[0]]
      second := letterCode[twoLetterCodon[1]]

      uint32Code := uint32(first) | uint32(second)<<8
      resultMap[uint32Code] = codes[0]
      }
      }
      // if clean is specified, we want to replace all '*' by 'X' in the output
      // sequence, so replace all occurrences of '*' directly in the ref map
      if clean {
      for k, v := range resultMap {
      if v == stopByte {
      resultMap[k] = unknown
      }
      }
      }
      r := make(byte, arrayCodeSize)
      for k, v := range resultMap {
      r[k] = v
      }
      return r
      }

      func computeFrames(frameName string) (frames int, reverse bool, err error) {

      frames = make(int, 6)
      reverse = false

      switch frameName {
      case "1":
      frames[0] = 1
      case "2":
      frames[1] = 1
      case "3":
      frames[2] = 1
      case "F":
      for i := 0; i < 3; i++ {
      frames[i] = 1
      }
      case "-1":
      frames[3] = 1
      reverse = true
      case "-2":
      frames[4] = 1
      reverse = true
      case "-3":
      frames[5] = 1
      reverse = true
      case "R":
      for i := 3; i < 6; i++ {
      frames[i] = 1
      }
      reverse = true
      case "6":
      for i := range frames {
      frames[i] = 1
      }
      reverse = true
      default:
      err = fmt.Errorf("wrong value for -f | --frame parameter: %s", frameName)
      }
      return frames, reverse, err
      }

      type writer struct {
      buf *bytes.Buffer
      currentLineLen int
      bytesToTrim int
      }

      func (w *writer) addByte(b byte) {
      w.buf.WriteByte(b)
      w.currentLineLen++
      if b == stopByte || b == unknown {
      w.bytesToTrim++
      } else {
      w.bytesToTrim = 0
      }
      }

      func (w *writer) addUnknown() {
      w.buf.WriteByte(unknown)
      w.currentLineLen++
      w.bytesToTrim++
      }

      func (w *writer) newLine() {
      w.buf.WriteByte('n')
      w.currentLineLen = 0
      w.bytesToTrim++
      }

      const (
      // size of the buffer for writing to file
      maxBufferSize = 1024 * 1024 * 30
      // max line size for sequence
      maxLineSize = 60
      // suffixes ta add to sequence id for each frame
      suffixes = "123456"
      )

      // Translate read a fata file, translate each sequence to the corresponding prot sequence in the specified frame
      func Translate(inputSequence io.Reader, out io.Writer, frame string, clean, trim, alternative bool) error {

      arrayCode := createCodeArray(clean)
      framesToGenerate, reverse, err := computeFrames(frame)
      if err != nil {
      return err
      }

      fnaSequences := make(chan encodedSequence, 10)
      errs := make(chan error, 1)

      var wg sync.WaitGroup
      ctx, cancel := context.WithCancel(context.Background())
      defer cancel()

      for nWorker := 0; nWorker < runtime.NumCPU(); nWorker++ {

      wg.Add(1)

      go func() {

      defer wg.Done()

      startPosition := make(int, 3)
      w := &writer{
      buf: bytes.NewBuffer(nil),
      bytesToTrim: 0,
      currentLineLen: 0,
      }

      for sequence := range fnaSequences {

      select {
      case <-ctx.Done():
      return
      default:
      }

      frameIndex := 0
      startPosition[0], startPosition[1], startPosition[2] = 0, 1, 2

      idSize := int(binary.LittleEndian.Uint32(sequence[0:4]))
      nuclSeqLength := len(sequence) - idSize

      Translate:
      for _, startPos := range startPosition {

      if framesToGenerate[frameIndex] == 0 {
      frameIndex++
      continue
      }

      // sequence id should look like
      // >sequenceID_<frame> comment
      idEnd := bytes.IndexByte(sequence[4:idSize], ' ')
      if idEnd != -1 {
      w.buf.Write(sequence[4 : 4+idEnd])
      w.buf.WriteByte('_')
      w.buf.WriteByte(suffixes[frameIndex])
      w.buf.Write(sequence[4+idEnd : idSize])
      } else {
      w.buf.Write(sequence[4:idSize])
      w.buf.WriteByte('_')
      w.buf.WriteByte(suffixes[frameIndex])
      }
      w.newLine()

      // if in trim mode, nb of bytes to trim (nb of successive 'X', '*' and 'n'
      // from right end of the sequence)
      w.bytesToTrim = 0
      w.currentLineLen = 0

      // read the sequence 3 letters at a time, starting at a specific position
      // corresponding to the frame
      for pos := startPos + 2 + idSize; pos < len(sequence); pos += 3 {

      if w.currentLineLen == maxLineSize {
      w.newLine()
      }
      // create an uint32 from the codon, to retrieve the corresponding
      // AA from the map
      codonCode := uint32(sequence[pos-2]) | uint32(sequence[pos-1])<<8 | uint32(sequence[pos])<<16

      b := arrayCode[codonCode]
      if b != byte(0) {
      w.addByte(b)
      } else {
      w.addUnknown()
      }
      }

      // the last codon is only 2 nucleotid long, try to guess
      // the corresponding AA
      if (nuclSeqLength-startPos)%3 == 2 {

      if w.currentLineLen == maxLineSize {
      w.newLine()
      }
      codonCode := uint32(sequence[len(sequence)-2]) | uint32(sequence[len(sequence)-1])<<8

      b := arrayCode[codonCode]
      if b != byte(0) {
      w.addByte(b)
      } else {
      w.addUnknown()
      }
      }

      // the last codon is only 1 nucleotid long, no way to guess
      // the corresponding AA
      if (nuclSeqLength-startPos)%3 == 1 {
      if w.currentLineLen == maxLineSize {
      w.newLine()
      }
      w.addUnknown()
      }

      if trim && w.bytesToTrim > 0 {
      // remove the last bytesToTrim bytes of the buffer
      // as they are 'X', '*' or 'n'
      w.buf.Truncate(w.buf.Len() - w.bytesToTrim)
      w.currentLineLen -= w.bytesToTrim
      }

      if w.currentLineLen != 0 {
      w.newLine()
      }
      frameIndex++
      }

      if reverse && frameIndex < 6 {

      // get the complementary sequence.
      // Basically, switch
      // A <-> T
      // C <-> G
      // N is not modified
      for i, n := range sequence[idSize:] {

      switch n {
      case aCode:
      sequence[i+idSize] = tCode
      case tCode:
      // handle both tCode and uCode
      sequence[i+idSize] = aCode
      case cCode:
      sequence[i+idSize] = gCode
      case gCode:
      sequence[i+idSize] = cCode
      default:
      //case N -> leave it
      }
      }
      // reverse the sequence
      for i, j := idSize, len(sequence)-1; i < j; i, j = i+1, j-1 {
      sequence[i], sequence[j] = sequence[j], sequence[i]
      }

      if !alternative {
      // Staden convention: Frame -1 is the reverse-complement of the sequence
      // having the same codon phase as frame 1. Frame -2 is the same phase as
      // frame 2. Frame -3 is the same phase as frame 3
      //
      // use the matrix to keep track of the forward frame as it depends on the
      // length of the sequence
      switch nuclSeqLength % 3 {
      case 0:
      startPosition[0], startPosition[1], startPosition[2] = 0, 2, 1
      case 1:
      startPosition[0], startPosition[1], startPosition[2] = 1, 0, 2
      case 2:
      startPosition[0], startPosition[1], startPosition[2] = 2, 1, 0
      }
      }
      // run the same loop, but with the reverse-complemented sequence
      goto Translate
      }

      if w.buf.Len() > maxBufferSize {
      _, err := out.Write(w.buf.Bytes())
      if err != nil {
      select {
      case errs <- fmt.Errorf("fail to write to output file: %v", err):
      default:
      }
      cancel()
      return
      }
      w.buf.Reset()
      }
      pool.Put(sequence)
      }

      if w.buf.Len() > 0 {
      _, err := out.Write(w.buf.Bytes())
      if err != nil {
      select {
      case errs <- fmt.Errorf("fail to write to output file: %v", err):
      default:
      }
      cancel()
      return
      }
      }
      }()
      }
      readSequenceFromFasta(ctx, inputSequence, fnaSequences)

      wg.Wait()
      select {
      case err, ok := <-errs:
      if ok {
      return err
      }
      default:
      }
      return nil
      }

      func readSequenceFromFasta(ctx context.Context, inputSequence io.Reader, fnaSequences chan encodedSequence) {

      feeder := &fastaChannelFeeder{
      idBuffer: bytes.NewBuffer(nil),
      commentBuffer: bytes.NewBuffer(nil),
      sequenceBuffer: bytes.NewBuffer(nil),
      fastaChan: fnaSequences,
      }
      // fasta format is:
      //
      // >sequenceID some comments on sequence
      // ACAGGCAGAGACACGACAGACGACGACACAGGAGCAGACAGCAGCAGACGACCACATATT
      // TTTGCGGTCACATGACGACTTCGGCAGCGA
      //
      // see https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp
      // section 1 for details
      scanner := bufio.NewScanner(inputSequence)
      Loop:
      for scanner.Scan() {

      line := scanner.Bytes()
      if len(line) == 0 {
      continue
      }
      if line[0] == '>' {

      if feeder.idBuffer.Len() > 0 {
      select {
      case <-ctx.Done():
      break Loop
      default:
      }
      feeder.sendFasta()
      }
      feeder.reset()

      // parse the ID of the sequence. ID is formatted like this:
      // >sequenceID comments
      seqID := bytes.SplitN(line, byte{' '}, 2)
      feeder.idBuffer.Write(seqID[0])

      if len(seqID) > 1 {
      feeder.commentBuffer.WriteByte(' ')
      feeder.commentBuffer.Write(seqID[1])
      }
      } else {
      // if the line doesn't start with '>', then it's a part of the
      // nucleotide sequence, so write it to the buffer
      feeder.sequenceBuffer.Write(line)
      }
      }
      // don't forget to push last sequence
      select {
      case <-ctx.Done():
      default:
      feeder.sendFasta()
      }
      close(fnaSequences)
      }

      // a type to hold an encoded fasta sequence
      //
      // s[0:4] stores the size of the sequence id + the size of the comment as an uint32 (little endian)
      // s[4:idSize] stores the sequence id, and the comment id there is one
      // s[idSize:] stores the nucl sequence
      type encodedSequence byte

      var pool = sync.Pool{
      New: func() interface{} {
      return make(encodedSequence, 512)
      },
      }

      func getSizedSlice(idSize, requiredSize int) encodedSequence {
      s := pool.Get().(encodedSequence)
      binary.LittleEndian.PutUint32(s[0:4], uint32(idSize))

      for len(s) < requiredSize {
      s = append(s, byte(0))
      }
      return s[0:requiredSize]
      }

      func (f *fastaChannelFeeder) sendFasta() {

      idSize := 4 + f.idBuffer.Len() + f.commentBuffer.Len()
      requiredSize := idSize + f.sequenceBuffer.Len()

      s := getSizedSlice(idSize, requiredSize)

      if f.commentBuffer.Len() > 0 {
      copy(s[idSize-f.commentBuffer.Len():idSize], f.commentBuffer.Bytes())
      }

      copy(s[4:4+f.idBuffer.Len()], f.idBuffer.Bytes())

      // convert the sequence of bytes to an array of uint8 codes,
      // so a codon (3 nucleotides | 3 bytes ) can be represented
      // as an uint32
      for i, b := range f.sequenceBuffer.Bytes() {

      switch b {
      case 'A':
      s[i+idSize] = aCode
      case 'C':
      s[i+idSize] = cCode
      case 'G':
      s[i+idSize] = gCode
      case 'T', 'U':
      s[i+idSize] = tCode
      case 'N':
      s[i+idSize] = nCode
      default:
      fmt.Printf("WARNING: invalid char in sequence %s: %s, ignoring", s[4:4+idSize], string(b))
      }
      }
      f.fastaChan <- s
      }

      type fastaChannelFeeder struct {
      idBuffer, commentBuffer, sequenceBuffer *bytes.Buffer
      fastaChan chan encodedSequence
      }

      func (f *fastaChannelFeeder) reset() {
      f.idBuffer.Reset()
      f.sequenceBuffer.Reset()
      f.commentBuffer.Reset()
      }






      performance algorithm go bioinformatics multiprocessing






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited yesterday







      felix

















      asked Dec 16 '18 at 2:43









      felixfelix

      303311




      303311






      This question has an open bounty worth +500
      reputation from felix ending in 6 days.


      This question has not received enough attention.








      This question has an open bounty worth +500
      reputation from felix ending in 6 days.


      This question has not received enough attention.
























          1 Answer
          1






          active

          oldest

          votes


















          7














          This covers an interesting topic. Great work!



          Because I am unfamiliar with this area, I utilized your unit testing to ensure changes I make did not break functionality. If they do, I apologize and please let me know.



          Utilize implicit repetition and iota



          Rather than manually defining the type and value of nCode, aCode, etc. we an implicitly get the value using iota. This also simplifies the assignment of arrayCodeSize.



          const (
          nCode = uint8(0)
          aCode = uint8(1)
          cCode = uint8(2)
          tCode = uint8(3)
          uCode = uint8(3)
          gCode = uint8(4)

          stopByte = '*'
          unknown = 'X'
          // Length of the array to store code/bytes
          // uses gCode because it's the biggest uint8 of all codes
          arrayCodeSize = (uint32(gCode) | uint32(gCode)<<8 | uint32(gCode)<<16) + 1
          )


          Becomes:



          const (
          nCode = iota
          aCode
          cCode
          tCode
          uCode
          gCode

          stopByte = '*'
          unknown = 'X'

          // Length of the array to store code/bytes
          // uses gCode because it's the biggest of all codes
          arrayCodeSize = (gCode | gCode<<8 | gCode<<16) + 1
          )


          Check error return values



          There are fourteen places in your code where the error return value is unchecked. If this is intentional, it's common practice to assign it to the blank identifier: _.



          Here is a list of occurrences:




          • w.buf.WriteByte() in addByte()

          • w.buf.WriteByte() in addUnknown()

          • w.buf.WriteByte() in newLine()

          • Many Write and WriteByte calls in Translate()

          • feeder.idBuffer.Write() and such, in readSequenceFromFasta


          Update: These always return nil.



          Reduce complexity of Translate()



          The Translate() function is complex. It's current cyclomatic complexity is 45. (By the end of this, it's complexity is 27).



          I notice that you define a go statement, which acts as a worker. I will leave it up to you to choose a fitting name, but for now "worker()" is sufficient.



          func worker(wg *sync.WaitGroup, fnaSequences chan encodedSequence,
          ctx context.Context, framesToGenerate int, arrayCode byte,
          options Options, reverse bool, out io.Writer, errs chan error,
          cancel context.CancelFunc) {
          defer wg.Done()

          startPosition := make(int, 3)

          w := &writer{
          buf: bytes.NewBuffer(nil),
          bytesToTrim: 0,
          currentLineLen: 0,
          }

          for sequence := range fnaSequences {

          select {
          case <-ctx.Done():
          return
          default:
          }

          frameIndex := 0
          startPosition[0], startPosition[1], startPosition[2] = 0, 1, 2

          idSize := int(binary.LittleEndian.Uint32(sequence[0:4]))
          nuclSeqLength := len(sequence) - idSize

          Translate:
          for _, startPos := range startPosition {

          if framesToGenerate[frameIndex] == 0 {
          frameIndex++
          continue
          }

          // sequence id should look like
          // >sequenceID_<frame> comment
          idEnd := bytes.IndexByte(sequence[4:idSize], ' ')
          if idEnd != -1 {
          w.buf.Write(sequence[4 : 4+idEnd])
          w.buf.WriteByte('_')
          w.buf.WriteByte(suffixes[frameIndex])
          w.buf.Write(sequence[4+idEnd : idSize])
          } else {
          w.buf.Write(sequence[4:idSize])
          w.buf.WriteByte('_')
          w.buf.WriteByte(suffixes[frameIndex])
          }
          w.newLine()

          // if in trim mode, nb of bytes to trim (nb of successive 'X', '*' and 'n'
          // from right end of the sequence)
          w.bytesToTrim = 0
          w.currentLineLen = 0

          // read the sequence 3 letters at a time, starting at a specific position
          // corresponding to the frame
          for pos := startPos + 2 + idSize; pos < len(sequence); pos += 3 {

          if w.currentLineLen == maxLineSize {
          w.newLine()
          }
          // create an uint32 from the codon, to retrieve the corresponding
          // AA from the map
          codonCode := uint32(sequence[pos-2]) | uint32(sequence[pos-1])<<8 | uint32(sequence[pos])<<16

          b := arrayCode[codonCode]
          if b != byte(0) {
          w.addByte(b)
          } else {
          w.addUnknown()
          }
          }

          // the last codon is only 2 nucleotid long, try to guess
          // the corresponding AA
          if (nuclSeqLength-startPos)%3 == 2 {

          if w.currentLineLen == maxLineSize {
          w.newLine()
          }
          codonCode := uint32(sequence[len(sequence)-2]) | uint32(sequence[len(sequence)-1])<<8

          b := arrayCode[codonCode]
          if b != byte(0) {
          w.addByte(b)
          } else {
          w.addUnknown()
          }
          }

          // the last codon is only 1 nucleotid long, no way to guess
          // the corresponding AA
          if (nuclSeqLength-startPos)%3 == 1 {
          if w.currentLineLen == maxLineSize {
          w.newLine()
          }
          w.addUnknown()
          }

          if options.Trim && w.bytesToTrim > 0 {
          // remove the last bytesToTrim bytes of the buffer
          // as they are 'X', '*' or 'n'
          w.buf.Truncate(w.buf.Len() - w.bytesToTrim)
          w.currentLineLen -= w.bytesToTrim
          }

          if w.currentLineLen != 0 {
          w.newLine()
          }
          frameIndex++
          }

          if reverse && frameIndex < 6 {

          // get the complementary sequence.
          // Basically, switch
          // A <-> T
          // C <-> G
          // N is not modified
          for i, n := range sequence[idSize:] {

          switch n {
          case aCode:
          sequence[i+idSize] = tCode
          case tCode:
          // handle both tCode and uCode
          sequence[i+idSize] = aCode
          case cCode:
          sequence[i+idSize] = gCode
          case gCode:
          sequence[i+idSize] = cCode
          default:
          //case N -> leave it
          }
          }
          // reverse the sequence
          for i, j := idSize, len(sequence)-1; i < j; i, j = i+1, j-1 {
          sequence[i], sequence[j] = sequence[j], sequence[i]
          }

          if !options.Alternative {
          // Staden convention: Frame -1 is the reverse-complement of the sequence
          // having the same codon phase as frame 1. Frame -2 is the same phase as
          // frame 2. Frame -3 is the same phase as frame 3
          //
          // use the matrix to keep track of the forward frame as it depends on the
          // length of the sequence
          switch nuclSeqLength % 3 {
          case 0:
          startPosition[0], startPosition[1], startPosition[2] = 0, 2, 1
          case 1:
          startPosition[0], startPosition[1], startPosition[2] = 1, 0, 2
          case 2:
          startPosition[0], startPosition[1], startPosition[2] = 2, 1, 0
          }
          }
          // run the same loop, but with the reverse-complemented sequence
          goto Translate
          }

          if w.buf.Len() > maxBufferSize {
          _, err := out.Write(w.buf.Bytes())
          if err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }
          w.buf.Reset()
          }
          pool.Put(sequence)
          }

          if w.buf.Len() > 0 {
          _, err := out.Write(w.buf.Bytes())
          if err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }
          }
          }


          But that's a ton of arguments for the worker, can more be done? Sure! But first, let's get rid of goto.



          goto



          You use a goto statement to re-run a block of code again. That says to me: recursive function.



          So, let's move this to a separate function. Again, I have no idea the proper name, and will leave that to you. For now, I'll call it getComplexityAndReverse() -- a verbose name, but it should suffice.



          func getComplexityAndAlternate(startPosition int, framesToGenerate int,
          frameIndex int, sequence encodedSequence, idSize int, w writer,
          arrayCode byte, nuclSeqLength int, options Options, reverse bool) {
          for _, startPos := range startPosition {
          if framesToGenerate[frameIndex] == 0 {
          frameIndex++
          continue
          }

          // sequence id should look like
          // >sequenceID_<frame> comment
          idEnd := bytes.IndexByte(sequence[4:idSize], ' ')
          if idEnd != -1 {
          w.buf.Write(sequence[4 : 4+idEnd])
          w.buf.WriteByte('_')
          w.buf.WriteByte(suffixes[frameIndex])
          w.buf.Write(sequence[4+idEnd : idSize])
          } else {
          w.buf.Write(sequence[4:idSize])
          w.buf.WriteByte('_')
          w.buf.WriteByte(suffixes[frameIndex])
          }
          w.newLine()

          // if in trim mode, nb of bytes to trim (nb of successive 'X', '*' and 'n'
          // from right end of the sequence)
          w.bytesToTrim = 0
          w.currentLineLen = 0

          // read the sequence 3 letters at a time, starting at a specific position
          // corresponding to the frame
          for pos := startPos + 2 + idSize; pos < len(sequence); pos += 3 {

          if w.currentLineLen == maxLineSize {
          w.newLine()
          }
          // create an uint32 from the codon, to retrieve the corresponding
          // AA from the map
          codonCode := uint32(sequence[pos-2]) | uint32(sequence[pos-1])<<8 | uint32(sequence[pos])<<16

          b := arrayCode[codonCode]
          if b != byte(0) {
          w.addByte(b)
          } else {
          w.addUnknown()
          }
          }

          // the last codon is only 2 nucleotid long, try to guess
          // the corresponding AA
          if (nuclSeqLength-startPos)%3 == 2 {

          if w.currentLineLen == maxLineSize {
          w.newLine()
          }
          codonCode := uint32(sequence[len(sequence)-2]) | uint32(sequence[len(sequence)-1])<<8

          b := arrayCode[codonCode]
          if b != byte(0) {
          w.addByte(b)
          } else {
          w.addUnknown()
          }
          }

          // the last codon is only 1 nucleotid long, no way to guess
          // the corresponding AA
          if (nuclSeqLength-startPos)%3 == 1 {
          if w.currentLineLen == maxLineSize {
          w.newLine()
          }
          w.addUnknown()
          }

          if options.Trim && w.bytesToTrim > 0 {
          // remove the last bytesToTrim bytes of the buffer
          // as they are 'X', '*' or 'n'
          w.buf.Truncate(w.buf.Len() - w.bytesToTrim)
          w.currentLineLen -= w.bytesToTrim
          }

          if w.currentLineLen != 0 {
          w.newLine()
          }
          frameIndex++
          }

          if reverse && frameIndex < 6 {

          // get the complementary sequence.
          // Basically, switch
          // A <-> T
          // C <-> G
          // N is not modified
          for i, n := range sequence[idSize:] {

          switch n {
          case aCode:
          sequence[i+idSize] = tCode
          case tCode:
          // handle both tCode and uCode
          sequence[i+idSize] = aCode
          case cCode:
          sequence[i+idSize] = gCode
          case gCode:
          sequence[i+idSize] = cCode
          default:
          //case N -> leave it
          }
          }
          // reverse the sequence
          for i, j := idSize, len(sequence)-1; i < j; i, j = i+1, j-1 {
          sequence[i], sequence[j] = sequence[j], sequence[i]
          }

          if !options.Alternative {
          // Staden convention: Frame -1 is the reverse-complement of the sequence
          // having the same codon phase as frame 1. Frame -2 is the same phase as
          // frame 2. Frame -3 is the same phase as frame 3
          //
          // use the matrix to keep track of the forward frame as it depends on the
          // length of the sequence
          switch nuclSeqLength % 3 {
          case 0:
          startPosition[0], startPosition[1], startPosition[2] = 0, 2, 1
          case 1:
          startPosition[0], startPosition[1], startPosition[2] = 1, 0, 2
          case 2:
          startPosition[0], startPosition[1], startPosition[2] = 2, 1, 0
          }
          }
          // run the same loop, but with the reverse-complemented sequence
          getComplexityAndAlternate(startPosition, framesToGenerate, frameIndex,
          sequence, idSize, w, arrayCode, nuclSeqLength, options, reverse)
          }
          }


          And we can simplify worker() even more:



          func worker(wg *sync.WaitGroup, fnaSequences chan encodedSequence,
          ctx context.Context, framesToGenerate int, arrayCode byte,
          options Options, reverse bool, out io.Writer, errs chan error,
          cancel context.CancelFunc) {
          defer wg.Done()

          startPosition := make(int, 3)

          w := &writer{
          buf: bytes.NewBuffer(nil),
          bytesToTrim: 0,
          currentLineLen: 0,
          }

          for sequence := range fnaSequences {

          select {
          case <-ctx.Done():
          return
          default:
          }

          frameIndex := 0
          startPosition[0], startPosition[1], startPosition[2] = 0, 1, 2

          idSize := int(binary.LittleEndian.Uint32(sequence[0:4]))
          nuclSeqLength := len(sequence) - idSize

          getComplexityAndAlternate(startPosition, framesToGenerate, frameIndex,
          sequence, idSize, *w, arrayCode, nuclSeqLength, options, reverse)

          if w.buf.Len() > maxBufferSize {
          _, err := out.Write(w.buf.Bytes())
          if err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }
          w.buf.Reset()
          }
          pool.Put(sequence)
          }

          if w.buf.Len() > 0 {
          _, err := out.Write(w.buf.Bytes())
          if err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }
          }
          }


          But, there's still loads of long arguments. However, now that things are broken into functions, I recommend shortening these variable names. These long names make the code very verbose.



          With my limited knowledge, I see the following:





          • fnaSequencesfnaSeqs


          • framesToGenerateframes


          • arrayCodecodes (†)


          • startPositionstarts (or sPos)


          • sequences (changed in some places)


          † This may be incorrect jargon.



          Move things to the lowest scope



          For example, startPosition (now starts) can be declared in a lower scope.



          While we're at it, starts can be declared as such:



          starts := int{0, 1, 2}


          Resulting in:



          (Within worker())



          for sequence := range fnaSequences {

          select {
          case <-ctx.Done():
          return
          default:
          }

          frameIndex := 0
          starts := int{0, 1, 2}

          idSize := int(binary.LittleEndian.Uint32(sequence[0:4]))
          nuclSeqLength := len(sequence) - idSize

          getComplexityAndAlternate(starts, framesToGenerate, frameIndex,
          sequence, idSize, *w, arrayCode, nuclSeqLength, options, reverse)

          if w.buf.Len() > maxBufferSize {
          _, err := out.Write(w.buf.Bytes())
          if err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }
          w.buf.Reset()
          }
          pool.Put(sequence)
          }


          (Within getComplexityAndAlternate())



          if !options.Alternative {
          // Staden convention: Frame -1 is the reverse-complement of the sequence
          // having the same codon phase as frame 1. Frame -2 is the same phase as
          // frame 2. Frame -3 is the same phase as frame 3
          //
          // use the matrix to keep track of the forward frame as it depends on the
          // length of the sequence
          switch nuclSeqLength % 3 {
          case 0:
          starts = int{0, 2, 1}
          case 1:
          starts = int{1, 0, 2}
          case 2:
          starts = int{2, 1, 0}
          }
          }


          Duplicate code



          The following code is used twice and should instead be a function:



          _, err := out.Write(w.buf.Bytes())
          if err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }


          Becomes a function (again, choose whatever name you want):



          func writeOrCancel(w writer, out io.Writer, errs chan error,
          cancel context.CancelFunc) {
          if _, err := out.Write(w.buf.Bytes()); err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }
          }


          Unroll short loops



          case "F":
          for i := 0; i < 3; i++ {
          frames[i] = 1
          }


          And



          case "R":
          for i := 3; i < 6; i++ {
          frames[i] = 1
          }
          reverse = true


          Shouldn't be loops. By using loops you use more magic numbers. Instead, unroll them:



          case "F":
          frames[0] = 1
          frames[1] = 1
          frames[2] = 1
          // ...
          case "R":
          frames[3] = 1
          frames[4] = 1
          frames[5] = 1
          reverse = true


          There's probably an even easier way to clean up the switch statement in computeFrames().



          Don't name return arguments unless it simplifies code



          In computeFrames() your return arguments are named, but they don't need to be.



          Use straightforward conditions



          // generate 2 letter code
          codes, ok := twoLetterMap[codon[0:2]]

          if !ok {
          twoLetterMap[codon[0:2]] = byte{aaCode}
          } else {
          twoLetterMap[codon[0:2]] = append(codes, aaCode)
          }


          Is more clearly:



          // generate 2 letter code
          if codes, ok := twoLetterMap[codon[0:2]]; ok {
          twoLetterMap[codon[0:2]] = append(codes, aaCode)
          } else {
          twoLetterMap[codon[0:2]] = byte{aaCode}
          }


          Exit loops early



          You can break early upon the condition codes[i] != codes[0].



          for twoLetterCodon, codes := range twoLetterMap {
          uniqueAA := true
          for i := 0; i < len(codes); i++ {
          if codes[i] != codes[0] {
          uniqueAA = false
          }
          }
          if uniqueAA {
          first := letterCode[twoLetterCodon[0]]
          second := letterCode[twoLetterCodon[1]]

          uint32Code := uint32(first) | uint32(second)<<8
          resultMap[uint32Code] = codes[0]
          }
          }


          Becomes:



          for twoLetterCodon, codes := range twoLetterMap {
          uniqueAA := true

          for _, c := range codes {
          if c != codes[0] {
          uniqueAA = false
          break
          }
          }

          if uniqueAA {
          first := letterCode[twoLetterCodon[0]]
          second := letterCode[twoLetterCodon[1]]

          uint32Code := uint32(first) | uint32(second)<<8
          resultMap[uint32Code] = codes[0]
          }
          }


          Combine global const declarations



          It's common practice to combine them, so readers of your code don't need to search the entire document.



          Move things to separate files



          Your writer structure is relatively separate from everything else. I've moved it to a writer.go file -- moving the two constants it uses along with it.



          You can also simplify the field names. If you feel explanation is needed, that's the purpose of documentation, not the field names themselves.



          Rather than writing the following:



          w := &writer{
          buf: bytes.NewBuffer(nil),
          toTrim: 0,
          clen: 0,
          }


          We can use a newWriter() function, which follows Go APIs:



          func newWriter(buf byte) *writer {
          return &writer{buf: bytes.NewBuffer(buf)}
          }


          Also note that specifying the default values (0) is not needed.



          Conclusion: More to be done



          I was not able to address all of the things I saw, but you should get the gist.



          I would urge you to continue to break concrete operations into functions. Even though you may end up with lots of function arguments, in my opinion that's better than hard-to-read deep nesting and long functions. Perhaps once seeing how things break up, you can simplify the whole architecture of the package.



          Here is a GitHub Gist of the final code. It includes other formatting things I did not mention explicitly.



          Hope this helped. Your project looks promising, best of luck!






          share|improve this answer























          • Thanks for the answer, there is some interesting ideas ! However, I disagree with a few points:
            – felix
            6 hours ago










          • (1) can't use iota this way because tCode = uCode = uint(3) (2) buf.Write() and buf.WriteByte() always return a nil error, cf godoc bytes, so no need to check for it (3) I agree that high cyclomatic complexity is bad, but methods with 10 (!!) arguments is not a good solution (4) starts = int{0, 2, 1} allocates a new array instead of reusing the same slice. Not really sure it's worth it as it's in the hot spot of the code ! And this answer doesn't talk at all about the performance of the modified code
            – felix
            6 hours ago












          • @felix (1) If you check, you will see that tcode is 3, and ucode is 4. That's how implicit repetition works in Go. (2) Fair point (3) In my opinion, it's better than code that's hard to read and understand, and definitely better than tons of nesting. If you split things into functions as I recommend, you may notice that functions (such as the one with 10 arguments) can be split up to do separate things. I left that work for you to implement. (4) If you do profiling, I highly doubt it will cause any noticeable performance difference.
            – esote
            4 hours ago










          • I don't talk about the performance of the modified code because it's likely the same as before.
            – esote
            4 hours ago












          • (1) sorry it wasn't clear enough, I meant both tCode and uCode have to be equal to uint(3) (cf initial code)
            – felix
            4 hours ago











          Your Answer





          StackExchange.ifUsing("editor", function () {
          return StackExchange.using("mathjaxEditing", function () {
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
          });
          });
          }, "mathjax-editing");

          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "196"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f209750%2ftranslate-nucleic-acid-sequence-into-its-corresponding-amino-acid-sequence%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          7














          This covers an interesting topic. Great work!



          Because I am unfamiliar with this area, I utilized your unit testing to ensure changes I make did not break functionality. If they do, I apologize and please let me know.



          Utilize implicit repetition and iota



          Rather than manually defining the type and value of nCode, aCode, etc. we an implicitly get the value using iota. This also simplifies the assignment of arrayCodeSize.



          const (
          nCode = uint8(0)
          aCode = uint8(1)
          cCode = uint8(2)
          tCode = uint8(3)
          uCode = uint8(3)
          gCode = uint8(4)

          stopByte = '*'
          unknown = 'X'
          // Length of the array to store code/bytes
          // uses gCode because it's the biggest uint8 of all codes
          arrayCodeSize = (uint32(gCode) | uint32(gCode)<<8 | uint32(gCode)<<16) + 1
          )


          Becomes:



          const (
          nCode = iota
          aCode
          cCode
          tCode
          uCode
          gCode

          stopByte = '*'
          unknown = 'X'

          // Length of the array to store code/bytes
          // uses gCode because it's the biggest of all codes
          arrayCodeSize = (gCode | gCode<<8 | gCode<<16) + 1
          )


          Check error return values



          There are fourteen places in your code where the error return value is unchecked. If this is intentional, it's common practice to assign it to the blank identifier: _.



          Here is a list of occurrences:




          • w.buf.WriteByte() in addByte()

          • w.buf.WriteByte() in addUnknown()

          • w.buf.WriteByte() in newLine()

          • Many Write and WriteByte calls in Translate()

          • feeder.idBuffer.Write() and such, in readSequenceFromFasta


          Update: These always return nil.



          Reduce complexity of Translate()



          The Translate() function is complex. It's current cyclomatic complexity is 45. (By the end of this, it's complexity is 27).



          I notice that you define a go statement, which acts as a worker. I will leave it up to you to choose a fitting name, but for now "worker()" is sufficient.



          func worker(wg *sync.WaitGroup, fnaSequences chan encodedSequence,
          ctx context.Context, framesToGenerate int, arrayCode byte,
          options Options, reverse bool, out io.Writer, errs chan error,
          cancel context.CancelFunc) {
          defer wg.Done()

          startPosition := make(int, 3)

          w := &writer{
          buf: bytes.NewBuffer(nil),
          bytesToTrim: 0,
          currentLineLen: 0,
          }

          for sequence := range fnaSequences {

          select {
          case <-ctx.Done():
          return
          default:
          }

          frameIndex := 0
          startPosition[0], startPosition[1], startPosition[2] = 0, 1, 2

          idSize := int(binary.LittleEndian.Uint32(sequence[0:4]))
          nuclSeqLength := len(sequence) - idSize

          Translate:
          for _, startPos := range startPosition {

          if framesToGenerate[frameIndex] == 0 {
          frameIndex++
          continue
          }

          // sequence id should look like
          // >sequenceID_<frame> comment
          idEnd := bytes.IndexByte(sequence[4:idSize], ' ')
          if idEnd != -1 {
          w.buf.Write(sequence[4 : 4+idEnd])
          w.buf.WriteByte('_')
          w.buf.WriteByte(suffixes[frameIndex])
          w.buf.Write(sequence[4+idEnd : idSize])
          } else {
          w.buf.Write(sequence[4:idSize])
          w.buf.WriteByte('_')
          w.buf.WriteByte(suffixes[frameIndex])
          }
          w.newLine()

          // if in trim mode, nb of bytes to trim (nb of successive 'X', '*' and 'n'
          // from right end of the sequence)
          w.bytesToTrim = 0
          w.currentLineLen = 0

          // read the sequence 3 letters at a time, starting at a specific position
          // corresponding to the frame
          for pos := startPos + 2 + idSize; pos < len(sequence); pos += 3 {

          if w.currentLineLen == maxLineSize {
          w.newLine()
          }
          // create an uint32 from the codon, to retrieve the corresponding
          // AA from the map
          codonCode := uint32(sequence[pos-2]) | uint32(sequence[pos-1])<<8 | uint32(sequence[pos])<<16

          b := arrayCode[codonCode]
          if b != byte(0) {
          w.addByte(b)
          } else {
          w.addUnknown()
          }
          }

          // the last codon is only 2 nucleotid long, try to guess
          // the corresponding AA
          if (nuclSeqLength-startPos)%3 == 2 {

          if w.currentLineLen == maxLineSize {
          w.newLine()
          }
          codonCode := uint32(sequence[len(sequence)-2]) | uint32(sequence[len(sequence)-1])<<8

          b := arrayCode[codonCode]
          if b != byte(0) {
          w.addByte(b)
          } else {
          w.addUnknown()
          }
          }

          // the last codon is only 1 nucleotid long, no way to guess
          // the corresponding AA
          if (nuclSeqLength-startPos)%3 == 1 {
          if w.currentLineLen == maxLineSize {
          w.newLine()
          }
          w.addUnknown()
          }

          if options.Trim && w.bytesToTrim > 0 {
          // remove the last bytesToTrim bytes of the buffer
          // as they are 'X', '*' or 'n'
          w.buf.Truncate(w.buf.Len() - w.bytesToTrim)
          w.currentLineLen -= w.bytesToTrim
          }

          if w.currentLineLen != 0 {
          w.newLine()
          }
          frameIndex++
          }

          if reverse && frameIndex < 6 {

          // get the complementary sequence.
          // Basically, switch
          // A <-> T
          // C <-> G
          // N is not modified
          for i, n := range sequence[idSize:] {

          switch n {
          case aCode:
          sequence[i+idSize] = tCode
          case tCode:
          // handle both tCode and uCode
          sequence[i+idSize] = aCode
          case cCode:
          sequence[i+idSize] = gCode
          case gCode:
          sequence[i+idSize] = cCode
          default:
          //case N -> leave it
          }
          }
          // reverse the sequence
          for i, j := idSize, len(sequence)-1; i < j; i, j = i+1, j-1 {
          sequence[i], sequence[j] = sequence[j], sequence[i]
          }

          if !options.Alternative {
          // Staden convention: Frame -1 is the reverse-complement of the sequence
          // having the same codon phase as frame 1. Frame -2 is the same phase as
          // frame 2. Frame -3 is the same phase as frame 3
          //
          // use the matrix to keep track of the forward frame as it depends on the
          // length of the sequence
          switch nuclSeqLength % 3 {
          case 0:
          startPosition[0], startPosition[1], startPosition[2] = 0, 2, 1
          case 1:
          startPosition[0], startPosition[1], startPosition[2] = 1, 0, 2
          case 2:
          startPosition[0], startPosition[1], startPosition[2] = 2, 1, 0
          }
          }
          // run the same loop, but with the reverse-complemented sequence
          goto Translate
          }

          if w.buf.Len() > maxBufferSize {
          _, err := out.Write(w.buf.Bytes())
          if err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }
          w.buf.Reset()
          }
          pool.Put(sequence)
          }

          if w.buf.Len() > 0 {
          _, err := out.Write(w.buf.Bytes())
          if err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }
          }
          }


          But that's a ton of arguments for the worker, can more be done? Sure! But first, let's get rid of goto.



          goto



          You use a goto statement to re-run a block of code again. That says to me: recursive function.



          So, let's move this to a separate function. Again, I have no idea the proper name, and will leave that to you. For now, I'll call it getComplexityAndReverse() -- a verbose name, but it should suffice.



          func getComplexityAndAlternate(startPosition int, framesToGenerate int,
          frameIndex int, sequence encodedSequence, idSize int, w writer,
          arrayCode byte, nuclSeqLength int, options Options, reverse bool) {
          for _, startPos := range startPosition {
          if framesToGenerate[frameIndex] == 0 {
          frameIndex++
          continue
          }

          // sequence id should look like
          // >sequenceID_<frame> comment
          idEnd := bytes.IndexByte(sequence[4:idSize], ' ')
          if idEnd != -1 {
          w.buf.Write(sequence[4 : 4+idEnd])
          w.buf.WriteByte('_')
          w.buf.WriteByte(suffixes[frameIndex])
          w.buf.Write(sequence[4+idEnd : idSize])
          } else {
          w.buf.Write(sequence[4:idSize])
          w.buf.WriteByte('_')
          w.buf.WriteByte(suffixes[frameIndex])
          }
          w.newLine()

          // if in trim mode, nb of bytes to trim (nb of successive 'X', '*' and 'n'
          // from right end of the sequence)
          w.bytesToTrim = 0
          w.currentLineLen = 0

          // read the sequence 3 letters at a time, starting at a specific position
          // corresponding to the frame
          for pos := startPos + 2 + idSize; pos < len(sequence); pos += 3 {

          if w.currentLineLen == maxLineSize {
          w.newLine()
          }
          // create an uint32 from the codon, to retrieve the corresponding
          // AA from the map
          codonCode := uint32(sequence[pos-2]) | uint32(sequence[pos-1])<<8 | uint32(sequence[pos])<<16

          b := arrayCode[codonCode]
          if b != byte(0) {
          w.addByte(b)
          } else {
          w.addUnknown()
          }
          }

          // the last codon is only 2 nucleotid long, try to guess
          // the corresponding AA
          if (nuclSeqLength-startPos)%3 == 2 {

          if w.currentLineLen == maxLineSize {
          w.newLine()
          }
          codonCode := uint32(sequence[len(sequence)-2]) | uint32(sequence[len(sequence)-1])<<8

          b := arrayCode[codonCode]
          if b != byte(0) {
          w.addByte(b)
          } else {
          w.addUnknown()
          }
          }

          // the last codon is only 1 nucleotid long, no way to guess
          // the corresponding AA
          if (nuclSeqLength-startPos)%3 == 1 {
          if w.currentLineLen == maxLineSize {
          w.newLine()
          }
          w.addUnknown()
          }

          if options.Trim && w.bytesToTrim > 0 {
          // remove the last bytesToTrim bytes of the buffer
          // as they are 'X', '*' or 'n'
          w.buf.Truncate(w.buf.Len() - w.bytesToTrim)
          w.currentLineLen -= w.bytesToTrim
          }

          if w.currentLineLen != 0 {
          w.newLine()
          }
          frameIndex++
          }

          if reverse && frameIndex < 6 {

          // get the complementary sequence.
          // Basically, switch
          // A <-> T
          // C <-> G
          // N is not modified
          for i, n := range sequence[idSize:] {

          switch n {
          case aCode:
          sequence[i+idSize] = tCode
          case tCode:
          // handle both tCode and uCode
          sequence[i+idSize] = aCode
          case cCode:
          sequence[i+idSize] = gCode
          case gCode:
          sequence[i+idSize] = cCode
          default:
          //case N -> leave it
          }
          }
          // reverse the sequence
          for i, j := idSize, len(sequence)-1; i < j; i, j = i+1, j-1 {
          sequence[i], sequence[j] = sequence[j], sequence[i]
          }

          if !options.Alternative {
          // Staden convention: Frame -1 is the reverse-complement of the sequence
          // having the same codon phase as frame 1. Frame -2 is the same phase as
          // frame 2. Frame -3 is the same phase as frame 3
          //
          // use the matrix to keep track of the forward frame as it depends on the
          // length of the sequence
          switch nuclSeqLength % 3 {
          case 0:
          startPosition[0], startPosition[1], startPosition[2] = 0, 2, 1
          case 1:
          startPosition[0], startPosition[1], startPosition[2] = 1, 0, 2
          case 2:
          startPosition[0], startPosition[1], startPosition[2] = 2, 1, 0
          }
          }
          // run the same loop, but with the reverse-complemented sequence
          getComplexityAndAlternate(startPosition, framesToGenerate, frameIndex,
          sequence, idSize, w, arrayCode, nuclSeqLength, options, reverse)
          }
          }


          And we can simplify worker() even more:



          func worker(wg *sync.WaitGroup, fnaSequences chan encodedSequence,
          ctx context.Context, framesToGenerate int, arrayCode byte,
          options Options, reverse bool, out io.Writer, errs chan error,
          cancel context.CancelFunc) {
          defer wg.Done()

          startPosition := make(int, 3)

          w := &writer{
          buf: bytes.NewBuffer(nil),
          bytesToTrim: 0,
          currentLineLen: 0,
          }

          for sequence := range fnaSequences {

          select {
          case <-ctx.Done():
          return
          default:
          }

          frameIndex := 0
          startPosition[0], startPosition[1], startPosition[2] = 0, 1, 2

          idSize := int(binary.LittleEndian.Uint32(sequence[0:4]))
          nuclSeqLength := len(sequence) - idSize

          getComplexityAndAlternate(startPosition, framesToGenerate, frameIndex,
          sequence, idSize, *w, arrayCode, nuclSeqLength, options, reverse)

          if w.buf.Len() > maxBufferSize {
          _, err := out.Write(w.buf.Bytes())
          if err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }
          w.buf.Reset()
          }
          pool.Put(sequence)
          }

          if w.buf.Len() > 0 {
          _, err := out.Write(w.buf.Bytes())
          if err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }
          }
          }


          But, there's still loads of long arguments. However, now that things are broken into functions, I recommend shortening these variable names. These long names make the code very verbose.



          With my limited knowledge, I see the following:





          • fnaSequencesfnaSeqs


          • framesToGenerateframes


          • arrayCodecodes (†)


          • startPositionstarts (or sPos)


          • sequences (changed in some places)


          † This may be incorrect jargon.



          Move things to the lowest scope



          For example, startPosition (now starts) can be declared in a lower scope.



          While we're at it, starts can be declared as such:



          starts := int{0, 1, 2}


          Resulting in:



          (Within worker())



          for sequence := range fnaSequences {

          select {
          case <-ctx.Done():
          return
          default:
          }

          frameIndex := 0
          starts := int{0, 1, 2}

          idSize := int(binary.LittleEndian.Uint32(sequence[0:4]))
          nuclSeqLength := len(sequence) - idSize

          getComplexityAndAlternate(starts, framesToGenerate, frameIndex,
          sequence, idSize, *w, arrayCode, nuclSeqLength, options, reverse)

          if w.buf.Len() > maxBufferSize {
          _, err := out.Write(w.buf.Bytes())
          if err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }
          w.buf.Reset()
          }
          pool.Put(sequence)
          }


          (Within getComplexityAndAlternate())



          if !options.Alternative {
          // Staden convention: Frame -1 is the reverse-complement of the sequence
          // having the same codon phase as frame 1. Frame -2 is the same phase as
          // frame 2. Frame -3 is the same phase as frame 3
          //
          // use the matrix to keep track of the forward frame as it depends on the
          // length of the sequence
          switch nuclSeqLength % 3 {
          case 0:
          starts = int{0, 2, 1}
          case 1:
          starts = int{1, 0, 2}
          case 2:
          starts = int{2, 1, 0}
          }
          }


          Duplicate code



          The following code is used twice and should instead be a function:



          _, err := out.Write(w.buf.Bytes())
          if err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }


          Becomes a function (again, choose whatever name you want):



          func writeOrCancel(w writer, out io.Writer, errs chan error,
          cancel context.CancelFunc) {
          if _, err := out.Write(w.buf.Bytes()); err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }
          }


          Unroll short loops



          case "F":
          for i := 0; i < 3; i++ {
          frames[i] = 1
          }


          And



          case "R":
          for i := 3; i < 6; i++ {
          frames[i] = 1
          }
          reverse = true


          Shouldn't be loops. By using loops you use more magic numbers. Instead, unroll them:



          case "F":
          frames[0] = 1
          frames[1] = 1
          frames[2] = 1
          // ...
          case "R":
          frames[3] = 1
          frames[4] = 1
          frames[5] = 1
          reverse = true


          There's probably an even easier way to clean up the switch statement in computeFrames().



          Don't name return arguments unless it simplifies code



          In computeFrames() your return arguments are named, but they don't need to be.



          Use straightforward conditions



          // generate 2 letter code
          codes, ok := twoLetterMap[codon[0:2]]

          if !ok {
          twoLetterMap[codon[0:2]] = byte{aaCode}
          } else {
          twoLetterMap[codon[0:2]] = append(codes, aaCode)
          }


          Is more clearly:



          // generate 2 letter code
          if codes, ok := twoLetterMap[codon[0:2]]; ok {
          twoLetterMap[codon[0:2]] = append(codes, aaCode)
          } else {
          twoLetterMap[codon[0:2]] = byte{aaCode}
          }


          Exit loops early



          You can break early upon the condition codes[i] != codes[0].



          for twoLetterCodon, codes := range twoLetterMap {
          uniqueAA := true
          for i := 0; i < len(codes); i++ {
          if codes[i] != codes[0] {
          uniqueAA = false
          }
          }
          if uniqueAA {
          first := letterCode[twoLetterCodon[0]]
          second := letterCode[twoLetterCodon[1]]

          uint32Code := uint32(first) | uint32(second)<<8
          resultMap[uint32Code] = codes[0]
          }
          }


          Becomes:



          for twoLetterCodon, codes := range twoLetterMap {
          uniqueAA := true

          for _, c := range codes {
          if c != codes[0] {
          uniqueAA = false
          break
          }
          }

          if uniqueAA {
          first := letterCode[twoLetterCodon[0]]
          second := letterCode[twoLetterCodon[1]]

          uint32Code := uint32(first) | uint32(second)<<8
          resultMap[uint32Code] = codes[0]
          }
          }


          Combine global const declarations



          It's common practice to combine them, so readers of your code don't need to search the entire document.



          Move things to separate files



          Your writer structure is relatively separate from everything else. I've moved it to a writer.go file -- moving the two constants it uses along with it.



          You can also simplify the field names. If you feel explanation is needed, that's the purpose of documentation, not the field names themselves.



          Rather than writing the following:



          w := &writer{
          buf: bytes.NewBuffer(nil),
          toTrim: 0,
          clen: 0,
          }


          We can use a newWriter() function, which follows Go APIs:



          func newWriter(buf byte) *writer {
          return &writer{buf: bytes.NewBuffer(buf)}
          }


          Also note that specifying the default values (0) is not needed.



          Conclusion: More to be done



          I was not able to address all of the things I saw, but you should get the gist.



          I would urge you to continue to break concrete operations into functions. Even though you may end up with lots of function arguments, in my opinion that's better than hard-to-read deep nesting and long functions. Perhaps once seeing how things break up, you can simplify the whole architecture of the package.



          Here is a GitHub Gist of the final code. It includes other formatting things I did not mention explicitly.



          Hope this helped. Your project looks promising, best of luck!






          share|improve this answer























          • Thanks for the answer, there is some interesting ideas ! However, I disagree with a few points:
            – felix
            6 hours ago










          • (1) can't use iota this way because tCode = uCode = uint(3) (2) buf.Write() and buf.WriteByte() always return a nil error, cf godoc bytes, so no need to check for it (3) I agree that high cyclomatic complexity is bad, but methods with 10 (!!) arguments is not a good solution (4) starts = int{0, 2, 1} allocates a new array instead of reusing the same slice. Not really sure it's worth it as it's in the hot spot of the code ! And this answer doesn't talk at all about the performance of the modified code
            – felix
            6 hours ago












          • @felix (1) If you check, you will see that tcode is 3, and ucode is 4. That's how implicit repetition works in Go. (2) Fair point (3) In my opinion, it's better than code that's hard to read and understand, and definitely better than tons of nesting. If you split things into functions as I recommend, you may notice that functions (such as the one with 10 arguments) can be split up to do separate things. I left that work for you to implement. (4) If you do profiling, I highly doubt it will cause any noticeable performance difference.
            – esote
            4 hours ago










          • I don't talk about the performance of the modified code because it's likely the same as before.
            – esote
            4 hours ago












          • (1) sorry it wasn't clear enough, I meant both tCode and uCode have to be equal to uint(3) (cf initial code)
            – felix
            4 hours ago
















          7














          This covers an interesting topic. Great work!



          Because I am unfamiliar with this area, I utilized your unit testing to ensure changes I make did not break functionality. If they do, I apologize and please let me know.



          Utilize implicit repetition and iota



          Rather than manually defining the type and value of nCode, aCode, etc. we an implicitly get the value using iota. This also simplifies the assignment of arrayCodeSize.



          const (
          nCode = uint8(0)
          aCode = uint8(1)
          cCode = uint8(2)
          tCode = uint8(3)
          uCode = uint8(3)
          gCode = uint8(4)

          stopByte = '*'
          unknown = 'X'
          // Length of the array to store code/bytes
          // uses gCode because it's the biggest uint8 of all codes
          arrayCodeSize = (uint32(gCode) | uint32(gCode)<<8 | uint32(gCode)<<16) + 1
          )


          Becomes:



          const (
          nCode = iota
          aCode
          cCode
          tCode
          uCode
          gCode

          stopByte = '*'
          unknown = 'X'

          // Length of the array to store code/bytes
          // uses gCode because it's the biggest of all codes
          arrayCodeSize = (gCode | gCode<<8 | gCode<<16) + 1
          )


          Check error return values



          There are fourteen places in your code where the error return value is unchecked. If this is intentional, it's common practice to assign it to the blank identifier: _.



          Here is a list of occurrences:




          • w.buf.WriteByte() in addByte()

          • w.buf.WriteByte() in addUnknown()

          • w.buf.WriteByte() in newLine()

          • Many Write and WriteByte calls in Translate()

          • feeder.idBuffer.Write() and such, in readSequenceFromFasta


          Update: These always return nil.



          Reduce complexity of Translate()



          The Translate() function is complex. It's current cyclomatic complexity is 45. (By the end of this, it's complexity is 27).



          I notice that you define a go statement, which acts as a worker. I will leave it up to you to choose a fitting name, but for now "worker()" is sufficient.



          func worker(wg *sync.WaitGroup, fnaSequences chan encodedSequence,
          ctx context.Context, framesToGenerate int, arrayCode byte,
          options Options, reverse bool, out io.Writer, errs chan error,
          cancel context.CancelFunc) {
          defer wg.Done()

          startPosition := make(int, 3)

          w := &writer{
          buf: bytes.NewBuffer(nil),
          bytesToTrim: 0,
          currentLineLen: 0,
          }

          for sequence := range fnaSequences {

          select {
          case <-ctx.Done():
          return
          default:
          }

          frameIndex := 0
          startPosition[0], startPosition[1], startPosition[2] = 0, 1, 2

          idSize := int(binary.LittleEndian.Uint32(sequence[0:4]))
          nuclSeqLength := len(sequence) - idSize

          Translate:
          for _, startPos := range startPosition {

          if framesToGenerate[frameIndex] == 0 {
          frameIndex++
          continue
          }

          // sequence id should look like
          // >sequenceID_<frame> comment
          idEnd := bytes.IndexByte(sequence[4:idSize], ' ')
          if idEnd != -1 {
          w.buf.Write(sequence[4 : 4+idEnd])
          w.buf.WriteByte('_')
          w.buf.WriteByte(suffixes[frameIndex])
          w.buf.Write(sequence[4+idEnd : idSize])
          } else {
          w.buf.Write(sequence[4:idSize])
          w.buf.WriteByte('_')
          w.buf.WriteByte(suffixes[frameIndex])
          }
          w.newLine()

          // if in trim mode, nb of bytes to trim (nb of successive 'X', '*' and 'n'
          // from right end of the sequence)
          w.bytesToTrim = 0
          w.currentLineLen = 0

          // read the sequence 3 letters at a time, starting at a specific position
          // corresponding to the frame
          for pos := startPos + 2 + idSize; pos < len(sequence); pos += 3 {

          if w.currentLineLen == maxLineSize {
          w.newLine()
          }
          // create an uint32 from the codon, to retrieve the corresponding
          // AA from the map
          codonCode := uint32(sequence[pos-2]) | uint32(sequence[pos-1])<<8 | uint32(sequence[pos])<<16

          b := arrayCode[codonCode]
          if b != byte(0) {
          w.addByte(b)
          } else {
          w.addUnknown()
          }
          }

          // the last codon is only 2 nucleotid long, try to guess
          // the corresponding AA
          if (nuclSeqLength-startPos)%3 == 2 {

          if w.currentLineLen == maxLineSize {
          w.newLine()
          }
          codonCode := uint32(sequence[len(sequence)-2]) | uint32(sequence[len(sequence)-1])<<8

          b := arrayCode[codonCode]
          if b != byte(0) {
          w.addByte(b)
          } else {
          w.addUnknown()
          }
          }

          // the last codon is only 1 nucleotid long, no way to guess
          // the corresponding AA
          if (nuclSeqLength-startPos)%3 == 1 {
          if w.currentLineLen == maxLineSize {
          w.newLine()
          }
          w.addUnknown()
          }

          if options.Trim && w.bytesToTrim > 0 {
          // remove the last bytesToTrim bytes of the buffer
          // as they are 'X', '*' or 'n'
          w.buf.Truncate(w.buf.Len() - w.bytesToTrim)
          w.currentLineLen -= w.bytesToTrim
          }

          if w.currentLineLen != 0 {
          w.newLine()
          }
          frameIndex++
          }

          if reverse && frameIndex < 6 {

          // get the complementary sequence.
          // Basically, switch
          // A <-> T
          // C <-> G
          // N is not modified
          for i, n := range sequence[idSize:] {

          switch n {
          case aCode:
          sequence[i+idSize] = tCode
          case tCode:
          // handle both tCode and uCode
          sequence[i+idSize] = aCode
          case cCode:
          sequence[i+idSize] = gCode
          case gCode:
          sequence[i+idSize] = cCode
          default:
          //case N -> leave it
          }
          }
          // reverse the sequence
          for i, j := idSize, len(sequence)-1; i < j; i, j = i+1, j-1 {
          sequence[i], sequence[j] = sequence[j], sequence[i]
          }

          if !options.Alternative {
          // Staden convention: Frame -1 is the reverse-complement of the sequence
          // having the same codon phase as frame 1. Frame -2 is the same phase as
          // frame 2. Frame -3 is the same phase as frame 3
          //
          // use the matrix to keep track of the forward frame as it depends on the
          // length of the sequence
          switch nuclSeqLength % 3 {
          case 0:
          startPosition[0], startPosition[1], startPosition[2] = 0, 2, 1
          case 1:
          startPosition[0], startPosition[1], startPosition[2] = 1, 0, 2
          case 2:
          startPosition[0], startPosition[1], startPosition[2] = 2, 1, 0
          }
          }
          // run the same loop, but with the reverse-complemented sequence
          goto Translate
          }

          if w.buf.Len() > maxBufferSize {
          _, err := out.Write(w.buf.Bytes())
          if err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }
          w.buf.Reset()
          }
          pool.Put(sequence)
          }

          if w.buf.Len() > 0 {
          _, err := out.Write(w.buf.Bytes())
          if err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }
          }
          }


          But that's a ton of arguments for the worker, can more be done? Sure! But first, let's get rid of goto.



          goto



          You use a goto statement to re-run a block of code again. That says to me: recursive function.



          So, let's move this to a separate function. Again, I have no idea the proper name, and will leave that to you. For now, I'll call it getComplexityAndReverse() -- a verbose name, but it should suffice.



          func getComplexityAndAlternate(startPosition int, framesToGenerate int,
          frameIndex int, sequence encodedSequence, idSize int, w writer,
          arrayCode byte, nuclSeqLength int, options Options, reverse bool) {
          for _, startPos := range startPosition {
          if framesToGenerate[frameIndex] == 0 {
          frameIndex++
          continue
          }

          // sequence id should look like
          // >sequenceID_<frame> comment
          idEnd := bytes.IndexByte(sequence[4:idSize], ' ')
          if idEnd != -1 {
          w.buf.Write(sequence[4 : 4+idEnd])
          w.buf.WriteByte('_')
          w.buf.WriteByte(suffixes[frameIndex])
          w.buf.Write(sequence[4+idEnd : idSize])
          } else {
          w.buf.Write(sequence[4:idSize])
          w.buf.WriteByte('_')
          w.buf.WriteByte(suffixes[frameIndex])
          }
          w.newLine()

          // if in trim mode, nb of bytes to trim (nb of successive 'X', '*' and 'n'
          // from right end of the sequence)
          w.bytesToTrim = 0
          w.currentLineLen = 0

          // read the sequence 3 letters at a time, starting at a specific position
          // corresponding to the frame
          for pos := startPos + 2 + idSize; pos < len(sequence); pos += 3 {

          if w.currentLineLen == maxLineSize {
          w.newLine()
          }
          // create an uint32 from the codon, to retrieve the corresponding
          // AA from the map
          codonCode := uint32(sequence[pos-2]) | uint32(sequence[pos-1])<<8 | uint32(sequence[pos])<<16

          b := arrayCode[codonCode]
          if b != byte(0) {
          w.addByte(b)
          } else {
          w.addUnknown()
          }
          }

          // the last codon is only 2 nucleotid long, try to guess
          // the corresponding AA
          if (nuclSeqLength-startPos)%3 == 2 {

          if w.currentLineLen == maxLineSize {
          w.newLine()
          }
          codonCode := uint32(sequence[len(sequence)-2]) | uint32(sequence[len(sequence)-1])<<8

          b := arrayCode[codonCode]
          if b != byte(0) {
          w.addByte(b)
          } else {
          w.addUnknown()
          }
          }

          // the last codon is only 1 nucleotid long, no way to guess
          // the corresponding AA
          if (nuclSeqLength-startPos)%3 == 1 {
          if w.currentLineLen == maxLineSize {
          w.newLine()
          }
          w.addUnknown()
          }

          if options.Trim && w.bytesToTrim > 0 {
          // remove the last bytesToTrim bytes of the buffer
          // as they are 'X', '*' or 'n'
          w.buf.Truncate(w.buf.Len() - w.bytesToTrim)
          w.currentLineLen -= w.bytesToTrim
          }

          if w.currentLineLen != 0 {
          w.newLine()
          }
          frameIndex++
          }

          if reverse && frameIndex < 6 {

          // get the complementary sequence.
          // Basically, switch
          // A <-> T
          // C <-> G
          // N is not modified
          for i, n := range sequence[idSize:] {

          switch n {
          case aCode:
          sequence[i+idSize] = tCode
          case tCode:
          // handle both tCode and uCode
          sequence[i+idSize] = aCode
          case cCode:
          sequence[i+idSize] = gCode
          case gCode:
          sequence[i+idSize] = cCode
          default:
          //case N -> leave it
          }
          }
          // reverse the sequence
          for i, j := idSize, len(sequence)-1; i < j; i, j = i+1, j-1 {
          sequence[i], sequence[j] = sequence[j], sequence[i]
          }

          if !options.Alternative {
          // Staden convention: Frame -1 is the reverse-complement of the sequence
          // having the same codon phase as frame 1. Frame -2 is the same phase as
          // frame 2. Frame -3 is the same phase as frame 3
          //
          // use the matrix to keep track of the forward frame as it depends on the
          // length of the sequence
          switch nuclSeqLength % 3 {
          case 0:
          startPosition[0], startPosition[1], startPosition[2] = 0, 2, 1
          case 1:
          startPosition[0], startPosition[1], startPosition[2] = 1, 0, 2
          case 2:
          startPosition[0], startPosition[1], startPosition[2] = 2, 1, 0
          }
          }
          // run the same loop, but with the reverse-complemented sequence
          getComplexityAndAlternate(startPosition, framesToGenerate, frameIndex,
          sequence, idSize, w, arrayCode, nuclSeqLength, options, reverse)
          }
          }


          And we can simplify worker() even more:



          func worker(wg *sync.WaitGroup, fnaSequences chan encodedSequence,
          ctx context.Context, framesToGenerate int, arrayCode byte,
          options Options, reverse bool, out io.Writer, errs chan error,
          cancel context.CancelFunc) {
          defer wg.Done()

          startPosition := make(int, 3)

          w := &writer{
          buf: bytes.NewBuffer(nil),
          bytesToTrim: 0,
          currentLineLen: 0,
          }

          for sequence := range fnaSequences {

          select {
          case <-ctx.Done():
          return
          default:
          }

          frameIndex := 0
          startPosition[0], startPosition[1], startPosition[2] = 0, 1, 2

          idSize := int(binary.LittleEndian.Uint32(sequence[0:4]))
          nuclSeqLength := len(sequence) - idSize

          getComplexityAndAlternate(startPosition, framesToGenerate, frameIndex,
          sequence, idSize, *w, arrayCode, nuclSeqLength, options, reverse)

          if w.buf.Len() > maxBufferSize {
          _, err := out.Write(w.buf.Bytes())
          if err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }
          w.buf.Reset()
          }
          pool.Put(sequence)
          }

          if w.buf.Len() > 0 {
          _, err := out.Write(w.buf.Bytes())
          if err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }
          }
          }


          But, there's still loads of long arguments. However, now that things are broken into functions, I recommend shortening these variable names. These long names make the code very verbose.



          With my limited knowledge, I see the following:





          • fnaSequencesfnaSeqs


          • framesToGenerateframes


          • arrayCodecodes (†)


          • startPositionstarts (or sPos)


          • sequences (changed in some places)


          † This may be incorrect jargon.



          Move things to the lowest scope



          For example, startPosition (now starts) can be declared in a lower scope.



          While we're at it, starts can be declared as such:



          starts := int{0, 1, 2}


          Resulting in:



          (Within worker())



          for sequence := range fnaSequences {

          select {
          case <-ctx.Done():
          return
          default:
          }

          frameIndex := 0
          starts := int{0, 1, 2}

          idSize := int(binary.LittleEndian.Uint32(sequence[0:4]))
          nuclSeqLength := len(sequence) - idSize

          getComplexityAndAlternate(starts, framesToGenerate, frameIndex,
          sequence, idSize, *w, arrayCode, nuclSeqLength, options, reverse)

          if w.buf.Len() > maxBufferSize {
          _, err := out.Write(w.buf.Bytes())
          if err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }
          w.buf.Reset()
          }
          pool.Put(sequence)
          }


          (Within getComplexityAndAlternate())



          if !options.Alternative {
          // Staden convention: Frame -1 is the reverse-complement of the sequence
          // having the same codon phase as frame 1. Frame -2 is the same phase as
          // frame 2. Frame -3 is the same phase as frame 3
          //
          // use the matrix to keep track of the forward frame as it depends on the
          // length of the sequence
          switch nuclSeqLength % 3 {
          case 0:
          starts = int{0, 2, 1}
          case 1:
          starts = int{1, 0, 2}
          case 2:
          starts = int{2, 1, 0}
          }
          }


          Duplicate code



          The following code is used twice and should instead be a function:



          _, err := out.Write(w.buf.Bytes())
          if err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }


          Becomes a function (again, choose whatever name you want):



          func writeOrCancel(w writer, out io.Writer, errs chan error,
          cancel context.CancelFunc) {
          if _, err := out.Write(w.buf.Bytes()); err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }
          }


          Unroll short loops



          case "F":
          for i := 0; i < 3; i++ {
          frames[i] = 1
          }


          And



          case "R":
          for i := 3; i < 6; i++ {
          frames[i] = 1
          }
          reverse = true


          Shouldn't be loops. By using loops you use more magic numbers. Instead, unroll them:



          case "F":
          frames[0] = 1
          frames[1] = 1
          frames[2] = 1
          // ...
          case "R":
          frames[3] = 1
          frames[4] = 1
          frames[5] = 1
          reverse = true


          There's probably an even easier way to clean up the switch statement in computeFrames().



          Don't name return arguments unless it simplifies code



          In computeFrames() your return arguments are named, but they don't need to be.



          Use straightforward conditions



          // generate 2 letter code
          codes, ok := twoLetterMap[codon[0:2]]

          if !ok {
          twoLetterMap[codon[0:2]] = byte{aaCode}
          } else {
          twoLetterMap[codon[0:2]] = append(codes, aaCode)
          }


          Is more clearly:



          // generate 2 letter code
          if codes, ok := twoLetterMap[codon[0:2]]; ok {
          twoLetterMap[codon[0:2]] = append(codes, aaCode)
          } else {
          twoLetterMap[codon[0:2]] = byte{aaCode}
          }


          Exit loops early



          You can break early upon the condition codes[i] != codes[0].



          for twoLetterCodon, codes := range twoLetterMap {
          uniqueAA := true
          for i := 0; i < len(codes); i++ {
          if codes[i] != codes[0] {
          uniqueAA = false
          }
          }
          if uniqueAA {
          first := letterCode[twoLetterCodon[0]]
          second := letterCode[twoLetterCodon[1]]

          uint32Code := uint32(first) | uint32(second)<<8
          resultMap[uint32Code] = codes[0]
          }
          }


          Becomes:



          for twoLetterCodon, codes := range twoLetterMap {
          uniqueAA := true

          for _, c := range codes {
          if c != codes[0] {
          uniqueAA = false
          break
          }
          }

          if uniqueAA {
          first := letterCode[twoLetterCodon[0]]
          second := letterCode[twoLetterCodon[1]]

          uint32Code := uint32(first) | uint32(second)<<8
          resultMap[uint32Code] = codes[0]
          }
          }


          Combine global const declarations



          It's common practice to combine them, so readers of your code don't need to search the entire document.



          Move things to separate files



          Your writer structure is relatively separate from everything else. I've moved it to a writer.go file -- moving the two constants it uses along with it.



          You can also simplify the field names. If you feel explanation is needed, that's the purpose of documentation, not the field names themselves.



          Rather than writing the following:



          w := &writer{
          buf: bytes.NewBuffer(nil),
          toTrim: 0,
          clen: 0,
          }


          We can use a newWriter() function, which follows Go APIs:



          func newWriter(buf byte) *writer {
          return &writer{buf: bytes.NewBuffer(buf)}
          }


          Also note that specifying the default values (0) is not needed.



          Conclusion: More to be done



          I was not able to address all of the things I saw, but you should get the gist.



          I would urge you to continue to break concrete operations into functions. Even though you may end up with lots of function arguments, in my opinion that's better than hard-to-read deep nesting and long functions. Perhaps once seeing how things break up, you can simplify the whole architecture of the package.



          Here is a GitHub Gist of the final code. It includes other formatting things I did not mention explicitly.



          Hope this helped. Your project looks promising, best of luck!






          share|improve this answer























          • Thanks for the answer, there is some interesting ideas ! However, I disagree with a few points:
            – felix
            6 hours ago










          • (1) can't use iota this way because tCode = uCode = uint(3) (2) buf.Write() and buf.WriteByte() always return a nil error, cf godoc bytes, so no need to check for it (3) I agree that high cyclomatic complexity is bad, but methods with 10 (!!) arguments is not a good solution (4) starts = int{0, 2, 1} allocates a new array instead of reusing the same slice. Not really sure it's worth it as it's in the hot spot of the code ! And this answer doesn't talk at all about the performance of the modified code
            – felix
            6 hours ago












          • @felix (1) If you check, you will see that tcode is 3, and ucode is 4. That's how implicit repetition works in Go. (2) Fair point (3) In my opinion, it's better than code that's hard to read and understand, and definitely better than tons of nesting. If you split things into functions as I recommend, you may notice that functions (such as the one with 10 arguments) can be split up to do separate things. I left that work for you to implement. (4) If you do profiling, I highly doubt it will cause any noticeable performance difference.
            – esote
            4 hours ago










          • I don't talk about the performance of the modified code because it's likely the same as before.
            – esote
            4 hours ago












          • (1) sorry it wasn't clear enough, I meant both tCode and uCode have to be equal to uint(3) (cf initial code)
            – felix
            4 hours ago














          7












          7








          7






          This covers an interesting topic. Great work!



          Because I am unfamiliar with this area, I utilized your unit testing to ensure changes I make did not break functionality. If they do, I apologize and please let me know.



          Utilize implicit repetition and iota



          Rather than manually defining the type and value of nCode, aCode, etc. we an implicitly get the value using iota. This also simplifies the assignment of arrayCodeSize.



          const (
          nCode = uint8(0)
          aCode = uint8(1)
          cCode = uint8(2)
          tCode = uint8(3)
          uCode = uint8(3)
          gCode = uint8(4)

          stopByte = '*'
          unknown = 'X'
          // Length of the array to store code/bytes
          // uses gCode because it's the biggest uint8 of all codes
          arrayCodeSize = (uint32(gCode) | uint32(gCode)<<8 | uint32(gCode)<<16) + 1
          )


          Becomes:



          const (
          nCode = iota
          aCode
          cCode
          tCode
          uCode
          gCode

          stopByte = '*'
          unknown = 'X'

          // Length of the array to store code/bytes
          // uses gCode because it's the biggest of all codes
          arrayCodeSize = (gCode | gCode<<8 | gCode<<16) + 1
          )


          Check error return values



          There are fourteen places in your code where the error return value is unchecked. If this is intentional, it's common practice to assign it to the blank identifier: _.



          Here is a list of occurrences:




          • w.buf.WriteByte() in addByte()

          • w.buf.WriteByte() in addUnknown()

          • w.buf.WriteByte() in newLine()

          • Many Write and WriteByte calls in Translate()

          • feeder.idBuffer.Write() and such, in readSequenceFromFasta


          Update: These always return nil.



          Reduce complexity of Translate()



          The Translate() function is complex. It's current cyclomatic complexity is 45. (By the end of this, it's complexity is 27).



          I notice that you define a go statement, which acts as a worker. I will leave it up to you to choose a fitting name, but for now "worker()" is sufficient.



          func worker(wg *sync.WaitGroup, fnaSequences chan encodedSequence,
          ctx context.Context, framesToGenerate int, arrayCode byte,
          options Options, reverse bool, out io.Writer, errs chan error,
          cancel context.CancelFunc) {
          defer wg.Done()

          startPosition := make(int, 3)

          w := &writer{
          buf: bytes.NewBuffer(nil),
          bytesToTrim: 0,
          currentLineLen: 0,
          }

          for sequence := range fnaSequences {

          select {
          case <-ctx.Done():
          return
          default:
          }

          frameIndex := 0
          startPosition[0], startPosition[1], startPosition[2] = 0, 1, 2

          idSize := int(binary.LittleEndian.Uint32(sequence[0:4]))
          nuclSeqLength := len(sequence) - idSize

          Translate:
          for _, startPos := range startPosition {

          if framesToGenerate[frameIndex] == 0 {
          frameIndex++
          continue
          }

          // sequence id should look like
          // >sequenceID_<frame> comment
          idEnd := bytes.IndexByte(sequence[4:idSize], ' ')
          if idEnd != -1 {
          w.buf.Write(sequence[4 : 4+idEnd])
          w.buf.WriteByte('_')
          w.buf.WriteByte(suffixes[frameIndex])
          w.buf.Write(sequence[4+idEnd : idSize])
          } else {
          w.buf.Write(sequence[4:idSize])
          w.buf.WriteByte('_')
          w.buf.WriteByte(suffixes[frameIndex])
          }
          w.newLine()

          // if in trim mode, nb of bytes to trim (nb of successive 'X', '*' and 'n'
          // from right end of the sequence)
          w.bytesToTrim = 0
          w.currentLineLen = 0

          // read the sequence 3 letters at a time, starting at a specific position
          // corresponding to the frame
          for pos := startPos + 2 + idSize; pos < len(sequence); pos += 3 {

          if w.currentLineLen == maxLineSize {
          w.newLine()
          }
          // create an uint32 from the codon, to retrieve the corresponding
          // AA from the map
          codonCode := uint32(sequence[pos-2]) | uint32(sequence[pos-1])<<8 | uint32(sequence[pos])<<16

          b := arrayCode[codonCode]
          if b != byte(0) {
          w.addByte(b)
          } else {
          w.addUnknown()
          }
          }

          // the last codon is only 2 nucleotid long, try to guess
          // the corresponding AA
          if (nuclSeqLength-startPos)%3 == 2 {

          if w.currentLineLen == maxLineSize {
          w.newLine()
          }
          codonCode := uint32(sequence[len(sequence)-2]) | uint32(sequence[len(sequence)-1])<<8

          b := arrayCode[codonCode]
          if b != byte(0) {
          w.addByte(b)
          } else {
          w.addUnknown()
          }
          }

          // the last codon is only 1 nucleotid long, no way to guess
          // the corresponding AA
          if (nuclSeqLength-startPos)%3 == 1 {
          if w.currentLineLen == maxLineSize {
          w.newLine()
          }
          w.addUnknown()
          }

          if options.Trim && w.bytesToTrim > 0 {
          // remove the last bytesToTrim bytes of the buffer
          // as they are 'X', '*' or 'n'
          w.buf.Truncate(w.buf.Len() - w.bytesToTrim)
          w.currentLineLen -= w.bytesToTrim
          }

          if w.currentLineLen != 0 {
          w.newLine()
          }
          frameIndex++
          }

          if reverse && frameIndex < 6 {

          // get the complementary sequence.
          // Basically, switch
          // A <-> T
          // C <-> G
          // N is not modified
          for i, n := range sequence[idSize:] {

          switch n {
          case aCode:
          sequence[i+idSize] = tCode
          case tCode:
          // handle both tCode and uCode
          sequence[i+idSize] = aCode
          case cCode:
          sequence[i+idSize] = gCode
          case gCode:
          sequence[i+idSize] = cCode
          default:
          //case N -> leave it
          }
          }
          // reverse the sequence
          for i, j := idSize, len(sequence)-1; i < j; i, j = i+1, j-1 {
          sequence[i], sequence[j] = sequence[j], sequence[i]
          }

          if !options.Alternative {
          // Staden convention: Frame -1 is the reverse-complement of the sequence
          // having the same codon phase as frame 1. Frame -2 is the same phase as
          // frame 2. Frame -3 is the same phase as frame 3
          //
          // use the matrix to keep track of the forward frame as it depends on the
          // length of the sequence
          switch nuclSeqLength % 3 {
          case 0:
          startPosition[0], startPosition[1], startPosition[2] = 0, 2, 1
          case 1:
          startPosition[0], startPosition[1], startPosition[2] = 1, 0, 2
          case 2:
          startPosition[0], startPosition[1], startPosition[2] = 2, 1, 0
          }
          }
          // run the same loop, but with the reverse-complemented sequence
          goto Translate
          }

          if w.buf.Len() > maxBufferSize {
          _, err := out.Write(w.buf.Bytes())
          if err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }
          w.buf.Reset()
          }
          pool.Put(sequence)
          }

          if w.buf.Len() > 0 {
          _, err := out.Write(w.buf.Bytes())
          if err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }
          }
          }


          But that's a ton of arguments for the worker, can more be done? Sure! But first, let's get rid of goto.



          goto



          You use a goto statement to re-run a block of code again. That says to me: recursive function.



          So, let's move this to a separate function. Again, I have no idea the proper name, and will leave that to you. For now, I'll call it getComplexityAndReverse() -- a verbose name, but it should suffice.



          func getComplexityAndAlternate(startPosition int, framesToGenerate int,
          frameIndex int, sequence encodedSequence, idSize int, w writer,
          arrayCode byte, nuclSeqLength int, options Options, reverse bool) {
          for _, startPos := range startPosition {
          if framesToGenerate[frameIndex] == 0 {
          frameIndex++
          continue
          }

          // sequence id should look like
          // >sequenceID_<frame> comment
          idEnd := bytes.IndexByte(sequence[4:idSize], ' ')
          if idEnd != -1 {
          w.buf.Write(sequence[4 : 4+idEnd])
          w.buf.WriteByte('_')
          w.buf.WriteByte(suffixes[frameIndex])
          w.buf.Write(sequence[4+idEnd : idSize])
          } else {
          w.buf.Write(sequence[4:idSize])
          w.buf.WriteByte('_')
          w.buf.WriteByte(suffixes[frameIndex])
          }
          w.newLine()

          // if in trim mode, nb of bytes to trim (nb of successive 'X', '*' and 'n'
          // from right end of the sequence)
          w.bytesToTrim = 0
          w.currentLineLen = 0

          // read the sequence 3 letters at a time, starting at a specific position
          // corresponding to the frame
          for pos := startPos + 2 + idSize; pos < len(sequence); pos += 3 {

          if w.currentLineLen == maxLineSize {
          w.newLine()
          }
          // create an uint32 from the codon, to retrieve the corresponding
          // AA from the map
          codonCode := uint32(sequence[pos-2]) | uint32(sequence[pos-1])<<8 | uint32(sequence[pos])<<16

          b := arrayCode[codonCode]
          if b != byte(0) {
          w.addByte(b)
          } else {
          w.addUnknown()
          }
          }

          // the last codon is only 2 nucleotid long, try to guess
          // the corresponding AA
          if (nuclSeqLength-startPos)%3 == 2 {

          if w.currentLineLen == maxLineSize {
          w.newLine()
          }
          codonCode := uint32(sequence[len(sequence)-2]) | uint32(sequence[len(sequence)-1])<<8

          b := arrayCode[codonCode]
          if b != byte(0) {
          w.addByte(b)
          } else {
          w.addUnknown()
          }
          }

          // the last codon is only 1 nucleotid long, no way to guess
          // the corresponding AA
          if (nuclSeqLength-startPos)%3 == 1 {
          if w.currentLineLen == maxLineSize {
          w.newLine()
          }
          w.addUnknown()
          }

          if options.Trim && w.bytesToTrim > 0 {
          // remove the last bytesToTrim bytes of the buffer
          // as they are 'X', '*' or 'n'
          w.buf.Truncate(w.buf.Len() - w.bytesToTrim)
          w.currentLineLen -= w.bytesToTrim
          }

          if w.currentLineLen != 0 {
          w.newLine()
          }
          frameIndex++
          }

          if reverse && frameIndex < 6 {

          // get the complementary sequence.
          // Basically, switch
          // A <-> T
          // C <-> G
          // N is not modified
          for i, n := range sequence[idSize:] {

          switch n {
          case aCode:
          sequence[i+idSize] = tCode
          case tCode:
          // handle both tCode and uCode
          sequence[i+idSize] = aCode
          case cCode:
          sequence[i+idSize] = gCode
          case gCode:
          sequence[i+idSize] = cCode
          default:
          //case N -> leave it
          }
          }
          // reverse the sequence
          for i, j := idSize, len(sequence)-1; i < j; i, j = i+1, j-1 {
          sequence[i], sequence[j] = sequence[j], sequence[i]
          }

          if !options.Alternative {
          // Staden convention: Frame -1 is the reverse-complement of the sequence
          // having the same codon phase as frame 1. Frame -2 is the same phase as
          // frame 2. Frame -3 is the same phase as frame 3
          //
          // use the matrix to keep track of the forward frame as it depends on the
          // length of the sequence
          switch nuclSeqLength % 3 {
          case 0:
          startPosition[0], startPosition[1], startPosition[2] = 0, 2, 1
          case 1:
          startPosition[0], startPosition[1], startPosition[2] = 1, 0, 2
          case 2:
          startPosition[0], startPosition[1], startPosition[2] = 2, 1, 0
          }
          }
          // run the same loop, but with the reverse-complemented sequence
          getComplexityAndAlternate(startPosition, framesToGenerate, frameIndex,
          sequence, idSize, w, arrayCode, nuclSeqLength, options, reverse)
          }
          }


          And we can simplify worker() even more:



          func worker(wg *sync.WaitGroup, fnaSequences chan encodedSequence,
          ctx context.Context, framesToGenerate int, arrayCode byte,
          options Options, reverse bool, out io.Writer, errs chan error,
          cancel context.CancelFunc) {
          defer wg.Done()

          startPosition := make(int, 3)

          w := &writer{
          buf: bytes.NewBuffer(nil),
          bytesToTrim: 0,
          currentLineLen: 0,
          }

          for sequence := range fnaSequences {

          select {
          case <-ctx.Done():
          return
          default:
          }

          frameIndex := 0
          startPosition[0], startPosition[1], startPosition[2] = 0, 1, 2

          idSize := int(binary.LittleEndian.Uint32(sequence[0:4]))
          nuclSeqLength := len(sequence) - idSize

          getComplexityAndAlternate(startPosition, framesToGenerate, frameIndex,
          sequence, idSize, *w, arrayCode, nuclSeqLength, options, reverse)

          if w.buf.Len() > maxBufferSize {
          _, err := out.Write(w.buf.Bytes())
          if err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }
          w.buf.Reset()
          }
          pool.Put(sequence)
          }

          if w.buf.Len() > 0 {
          _, err := out.Write(w.buf.Bytes())
          if err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }
          }
          }


          But, there's still loads of long arguments. However, now that things are broken into functions, I recommend shortening these variable names. These long names make the code very verbose.



          With my limited knowledge, I see the following:





          • fnaSequencesfnaSeqs


          • framesToGenerateframes


          • arrayCodecodes (†)


          • startPositionstarts (or sPos)


          • sequences (changed in some places)


          † This may be incorrect jargon.



          Move things to the lowest scope



          For example, startPosition (now starts) can be declared in a lower scope.



          While we're at it, starts can be declared as such:



          starts := int{0, 1, 2}


          Resulting in:



          (Within worker())



          for sequence := range fnaSequences {

          select {
          case <-ctx.Done():
          return
          default:
          }

          frameIndex := 0
          starts := int{0, 1, 2}

          idSize := int(binary.LittleEndian.Uint32(sequence[0:4]))
          nuclSeqLength := len(sequence) - idSize

          getComplexityAndAlternate(starts, framesToGenerate, frameIndex,
          sequence, idSize, *w, arrayCode, nuclSeqLength, options, reverse)

          if w.buf.Len() > maxBufferSize {
          _, err := out.Write(w.buf.Bytes())
          if err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }
          w.buf.Reset()
          }
          pool.Put(sequence)
          }


          (Within getComplexityAndAlternate())



          if !options.Alternative {
          // Staden convention: Frame -1 is the reverse-complement of the sequence
          // having the same codon phase as frame 1. Frame -2 is the same phase as
          // frame 2. Frame -3 is the same phase as frame 3
          //
          // use the matrix to keep track of the forward frame as it depends on the
          // length of the sequence
          switch nuclSeqLength % 3 {
          case 0:
          starts = int{0, 2, 1}
          case 1:
          starts = int{1, 0, 2}
          case 2:
          starts = int{2, 1, 0}
          }
          }


          Duplicate code



          The following code is used twice and should instead be a function:



          _, err := out.Write(w.buf.Bytes())
          if err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }


          Becomes a function (again, choose whatever name you want):



          func writeOrCancel(w writer, out io.Writer, errs chan error,
          cancel context.CancelFunc) {
          if _, err := out.Write(w.buf.Bytes()); err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }
          }


          Unroll short loops



          case "F":
          for i := 0; i < 3; i++ {
          frames[i] = 1
          }


          And



          case "R":
          for i := 3; i < 6; i++ {
          frames[i] = 1
          }
          reverse = true


          Shouldn't be loops. By using loops you use more magic numbers. Instead, unroll them:



          case "F":
          frames[0] = 1
          frames[1] = 1
          frames[2] = 1
          // ...
          case "R":
          frames[3] = 1
          frames[4] = 1
          frames[5] = 1
          reverse = true


          There's probably an even easier way to clean up the switch statement in computeFrames().



          Don't name return arguments unless it simplifies code



          In computeFrames() your return arguments are named, but they don't need to be.



          Use straightforward conditions



          // generate 2 letter code
          codes, ok := twoLetterMap[codon[0:2]]

          if !ok {
          twoLetterMap[codon[0:2]] = byte{aaCode}
          } else {
          twoLetterMap[codon[0:2]] = append(codes, aaCode)
          }


          Is more clearly:



          // generate 2 letter code
          if codes, ok := twoLetterMap[codon[0:2]]; ok {
          twoLetterMap[codon[0:2]] = append(codes, aaCode)
          } else {
          twoLetterMap[codon[0:2]] = byte{aaCode}
          }


          Exit loops early



          You can break early upon the condition codes[i] != codes[0].



          for twoLetterCodon, codes := range twoLetterMap {
          uniqueAA := true
          for i := 0; i < len(codes); i++ {
          if codes[i] != codes[0] {
          uniqueAA = false
          }
          }
          if uniqueAA {
          first := letterCode[twoLetterCodon[0]]
          second := letterCode[twoLetterCodon[1]]

          uint32Code := uint32(first) | uint32(second)<<8
          resultMap[uint32Code] = codes[0]
          }
          }


          Becomes:



          for twoLetterCodon, codes := range twoLetterMap {
          uniqueAA := true

          for _, c := range codes {
          if c != codes[0] {
          uniqueAA = false
          break
          }
          }

          if uniqueAA {
          first := letterCode[twoLetterCodon[0]]
          second := letterCode[twoLetterCodon[1]]

          uint32Code := uint32(first) | uint32(second)<<8
          resultMap[uint32Code] = codes[0]
          }
          }


          Combine global const declarations



          It's common practice to combine them, so readers of your code don't need to search the entire document.



          Move things to separate files



          Your writer structure is relatively separate from everything else. I've moved it to a writer.go file -- moving the two constants it uses along with it.



          You can also simplify the field names. If you feel explanation is needed, that's the purpose of documentation, not the field names themselves.



          Rather than writing the following:



          w := &writer{
          buf: bytes.NewBuffer(nil),
          toTrim: 0,
          clen: 0,
          }


          We can use a newWriter() function, which follows Go APIs:



          func newWriter(buf byte) *writer {
          return &writer{buf: bytes.NewBuffer(buf)}
          }


          Also note that specifying the default values (0) is not needed.



          Conclusion: More to be done



          I was not able to address all of the things I saw, but you should get the gist.



          I would urge you to continue to break concrete operations into functions. Even though you may end up with lots of function arguments, in my opinion that's better than hard-to-read deep nesting and long functions. Perhaps once seeing how things break up, you can simplify the whole architecture of the package.



          Here is a GitHub Gist of the final code. It includes other formatting things I did not mention explicitly.



          Hope this helped. Your project looks promising, best of luck!






          share|improve this answer














          This covers an interesting topic. Great work!



          Because I am unfamiliar with this area, I utilized your unit testing to ensure changes I make did not break functionality. If they do, I apologize and please let me know.



          Utilize implicit repetition and iota



          Rather than manually defining the type and value of nCode, aCode, etc. we an implicitly get the value using iota. This also simplifies the assignment of arrayCodeSize.



          const (
          nCode = uint8(0)
          aCode = uint8(1)
          cCode = uint8(2)
          tCode = uint8(3)
          uCode = uint8(3)
          gCode = uint8(4)

          stopByte = '*'
          unknown = 'X'
          // Length of the array to store code/bytes
          // uses gCode because it's the biggest uint8 of all codes
          arrayCodeSize = (uint32(gCode) | uint32(gCode)<<8 | uint32(gCode)<<16) + 1
          )


          Becomes:



          const (
          nCode = iota
          aCode
          cCode
          tCode
          uCode
          gCode

          stopByte = '*'
          unknown = 'X'

          // Length of the array to store code/bytes
          // uses gCode because it's the biggest of all codes
          arrayCodeSize = (gCode | gCode<<8 | gCode<<16) + 1
          )


          Check error return values



          There are fourteen places in your code where the error return value is unchecked. If this is intentional, it's common practice to assign it to the blank identifier: _.



          Here is a list of occurrences:




          • w.buf.WriteByte() in addByte()

          • w.buf.WriteByte() in addUnknown()

          • w.buf.WriteByte() in newLine()

          • Many Write and WriteByte calls in Translate()

          • feeder.idBuffer.Write() and such, in readSequenceFromFasta


          Update: These always return nil.



          Reduce complexity of Translate()



          The Translate() function is complex. It's current cyclomatic complexity is 45. (By the end of this, it's complexity is 27).



          I notice that you define a go statement, which acts as a worker. I will leave it up to you to choose a fitting name, but for now "worker()" is sufficient.



          func worker(wg *sync.WaitGroup, fnaSequences chan encodedSequence,
          ctx context.Context, framesToGenerate int, arrayCode byte,
          options Options, reverse bool, out io.Writer, errs chan error,
          cancel context.CancelFunc) {
          defer wg.Done()

          startPosition := make(int, 3)

          w := &writer{
          buf: bytes.NewBuffer(nil),
          bytesToTrim: 0,
          currentLineLen: 0,
          }

          for sequence := range fnaSequences {

          select {
          case <-ctx.Done():
          return
          default:
          }

          frameIndex := 0
          startPosition[0], startPosition[1], startPosition[2] = 0, 1, 2

          idSize := int(binary.LittleEndian.Uint32(sequence[0:4]))
          nuclSeqLength := len(sequence) - idSize

          Translate:
          for _, startPos := range startPosition {

          if framesToGenerate[frameIndex] == 0 {
          frameIndex++
          continue
          }

          // sequence id should look like
          // >sequenceID_<frame> comment
          idEnd := bytes.IndexByte(sequence[4:idSize], ' ')
          if idEnd != -1 {
          w.buf.Write(sequence[4 : 4+idEnd])
          w.buf.WriteByte('_')
          w.buf.WriteByte(suffixes[frameIndex])
          w.buf.Write(sequence[4+idEnd : idSize])
          } else {
          w.buf.Write(sequence[4:idSize])
          w.buf.WriteByte('_')
          w.buf.WriteByte(suffixes[frameIndex])
          }
          w.newLine()

          // if in trim mode, nb of bytes to trim (nb of successive 'X', '*' and 'n'
          // from right end of the sequence)
          w.bytesToTrim = 0
          w.currentLineLen = 0

          // read the sequence 3 letters at a time, starting at a specific position
          // corresponding to the frame
          for pos := startPos + 2 + idSize; pos < len(sequence); pos += 3 {

          if w.currentLineLen == maxLineSize {
          w.newLine()
          }
          // create an uint32 from the codon, to retrieve the corresponding
          // AA from the map
          codonCode := uint32(sequence[pos-2]) | uint32(sequence[pos-1])<<8 | uint32(sequence[pos])<<16

          b := arrayCode[codonCode]
          if b != byte(0) {
          w.addByte(b)
          } else {
          w.addUnknown()
          }
          }

          // the last codon is only 2 nucleotid long, try to guess
          // the corresponding AA
          if (nuclSeqLength-startPos)%3 == 2 {

          if w.currentLineLen == maxLineSize {
          w.newLine()
          }
          codonCode := uint32(sequence[len(sequence)-2]) | uint32(sequence[len(sequence)-1])<<8

          b := arrayCode[codonCode]
          if b != byte(0) {
          w.addByte(b)
          } else {
          w.addUnknown()
          }
          }

          // the last codon is only 1 nucleotid long, no way to guess
          // the corresponding AA
          if (nuclSeqLength-startPos)%3 == 1 {
          if w.currentLineLen == maxLineSize {
          w.newLine()
          }
          w.addUnknown()
          }

          if options.Trim && w.bytesToTrim > 0 {
          // remove the last bytesToTrim bytes of the buffer
          // as they are 'X', '*' or 'n'
          w.buf.Truncate(w.buf.Len() - w.bytesToTrim)
          w.currentLineLen -= w.bytesToTrim
          }

          if w.currentLineLen != 0 {
          w.newLine()
          }
          frameIndex++
          }

          if reverse && frameIndex < 6 {

          // get the complementary sequence.
          // Basically, switch
          // A <-> T
          // C <-> G
          // N is not modified
          for i, n := range sequence[idSize:] {

          switch n {
          case aCode:
          sequence[i+idSize] = tCode
          case tCode:
          // handle both tCode and uCode
          sequence[i+idSize] = aCode
          case cCode:
          sequence[i+idSize] = gCode
          case gCode:
          sequence[i+idSize] = cCode
          default:
          //case N -> leave it
          }
          }
          // reverse the sequence
          for i, j := idSize, len(sequence)-1; i < j; i, j = i+1, j-1 {
          sequence[i], sequence[j] = sequence[j], sequence[i]
          }

          if !options.Alternative {
          // Staden convention: Frame -1 is the reverse-complement of the sequence
          // having the same codon phase as frame 1. Frame -2 is the same phase as
          // frame 2. Frame -3 is the same phase as frame 3
          //
          // use the matrix to keep track of the forward frame as it depends on the
          // length of the sequence
          switch nuclSeqLength % 3 {
          case 0:
          startPosition[0], startPosition[1], startPosition[2] = 0, 2, 1
          case 1:
          startPosition[0], startPosition[1], startPosition[2] = 1, 0, 2
          case 2:
          startPosition[0], startPosition[1], startPosition[2] = 2, 1, 0
          }
          }
          // run the same loop, but with the reverse-complemented sequence
          goto Translate
          }

          if w.buf.Len() > maxBufferSize {
          _, err := out.Write(w.buf.Bytes())
          if err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }
          w.buf.Reset()
          }
          pool.Put(sequence)
          }

          if w.buf.Len() > 0 {
          _, err := out.Write(w.buf.Bytes())
          if err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }
          }
          }


          But that's a ton of arguments for the worker, can more be done? Sure! But first, let's get rid of goto.



          goto



          You use a goto statement to re-run a block of code again. That says to me: recursive function.



          So, let's move this to a separate function. Again, I have no idea the proper name, and will leave that to you. For now, I'll call it getComplexityAndReverse() -- a verbose name, but it should suffice.



          func getComplexityAndAlternate(startPosition int, framesToGenerate int,
          frameIndex int, sequence encodedSequence, idSize int, w writer,
          arrayCode byte, nuclSeqLength int, options Options, reverse bool) {
          for _, startPos := range startPosition {
          if framesToGenerate[frameIndex] == 0 {
          frameIndex++
          continue
          }

          // sequence id should look like
          // >sequenceID_<frame> comment
          idEnd := bytes.IndexByte(sequence[4:idSize], ' ')
          if idEnd != -1 {
          w.buf.Write(sequence[4 : 4+idEnd])
          w.buf.WriteByte('_')
          w.buf.WriteByte(suffixes[frameIndex])
          w.buf.Write(sequence[4+idEnd : idSize])
          } else {
          w.buf.Write(sequence[4:idSize])
          w.buf.WriteByte('_')
          w.buf.WriteByte(suffixes[frameIndex])
          }
          w.newLine()

          // if in trim mode, nb of bytes to trim (nb of successive 'X', '*' and 'n'
          // from right end of the sequence)
          w.bytesToTrim = 0
          w.currentLineLen = 0

          // read the sequence 3 letters at a time, starting at a specific position
          // corresponding to the frame
          for pos := startPos + 2 + idSize; pos < len(sequence); pos += 3 {

          if w.currentLineLen == maxLineSize {
          w.newLine()
          }
          // create an uint32 from the codon, to retrieve the corresponding
          // AA from the map
          codonCode := uint32(sequence[pos-2]) | uint32(sequence[pos-1])<<8 | uint32(sequence[pos])<<16

          b := arrayCode[codonCode]
          if b != byte(0) {
          w.addByte(b)
          } else {
          w.addUnknown()
          }
          }

          // the last codon is only 2 nucleotid long, try to guess
          // the corresponding AA
          if (nuclSeqLength-startPos)%3 == 2 {

          if w.currentLineLen == maxLineSize {
          w.newLine()
          }
          codonCode := uint32(sequence[len(sequence)-2]) | uint32(sequence[len(sequence)-1])<<8

          b := arrayCode[codonCode]
          if b != byte(0) {
          w.addByte(b)
          } else {
          w.addUnknown()
          }
          }

          // the last codon is only 1 nucleotid long, no way to guess
          // the corresponding AA
          if (nuclSeqLength-startPos)%3 == 1 {
          if w.currentLineLen == maxLineSize {
          w.newLine()
          }
          w.addUnknown()
          }

          if options.Trim && w.bytesToTrim > 0 {
          // remove the last bytesToTrim bytes of the buffer
          // as they are 'X', '*' or 'n'
          w.buf.Truncate(w.buf.Len() - w.bytesToTrim)
          w.currentLineLen -= w.bytesToTrim
          }

          if w.currentLineLen != 0 {
          w.newLine()
          }
          frameIndex++
          }

          if reverse && frameIndex < 6 {

          // get the complementary sequence.
          // Basically, switch
          // A <-> T
          // C <-> G
          // N is not modified
          for i, n := range sequence[idSize:] {

          switch n {
          case aCode:
          sequence[i+idSize] = tCode
          case tCode:
          // handle both tCode and uCode
          sequence[i+idSize] = aCode
          case cCode:
          sequence[i+idSize] = gCode
          case gCode:
          sequence[i+idSize] = cCode
          default:
          //case N -> leave it
          }
          }
          // reverse the sequence
          for i, j := idSize, len(sequence)-1; i < j; i, j = i+1, j-1 {
          sequence[i], sequence[j] = sequence[j], sequence[i]
          }

          if !options.Alternative {
          // Staden convention: Frame -1 is the reverse-complement of the sequence
          // having the same codon phase as frame 1. Frame -2 is the same phase as
          // frame 2. Frame -3 is the same phase as frame 3
          //
          // use the matrix to keep track of the forward frame as it depends on the
          // length of the sequence
          switch nuclSeqLength % 3 {
          case 0:
          startPosition[0], startPosition[1], startPosition[2] = 0, 2, 1
          case 1:
          startPosition[0], startPosition[1], startPosition[2] = 1, 0, 2
          case 2:
          startPosition[0], startPosition[1], startPosition[2] = 2, 1, 0
          }
          }
          // run the same loop, but with the reverse-complemented sequence
          getComplexityAndAlternate(startPosition, framesToGenerate, frameIndex,
          sequence, idSize, w, arrayCode, nuclSeqLength, options, reverse)
          }
          }


          And we can simplify worker() even more:



          func worker(wg *sync.WaitGroup, fnaSequences chan encodedSequence,
          ctx context.Context, framesToGenerate int, arrayCode byte,
          options Options, reverse bool, out io.Writer, errs chan error,
          cancel context.CancelFunc) {
          defer wg.Done()

          startPosition := make(int, 3)

          w := &writer{
          buf: bytes.NewBuffer(nil),
          bytesToTrim: 0,
          currentLineLen: 0,
          }

          for sequence := range fnaSequences {

          select {
          case <-ctx.Done():
          return
          default:
          }

          frameIndex := 0
          startPosition[0], startPosition[1], startPosition[2] = 0, 1, 2

          idSize := int(binary.LittleEndian.Uint32(sequence[0:4]))
          nuclSeqLength := len(sequence) - idSize

          getComplexityAndAlternate(startPosition, framesToGenerate, frameIndex,
          sequence, idSize, *w, arrayCode, nuclSeqLength, options, reverse)

          if w.buf.Len() > maxBufferSize {
          _, err := out.Write(w.buf.Bytes())
          if err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }
          w.buf.Reset()
          }
          pool.Put(sequence)
          }

          if w.buf.Len() > 0 {
          _, err := out.Write(w.buf.Bytes())
          if err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }
          }
          }


          But, there's still loads of long arguments. However, now that things are broken into functions, I recommend shortening these variable names. These long names make the code very verbose.



          With my limited knowledge, I see the following:





          • fnaSequencesfnaSeqs


          • framesToGenerateframes


          • arrayCodecodes (†)


          • startPositionstarts (or sPos)


          • sequences (changed in some places)


          † This may be incorrect jargon.



          Move things to the lowest scope



          For example, startPosition (now starts) can be declared in a lower scope.



          While we're at it, starts can be declared as such:



          starts := int{0, 1, 2}


          Resulting in:



          (Within worker())



          for sequence := range fnaSequences {

          select {
          case <-ctx.Done():
          return
          default:
          }

          frameIndex := 0
          starts := int{0, 1, 2}

          idSize := int(binary.LittleEndian.Uint32(sequence[0:4]))
          nuclSeqLength := len(sequence) - idSize

          getComplexityAndAlternate(starts, framesToGenerate, frameIndex,
          sequence, idSize, *w, arrayCode, nuclSeqLength, options, reverse)

          if w.buf.Len() > maxBufferSize {
          _, err := out.Write(w.buf.Bytes())
          if err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }
          w.buf.Reset()
          }
          pool.Put(sequence)
          }


          (Within getComplexityAndAlternate())



          if !options.Alternative {
          // Staden convention: Frame -1 is the reverse-complement of the sequence
          // having the same codon phase as frame 1. Frame -2 is the same phase as
          // frame 2. Frame -3 is the same phase as frame 3
          //
          // use the matrix to keep track of the forward frame as it depends on the
          // length of the sequence
          switch nuclSeqLength % 3 {
          case 0:
          starts = int{0, 2, 1}
          case 1:
          starts = int{1, 0, 2}
          case 2:
          starts = int{2, 1, 0}
          }
          }


          Duplicate code



          The following code is used twice and should instead be a function:



          _, err := out.Write(w.buf.Bytes())
          if err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }


          Becomes a function (again, choose whatever name you want):



          func writeOrCancel(w writer, out io.Writer, errs chan error,
          cancel context.CancelFunc) {
          if _, err := out.Write(w.buf.Bytes()); err != nil {
          select {
          case errs <- fmt.Errorf("fail to write to output file: %v", err):
          default:
          }
          cancel()
          return
          }
          }


          Unroll short loops



          case "F":
          for i := 0; i < 3; i++ {
          frames[i] = 1
          }


          And



          case "R":
          for i := 3; i < 6; i++ {
          frames[i] = 1
          }
          reverse = true


          Shouldn't be loops. By using loops you use more magic numbers. Instead, unroll them:



          case "F":
          frames[0] = 1
          frames[1] = 1
          frames[2] = 1
          // ...
          case "R":
          frames[3] = 1
          frames[4] = 1
          frames[5] = 1
          reverse = true


          There's probably an even easier way to clean up the switch statement in computeFrames().



          Don't name return arguments unless it simplifies code



          In computeFrames() your return arguments are named, but they don't need to be.



          Use straightforward conditions



          // generate 2 letter code
          codes, ok := twoLetterMap[codon[0:2]]

          if !ok {
          twoLetterMap[codon[0:2]] = byte{aaCode}
          } else {
          twoLetterMap[codon[0:2]] = append(codes, aaCode)
          }


          Is more clearly:



          // generate 2 letter code
          if codes, ok := twoLetterMap[codon[0:2]]; ok {
          twoLetterMap[codon[0:2]] = append(codes, aaCode)
          } else {
          twoLetterMap[codon[0:2]] = byte{aaCode}
          }


          Exit loops early



          You can break early upon the condition codes[i] != codes[0].



          for twoLetterCodon, codes := range twoLetterMap {
          uniqueAA := true
          for i := 0; i < len(codes); i++ {
          if codes[i] != codes[0] {
          uniqueAA = false
          }
          }
          if uniqueAA {
          first := letterCode[twoLetterCodon[0]]
          second := letterCode[twoLetterCodon[1]]

          uint32Code := uint32(first) | uint32(second)<<8
          resultMap[uint32Code] = codes[0]
          }
          }


          Becomes:



          for twoLetterCodon, codes := range twoLetterMap {
          uniqueAA := true

          for _, c := range codes {
          if c != codes[0] {
          uniqueAA = false
          break
          }
          }

          if uniqueAA {
          first := letterCode[twoLetterCodon[0]]
          second := letterCode[twoLetterCodon[1]]

          uint32Code := uint32(first) | uint32(second)<<8
          resultMap[uint32Code] = codes[0]
          }
          }


          Combine global const declarations



          It's common practice to combine them, so readers of your code don't need to search the entire document.



          Move things to separate files



          Your writer structure is relatively separate from everything else. I've moved it to a writer.go file -- moving the two constants it uses along with it.



          You can also simplify the field names. If you feel explanation is needed, that's the purpose of documentation, not the field names themselves.



          Rather than writing the following:



          w := &writer{
          buf: bytes.NewBuffer(nil),
          toTrim: 0,
          clen: 0,
          }


          We can use a newWriter() function, which follows Go APIs:



          func newWriter(buf byte) *writer {
          return &writer{buf: bytes.NewBuffer(buf)}
          }


          Also note that specifying the default values (0) is not needed.



          Conclusion: More to be done



          I was not able to address all of the things I saw, but you should get the gist.



          I would urge you to continue to break concrete operations into functions. Even though you may end up with lots of function arguments, in my opinion that's better than hard-to-read deep nesting and long functions. Perhaps once seeing how things break up, you can simplify the whole architecture of the package.



          Here is a GitHub Gist of the final code. It includes other formatting things I did not mention explicitly.



          Hope this helped. Your project looks promising, best of luck!







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited 4 hours ago

























          answered 19 hours ago









          esoteesote

          2,0991933




          2,0991933












          • Thanks for the answer, there is some interesting ideas ! However, I disagree with a few points:
            – felix
            6 hours ago










          • (1) can't use iota this way because tCode = uCode = uint(3) (2) buf.Write() and buf.WriteByte() always return a nil error, cf godoc bytes, so no need to check for it (3) I agree that high cyclomatic complexity is bad, but methods with 10 (!!) arguments is not a good solution (4) starts = int{0, 2, 1} allocates a new array instead of reusing the same slice. Not really sure it's worth it as it's in the hot spot of the code ! And this answer doesn't talk at all about the performance of the modified code
            – felix
            6 hours ago












          • @felix (1) If you check, you will see that tcode is 3, and ucode is 4. That's how implicit repetition works in Go. (2) Fair point (3) In my opinion, it's better than code that's hard to read and understand, and definitely better than tons of nesting. If you split things into functions as I recommend, you may notice that functions (such as the one with 10 arguments) can be split up to do separate things. I left that work for you to implement. (4) If you do profiling, I highly doubt it will cause any noticeable performance difference.
            – esote
            4 hours ago










          • I don't talk about the performance of the modified code because it's likely the same as before.
            – esote
            4 hours ago












          • (1) sorry it wasn't clear enough, I meant both tCode and uCode have to be equal to uint(3) (cf initial code)
            – felix
            4 hours ago


















          • Thanks for the answer, there is some interesting ideas ! However, I disagree with a few points:
            – felix
            6 hours ago










          • (1) can't use iota this way because tCode = uCode = uint(3) (2) buf.Write() and buf.WriteByte() always return a nil error, cf godoc bytes, so no need to check for it (3) I agree that high cyclomatic complexity is bad, but methods with 10 (!!) arguments is not a good solution (4) starts = int{0, 2, 1} allocates a new array instead of reusing the same slice. Not really sure it's worth it as it's in the hot spot of the code ! And this answer doesn't talk at all about the performance of the modified code
            – felix
            6 hours ago












          • @felix (1) If you check, you will see that tcode is 3, and ucode is 4. That's how implicit repetition works in Go. (2) Fair point (3) In my opinion, it's better than code that's hard to read and understand, and definitely better than tons of nesting. If you split things into functions as I recommend, you may notice that functions (such as the one with 10 arguments) can be split up to do separate things. I left that work for you to implement. (4) If you do profiling, I highly doubt it will cause any noticeable performance difference.
            – esote
            4 hours ago










          • I don't talk about the performance of the modified code because it's likely the same as before.
            – esote
            4 hours ago












          • (1) sorry it wasn't clear enough, I meant both tCode and uCode have to be equal to uint(3) (cf initial code)
            – felix
            4 hours ago
















          Thanks for the answer, there is some interesting ideas ! However, I disagree with a few points:
          – felix
          6 hours ago




          Thanks for the answer, there is some interesting ideas ! However, I disagree with a few points:
          – felix
          6 hours ago












          (1) can't use iota this way because tCode = uCode = uint(3) (2) buf.Write() and buf.WriteByte() always return a nil error, cf godoc bytes, so no need to check for it (3) I agree that high cyclomatic complexity is bad, but methods with 10 (!!) arguments is not a good solution (4) starts = int{0, 2, 1} allocates a new array instead of reusing the same slice. Not really sure it's worth it as it's in the hot spot of the code ! And this answer doesn't talk at all about the performance of the modified code
          – felix
          6 hours ago






          (1) can't use iota this way because tCode = uCode = uint(3) (2) buf.Write() and buf.WriteByte() always return a nil error, cf godoc bytes, so no need to check for it (3) I agree that high cyclomatic complexity is bad, but methods with 10 (!!) arguments is not a good solution (4) starts = int{0, 2, 1} allocates a new array instead of reusing the same slice. Not really sure it's worth it as it's in the hot spot of the code ! And this answer doesn't talk at all about the performance of the modified code
          – felix
          6 hours ago














          @felix (1) If you check, you will see that tcode is 3, and ucode is 4. That's how implicit repetition works in Go. (2) Fair point (3) In my opinion, it's better than code that's hard to read and understand, and definitely better than tons of nesting. If you split things into functions as I recommend, you may notice that functions (such as the one with 10 arguments) can be split up to do separate things. I left that work for you to implement. (4) If you do profiling, I highly doubt it will cause any noticeable performance difference.
          – esote
          4 hours ago




          @felix (1) If you check, you will see that tcode is 3, and ucode is 4. That's how implicit repetition works in Go. (2) Fair point (3) In my opinion, it's better than code that's hard to read and understand, and definitely better than tons of nesting. If you split things into functions as I recommend, you may notice that functions (such as the one with 10 arguments) can be split up to do separate things. I left that work for you to implement. (4) If you do profiling, I highly doubt it will cause any noticeable performance difference.
          – esote
          4 hours ago












          I don't talk about the performance of the modified code because it's likely the same as before.
          – esote
          4 hours ago






          I don't talk about the performance of the modified code because it's likely the same as before.
          – esote
          4 hours ago














          (1) sorry it wasn't clear enough, I meant both tCode and uCode have to be equal to uint(3) (cf initial code)
          – felix
          4 hours ago




          (1) sorry it wasn't clear enough, I meant both tCode and uCode have to be equal to uint(3) (cf initial code)
          – felix
          4 hours ago


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Code Review Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.





          Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


          Please pay close attention to the following guidance:


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f209750%2ftranslate-nucleic-acid-sequence-into-its-corresponding-amino-acid-sequence%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          How to make a Squid Proxy server?

          Is this a new Fibonacci Identity?

          19世紀