Ytdyklly

Question

Goal of the program

The goal of the program is to translate a nucleic acid sequence into its corresponding amino acid sequence. The nucleic sequences
have to be formatted in a specific format called fasta. There is an existing implementation of this program in C here: emboss transeq

Fasta format

A fasta file looks like this:

>sequenceId comment

nucleic sequence

for example:

>Seq1 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

CCTTTATCTAATCTTTGGAGCATGAGCTGGCATAGTTGGAACCGCCCTCAGCCTCCTCATCCGTGCAGAA

CCCAGTCCTGTACCAACACCTCTTCTGATTCTTCGGCCATCCAGAAGTCTATATCCTCATTTTAC



>Seq2 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

GGTAGGTACCGCCCTAAGNCTCCTAATCCGAGCAGAACTANGCCAACCCGGAGCCCTTCTGGGAGACGAC

AATCAACATAAAA

A nucleic sequence is a string composed of the letters A, C, T, G, U, N

Expected output

A combinaison of 3 nucleic acid, named codon, gives a specif amino acid, for exemple GCT is the code for Alanine, symbolised by the letter A

With the Seq1 define above:

codon      CCT TTA TCT AAT CTT TGG AGC ATG ...

amino acid  P   L   S   N   L   W   S   M  ...

The expected output is:

>Seq1_1 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

PLSNLWSMSWHSWNRPQPPHPCRTQSCTNTSSDSSAIQKSISSFY

>Seq2_1 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

GRYRPKXPNPSRTXPTRSPSGRRQST*X

Specific rules

The program takes 4 parameters:

clean : if true, write STOP codon as X instead of *

trim : if true, remove all X and * chars from the right end of the amino - acid sequence

alternative: different way to compute reverse frame

frame: a string value in ["1", "2", "3", "F", "-1", "-2", "-3", "R", "6" ]

The frame define the position in the nucleic sequence to start from:

-frame 1

|-frame 2

||-frame 3

|||

CCTTTATCTAATCTTTGGAGCATGAGCTGGCATAGTTGGAACCGCCCTCAGCCTCCTCATCCGTGCAGAA

CCCAGTCCTGTACCAACACCTCTTCTGATTCTTCGGCCATCCAGAAGTCTATATCCTCATTTTAC

                                                              |||

                                                              ||-frame -1

                                                              |-frame -2

                                                              -frame -3





frame "F" = "1", "2", "3"

frame "R" = "-1", "-2", "-3"

frame "6" = "1", "2", "3", "-1", "-2", "-3"

In the output file, we add the frame used to the sequenceId: sequenceId = sequenceId_frame

For example if the program is used with frame=6, the expected output is

>Seq1_1 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

PLSNLWSMSWHSWNRPQPPHPCRTQSCTNTSSDSSAIQKSISSFY

>Seq1_2 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

LYLIFGA*AGIVGTALSLLIRAEPSPVPTPLLILRPSRSLYPHFT

>Seq1_3 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

FI*SLEHELA*LEPPSASSSVQNPVLYQHLF*FFGHPEVYILILX

>Seq1_4 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

VK*GYRLLDGRRIRRGVGTGLGSARMRRLRAVPTMPAHAPKIR*R

>Seq1_5 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

KMRI*TSGWPKNQKRCWYRTGFCTDEEAEGGSNYASSCSKD*IKX

>Seq1_6 [organism=Carpodacus mexicanus] [clone=6b] actin (act) mRNA, partial cds

*NEDIDFWMAEESEEVLVQDWVLHG*GG*GRFQLCQLMLQRLDKG

>Seq2_1 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

GRYRPKXPNPSRTXPTRSPSGRRQST*X

>Seq2_2 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

VGTALXLLIRAELXQPGALLGDDNQHKX

>Seq2_3 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

*VPP*XS*SEQNXANPEPFWETTINIK

>Seq2_4 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

LC*LSSPRRAPGWXSSARIRXLRAVPT

>Seq2_5 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

FMLIVVSQKGSGLA*FCSD*EX*GGTYX

>Seq2_6 [organism=uncultured bacillus sp.] [isolate=A2] corticotropin (CT) gene

FYVDCRLPEGLRVGXVLLGLGXLGRYLP

Improvements

Performance improvements

Is the code easy to read / understand ?

Is the code idiomatic ?

( full project with tests + flags handling is avaible on github: gotranseq)

code :

package transeq



import (

    "bufio"

    "bytes"

    "context"

    "encoding/binary"

    "fmt"

    "io"

    "runtime"

    "sync"

)



var (

    letterCode = map[byte]uint8{

        'A': aCode,

        'C': cCode,

        'T': tCode,

        'G': gCode,

        'N': nCode,

        'U': uCode,

    }

    standard = map[string]byte{

        "TTT": 'F',

        "TCT": 'S',

        "TAT": 'Y',

        "TGT": 'C',

        "TTC": 'F',

        "TCC": 'S',

        "TAC": 'Y',

        "TGC": 'C',

        "TTA": 'L',

        "TCA": 'S',

        "TAA": '*',

        "TGA": '*',

        "TTG": 'L',

        "TCG": 'S',

        "TAG": '*',

        "TGG": 'W',

        "CTT": 'L',

        "CCT": 'P',

        "CAT": 'H',

        "CGT": 'R',

        "CTC": 'L',

        "CCC": 'P',

        "CAC": 'H',

        "CGC": 'R',

        "CTA": 'L',

        "CCA": 'P',

        "CAA": 'Q',

        "CGA": 'R',

        "CTG": 'L',

        "CCG": 'P',

        "CAG": 'Q',

        "CGG": 'R',

        "ATT": 'I',

        "ACT": 'T',

        "AAT": 'N',

        "AGT": 'S',

        "ATC": 'I',

        "ACC": 'T',

        "AAC": 'N',

        "AGC": 'S',

        "ATA": 'I',

        "ACA": 'T',

        "AAA": 'K',

        "AGA": 'R',

        "ATG": 'M',

        "ACG": 'T',

        "AAG": 'K',

        "AGG": 'R',

        "GTT": 'V',

        "GCT": 'A',

        "GAT": 'D',

        "GGT": 'G',

        "GTC": 'V',

        "GCC": 'A',

        "GAC": 'D',

        "GGC": 'G',

        "GTA": 'V',

        "GCA": 'A',

        "GAA": 'E',

        "GGA": 'G',

        "GTG": 'V',

        "GCG": 'A',

        "GAG": 'E',

        "GGG": 'G',

    }

)



const (

    nCode = uint8(0)

    aCode = uint8(1)

    cCode = uint8(2)

    tCode = uint8(3)

    uCode = uint8(3)

    gCode = uint8(4)



    stopByte = '*'

    unknown  = 'X'

    // Length of the array to store code/bytes

    // uses gCode because it's the biggest uint8 of all codes

    arrayCodeSize = (uint32(gCode) | uint32(gCode)<<8 | uint32(gCode)<<16) + 1

)



func createCodeArray(clean bool) byte {



    resultMap := map[uint32]byte{}

    twoLetterMap := map[string]byte{}



    tmpCode := make(uint8, 4)



    for codon, aaCode := range standard {

        // generate 3 letter code

        for i := 0; i < 3; i++ {

            tmpCode[i] = letterCode[codon[i]]

        }

        // each codon is represented by an unique uint32:

        // each possible nucleotide is represented by an uint8 (255 possibility)

        // the three first bytes are the the code for each nucleotide

        // last byte is unused ( eq to uint8(0) )

        // example:

        // codon 'ACG' ==> uint32(aCode) | uint32(cCode)<<8 | uint32(gCode)<<16

        uint32Code := uint32(tmpCode[0]) | uint32(tmpCode[1])<<8 | uint32(tmpCode[2])<<16

        resultMap[uint32Code] = aaCode



        // generate 2 letter code

        codes, ok := twoLetterMap[codon[0:2]]

        if !ok {

            twoLetterMap[codon[0:2]] = byte{aaCode}

        } else {

            twoLetterMap[codon[0:2]] = append(codes, aaCode)

        }

    }

    for twoLetterCodon, codes := range twoLetterMap {

        uniqueAA := true

        for i := 0; i < len(codes); i++ {



            if codes[i] != codes[0] {

                uniqueAA = false

            }

        }

        if uniqueAA {

            first := letterCode[twoLetterCodon[0]]

            second := letterCode[twoLetterCodon[1]]



            uint32Code := uint32(first) | uint32(second)<<8

            resultMap[uint32Code] = codes[0]

        }

    }

    // if clean is specified, we want to replace all '*' by 'X' in the output

    // sequence, so replace all occurrences of '*' directly in the ref map

    if clean {

        for k, v := range resultMap {

            if v == stopByte {

                resultMap[k] = unknown

            }

        }

    }

    r := make(byte, arrayCodeSize)

    for k, v := range resultMap {

        r[k] = v

    }

    return r

}



func computeFrames(frameName string) (frames int, reverse bool, err error) {



    frames = make(int, 6)

    reverse = false



    switch frameName {

    case "1":

        frames[0] = 1

    case "2":

        frames[1] = 1

    case "3":

        frames[2] = 1

    case "F":

        for i := 0; i < 3; i++ {

            frames[i] = 1

        }

    case "-1":

        frames[3] = 1

        reverse = true

    case "-2":

        frames[4] = 1

        reverse = true

    case "-3":

        frames[5] = 1

        reverse = true

    case "R":

        for i := 3; i < 6; i++ {

            frames[i] = 1

        }

        reverse = true

    case "6":

        for i := range frames {

            frames[i] = 1

        }

        reverse = true

    default:

        err = fmt.Errorf("wrong value for -f | --frame parameter: %s", frameName)

    }

    return frames, reverse, err

}



type writer struct {

    buf            *bytes.Buffer

    currentLineLen int

    bytesToTrim    int

}



func (w *writer) addByte(b byte) {

    w.buf.WriteByte(b)

    w.currentLineLen++

    if b == stopByte || b == unknown {

        w.bytesToTrim++

    } else {

        w.bytesToTrim = 0

    }

}



func (w *writer) addUnknown() {

    w.buf.WriteByte(unknown)

    w.currentLineLen++

    w.bytesToTrim++

}



func (w *writer) newLine() {

    w.buf.WriteByte('n')

    w.currentLineLen = 0

    w.bytesToTrim++

}



const (

    // size of the buffer for writing to file

    maxBufferSize = 1024 * 1024 * 30

    // max line size for sequence

    maxLineSize = 60

    // suffixes ta add to sequence id for each frame

    suffixes = "123456"

)



// Translate read a fata file, translate each sequence to the corresponding prot sequence in the specified frame

func Translate(inputSequence io.Reader, out io.Writer, frame string, clean, trim, alternative bool) error {



    arrayCode := createCodeArray(clean)

    framesToGenerate, reverse, err := computeFrames(frame)

    if err != nil {

        return err

    }



    fnaSequences := make(chan encodedSequence, 10)

    errs := make(chan error, 1)



    var wg sync.WaitGroup

    ctx, cancel := context.WithCancel(context.Background())

    defer cancel()



    for nWorker := 0; nWorker < runtime.NumCPU(); nWorker++ {



        wg.Add(1)



        go func() {



            defer wg.Done()



            startPosition := make(int, 3)

            w := &writer{

                buf:            bytes.NewBuffer(nil),

                bytesToTrim:    0,

                currentLineLen: 0,

            }



            for sequence := range fnaSequences {



                select {

                case <-ctx.Done():

                    return

                default:

                }



                frameIndex := 0

                startPosition[0], startPosition[1], startPosition[2] = 0, 1, 2



                idSize := int(binary.LittleEndian.Uint32(sequence[0:4]))

                nuclSeqLength := len(sequence) - idSize



            Translate:

                for _, startPos := range startPosition {



                    if framesToGenerate[frameIndex] == 0 {

                        frameIndex++

                        continue

                    }



                    // sequence id should look like

                    // >sequenceID_<frame> comment

                    idEnd := bytes.IndexByte(sequence[4:idSize], ' ')

                    if idEnd != -1 {

                        w.buf.Write(sequence[4 : 4+idEnd])

                        w.buf.WriteByte('_')

                        w.buf.WriteByte(suffixes[frameIndex])

                        w.buf.Write(sequence[4+idEnd : idSize])

                    } else {

                        w.buf.Write(sequence[4:idSize])

                        w.buf.WriteByte('_')

                        w.buf.WriteByte(suffixes[frameIndex])

                    }

                    w.newLine()



                    // if in trim mode, nb of bytes to trim (nb of successive 'X', '*' and 'n'

                    // from right end of the sequence)

                    w.bytesToTrim = 0

                    w.currentLineLen = 0



                    // read the sequence 3 letters at a time, starting at a specific position

                    // corresponding to the frame

                    for pos := startPos + 2 + idSize; pos < len(sequence); pos += 3 {



                        if w.currentLineLen == maxLineSize {

                            w.newLine()

                        }

                        // create an uint32 from the codon, to retrieve the corresponding

                        // AA from the map

                        codonCode := uint32(sequence[pos-2]) | uint32(sequence[pos-1])<<8 | uint32(sequence[pos])<<16



                        b := arrayCode[codonCode]

                        if b != byte(0) {

                            w.addByte(b)

                        } else {

                            w.addUnknown()

                        }

                    }



                    // the last codon is only 2 nucleotid long, try to guess

                    // the corresponding AA

                    if (nuclSeqLength-startPos)%3 == 2 {



                        if w.currentLineLen == maxLineSize {

                            w.newLine()

                        }

                        codonCode := uint32(sequence[len(sequence)-2]) | uint32(sequence[len(sequence)-1])<<8



                        b := arrayCode[codonCode]

                        if b != byte(0) {

                            w.addByte(b)

                        } else {

                            w.addUnknown()

                        }

                    }



                    // the last codon is only 1 nucleotid long, no way to guess

                    // the corresponding AA

                    if (nuclSeqLength-startPos)%3 == 1 {

                        if w.currentLineLen == maxLineSize {

                            w.newLine()

                        }

                        w.addUnknown()

                    }



                    if trim && w.bytesToTrim > 0 {

                        // remove the last bytesToTrim bytes of the buffer

                        // as they are 'X', '*' or 'n'

                        w.buf.Truncate(w.buf.Len() - w.bytesToTrim)

                        w.currentLineLen -= w.bytesToTrim

                    }



                    if w.currentLineLen != 0 {

                        w.newLine()

                    }

                    frameIndex++

                }



                if reverse && frameIndex < 6 {



                    // get the complementary sequence.

                    // Basically, switch

                    //   A <-> T

                    //   C <-> G

                    // N is not modified

                    for i, n := range sequence[idSize:] {



                        switch n {

                        case aCode:

                            sequence[i+idSize] = tCode

                        case tCode:

                            // handle both tCode and uCode

                            sequence[i+idSize] = aCode

                        case cCode:

                            sequence[i+idSize] = gCode

                        case gCode:

                            sequence[i+idSize] = cCode

                        default:

                            //case N -> leave it

                        }

                    }

                    // reverse the sequence

                    for i, j := idSize, len(sequence)-1; i < j; i, j = i+1, j-1 {

                        sequence[i], sequence[j] = sequence[j], sequence[i]

                    }



                    if !alternative {

                        // Staden convention: Frame -1 is the reverse-complement of the sequence

                        // having the same codon phase as frame 1. Frame -2 is the same phase as

                        // frame 2. Frame -3 is the same phase as frame 3

                        //

                        // use the matrix to keep track of the forward frame as it depends on the

                        // length of the sequence

                        switch nuclSeqLength % 3 {

                        case 0:

                            startPosition[0], startPosition[1], startPosition[2] = 0, 2, 1

                        case 1:

                            startPosition[0], startPosition[1], startPosition[2] = 1, 0, 2

                        case 2:

                            startPosition[0], startPosition[1], startPosition[2] = 2, 1, 0

                        }

                    }

                    // run the same loop, but with the reverse-complemented sequence

                    goto Translate

                }



                if w.buf.Len() > maxBufferSize {

                    _, err := out.Write(w.buf.Bytes())

                    if err != nil {

                        select {

                        case errs <- fmt.Errorf("fail to write to output file: %v", err):

                        default:

                        }

                        cancel()

                        return

                    }

                    w.buf.Reset()

                }

                pool.Put(sequence)

            }



            if w.buf.Len() > 0 {

                _, err := out.Write(w.buf.Bytes())

                if err != nil {

                    select {

                    case errs <- fmt.Errorf("fail to write to output file: %v", err):

                    default:

                    }

                    cancel()

                    return

                }

            }

        }()

    }

    readSequenceFromFasta(ctx, inputSequence, fnaSequences)



    wg.Wait()

    select {

    case err, ok := <-errs:

        if ok {

            return err

        }

    default:

    }

    return nil

}



func readSequenceFromFasta(ctx context.Context, inputSequence io.Reader, fnaSequences chan encodedSequence) {



    feeder := &fastaChannelFeeder{

        idBuffer:       bytes.NewBuffer(nil),

        commentBuffer:  bytes.NewBuffer(nil),

        sequenceBuffer: bytes.NewBuffer(nil),

        fastaChan:      fnaSequences,

    }

    // fasta format is:

    //

    // >sequenceID some comments on sequence

    // ACAGGCAGAGACACGACAGACGACGACACAGGAGCAGACAGCAGCAGACGACCACATATT

    // TTTGCGGTCACATGACGACTTCGGCAGCGA

    //

    // see https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp

    // section 1 for details

    scanner := bufio.NewScanner(inputSequence)

Loop:

    for scanner.Scan() {



        line := scanner.Bytes()

        if len(line) == 0 {

            continue

        }

        if line[0] == '>' {



            if feeder.idBuffer.Len() > 0 {

                select {

                case <-ctx.Done():

                    break Loop

                default:

                }

                feeder.sendFasta()

            }

            feeder.reset()



            // parse the ID of the sequence. ID is formatted like this:

            // >sequenceID comments

            seqID := bytes.SplitN(line, byte{' '}, 2)

            feeder.idBuffer.Write(seqID[0])



            if len(seqID) > 1 {

                feeder.commentBuffer.WriteByte(' ')

                feeder.commentBuffer.Write(seqID[1])

            }

        } else {

            // if the line doesn't start with '>', then it's a part of the

            // nucleotide sequence, so write it to the buffer

            feeder.sequenceBuffer.Write(line)

        }

    }

    // don't forget to push last sequence

    select {

    case <-ctx.Done():

    default:

        feeder.sendFasta()

    }

    close(fnaSequences)

}



// a type to hold an encoded fasta sequence

//

//  s[0:4] stores the size of the sequence id + the size of the comment as an uint32 (little endian)

//  s[4:idSize] stores the sequence id, and the comment id there is one

//  s[idSize:] stores the nucl sequence

type encodedSequence byte



var pool = sync.Pool{

    New: func() interface{} {

        return make(encodedSequence, 512)

    },

}



func getSizedSlice(idSize, requiredSize int) encodedSequence {

    s := pool.Get().(encodedSequence)

    binary.LittleEndian.PutUint32(s[0:4], uint32(idSize))



    for len(s) < requiredSize {

        s = append(s, byte(0))

    }

    return s[0:requiredSize]

}



func (f *fastaChannelFeeder) sendFasta() {



    idSize := 4 + f.idBuffer.Len() + f.commentBuffer.Len()

    requiredSize := idSize + f.sequenceBuffer.Len()



    s := getSizedSlice(idSize, requiredSize)



    if f.commentBuffer.Len() > 0 {

        copy(s[idSize-f.commentBuffer.Len():idSize], f.commentBuffer.Bytes())

    }



    copy(s[4:4+f.idBuffer.Len()], f.idBuffer.Bytes())



    // convert the sequence of bytes to an array of uint8 codes,

    // so a codon (3 nucleotides | 3 bytes ) can be represented

    // as an uint32

    for i, b := range f.sequenceBuffer.Bytes() {



        switch b {

        case 'A':

            s[i+idSize] = aCode

        case 'C':

            s[i+idSize] = cCode

        case 'G':

            s[i+idSize] = gCode

        case 'T', 'U':

            s[i+idSize] = tCode

        case 'N':

            s[i+idSize] = nCode

        default:

            fmt.Printf("WARNING: invalid char in sequence %s: %s, ignoring", s[4:4+idSize], string(b))

        }

    }

    f.fastaChan <- s

}



type fastaChannelFeeder struct {

    idBuffer, commentBuffer, sequenceBuffer *bytes.Buffer

    fastaChan                               chan encodedSequence

}



func (f *fastaChannelFeeder) reset() {

    f.idBuffer.Reset()

    f.sequenceBuffer.Reset()

    f.commentBuffer.Reset()

}

score 7 · Accepted Answer · 2019-01-08 22:00:28Z

This covers an interesting topic. Great work!

Because I am unfamiliar with this area, I utilized your unit testing to ensure changes I make did not break functionality. If they do, I apologize and please let me know.

Utilize implicit repetition and `iota`

Rather than manually defining the type and value of nCode, aCode, etc. we an implicitly get the value using iota. This also simplifies the assignment of arrayCodeSize.

const (

    nCode = uint8(0)

    aCode = uint8(1)

    cCode = uint8(2)

    tCode = uint8(3)

    uCode = uint8(3)

    gCode = uint8(4)



    stopByte = '*'

    unknown  = 'X'

    // Length of the array to store code/bytes

    // uses gCode because it's the biggest uint8 of all codes

    arrayCodeSize = (uint32(gCode) | uint32(gCode)<<8 | uint32(gCode)<<16) + 1

)

Becomes:

const (

    nCode = iota

    aCode

    cCode

    tCode

    uCode

    gCode



    stopByte = '*'

    unknown  = 'X'



    // Length of the array to store code/bytes

    // uses gCode because it's the biggest of all codes

    arrayCodeSize = (gCode | gCode<<8 | gCode<<16) + 1

)

Check `error` return values

There are fourteen places in your code where the error return value is unchecked. If this is intentional, it's common practice to assign it to the blank identifier: _.

Here is a list of occurrences:

~~w.buf.WriteByte() in addByte()~~

~~w.buf.WriteByte() in addUnknown()~~

~~w.buf.WriteByte() in newLine()~~

~~Many Write and WriteByte calls in Translate()~~

~~feeder.idBuffer.Write() and such, in readSequenceFromFasta~~

Update: These always return nil.

Reduce complexity of `Translate()`

The Translate() function is complex. It's current cyclomatic complexity is 45. (By the end of this, it's complexity is 27).

I notice that you define a go statement, which acts as a worker. I will leave it up to you to choose a fitting name, but for now "worker()" is sufficient.

func worker(wg *sync.WaitGroup, fnaSequences chan encodedSequence,

    ctx context.Context, framesToGenerate int, arrayCode byte,

    options Options, reverse bool, out io.Writer, errs chan error,

    cancel context.CancelFunc) {

    defer wg.Done()



    startPosition := make(int, 3)



    w := &writer{

        buf:            bytes.NewBuffer(nil),

        bytesToTrim:    0,

        currentLineLen: 0,

    }



    for sequence := range fnaSequences {



        select {

        case <-ctx.Done():

            return

        default:

        }



        frameIndex := 0

        startPosition[0], startPosition[1], startPosition[2] = 0, 1, 2



        idSize := int(binary.LittleEndian.Uint32(sequence[0:4]))

        nuclSeqLength := len(sequence) - idSize



    Translate:

        for _, startPos := range startPosition {



            if framesToGenerate[frameIndex] == 0 {

                frameIndex++

                continue

            }



            // sequence id should look like

            // >sequenceID_<frame> comment

            idEnd := bytes.IndexByte(sequence[4:idSize], ' ')

            if idEnd != -1 {

                w.buf.Write(sequence[4 : 4+idEnd])

                w.buf.WriteByte('_')

                w.buf.WriteByte(suffixes[frameIndex])

                w.buf.Write(sequence[4+idEnd : idSize])

            } else {

                w.buf.Write(sequence[4:idSize])

                w.buf.WriteByte('_')

                w.buf.WriteByte(suffixes[frameIndex])

            }

            w.newLine()



            // if in trim mode, nb of bytes to trim (nb of successive 'X', '*' and 'n'

            // from right end of the sequence)

            w.bytesToTrim = 0

            w.currentLineLen = 0



            // read the sequence 3 letters at a time, starting at a specific position

            // corresponding to the frame

            for pos := startPos + 2 + idSize; pos < len(sequence); pos += 3 {



                if w.currentLineLen == maxLineSize {

                    w.newLine()

                }

                // create an uint32 from the codon, to retrieve the corresponding

                // AA from the map

                codonCode := uint32(sequence[pos-2]) | uint32(sequence[pos-1])<<8 | uint32(sequence[pos])<<16



                b := arrayCode[codonCode]

                if b != byte(0) {

                    w.addByte(b)

                } else {

                    w.addUnknown()

                }

            }



            // the last codon is only 2 nucleotid long, try to guess

            // the corresponding AA

            if (nuclSeqLength-startPos)%3 == 2 {



                if w.currentLineLen == maxLineSize {

                    w.newLine()

                }

                codonCode := uint32(sequence[len(sequence)-2]) | uint32(sequence[len(sequence)-1])<<8



                b := arrayCode[codonCode]

                if b != byte(0) {

                    w.addByte(b)

                } else {

                    w.addUnknown()

                }

            }



            // the last codon is only 1 nucleotid long, no way to guess

            // the corresponding AA

            if (nuclSeqLength-startPos)%3 == 1 {

                if w.currentLineLen == maxLineSize {

                    w.newLine()

                }

                w.addUnknown()

            }



            if options.Trim && w.bytesToTrim > 0 {

                // remove the last bytesToTrim bytes of the buffer

                // as they are 'X', '*' or 'n'

                w.buf.Truncate(w.buf.Len() - w.bytesToTrim)

                w.currentLineLen -= w.bytesToTrim

            }



            if w.currentLineLen != 0 {

                w.newLine()

            }

            frameIndex++

        }



        if reverse && frameIndex < 6 {



            // get the complementary sequence.

            // Basically, switch

            //   A <-> T

            //   C <-> G

            // N is not modified

            for i, n := range sequence[idSize:] {



                switch n {

                case aCode:

                    sequence[i+idSize] = tCode

                case tCode:

                    // handle both tCode and uCode

                    sequence[i+idSize] = aCode

                case cCode:

                    sequence[i+idSize] = gCode

                case gCode:

                    sequence[i+idSize] = cCode

                default:

                    //case N -> leave it

                }

            }

            // reverse the sequence

            for i, j := idSize, len(sequence)-1; i < j; i, j = i+1, j-1 {

                sequence[i], sequence[j] = sequence[j], sequence[i]

            }



            if !options.Alternative {

                // Staden convention: Frame -1 is the reverse-complement of the sequence

                // having the same codon phase as frame 1. Frame -2 is the same phase as

                // frame 2. Frame -3 is the same phase as frame 3

                //

                // use the matrix to keep track of the forward frame as it depends on the

                // length of the sequence

                switch nuclSeqLength % 3 {

                case 0:

                    startPosition[0], startPosition[1], startPosition[2] = 0, 2, 1

                case 1:

                    startPosition[0], startPosition[1], startPosition[2] = 1, 0, 2

                case 2:

                    startPosition[0], startPosition[1], startPosition[2] = 2, 1, 0

                }

            }

            // run the same loop, but with the reverse-complemented sequence

            goto Translate

        }



        if w.buf.Len() > maxBufferSize {

            _, err := out.Write(w.buf.Bytes())

            if err != nil {

                select {

                case errs <- fmt.Errorf("fail to write to output file: %v", err):

                default:

                }

                cancel()

                return

            }

            w.buf.Reset()

        }

        pool.Put(sequence)

    }



    if w.buf.Len() > 0 {

        _, err := out.Write(w.buf.Bytes())

        if err != nil {

            select {

            case errs <- fmt.Errorf("fail to write to output file: %v", err):

            default:

            }

            cancel()

            return

        }

    }

}

But that's a ton of arguments for the worker, can more be done? Sure! But first, let's get rid of goto.

`goto`

You use a goto statement to re-run a block of code again. That says to me: recursive function.

So, let's move this to a separate function. Again, I have no idea the proper name, and will leave that to you. For now, I'll call it getComplexityAndReverse() -- a verbose name, but it should suffice.

func getComplexityAndAlternate(startPosition int, framesToGenerate int,

    frameIndex int, sequence encodedSequence, idSize int, w writer,

    arrayCode byte, nuclSeqLength int, options Options, reverse bool) {

    for _, startPos := range startPosition {

        if framesToGenerate[frameIndex] == 0 {

            frameIndex++

            continue

        }



        // sequence id should look like

        // >sequenceID_<frame> comment

        idEnd := bytes.IndexByte(sequence[4:idSize], ' ')

        if idEnd != -1 {

            w.buf.Write(sequence[4 : 4+idEnd])

            w.buf.WriteByte('_')

            w.buf.WriteByte(suffixes[frameIndex])

            w.buf.Write(sequence[4+idEnd : idSize])

        } else {

            w.buf.Write(sequence[4:idSize])

            w.buf.WriteByte('_')

            w.buf.WriteByte(suffixes[frameIndex])

        }

        w.newLine()



        // if in trim mode, nb of bytes to trim (nb of successive 'X', '*' and 'n'

        // from right end of the sequence)

        w.bytesToTrim = 0

        w.currentLineLen = 0



        // read the sequence 3 letters at a time, starting at a specific position

        // corresponding to the frame

        for pos := startPos + 2 + idSize; pos < len(sequence); pos += 3 {



            if w.currentLineLen == maxLineSize {

                w.newLine()

            }

            // create an uint32 from the codon, to retrieve the corresponding

            // AA from the map

            codonCode := uint32(sequence[pos-2]) | uint32(sequence[pos-1])<<8 | uint32(sequence[pos])<<16



            b := arrayCode[codonCode]

            if b != byte(0) {

                w.addByte(b)

            } else {

                w.addUnknown()

            }

        }



        // the last codon is only 2 nucleotid long, try to guess

        // the corresponding AA

        if (nuclSeqLength-startPos)%3 == 2 {



            if w.currentLineLen == maxLineSize {

                w.newLine()

            }

            codonCode := uint32(sequence[len(sequence)-2]) | uint32(sequence[len(sequence)-1])<<8



            b := arrayCode[codonCode]

            if b != byte(0) {

                w.addByte(b)

            } else {

                w.addUnknown()

            }

        }



        // the last codon is only 1 nucleotid long, no way to guess

        // the corresponding AA

        if (nuclSeqLength-startPos)%3 == 1 {

            if w.currentLineLen == maxLineSize {

                w.newLine()

            }

            w.addUnknown()

        }



        if options.Trim && w.bytesToTrim > 0 {

            // remove the last bytesToTrim bytes of the buffer

            // as they are 'X', '*' or 'n'

            w.buf.Truncate(w.buf.Len() - w.bytesToTrim)

            w.currentLineLen -= w.bytesToTrim

        }



        if w.currentLineLen != 0 {

            w.newLine()

        }

        frameIndex++

    }



    if reverse && frameIndex < 6 {



        // get the complementary sequence.

        // Basically, switch

        //   A <-> T

        //   C <-> G

        // N is not modified

        for i, n := range sequence[idSize:] {



            switch n {

            case aCode:

                sequence[i+idSize] = tCode

            case tCode:

                // handle both tCode and uCode

                sequence[i+idSize] = aCode

            case cCode:

                sequence[i+idSize] = gCode

            case gCode:

                sequence[i+idSize] = cCode

            default:

                //case N -> leave it

            }

        }

        // reverse the sequence

        for i, j := idSize, len(sequence)-1; i < j; i, j = i+1, j-1 {

            sequence[i], sequence[j] = sequence[j], sequence[i]

        }



        if !options.Alternative {

            // Staden convention: Frame -1 is the reverse-complement of the sequence

            // having the same codon phase as frame 1. Frame -2 is the same phase as

            // frame 2. Frame -3 is the same phase as frame 3

            //

            // use the matrix to keep track of the forward frame as it depends on the

            // length of the sequence

            switch nuclSeqLength % 3 {

            case 0:

                startPosition[0], startPosition[1], startPosition[2] = 0, 2, 1

            case 1:

                startPosition[0], startPosition[1], startPosition[2] = 1, 0, 2

            case 2:

                startPosition[0], startPosition[1], startPosition[2] = 2, 1, 0

            }

        }

        // run the same loop, but with the reverse-complemented sequence

        getComplexityAndAlternate(startPosition, framesToGenerate, frameIndex,

            sequence, idSize, w, arrayCode, nuclSeqLength, options, reverse)

    }

}

And we can simplify worker() even more:

func worker(wg *sync.WaitGroup, fnaSequences chan encodedSequence,

    ctx context.Context, framesToGenerate int, arrayCode byte,

    options Options, reverse bool, out io.Writer, errs chan error,

    cancel context.CancelFunc) {

    defer wg.Done()



    startPosition := make(int, 3)



    w := &writer{

        buf:            bytes.NewBuffer(nil),

        bytesToTrim:    0,

        currentLineLen: 0,

    }



    for sequence := range fnaSequences {



        select {

        case <-ctx.Done():

            return

        default:

        }



        frameIndex := 0

        startPosition[0], startPosition[1], startPosition[2] = 0, 1, 2



        idSize := int(binary.LittleEndian.Uint32(sequence[0:4]))

        nuclSeqLength := len(sequence) - idSize



        getComplexityAndAlternate(startPosition, framesToGenerate, frameIndex,

            sequence, idSize, *w, arrayCode, nuclSeqLength, options, reverse)



        if w.buf.Len() > maxBufferSize {

            _, err := out.Write(w.buf.Bytes())

            if err != nil {

                select {

                case errs <- fmt.Errorf("fail to write to output file: %v", err):

                default:

                }

                cancel()

                return

            }

            w.buf.Reset()

        }

        pool.Put(sequence)

    }



    if w.buf.Len() > 0 {

        _, err := out.Write(w.buf.Bytes())

        if err != nil {

            select {

            case errs <- fmt.Errorf("fail to write to output file: %v", err):

            default:

            }

            cancel()

            return

        }

    }

}

But, there's still loads of long arguments. However, now that things are broken into functions, I recommend shortening these variable names. These long names make the code very verbose.

With my limited knowledge, I see the following:

fnaSequences → fnaSeqs

framesToGenerate → frames

arrayCode → codes (†)

startPosition → starts (or sPos)

sequence → s (changed in some places)

† This may be incorrect jargon.

Move things to the lowest scope

For example, startPosition (now starts) can be declared in a lower scope.

While we're at it, starts can be declared as such:

starts := int{0, 1, 2}

Resulting in:

(Within worker())

for sequence := range fnaSequences {



    select {

    case <-ctx.Done():

        return

    default:

    }



    frameIndex := 0

    starts := int{0, 1, 2}



    idSize := int(binary.LittleEndian.Uint32(sequence[0:4]))

    nuclSeqLength := len(sequence) - idSize



    getComplexityAndAlternate(starts, framesToGenerate, frameIndex,

        sequence, idSize, *w, arrayCode, nuclSeqLength, options, reverse)



    if w.buf.Len() > maxBufferSize {

        _, err := out.Write(w.buf.Bytes())

        if err != nil {

            select {

            case errs <- fmt.Errorf("fail to write to output file: %v", err):

            default:

            }

            cancel()

            return

        }

        w.buf.Reset()

    }

    pool.Put(sequence)

}

(Within getComplexityAndAlternate())

if !options.Alternative {

    // Staden convention: Frame -1 is the reverse-complement of the sequence

    // having the same codon phase as frame 1. Frame -2 is the same phase as

    // frame 2. Frame -3 is the same phase as frame 3

    //

    // use the matrix to keep track of the forward frame as it depends on the

    // length of the sequence

    switch nuclSeqLength % 3 {

    case 0:

        starts = int{0, 2, 1}

    case 1:

        starts = int{1, 0, 2}

    case 2:

        starts = int{2, 1, 0}

    }

}

Duplicate code

The following code is used twice and should instead be a function:

_, err := out.Write(w.buf.Bytes())

if err != nil {

    select {

    case errs <- fmt.Errorf("fail to write to output file: %v", err):

    default:

    }

    cancel()

    return

}

Becomes a function (again, choose whatever name you want):

func writeOrCancel(w writer, out io.Writer, errs chan error,

    cancel context.CancelFunc) {

    if _, err := out.Write(w.buf.Bytes()); err != nil {

        select {

        case errs <- fmt.Errorf("fail to write to output file: %v", err):

        default:

        }

        cancel()

        return

    }

}

Unroll short loops

case "F":

    for i := 0; i < 3; i++ {

        frames[i] = 1

    }

And

case "R":

    for i := 3; i < 6; i++ {

        frames[i] = 1

    }

    reverse = true

Shouldn't be loops. By using loops you use more magic numbers. Instead, unroll them:

case "F":

    frames[0] = 1

    frames[1] = 1

    frames[2] = 1

// ...

case "R":

    frames[3] = 1

    frames[4] = 1

    frames[5] = 1

    reverse = true

There's probably an even easier way to clean up the switch statement in computeFrames().

Don't name return arguments unless it simplifies code

In computeFrames() your return arguments are named, but they don't need to be.

Use straightforward conditions

// generate 2 letter code

codes, ok := twoLetterMap[codon[0:2]]



if !ok {

    twoLetterMap[codon[0:2]] = byte{aaCode}

} else {

    twoLetterMap[codon[0:2]] = append(codes, aaCode)

}

Is more clearly:

// generate 2 letter code

if codes, ok := twoLetterMap[codon[0:2]]; ok {

    twoLetterMap[codon[0:2]] = append(codes, aaCode)

} else {

    twoLetterMap[codon[0:2]] = byte{aaCode}

}

Exit loops early

You can break early upon the condition codes[i] != codes[0].

for twoLetterCodon, codes := range twoLetterMap {

    uniqueAA := true

    for i := 0; i < len(codes); i++ {

        if codes[i] != codes[0] {

            uniqueAA = false

        }

    }

    if uniqueAA {

        first := letterCode[twoLetterCodon[0]]

        second := letterCode[twoLetterCodon[1]]



        uint32Code := uint32(first) | uint32(second)<<8

        resultMap[uint32Code] = codes[0]

    }

}

Becomes:

for twoLetterCodon, codes := range twoLetterMap {

    uniqueAA := true



    for _, c := range codes {

        if c != codes[0] {

            uniqueAA = false

            break

        }

    }



    if uniqueAA {

        first := letterCode[twoLetterCodon[0]]

        second := letterCode[twoLetterCodon[1]]



        uint32Code := uint32(first) | uint32(second)<<8

        resultMap[uint32Code] = codes[0]

    }

}

Combine global `const` declarations

It's common practice to combine them, so readers of your code don't need to search the entire document.

Move things to separate files

Your writer structure is relatively separate from everything else. I've moved it to a writer.go file -- moving the two constants it uses along with it.

You can also simplify the field names. If you feel explanation is needed, that's the purpose of documentation, not the field names themselves.

Rather than writing the following:

w := &writer{

    buf:    bytes.NewBuffer(nil),

    toTrim: 0,

    clen:   0,

}

We can use a newWriter() function, which follows Go APIs:

func newWriter(buf byte) *writer {

    return &writer{buf: bytes.NewBuffer(buf)}

}

Also note that specifying the default values (0) is not needed.

Conclusion: More to be done

I was not able to address all of the things I saw, but you should get the gist.

I would urge you to continue to break concrete operations into functions. Even though you may end up with lots of function arguments, in my opinion that's better than hard-to-read deep nesting and long functions. Perhaps once seeing how things break up, you can simplify the whole architecture of the package.

Here is a GitHub Gist of the final code. It includes other formatting things I did not mention explicitly.

Hope this helped. Your project looks promising, best of luck!

Thanks for the answer, there is some interesting ideas ! However, I disagree with a few points: — 6 hours ago
(1) can't use iota this way because tCode = uCode = uint(3) (2) buf.Write() and buf.WriteByte() always return a nil error, cf godoc bytes, so no need to check for it (3) I agree that high cyclomatic complexity is bad, but methods with 10 (!!) arguments is not a good solution (4) starts = int{0, 2, 1} allocates a new array instead of reusing the same slice. Not really sure it's worth it as it's in the hot spot of the code ! And this answer doesn't talk at all about the performance of the modified code — 6 hours ago
@felix (1) If you check, you will see that tcode is 3, and ucode is 4. That's how implicit repetition works in Go. (2) Fair point (3) In my opinion, it's better than code that's hard to read and understand, and definitely better than tons of nesting. If you split things into functions as I recommend, you may notice that functions (such as the one with 10 arguments) can be split up to do separate things. I left that work for you to implement. (4) If you do profiling, I highly doubt it will cause any noticeable performance difference. — 4 hours ago
I don't talk about the performance of the modified code because it's likely the same as before. — 4 hours ago
(1) sorry it wasn't clear enough, I meant both tCode and uCode have to be equal to uint(3) (cf initial code) — 4 hours ago

score 7 · Accepted Answer · 2019-01-08 22:00:28Z

This covers an interesting topic. Great work!

Because I am unfamiliar with this area, I utilized your unit testing to ensure changes I make did not break functionality. If they do, I apologize and please let me know.

Utilize implicit repetition and `iota`

Rather than manually defining the type and value of nCode, aCode, etc. we an implicitly get the value using iota. This also simplifies the assignment of arrayCodeSize.

const (

    nCode = uint8(0)

    aCode = uint8(1)

    cCode = uint8(2)

    tCode = uint8(3)

    uCode = uint8(3)

    gCode = uint8(4)



    stopByte = '*'

    unknown  = 'X'

    // Length of the array to store code/bytes

    // uses gCode because it's the biggest uint8 of all codes

    arrayCodeSize = (uint32(gCode) | uint32(gCode)<<8 | uint32(gCode)<<16) + 1

)

Becomes:

const (

    nCode = iota

    aCode

    cCode

    tCode

    uCode

    gCode



    stopByte = '*'

    unknown  = 'X'



    // Length of the array to store code/bytes

    // uses gCode because it's the biggest of all codes

    arrayCodeSize = (gCode | gCode<<8 | gCode<<16) + 1

)

Check `error` return values

There are fourteen places in your code where the error return value is unchecked. If this is intentional, it's common practice to assign it to the blank identifier: _.

Here is a list of occurrences:

~~w.buf.WriteByte() in addByte()~~

~~w.buf.WriteByte() in addUnknown()~~

~~w.buf.WriteByte() in newLine()~~

~~Many Write and WriteByte calls in Translate()~~

~~feeder.idBuffer.Write() and such, in readSequenceFromFasta~~

Update: These always return nil.

Reduce complexity of `Translate()`

The Translate() function is complex. It's current cyclomatic complexity is 45. (By the end of this, it's complexity is 27).

I notice that you define a go statement, which acts as a worker. I will leave it up to you to choose a fitting name, but for now "worker()" is sufficient.

func worker(wg *sync.WaitGroup, fnaSequences chan encodedSequence,

    ctx context.Context, framesToGenerate int, arrayCode byte,

    options Options, reverse bool, out io.Writer, errs chan error,

    cancel context.CancelFunc) {

    defer wg.Done()



    startPosition := make(int, 3)



    w := &writer{

        buf:            bytes.NewBuffer(nil),

        bytesToTrim:    0,

        currentLineLen: 0,

    }



    for sequence := range fnaSequences {



        select {

        case <-ctx.Done():

            return

        default:

        }



        frameIndex := 0

        startPosition[0], startPosition[1], startPosition[2] = 0, 1, 2



        idSize := int(binary.LittleEndian.Uint32(sequence[0:4]))

        nuclSeqLength := len(sequence) - idSize



    Translate:

        for _, startPos := range startPosition {



            if framesToGenerate[frameIndex] == 0 {

                frameIndex++

                continue

            }



            // sequence id should look like

            // >sequenceID_<frame> comment

            idEnd := bytes.IndexByte(sequence[4:idSize], ' ')

            if idEnd != -1 {

                w.buf.Write(sequence[4 : 4+idEnd])

                w.buf.WriteByte('_')

                w.buf.WriteByte(suffixes[frameIndex])

                w.buf.Write(sequence[4+idEnd : idSize])

            } else {

                w.buf.Write(sequence[4:idSize])

                w.buf.WriteByte('_')

                w.buf.WriteByte(suffixes[frameIndex])

            }

            w.newLine()



            // if in trim mode, nb of bytes to trim (nb of successive 'X', '*' and 'n'

            // from right end of the sequence)

            w.bytesToTrim = 0

            w.currentLineLen = 0



            // read the sequence 3 letters at a time, starting at a specific position

            // corresponding to the frame

            for pos := startPos + 2 + idSize; pos < len(sequence); pos += 3 {



                if w.currentLineLen == maxLineSize {

                    w.newLine()

                }

                // create an uint32 from the codon, to retrieve the corresponding

                // AA from the map

                codonCode := uint32(sequence[pos-2]) | uint32(sequence[pos-1])<<8 | uint32(sequence[pos])<<16



                b := arrayCode[codonCode]

                if b != byte(0) {

                    w.addByte(b)

                } else {

                    w.addUnknown()

                }

            }



            // the last codon is only 2 nucleotid long, try to guess

            // the corresponding AA

            if (nuclSeqLength-startPos)%3 == 2 {



                if w.currentLineLen == maxLineSize {

                    w.newLine()

                }

                codonCode := uint32(sequence[len(sequence)-2]) | uint32(sequence[len(sequence)-1])<<8



                b := arrayCode[codonCode]

                if b != byte(0) {

                    w.addByte(b)

                } else {

                    w.addUnknown()

                }

            }



            // the last codon is only 1 nucleotid long, no way to guess

            // the corresponding AA

            if (nuclSeqLength-startPos)%3 == 1 {

                if w.currentLineLen == maxLineSize {

                    w.newLine()

                }

                w.addUnknown()

            }



            if options.Trim && w.bytesToTrim > 0 {

                // remove the last bytesToTrim bytes of the buffer

                // as they are 'X', '*' or 'n'

                w.buf.Truncate(w.buf.Len() - w.bytesToTrim)

                w.currentLineLen -= w.bytesToTrim

            }



            if w.currentLineLen != 0 {

                w.newLine()

            }

            frameIndex++

        }



        if reverse && frameIndex < 6 {



            // get the complementary sequence.

            // Basically, switch

            //   A <-> T

            //   C <-> G

            // N is not modified

            for i, n := range sequence[idSize:] {



                switch n {

                case aCode:

                    sequence[i+idSize] = tCode

                case tCode:

                    // handle both tCode and uCode

                    sequence[i+idSize] = aCode

                case cCode:

                    sequence[i+idSize] = gCode

                case gCode:

                    sequence[i+idSize] = cCode

                default:

                    //case N -> leave it

                }

            }

            // reverse the sequence

            for i, j := idSize, len(sequence)-1; i < j; i, j = i+1, j-1 {

                sequence[i], sequence[j] = sequence[j], sequence[i]

            }



            if !options.Alternative {

                // Staden convention: Frame -1 is the reverse-complement of the sequence

                // having the same codon phase as frame 1. Frame -2 is the same phase as

                // frame 2. Frame -3 is the same phase as frame 3

                //

                // use the matrix to keep track of the forward frame as it depends on the

                // length of the sequence

                switch nuclSeqLength % 3 {

                case 0:

                    startPosition[0], startPosition[1], startPosition[2] = 0, 2, 1

                case 1:

                    startPosition[0], startPosition[1], startPosition[2] = 1, 0, 2

                case 2:

                    startPosition[0], startPosition[1], startPosition[2] = 2, 1, 0

                }

            }

            // run the same loop, but with the reverse-complemented sequence

            goto Translate

        }



        if w.buf.Len() > maxBufferSize {

            _, err := out.Write(w.buf.Bytes())

            if err != nil {

                select {

                case errs <- fmt.Errorf("fail to write to output file: %v", err):

                default:

                }

                cancel()

                return

            }

            w.buf.Reset()

        }

        pool.Put(sequence)

    }



    if w.buf.Len() > 0 {

        _, err := out.Write(w.buf.Bytes())

        if err != nil {

            select {

            case errs <- fmt.Errorf("fail to write to output file: %v", err):

            default:

            }

            cancel()

            return

        }

    }

}

But that's a ton of arguments for the worker, can more be done? Sure! But first, let's get rid of goto.

`goto`

You use a goto statement to re-run a block of code again. That says to me: recursive function.

So, let's move this to a separate function. Again, I have no idea the proper name, and will leave that to you. For now, I'll call it getComplexityAndReverse() -- a verbose name, but it should suffice.

func getComplexityAndAlternate(startPosition int, framesToGenerate int,

    frameIndex int, sequence encodedSequence, idSize int, w writer,

    arrayCode byte, nuclSeqLength int, options Options, reverse bool) {

    for _, startPos := range startPosition {

        if framesToGenerate[frameIndex] == 0 {

            frameIndex++

            continue

        }



        // sequence id should look like

        // >sequenceID_<frame> comment

        idEnd := bytes.IndexByte(sequence[4:idSize], ' ')

        if idEnd != -1 {

            w.buf.Write(sequence[4 : 4+idEnd])

            w.buf.WriteByte('_')

            w.buf.WriteByte(suffixes[frameIndex])

            w.buf.Write(sequence[4+idEnd : idSize])

        } else {

            w.buf.Write(sequence[4:idSize])

            w.buf.WriteByte('_')

            w.buf.WriteByte(suffixes[frameIndex])

        }

        w.newLine()



        // if in trim mode, nb of bytes to trim (nb of successive 'X', '*' and 'n'

        // from right end of the sequence)

        w.bytesToTrim = 0

        w.currentLineLen = 0



        // read the sequence 3 letters at a time, starting at a specific position

        // corresponding to the frame

        for pos := startPos + 2 + idSize; pos < len(sequence); pos += 3 {



            if w.currentLineLen == maxLineSize {

                w.newLine()

            }

            // create an uint32 from the codon, to retrieve the corresponding

            // AA from the map

            codonCode := uint32(sequence[pos-2]) | uint32(sequence[pos-1])<<8 | uint32(sequence[pos])<<16



            b := arrayCode[codonCode]

            if b != byte(0) {

                w.addByte(b)

            } else {

                w.addUnknown()

            }

        }



        // the last codon is only 2 nucleotid long, try to guess

        // the corresponding AA

        if (nuclSeqLength-startPos)%3 == 2 {



            if w.currentLineLen == maxLineSize {

                w.newLine()

            }

            codonCode := uint32(sequence[len(sequence)-2]) | uint32(sequence[len(sequence)-1])<<8



            b := arrayCode[codonCode]

            if b != byte(0) {

                w.addByte(b)

            } else {

                w.addUnknown()

            }

        }



        // the last codon is only 1 nucleotid long, no way to guess

        // the corresponding AA

        if (nuclSeqLength-startPos)%3 == 1 {

            if w.currentLineLen == maxLineSize {

                w.newLine()

            }

            w.addUnknown()

        }



        if options.Trim && w.bytesToTrim > 0 {

            // remove the last bytesToTrim bytes of the buffer

            // as they are 'X', '*' or 'n'

            w.buf.Truncate(w.buf.Len() - w.bytesToTrim)

            w.currentLineLen -= w.bytesToTrim

        }



        if w.currentLineLen != 0 {

            w.newLine()

        }

        frameIndex++

    }



    if reverse && frameIndex < 6 {



        // get the complementary sequence.

        // Basically, switch

        //   A <-> T

        //   C <-> G

        // N is not modified

        for i, n := range sequence[idSize:] {



            switch n {

            case aCode:

                sequence[i+idSize] = tCode

            case tCode:

                // handle both tCode and uCode

                sequence[i+idSize] = aCode

            case cCode:

                sequence[i+idSize] = gCode

            case gCode:

                sequence[i+idSize] = cCode

            default:

                //case N -> leave it

            }

        }

        // reverse the sequence

        for i, j := idSize, len(sequence)-1; i < j; i, j = i+1, j-1 {

            sequence[i], sequence[j] = sequence[j], sequence[i]

        }



        if !options.Alternative {

            // Staden convention: Frame -1 is the reverse-complement of the sequence

            // having the same codon phase as frame 1. Frame -2 is the same phase as

            // frame 2. Frame -3 is the same phase as frame 3

            //

            // use the matrix to keep track of the forward frame as it depends on the

            // length of the sequence

            switch nuclSeqLength % 3 {

            case 0:

                startPosition[0], startPosition[1], startPosition[2] = 0, 2, 1

            case 1:

                startPosition[0], startPosition[1], startPosition[2] = 1, 0, 2

            case 2:

                startPosition[0], startPosition[1], startPosition[2] = 2, 1, 0

            }

        }

        // run the same loop, but with the reverse-complemented sequence

        getComplexityAndAlternate(startPosition, framesToGenerate, frameIndex,

            sequence, idSize, w, arrayCode, nuclSeqLength, options, reverse)

    }

}

And we can simplify worker() even more:

func worker(wg *sync.WaitGroup, fnaSequences chan encodedSequence,

    ctx context.Context, framesToGenerate int, arrayCode byte,

    options Options, reverse bool, out io.Writer, errs chan error,

    cancel context.CancelFunc) {

    defer wg.Done()



    startPosition := make(int, 3)



    w := &writer{

        buf:            bytes.NewBuffer(nil),

        bytesToTrim:    0,

        currentLineLen: 0,

    }



    for sequence := range fnaSequences {



        select {

        case <-ctx.Done():

            return

        default:

        }



        frameIndex := 0

        startPosition[0], startPosition[1], startPosition[2] = 0, 1, 2



        idSize := int(binary.LittleEndian.Uint32(sequence[0:4]))

        nuclSeqLength := len(sequence) - idSize



        getComplexityAndAlternate(startPosition, framesToGenerate, frameIndex,

            sequence, idSize, *w, arrayCode, nuclSeqLength, options, reverse)



        if w.buf.Len() > maxBufferSize {

            _, err := out.Write(w.buf.Bytes())

            if err != nil {

                select {

                case errs <- fmt.Errorf("fail to write to output file: %v", err):

                default:

                }

                cancel()

                return

            }

            w.buf.Reset()

        }

        pool.Put(sequence)

    }



    if w.buf.Len() > 0 {

        _, err := out.Write(w.buf.Bytes())

        if err != nil {

            select {

            case errs <- fmt.Errorf("fail to write to output file: %v", err):

            default:

            }

            cancel()

            return

        }

    }

}

But, there's still loads of long arguments. However, now that things are broken into functions, I recommend shortening these variable names. These long names make the code very verbose.

With my limited knowledge, I see the following:

fnaSequences → fnaSeqs

framesToGenerate → frames

arrayCode → codes (†)

startPosition → starts (or sPos)

sequence → s (changed in some places)

† This may be incorrect jargon.

Move things to the lowest scope

For example, startPosition (now starts) can be declared in a lower scope.

While we're at it, starts can be declared as such:

starts := int{0, 1, 2}

Resulting in:

(Within worker())

for sequence := range fnaSequences {



    select {

    case <-ctx.Done():

        return

    default:

    }



    frameIndex := 0

    starts := int{0, 1, 2}



    idSize := int(binary.LittleEndian.Uint32(sequence[0:4]))

    nuclSeqLength := len(sequence) - idSize



    getComplexityAndAlternate(starts, framesToGenerate, frameIndex,

        sequence, idSize, *w, arrayCode, nuclSeqLength, options, reverse)



    if w.buf.Len() > maxBufferSize {

        _, err := out.Write(w.buf.Bytes())

        if err != nil {

            select {

            case errs <- fmt.Errorf("fail to write to output file: %v", err):

            default:

            }

            cancel()

            return

        }

        w.buf.Reset()

    }

    pool.Put(sequence)

}

(Within getComplexityAndAlternate())

if !options.Alternative {

    // Staden convention: Frame -1 is the reverse-complement of the sequence

    // having the same codon phase as frame 1. Frame -2 is the same phase as

    // frame 2. Frame -3 is the same phase as frame 3

    //

    // use the matrix to keep track of the forward frame as it depends on the

    // length of the sequence

    switch nuclSeqLength % 3 {

    case 0:

        starts = int{0, 2, 1}

    case 1:

        starts = int{1, 0, 2}

    case 2:

        starts = int{2, 1, 0}

    }

}

Duplicate code

The following code is used twice and should instead be a function:

_, err := out.Write(w.buf.Bytes())

if err != nil {

    select {

    case errs <- fmt.Errorf("fail to write to output file: %v", err):

    default:

    }

    cancel()

    return

}

Becomes a function (again, choose whatever name you want):

func writeOrCancel(w writer, out io.Writer, errs chan error,

    cancel context.CancelFunc) {

    if _, err := out.Write(w.buf.Bytes()); err != nil {

        select {

        case errs <- fmt.Errorf("fail to write to output file: %v", err):

        default:

        }

        cancel()

        return

    }

}

Unroll short loops

case "F":

    for i := 0; i < 3; i++ {

        frames[i] = 1

    }

And

case "R":

    for i := 3; i < 6; i++ {

        frames[i] = 1

    }

    reverse = true

Shouldn't be loops. By using loops you use more magic numbers. Instead, unroll them:

case "F":

    frames[0] = 1

    frames[1] = 1

    frames[2] = 1

// ...

case "R":

    frames[3] = 1

    frames[4] = 1

    frames[5] = 1

    reverse = true

There's probably an even easier way to clean up the switch statement in computeFrames().

Don't name return arguments unless it simplifies code

In computeFrames() your return arguments are named, but they don't need to be.

Use straightforward conditions

// generate 2 letter code

codes, ok := twoLetterMap[codon[0:2]]



if !ok {

    twoLetterMap[codon[0:2]] = byte{aaCode}

} else {

    twoLetterMap[codon[0:2]] = append(codes, aaCode)

}

Is more clearly:

// generate 2 letter code

if codes, ok := twoLetterMap[codon[0:2]]; ok {

    twoLetterMap[codon[0:2]] = append(codes, aaCode)

} else {

    twoLetterMap[codon[0:2]] = byte{aaCode}

}

Exit loops early

You can break early upon the condition codes[i] != codes[0].

for twoLetterCodon, codes := range twoLetterMap {

    uniqueAA := true

    for i := 0; i < len(codes); i++ {

        if codes[i] != codes[0] {

            uniqueAA = false

        }

    }

    if uniqueAA {

        first := letterCode[twoLetterCodon[0]]

        second := letterCode[twoLetterCodon[1]]



        uint32Code := uint32(first) | uint32(second)<<8

        resultMap[uint32Code] = codes[0]

    }

}

Becomes:

for twoLetterCodon, codes := range twoLetterMap {

    uniqueAA := true



    for _, c := range codes {

        if c != codes[0] {

            uniqueAA = false

            break

        }

    }



    if uniqueAA {

        first := letterCode[twoLetterCodon[0]]

        second := letterCode[twoLetterCodon[1]]



        uint32Code := uint32(first) | uint32(second)<<8

        resultMap[uint32Code] = codes[0]

    }

}

Combine global `const` declarations

It's common practice to combine them, so readers of your code don't need to search the entire document.

Move things to separate files

Your writer structure is relatively separate from everything else. I've moved it to a writer.go file -- moving the two constants it uses along with it.

You can also simplify the field names. If you feel explanation is needed, that's the purpose of documentation, not the field names themselves.

Rather than writing the following:

w := &writer{

    buf:    bytes.NewBuffer(nil),

    toTrim: 0,

    clen:   0,

}

We can use a newWriter() function, which follows Go APIs:

func newWriter(buf byte) *writer {

    return &writer{buf: bytes.NewBuffer(buf)}

}

Also note that specifying the default values (0) is not needed.

Conclusion: More to be done

I was not able to address all of the things I saw, but you should get the gist.

I would urge you to continue to break concrete operations into functions. Even though you may end up with lots of function arguments, in my opinion that's better than hard-to-read deep nesting and long functions. Perhaps once seeing how things break up, you can simplify the whole architecture of the package.

Here is a GitHub Gist of the final code. It includes other formatting things I did not mention explicitly.

Hope this helped. Your project looks promising, best of luck!

Thanks for the answer, there is some interesting ideas ! However, I disagree with a few points: — 6 hours ago
(1) can't use iota this way because tCode = uCode = uint(3) (2) buf.Write() and buf.WriteByte() always return a nil error, cf godoc bytes, so no need to check for it (3) I agree that high cyclomatic complexity is bad, but methods with 10 (!!) arguments is not a good solution (4) starts = int{0, 2, 1} allocates a new array instead of reusing the same slice. Not really sure it's worth it as it's in the hot spot of the code ! And this answer doesn't talk at all about the performance of the modified code — 6 hours ago
@felix (1) If you check, you will see that tcode is 3, and ucode is 4. That's how implicit repetition works in Go. (2) Fair point (3) In my opinion, it's better than code that's hard to read and understand, and definitely better than tons of nesting. If you split things into functions as I recommend, you may notice that functions (such as the one with 10 arguments) can be split up to do separate things. I left that work for you to implement. (4) If you do profiling, I highly doubt it will cause any noticeable performance difference. — 4 hours ago
I don't talk about the performance of the modified code because it's likely the same as before. — 4 hours ago
(1) sorry it wasn't clear enough, I meant both tCode and uCode have to be equal to uint(3) (cf initial code) — 4 hours ago

Translate nucleic acid sequence into its corresponding amino acid sequence

Goal of the program

Fasta format

Expected output

Specific rules

Improvements

This question has an open bounty worth +500 reputation from felix ending in 6 days.

Goal of the program

Fasta format

Expected output

Specific rules

Improvements

This question has an open bounty worth +500 reputation from felix ending in 6 days.

Goal of the program

Fasta format

Expected output

Specific rules

Improvements

Goal of the program

Fasta format

Expected output

Specific rules

Improvements

This question has an open bounty worth +500 reputation from felix ending in 6 days.

This question has an open bounty worth +500 reputation from felix ending in 6 days.

1 Answer 1

Utilize implicit repetition and iota

Check error return values

Reduce complexity of Translate()

goto

Move things to the lowest scope

Duplicate code

Unroll short loops

Don't name return arguments unless it simplifies code

Use straightforward conditions

Exit loops early

Combine global const declarations

Move things to separate files

Conclusion: More to be done

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Utilize implicit repetition and iota

Check error return values

Reduce complexity of Translate()

goto

Move things to the lowest scope

Duplicate code

Unroll short loops

Don't name return arguments unless it simplifies code

Use straightforward conditions

Exit loops early

Combine global const declarations

Move things to separate files

Conclusion: More to be done

Utilize implicit repetition and iota

Check error return values

Reduce complexity of Translate()

goto

Move things to the lowest scope

Duplicate code

Unroll short loops

Don't name return arguments unless it simplifies code

Use straightforward conditions

Exit loops early

Combine global const declarations

Move things to separate files

Conclusion: More to be done

Utilize implicit repetition and iota

Check error return values

Reduce complexity of Translate()

goto

Move things to the lowest scope

Duplicate code

Unroll short loops

Don't name return arguments unless it simplifies code

Use straightforward conditions

This question has an open bounty worth +500
reputation from felix ending in 6 days.

This question has an open bounty worth +500
reputation from felix ending in 6 days.

This question has an open bounty worth +500
reputation from felix ending in 6 days.

This question has an open bounty worth +500
reputation from felix ending in 6 days.

1 Answer
1

Utilize implicit repetition and `iota`

Check `error` return values

Reduce complexity of `Translate()`

`goto`

Combine global `const` declarations

1 Answer
1

1 Answer
1

Utilize implicit repetition and `iota`

Check `error` return values

Reduce complexity of `Translate()`

`goto`

Combine global `const` declarations

Utilize implicit repetition and `iota`

Check `error` return values

Reduce complexity of `Translate()`

`goto`

Combine global `const` declarations

Utilize implicit repetition and `iota`

Check `error` return values

Reduce complexity of `Translate()`

`goto`

Combine global `const` declarations

Utilize implicit repetition and `iota`

Check `error` return values

Reduce complexity of `Translate()`

`goto`

Combine global `const` declarations