Optimizing custom JPEG decompression












0















The aim of my code is to decode an image format that is based on the JPEG chain of compression/decompression, however it is not compatible with the default JPEG flow as far as I know, since all libraries I have tried fail to properly decode the data. I am only interested decompression in this case. It follows the standard pattern:




  • Read Huffman values -> Like normal JPEG

  • unzigzag -> Like normal JPEG

  • Dequantize -> Like normal JPEG

  • IDCT -> Almost like normal JPEG, but different range/clamping

  • Color Space conversion -> Custom, not YCbCr


For one 8x8 except for the last step that looks like this right now:



int16_t processBlock(int16_t prevDc, BitStream &stream, const tHuffTable &dcTable, const tHuffTable &acTable,
float *quantTable, bool isLuminance, int16_t *outBlock) {
int16_t workBlock[64] = {0};
int16_t curDc = decodeBlock(stream, workBlock, dcTable, acTable, prevDc);
unzigzag(workBlock);
dequantize(workBlock, quantTable);
idct(outBlock, workBlock, isLuminance);
return curDc;
}


after this the outBlock is treated by the color space conversion based on the image type.



What I want to optimize is the overall performance. The entire image is decompressed in the following way with 4 luminance blocks for component 1, 1 chrominance block for component 2 and 1 chrominance block for component 3. There are 4 more blocks for another luminance component, but I dont know what it is used for, so we can ignore it. The code looks like this:



void decodeImageType0(
uint32_t width,
uint32_t height,
std::vector<uint8_t> &outData,
BitStream &stream,
const tHuffTable &dcLumTable,
const tHuffTable &acLumTable,
const tHuffTable &dcCromTable,
const tHuffTable &acCromTable,
float *lumQuant[4],
float *cromQuant[4]) {
int16_t lum0[4][64]{};
int16_t lum1[4][64]{};
int16_t crom0[64]{};
int16_t crom1[64]{};
uint32_t colorBlock[16 * 16]{};

const auto actualHeight = ((height + 15) / 16) * 16;
const auto actualWidth = ((width + 15) / 16) * 16;

int16_t prevDc[4] = {0};
for (auto y = 0; y < (actualHeight / 16); ++y) {
for (auto x = 0; x < (actualWidth / 16); ++x) {
for (auto &lum : lum0) {
prevDc[0] = processBlock(prevDc[0], stream, dcLumTable, acLumTable, lumQuant[0], true, lum);
}
prevDc[1] = processBlock(prevDc[1], stream, dcCromTable, acCromTable, cromQuant[1], false, crom0);
prevDc[2] = processBlock(prevDc[2], stream, dcCromTable, acCromTable, cromQuant[2], false, crom1);
for (auto &lum : lum1) {
prevDc[3] = processBlock(prevDc[3], stream, dcLumTable, acLumTable, lumQuant[3], true, lum);
}

decodeColorBlockType0(lum0, lum1, crom0, crom1, colorBlock);
for (auto row = 0; row < 16; ++row) {
if(y * 16 + row >= height || x * 16 >= width) {
continue;
}

const auto numPixels = std::min(16u, width - x * 16);
memcpy(outData.data() + (y * 16 + row) * width * 4 + x * 16 * 4, &colorBlock[row * 16], numPixels * 4);
}
}
}
}


Now my measurements have shown that over 80% of the time is spent inside the idct function, so this is where I want to optimize. The function looks like this, after I applied what I could think of to optimize it. I have created a cache of the static coefficients used in the IDCT process which significantly improved performance, but I hope there is still room for more, for example nanojpg is 3 times faster (however with invalid results).



float idctHelper(const int16_t *inBlock, int32_t u, int32_t v, int32_t blockWidth, int32_t blockHeight) {
glm::vec<4, float, glm::packed_lowp> vec3{};

float result = 0.0f;
for (auto y = 0; y < blockHeight; ++y) {
for (auto x = 0; x < blockWidth; x += 4) {
const auto idx = (v * 8 + u) * 64 + y * 8 + x;
vec3 = glm::vec<4, float, glm::packed_lowp>(inBlock[y * blockWidth + x], inBlock[y * blockWidth + x + 1], inBlock[y * blockWidth + x + 2], inBlock[y * blockWidth + x + 3]) *
glm::vec<4, float, glm::packed_lowp>(idctLookup[idx], idctLookup[idx + 1], idctLookup[idx + 2], idctLookup[idx + 3]);
result += vec3.x + vec3.y + vec3.z + vec3.w;
}
}

return result;
}

template<typename T, typename U = T>
U clamp(T value, T min, T max) {
return static_cast<U>(std::min<T>(std::max<T>(value, min), max));
}

void idct(int16_t *outBlock, int16_t *inBlock, bool isLuminance, int32_t blockWidth = 8, int32_t blockHeight = 8) {
for (auto y = 0; y < blockHeight; ++y) {
for (auto x = 0; x < blockWidth; ++x) {
auto value = static_cast<int16_t>(std::round(
0.25f * idctHelper(inBlock, x, y, blockWidth, blockHeight)));
if (isLuminance) {
value = clamp<int16_t>(static_cast<int16_t>(value + 128), 0, 255);
} else {
value = clamp<int16_t>(value, -256, 255);
}

outBlock[y * blockWidth + x] = value;
}
}
}


This is the cache that is created once:



float alphaFunction(int32_t n) {
static float INV_SQRT_2 = 1.0f / sqrtf(2.0f);

if (n == 0) {
return INV_SQRT_2;
} else {
return 1;
}
}
for (auto u = 0; u < 8; ++u) {
for (auto v = 0; v < 8; ++v) {
for (auto x = 0; x < 8; ++x) {
for (auto y = 0; y < 8; ++y) {
idctLookup[(v * 8 + u) * 64 + y * 8 + x] = alphaFunction(x) * alphaFunction(y) *
cosf((2 * u + 1) * x * (float) M_PI / 16.0f) *
cosf((2 * v + 1) * y * (float) M_PI / 16.0f);
}
}
}
}









share|improve this question







New contributor




Cromon is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

























    0















    The aim of my code is to decode an image format that is based on the JPEG chain of compression/decompression, however it is not compatible with the default JPEG flow as far as I know, since all libraries I have tried fail to properly decode the data. I am only interested decompression in this case. It follows the standard pattern:




    • Read Huffman values -> Like normal JPEG

    • unzigzag -> Like normal JPEG

    • Dequantize -> Like normal JPEG

    • IDCT -> Almost like normal JPEG, but different range/clamping

    • Color Space conversion -> Custom, not YCbCr


    For one 8x8 except for the last step that looks like this right now:



    int16_t processBlock(int16_t prevDc, BitStream &stream, const tHuffTable &dcTable, const tHuffTable &acTable,
    float *quantTable, bool isLuminance, int16_t *outBlock) {
    int16_t workBlock[64] = {0};
    int16_t curDc = decodeBlock(stream, workBlock, dcTable, acTable, prevDc);
    unzigzag(workBlock);
    dequantize(workBlock, quantTable);
    idct(outBlock, workBlock, isLuminance);
    return curDc;
    }


    after this the outBlock is treated by the color space conversion based on the image type.



    What I want to optimize is the overall performance. The entire image is decompressed in the following way with 4 luminance blocks for component 1, 1 chrominance block for component 2 and 1 chrominance block for component 3. There are 4 more blocks for another luminance component, but I dont know what it is used for, so we can ignore it. The code looks like this:



    void decodeImageType0(
    uint32_t width,
    uint32_t height,
    std::vector<uint8_t> &outData,
    BitStream &stream,
    const tHuffTable &dcLumTable,
    const tHuffTable &acLumTable,
    const tHuffTable &dcCromTable,
    const tHuffTable &acCromTable,
    float *lumQuant[4],
    float *cromQuant[4]) {
    int16_t lum0[4][64]{};
    int16_t lum1[4][64]{};
    int16_t crom0[64]{};
    int16_t crom1[64]{};
    uint32_t colorBlock[16 * 16]{};

    const auto actualHeight = ((height + 15) / 16) * 16;
    const auto actualWidth = ((width + 15) / 16) * 16;

    int16_t prevDc[4] = {0};
    for (auto y = 0; y < (actualHeight / 16); ++y) {
    for (auto x = 0; x < (actualWidth / 16); ++x) {
    for (auto &lum : lum0) {
    prevDc[0] = processBlock(prevDc[0], stream, dcLumTable, acLumTable, lumQuant[0], true, lum);
    }
    prevDc[1] = processBlock(prevDc[1], stream, dcCromTable, acCromTable, cromQuant[1], false, crom0);
    prevDc[2] = processBlock(prevDc[2], stream, dcCromTable, acCromTable, cromQuant[2], false, crom1);
    for (auto &lum : lum1) {
    prevDc[3] = processBlock(prevDc[3], stream, dcLumTable, acLumTable, lumQuant[3], true, lum);
    }

    decodeColorBlockType0(lum0, lum1, crom0, crom1, colorBlock);
    for (auto row = 0; row < 16; ++row) {
    if(y * 16 + row >= height || x * 16 >= width) {
    continue;
    }

    const auto numPixels = std::min(16u, width - x * 16);
    memcpy(outData.data() + (y * 16 + row) * width * 4 + x * 16 * 4, &colorBlock[row * 16], numPixels * 4);
    }
    }
    }
    }


    Now my measurements have shown that over 80% of the time is spent inside the idct function, so this is where I want to optimize. The function looks like this, after I applied what I could think of to optimize it. I have created a cache of the static coefficients used in the IDCT process which significantly improved performance, but I hope there is still room for more, for example nanojpg is 3 times faster (however with invalid results).



    float idctHelper(const int16_t *inBlock, int32_t u, int32_t v, int32_t blockWidth, int32_t blockHeight) {
    glm::vec<4, float, glm::packed_lowp> vec3{};

    float result = 0.0f;
    for (auto y = 0; y < blockHeight; ++y) {
    for (auto x = 0; x < blockWidth; x += 4) {
    const auto idx = (v * 8 + u) * 64 + y * 8 + x;
    vec3 = glm::vec<4, float, glm::packed_lowp>(inBlock[y * blockWidth + x], inBlock[y * blockWidth + x + 1], inBlock[y * blockWidth + x + 2], inBlock[y * blockWidth + x + 3]) *
    glm::vec<4, float, glm::packed_lowp>(idctLookup[idx], idctLookup[idx + 1], idctLookup[idx + 2], idctLookup[idx + 3]);
    result += vec3.x + vec3.y + vec3.z + vec3.w;
    }
    }

    return result;
    }

    template<typename T, typename U = T>
    U clamp(T value, T min, T max) {
    return static_cast<U>(std::min<T>(std::max<T>(value, min), max));
    }

    void idct(int16_t *outBlock, int16_t *inBlock, bool isLuminance, int32_t blockWidth = 8, int32_t blockHeight = 8) {
    for (auto y = 0; y < blockHeight; ++y) {
    for (auto x = 0; x < blockWidth; ++x) {
    auto value = static_cast<int16_t>(std::round(
    0.25f * idctHelper(inBlock, x, y, blockWidth, blockHeight)));
    if (isLuminance) {
    value = clamp<int16_t>(static_cast<int16_t>(value + 128), 0, 255);
    } else {
    value = clamp<int16_t>(value, -256, 255);
    }

    outBlock[y * blockWidth + x] = value;
    }
    }
    }


    This is the cache that is created once:



    float alphaFunction(int32_t n) {
    static float INV_SQRT_2 = 1.0f / sqrtf(2.0f);

    if (n == 0) {
    return INV_SQRT_2;
    } else {
    return 1;
    }
    }
    for (auto u = 0; u < 8; ++u) {
    for (auto v = 0; v < 8; ++v) {
    for (auto x = 0; x < 8; ++x) {
    for (auto y = 0; y < 8; ++y) {
    idctLookup[(v * 8 + u) * 64 + y * 8 + x] = alphaFunction(x) * alphaFunction(y) *
    cosf((2 * u + 1) * x * (float) M_PI / 16.0f) *
    cosf((2 * v + 1) * y * (float) M_PI / 16.0f);
    }
    }
    }
    }









    share|improve this question







    New contributor




    Cromon is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.























      0












      0








      0








      The aim of my code is to decode an image format that is based on the JPEG chain of compression/decompression, however it is not compatible with the default JPEG flow as far as I know, since all libraries I have tried fail to properly decode the data. I am only interested decompression in this case. It follows the standard pattern:




      • Read Huffman values -> Like normal JPEG

      • unzigzag -> Like normal JPEG

      • Dequantize -> Like normal JPEG

      • IDCT -> Almost like normal JPEG, but different range/clamping

      • Color Space conversion -> Custom, not YCbCr


      For one 8x8 except for the last step that looks like this right now:



      int16_t processBlock(int16_t prevDc, BitStream &stream, const tHuffTable &dcTable, const tHuffTable &acTable,
      float *quantTable, bool isLuminance, int16_t *outBlock) {
      int16_t workBlock[64] = {0};
      int16_t curDc = decodeBlock(stream, workBlock, dcTable, acTable, prevDc);
      unzigzag(workBlock);
      dequantize(workBlock, quantTable);
      idct(outBlock, workBlock, isLuminance);
      return curDc;
      }


      after this the outBlock is treated by the color space conversion based on the image type.



      What I want to optimize is the overall performance. The entire image is decompressed in the following way with 4 luminance blocks for component 1, 1 chrominance block for component 2 and 1 chrominance block for component 3. There are 4 more blocks for another luminance component, but I dont know what it is used for, so we can ignore it. The code looks like this:



      void decodeImageType0(
      uint32_t width,
      uint32_t height,
      std::vector<uint8_t> &outData,
      BitStream &stream,
      const tHuffTable &dcLumTable,
      const tHuffTable &acLumTable,
      const tHuffTable &dcCromTable,
      const tHuffTable &acCromTable,
      float *lumQuant[4],
      float *cromQuant[4]) {
      int16_t lum0[4][64]{};
      int16_t lum1[4][64]{};
      int16_t crom0[64]{};
      int16_t crom1[64]{};
      uint32_t colorBlock[16 * 16]{};

      const auto actualHeight = ((height + 15) / 16) * 16;
      const auto actualWidth = ((width + 15) / 16) * 16;

      int16_t prevDc[4] = {0};
      for (auto y = 0; y < (actualHeight / 16); ++y) {
      for (auto x = 0; x < (actualWidth / 16); ++x) {
      for (auto &lum : lum0) {
      prevDc[0] = processBlock(prevDc[0], stream, dcLumTable, acLumTable, lumQuant[0], true, lum);
      }
      prevDc[1] = processBlock(prevDc[1], stream, dcCromTable, acCromTable, cromQuant[1], false, crom0);
      prevDc[2] = processBlock(prevDc[2], stream, dcCromTable, acCromTable, cromQuant[2], false, crom1);
      for (auto &lum : lum1) {
      prevDc[3] = processBlock(prevDc[3], stream, dcLumTable, acLumTable, lumQuant[3], true, lum);
      }

      decodeColorBlockType0(lum0, lum1, crom0, crom1, colorBlock);
      for (auto row = 0; row < 16; ++row) {
      if(y * 16 + row >= height || x * 16 >= width) {
      continue;
      }

      const auto numPixels = std::min(16u, width - x * 16);
      memcpy(outData.data() + (y * 16 + row) * width * 4 + x * 16 * 4, &colorBlock[row * 16], numPixels * 4);
      }
      }
      }
      }


      Now my measurements have shown that over 80% of the time is spent inside the idct function, so this is where I want to optimize. The function looks like this, after I applied what I could think of to optimize it. I have created a cache of the static coefficients used in the IDCT process which significantly improved performance, but I hope there is still room for more, for example nanojpg is 3 times faster (however with invalid results).



      float idctHelper(const int16_t *inBlock, int32_t u, int32_t v, int32_t blockWidth, int32_t blockHeight) {
      glm::vec<4, float, glm::packed_lowp> vec3{};

      float result = 0.0f;
      for (auto y = 0; y < blockHeight; ++y) {
      for (auto x = 0; x < blockWidth; x += 4) {
      const auto idx = (v * 8 + u) * 64 + y * 8 + x;
      vec3 = glm::vec<4, float, glm::packed_lowp>(inBlock[y * blockWidth + x], inBlock[y * blockWidth + x + 1], inBlock[y * blockWidth + x + 2], inBlock[y * blockWidth + x + 3]) *
      glm::vec<4, float, glm::packed_lowp>(idctLookup[idx], idctLookup[idx + 1], idctLookup[idx + 2], idctLookup[idx + 3]);
      result += vec3.x + vec3.y + vec3.z + vec3.w;
      }
      }

      return result;
      }

      template<typename T, typename U = T>
      U clamp(T value, T min, T max) {
      return static_cast<U>(std::min<T>(std::max<T>(value, min), max));
      }

      void idct(int16_t *outBlock, int16_t *inBlock, bool isLuminance, int32_t blockWidth = 8, int32_t blockHeight = 8) {
      for (auto y = 0; y < blockHeight; ++y) {
      for (auto x = 0; x < blockWidth; ++x) {
      auto value = static_cast<int16_t>(std::round(
      0.25f * idctHelper(inBlock, x, y, blockWidth, blockHeight)));
      if (isLuminance) {
      value = clamp<int16_t>(static_cast<int16_t>(value + 128), 0, 255);
      } else {
      value = clamp<int16_t>(value, -256, 255);
      }

      outBlock[y * blockWidth + x] = value;
      }
      }
      }


      This is the cache that is created once:



      float alphaFunction(int32_t n) {
      static float INV_SQRT_2 = 1.0f / sqrtf(2.0f);

      if (n == 0) {
      return INV_SQRT_2;
      } else {
      return 1;
      }
      }
      for (auto u = 0; u < 8; ++u) {
      for (auto v = 0; v < 8; ++v) {
      for (auto x = 0; x < 8; ++x) {
      for (auto y = 0; y < 8; ++y) {
      idctLookup[(v * 8 + u) * 64 + y * 8 + x] = alphaFunction(x) * alphaFunction(y) *
      cosf((2 * u + 1) * x * (float) M_PI / 16.0f) *
      cosf((2 * v + 1) * y * (float) M_PI / 16.0f);
      }
      }
      }
      }









      share|improve this question







      New contributor




      Cromon is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.












      The aim of my code is to decode an image format that is based on the JPEG chain of compression/decompression, however it is not compatible with the default JPEG flow as far as I know, since all libraries I have tried fail to properly decode the data. I am only interested decompression in this case. It follows the standard pattern:




      • Read Huffman values -> Like normal JPEG

      • unzigzag -> Like normal JPEG

      • Dequantize -> Like normal JPEG

      • IDCT -> Almost like normal JPEG, but different range/clamping

      • Color Space conversion -> Custom, not YCbCr


      For one 8x8 except for the last step that looks like this right now:



      int16_t processBlock(int16_t prevDc, BitStream &stream, const tHuffTable &dcTable, const tHuffTable &acTable,
      float *quantTable, bool isLuminance, int16_t *outBlock) {
      int16_t workBlock[64] = {0};
      int16_t curDc = decodeBlock(stream, workBlock, dcTable, acTable, prevDc);
      unzigzag(workBlock);
      dequantize(workBlock, quantTable);
      idct(outBlock, workBlock, isLuminance);
      return curDc;
      }


      after this the outBlock is treated by the color space conversion based on the image type.



      What I want to optimize is the overall performance. The entire image is decompressed in the following way with 4 luminance blocks for component 1, 1 chrominance block for component 2 and 1 chrominance block for component 3. There are 4 more blocks for another luminance component, but I dont know what it is used for, so we can ignore it. The code looks like this:



      void decodeImageType0(
      uint32_t width,
      uint32_t height,
      std::vector<uint8_t> &outData,
      BitStream &stream,
      const tHuffTable &dcLumTable,
      const tHuffTable &acLumTable,
      const tHuffTable &dcCromTable,
      const tHuffTable &acCromTable,
      float *lumQuant[4],
      float *cromQuant[4]) {
      int16_t lum0[4][64]{};
      int16_t lum1[4][64]{};
      int16_t crom0[64]{};
      int16_t crom1[64]{};
      uint32_t colorBlock[16 * 16]{};

      const auto actualHeight = ((height + 15) / 16) * 16;
      const auto actualWidth = ((width + 15) / 16) * 16;

      int16_t prevDc[4] = {0};
      for (auto y = 0; y < (actualHeight / 16); ++y) {
      for (auto x = 0; x < (actualWidth / 16); ++x) {
      for (auto &lum : lum0) {
      prevDc[0] = processBlock(prevDc[0], stream, dcLumTable, acLumTable, lumQuant[0], true, lum);
      }
      prevDc[1] = processBlock(prevDc[1], stream, dcCromTable, acCromTable, cromQuant[1], false, crom0);
      prevDc[2] = processBlock(prevDc[2], stream, dcCromTable, acCromTable, cromQuant[2], false, crom1);
      for (auto &lum : lum1) {
      prevDc[3] = processBlock(prevDc[3], stream, dcLumTable, acLumTable, lumQuant[3], true, lum);
      }

      decodeColorBlockType0(lum0, lum1, crom0, crom1, colorBlock);
      for (auto row = 0; row < 16; ++row) {
      if(y * 16 + row >= height || x * 16 >= width) {
      continue;
      }

      const auto numPixels = std::min(16u, width - x * 16);
      memcpy(outData.data() + (y * 16 + row) * width * 4 + x * 16 * 4, &colorBlock[row * 16], numPixels * 4);
      }
      }
      }
      }


      Now my measurements have shown that over 80% of the time is spent inside the idct function, so this is where I want to optimize. The function looks like this, after I applied what I could think of to optimize it. I have created a cache of the static coefficients used in the IDCT process which significantly improved performance, but I hope there is still room for more, for example nanojpg is 3 times faster (however with invalid results).



      float idctHelper(const int16_t *inBlock, int32_t u, int32_t v, int32_t blockWidth, int32_t blockHeight) {
      glm::vec<4, float, glm::packed_lowp> vec3{};

      float result = 0.0f;
      for (auto y = 0; y < blockHeight; ++y) {
      for (auto x = 0; x < blockWidth; x += 4) {
      const auto idx = (v * 8 + u) * 64 + y * 8 + x;
      vec3 = glm::vec<4, float, glm::packed_lowp>(inBlock[y * blockWidth + x], inBlock[y * blockWidth + x + 1], inBlock[y * blockWidth + x + 2], inBlock[y * blockWidth + x + 3]) *
      glm::vec<4, float, glm::packed_lowp>(idctLookup[idx], idctLookup[idx + 1], idctLookup[idx + 2], idctLookup[idx + 3]);
      result += vec3.x + vec3.y + vec3.z + vec3.w;
      }
      }

      return result;
      }

      template<typename T, typename U = T>
      U clamp(T value, T min, T max) {
      return static_cast<U>(std::min<T>(std::max<T>(value, min), max));
      }

      void idct(int16_t *outBlock, int16_t *inBlock, bool isLuminance, int32_t blockWidth = 8, int32_t blockHeight = 8) {
      for (auto y = 0; y < blockHeight; ++y) {
      for (auto x = 0; x < blockWidth; ++x) {
      auto value = static_cast<int16_t>(std::round(
      0.25f * idctHelper(inBlock, x, y, blockWidth, blockHeight)));
      if (isLuminance) {
      value = clamp<int16_t>(static_cast<int16_t>(value + 128), 0, 255);
      } else {
      value = clamp<int16_t>(value, -256, 255);
      }

      outBlock[y * blockWidth + x] = value;
      }
      }
      }


      This is the cache that is created once:



      float alphaFunction(int32_t n) {
      static float INV_SQRT_2 = 1.0f / sqrtf(2.0f);

      if (n == 0) {
      return INV_SQRT_2;
      } else {
      return 1;
      }
      }
      for (auto u = 0; u < 8; ++u) {
      for (auto v = 0; v < 8; ++v) {
      for (auto x = 0; x < 8; ++x) {
      for (auto y = 0; y < 8; ++y) {
      idctLookup[(v * 8 + u) * 64 + y * 8 + x] = alphaFunction(x) * alphaFunction(y) *
      cosf((2 * u + 1) * x * (float) M_PI / 16.0f) *
      cosf((2 * v + 1) * y * (float) M_PI / 16.0f);
      }
      }
      }
      }






      c++ performance image






      share|improve this question







      New contributor




      Cromon is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question







      New contributor




      Cromon is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question






      New contributor




      Cromon is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked 6 hours ago









      CromonCromon

      101




      101




      New contributor




      Cromon is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      Cromon is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      Cromon is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






















          0






          active

          oldest

          votes











          Your Answer





          StackExchange.ifUsing("editor", function () {
          return StackExchange.using("mathjaxEditing", function () {
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
          });
          });
          }, "mathjax-editing");

          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "196"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });






          Cromon is a new contributor. Be nice, and check out our Code of Conduct.










          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f211391%2foptimizing-custom-jpeg-decompression%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          Cromon is a new contributor. Be nice, and check out our Code of Conduct.










          draft saved

          draft discarded


















          Cromon is a new contributor. Be nice, and check out our Code of Conduct.













          Cromon is a new contributor. Be nice, and check out our Code of Conduct.












          Cromon is a new contributor. Be nice, and check out our Code of Conduct.
















          Thanks for contributing an answer to Code Review Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f211391%2foptimizing-custom-jpeg-decompression%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          How to reconfigure Docker Trusted Registry 2.x.x to use CEPH FS mount instead of NFS and other traditional...

          is 'sed' thread safe

          How to make a Squid Proxy server?