Is Highams' computation of mean worth the price?












2












$begingroup$


In Accuracy and Stability of Numerical Algorithms, equation 1.6a, Higham gives the following update formula for the mean:
$$
M_{1} := x_1, quad
M_{k+1} := M_{k} + frac{x_k - M_k}{k}
$$

Ok, one benefit of this update is that it doesn't overflow, which is a potential problem when we naively compute $sum_{i=1}^{k} x_k$ and then divide by $k$ before any use.



But the naive sum requires $N-1$ adds and 1 division, and potentially vectorizes very nicely, whereas the update proposed by Higham requires $2N-2$ additions and $N-1$ divisions, and I'm unaware of any assembly instruction that can vectorize all this.



So it Higham's update formula worth using? Are there benefits I'm not seeing?



Note: Higham gives the following generic advice:
"It is advantageous to express update formulae as newvalue = oldvalue +
smallcorrection if the small correction can be computed with many correct significant figures."



Update 1.6a does take this form, but it's not clear to me that the small correction can be given to many significant figures.



Edit: I found an empirical study of the performance of various methods of computing means, and 1.6a came highly recommended; see



Robert F. Ling (1974) Comparison of Several Algorithms for Computing Sample Means and Variances, Journal of the American Statistical Association, 69:348, 859-866, DOI: 10.1080/01621459.1974.10480219



However, it still wasn't clear to me after reading that paper that the update was worth the price; and in any case I was hoping for a worst and average case bound by accumulating rounding errors.










share|cite|improve this question











$endgroup$

















    2












    $begingroup$


    In Accuracy and Stability of Numerical Algorithms, equation 1.6a, Higham gives the following update formula for the mean:
    $$
    M_{1} := x_1, quad
    M_{k+1} := M_{k} + frac{x_k - M_k}{k}
    $$

    Ok, one benefit of this update is that it doesn't overflow, which is a potential problem when we naively compute $sum_{i=1}^{k} x_k$ and then divide by $k$ before any use.



    But the naive sum requires $N-1$ adds and 1 division, and potentially vectorizes very nicely, whereas the update proposed by Higham requires $2N-2$ additions and $N-1$ divisions, and I'm unaware of any assembly instruction that can vectorize all this.



    So it Higham's update formula worth using? Are there benefits I'm not seeing?



    Note: Higham gives the following generic advice:
    "It is advantageous to express update formulae as newvalue = oldvalue +
    smallcorrection if the small correction can be computed with many correct significant figures."



    Update 1.6a does take this form, but it's not clear to me that the small correction can be given to many significant figures.



    Edit: I found an empirical study of the performance of various methods of computing means, and 1.6a came highly recommended; see



    Robert F. Ling (1974) Comparison of Several Algorithms for Computing Sample Means and Variances, Journal of the American Statistical Association, 69:348, 859-866, DOI: 10.1080/01621459.1974.10480219



    However, it still wasn't clear to me after reading that paper that the update was worth the price; and in any case I was hoping for a worst and average case bound by accumulating rounding errors.










    share|cite|improve this question











    $endgroup$















      2












      2








      2





      $begingroup$


      In Accuracy and Stability of Numerical Algorithms, equation 1.6a, Higham gives the following update formula for the mean:
      $$
      M_{1} := x_1, quad
      M_{k+1} := M_{k} + frac{x_k - M_k}{k}
      $$

      Ok, one benefit of this update is that it doesn't overflow, which is a potential problem when we naively compute $sum_{i=1}^{k} x_k$ and then divide by $k$ before any use.



      But the naive sum requires $N-1$ adds and 1 division, and potentially vectorizes very nicely, whereas the update proposed by Higham requires $2N-2$ additions and $N-1$ divisions, and I'm unaware of any assembly instruction that can vectorize all this.



      So it Higham's update formula worth using? Are there benefits I'm not seeing?



      Note: Higham gives the following generic advice:
      "It is advantageous to express update formulae as newvalue = oldvalue +
      smallcorrection if the small correction can be computed with many correct significant figures."



      Update 1.6a does take this form, but it's not clear to me that the small correction can be given to many significant figures.



      Edit: I found an empirical study of the performance of various methods of computing means, and 1.6a came highly recommended; see



      Robert F. Ling (1974) Comparison of Several Algorithms for Computing Sample Means and Variances, Journal of the American Statistical Association, 69:348, 859-866, DOI: 10.1080/01621459.1974.10480219



      However, it still wasn't clear to me after reading that paper that the update was worth the price; and in any case I was hoping for a worst and average case bound by accumulating rounding errors.










      share|cite|improve this question











      $endgroup$




      In Accuracy and Stability of Numerical Algorithms, equation 1.6a, Higham gives the following update formula for the mean:
      $$
      M_{1} := x_1, quad
      M_{k+1} := M_{k} + frac{x_k - M_k}{k}
      $$

      Ok, one benefit of this update is that it doesn't overflow, which is a potential problem when we naively compute $sum_{i=1}^{k} x_k$ and then divide by $k$ before any use.



      But the naive sum requires $N-1$ adds and 1 division, and potentially vectorizes very nicely, whereas the update proposed by Higham requires $2N-2$ additions and $N-1$ divisions, and I'm unaware of any assembly instruction that can vectorize all this.



      So it Higham's update formula worth using? Are there benefits I'm not seeing?



      Note: Higham gives the following generic advice:
      "It is advantageous to express update formulae as newvalue = oldvalue +
      smallcorrection if the small correction can be computed with many correct significant figures."



      Update 1.6a does take this form, but it's not clear to me that the small correction can be given to many significant figures.



      Edit: I found an empirical study of the performance of various methods of computing means, and 1.6a came highly recommended; see



      Robert F. Ling (1974) Comparison of Several Algorithms for Computing Sample Means and Variances, Journal of the American Statistical Association, 69:348, 859-866, DOI: 10.1080/01621459.1974.10480219



      However, it still wasn't clear to me after reading that paper that the update was worth the price; and in any case I was hoping for a worst and average case bound by accumulating rounding errors.







      floating-point numerics






      share|cite|improve this question















      share|cite|improve this question













      share|cite|improve this question




      share|cite|improve this question








      edited 2 hours ago







      user14717

















      asked 5 hours ago









      user14717user14717

      562211




      562211






















          2 Answers
          2






          active

          oldest

          votes


















          1












          $begingroup$

          Suppose you have $m$ numbers and the vector length $l$ divides $n$, i.e., $m=ln$. The you can apply Higham's scheme to each group of $n$ numbers. This vectorizes nicely, especially if the numbers are interleaved, i.e., all the first elements are contiguous in memory, followed by all the second elements, etc. When you have computed the average of each group you have $l$ numbers left. You apply a sequential implementation of Higham's scheme and then you are done. If the vector length does not divide $n$, then pad the data with zeros, split the data into $l$ groups with $n$ numbers each and proceed as before. When you have the average $mu_j$ of each group, you compute $(n/m)mu_j$ and simply add these numbers using the naive summation algorithm.




          Whether Higham's formula is worth using depends on the situation. When overflow is an issue, then the naive formula is useless. Both formula's have low arithmetic intensity, and I expect Higham's formula to run very slowly simply because the division has very high latency. Personally, I find that speed is very secondary compared with reliability and accuracy.


          Higham's primary goal in this very first chapter is simply to alert the reader to the fact that computing with floating point numbers is profoundly different from real arithmetic. His advice about corrections is quoted correctly, but I would have written it a bit differently. Specifically, it does not matter if the correction is computed with a large relative error as long as the correction is small relative to the original value.




          share|cite









          $endgroup$













          • $begingroup$
            I'm not sure that we're using the word "vectorize" in the same way. I meant vectorize to mean "compiles down to AVX-512 instructions"; it seems like you are talking about parallelization?
            $endgroup$
            – user14717
            2 hours ago










          • $begingroup$
            @user14717: If you pad your array so that it's a multiple of the length of your AVX512 vector and then treat each part of the AVX512 vector as a separate accumulator, then you can do this in a vectorized fashion. I doubt that a compiler would do this automatically, though.
            $endgroup$
            – Richard
            2 hours ago



















          1












          $begingroup$

          Higham's algorithm seems very useful for online algorithms or algorithms in which you have limited storage capacity such as edge processing. In these situations it is probably always worth the cost.



          But, to address somewhat your question as posed, I implemented an SSE2 version which I think captures your question:



          #include <chrono>
          #include <cstdlib>
          #include <iomanip>
          #include <iostream>
          #include <xmmintrin.h>

          int main(){
          const int num_count = 100000;

          alignas(16) float data[num_count];
          for(int i=0;i<num_count;i++)
          data[i] = rand()%100000+rand()/(double)RAND_MAX;


          {
          const auto t1 = std::chrono::high_resolution_clock::now();
          float sum=0;
          for(int i=0;i<num_count;i++){
          sum += data[i];
          }
          const float mean=sum/num_count;

          const auto t2 = std::chrono::high_resolution_clock::now();
          const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
          std::cout << "Exec time:t" << time_span1.count() << " sn";
          std::cout << "Mean = "<<std::setprecision(20)<<mean<<std::endl;
          }

          {
          const auto t1 = std::chrono::high_resolution_clock::now();
          __m128 mean = _mm_load_ps(&data[0]);
          for (int i=4;i<num_count;i+=4){
          const __m128 x = _mm_load_ps(&data[i]);
          const __m128 diff = _mm_sub_ps(x,mean);
          const __m128 k = _mm_set1_ps(i/4);
          const __m128 div = _mm_div_ps(diff, k);
          mean = _mm_add_ps(mean, div);
          }
          float result[4];
          _mm_store_ps(result, mean);
          const float tmean = (result[0] + result[1] + result[2] + result[3])/4; //I'm suspicious about this step: probably throws away precision
          const auto t2 = std::chrono::high_resolution_clock::now();
          const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
          std::cout << "Exec time:t" << time_span1.count() << " sn";
          std::cout << "Mean = "<<std::setprecision(20)<<tmean<<std::endl;
          }


          }

          }


          and observe that



          Exec time:  0.000225851 s
          Simple Mean = 49891.23046875
          Exec time: 0.0003759360000000000002 s
          Higham Mean = 49890.26171875


          Higham's mean takes a little longer to calculate and the values differ by what might be a non-negligible amount, though the accuracy you need really depends on your implementation.






          share|cite|improve this answer









          $endgroup$













          • $begingroup$
            Ok, this really changes the calculus. The clang generated code for 1.6a takes 14x as long as the naive version, but yours is only 1.5x.
            $endgroup$
            – user14717
            1 hour ago













          Your Answer





          StackExchange.ifUsing("editor", function () {
          return StackExchange.using("mathjaxEditing", function () {
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          });
          });
          }, "mathjax-editing");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "363"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fscicomp.stackexchange.com%2fquestions%2f31071%2fis-highams-computation-of-mean-worth-the-price%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          2 Answers
          2






          active

          oldest

          votes








          2 Answers
          2






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1












          $begingroup$

          Suppose you have $m$ numbers and the vector length $l$ divides $n$, i.e., $m=ln$. The you can apply Higham's scheme to each group of $n$ numbers. This vectorizes nicely, especially if the numbers are interleaved, i.e., all the first elements are contiguous in memory, followed by all the second elements, etc. When you have computed the average of each group you have $l$ numbers left. You apply a sequential implementation of Higham's scheme and then you are done. If the vector length does not divide $n$, then pad the data with zeros, split the data into $l$ groups with $n$ numbers each and proceed as before. When you have the average $mu_j$ of each group, you compute $(n/m)mu_j$ and simply add these numbers using the naive summation algorithm.




          Whether Higham's formula is worth using depends on the situation. When overflow is an issue, then the naive formula is useless. Both formula's have low arithmetic intensity, and I expect Higham's formula to run very slowly simply because the division has very high latency. Personally, I find that speed is very secondary compared with reliability and accuracy.


          Higham's primary goal in this very first chapter is simply to alert the reader to the fact that computing with floating point numbers is profoundly different from real arithmetic. His advice about corrections is quoted correctly, but I would have written it a bit differently. Specifically, it does not matter if the correction is computed with a large relative error as long as the correction is small relative to the original value.




          share|cite









          $endgroup$













          • $begingroup$
            I'm not sure that we're using the word "vectorize" in the same way. I meant vectorize to mean "compiles down to AVX-512 instructions"; it seems like you are talking about parallelization?
            $endgroup$
            – user14717
            2 hours ago










          • $begingroup$
            @user14717: If you pad your array so that it's a multiple of the length of your AVX512 vector and then treat each part of the AVX512 vector as a separate accumulator, then you can do this in a vectorized fashion. I doubt that a compiler would do this automatically, though.
            $endgroup$
            – Richard
            2 hours ago
















          1












          $begingroup$

          Suppose you have $m$ numbers and the vector length $l$ divides $n$, i.e., $m=ln$. The you can apply Higham's scheme to each group of $n$ numbers. This vectorizes nicely, especially if the numbers are interleaved, i.e., all the first elements are contiguous in memory, followed by all the second elements, etc. When you have computed the average of each group you have $l$ numbers left. You apply a sequential implementation of Higham's scheme and then you are done. If the vector length does not divide $n$, then pad the data with zeros, split the data into $l$ groups with $n$ numbers each and proceed as before. When you have the average $mu_j$ of each group, you compute $(n/m)mu_j$ and simply add these numbers using the naive summation algorithm.




          Whether Higham's formula is worth using depends on the situation. When overflow is an issue, then the naive formula is useless. Both formula's have low arithmetic intensity, and I expect Higham's formula to run very slowly simply because the division has very high latency. Personally, I find that speed is very secondary compared with reliability and accuracy.


          Higham's primary goal in this very first chapter is simply to alert the reader to the fact that computing with floating point numbers is profoundly different from real arithmetic. His advice about corrections is quoted correctly, but I would have written it a bit differently. Specifically, it does not matter if the correction is computed with a large relative error as long as the correction is small relative to the original value.




          share|cite









          $endgroup$













          • $begingroup$
            I'm not sure that we're using the word "vectorize" in the same way. I meant vectorize to mean "compiles down to AVX-512 instructions"; it seems like you are talking about parallelization?
            $endgroup$
            – user14717
            2 hours ago










          • $begingroup$
            @user14717: If you pad your array so that it's a multiple of the length of your AVX512 vector and then treat each part of the AVX512 vector as a separate accumulator, then you can do this in a vectorized fashion. I doubt that a compiler would do this automatically, though.
            $endgroup$
            – Richard
            2 hours ago














          1












          1








          1





          $begingroup$

          Suppose you have $m$ numbers and the vector length $l$ divides $n$, i.e., $m=ln$. The you can apply Higham's scheme to each group of $n$ numbers. This vectorizes nicely, especially if the numbers are interleaved, i.e., all the first elements are contiguous in memory, followed by all the second elements, etc. When you have computed the average of each group you have $l$ numbers left. You apply a sequential implementation of Higham's scheme and then you are done. If the vector length does not divide $n$, then pad the data with zeros, split the data into $l$ groups with $n$ numbers each and proceed as before. When you have the average $mu_j$ of each group, you compute $(n/m)mu_j$ and simply add these numbers using the naive summation algorithm.




          Whether Higham's formula is worth using depends on the situation. When overflow is an issue, then the naive formula is useless. Both formula's have low arithmetic intensity, and I expect Higham's formula to run very slowly simply because the division has very high latency. Personally, I find that speed is very secondary compared with reliability and accuracy.


          Higham's primary goal in this very first chapter is simply to alert the reader to the fact that computing with floating point numbers is profoundly different from real arithmetic. His advice about corrections is quoted correctly, but I would have written it a bit differently. Specifically, it does not matter if the correction is computed with a large relative error as long as the correction is small relative to the original value.




          share|cite









          $endgroup$



          Suppose you have $m$ numbers and the vector length $l$ divides $n$, i.e., $m=ln$. The you can apply Higham's scheme to each group of $n$ numbers. This vectorizes nicely, especially if the numbers are interleaved, i.e., all the first elements are contiguous in memory, followed by all the second elements, etc. When you have computed the average of each group you have $l$ numbers left. You apply a sequential implementation of Higham's scheme and then you are done. If the vector length does not divide $n$, then pad the data with zeros, split the data into $l$ groups with $n$ numbers each and proceed as before. When you have the average $mu_j$ of each group, you compute $(n/m)mu_j$ and simply add these numbers using the naive summation algorithm.




          Whether Higham's formula is worth using depends on the situation. When overflow is an issue, then the naive formula is useless. Both formula's have low arithmetic intensity, and I expect Higham's formula to run very slowly simply because the division has very high latency. Personally, I find that speed is very secondary compared with reliability and accuracy.


          Higham's primary goal in this very first chapter is simply to alert the reader to the fact that computing with floating point numbers is profoundly different from real arithmetic. His advice about corrections is quoted correctly, but I would have written it a bit differently. Specifically, it does not matter if the correction is computed with a large relative error as long as the correction is small relative to the original value.





          share|cite












          share|cite



          share|cite










          answered 2 hours ago









          Carl ChristianCarl Christian

          98748




          98748












          • $begingroup$
            I'm not sure that we're using the word "vectorize" in the same way. I meant vectorize to mean "compiles down to AVX-512 instructions"; it seems like you are talking about parallelization?
            $endgroup$
            – user14717
            2 hours ago










          • $begingroup$
            @user14717: If you pad your array so that it's a multiple of the length of your AVX512 vector and then treat each part of the AVX512 vector as a separate accumulator, then you can do this in a vectorized fashion. I doubt that a compiler would do this automatically, though.
            $endgroup$
            – Richard
            2 hours ago


















          • $begingroup$
            I'm not sure that we're using the word "vectorize" in the same way. I meant vectorize to mean "compiles down to AVX-512 instructions"; it seems like you are talking about parallelization?
            $endgroup$
            – user14717
            2 hours ago










          • $begingroup$
            @user14717: If you pad your array so that it's a multiple of the length of your AVX512 vector and then treat each part of the AVX512 vector as a separate accumulator, then you can do this in a vectorized fashion. I doubt that a compiler would do this automatically, though.
            $endgroup$
            – Richard
            2 hours ago
















          $begingroup$
          I'm not sure that we're using the word "vectorize" in the same way. I meant vectorize to mean "compiles down to AVX-512 instructions"; it seems like you are talking about parallelization?
          $endgroup$
          – user14717
          2 hours ago




          $begingroup$
          I'm not sure that we're using the word "vectorize" in the same way. I meant vectorize to mean "compiles down to AVX-512 instructions"; it seems like you are talking about parallelization?
          $endgroup$
          – user14717
          2 hours ago












          $begingroup$
          @user14717: If you pad your array so that it's a multiple of the length of your AVX512 vector and then treat each part of the AVX512 vector as a separate accumulator, then you can do this in a vectorized fashion. I doubt that a compiler would do this automatically, though.
          $endgroup$
          – Richard
          2 hours ago




          $begingroup$
          @user14717: If you pad your array so that it's a multiple of the length of your AVX512 vector and then treat each part of the AVX512 vector as a separate accumulator, then you can do this in a vectorized fashion. I doubt that a compiler would do this automatically, though.
          $endgroup$
          – Richard
          2 hours ago











          1












          $begingroup$

          Higham's algorithm seems very useful for online algorithms or algorithms in which you have limited storage capacity such as edge processing. In these situations it is probably always worth the cost.



          But, to address somewhat your question as posed, I implemented an SSE2 version which I think captures your question:



          #include <chrono>
          #include <cstdlib>
          #include <iomanip>
          #include <iostream>
          #include <xmmintrin.h>

          int main(){
          const int num_count = 100000;

          alignas(16) float data[num_count];
          for(int i=0;i<num_count;i++)
          data[i] = rand()%100000+rand()/(double)RAND_MAX;


          {
          const auto t1 = std::chrono::high_resolution_clock::now();
          float sum=0;
          for(int i=0;i<num_count;i++){
          sum += data[i];
          }
          const float mean=sum/num_count;

          const auto t2 = std::chrono::high_resolution_clock::now();
          const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
          std::cout << "Exec time:t" << time_span1.count() << " sn";
          std::cout << "Mean = "<<std::setprecision(20)<<mean<<std::endl;
          }

          {
          const auto t1 = std::chrono::high_resolution_clock::now();
          __m128 mean = _mm_load_ps(&data[0]);
          for (int i=4;i<num_count;i+=4){
          const __m128 x = _mm_load_ps(&data[i]);
          const __m128 diff = _mm_sub_ps(x,mean);
          const __m128 k = _mm_set1_ps(i/4);
          const __m128 div = _mm_div_ps(diff, k);
          mean = _mm_add_ps(mean, div);
          }
          float result[4];
          _mm_store_ps(result, mean);
          const float tmean = (result[0] + result[1] + result[2] + result[3])/4; //I'm suspicious about this step: probably throws away precision
          const auto t2 = std::chrono::high_resolution_clock::now();
          const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
          std::cout << "Exec time:t" << time_span1.count() << " sn";
          std::cout << "Mean = "<<std::setprecision(20)<<tmean<<std::endl;
          }


          }

          }


          and observe that



          Exec time:  0.000225851 s
          Simple Mean = 49891.23046875
          Exec time: 0.0003759360000000000002 s
          Higham Mean = 49890.26171875


          Higham's mean takes a little longer to calculate and the values differ by what might be a non-negligible amount, though the accuracy you need really depends on your implementation.






          share|cite|improve this answer









          $endgroup$













          • $begingroup$
            Ok, this really changes the calculus. The clang generated code for 1.6a takes 14x as long as the naive version, but yours is only 1.5x.
            $endgroup$
            – user14717
            1 hour ago


















          1












          $begingroup$

          Higham's algorithm seems very useful for online algorithms or algorithms in which you have limited storage capacity such as edge processing. In these situations it is probably always worth the cost.



          But, to address somewhat your question as posed, I implemented an SSE2 version which I think captures your question:



          #include <chrono>
          #include <cstdlib>
          #include <iomanip>
          #include <iostream>
          #include <xmmintrin.h>

          int main(){
          const int num_count = 100000;

          alignas(16) float data[num_count];
          for(int i=0;i<num_count;i++)
          data[i] = rand()%100000+rand()/(double)RAND_MAX;


          {
          const auto t1 = std::chrono::high_resolution_clock::now();
          float sum=0;
          for(int i=0;i<num_count;i++){
          sum += data[i];
          }
          const float mean=sum/num_count;

          const auto t2 = std::chrono::high_resolution_clock::now();
          const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
          std::cout << "Exec time:t" << time_span1.count() << " sn";
          std::cout << "Mean = "<<std::setprecision(20)<<mean<<std::endl;
          }

          {
          const auto t1 = std::chrono::high_resolution_clock::now();
          __m128 mean = _mm_load_ps(&data[0]);
          for (int i=4;i<num_count;i+=4){
          const __m128 x = _mm_load_ps(&data[i]);
          const __m128 diff = _mm_sub_ps(x,mean);
          const __m128 k = _mm_set1_ps(i/4);
          const __m128 div = _mm_div_ps(diff, k);
          mean = _mm_add_ps(mean, div);
          }
          float result[4];
          _mm_store_ps(result, mean);
          const float tmean = (result[0] + result[1] + result[2] + result[3])/4; //I'm suspicious about this step: probably throws away precision
          const auto t2 = std::chrono::high_resolution_clock::now();
          const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
          std::cout << "Exec time:t" << time_span1.count() << " sn";
          std::cout << "Mean = "<<std::setprecision(20)<<tmean<<std::endl;
          }


          }

          }


          and observe that



          Exec time:  0.000225851 s
          Simple Mean = 49891.23046875
          Exec time: 0.0003759360000000000002 s
          Higham Mean = 49890.26171875


          Higham's mean takes a little longer to calculate and the values differ by what might be a non-negligible amount, though the accuracy you need really depends on your implementation.






          share|cite|improve this answer









          $endgroup$













          • $begingroup$
            Ok, this really changes the calculus. The clang generated code for 1.6a takes 14x as long as the naive version, but yours is only 1.5x.
            $endgroup$
            – user14717
            1 hour ago
















          1












          1








          1





          $begingroup$

          Higham's algorithm seems very useful for online algorithms or algorithms in which you have limited storage capacity such as edge processing. In these situations it is probably always worth the cost.



          But, to address somewhat your question as posed, I implemented an SSE2 version which I think captures your question:



          #include <chrono>
          #include <cstdlib>
          #include <iomanip>
          #include <iostream>
          #include <xmmintrin.h>

          int main(){
          const int num_count = 100000;

          alignas(16) float data[num_count];
          for(int i=0;i<num_count;i++)
          data[i] = rand()%100000+rand()/(double)RAND_MAX;


          {
          const auto t1 = std::chrono::high_resolution_clock::now();
          float sum=0;
          for(int i=0;i<num_count;i++){
          sum += data[i];
          }
          const float mean=sum/num_count;

          const auto t2 = std::chrono::high_resolution_clock::now();
          const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
          std::cout << "Exec time:t" << time_span1.count() << " sn";
          std::cout << "Mean = "<<std::setprecision(20)<<mean<<std::endl;
          }

          {
          const auto t1 = std::chrono::high_resolution_clock::now();
          __m128 mean = _mm_load_ps(&data[0]);
          for (int i=4;i<num_count;i+=4){
          const __m128 x = _mm_load_ps(&data[i]);
          const __m128 diff = _mm_sub_ps(x,mean);
          const __m128 k = _mm_set1_ps(i/4);
          const __m128 div = _mm_div_ps(diff, k);
          mean = _mm_add_ps(mean, div);
          }
          float result[4];
          _mm_store_ps(result, mean);
          const float tmean = (result[0] + result[1] + result[2] + result[3])/4; //I'm suspicious about this step: probably throws away precision
          const auto t2 = std::chrono::high_resolution_clock::now();
          const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
          std::cout << "Exec time:t" << time_span1.count() << " sn";
          std::cout << "Mean = "<<std::setprecision(20)<<tmean<<std::endl;
          }


          }

          }


          and observe that



          Exec time:  0.000225851 s
          Simple Mean = 49891.23046875
          Exec time: 0.0003759360000000000002 s
          Higham Mean = 49890.26171875


          Higham's mean takes a little longer to calculate and the values differ by what might be a non-negligible amount, though the accuracy you need really depends on your implementation.






          share|cite|improve this answer









          $endgroup$



          Higham's algorithm seems very useful for online algorithms or algorithms in which you have limited storage capacity such as edge processing. In these situations it is probably always worth the cost.



          But, to address somewhat your question as posed, I implemented an SSE2 version which I think captures your question:



          #include <chrono>
          #include <cstdlib>
          #include <iomanip>
          #include <iostream>
          #include <xmmintrin.h>

          int main(){
          const int num_count = 100000;

          alignas(16) float data[num_count];
          for(int i=0;i<num_count;i++)
          data[i] = rand()%100000+rand()/(double)RAND_MAX;


          {
          const auto t1 = std::chrono::high_resolution_clock::now();
          float sum=0;
          for(int i=0;i<num_count;i++){
          sum += data[i];
          }
          const float mean=sum/num_count;

          const auto t2 = std::chrono::high_resolution_clock::now();
          const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
          std::cout << "Exec time:t" << time_span1.count() << " sn";
          std::cout << "Mean = "<<std::setprecision(20)<<mean<<std::endl;
          }

          {
          const auto t1 = std::chrono::high_resolution_clock::now();
          __m128 mean = _mm_load_ps(&data[0]);
          for (int i=4;i<num_count;i+=4){
          const __m128 x = _mm_load_ps(&data[i]);
          const __m128 diff = _mm_sub_ps(x,mean);
          const __m128 k = _mm_set1_ps(i/4);
          const __m128 div = _mm_div_ps(diff, k);
          mean = _mm_add_ps(mean, div);
          }
          float result[4];
          _mm_store_ps(result, mean);
          const float tmean = (result[0] + result[1] + result[2] + result[3])/4; //I'm suspicious about this step: probably throws away precision
          const auto t2 = std::chrono::high_resolution_clock::now();
          const auto time_span1 = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
          std::cout << "Exec time:t" << time_span1.count() << " sn";
          std::cout << "Mean = "<<std::setprecision(20)<<tmean<<std::endl;
          }


          }

          }


          and observe that



          Exec time:  0.000225851 s
          Simple Mean = 49891.23046875
          Exec time: 0.0003759360000000000002 s
          Higham Mean = 49890.26171875


          Higham's mean takes a little longer to calculate and the values differ by what might be a non-negligible amount, though the accuracy you need really depends on your implementation.







          share|cite|improve this answer












          share|cite|improve this answer



          share|cite|improve this answer










          answered 1 hour ago









          RichardRichard

          571210




          571210












          • $begingroup$
            Ok, this really changes the calculus. The clang generated code for 1.6a takes 14x as long as the naive version, but yours is only 1.5x.
            $endgroup$
            – user14717
            1 hour ago




















          • $begingroup$
            Ok, this really changes the calculus. The clang generated code for 1.6a takes 14x as long as the naive version, but yours is only 1.5x.
            $endgroup$
            – user14717
            1 hour ago


















          $begingroup$
          Ok, this really changes the calculus. The clang generated code for 1.6a takes 14x as long as the naive version, but yours is only 1.5x.
          $endgroup$
          – user14717
          1 hour ago






          $begingroup$
          Ok, this really changes the calculus. The clang generated code for 1.6a takes 14x as long as the naive version, but yours is only 1.5x.
          $endgroup$
          – user14717
          1 hour ago




















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Computational Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fscicomp.stackexchange.com%2fquestions%2f31071%2fis-highams-computation-of-mean-worth-the-price%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          How to make a Squid Proxy server?

          Is this a new Fibonacci Identity?

          19世紀