APEI Generic Hardware Error












9















Over the past week my server (running Debian Jessie) has rebooted twice. In the syslog I see this before each reboot, and at no other points:



Aug 15 13:32:58 hoshimiya kernel: [296512.005355] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
Aug 15 13:32:58 hoshimiya kernel: [296512.005360] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
Aug 15 13:32:58 hoshimiya kernel: [296512.005361] {1}[Hardware Error]: event severity: corrected
Aug 15 13:32:58 hoshimiya kernel: [296512.005362] {1}[Hardware Error]: Error 0, type: corrected
Aug 15 13:32:58 hoshimiya kernel: [296512.005363] {1}[Hardware Error]: fru_text: CorrectedErr
Aug 15 13:32:58 hoshimiya kernel: [296512.005364] {1}[Hardware Error]: section_type: memory error
Aug 15 13:32:58 hoshimiya kernel: [296512.005365] [Firmware Warn]: error section length is too small


Some googling leads me to believe that this is to do with my ECC RAM detecting and recovering from an error. Is this correct? If it's recovering, why does the system reboot? I'd like to prevent the system from rebooting, if at all possible.










share|improve this question





























    9















    Over the past week my server (running Debian Jessie) has rebooted twice. In the syslog I see this before each reboot, and at no other points:



    Aug 15 13:32:58 hoshimiya kernel: [296512.005355] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
    Aug 15 13:32:58 hoshimiya kernel: [296512.005360] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
    Aug 15 13:32:58 hoshimiya kernel: [296512.005361] {1}[Hardware Error]: event severity: corrected
    Aug 15 13:32:58 hoshimiya kernel: [296512.005362] {1}[Hardware Error]: Error 0, type: corrected
    Aug 15 13:32:58 hoshimiya kernel: [296512.005363] {1}[Hardware Error]: fru_text: CorrectedErr
    Aug 15 13:32:58 hoshimiya kernel: [296512.005364] {1}[Hardware Error]: section_type: memory error
    Aug 15 13:32:58 hoshimiya kernel: [296512.005365] [Firmware Warn]: error section length is too small


    Some googling leads me to believe that this is to do with my ECC RAM detecting and recovering from an error. Is this correct? If it's recovering, why does the system reboot? I'd like to prevent the system from rebooting, if at all possible.










    share|improve this question



























      9












      9








      9








      Over the past week my server (running Debian Jessie) has rebooted twice. In the syslog I see this before each reboot, and at no other points:



      Aug 15 13:32:58 hoshimiya kernel: [296512.005355] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
      Aug 15 13:32:58 hoshimiya kernel: [296512.005360] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
      Aug 15 13:32:58 hoshimiya kernel: [296512.005361] {1}[Hardware Error]: event severity: corrected
      Aug 15 13:32:58 hoshimiya kernel: [296512.005362] {1}[Hardware Error]: Error 0, type: corrected
      Aug 15 13:32:58 hoshimiya kernel: [296512.005363] {1}[Hardware Error]: fru_text: CorrectedErr
      Aug 15 13:32:58 hoshimiya kernel: [296512.005364] {1}[Hardware Error]: section_type: memory error
      Aug 15 13:32:58 hoshimiya kernel: [296512.005365] [Firmware Warn]: error section length is too small


      Some googling leads me to believe that this is to do with my ECC RAM detecting and recovering from an error. Is this correct? If it's recovering, why does the system reboot? I'd like to prevent the system from rebooting, if at all possible.










      share|improve this question
















      Over the past week my server (running Debian Jessie) has rebooted twice. In the syslog I see this before each reboot, and at no other points:



      Aug 15 13:32:58 hoshimiya kernel: [296512.005355] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
      Aug 15 13:32:58 hoshimiya kernel: [296512.005360] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
      Aug 15 13:32:58 hoshimiya kernel: [296512.005361] {1}[Hardware Error]: event severity: corrected
      Aug 15 13:32:58 hoshimiya kernel: [296512.005362] {1}[Hardware Error]: Error 0, type: corrected
      Aug 15 13:32:58 hoshimiya kernel: [296512.005363] {1}[Hardware Error]: fru_text: CorrectedErr
      Aug 15 13:32:58 hoshimiya kernel: [296512.005364] {1}[Hardware Error]: section_type: memory error
      Aug 15 13:32:58 hoshimiya kernel: [296512.005365] [Firmware Warn]: error section length is too small


      Some googling leads me to believe that this is to do with my ECC RAM detecting and recovering from an error. Is this correct? If it's recovering, why does the system reboot? I'd like to prevent the system from rebooting, if at all possible.







      hardware






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Apr 27 '16 at 17:58









      Anthon

      60.9k17103166




      60.9k17103166










      asked Aug 15 '14 at 19:04









      moujikmoujik

      48113




      48113






















          2 Answers
          2






          active

          oldest

          votes


















          9














          Looks like your RAM is failing, or having errors that are being corrected. Depending on the severity it sounds like these errors are impacting it's ability to function and it's having to reboot afterwards.



          From the looks of this thread the message bit at the end about the error section length being too small is likely the culprit.



          excerpt - [PATCH 1/1] efi: cper: Support different length of Error Section




          Some fields might be added to the Error Section in the newer UEFI
          spec. For example, the fields 'Reserved', 'Rank Number', 'Card Handle'
          and 'Module Handle' are added to the Memory Error Section started from
          UEFI spec 2.3. Unfortunately, there will have the following warning
          message if the memory corrected error is detected and the field
          'revision' in struct acpi_generic_data is less then 0x203 (UEFI spec
          2.3):



          {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 3
          {1}[Hardware Error]: It has been corrected by h/w and requires no further action
          {1}[Hardware Error]: event severity: corrected
          {1}[Hardware Error]: Error 0, type: corrected
          {1}[Hardware Error]: section_type: memory error
          [Firmware Warn]: error section length is too small


          This behavior causes this corrected error cannot be displayed
          correctly. To solve the issue, this patch supports different length of
          the Error Section for different UEFI spec version.



          And, this patch employs a pre-defined structure to clean up the
          duplicated codes in function cper_estatus_print_section.



          With applying this patch, the memory corrected error could be
          displayed correctly after injecting the error.



          Tested on v3.14-rc5 with Grantley platform and Intel RAStool.




          So it would seem a patch for that particular error is in the works and might be available in a newer version of the kernel.






          share|improve this answer































            3














            FYI I appeared to have a very similar issue as this.



            As it turned out the solution was taking the memory out, and reseating it, and everything was back to normal.






            share|improve this answer























              Your Answer








              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "106"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: false,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f150451%2fapei-generic-hardware-error%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              9














              Looks like your RAM is failing, or having errors that are being corrected. Depending on the severity it sounds like these errors are impacting it's ability to function and it's having to reboot afterwards.



              From the looks of this thread the message bit at the end about the error section length being too small is likely the culprit.



              excerpt - [PATCH 1/1] efi: cper: Support different length of Error Section




              Some fields might be added to the Error Section in the newer UEFI
              spec. For example, the fields 'Reserved', 'Rank Number', 'Card Handle'
              and 'Module Handle' are added to the Memory Error Section started from
              UEFI spec 2.3. Unfortunately, there will have the following warning
              message if the memory corrected error is detected and the field
              'revision' in struct acpi_generic_data is less then 0x203 (UEFI spec
              2.3):



              {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 3
              {1}[Hardware Error]: It has been corrected by h/w and requires no further action
              {1}[Hardware Error]: event severity: corrected
              {1}[Hardware Error]: Error 0, type: corrected
              {1}[Hardware Error]: section_type: memory error
              [Firmware Warn]: error section length is too small


              This behavior causes this corrected error cannot be displayed
              correctly. To solve the issue, this patch supports different length of
              the Error Section for different UEFI spec version.



              And, this patch employs a pre-defined structure to clean up the
              duplicated codes in function cper_estatus_print_section.



              With applying this patch, the memory corrected error could be
              displayed correctly after injecting the error.



              Tested on v3.14-rc5 with Grantley platform and Intel RAStool.




              So it would seem a patch for that particular error is in the works and might be available in a newer version of the kernel.






              share|improve this answer




























                9














                Looks like your RAM is failing, or having errors that are being corrected. Depending on the severity it sounds like these errors are impacting it's ability to function and it's having to reboot afterwards.



                From the looks of this thread the message bit at the end about the error section length being too small is likely the culprit.



                excerpt - [PATCH 1/1] efi: cper: Support different length of Error Section




                Some fields might be added to the Error Section in the newer UEFI
                spec. For example, the fields 'Reserved', 'Rank Number', 'Card Handle'
                and 'Module Handle' are added to the Memory Error Section started from
                UEFI spec 2.3. Unfortunately, there will have the following warning
                message if the memory corrected error is detected and the field
                'revision' in struct acpi_generic_data is less then 0x203 (UEFI spec
                2.3):



                {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 3
                {1}[Hardware Error]: It has been corrected by h/w and requires no further action
                {1}[Hardware Error]: event severity: corrected
                {1}[Hardware Error]: Error 0, type: corrected
                {1}[Hardware Error]: section_type: memory error
                [Firmware Warn]: error section length is too small


                This behavior causes this corrected error cannot be displayed
                correctly. To solve the issue, this patch supports different length of
                the Error Section for different UEFI spec version.



                And, this patch employs a pre-defined structure to clean up the
                duplicated codes in function cper_estatus_print_section.



                With applying this patch, the memory corrected error could be
                displayed correctly after injecting the error.



                Tested on v3.14-rc5 with Grantley platform and Intel RAStool.




                So it would seem a patch for that particular error is in the works and might be available in a newer version of the kernel.






                share|improve this answer


























                  9












                  9








                  9







                  Looks like your RAM is failing, or having errors that are being corrected. Depending on the severity it sounds like these errors are impacting it's ability to function and it's having to reboot afterwards.



                  From the looks of this thread the message bit at the end about the error section length being too small is likely the culprit.



                  excerpt - [PATCH 1/1] efi: cper: Support different length of Error Section




                  Some fields might be added to the Error Section in the newer UEFI
                  spec. For example, the fields 'Reserved', 'Rank Number', 'Card Handle'
                  and 'Module Handle' are added to the Memory Error Section started from
                  UEFI spec 2.3. Unfortunately, there will have the following warning
                  message if the memory corrected error is detected and the field
                  'revision' in struct acpi_generic_data is less then 0x203 (UEFI spec
                  2.3):



                  {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 3
                  {1}[Hardware Error]: It has been corrected by h/w and requires no further action
                  {1}[Hardware Error]: event severity: corrected
                  {1}[Hardware Error]: Error 0, type: corrected
                  {1}[Hardware Error]: section_type: memory error
                  [Firmware Warn]: error section length is too small


                  This behavior causes this corrected error cannot be displayed
                  correctly. To solve the issue, this patch supports different length of
                  the Error Section for different UEFI spec version.



                  And, this patch employs a pre-defined structure to clean up the
                  duplicated codes in function cper_estatus_print_section.



                  With applying this patch, the memory corrected error could be
                  displayed correctly after injecting the error.



                  Tested on v3.14-rc5 with Grantley platform and Intel RAStool.




                  So it would seem a patch for that particular error is in the works and might be available in a newer version of the kernel.






                  share|improve this answer













                  Looks like your RAM is failing, or having errors that are being corrected. Depending on the severity it sounds like these errors are impacting it's ability to function and it's having to reboot afterwards.



                  From the looks of this thread the message bit at the end about the error section length being too small is likely the culprit.



                  excerpt - [PATCH 1/1] efi: cper: Support different length of Error Section




                  Some fields might be added to the Error Section in the newer UEFI
                  spec. For example, the fields 'Reserved', 'Rank Number', 'Card Handle'
                  and 'Module Handle' are added to the Memory Error Section started from
                  UEFI spec 2.3. Unfortunately, there will have the following warning
                  message if the memory corrected error is detected and the field
                  'revision' in struct acpi_generic_data is less then 0x203 (UEFI spec
                  2.3):



                  {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 3
                  {1}[Hardware Error]: It has been corrected by h/w and requires no further action
                  {1}[Hardware Error]: event severity: corrected
                  {1}[Hardware Error]: Error 0, type: corrected
                  {1}[Hardware Error]: section_type: memory error
                  [Firmware Warn]: error section length is too small


                  This behavior causes this corrected error cannot be displayed
                  correctly. To solve the issue, this patch supports different length of
                  the Error Section for different UEFI spec version.



                  And, this patch employs a pre-defined structure to clean up the
                  duplicated codes in function cper_estatus_print_section.



                  With applying this patch, the memory corrected error could be
                  displayed correctly after injecting the error.



                  Tested on v3.14-rc5 with Grantley platform and Intel RAStool.




                  So it would seem a patch for that particular error is in the works and might be available in a newer version of the kernel.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Aug 21 '14 at 13:54









                  slmslm

                  251k69529685




                  251k69529685

























                      3














                      FYI I appeared to have a very similar issue as this.



                      As it turned out the solution was taking the memory out, and reseating it, and everything was back to normal.






                      share|improve this answer




























                        3














                        FYI I appeared to have a very similar issue as this.



                        As it turned out the solution was taking the memory out, and reseating it, and everything was back to normal.






                        share|improve this answer


























                          3












                          3








                          3







                          FYI I appeared to have a very similar issue as this.



                          As it turned out the solution was taking the memory out, and reseating it, and everything was back to normal.






                          share|improve this answer













                          FYI I appeared to have a very similar issue as this.



                          As it turned out the solution was taking the memory out, and reseating it, and everything was back to normal.







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Dec 5 '17 at 21:02









                          Darren HarrisonDarren Harrison

                          412




                          412






























                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Unix & Linux Stack Exchange!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f150451%2fapei-generic-hardware-error%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              How to reconfigure Docker Trusted Registry 2.x.x to use CEPH FS mount instead of NFS and other traditional...

                              is 'sed' thread safe

                              How to make a Squid Proxy server?