Weird I/O stalls affecting a whole desktop












1















After a recent hardware migration I started experiencing weird I/O stalls affecting my desktop Debian Stretch system. Typical symptoms, all happening during each stall:




  • I stop being able to interact with Chromium, my web browser. Nothing works: webpage scrolling (usually this is the way I notice the stall), switching tabs, etc. No mouse-over actions either, whether on a web page or Chromium UI.


  • In a virtual terminal, I can't run new processes anymore. For example, I open a new tab in mate-terminal and my shell doesn't show up, just the cursor blinking. In a terminal with shell opened before a stall, I can type a command, but usually it doesn't start; sudo something doesn't even ask for a password.


  • Other programs, like RStudio, can't save anything to disk and often hang when they attempt to.



  • I see in the logs of journald -f that if the stall is long enough, journald itself restarts, example:



    sty 30 14:03:54 liori-pc systemd[1]: systemd-journald.service: Main process exited, code=killed, status=6/ABRT
    sty 30 14:03:54 liori-pc systemd[1]: systemd-journald.service: Unit entered failed state.
    sty 30 14:03:54 liori-pc systemd[1]: systemd-journald.service: Failed with result 'watchdog'.
    sty 30 14:03:54 liori-pc systemd[1]: systemd-journald.service: Service has no hold-off time, scheduling restart.
    sty 30 14:03:54 liori-pc systemd[1]: Stopped Flush Journal to Persistent Storage.
    sty 30 14:03:54 liori-pc systemd[1]: Stopping Flush Journal to Persistent Storage...
    sty 30 14:03:54 liori-pc systemd[1]: Stopped Journal Service.
    sty 30 14:03:54 liori-pc systemd[1]: Starting Journal Service...
    sty 30 14:03:54 liori-pc systemd-journald[23935]: Journal started
    sty 30 14:03:54 liori-pc systemd-journald[23935]: System journal (/var/log/journal/2318080f60e357aaf765e98d0000035c) is 2.1G, max 4.0G, 1.8G free.


  • When using dm_crypt, a dmcrypt_write process starts taking 100% of a single CPU core (I later got rid of dm_crypt from this system, but stalls still happen).


  • I observe /proc/meminfo and see that the Dirty number is never more than few megabytes. Notably, during a stall, this number doesn't change.


  • In rare cases, I even get a kernel message in the form of "INFO: task «some process» blocked for more than 120 seconds.", with «some_process» being usually mdX_raid5, chromium or one of its threads, etc. Example log.



Initially my setup was just a single 600GB ext4 file system on a partition on a single 1TB drive (current /dev/sdd). Then I migrated to 3×6TB drives (/dev/sd{b,c,e}), with LVM-based raid5, bcache with its cache on an SSD drive, then dm_crypt — and that's when the stalls started. In the process of debugging, I simplified it to just LVM-raid5, with no bcache or dm_crypt; stalls still happen, though I feel they are less often now.



This kind of stall happens several times a day and usually lasts few minutes. I noticed that I can break it by explicitly requesting some disk operation: I was able to sometimes break it by logging in to this system by ssh from a remote machine, or (almost always) by just cat /dev/sdb >/dev/null or cat /dev/sdc >/dev/null (sometimes one, sometimes the other works; notably cat /dev/sde >/dev/null never helped). Then, everything that stalled suddenly start working again.



So I suspect the problem is caused by one of or interaction of:




  • The drives: all three are Seagate Skyhawk ST6000VX0023. Two of them unused before this setup, the third one used for half a year (/dev/sdc).

  • Disk controllers: the motherboard: Gigabyte Z68X-UD3H-B3 has two controllers: Marvell 88SE9172 where one of the drives is connected to, and the chipset-builtin controller (Intel® Z68) with two others (can I check which one is where in software?).

  • Some bug in the controller kernel drivers.

  • Some bug in LVM or raid5.


This is a Debian Stretch system with some backported packages installed, most notably kernel 4.19.0-0.bpo.1-amd64. Intel Core i7-2600k, 16GB of RAM.



At this point I ran out of ideas. How do I debug this problem further?



Edit: I started a script that's reading a single random sector from one of these drives every 4 seconds, and had no stalls for 2 days now. So indeed it does look like some system component (LVM? raid?) doesn't properly wake up devices from some kind of a low-power mode when it's necessary.










share|improve this question

























  • Does an actual time gap show up in the journalctl output? I note that there doesn't appear to be in the output you've quoted above - but that happens to start about where my journalctl output seemed to generally resume after a pause like this on my former laptop. (You're not likely to have the problem that laptop had, however - it had been bought brand new in May 2008, and was a low power laptop so only 32 bit.)

    – Ed Grimm
    Feb 3 at 2:40











  • @EdGrimm, I think so. At least, when I notice the stall and switch to journalctl -f output running in some terminal, I don't get any new messages until the stall ends. That doesn't mean I know when exactly the stall started.

    – liori
    Feb 3 at 3:12











  • I expected from your description of the stall that journalctl -f wouldn't report anything until the stall ended. But when the stall ends, it could just log new stuff (unlikely, unless journalctl itself was restarted at the end), it could catch up part of the way, or it could catch up all of the way, such that there's no noticeable gap. It would be much easier to tell on a busier machine that had regularly occurring logging. It's not really something I expect you'd be able to answer off-hand, because our brains try to only note things that are noteworthy.

    – Ed Grimm
    Feb 3 at 3:23






  • 1





    @EdGrimm: I'm starting to suspect that the stalls happen only when the machine is actually not busy. I didn't get any stall yesterday when I was running a heavy I/O task for ~10 hours, and normally I would get a few of them. To test this hypothesis, I'm now leaving a small process doing small random reads all over the drives. The machine is certainly not overheating: I managed to observe sensors twice before, and except for the case of dm_crypt taking 100% of a CPU core, the temperatures of both the CPU and drives did not go over ~55°C.

    – liori
    Feb 3 at 16:44






  • 1





    That sounds like the stalls are due to the drives going into a low-power mode and taking a really long time to come out of it. My knowledge of how to fix that issue went obsolete with IDE, but hopefully that suggestion will help you find a better answer, or somebody with a better answer will come by and give it.

    – Ed Grimm
    Feb 3 at 20:22
















1















After a recent hardware migration I started experiencing weird I/O stalls affecting my desktop Debian Stretch system. Typical symptoms, all happening during each stall:




  • I stop being able to interact with Chromium, my web browser. Nothing works: webpage scrolling (usually this is the way I notice the stall), switching tabs, etc. No mouse-over actions either, whether on a web page or Chromium UI.


  • In a virtual terminal, I can't run new processes anymore. For example, I open a new tab in mate-terminal and my shell doesn't show up, just the cursor blinking. In a terminal with shell opened before a stall, I can type a command, but usually it doesn't start; sudo something doesn't even ask for a password.


  • Other programs, like RStudio, can't save anything to disk and often hang when they attempt to.



  • I see in the logs of journald -f that if the stall is long enough, journald itself restarts, example:



    sty 30 14:03:54 liori-pc systemd[1]: systemd-journald.service: Main process exited, code=killed, status=6/ABRT
    sty 30 14:03:54 liori-pc systemd[1]: systemd-journald.service: Unit entered failed state.
    sty 30 14:03:54 liori-pc systemd[1]: systemd-journald.service: Failed with result 'watchdog'.
    sty 30 14:03:54 liori-pc systemd[1]: systemd-journald.service: Service has no hold-off time, scheduling restart.
    sty 30 14:03:54 liori-pc systemd[1]: Stopped Flush Journal to Persistent Storage.
    sty 30 14:03:54 liori-pc systemd[1]: Stopping Flush Journal to Persistent Storage...
    sty 30 14:03:54 liori-pc systemd[1]: Stopped Journal Service.
    sty 30 14:03:54 liori-pc systemd[1]: Starting Journal Service...
    sty 30 14:03:54 liori-pc systemd-journald[23935]: Journal started
    sty 30 14:03:54 liori-pc systemd-journald[23935]: System journal (/var/log/journal/2318080f60e357aaf765e98d0000035c) is 2.1G, max 4.0G, 1.8G free.


  • When using dm_crypt, a dmcrypt_write process starts taking 100% of a single CPU core (I later got rid of dm_crypt from this system, but stalls still happen).


  • I observe /proc/meminfo and see that the Dirty number is never more than few megabytes. Notably, during a stall, this number doesn't change.


  • In rare cases, I even get a kernel message in the form of "INFO: task «some process» blocked for more than 120 seconds.", with «some_process» being usually mdX_raid5, chromium or one of its threads, etc. Example log.



Initially my setup was just a single 600GB ext4 file system on a partition on a single 1TB drive (current /dev/sdd). Then I migrated to 3×6TB drives (/dev/sd{b,c,e}), with LVM-based raid5, bcache with its cache on an SSD drive, then dm_crypt — and that's when the stalls started. In the process of debugging, I simplified it to just LVM-raid5, with no bcache or dm_crypt; stalls still happen, though I feel they are less often now.



This kind of stall happens several times a day and usually lasts few minutes. I noticed that I can break it by explicitly requesting some disk operation: I was able to sometimes break it by logging in to this system by ssh from a remote machine, or (almost always) by just cat /dev/sdb >/dev/null or cat /dev/sdc >/dev/null (sometimes one, sometimes the other works; notably cat /dev/sde >/dev/null never helped). Then, everything that stalled suddenly start working again.



So I suspect the problem is caused by one of or interaction of:




  • The drives: all three are Seagate Skyhawk ST6000VX0023. Two of them unused before this setup, the third one used for half a year (/dev/sdc).

  • Disk controllers: the motherboard: Gigabyte Z68X-UD3H-B3 has two controllers: Marvell 88SE9172 where one of the drives is connected to, and the chipset-builtin controller (Intel® Z68) with two others (can I check which one is where in software?).

  • Some bug in the controller kernel drivers.

  • Some bug in LVM or raid5.


This is a Debian Stretch system with some backported packages installed, most notably kernel 4.19.0-0.bpo.1-amd64. Intel Core i7-2600k, 16GB of RAM.



At this point I ran out of ideas. How do I debug this problem further?



Edit: I started a script that's reading a single random sector from one of these drives every 4 seconds, and had no stalls for 2 days now. So indeed it does look like some system component (LVM? raid?) doesn't properly wake up devices from some kind of a low-power mode when it's necessary.










share|improve this question

























  • Does an actual time gap show up in the journalctl output? I note that there doesn't appear to be in the output you've quoted above - but that happens to start about where my journalctl output seemed to generally resume after a pause like this on my former laptop. (You're not likely to have the problem that laptop had, however - it had been bought brand new in May 2008, and was a low power laptop so only 32 bit.)

    – Ed Grimm
    Feb 3 at 2:40











  • @EdGrimm, I think so. At least, when I notice the stall and switch to journalctl -f output running in some terminal, I don't get any new messages until the stall ends. That doesn't mean I know when exactly the stall started.

    – liori
    Feb 3 at 3:12











  • I expected from your description of the stall that journalctl -f wouldn't report anything until the stall ended. But when the stall ends, it could just log new stuff (unlikely, unless journalctl itself was restarted at the end), it could catch up part of the way, or it could catch up all of the way, such that there's no noticeable gap. It would be much easier to tell on a busier machine that had regularly occurring logging. It's not really something I expect you'd be able to answer off-hand, because our brains try to only note things that are noteworthy.

    – Ed Grimm
    Feb 3 at 3:23






  • 1





    @EdGrimm: I'm starting to suspect that the stalls happen only when the machine is actually not busy. I didn't get any stall yesterday when I was running a heavy I/O task for ~10 hours, and normally I would get a few of them. To test this hypothesis, I'm now leaving a small process doing small random reads all over the drives. The machine is certainly not overheating: I managed to observe sensors twice before, and except for the case of dm_crypt taking 100% of a CPU core, the temperatures of both the CPU and drives did not go over ~55°C.

    – liori
    Feb 3 at 16:44






  • 1





    That sounds like the stalls are due to the drives going into a low-power mode and taking a really long time to come out of it. My knowledge of how to fix that issue went obsolete with IDE, but hopefully that suggestion will help you find a better answer, or somebody with a better answer will come by and give it.

    – Ed Grimm
    Feb 3 at 20:22














1












1








1








After a recent hardware migration I started experiencing weird I/O stalls affecting my desktop Debian Stretch system. Typical symptoms, all happening during each stall:




  • I stop being able to interact with Chromium, my web browser. Nothing works: webpage scrolling (usually this is the way I notice the stall), switching tabs, etc. No mouse-over actions either, whether on a web page or Chromium UI.


  • In a virtual terminal, I can't run new processes anymore. For example, I open a new tab in mate-terminal and my shell doesn't show up, just the cursor blinking. In a terminal with shell opened before a stall, I can type a command, but usually it doesn't start; sudo something doesn't even ask for a password.


  • Other programs, like RStudio, can't save anything to disk and often hang when they attempt to.



  • I see in the logs of journald -f that if the stall is long enough, journald itself restarts, example:



    sty 30 14:03:54 liori-pc systemd[1]: systemd-journald.service: Main process exited, code=killed, status=6/ABRT
    sty 30 14:03:54 liori-pc systemd[1]: systemd-journald.service: Unit entered failed state.
    sty 30 14:03:54 liori-pc systemd[1]: systemd-journald.service: Failed with result 'watchdog'.
    sty 30 14:03:54 liori-pc systemd[1]: systemd-journald.service: Service has no hold-off time, scheduling restart.
    sty 30 14:03:54 liori-pc systemd[1]: Stopped Flush Journal to Persistent Storage.
    sty 30 14:03:54 liori-pc systemd[1]: Stopping Flush Journal to Persistent Storage...
    sty 30 14:03:54 liori-pc systemd[1]: Stopped Journal Service.
    sty 30 14:03:54 liori-pc systemd[1]: Starting Journal Service...
    sty 30 14:03:54 liori-pc systemd-journald[23935]: Journal started
    sty 30 14:03:54 liori-pc systemd-journald[23935]: System journal (/var/log/journal/2318080f60e357aaf765e98d0000035c) is 2.1G, max 4.0G, 1.8G free.


  • When using dm_crypt, a dmcrypt_write process starts taking 100% of a single CPU core (I later got rid of dm_crypt from this system, but stalls still happen).


  • I observe /proc/meminfo and see that the Dirty number is never more than few megabytes. Notably, during a stall, this number doesn't change.


  • In rare cases, I even get a kernel message in the form of "INFO: task «some process» blocked for more than 120 seconds.", with «some_process» being usually mdX_raid5, chromium or one of its threads, etc. Example log.



Initially my setup was just a single 600GB ext4 file system on a partition on a single 1TB drive (current /dev/sdd). Then I migrated to 3×6TB drives (/dev/sd{b,c,e}), with LVM-based raid5, bcache with its cache on an SSD drive, then dm_crypt — and that's when the stalls started. In the process of debugging, I simplified it to just LVM-raid5, with no bcache or dm_crypt; stalls still happen, though I feel they are less often now.



This kind of stall happens several times a day and usually lasts few minutes. I noticed that I can break it by explicitly requesting some disk operation: I was able to sometimes break it by logging in to this system by ssh from a remote machine, or (almost always) by just cat /dev/sdb >/dev/null or cat /dev/sdc >/dev/null (sometimes one, sometimes the other works; notably cat /dev/sde >/dev/null never helped). Then, everything that stalled suddenly start working again.



So I suspect the problem is caused by one of or interaction of:




  • The drives: all three are Seagate Skyhawk ST6000VX0023. Two of them unused before this setup, the third one used for half a year (/dev/sdc).

  • Disk controllers: the motherboard: Gigabyte Z68X-UD3H-B3 has two controllers: Marvell 88SE9172 where one of the drives is connected to, and the chipset-builtin controller (Intel® Z68) with two others (can I check which one is where in software?).

  • Some bug in the controller kernel drivers.

  • Some bug in LVM or raid5.


This is a Debian Stretch system with some backported packages installed, most notably kernel 4.19.0-0.bpo.1-amd64. Intel Core i7-2600k, 16GB of RAM.



At this point I ran out of ideas. How do I debug this problem further?



Edit: I started a script that's reading a single random sector from one of these drives every 4 seconds, and had no stalls for 2 days now. So indeed it does look like some system component (LVM? raid?) doesn't properly wake up devices from some kind of a low-power mode when it's necessary.










share|improve this question
















After a recent hardware migration I started experiencing weird I/O stalls affecting my desktop Debian Stretch system. Typical symptoms, all happening during each stall:




  • I stop being able to interact with Chromium, my web browser. Nothing works: webpage scrolling (usually this is the way I notice the stall), switching tabs, etc. No mouse-over actions either, whether on a web page or Chromium UI.


  • In a virtual terminal, I can't run new processes anymore. For example, I open a new tab in mate-terminal and my shell doesn't show up, just the cursor blinking. In a terminal with shell opened before a stall, I can type a command, but usually it doesn't start; sudo something doesn't even ask for a password.


  • Other programs, like RStudio, can't save anything to disk and often hang when they attempt to.



  • I see in the logs of journald -f that if the stall is long enough, journald itself restarts, example:



    sty 30 14:03:54 liori-pc systemd[1]: systemd-journald.service: Main process exited, code=killed, status=6/ABRT
    sty 30 14:03:54 liori-pc systemd[1]: systemd-journald.service: Unit entered failed state.
    sty 30 14:03:54 liori-pc systemd[1]: systemd-journald.service: Failed with result 'watchdog'.
    sty 30 14:03:54 liori-pc systemd[1]: systemd-journald.service: Service has no hold-off time, scheduling restart.
    sty 30 14:03:54 liori-pc systemd[1]: Stopped Flush Journal to Persistent Storage.
    sty 30 14:03:54 liori-pc systemd[1]: Stopping Flush Journal to Persistent Storage...
    sty 30 14:03:54 liori-pc systemd[1]: Stopped Journal Service.
    sty 30 14:03:54 liori-pc systemd[1]: Starting Journal Service...
    sty 30 14:03:54 liori-pc systemd-journald[23935]: Journal started
    sty 30 14:03:54 liori-pc systemd-journald[23935]: System journal (/var/log/journal/2318080f60e357aaf765e98d0000035c) is 2.1G, max 4.0G, 1.8G free.


  • When using dm_crypt, a dmcrypt_write process starts taking 100% of a single CPU core (I later got rid of dm_crypt from this system, but stalls still happen).


  • I observe /proc/meminfo and see that the Dirty number is never more than few megabytes. Notably, during a stall, this number doesn't change.


  • In rare cases, I even get a kernel message in the form of "INFO: task «some process» blocked for more than 120 seconds.", with «some_process» being usually mdX_raid5, chromium or one of its threads, etc. Example log.



Initially my setup was just a single 600GB ext4 file system on a partition on a single 1TB drive (current /dev/sdd). Then I migrated to 3×6TB drives (/dev/sd{b,c,e}), with LVM-based raid5, bcache with its cache on an SSD drive, then dm_crypt — and that's when the stalls started. In the process of debugging, I simplified it to just LVM-raid5, with no bcache or dm_crypt; stalls still happen, though I feel they are less often now.



This kind of stall happens several times a day and usually lasts few minutes. I noticed that I can break it by explicitly requesting some disk operation: I was able to sometimes break it by logging in to this system by ssh from a remote machine, or (almost always) by just cat /dev/sdb >/dev/null or cat /dev/sdc >/dev/null (sometimes one, sometimes the other works; notably cat /dev/sde >/dev/null never helped). Then, everything that stalled suddenly start working again.



So I suspect the problem is caused by one of or interaction of:




  • The drives: all three are Seagate Skyhawk ST6000VX0023. Two of them unused before this setup, the third one used for half a year (/dev/sdc).

  • Disk controllers: the motherboard: Gigabyte Z68X-UD3H-B3 has two controllers: Marvell 88SE9172 where one of the drives is connected to, and the chipset-builtin controller (Intel® Z68) with two others (can I check which one is where in software?).

  • Some bug in the controller kernel drivers.

  • Some bug in LVM or raid5.


This is a Debian Stretch system with some backported packages installed, most notably kernel 4.19.0-0.bpo.1-amd64. Intel Core i7-2600k, 16GB of RAM.



At this point I ran out of ideas. How do I debug this problem further?



Edit: I started a script that's reading a single random sector from one of these drives every 4 seconds, and had no stalls for 2 days now. So indeed it does look like some system component (LVM? raid?) doesn't properly wake up devices from some kind of a low-power mode when it's necessary.







io delay






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Feb 6 at 1:01







liori

















asked Feb 3 at 0:24









lioriliori

32916




32916













  • Does an actual time gap show up in the journalctl output? I note that there doesn't appear to be in the output you've quoted above - but that happens to start about where my journalctl output seemed to generally resume after a pause like this on my former laptop. (You're not likely to have the problem that laptop had, however - it had been bought brand new in May 2008, and was a low power laptop so only 32 bit.)

    – Ed Grimm
    Feb 3 at 2:40











  • @EdGrimm, I think so. At least, when I notice the stall and switch to journalctl -f output running in some terminal, I don't get any new messages until the stall ends. That doesn't mean I know when exactly the stall started.

    – liori
    Feb 3 at 3:12











  • I expected from your description of the stall that journalctl -f wouldn't report anything until the stall ended. But when the stall ends, it could just log new stuff (unlikely, unless journalctl itself was restarted at the end), it could catch up part of the way, or it could catch up all of the way, such that there's no noticeable gap. It would be much easier to tell on a busier machine that had regularly occurring logging. It's not really something I expect you'd be able to answer off-hand, because our brains try to only note things that are noteworthy.

    – Ed Grimm
    Feb 3 at 3:23






  • 1





    @EdGrimm: I'm starting to suspect that the stalls happen only when the machine is actually not busy. I didn't get any stall yesterday when I was running a heavy I/O task for ~10 hours, and normally I would get a few of them. To test this hypothesis, I'm now leaving a small process doing small random reads all over the drives. The machine is certainly not overheating: I managed to observe sensors twice before, and except for the case of dm_crypt taking 100% of a CPU core, the temperatures of both the CPU and drives did not go over ~55°C.

    – liori
    Feb 3 at 16:44






  • 1





    That sounds like the stalls are due to the drives going into a low-power mode and taking a really long time to come out of it. My knowledge of how to fix that issue went obsolete with IDE, but hopefully that suggestion will help you find a better answer, or somebody with a better answer will come by and give it.

    – Ed Grimm
    Feb 3 at 20:22



















  • Does an actual time gap show up in the journalctl output? I note that there doesn't appear to be in the output you've quoted above - but that happens to start about where my journalctl output seemed to generally resume after a pause like this on my former laptop. (You're not likely to have the problem that laptop had, however - it had been bought brand new in May 2008, and was a low power laptop so only 32 bit.)

    – Ed Grimm
    Feb 3 at 2:40











  • @EdGrimm, I think so. At least, when I notice the stall and switch to journalctl -f output running in some terminal, I don't get any new messages until the stall ends. That doesn't mean I know when exactly the stall started.

    – liori
    Feb 3 at 3:12











  • I expected from your description of the stall that journalctl -f wouldn't report anything until the stall ended. But when the stall ends, it could just log new stuff (unlikely, unless journalctl itself was restarted at the end), it could catch up part of the way, or it could catch up all of the way, such that there's no noticeable gap. It would be much easier to tell on a busier machine that had regularly occurring logging. It's not really something I expect you'd be able to answer off-hand, because our brains try to only note things that are noteworthy.

    – Ed Grimm
    Feb 3 at 3:23






  • 1





    @EdGrimm: I'm starting to suspect that the stalls happen only when the machine is actually not busy. I didn't get any stall yesterday when I was running a heavy I/O task for ~10 hours, and normally I would get a few of them. To test this hypothesis, I'm now leaving a small process doing small random reads all over the drives. The machine is certainly not overheating: I managed to observe sensors twice before, and except for the case of dm_crypt taking 100% of a CPU core, the temperatures of both the CPU and drives did not go over ~55°C.

    – liori
    Feb 3 at 16:44






  • 1





    That sounds like the stalls are due to the drives going into a low-power mode and taking a really long time to come out of it. My knowledge of how to fix that issue went obsolete with IDE, but hopefully that suggestion will help you find a better answer, or somebody with a better answer will come by and give it.

    – Ed Grimm
    Feb 3 at 20:22

















Does an actual time gap show up in the journalctl output? I note that there doesn't appear to be in the output you've quoted above - but that happens to start about where my journalctl output seemed to generally resume after a pause like this on my former laptop. (You're not likely to have the problem that laptop had, however - it had been bought brand new in May 2008, and was a low power laptop so only 32 bit.)

– Ed Grimm
Feb 3 at 2:40





Does an actual time gap show up in the journalctl output? I note that there doesn't appear to be in the output you've quoted above - but that happens to start about where my journalctl output seemed to generally resume after a pause like this on my former laptop. (You're not likely to have the problem that laptop had, however - it had been bought brand new in May 2008, and was a low power laptop so only 32 bit.)

– Ed Grimm
Feb 3 at 2:40













@EdGrimm, I think so. At least, when I notice the stall and switch to journalctl -f output running in some terminal, I don't get any new messages until the stall ends. That doesn't mean I know when exactly the stall started.

– liori
Feb 3 at 3:12





@EdGrimm, I think so. At least, when I notice the stall and switch to journalctl -f output running in some terminal, I don't get any new messages until the stall ends. That doesn't mean I know when exactly the stall started.

– liori
Feb 3 at 3:12













I expected from your description of the stall that journalctl -f wouldn't report anything until the stall ended. But when the stall ends, it could just log new stuff (unlikely, unless journalctl itself was restarted at the end), it could catch up part of the way, or it could catch up all of the way, such that there's no noticeable gap. It would be much easier to tell on a busier machine that had regularly occurring logging. It's not really something I expect you'd be able to answer off-hand, because our brains try to only note things that are noteworthy.

– Ed Grimm
Feb 3 at 3:23





I expected from your description of the stall that journalctl -f wouldn't report anything until the stall ended. But when the stall ends, it could just log new stuff (unlikely, unless journalctl itself was restarted at the end), it could catch up part of the way, or it could catch up all of the way, such that there's no noticeable gap. It would be much easier to tell on a busier machine that had regularly occurring logging. It's not really something I expect you'd be able to answer off-hand, because our brains try to only note things that are noteworthy.

– Ed Grimm
Feb 3 at 3:23




1




1





@EdGrimm: I'm starting to suspect that the stalls happen only when the machine is actually not busy. I didn't get any stall yesterday when I was running a heavy I/O task for ~10 hours, and normally I would get a few of them. To test this hypothesis, I'm now leaving a small process doing small random reads all over the drives. The machine is certainly not overheating: I managed to observe sensors twice before, and except for the case of dm_crypt taking 100% of a CPU core, the temperatures of both the CPU and drives did not go over ~55°C.

– liori
Feb 3 at 16:44





@EdGrimm: I'm starting to suspect that the stalls happen only when the machine is actually not busy. I didn't get any stall yesterday when I was running a heavy I/O task for ~10 hours, and normally I would get a few of them. To test this hypothesis, I'm now leaving a small process doing small random reads all over the drives. The machine is certainly not overheating: I managed to observe sensors twice before, and except for the case of dm_crypt taking 100% of a CPU core, the temperatures of both the CPU and drives did not go over ~55°C.

– liori
Feb 3 at 16:44




1




1





That sounds like the stalls are due to the drives going into a low-power mode and taking a really long time to come out of it. My knowledge of how to fix that issue went obsolete with IDE, but hopefully that suggestion will help you find a better answer, or somebody with a better answer will come by and give it.

– Ed Grimm
Feb 3 at 20:22





That sounds like the stalls are due to the drives going into a low-power mode and taking a really long time to come out of it. My knowledge of how to fix that issue went obsolete with IDE, but hopefully that suggestion will help you find a better answer, or somebody with a better answer will come by and give it.

– Ed Grimm
Feb 3 at 20:22










0






active

oldest

votes











Your Answer








StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f498364%2fweird-i-o-stalls-affecting-a-whole-desktop%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Unix & Linux Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f498364%2fweird-i-o-stalls-affecting-a-whole-desktop%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

How to reconfigure Docker Trusted Registry 2.x.x to use CEPH FS mount instead of NFS and other traditional...

is 'sed' thread safe

How to make a Squid Proxy server?