Segmentation fault after interrupting computation


Advanced search

Message boards : Number crunching : Segmentation fault after interrupting computation

AuthorMessage
Anders
Send message
Joined: Jan 15 07
Posts: 2
Credit: 6,292
RAC: 0
Message 420 - Posted 16 Jan 2007 18:20:20 UTC

    Last modified: 16 Jan 2007 18:27:03 UTC

    Two of my work units stopped with a segmentation fault: 26861 and 26906. They have as peculiarity that they were preempted by the scheduler to start a task from another project and they were restarted an hour later.

    I also noticed that while I had done at least 18 minutes of work in one of them, after restarting it counted from only about 3 minutes on. This is in contrast to another project I am working on (Rectilinear Crossing Numbers), that always starts with the exact same second it stopped at; and this one does not even implement checkpoints.

    Good luck with the bug chasing,

    Anders

    PS: After posting, I notice that another thread is concerned in part with the same problem - I only read the other part... Definitely in my case, it cannot be a memory problem - I have 1GB that is never fully used.

    abc@home staff
    Forum moderator
    Project administrator
    Project developer
    Send message
    Joined: Nov 8 06
    Posts: 342
    Credit: 44,383
    RAC: 0
    Message 423 - Posted 16 Jan 2007 18:25:40 UTC - in response to Message 420.

      Last modified: 16 Jan 2007 18:26:12 UTC

      Two of my work units stopped with a segmentation fault: 26861 and 26906. They have as peculiarity that they were preempted by the scheduler to start a task from another project and they were restarted an hour later.

      I also noticed that while I had done at least 18 minutes of work in one of them, after restarting it counted from only about 3 minutes on. This is in contrast to another project I am working on (Rectilinear Crossing Numbers), that always starts with the exact same second it stopped at; and this one does not even implement checkpoints.

      Good luck with the bug chasing,

      Anders


      Hi,

      I know the problem of preempting and segfaulting, it doesn't seem to be
      a problem caused by abc but we are still looking into it.
      The checkpointing is now still a bit off and progress might drop because
      we checkpoint between so called blocks of numbers.
      In the beginning of our search party (the entire ABC search that is,
      not per wu) these blocks can be quite irregular
      and big, causing not good progress and sometimes long workunits.
      The further we go the better it gets though, now it's already a lot better
      than it was yesterday.

      Anders
      Send message
      Joined: Jan 15 07
      Posts: 2
      Credit: 6,292
      RAC: 0
      Message 425 - Posted 16 Jan 2007 18:38:49 UTC - in response to Message 423.

        Hi Hendrik,

        I know the problem of preempting and segfaulting, it doesn't seem to be
        a problem caused by abc but we are still looking into it.
        The checkpointing is now still a bit off and progress might drop because
        we checkpoint between so called blocks of numbers.


        could preempting not just simply send a SIGSTOP, so that work will start at the exact same location where it stopped, unless the client is shut down? I have the impression this is happening in RCN, where 'ps' still shows the task when it is suspended; while in abc, the process disappears completely. Maybe this could solve the problem.

        In the beginning of our search party (the entire ABC search that is,
        not per wu) these blocks can be quite irregular
        and big, causing not good progress and sometimes long workunits.
        The further we go the better it gets though, now it's already a lot better
        than it was yesterday.


        The work units were not even that big, they just took a tiny bit over an hour, so that the boinc scheduler kicked in and switched to my second project. As long as suspending/resuming does not work, only work units taking less than an hour wall clock time (so maybe even less cpu time if the computer is used in parallel) can be executed, which is quite short after all.

        Anders

        abc@home staff
        Forum moderator
        Project administrator
        Project developer
        Send message
        Joined: Nov 8 06
        Posts: 342
        Credit: 44,383
        RAC: 0
        Message 428 - Posted 16 Jan 2007 18:45:38 UTC - in response to Message 425.

          Hi Hendrik,

          I know the problem of preempting and segfaulting, it doesn't seem to be
          a problem caused by abc but we are still looking into it.
          The checkpointing is now still a bit off and progress might drop because
          we checkpoint between so called blocks of numbers.


          could preempting not just simply send a SIGSTOP, so that work will start at the exact same location where it stopped, unless the client is shut down? I have the impression this is happening in RCN, where 'ps' still shows the task when it is suspended; while in abc, the process disappears completely. Maybe this could solve the problem.

          In the beginning of our search party (the entire ABC search that is,
          not per wu) these blocks can be quite irregular
          and big, causing not good progress and sometimes long workunits.
          The further we go the better it gets though, now it's already a lot better
          than it was yesterday.


          The work units were not even that big, they just took a tiny bit over an hour, so that the boinc scheduler kicked in and switched to my second project. As long as suspending/resuming does not work, only work units taking less than an hour wall clock time (so maybe even less cpu time if the computer is used in parallel) can be executed, which is quite short after all.

          Anders


          The boinc core client should and does I think handle signals.
          Even if the workunits are not long irregular progress can occur.
          I agree it's an annoyance with the boinc scheduler and loosing your work that
          way. I don't understand why RCN doesn't have that issue.

          Dagorath
          Send message
          Joined: Jan 7 07
          Posts: 381
          Credit: 3,365,400
          RAC: 0
          Message 431 - Posted 16 Jan 2007 19:00:08 UTC - in response to Message 425.

            Last modified: 16 Jan 2007 19:01:40 UTC

            The work units were not even that big, they just took a tiny bit over an hour, so that the boinc scheduler kicked in and switched to my second project. As long as suspending/resuming does not work, only work units taking less than an hour wall clock time (so maybe even less cpu time if the computer is used in parallel) can be executed, which is quite short after all.

            Anders


            The problem with suspending and resuming hits a lot of projects. I had the experience that I could not crunch SIMAP and MalariaControl on the same compputer because one interfered with the other. I think they still do not understand why. Others crunch SIMAP and Malaria side-by-side with no problems. Very strange.

            edit: sorry, this is not the ideal solution but it may help until the better solution is found :)


            It helps if you set your preferences so that the computer works longer on each project before switching to a different project. You can do that in your preferences on the website or in a global_prefs_override.xml file in your BOINC installation folder. The line you need to edit in global_prefs_override.xml is...

            <cpu_scheduling_period_minutes>240.000000</cpu_scheduling_period_minutes>

            With the above example the computer will stay on one project for 240 minutes. It won't cure the problem completely but it will help. You can make the time even higher if you want. I have had it up to 10 hours with no problem.


            Post to thread

            Message boards : Number crunching : Segmentation fault after interrupting computation


            Return to ABC@home main page


            Copyright © 2013 University of Leiden