Segmentation fault after interrupting computation |
Message boards : Number crunching : Segmentation fault after interrupting computation
| Author | Message |
|---|---|
|
Two of my work units stopped with a segmentation fault: 26861 and 26906. They have as peculiarity that they were preempted by the scheduler to start a task from another project and they were restarted an hour later. | |
| ID: 420 | Rating: 0 | rate:
| |
Two of my work units stopped with a segmentation fault: 26861 and 26906. They have as peculiarity that they were preempted by the scheduler to start a task from another project and they were restarted an hour later. Hi, I know the problem of preempting and segfaulting, it doesn't seem to be a problem caused by abc but we are still looking into it. The checkpointing is now still a bit off and progress might drop because we checkpoint between so called blocks of numbers. In the beginning of our search party (the entire ABC search that is, not per wu) these blocks can be quite irregular and big, causing not good progress and sometimes long workunits. The further we go the better it gets though, now it's already a lot better than it was yesterday. | |
| ID: 423 | Rating: 0 | rate:
| |
|
Hi Hendrik, I know the problem of preempting and segfaulting, it doesn't seem to be could preempting not just simply send a SIGSTOP, so that work will start at the exact same location where it stopped, unless the client is shut down? I have the impression this is happening in RCN, where 'ps' still shows the task when it is suspended; while in abc, the process disappears completely. Maybe this could solve the problem. In the beginning of our search party (the entire ABC search that is, The work units were not even that big, they just took a tiny bit over an hour, so that the boinc scheduler kicked in and switched to my second project. As long as suspending/resuming does not work, only work units taking less than an hour wall clock time (so maybe even less cpu time if the computer is used in parallel) can be executed, which is quite short after all. Anders | |
| ID: 425 | Rating: 0 | rate:
| |
Hi Hendrik, The boinc core client should and does I think handle signals. Even if the workunits are not long irregular progress can occur. I agree it's an annoyance with the boinc scheduler and loosing your work that way. I don't understand why RCN doesn't have that issue. | |
| ID: 428 | Rating: 0 | rate:
| |
The work units were not even that big, they just took a tiny bit over an hour, so that the boinc scheduler kicked in and switched to my second project. As long as suspending/resuming does not work, only work units taking less than an hour wall clock time (so maybe even less cpu time if the computer is used in parallel) can be executed, which is quite short after all. The problem with suspending and resuming hits a lot of projects. I had the experience that I could not crunch SIMAP and MalariaControl on the same compputer because one interfered with the other. I think they still do not understand why. Others crunch SIMAP and Malaria side-by-side with no problems. Very strange. edit: sorry, this is not the ideal solution but it may help until the better solution is found :) It helps if you set your preferences so that the computer works longer on each project before switching to a different project. You can do that in your preferences on the website or in a global_prefs_override.xml file in your BOINC installation folder. The line you need to edit in global_prefs_override.xml is... <cpu_scheduling_period_minutes>240.000000</cpu_scheduling_period_minutes> With the above example the computer will stay on one project for 240 minutes. It won't cure the problem completely but it will help. You can make the time even higher if you want. I have had it up to 10 hours with no problem. | |
| ID: 431 | Rating: 0 | rate:
| |
Message boards :
Number crunching :
Segmentation fault after interrupting computation