-
Notifications
You must be signed in to change notification settings - Fork 166
Problems with child processes not cleaning up #38
Conversation
Can you give me more info? The functionality you mention is actually already in place -- in the Process class: https://github.com/shaneharter/PHP-Daemon/blob/master/Core/Lib/Process.php#L39 So clearly things aren't working they way they are supposed to :) |
Looks like that should work for one worker but say for instance I have 2 workers. First is configured for one fork, second is configured for 5. Forking mode is aggressive so the second worker forks 5 processes on startup. Worker 1 throws an error and the timeout callback method for worker 1 sets shutdown true on the daemon. Daemon process shuts down and the first worker process shuts down but the 5 processes for the second worker don't. Am I doing it wrong? Also hoping you had some examples demonstrating this:
I saw that after I had already created an abstract base class that implements Core_IWorker and then had all my workers extending the base class. |
Ben, I think I have an idea of what's going wrong here.... So when the parent Daemon process exits, it eventually gets to the Core_Daemon::__destruct method. https://github.com/shaneharter/PHP-Daemon/blob/master/Core/Daemon.php#L349 It iterates over all the workers and plugins and calls teardown() on them. One of those plugins is an instance of Core_Plugin_ProcessManager. You can see here that Core_Plugin_ProcessManager::teardown waits for processes, killing them if needed: But, I believe I've made a careless mistake in Core_Daemon::__destruct. My theory is that Core_Plugin_ProcessManager::teardown is never being called because I mistakenly combine the plugin and worker arrays with the + operator instead of using array_merge. At one point these were associative arrays but in a refactoring they were changed to be numeric, so plugins and workers have the same keys. Could you do me a favor and modify that line from: to: If you don't mind sending over a PR for that I'd be happy to merge it. If it doesn't solve your issue, try adding some add'l logging to Core_Plugin_ProcessManager::teardown and lets see what it's doing. To your last comment, about using an ON_FORK callback: In the setup() method of your Daemon class, you could do something like this: $this->on(self::ON_FORK, function() { |
FWIW, issues like these are the motiviation behind my working on test coverage for this library. I wrote this initially in 2010, open sourced it a year later. In 2012 I added background workers. Last year I made progress cleaning-up the background workers and applying lessons learned. This year I have 2 goals:
|
Thanks for looking at this. I had noticed that too and wondered why you hadn't used array_merge but didn't put 2 and 2 together. I'll make your changes and the change you had suggested in the ticket prior to this one in the morning. If all works well I'll fork and send a pull request tomorrow. |
Sounds good, thanks for the bug report! |
Okay, those changes have the daemon cleaning up now on shutdown. I'm still running into an issue when a worker is recycled on timeout or when setting auto_restart=true on the worker. It doesn't die and strange things happen. I'll see if I can work out a simple example to demonstrate. |
Okay, I replicated it pretty easily. From looking at the code if a worker throws an exception and times out then the mediator kills the process at line 724 and then the daemon timeout callback gets called. This works correctly but if you don't do anything in the daemon timeout callback and just let it sit with only the daemon process running for about 15 seconds the daemon will fork another worker processes and loads it up with 20+ calls to throwException() . I would think it should only do the single call on the new worker and then is_idle() would return false. Here's how to reproduce: Examples/PrimeNumbers/Daemon.php protected function setup_workers()
{
$that = $this;
$via = new \Core_Worker_Via_SysV();
$via->malloc(30 * 1024 * 1024);
$this->worker('PrimeNumbers', new Workers_Primes(), $via);
$this->PrimeNumbers->timeout(5);
$this->PrimeNumbers->workers(1);
$this->PrimeNumbers->onTimeout(function ($call, $log) use($that)
{
$log("Job {$call->id} Timed Out!");
});
}
protected function execute()
{
if($this->PrimeNumbers->is_idle())
{
$this->PrimeNumbers->throwException();
}
} Examples/PrimeNumbers/Workers/Primes.php public function throwException()
{
throw new \Exception('Testing throwing an exception');
} |
By the way, I've just seen in ProcessManager.php (line 120-122)
Shouldn't be also an array_merge operation instead of "+=" - if a flat array of processes is expected from the $processes property (array of array, If I'm not wrong) ? |
That's actually correct the way it is I think. $list is a numbered array with the key being the pid. If you use array_merge the key gets lost and bad things happen. |
Oh Yes, thanks ! |
Solid debugging work. If either of you are looking for work in San Francisco, lmk :) Let me play with this and see what develops. |
I see what you mean. Have you stepped-thru the IPC using the --debugworkers option? I'm doing that now. |
Ok, i've reproduced both bugs. @cturbelin one reason I don't love the idea of protecting these calls is that there shouldn't be these issues. Protecting the call would avoid the fatal but would mask and make more subtle this bug. |
@brunnels So playing with this... Here's what's happening with is_idle in your example:
Now here's where I need to dig deeper. I think what we're seeing here is this: ProcessManager keeps a hash of running processes. It only removes processes from this list when it gets the SIGCHLD signal from the OS, which it checks for when it runs its reap() method. The reap() method gets called on-idle. It should be pretty much instantaneous but there's something going wrong here... I need to do more digging. I'm not sure WHY it's happening but the reason is_idle might be returning true when you expect it to return false, is that it compares RUNNING calls to the number of allocated processes. Calls are only marked as "running" when the Daemon gets an Ack message from the worker. That's flawed and should be improved. If the mediator queues up 10 calls, just because a worker hasn't acked them as running yet doesn't mean that the workers are idle. I'll continue to work on this and let you know what I find. |
So is there a better way to determine if a worker is idle or not? Here's my scenario: I'm updating records in one database from another database. I need to update each record every hour. My first worker with the single process queries the local database for a list of records that need to be updated and returns the list to the daemon. The daemon then loops through that list and queueing work for the second worker to pull the records from the remote database and update the local database. If the first worker doesn't find any records that need to be updated it sleeps for 55 seconds. First worker has a timeout of 60 seconds. The first worker should never have more than one piece of work queued. First worker should not run unless there is no work queued for the second worker. If any worker child process throws an uncaught error I log it in the timeout and send an email but I want the worker to fork a new child and continue on. Here's some code from my daemon: protected function setup_workers()
{
$that = $this;
$staleVia = new Core_Worker_Via_SysV();
$staleVia->malloc(75 * 1024 * 1024);
$this->worker('staleRecordGetter', new StaleRecordWorker(), $staleVia);
$this->staleRecordGetter->timeout(60);
$this->staleRecordGetter->workers(1);
$this->staleRecordGetter->auto_restart(true);
$this->staleRecordGetter->onReturn(function ($call, $log) use($that)
{
$that->logDebug("Job {$call->id} to {$call->method}() Complete");
if($call->method == 'getStaleRecords')
{
if($call->return instanceof PropelObjectCollection && count($call->return))
{
$that->logDebug(count($call->return) . ' Stale records found');
$that->updateRecords($call->return);
$that->staleRecordGetter->wait(5);
}
else
{
$that->logDebug('No stale records to update');
$that->staleRecordGetter->wait(55);
}
}
});
$this->staleRecordGetter->onTimeout(function ($call, $log) use($that)
{
$that->logDebug("Job {$call->id} Timed Out!");
$that->staleRecordGetterTimeout($call);
});
$updaterVia = new Core_Worker_Via_SysV();
$updaterVia->malloc(110 * 1024 * 1024);
$this->worker('recordUpdater', new RecordUpdateWorker(), $updaterVia);
$this->recordUpdater->timeout(30);
$this->recordUpdater->workers(5);
$this->recordUpdater->auto_restart(true);
$this->recordUpdater->onReturn(function ($call, $log) use($that)
{
$that->logDebug("Job {$call->id} to {$call->method}() Complete");
$that->recordUpdaterReturn($call->return);
});
$this->recordUpdater->onTimeout(function ($call, $log) use($that)
{
$that->logDebug("Job {$call->id} Timed Out!");
$that->recordUpdaterTimeout($call);
});
}
protected function execute()
{
if($this->recordUpdater->is_idle())
{
if($this->staleRecordGetter->is_idle())
{
$this->staleRecordGetter->getStaleRecords();
}
}
}
protected function updateRecords(PropelObjectCollection $staleRecords)
{
foreach($staleRecords as $staleRecord)
{
$this->recordUpdater->updateRecord($staleRecord);
}
} |
Fixed issue with workers not tearing down properly
Okay, I've worked out one of the bigger issues I've been running into. I'm using PDO to connect to my databases. The problem with PDO is that when a process forks, the parent process is copied as the child. These duplicate processes then continue executing from that point onward side by side. Then when the children processes terminate they cleanup and close the PDO connection. If the parent tries to use it again it's closed and bad things happen. So, best practice for using PDO connection seems to be, don't use them in the daemon, only in a worker or a task. |
So looks like what I really wanted was a running_count() method on the worker so I added that. Now I can determine if the worker doesn't just have less workers running than max workers. This is important because I'm not sure the MIXED and LAZY queuing strategies are working as I expect. I'm never seeing the running workers == max workers when using those modes no matter how much I queue. |
Problems with child processes not cleaning up
I almost set the title "problem killing children" but thought that sounded bad for some reason... lol.
I'm using latest ubuntu 64bit server, symfony 1.4, propel 1.7 with connections to an oracle database using pdo_oci8. The issue I'm seeing is the only way to stop the child processes that open a database connection is using SIGKILL. This is causing me some pain when recycling processes, catching errors, restarting the parent daemon, etc...
Any way to add an option to force cleanup child processes or a check to see that they're all gone and SIGKILL after a specific timeout?