Problems with child processes not cleaning up #38

brunnels · 2014-04-24T13:05:14Z

I almost set the title "problem killing children" but thought that sounded bad for some reason... lol.

I'm using latest ubuntu 64bit server, symfony 1.4, propel 1.7 with connections to an oracle database using pdo_oci8. The issue I'm seeing is the only way to stop the child processes that open a database connection is using SIGKILL. This is causing me some pain when recycling processes, catching errors, restarting the parent daemon, etc...

Any way to add an option to force cleanup child processes or a check to see that they're all gone and SIGKILL after a specific timeout?

shaneharter · 2014-04-21T21:58:51Z

Can you give me more info? The functionality you mention is actually already in place -- in the Process class:

https://github.com/shaneharter/PHP-Daemon/blob/master/Core/Lib/Process.php#L39

So clearly things aren't working they way they are supposed to :)

brunnels · 2014-04-22T01:59:01Z

Looks like that should work for one worker but say for instance I have 2 workers. First is configured for one fork, second is configured for 5. Forking mode is aggressive so the second worker forks 5 processes on startup. Worker 1 throws an error and the timeout callback method for worker 1 sets shutdown true on the daemon. Daemon process shuts down and the first worker process shuts down but the 5 processes for the second worker don't.

Am I doing it wrong? Also hoping you had some examples demonstrating this:

You can use the Core_Daemon::on(ON_FORK) method to provide universal setup code that is run after every fork and
in every worker. The setup() method defined here can be used if you want specific setup code run in this forked process.

I saw that after I had already created an abstract base class that implements Core_IWorker and then had all my workers extending the base class.

shaneharter · 2014-04-22T02:34:16Z

Ben,

I think I have an idea of what's going wrong here....

So when the parent Daemon process exits, it eventually gets to the Core_Daemon::__destruct method. https://github.com/shaneharter/PHP-Daemon/blob/master/Core/Daemon.php#L349

It iterates over all the workers and plugins and calls teardown() on them. One of those plugins is an instance of Core_Plugin_ProcessManager. You can see here that Core_Plugin_ProcessManager::teardown waits for processes, killing them if needed:
https://github.com/shaneharter/PHP-Daemon/blob/master/Core/Plugin/ProcessManager.php#L63

But, I believe I've made a careless mistake in Core_Daemon::__destruct. My theory is that Core_Plugin_ProcessManager::teardown is never being called because I mistakenly combine the plugin and worker arrays with the + operator instead of using array_merge. At one point these were associative arrays but in a refactoring they were changed to be numeric, so plugins and workers have the same keys.

Could you do me a favor and modify that line from:
foreach($this->workers + $this->plugins as $object) {

to:
foreach(array_merge($this->workers, $this->plugins) as $object) {

If you don't mind sending over a PR for that I'd be happy to merge it. If it doesn't solve your issue, try adding some add'l logging to Core_Plugin_ProcessManager::teardown and lets see what it's doing.

To your last comment, about using an ON_FORK callback:

In the setup() method of your Daemon class, you could do something like this:

$this->on(self::ON_FORK, function() {
// this function will be run in each child process immediately after forking
// it's triggered here: https://github.com/shaneharter/PHP-Daemon/blob/master/Core/Plugin/ProcessManager.php#L159
});

shaneharter · 2014-04-22T02:40:03Z

FWIW, issues like these are the motiviation behind my working on test coverage for this library.

I wrote this initially in 2010, open sourced it a year later. In 2012 I added background workers. Last year I made progress cleaning-up the background workers and applying lessons learned. This year I have 2 goals:

Provide an alternate channel for IPC aside from SysV. Using OS message queues and shared memory was the easiest path, but I've experienced some frustrating bugs when running at high-throughput. My intention is to write an IWorkerVia object that communicates between worker and daemon using a socket.
Refactor where necessary (static methods and lack of DI...) to provide a baseline of code coverage, focusing on the critical paths first.

brunnels · 2014-04-22T02:45:37Z

Thanks for looking at this. I had noticed that too and wondered why you hadn't used array_merge but didn't put 2 and 2 together. I'll make your changes and the change you had suggested in the ticket prior to this one in the morning. If all works well I'll fork and send a pull request tomorrow.

shaneharter · 2014-04-22T02:51:41Z

Sounds good, thanks for the bug report!

brunnels · 2014-04-22T14:58:26Z

Okay, those changes have the daemon cleaning up now on shutdown. I'm still running into an issue when a worker is recycled on timeout or when setting auto_restart=true on the worker. It doesn't die and strange things happen. I'll see if I can work out a simple example to demonstrate.

brunnels · 2014-04-22T15:37:58Z

Okay, I replicated it pretty easily. From looking at the code if a worker throws an exception and times out then the mediator kills the process at line 724 and then the daemon timeout callback gets called. This works correctly but if you don't do anything in the daemon timeout callback and just let it sit with only the daemon process running for about 15 seconds the daemon will fork another worker processes and loads it up with 20+ calls to throwException() . I would think it should only do the single call on the new worker and then is_idle() would return false.

Here's how to reproduce:

Examples/PrimeNumbers/Daemon.php

  protected function setup_workers()
  {
    $that = $this;

    $via = new \Core_Worker_Via_SysV();
    $via->malloc(30 * 1024 * 1024);

    $this->worker('PrimeNumbers', new Workers_Primes(), $via);
    $this->PrimeNumbers->timeout(5);
    $this->PrimeNumbers->workers(1);

    $this->PrimeNumbers->onTimeout(function ($call, $log) use($that)
    {
      $log("Job {$call->id} Timed Out!");
    });
  }

  protected function execute()
  {
    if($this->PrimeNumbers->is_idle())
    {
      $this->PrimeNumbers->throwException();
    }
  }

Examples/PrimeNumbers/Workers/Primes.php

    public function throwException()
    {
      throw new \Exception('Testing throwing an exception');
    }

cturbelin · 2014-04-22T15:42:51Z

By the way, I've just seen in ProcessManager.php (line 120-122)

        $list = array();
        foreach($this->processes as $process_group)
            $list += $process_group;

Shouldn't be also an array_merge operation instead of "+=" - if a flat array of processes is expected from the $processes property (array of array, If I'm not wrong) ?

brunnels · 2014-04-22T16:17:19Z

That's actually correct the way it is I think. $list is a numbered array with the key being the pid. If you use array_merge the key gets lost and bad things happen.

cturbelin · 2014-04-22T16:22:52Z

Oh Yes, thanks !

shaneharter · 2014-04-22T16:48:37Z

Solid debugging work. If either of you are looking for work in San Francisco, lmk :)

Let me play with this and see what develops.

shaneharter · 2014-04-22T17:02:07Z

I see what you mean. Have you stepped-thru the IPC using the --debugworkers option? I'm doing that now.

shaneharter · 2014-04-22T17:06:22Z

Ok, i've reproduced both bugs. @cturbelin one reason I don't love the idea of protecting these calls is that there shouldn't be these issues. Protecting the call would avoid the fatal but would mask and make more subtle this bug.

shaneharter · 2014-04-22T22:05:44Z

@brunnels So playing with this... Here's what's happening with is_idle in your example:

Daemon starts up a 1 second event loop
On the first iteration of your event loop, daemon calls PrimeNumbers->throwException()
On the second iteration of your event loop, there is 1 running call (and 1 worker process) so is_idle() returns false
That continues for the next 5 calls to execute() that it does over the next 5 seconds
The mediator runs the timeout on that first call, and kills the process

Now here's where I need to dig deeper. I think what we're seeing here is this:

ProcessManager keeps a hash of running processes. It only removes processes from this list when it gets the SIGCHLD signal from the OS, which it checks for when it runs its reap() method. The reap() method gets called on-idle. It should be pretty much instantaneous but there's something going wrong here... I need to do more digging.

I'm not sure WHY it's happening but the reason is_idle might be returning true when you expect it to return false, is that it compares RUNNING calls to the number of allocated processes. Calls are only marked as "running" when the Daemon gets an Ack message from the worker. That's flawed and should be improved. If the mediator queues up 10 calls, just because a worker hasn't acked them as running yet doesn't mean that the workers are idle.

I'll continue to work on this and let you know what I find.

brunnels · 2014-04-23T11:51:53Z

So is there a better way to determine if a worker is idle or not? Here's my scenario:

I'm updating records in one database from another database. I need to update each record every hour. My first worker with the single process queries the local database for a list of records that need to be updated and returns the list to the daemon. The daemon then loops through that list and queueing work for the second worker to pull the records from the remote database and update the local database. If the first worker doesn't find any records that need to be updated it sleeps for 55 seconds. First worker has a timeout of 60 seconds. The first worker should never have more than one piece of work queued. First worker should not run unless there is no work queued for the second worker. If any worker child process throws an uncaught error I log it in the timeout and send an email but I want the worker to fork a new child and continue on.

Here's some code from my daemon:

  protected function setup_workers()
  {
    $that = $this;

    $staleVia = new Core_Worker_Via_SysV();
    $staleVia->malloc(75 * 1024 * 1024);

    $this->worker('staleRecordGetter', new StaleRecordWorker(), $staleVia);
    $this->staleRecordGetter->timeout(60);
    $this->staleRecordGetter->workers(1);
    $this->staleRecordGetter->auto_restart(true);

    $this->staleRecordGetter->onReturn(function ($call, $log) use($that)
    {
      $that->logDebug("Job {$call->id} to {$call->method}() Complete");

      if($call->method == 'getStaleRecords')
      {
        if($call->return instanceof PropelObjectCollection && count($call->return))
        {
          $that->logDebug(count($call->return) . ' Stale records found');
          $that->updateRecords($call->return);
          $that->staleRecordGetter->wait(5);
        }
        else
        {
          $that->logDebug('No stale records to update');
          $that->staleRecordGetter->wait(55);
        }
      }
    });

    $this->staleRecordGetter->onTimeout(function ($call, $log) use($that)
    {
      $that->logDebug("Job {$call->id} Timed Out!");

      $that->staleRecordGetterTimeout($call);
    });

    $updaterVia = new Core_Worker_Via_SysV();
    $updaterVia->malloc(110 * 1024 * 1024);

    $this->worker('recordUpdater', new RecordUpdateWorker(), $updaterVia);
    $this->recordUpdater->timeout(30);
    $this->recordUpdater->workers(5);
    $this->recordUpdater->auto_restart(true);

    $this->recordUpdater->onReturn(function ($call, $log) use($that)
    {
       $that->logDebug("Job {$call->id} to {$call->method}() Complete");
       $that->recordUpdaterReturn($call->return);
    });

    $this->recordUpdater->onTimeout(function ($call, $log) use($that)
    {
      $that->logDebug("Job {$call->id} Timed Out!");
      $that->recordUpdaterTimeout($call);
    });
  }

  protected function execute()
  {
    if($this->recordUpdater->is_idle())
    {
       if($this->staleRecordGetter->is_idle())
       {
         $this->staleRecordGetter->getStaleRecords();
       }
    }
  }

  protected function updateRecords(PropelObjectCollection $staleRecords)
  {
    foreach($staleRecords as $staleRecord)
    {
      $this->recordUpdater->updateRecord($staleRecord);
    }
  }

Fixed issue with workers not tearing down properly

brunnels · 2014-04-24T20:59:37Z

Okay, I've worked out one of the bigger issues I've been running into. I'm using PDO to connect to my databases. The problem with PDO is that when a process forks, the parent process is copied as the child. These duplicate processes then continue executing from that point onward side by side. Then when the children processes terminate they cleanup and close the PDO connection. If the parent tries to use it again it's closed and bad things happen.

So, best practice for using PDO connection seems to be, don't use them in the daemon, only in a worker or a task.

brunnels · 2014-04-25T13:00:02Z

So looks like what I really wanted was a running_count() method on the worker so I added that. Now I can determine if the worker doesn't just have less workers running than max workers. This is important because I'm not sure the MIXED and LAZY queuing strategies are working as I expect. I'm never seeing the running workers == max workers when using those modes no matter how much I queue.

Problems with child processes not cleaning up

shaneharter closed this Apr 22, 2014

shaneharter reopened this Apr 22, 2014

Normalized all files to use unix line endings

2e4ac4f

Fixed issue with workers not tearing down properly

shaneharter added a commit that referenced this pull request May 1, 2014

Merge pull request #38 from brunnels/master

7e76f00

Problems with child processes not cleaning up

shaneharter merged commit 7e76f00 into shaneharter:master May 1, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with child processes not cleaning up #38

Problems with child processes not cleaning up #38

brunnels commented Apr 24, 2014

shaneharter commented Apr 21, 2014

brunnels commented Apr 22, 2014

shaneharter commented Apr 22, 2014

shaneharter commented Apr 22, 2014

brunnels commented Apr 22, 2014

shaneharter commented Apr 22, 2014

brunnels commented Apr 22, 2014

brunnels commented Apr 22, 2014

cturbelin commented Apr 22, 2014

brunnels commented Apr 22, 2014

cturbelin commented Apr 22, 2014

shaneharter commented Apr 22, 2014

shaneharter commented Apr 22, 2014

shaneharter commented Apr 22, 2014

shaneharter commented Apr 22, 2014

brunnels commented Apr 23, 2014

brunnels commented Apr 24, 2014

brunnels commented Apr 25, 2014

Problems with child processes not cleaning up #38

Problems with child processes not cleaning up #38

Conversation

brunnels commented Apr 24, 2014

shaneharter commented Apr 21, 2014

brunnels commented Apr 22, 2014

shaneharter commented Apr 22, 2014

shaneharter commented Apr 22, 2014

brunnels commented Apr 22, 2014

shaneharter commented Apr 22, 2014

brunnels commented Apr 22, 2014

brunnels commented Apr 22, 2014

cturbelin commented Apr 22, 2014

brunnels commented Apr 22, 2014

cturbelin commented Apr 22, 2014

shaneharter commented Apr 22, 2014

shaneharter commented Apr 22, 2014

shaneharter commented Apr 22, 2014

shaneharter commented Apr 22, 2014

brunnels commented Apr 23, 2014

brunnels commented Apr 24, 2014

brunnels commented Apr 25, 2014