Skip to content
This repository has been archived by the owner on Sep 24, 2022. It is now read-only.

Restarting PostgreSQL sometimes causes Pool.getconn() to hang #147

Closed
gward opened this issue Jul 4, 2016 · 5 comments
Closed

Restarting PostgreSQL sometimes causes Pool.getconn() to hang #147

gward opened this issue Jul 4, 2016 · 5 comments

Comments

@gward
Copy link

gward commented Jul 4, 2016

Imagine a simple application that runs an infinite loop, periodically querying a database via momoko:

@tornado.gen.coroutine
def run(dsn, io_loop):
    pool = momoko.Pool(dsn=dsn, size=1, ioloop=io_loop)
    yield pool.connect()
    while True:
        conn = yield pool.getconn()
        cur = yield conn.execute('select pg_backend_pid()')
        pid = cur.fetchall()[0][0]
        log.info('pid: %r', pid)
        cur.close()
        pool.putconn(conn)
        time.sleep(random.uniform(0.2, 0.8))

As long as I never restart my PostgreSQL server, this works perfectly. But when I restart PG, bad stuff happens. Specifically:

  • almost every time, some calls to getconn() fail with DatabaseNotAvailable: No database connection available. That's annoying; I would think that a connection pool would take care of reconnecting when my database server restarts. I don't want to wrap a retry loop around momoko, because IMHO momoko should do that for me!
  • sometimes, the whole loop hangs forever because pool.getconn() never yields a result. My query-forever app is dead and has to be manually restarted.

The latter bug is what I'm really complaining about. If momoko's position is that PG restarts are an application problem rather than a connection pool problem, and it's my job to retry getconn() to handle PG restarts: not ideal, but I can live with it.

But if momoko occasionally freezes up when PG restarts, that's a more serious problem. IMHO that needs to be fixed.

@gward
Copy link
Author

gward commented Jul 4, 2016

Oops, forgot to show my complete reproduction script:

$ cat pgrestart.py
#!/usr/bin/python

# What happens when momoko is connected to a database and the postgresql server is
# restarted? momoko should recover, but doesn't.

from __future__ import print_function

import sys
import logging
import random
import time

import psycopg2

from tornado import gen, ioloop
import momoko

log = logging.getLogger('pgrestart')


@gen.coroutine
def test(dsn, io_loop):
    log.info('creating pool: %r', dsn)
    pool = momoko.Pool(dsn=dsn, size=1, ioloop=io_loop)
    yield pool.connect()
    log.info('pool: %r', pool)

    while True:
        log.debug('calling pool.getconn()')
        try:
            conn = yield pool.getconn()
        except psycopg2.DatabaseError:
            log.exception('error from pool.getconn()')
            time.sleep(1)
            continue

        log.debug('querying conn %r', conn)
        cur = yield conn.execute('select pg_backend_pid()')
        pid = cur.fetchall()[0][0]
        log.info('pid: %r', pid)
        cur.close()
        pool.putconn(conn)

        time.sleep(random.uniform(0.2, 0.8))

    pool.close()


def main():
    dsn = sys.argv[1]

    logging.basicConfig(format='[%(asctime)s %(levelname)-1.1s %(name)s] %(message)s',
                        level=logging.DEBUG)
    logging.getLogger("momoko").setLevel(logging.INFO)

    io_loop = ioloop.IOLoop.instance()
    future = test(dsn, io_loop)
    io_loop.add_future(future, lambda x: io_loop.stop())
    io_loop.start()


main()

If I run this normally, the output is very boring:

python pgrestart.py 'host=localhost dbname=psycopg2_test'
[2016-07-04 16:25:46,807 I pgrestart] creating pool: 'host=localhost dbname=psycopg2_test'
[2016-07-04 16:25:46,819 I pgrestart] pool: <momoko.connection.Pool object at 0x7f55c52ffc50>
[2016-07-04 16:25:46,819 D pgrestart] calling pool.getconn()
[2016-07-04 16:25:46,820 D pgrestart] querying conn <momoko.connection.Connection object at 0x7f55bdec27d0>
[2016-07-04 16:25:46,821 I pgrestart] pid: 29496
[2016-07-04 16:25:47,528 D pgrestart] calling pool.getconn()
[2016-07-04 16:25:47,530 D pgrestart] querying conn <momoko.connection.Connection object at 0x7f55bdec27d0>
[2016-07-04 16:25:47,531 I pgrestart] pid: 29496
[...repeat forever...]

But if I start it running, and then in other terminal I run sudo service postgresql restart, here's what I get almost every time:

[2016-07-04 16:26:52,676 D pgrestart] calling pool.getconn()
[2016-07-04 16:26:52,682 D pgrestart] querying conn <momoko.connection.Connection object at 0x7f020b16a7d0>
[2016-07-04 16:26:52,683 I pgrestart] pid: 29510
[2016-07-04 16:26:53,445 D pgrestart] calling pool.getconn()
[2016-07-04 16:26:53,448 E pgrestart] error from pool.getconn()
Traceback (most recent call last):
  File "pgrestart.py", line 32, in test
    conn = yield pool.getconn()
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/gen.py", line 1008, in run
    value = future.result()
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/concurrent.py", line 232, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
DatabaseNotAvailable: No database connection available
[2016-07-04 16:26:54,450 D pgrestart] calling pool.getconn()
[2016-07-04 16:26:54,470 D pgrestart] querying conn <momoko.connection.Connection object at 0x7f020b16a7d0>
[2016-07-04 16:26:54,471 I pgrestart] pid: 29543

That is, one query fails while PG is down, and then we're back in business with a new backend. I'd love it if momoko hid that and did the necessary retries for me, but that's not the serious problem.

Here is the serious problem, which I see occasionally:

[2016-07-04 16:27:12,581 D pgrestart] calling pool.getconn()
[2016-07-04 16:27:12,583 D pgrestart] querying conn <momoko.connection.Connection object at 0x7f020b16a7d0>
[2016-07-04 16:27:12,584 I pgrestart] pid: 29634
[2016-07-04 16:27:13,087 D pgrestart] calling pool.getconn()
[2016-07-04 16:27:13,088 E pgrestart] error from pool.getconn()
Traceback (most recent call last):
  File "pgrestart.py", line 32, in test
    conn = yield pool.getconn()
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/gen.py", line 1008, in run
    value = future.result()
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/concurrent.py", line 232, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
DatabaseNotAvailable: No database connection available
[2016-07-04 16:27:14,089 D pgrestart] calling pool.getconn()
[2016-07-04 16:27:14,090 E tornado.application] Exception in callback <functools.partial object at 0x7f020a8fa890>
Traceback (most recent call last):
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/ioloop.py", line 600, in _run_callback
    ret = callback()
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/data/src/momoko/momoko/connection.py", line 579, in on_connection_available
    conn = fut.result()
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/concurrent.py", line 232, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
DatabaseNotAvailable: No database connection available

...and that's it. My app is dead, frozen, hung. Nothing happens until I kill it and restart it.

Note that I can greatly increase the frequency of hangs by decreasing the sleep after "error from pool.getconn()". Eg. with a 1 sec sleep as shown, roughly 1 in 10 restarts result in a hung process. But if I reduce the sleep to 0.1 sec:

--- a/pgrestart.py
+++ b/pgrestart.py
@@ -32,7 +32,7 @@ def test(dsn, io_loop):
             conn = yield pool.getconn()
         except psycopg2.DatabaseError:
             log.exception('error from pool.getconn()')
-            time.sleep(1)
+            time.sleep(0.1)
             continue

         log.debug('querying conn %r', conn)

then my process hangs almost every single time. It's dead easy to reproduce.

@gward
Copy link
Author

gward commented Jul 4, 2016

And here is the output of running pgrestart.py with momoko debugging cranked up:

[2016-07-04 16:31:47,029 D pgrestart] calling pool.getconn()
[2016-07-04 16:31:47,029 D momoko] Acquired free connection 8
[2016-07-04 16:31:47,030 D momoko] Obtained connection: 8
[2016-07-04 16:31:47,031 D pgrestart] querying conn <momoko.connection.Connection object at 0x7fe91745a7d0>
[2016-07-04 16:31:47,032 I pgrestart] pid: 29784
[2016-07-04 16:31:47,032 D momoko] About to release connection 8
[2016-07-04 16:31:47,033 D momoko] The connection is alive
[2016-07-04 16:31:47,033 D momoko] Handling free connection 8
[2016-07-04 16:31:47,033 D momoko] No outstanding requests - adding to free pool
[2016-07-04 16:31:47,611 D pgrestart] calling pool.getconn()
[2016-07-04 16:31:47,611 D momoko] Acquired free connection 8
[2016-07-04 16:31:47,611 D momoko] Obtained connection: 8
[2016-07-04 16:31:47,612 D momoko] Method failed Asynchronously
[2016-07-04 16:31:47,612 D momoko] About to release connection 8
[2016-07-04 16:31:47,612 D momoko] The connection is dead
[2016-07-04 16:31:47,612 D momoko] All connections are dead - aborting waiting queue
[2016-07-04 16:31:47,612 E pgrestart] error from pool.getconn()
Traceback (most recent call last):
  File "pgrestart.py", line 32, in test
    conn = yield pool.getconn()
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/gen.py", line 1008, in run
    value = future.result()
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/concurrent.py", line 232, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
DatabaseNotAvailable: No database connection available
[2016-07-04 16:31:47,713 D pgrestart] calling pool.getconn()
[2016-07-04 16:31:47,713 D momoko] All connections are dead
[2016-07-04 16:31:47,715 D momoko] Adding dead connection
[2016-07-04 16:31:47,716 E tornado.application] Exception in callback <functools.partial object at 0x7fe916be8520>
Traceback (most recent call last):
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/ioloop.py", line 600, in _run_callback
    ret = callback()
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/stack_context.py", line 275, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/data/src/momoko/momoko/connection.py", line 579, in on_connection_available
    conn = fut.result()
  File "/var/lib/venv/renesys/local/lib/python2.7/site-packages/tornado/concurrent.py", line 232, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
DatabaseNotAvailable: No database connection available

@gward
Copy link
Author

gward commented Jul 4, 2016

Oh yeah, I'm seeing this on:

  • Ubuntu 15.10
  • system package postgresql-9.4 9.4.8-0ubuntu0.15.10
  • system package python 2.7.9-1 (but python --version reports 2.7.10: weird!)
  • system package python-psycopg2 2.6.1-1build1
  • latest git master of momoko (commit 678f494)

haizaar added a commit that referenced this issue Jul 18, 2016
@haizaar
Copy link
Collaborator

haizaar commented Jul 18, 2016

You are right about both. Sorry for the late response - really busy at work and 10-month old daughter :)

The reconnect logic was only conceived to reconnect idle connections that were silently dropped by the server - for example, if you are running a lazy query once in an hour; and 30 minutes in the middle the server was restarted.
Trying to implement general retry logic on such a deep level is a muddy water from my experience and that's why I leave it to application developer.

Regarding the hangs - this is bug indeed. It turns out that there even is a simpler reproduction - instead of doing restarts, just shutdown the server and your script would hang.

I've fixed this in master. Can you please have alook?

@haizaar
Copy link
Collaborator

haizaar commented Jul 18, 2016

And thanks for thorough reproduction (as usual :)

friedcell added a commit to friedcell/momoko that referenced this issue Jul 13, 2023
* commit '5c15607a69a80f1a40c25bb0c88461ce21f15804':
  Going 2.2.4
  Dropping support for Python 3.2
  Ping may fail early when all connections are dead. Fixes FSX#147
  flake8 cleanup
  Provide a useful __repr__() method for ConnectionContainer.
  Improve one small log message.
  Fix grammar/typo in comment and docstring.
  Fixed tutorial examples. Closes FSX#143.
  Fixes 139 - `all_dead` should check for pending
  Explicitly document that dict parameters are supported on execute.
  Explicitly document that dict parameters are supported.
  Fix typo.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants