How Does SQLAlchemy Manage Database Transactions? || fabian writes.

I’ve been working on database stuff for the last few months at Sendwave. It’s… unexpectedly satisfying? There’s so much thought that’s gone into making databases reliable and useful and fast and somehow mostly text-based.

Sendwave is using the SQLAlchemy ORM for database operations, and we’re getting to a point now where scale and performance concerns demand that we’re doing things efficiently. This means we’re no longer able to treat SQLAlchemy as a black-box, but instead, it’s time to dive in to the way SQLAlchemy itself uses the database.

In this blog post, I’ll explain how, when and why SQLAlchemy utilises database transactions, and give you ✨ tools for exploring this sort of thing yourself ✨.

Along the way, I’ll jump a lot between what’s happening from the perspective of the database, and what’s happening from the perspective of the higher-level SQLAlchemy API. Our systems use Postgres, so a lot of the SQL presented will be Postgres-specific. If you’re using a different database, the techniques will probably be reusable, but you’ll need to adapt some of them to your specific DB.

A Quick Introduction to Database Transactions

Why should I care about transactions? In short, they’re the fundamental building block of all SQL databases. In SQL, you enter a transaction whenever you write BEGIN, and exit the transaction whenever you write COMMIT or ROLLBACK.

As far as I’m aware, there’s two main, related reasons for transactions¹:

To allow for atomic writes, that is, to say to the database “please do all of these things” (i.e. COMMIT), or “actually I changed my mind, please do none of those things” (i.e. ROLLBACK).
To ensure that results and changes being applied on one connection don’t have effects on other, concurrent connections. Writes occurring within a transaction aren’t normally available to other sessions until they’ve been committed².

In database parlance, these properties are known as Atomicity and Isolation, respectively. If you’ve heard of an ACID-compliant database, we’ve just covered the A and the I.

You might not have manually written SQL for transactions before! I personally don’t do it very often; and for many of my manual workloads, I don’t care about batching a series of statements so much. This doesn’t mean you haven’t used transactions, though – many ORMs will use them automatically and Postgres automatically wraps single queries in transactions so that it can maintain those guarantees about atomicity and isolation.

Connection States

Transactions are so fundamental to the SQL model that each Postgres connection is described by one of three states, which describe the relationship between a connection and a possible transaction open on the connection³:

active - the database is actually doing work on the connection, i.e. running a query. This also implies that the connection is in a transaction, because, as we discussed, every query runs in a transaction.
idle in transaction - the connection started a transaction at some point (ran BEGIN), but now, it’s just waiting. Most likely, the application loaded some data, and is processing it in some way before writing something back or loading more data.
idle - the connection is waiting. You could disconnect from the database if you wanted and it wouldn’t result in data loss.

Our specific problem at Sendwave, and the reason I started investigating this in the first place, was that we had a very large number of connections that were idle in transaction. When a connection is in this state, it’s not really doing work, but it also can’t be used for something else, so it sorta represents a missed opportunity, especially when you start brushing up against the connection limits of the database⁴. It’s ok to idle in transaction for short periods of time – like, if you’re doing some processing on a database record before writing it back – but it’s less ideal to be idle in transaction whilst making slow HTTP requests or otherwise waiting for something out of your control, because you’re effectively restricting the degree to which your system can scale based on the amount of time you spend performing these slower tasks.

SQLAlchemy doesn’t directly expose transaction state to API users, but it’s clear that in order to minimise the amount of time we’re wasting in this state, we need to understand what’s happening.

Still with me? Cool 🥳. Time to introduce the tools we’re going to use to figure out SQLAlchemy’s use of database transactions.

Tools for Understanding Database Performance

We’ll be using two main tools for understanding what’s going on in the database:

The pg_stat_activity view, which gives us insight into what the different connections to our database are doing at any point in time, and
Verbose SQL logging in SQLAlchemy.

`pg_stat_activity`, the Swiss Army Knife of Postgres Connection Tracking

Let’s start with Postgres’ pg_stat_activity view. This lists some critical info about every connection to the database, and what they’re all up to.

Assuming you’ve got Postgres installed, fire up a new terminal and log into Postgres:

psql -x  # -x chooses a nicer display format for this task

Now, run the following query:

SELECT
    backend_start,
    xact_start,
    query_start,
    state_change,
    state,
    substring(query for 50) AS query
FROM pg_stat_activity
WHERE backend_type = 'client backend';

This query selects every connection that’s running on the current system. Note that WHERE backend_type = 'client backend' excludes automated processes, which are critical for Postgres’ normal operation, and that we’re only including a subset of the columns available in the pg_stat_activity view – the others are useful, but aren’t as relevant for what we’ll be doing today.

You should see one record indicating the query we literally just ran:

-[ RECORD 1 ]-+------------------------------
backend_start | 2019-12-15 12:46:24.080431+01
xact_start    | 2019-12-15 12:46:25.100934+01
query_start   | 2019-12-15 12:46:25.100934+01
state_change  | 2019-12-15 12:46:25.100936+01
state         | active
query         | SELECT                       +
              |     backend_start,           +
              |     xact_start,              +
              |     quer

To make it easier to read results in future, let’s exclude the diagnostic query as we run it, using:

SELECT
    backend_start,
    xact_start,
    query_start,
    state_change,
    state,
    substring(query for 50) AS query
FROM pg_stat_activity
WHERE
    backend_type = 'client backend' AND
    pid != pg_backend_pid(); -- ✨ new! ✨

The new WHERE condition references the connection’s pid. The pid, or process ID, is a number referring to the unique Postgres ‘backend process’ running the query. There’s one pid per database connection, and one for each of the automated processes we’re filtering out with backend_type = 'client backend'. If you run that query now, the results should come back empty, because we’ve got no other queries running against the database.

Keep this database shell open! Throughout this article, we’ll run this exact query again and again to inspect the state of our connections from the perspective of the database.

Turn on SQLAlchemy statement logging

Tool number two is verbose logging of the SQL emitted by SQLAlchemy. Doing this allows us to see what’s actually happening under the hood in SQLAlchemy without having to read and understand all the SQLAlchemy source code⁵.

The fastest way to turn on statement logging is to add echo=True when you instantiate the SQLAlchemy engine:

engine = sqlalchemy.create_engine(os.environ['DATABASE_URL'], echo=True)
Session = orm.sessionmaker(bind=engine)

…but it’s also possible to do this with logging configuration. In this example repo, I’ve configured SQL output using the logging mechanism, so I can customise the formatting to my liking.

Digging in

Let’s write some code and run it line-by-line to see what it does. Run the following in an ipython shell or Jupyter notebook, and run our pg_stat_activity query after each step. You won’t get output after every line of python, but I’ve paused to show output when anything has changed.

# Configure a connection to the database at the URL specified by the
# DATABASE_URL environment variable.
# Remember that we're using `echo=True` so we can see all generated SQL.
engine = sqlalchemy.create_engine(os.environ['DATABASE_URL'], echo=True)

# Create a session factory. Calling `Session()` will create new SQLAlchemy
# ORM sessions.
Session = orm.sessionmaker(bind=engine)

# Create a new session which we'll use for the following investigation.
session = Session()

Even after creating a new session, there still hasn’t been any SQL emitted. Let’s keep going:

q = session.query(Zebra).filter(Zebra.when_born <= now() - 2*YEAR)
old_zebra = q.first()

select version()
{}
select current_schema()
{}
SELECT CAST('test plain returns' AS VARCHAR(60)) AS anon_1
{}
SELECT CAST('test unicode returns' AS VARCHAR(60)) AS anon_1
{}
show standard_conforming_strings
{}
BEGIN (implicit)
SELECT zebras.id AS zebras_id, zebras.name AS zebras_name, zebras.when_born AS zebras_when_born, zebras.when_stripes_counted AS zebras_when_stripes_counted, zebras.number_of_stripes AS zebras_number_of_stripes
FROM zebras
WHERE zebras.when_born <= %(when_born_1)s
 LIMIT %(param_1)s
{'when_born_1': datetime.datetime(2017, 12, 15, 13, 55, 24, 401458), 'param_1': 1}

Wow, that’s a lot of SQL statements for a single query! Let’s walk through them:

The first 5 statements are sanity and compatibility checks – to check that the database environment matches what the SQLAlchemy library authors were expecting when they wrote the code. It runs just once per engine⁶.
The sixth query is a BEGIN, which opens up a new transaction before SQLAlchemy attempts to load any data.
The seventh actually loads the thing that we wanted to load.

Let’s run our pg_stat_activity query now to see what the database status is:

-[ RECORD 1 ]-+---------------------------------------------------
backend_start | 2019-12-15 12:55:24.40754+01
xact_start    | 2019-12-15 12:55:24.422549+01
query_start   | 2019-12-15 12:55:24.422586+01
state_change  | 2019-12-15 12:55:24.423075+01
state         | idle in transaction
query         | SELECT zebras.id AS zebras_id, zebras.name AS zebr

There’s a wealth of information in this record. Let’s run through it quickly:

The connection to the database was made at 12:55:24.40754+01 (backend_start)
About 15 ms later, a transaction was opened (xact_start)
The most recent query SELECT zebras.id AS zebras_id, zebras.name AS zebr (query) was started about 0.04 ms after that (query_start)
The most recent query took 0.5 ms to run, because at 12:55:24.423075+01 the state changed back from active to idle in transaction (state_change)
The transaction is still open, and the database is waiting for our connection to do something or to close out the transaction (state is idle in transaction).

Let’s keep writing python code, and get a watering hole for our Zebra:

wh = session.query(WateringHole).first()

You’ll see that this time, SQLAlchemy only emits the relevant statement, because we’ve already run the sanity checks (once per engine) and we’re already in a transaction:

SELECT watering_holes.id AS watering_holes_id, watering_holes.name AS watering_holes_name, watering_holes.latitude AS watering_holes_latitude, watering_holes.longitude AS watering_holes_longitude
FROM watering_holes
 LIMIT %(param_1)s
{'param_1': 1}

If we run our pg_stat_activity query again, we’ll get something like:

-[ RECORD 1 ]-+---------------------------------------------------
backend_start | 2019-12-15 12:55:24.40754+01
xact_start    | 2019-12-15 12:55:24.422549+01
query_start   | 2019-12-15 12:57:18.384988+01
state_change  | 2019-12-15 12:57:18.385397+01
state         | idle in transaction
query         | SELECT watering_holes.id AS watering_holes_id, wat

Note that:

We’re using the same connection and the same transaction, so backend_start and xact_start haven’t changed
query_start, state_change, and query have all changed to reflect the new query⁷. They’re about two minutes after the backend_start and xact_start because I ran this part of the example two minutes after running the first part.

Let’s wrap it up and rollback the query:

session.rollback()

ROLLBACK

-[ RECORD 1 ]-+------------------------------
backend_start | 2019-12-15 12:55:24.40754+01
xact_start    |
query_start   | 2019-12-15 12:58:04.490102+01
state_change  | 2019-12-15 12:58:04.490205+01
state         | idle
query         | ROLLBACK

The state is now idle, which means the connection’s literally doing nothing. It’s not in a transaction (note that xact_start is blank now), and it’s not running in a query – it’s just waiting to do work.

Calling session.commit() instead of session.rollback() looks pretty much the same. For completeness:

session = Session()
# Grab any ol' Zebra
print(session.query(Zebra).first())
session.commit()

BEGIN (implicit)
SELECT zebras.id AS zebras_id, zebras.name AS zebras_name, zebras.when_born AS zebras_when_born, zebras.when_stripes_counted AS zebras_when_stripes_counted, zebras.number_of_stripes AS zebras_number_of_stripes
FROM zebras
 LIMIT %(param_1)s
{'param_1': 1}
Zebra(name='Stevie')
COMMIT

-[ RECORD 1 ]-+------------------------------
backend_start | 2019-12-15 12:55:24.40754+01
xact_start    |
query_start   | 2019-12-15 13:00:30.617891+01
state_change  | 2019-12-15 13:00:30.617905+01
state         | idle
query         | COMMIT

Finding loads in all the wrong places

Things kinda make sense so far, right? We’ve now determined:

A SQLAlchemy session opens a transaction only when it first needs access to the database, and
It closes the transaction when you rollback() or commit().

Things get slightly hairier when you continue to work with objects after these events, though. Let’s load up an object and close out the transaction:

session = Session()
# Grab any ol' Zebra
zebra = session.query(Zebra).first()
session.commit()

BEGIN (implicit)
SELECT zebras.id AS zebras_id, zebras.name AS zebras_name, zebras.when_born AS zebras_when_born, zebras.when_stripes_counted AS zebras_when_stripes_counted, zebras.number_of_stripes AS zebras_number_of_stripes
FROM zebras
 LIMIT %(param_1)s
{'param_1': 1}
COMMIT

Cool – the SQL indicates that the transaction’s been committed (you can verify that with the pg_stat_activity query, if you want).

Now, let’s try saying hello to that zebra:

print('Hello,', zebra.name)

BEGIN (implicit)
SELECT zebras.id AS zebras_id, zebras.name AS zebras_name, zebras.when_born AS zebras_when_born, zebras.when_stripes_counted AS zebras_when_stripes_counted, zebras.number_of_stripes AS zebras_number_of_stripes
FROM zebras
WHERE zebras.id = %(param_1)s
{'param_1': 1}
Hello, Stevie!

Wait, that first line is a BEGIN! Did SQLAlchemy really open another transaction just to get data we already know about?

Let’s run pg_stat_activity to make sure:

-[ RECORD 1 ]-+---------------------------------------------------
backend_start | 2019-12-15 13:02:08.225582+01
xact_start    | 2019-12-15 13:02:08.335367+01
query_start   | 2019-12-15 13:02:08.335425+01
state_change  | 2019-12-15 13:02:08.335628+01
state         | idle in transaction
query         | SELECT zebras.id AS zebras_id, zebras.name AS zebr

… yep, the connection is idle in transaction. SQLAlchemy’s default behaviour is that when you access a property after the transaction has ended (i.e. through a commit or a rollback), the object gets reloaded from the database from a new transaction. This was really surprising to me – I wouldn’t have expected a simple property access to load data from the database!

There’s an explanation of the rationale for that behaviour in that link, but for now, let’s talk about what to do if you don’t want a new transaction to open automatically.

Use `expunge` to detach objects from the session

If you want to access the data without opening a session, you can expunge the object from the session. This detaches an instance from the given session, which means that its properties won’t be reloaded after commit:

session = Session()
zebra = session.query(Zebra).first()
print(f'Hello, {zebra.name}!')
session.expunge(zebra)
session.commit()
print(f'{zebra.name}, are you still there?')

Observe that there’s no BEGIN after accessing the Zebra’s properties…

BEGIN (implicit)
SELECT zebras.id AS zebras_id, zebras.name AS zebras_name, zebras.when_born AS zebras_when_born, zebras.when_stripes_counted AS zebras_when_stripes_counted, zebras.number_of_stripes AS zebras_number_of_stripes
FROM zebras
 LIMIT %(param_1)s
{'param_1': 1}
Hello, Stevie!
COMMIT
Stevie, are you still there?

…and that our printout of SQL connection usage indicates that the connection is still idle (and not idle in transaction):

-[ RECORD 1 ]-+------------------------------
backend_start | 2019-12-15 13:10:31.69101+01
xact_start    |
query_start   | 2019-12-15 13:12:07.608321+01
state_change  | 2019-12-15 13:12:07.608364+01
state         | idle
query         | COMMIT

Note that you have to be careful when you’re expunging stuff. For example, changes we make to our Zebra buddy won’t automatically stick after we’ve expunged them from the session⁸:

# 'expunge' is the action, 'detached' is the state
assert inspect(zebra).detached
zebra.number_of_stripes = 63
session.commit()
other_session = Session()
# Reload the zebra in a different session
zr = other_session.query(Zebra).filter(Zebra.id == zebra.id).one()
print(f'{zr.name} has {zr.number_of_stripes} stripes')

BEGIN (implicit)
SELECT zebras.id AS zebras_id, zebras.name AS zebras_name, zebras.when_born AS zebras_when_born, zebras.when_stripes_counted AS zebras_when_stripes_counted, zebras.number_of_stripes AS zebras_number_of_stripes
FROM zebras
WHERE zebras.id = %(id_1)s
{'id_1': 1}
Stevie has None stripes

Similarly, if you’re inserting objects, make sure to session.flush() the object to the database before expunging it:

session = Session()
janet = Zebra(name='Janet', when_born=parser.parse('2018-06-05'))
session.add(janet)
session.flush()
# ID should have been set when we flushed Janet to the database,
# which indicates that she was correctly saved.
assert janet.id
session.expunge(janet)
session.commit()
print(f"Hi {janet.name}, we're glad you're still here!")

BEGIN (implicit)
INSERT INTO zebras (name, when_born, when_stripes_counted, number_of_stripes) VALUES (%(name)s, %(when_born)s, %(when_stripes_counted)s, %(number_of_stripes)s) RETURNING zebras.id
{'name': 'Janet', 'when_born': datetime.datetime(2018, 6, 5, 0, 0), 'when_stripes_counted': None, 'number_of_stripes': None}
COMMIT
Hi Janet, we're glad you're still here!

Use `close()` to enforce that you’re definitely not in a transaction

Now that expunge() has given us the ability to use objects outside of a session context, we might want to enforce that transactions aren’t accidentally opened by querying object state.

Calling close() means that under no circumstances will another database transaction be opened to reload attributes, but the semantics of what this means for objects are subtle. In the following two examples, I’ll load two Zebras, detach one and keep the other one in the session:

session = Session()
[za, zb] = session.query(Zebra).limit(2)
session.expunge(za)

If we close() the session without calling rollback() or commit(), then, per the docs, everything is expunged from the session:

session = Session()
[za, zb] = session.query(Zebra).limit(2)
session.close()
assert inspect(za).detached
assert inspect(zb).detached
print(f"Hi {za.name}, we're glad you're still here!")
print(f"Hi {zb.name}, we're glad you're here too!")

BEGIN (implicit)
SELECT zebras.id AS zebras_id, zebras.name AS zebras_name, zebras.when_born AS zebras_when_born, zebras.when_stripes_counted AS zebras_when_stripes_counted, zebras.number_of_stripes AS zebras_number_of_stripes
FROM zebras
 LIMIT %(param_1)s
{'param_1': 2}
ROLLBACK
Hi Stevie, we're glad you're still here!
Hi Anya, we're glad you're here too!

On the other hand, if you explicitly commit() or rollback() before calling close(), the objects get expired before they get detached, and so any subsequent attempts to load will cause failures:

session = Session()
[za, zb] = session.query(Zebra).limit(2)
# Expunge one explicitly before session close
session.expunge(za)
# Or rollback, both have the same effect
session.commit()
session.close()
# Should work, was detached before commit()
print(f"Hi {za.name}, we're glad you're still here!")
# Will raise, was detached after being expired
print(f"Hi {zb.name}, we're glad you're here too!")

[sql omitted]

Hi Stevie, we're glad you're still here!
Traceback (most recent call last):
  File "/Users/fabian/Code/sqlalchemy-example/utils.py", line 36, in step
    yield
  File "/Users/fabian/Code/sqlalchemy-example/002_loads_in_all_the_wrong_places.py", line 85, in <module>
    print(f"Hi {zb.name}, we're glad you're here too!")
  File "/Users/fabian/.local/share/virtualenvs/sqlalchemy-example-uvKGK8dZ/lib/python3.7/site-packages/sqlalchemy/orm/attributes.py", line 282, in __get__
    return self.impl.get(instance_state(instance), dict_)
  File "/Users/fabian/.local/share/virtualenvs/sqlalchemy-example-uvKGK8dZ/lib/python3.7/site-packages/sqlalchemy/orm/attributes.py", line 705, in get
    value = state._load_expired(state, passive)
  File "/Users/fabian/.local/share/virtualenvs/sqlalchemy-example-uvKGK8dZ/lib/python3.7/site-packages/sqlalchemy/orm/state.py", line 660, in _load_expired
    self.manager.deferred_scalar_loader(self, toload)
  File "/Users/fabian/.local/share/virtualenvs/sqlalchemy-example-uvKGK8dZ/lib/python3.7/site-packages/sqlalchemy/orm/loading.py", line 913, in load_scalar_attributes
    "attribute refresh operation cannot proceed" % (state_str(state))
sqlalchemy.orm.exc.DetachedInstanceError: Instance <Zebra at 0x10d91d5c0> is not bound to a Session; attribute refresh operation cannot proceed (Background on this error at: http://sqlalche.me/e/bhk3)

Our first Zebra pal was able to be greeted because they’d previously been detached through the explicit expunge() call. Calling session.commit() expired the stored info about our second Zebra pal, who was still associated with the session. Then, when SQLAlchemy went to refresh the info about our second zebra from the database, it raised an error because the session doesn’t exist anymore, and thus there’s nothing that expired data can be reloaded from.

Observe, however, that no transaction was re-opened as part of this attempt, and so our connection state is again idle:

-[ RECORD 1 ]-+------------------------------
backend_start | 2019-12-15 13:15:51.734891+01
xact_start    |
query_start   | 2019-12-15 13:16:33.850546+01
state_change  | 2019-12-15 13:16:33.850578+01
state         | idle
query         | COMMIT

Wrapping up

Alright, that’s it for now. Hopefully this post has given you the tools you need to keep digging if you want to keep exploring 🕵🏽‍♀️

Here’s what we covered:

pg_stat_activity is wonderfully useful for understanding what every connection is doing in a Postgres database
Turning on logging at a lower level of the stack is a great way to understand what’s going on at higher levels
SQLAlchemy opens database transactions the first time you attempt to load from a Session object, and closes them when you commit() or rollback()
Closing a transaction with commit() or rollback() will cause all associated objects to be expired, and reload their properties on the next access
You can expunge() objects from the session so that you can continue to use them outside of a database transaction
You can call Session.close() to guarantee that objects loaded within a session won’t reopen transactions, or otherwise access the database anymore.

Now that we’ve got a good picture of how sessions interact with database transactions, we’re well-positioned to build systems that make it easy to reason about transactions! That’ll require some careful thought and another few thousand words, though, so we’ll come back to it in a future post.

As always, I’d love to hear from you with thoughts or feedback.

See Transactions in the Postgres docs for a great explanation of the how / why here. ↩︎
There are ways of modifying what’s visible from one transaction to another – this is referred to as transaction isolation. The fact that people have thought these details through so thoroughly and that it all works so well strikes me as a significant feat of engineering. I’d taken this for granted for so long, but it’s a hard problem both from a design and a technological perspective, and it’s amazing to me that we’ve got off-the-shelf technology to do it at scale now. ↩︎
There’s actually five, but for the sake of simplicity, I’m going to say a lot of very reductionistic things in this article. This is a high-level primer to a relatively complex topic, so please bear with me 🐻 ↩︎
I really hoped I’d be able to fit a discussion about connection limits into this article but I think it’s slightly too off topic, and there’s so much good interesting stuff to talk about there that I’d have a hard time stopping 😅 The tl;dr: many cloud database providers set hard limits on the number of concurrent connections for your database. It turns out that there are some real good reasons for this – notably that the performance of some postgres subsystems is linear in the number of connections, and so at a certain point, total system throughput actually decreases compared to running the extra work sequentially. Brandur has an excellent writeup about this, and there’s lots of other good info in the Postgres Wiki. ↩︎
Note that we could also see all SQL queries by enabling and reading Postgres query logs. For our purposes, it’s more convenient to have SQLAlchemy emit these queries so that we can also print things from the Python code to see exactly when each query is being run, but query logs are more accurate because they don’t rely upon the SQLAlchemy authors having added a log statement everywhere a query is run. Using query logs would be a great last resort if the results I was seeing from SQLAlchemy weren’t making any sense. ↩︎
I discovered this by searching for the string ’test plain returns’ in the SQLAlchemy code, and then using PyCharm’s excellent code navigation features to discover the places that that method was called. ↩︎
Why has state_change updated, even though the state is the same? When SQLAlchemy ran our query, we briefly transitioned to the ‘active’ state and then back again to the ‘idle in transaction’ state. So we’ve actually seen two state changes, and the state_change variable simply contains the timestamp at which the most recent one occurred. ↩︎
You’re able to merge detached instances back into sessions, but merging is complex and comes with a lot of warnings. I’m avoiding writing about it because it’s scary and nuanced and I’m praying I never have to learn how it works 😅 but the docs for it seem pretty solid, at least. ↩︎