I’ve been working on database stuff for the last few months at Sendwave. It’s… unexpectedly satisfying? There’s so much thought that’s gone into making databases reliable and useful and fast and somehow mostly text-based.
Sendwave is using the SQLAlchemy ORM for database operations, and we’re getting to a point now where scale and performance concerns demand that we’re doing things efficiently. This means we’re no longer able to treat SQLAlchemy as a black-box, but instead, it’s time to dive in to the way SQLAlchemy itself uses the database.
In this blog post, I’ll explain how, when and why SQLAlchemy utilises database transactions, and give you ✨ tools for exploring this sort of thing yourself ✨.
Along the way, I’ll jump a lot between what’s happening from the perspective of the database, and what’s happening from the perspective of the higher-level SQLAlchemy API. Our systems use Postgres, so a lot of the SQL presented will be Postgres-specific. If you’re using a different database, the techniques will probably be reusable, but you’ll need to adapt some of them to your specific DB.
A Quick Introduction to Database Transactions
Why should I care about transactions? In short, they’re the fundamental building block of all SQL databases. In SQL, you enter a transaction whenever you write BEGIN
, and exit the transaction whenever you write COMMIT
or ROLLBACK
.
As far as I’m aware, there’s two main, related reasons for transactions1:
To allow for atomic writes, that is, to say to the database “please do all of these things” (i.e.
COMMIT
), or “actually I changed my mind, please do none of those things” (i.e.ROLLBACK
).To ensure that results and changes being applied on one connection don’t have effects on other, concurrent connections. Writes occurring within a transaction aren’t normally available to other sessions until they’ve been committed2.
In database parlance, these properties are known as Atomicity and Isolation, respectively. If you’ve heard of an ACID-compliant database, we’ve just covered the A
and the I
.
You might not have manually written SQL for transactions before! I personally don’t do it very often; and for many of my manual workloads, I don’t care about batching a series of statements so much. This doesn’t mean you haven’t used transactions, though – many ORMs will use them automatically and Postgres automatically wraps single queries in transactions so that it can maintain those guarantees about atomicity and isolation.
Connection States
Transactions are so fundamental to the SQL model that each Postgres connection is described by one of three states, which describe the relationship between a connection and a possible transaction open on the connection3:
active
- the database is actually doing work on the connection, i.e. running a query. This also implies that the connection is in a transaction, because, as we discussed, every query runs in a transaction.idle in transaction
- the connection started a transaction at some point (ranBEGIN
), but now, it’s just waiting. Most likely, the application loaded some data, and is processing it in some way before writing something back or loading more data.idle
- the connection is waiting. You could disconnect from the database if you wanted and it wouldn’t result in data loss.
Our specific problem at Sendwave, and the reason I started investigating this in the first place, was that we had a very large number of connections that were idle in transaction
. When a connection is in this state, it’s not really doing work, but it also can’t be used for something else, so it sorta represents a missed opportunity, especially when you start brushing up against the connection limits of the database4. It’s ok to idle in transaction
for short periods of time – like, if you’re doing some processing on a database record before writing it back – but it’s less ideal to be idle in transaction
whilst making slow HTTP requests or otherwise waiting for something out of your control, because you’re effectively restricting the degree to which your system can scale based on the amount of time you spend performing these slower tasks.
SQLAlchemy doesn’t directly expose transaction state to API users, but it’s clear that in order to minimise the amount of time we’re wasting in this state, we need to understand what’s happening.
Still with me? Cool 🥳. Time to introduce the tools we’re going to use to figure out SQLAlchemy’s use of database transactions.
Tools for Understanding Database Performance
We’ll be using two main tools for understanding what’s going on in the database:
The
pg_stat_activity
view, which gives us insight into what the different connections to our database are doing at any point in time, andVerbose SQL logging in SQLAlchemy.
pg_stat_activity
, the Swiss Army Knife of Postgres Connection Tracking
Let’s start with Postgres’ pg_stat_activity
view. This lists some critical info about every connection to the database, and what they’re all up to.
Assuming you’ve got Postgres installed, fire up a new terminal and log into Postgres:
psql -x # -x chooses a nicer display format for this task
Now, run the following query:
SELECT
backend_start,
xact_start,
query_start,
state_change,
state,
substring(query for 50) AS query
FROM pg_stat_activity
WHERE backend_type = 'client backend';
This query selects every connection that’s running on the current system. Note that WHERE backend_type = 'client backend'
excludes automated processes, which are critical for Postgres’ normal operation, and that we’re only including a subset of the columns available in the pg_stat_activity
view – the others are useful, but aren’t as relevant for what we’ll be doing today.
You should see one record indicating the query we literally just ran:
-[ RECORD 1 ]-+------------------------------
backend_start | 2019-12-15 12:46:24.080431+01
xact_start | 2019-12-15 12:46:25.100934+01
query_start | 2019-12-15 12:46:25.100934+01
state_change | 2019-12-15 12:46:25.100936+01
state | active
query | SELECT +
| backend_start, +
| xact_start, +
| quer
To make it easier to read results in future, let’s exclude the diagnostic query as we run it, using:
SELECT
backend_start,
xact_start,
query_start,
state_change,
state,
substring(query for 50) AS query
FROM pg_stat_activity
WHERE
backend_type = 'client backend' AND
pid != pg_backend_pid(); -- ✨ new! ✨
The new WHERE
condition references the connection’s pid
. The pid
, or process ID, is a number referring to the unique Postgres ‘backend process’ running the query. There’s one pid
per database connection, and one for each of the automated processes we’re filtering out with backend_type = 'client backend'
. If you run that query now, the results should come back empty, because we’ve got no other queries running against the database.
Keep this database shell open! Throughout this article, we’ll run this exact query again and again to inspect the state of our connections from the perspective of the database.
Turn on SQLAlchemy statement logging
Tool number two is verbose logging of the SQL emitted by SQLAlchemy. Doing this allows us to see what’s actually happening under the hood in SQLAlchemy without having to read and understand all the SQLAlchemy source code5.
The fastest way to turn on statement logging is to add echo=True
when you instantiate the SQLAlchemy engine:
engine = sqlalchemy.create_engine(os.environ['DATABASE_URL'], echo=True)
Session = orm.sessionmaker(bind=engine)
…but it’s also possible to do this with logging configuration. In this example repo, I’ve configured SQL output using the logging mechanism, so I can customise the formatting to my liking.
Digging in
Let’s write some code and run it line-by-line to see what it does. Run the following in an ipython shell or Jupyter notebook, and run our pg_stat_activity
query after each step. You won’t get output after every line of python, but I’ve paused to show output when anything has changed.
# Configure a connection to the database at the URL specified by the
# DATABASE_URL environment variable.
# Remember that we're using `echo=True` so we can see all generated SQL.
engine = sqlalchemy.create_engine(os.environ['DATABASE_URL'], echo=True)
# Create a session factory. Calling `Session()` will create new SQLAlchemy
# ORM sessions.
Session = orm.sessionmaker(bind=engine)
# Create a new session which we'll use for the following investigation.
session = Session()
Even after creating a new session, there still hasn’t been any SQL emitted. Let’s keep going:
q = session.query(Zebra).filter(Zebra.when_born <= now() - 2*YEAR)
old_zebra = q.first()
select version()
{}
select current_schema()
{}
SELECT CAST('test plain returns' AS VARCHAR(60)) AS anon_1
{}
SELECT CAST('test unicode returns' AS VARCHAR(60)) AS anon_1
{}
show standard_conforming_strings
{}
BEGIN (implicit)
SELECT zebras.id AS zebras_id, zebras.name AS zebras_name, zebras.when_born AS zebras_when_born, zebras.when_stripes_counted AS zebras_when_stripes_counted, zebras.number_of_stripes AS zebras_number_of_stripes
FROM zebras
WHERE zebras.when_born <= %(when_born_1)s
LIMIT %(param_1)s
{'when_born_1': datetime.datetime(2017, 12, 15, 13, 55, 24, 401458), 'param_1': 1}
Wow, that’s a lot of SQL statements for a single query! Let’s walk through them:
- The first 5 statements are sanity and compatibility checks – to check that the database environment matches what the SQLAlchemy library authors were expecting when they wrote the code. It runs just once per engine6.
- The sixth query is a
BEGIN
, which opens up a new transaction before SQLAlchemy attempts to load any data. - The seventh actually loads the thing that we wanted to load.
Let’s run our pg_stat_activity
query now to see what the database status is:
-[ RECORD 1 ]-+---------------------------------------------------
backend_start | 2019-12-15 12:55:24.40754+01
xact_start | 2019-12-15 12:55:24.422549+01
query_start | 2019-12-15 12:55:24.422586+01
state_change | 2019-12-15 12:55:24.423075+01
state | idle in transaction
query | SELECT zebras.id AS zebras_id, zebras.name AS zebr
There’s a wealth of information in this record. Let’s run through it quickly:
- The connection to the database was made at
12:55:24.40754+01
(backend_start
) - About 15 ms later, a transaction was opened (
xact_start
) - The most recent query
SELECT zebras.id AS zebras_id, zebras.name AS zebr
(query
) was started about 0.04 ms after that (query_start
) - The most recent query took 0.5 ms to run, because at
12:55:24.423075+01
thestate
changed back fromactive
toidle in transaction
(state_change
) - The transaction is still open, and the database is waiting for our connection to do something or to close out the transaction (
state
isidle in transaction
).
Let’s keep writing python code, and get a watering hole for our Zebra:
wh = session.query(WateringHole).first()
You’ll see that this time, SQLAlchemy only emits the relevant statement, because we’ve already run the sanity checks (once per engine) and we’re already in a transaction:
SELECT watering_holes.id AS watering_holes_id, watering_holes.name AS watering_holes_name, watering_holes.latitude AS watering_holes_latitude, watering_holes.longitude AS watering_holes_longitude
FROM watering_holes
LIMIT %(param_1)s
{'param_1': 1}
If we run our pg_stat_activity
query again, we’ll get something like:
-[ RECORD 1 ]-+---------------------------------------------------
backend_start | 2019-12-15 12:55:24.40754+01
xact_start | 2019-12-15 12:55:24.422549+01
query_start | 2019-12-15 12:57:18.384988+01
state_change | 2019-12-15 12:57:18.385397+01
state | idle in transaction
query | SELECT watering_holes.id AS watering_holes_id, wat
Note that:
- We’re using the same connection and the same transaction, so
backend_start
andxact_start
haven’t changed query_start
,state_change
, andquery
have all changed to reflect the new query7. They’re about two minutes after thebackend_start
andxact_start
because I ran this part of the example two minutes after running the first part.
Let’s wrap it up and rollback the query:
session.rollback()
ROLLBACK
-[ RECORD 1 ]-+------------------------------
backend_start | 2019-12-15 12:55:24.40754+01
xact_start |
query_start | 2019-12-15 12:58:04.490102+01
state_change | 2019-12-15 12:58:04.490205+01
state | idle
query | ROLLBACK
The state is now idle, which means the connection’s literally doing nothing. It’s not in a transaction (note that xact_start
is blank now), and it’s not running in a query – it’s just waiting to do work.
Calling session.commit()
instead of session.rollback()
looks pretty much the same. For completeness:
session = Session()
# Grab any ol' Zebra
print(session.query(Zebra).first())
session.commit()
BEGIN (implicit)
SELECT zebras.id AS zebras_id, zebras.name AS zebras_name, zebras.when_born AS zebras_when_born, zebras.when_stripes_counted AS zebras_when_stripes_counted, zebras.number_of_stripes AS zebras_number_of_stripes
FROM zebras
LIMIT %(param_1)s
{'param_1': 1}
Zebra(name='Stevie')
COMMIT
-[ RECORD 1 ]-+------------------------------
backend_start | 2019-12-15 12:55:24.40754+01
xact_start |
query_start | 2019-12-15 13:00:30.617891+01
state_change | 2019-12-15 13:00:30.617905+01
state | idle
query | COMMIT
Finding loads in all the wrong places
Things kinda make sense so far, right? We’ve now determined:
- A SQLAlchemy session opens a transaction only when it first needs access to the database, and
- It closes the transaction when you
rollback()
orcommit()
.
Things get slightly hairier when you continue to work with objects after these events, though. Let’s load up an object and close out the transaction:
session = Session()
# Grab any ol' Zebra
zebra = session.query(Zebra).first()
session.commit()
BEGIN (implicit)
SELECT zebras.id AS zebras_id, zebras.name AS zebras_name, zebras.when_born AS zebras_when_born, zebras.when_stripes_counted AS zebras_when_stripes_counted, zebras.number_of_stripes AS zebras_number_of_stripes
FROM zebras
LIMIT %(param_1)s
{'param_1': 1}
COMMIT
Cool – the SQL indicates that the transaction’s been committed (you can verify that with the pg_stat_activity
query, if you want).
Now, let’s try saying hello to that zebra:
print('Hello,', zebra.name)
BEGIN (implicit)
SELECT zebras.id AS zebras_id, zebras.name AS zebras_name, zebras.when_born AS zebras_when_born, zebras.when_stripes_counted AS zebras_when_stripes_counted, zebras.number_of_stripes AS zebras_number_of_stripes
FROM zebras
WHERE zebras.id = %(param_1)s
{'param_1': 1}
Hello, Stevie!
Wait, that first line is a BEGIN
! Did SQLAlchemy really open another transaction just to get data we already know about?
Let’s run pg_stat_activity
to make sure:
-[ RECORD 1 ]-+---------------------------------------------------
backend_start | 2019-12-15 13:02:08.225582+01
xact_start | 2019-12-15 13:02:08.335367+01
query_start | 2019-12-15 13:02:08.335425+01
state_change | 2019-12-15 13:02:08.335628+01
state | idle in transaction
query | SELECT zebras.id AS zebras_id, zebras.name AS zebr
… yep, the connection is idle in transaction
. SQLAlchemy’s default behaviour is that when you access a property after the transaction has ended (i.e. through a commit
or a rollback
), the object gets reloaded from the database from a new transaction. This was really surprising to me – I wouldn’t have expected a simple property access to load data from the database!
There’s an explanation of the rationale for that behaviour in that link, but for now, let’s talk about what to do if you don’t want a new transaction to open automatically.
Use expunge
to detach objects from the session
If you want to access the data without opening a session, you can expunge the object from the session. This detaches an instance from the given session, which means that its properties won’t be reloaded after commit:
session = Session()
zebra = session.query(Zebra).first()
print(f'Hello, {zebra.name}!')
session.expunge(zebra)
session.commit()
print(f'{zebra.name}, are you still there?')
Observe that there’s no BEGIN
after accessing the Zebra’s properties…
BEGIN (implicit)
SELECT zebras.id AS zebras_id, zebras.name AS zebras_name, zebras.when_born AS zebras_when_born, zebras.when_stripes_counted AS zebras_when_stripes_counted, zebras.number_of_stripes AS zebras_number_of_stripes
FROM zebras
LIMIT %(param_1)s
{'param_1': 1}
Hello, Stevie!
COMMIT
Stevie, are you still there?
…and that our printout of SQL connection usage indicates that the connection is still idle
(and not idle in transaction
):
-[ RECORD 1 ]-+------------------------------
backend_start | 2019-12-15 13:10:31.69101+01
xact_start |
query_start | 2019-12-15 13:12:07.608321+01
state_change | 2019-12-15 13:12:07.608364+01
state | idle
query | COMMIT
Note that you have to be careful when you’re expunging stuff. For example, changes we make to our Zebra buddy won’t automatically stick after we’ve expunged them from the session8:
# 'expunge' is the action, 'detached' is the state
assert inspect(zebra).detached
zebra.number_of_stripes = 63
session.commit()
other_session = Session()
# Reload the zebra in a different session
zr = other_session.query(Zebra).filter(Zebra.id == zebra.id).one()
print(f'{zr.name} has {zr.number_of_stripes} stripes')
BEGIN (implicit)
SELECT zebras.id AS zebras_id, zebras.name AS zebras_name, zebras.when_born AS zebras_when_born, zebras.when_stripes_counted AS zebras_when_stripes_counted, zebras.number_of_stripes AS zebras_number_of_stripes
FROM zebras
WHERE zebras.id = %(id_1)s
{'id_1': 1}
Stevie has None stripes
Similarly, if you’re inserting objects, make sure to session.flush()
the object to the database before expunging it:
session = Session()
janet = Zebra(name='Janet', when_born=parser.parse('2018-06-05'))
session.add(janet)
session.flush()
# ID should have been set when we flushed Janet to the database,
# which indicates that she was correctly saved.
assert janet.id
session.expunge(janet)
session.commit()
print(f"Hi {janet.name}, we're glad you're still here!")
BEGIN (implicit)
INSERT INTO zebras (name, when_born, when_stripes_counted, number_of_stripes) VALUES (%(name)s, %(when_born)s, %(when_stripes_counted)s, %(number_of_stripes)s) RETURNING zebras.id
{'name': 'Janet', 'when_born': datetime.datetime(2018, 6, 5, 0, 0), 'when_stripes_counted': None, 'number_of_stripes': None}
COMMIT
Hi Janet, we're glad you're still here!
Use close()
to enforce that you’re definitely not in a transaction
Now that expunge()
has given us the ability to use objects outside of a session context, we might want to enforce that transactions aren’t accidentally opened by querying object state.
Calling close()
means that under no circumstances will another database transaction be opened to reload attributes, but the semantics of what this means for objects are subtle. In the following two examples, I’ll load two Zebras
, detach one and keep the other one in the session:
session = Session()
[za, zb] = session.query(Zebra).limit(2)
session.expunge(za)
If we close()
the session without calling rollback()
or commit()
, then, per the docs, everything is expunged from the session:
session = Session()
[za, zb] = session.query(Zebra).limit(2)
session.close()
assert inspect(za).detached
assert inspect(zb).detached
print(f"Hi {za.name}, we're glad you're still here!")
print(f"Hi {zb.name}, we're glad you're here too!")
BEGIN (implicit)
SELECT zebras.id AS zebras_id, zebras.name AS zebras_name, zebras.when_born AS zebras_when_born, zebras.when_stripes_counted AS zebras_when_stripes_counted, zebras.number_of_stripes AS zebras_number_of_stripes
FROM zebras
LIMIT %(param_1)s
{'param_1': 2}
ROLLBACK
Hi Stevie, we're glad you're still here!
Hi Anya, we're glad you're here too!
On the other hand, if you explicitly commit()
or rollback()
before calling close()
, the objects get expired before they get detached, and so any subsequent attempts to load will cause failures:
session = Session()
[za, zb] = session.query(Zebra).limit(2)
# Expunge one explicitly before session close
session.expunge(za)
# Or rollback, both have the same effect
session.commit()
session.close()
# Should work, was detached before commit()
print(f"Hi {za.name}, we're glad you're still here!")
# Will raise, was detached after being expired
print(f"Hi {zb.name}, we're glad you're here too!")
[sql omitted]
Hi Stevie, we're glad you're still here!
Traceback (most recent call last):
File "/Users/fabian/Code/sqlalchemy-example/utils.py", line 36, in step
yield
File "/Users/fabian/Code/sqlalchemy-example/002_loads_in_all_the_wrong_places.py", line 85, in <module>
print(f"Hi {zb.name}, we're glad you're here too!")
File "/Users/fabian/.local/share/virtualenvs/sqlalchemy-example-uvKGK8dZ/lib/python3.7/site-packages/sqlalchemy/orm/attributes.py", line 282, in __get__
return self.impl.get(instance_state(instance), dict_)
File "/Users/fabian/.local/share/virtualenvs/sqlalchemy-example-uvKGK8dZ/lib/python3.7/site-packages/sqlalchemy/orm/attributes.py", line 705, in get
value = state._load_expired(state, passive)
File "/Users/fabian/.local/share/virtualenvs/sqlalchemy-example-uvKGK8dZ/lib/python3.7/site-packages/sqlalchemy/orm/state.py", line 660, in _load_expired
self.manager.deferred_scalar_loader(self, toload)
File "/Users/fabian/.local/share/virtualenvs/sqlalchemy-example-uvKGK8dZ/lib/python3.7/site-packages/sqlalchemy/orm/loading.py", line 913, in load_scalar_attributes
"attribute refresh operation cannot proceed" % (state_str(state))
sqlalchemy.orm.exc.DetachedInstanceError: Instance <Zebra at 0x10d91d5c0> is not bound to a Session; attribute refresh operation cannot proceed (Background on this error at: http://sqlalche.me/e/bhk3)
Our first Zebra pal was able to be greeted because they’d previously been detached through the explicit expunge()
call. Calling session.commit()
expired the stored info about our second Zebra pal, who was still associated with the session
. Then, when SQLAlchemy went to refresh the info about our second zebra from the database, it raised an error because the session doesn’t exist anymore, and thus there’s nothing that expired data can be reloaded from.
Observe, however, that no transaction was re-opened as part of this attempt, and so our connection state is again idle
:
-[ RECORD 1 ]-+------------------------------
backend_start | 2019-12-15 13:15:51.734891+01
xact_start |
query_start | 2019-12-15 13:16:33.850546+01
state_change | 2019-12-15 13:16:33.850578+01
state | idle
query | COMMIT
Wrapping up
Alright, that’s it for now. Hopefully this post has given you the tools you need to keep digging if you want to keep exploring 🕵🏽♀️
Here’s what we covered:
pg_stat_activity
is wonderfully useful for understanding what every connection is doing in a Postgres database- Turning on logging at a lower level of the stack is a great way to understand what’s going on at higher levels
- SQLAlchemy opens database transactions the first time you attempt to load from a
Session
object, and closes them when youcommit()
orrollback()
- Closing a transaction with
commit()
orrollback()
will cause all associated objects to be expired, and reload their properties on the next access - You can
expunge()
objects from the session so that you can continue to use them outside of a database transaction - You can call
Session.close()
to guarantee that objects loaded within a session won’t reopen transactions, or otherwise access the database anymore.
Now that we’ve got a good picture of how sessions interact with database transactions, we’re well-positioned to build systems that make it easy to reason about transactions! That’ll require some careful thought and another few thousand words, though, so we’ll come back to it in a future post.
As always, I’d love to hear from you with thoughts or feedback.
See Transactions in the Postgres docs for a great explanation of the how / why here. ↩︎
There are ways of modifying what’s visible from one transaction to another – this is referred to as transaction isolation. The fact that people have thought these details through so thoroughly and that it all works so well strikes me as a significant feat of engineering. I’d taken this for granted for so long, but it’s a hard problem both from a design and a technological perspective, and it’s amazing to me that we’ve got off-the-shelf technology to do it at scale now. ↩︎
There’s actually five, but for the sake of simplicity, I’m going to say a lot of very reductionistic things in this article. This is a high-level primer to a relatively complex topic, so please bear with me 🐻 ↩︎
I really hoped I’d be able to fit a discussion about connection limits into this article but I think it’s slightly too off topic, and there’s so much good interesting stuff to talk about there that I’d have a hard time stopping 😅 The tl;dr: many cloud database providers set hard limits on the number of concurrent connections for your database. It turns out that there are some real good reasons for this – notably that the performance of some postgres subsystems is linear in the number of connections, and so at a certain point, total system throughput actually decreases compared to running the extra work sequentially. Brandur has an excellent writeup about this, and there’s lots of other good info in the Postgres Wiki. ↩︎
Note that we could also see all SQL queries by enabling and reading Postgres query logs. For our purposes, it’s more convenient to have SQLAlchemy emit these queries so that we can also print things from the Python code to see exactly when each query is being run, but query logs are more accurate because they don’t rely upon the SQLAlchemy authors having added a log statement everywhere a query is run. Using query logs would be a great last resort if the results I was seeing from SQLAlchemy weren’t making any sense. ↩︎
I discovered this by searching for the string ’test plain returns’ in the SQLAlchemy code, and then using PyCharm’s excellent code navigation features to discover the places that that method was called. ↩︎
Why has
state_change
updated, even though thestate
is the same? When SQLAlchemy ran our query, we briefly transitioned to the ‘active’ state and then back again to the ‘idle in transaction’ state. So we’ve actually seen two state changes, and thestate_change
variable simply contains the timestamp at which the most recent one occurred. ↩︎You’re able to merge detached instances back into sessions, but merging is complex and comes with a lot of warnings. I’m avoiding writing about it because it’s scary and nuanced and I’m praying I never have to learn how it works 😅 but the docs for it seem pretty solid, at least. ↩︎