Intro to Optimizing Database Performance in Python Applications

Introduction

I am sure we have all experienced the joy of when a Google page won't load. Or, when the Zoom call takes forever to connect? Or, when I supposedly have a 5G connection, but can't get my Instagram page to load? If you couldn't tell, I get pretty frustrated. Just like how a slow-loading application can be incredibly annoying for users, slow database interactions can impact a Python application's overall speed and responsiveness.

One author tells us that database performance is "the optimization of resource use to increase throughput and minimize contention, enabling the largest possible workload to be processed." Wait what? Let's break that down a little more.

Workload can be thought of as the demand from the database. Have you ever waited for Taylor Swift tickets to drop and the webpage suddenly becomes incredibly laggy and sometimes even breaks? This would be because of the incredibly high demand at that moment. The workload demanded can fluctuate at different times and can sometimes be unpredictable.

But at a baseline, what is our database capable of handling? How much data can the database process? That would be throughput. This goes hand in hand with resources, which are the hardware and software available to the system so that the database can process data and requests.

Finally, contention is when multiple components of the workload are trying to use a single resource. As workload increases, so does the chance of contention, decreasing throughput.

So all together, database performance is trying to use resources in the most efficient way to increase the database's capability to swiftly process large amounts of data, while preventing conflicts.

In the context of Python, this article will provide an introduction to thinking about best practices and factors to consider when trying to optimize database performance.

Choosing the Right Database

What Database Management System (DBMS) will be the best to use if you want to implement your desired features and determine performance characteristics? Should you choose SQL or noSQL? What is the best way to communicate with your DBMS?

Hold on, what even is a DBMS? When I first read about improving database performance, articles would consistently reference the DBMS, an acronym for a concept I rarely, if ever, had thought about. So let's break this concept down into an analogy about making dinner.

Let's say you find a super cool new recipe you want to try out. That is your database. The source of knowledge that you need if you want to make anything. Then you check the ingredients list and the kitchen tools you will need. If you want to cook the dish using the recipe, you must get the right ingredients and use the right tools. In the same way, if you want to access your database (which you do), you must choose the right database management system. A DBMS will handle tasks like creating, modifying, and organizing data in the database. Just like how different ingredients can lead to different dishes, the DBMS you select will change the way you handle various tasks.

So what are your options? At a very basic level, DBMSs can be categorized into two groups: SQL and NoSQL. As the names suggest, SQL DBMS like MySQL, PostgreSQL, and SQLite communicate with the database using the language of SQL (Structured Query Language). SQL allows a developer to interact with the DBMS by writing queries to retrieve, insert, update, or delete data. Using SQL is generally more appropriate if your data is highly relational. Relational databases will store your data in various tables, organized by rows and columns. Tables can be linked or joined together using one or more relations.

A noSQL database is best used if you have unstructured or semi-structured data. It allows for a flexible schema and can be quickly scaled to manage large amounts of data. This article will not go in-depth on SQL versus no-SQL databases, but you can read more about the differences here.

Choosing the right database can significantly impact the speed and performance of an application. As we saw above, depending on the structure of the data you are using (SQL or noSQL), you want to choose a database that will suit your data to efficiently store and retrieve information.

In the next few sections, we will go over other important factors to consider when optimizing database performance. This article will focus more on SQL databases.

Efficient Querying

In its simplest form, a SQL query serves as an instruction to the database to retrieve data. Let's go back to our dinner example, except this time, you are ordering the food. Efficient querying is like providing a well-organized and clear order to the chef. By only including the essential information needed to receive the desired food, you ensure the fastest and most accurate preparation of your dish. Similarly, when we think about databases, using SQL queries will optimize the retrieval of data, and increase throughput by efficiently utilizing resources.

Not only does efficient querying streamline the backend, but it also contributes to a smoother and faster experience for users interacting with your Python application.

Indexing Basics

In a database, indexing can increase the speed of querying by creating flags to point out where data is stored. Like a table of contents in a book, it allows the user to quickly find and retrieve data.

In SQL, an index is used to find data based on specific columns. Without an index, when you make a SQL query, the database would need to peruse through the entire table to find the requested data. However, a database index will pre-sort the values within a table and create references to the location of each value. When you execute a query, the DBMS will utilize this roadmap to quickly navigate through the indexed columns and grab the information from the relevant rows.

So why not use indexes everywhere? Although this is a great way to increase the performance of queries, too many indexes on a table can lead to increased storage requirements and reduce the throughput your database can handle. Furthermore, when insert, update, or delete methods are enacted, your index would also need to be adjusted. This creates difficulties with storage space.

As far as implementing indexes in SQL, the code is relatively simple. Let's say you have a table called restaurants with the following columns: id, name, location, and, rating. You want to improve the speed of queries that involve searching for restaurants based on their location. See below for a basic example in SQL:

CREATE INDEX idx_location ON restaurants(location);

CREATE INDEX is the SQL statement used to create the index
idx_location is the name given to the index
ON restaurants(location) specifies the table and column where the index should be created

The DBMS will manage the index, keeping it updated as the data in the table changes. As your data continues to change, it is also important that you monitor and maintain the indexes you have created to make sure they are improving query performance in the most efficient way possible. In conclusion, as long as indexes are used appropriately, they are another great tool to optimize database performance.

Caching for Performance

A cache is a specialized storage layer that holds a subset of data, enabling users to rapidly retrieve information compared to accessing the primary storage location. Sticking to our cooking analogy, imagine you love eating Shin Ramen and you eat it multiple times per week. Having to go to the store every time you want to make ramen would be extremely annoying and time-consuming. But what about if you kept ramen packs readily available in your home? Then you could expedite the cooking process and reduce the amount of times you had to travel to the store.

In the world of databases, a cache is your cabinet full of ramen. Caching is when copies of frequently accessed or computationally expensive data are stored in a cache, a high-speed, easily accessible storage layer. When you make a SQL query, the DBMS will first check the cache for the requested information. If the data is there, it can be quickly retrieved without querying the entire database, a process that takes a lot more time even with indexing. Altogether, this reduces the workload on the database and reduces the stress on resources available to the database, resulting in a smoother and more responsive application for the end user.

Like with all great things, there is a catch. Caching should mainly be used to access frequently used data that is static and data that would otherwise be resource-intensive to retrieve. It is also important to balance how much data is stored in the cache. Too little and you aren’t maximizing its potential, too much and you will slow down its performance. In conclusion, caching is another useful tool that can be used alongside efficient querying and indexing to strategically store data to optimize retrieval.

Connection Pooling

Before jumping into connection pooling, let's briefly summarize what happens when an application connects to a database:

Open a Connection: To connect with the database, an application utilizes a database driver. This is generally a software resource that acts as a way for the application to communicate with the DBMS. So first, the application will request a connection to the database through the driver.
Network Socket: After the request, a network socket is opened. A network socket is a way for data to be exchanged between the application and the database. The socket establishes the connection.
User Authentication: After the connection is established, the user needs to be authenticated, which could involve verifying user credentials like a username and password.
Database Operations: Now that the user has been authenticated, the application can perform database operations like various querying commands.
Network Socket is Closed: When the application has received whatever data it requested, the connection can be closed and so is the associated network socket, ending the communication channel.

If we use Python's 'sqlite3' module to connect to an SQLite database as an example, this would be the following commands:

import sqlite3

## establish or request connection to the database
## .connect() is the request for connection
CONN = sqlite3.connect('example.db')

##sqlite3 doesn't have a traditional authentication step
##would occur here after a connection is established

## .cursor() would be the database driver
## this allows you to interact with the database
CURSOR = CONN.cursor() 

##execute SQL commands
CURSOR.execute()

#commit the changes to the database
CONN.commit()

#close the connection, the network socket is closed
CONN.close()

Already you can see this is a somewhat extensive operation that could reduce the efficiency of your application if you need to make multiple connections at once.

Connection pooling is a tool used to manage and reuse database connections. Instead of opening and closing new connections for each operation, connection pooling maintains a "pool" of established connections. Let's relate it to our food analogy. Think of your top 3 restaurants. Each of these restaurants represents a database connection. Instead of dialing the phone number for each restaurant every time you want to place an order (connection request), what if you decided to keep those 3 restaurants on speed dial (connection pool)? Now you are only a click away from accessing your favorite foods. Also, you wouldn't keep every single restaurant you owned on speed dial, since there just isn't enough space (pool size). Similarly, the connection pool also has a limited size to prevent overconsumption of resources.

Bringing it back to our database, when the application wants to access the database, it can use one of the existing connections from the pool. When the operation is completed, instead of closing the connection it is returned to the pool. This technique can reduce the costly operation of opening and closing connections, improving performance and resource efficiency.

How can we create a connection pool? Using SQLAlchemy, I created a very basic example of how you might establish a connection pool.

from sqlalchemy import create_engine, pool

database_url = 'example'

#create a connection pool with a maximum of 5 connections
engine = create_engine(database_url, pool_size=5, max_overflow=10)

#get connection from the pool
connection = engine.connect()

#use the connection for some operation
result = connection.execute()

#release the connection back to the pool
connection.close()

The pool_size parameter specifies how many connections can be kept in the pool.
the max_overflow parameter specifies the maximum number of connections that can be created after the pool_size is exceeded

Keeping a connection open can also have resource costs, which is why it is important to consider several factors when deciding to use a connection pool. These could include how often the application interacts with a database and how the application will interact with the database.

Conclusion

When we think of optimizing database performance, we can put ourselves back in the shoes of the chef trying to create a seamless dining experience for her customers. Choosing the right database, creating efficient queries, utilizing indexing and caching, and implementing connection pooling are all essential ingredients in this recipe.

These are just the basics of database optimization and as you progress in Python and other languages, another great topic to further explore would be how ORM frameworks like SQLAlchemy can help simplify and optimize database interactions.