Database access optimization in Django-pre-loading related data

created at 07-17-2021 views: 2

Introduction

Django's model layer provides an ORM system, which allows us to use the database to store related data without learning SQL. It is often more efficient to obtain all the required data in one query than to obtain the data separately in multiple queries. However, because the database retrieval process of the Django model is hidden in the background, it is easy to retrieve the database multiple times without paying attention, and waste unnecessary time. Therefore, it is very important to fully understand the query mechanism of the Django model.

The official django document gives many suggestions for database access optimization: Database access optimization. These suggestions are very helpful for improving the efficiency of the code, but they have a lot of content. This article only introduces one of the optimization methods, preloading associated data.

Foreign keys and many-to-many relationships are often used in the model, but queryset will not retrieve the data of the associated object if it is not specified when obtaining the data of the object. When you call the associated object, queryset will access the database again. Therefore, when you loop through multiple objects and call the objects associated with their foreign keys, Django will keep accessing the database to obtain the data it needs.

It may be a bit abstract to say that, the following is a detailed explanation in conjunction with an example, which uses the following model.

from django.db import models
from django.contrib.auth.models import User

class Category(models.Model):
    name = models.CharField('category', max_length=16)

class Topic(models.Model):
    name = models.CharField('topic', max_length=16)

class Article(models.Model):
    'article'
    title = models.CharField('title', max_length=100)
    content = models.TextField('content')
    pub_date = models.DateTimeField('created at')
    category = models.ForeignKey(Category)
    topics = models.ManyToManyField(Topic)

class ArticleComment(BaseComment):
    'article comment'
    content = models.TextField('comment')
    article = models.ForeignKey(Article, related_name='comments')

Let's take a brief look at the query mechanism. Construct a QuerySet object through the model's Manager. The QuerySet is lazy loaded, and it always waits until the result is needed to access the database. When QuerySet accesses the database, it actually uses SQL statements to get the results. We can view SQL statements through logging and adjust the logging level to DEBUG. I have an introduction in another article: How to view the sql statements executed by Django ORM. Or check the query property print(QuerySet.query).

>>> Article.objects.all()
SELECT `blog_article`.`id`, `blog_article`.`title`, `blog_article`.`content`, `blog_article`.`pub_date`, `blog_article`.`category_id` FROM `blog_article`;

It can be seen that the general QuerySet only fetches the data in the table corresponding to the model, but does not fetch the data in the associated table. This means that only the foreign key id is obtained, not the data pointed to by the foreign key. As for the many-to-many relationship, nothing can be obtained, because the data of the many-to-many relationship is actually stored in another intermediate table.

Article and Category are associated with a foreign key, which is a many-to-one relationship. An article can only belong to one category, and one category can contain multiple articles. To get the category of an article, call the Article.category property. But because the article cached in the article is only the id of the article category, Article.category.id, rather than the complete Category object, when the category attribute of the article is used, django will access the database again to retrieve its content.

As follows, when the article category Article.category is printed with the for loop, the database will be accessed once in each loop. Moreover, the classification of articles is often repeated. The same classification may be retrieved multiple times in for. Such usage is obviously quite time-consuming.

# Visit the database once to get the Article object
for a in Article.objects.all():
     # Access the database n times, and retrieve the Category data every time it loops
     print(a.category)

Use select_related to obtain objects and related objects at one time, and only need to access the database once:

for a in Article.objects.all().select_related('category'):
     # The data has been cached, and the database will not be accessed again
     print(a.category)

Then use the query attribute to look at the SQL statement, select_related() uses JOIN to get the data of the Category model. In this way, the object associated with the foreign key is preloaded, and the database will not be accessed when the associated object is called again.

>>> Article.objects.select_related('category') # all()可以省略
SELECT `blog_article`.`id`, `blog_article`.`title`, `blog_article`.`content`, `blog_article`.`pub_date`, `blog_article`.`category_id`, `blog_category`.`id`, `blog_category`.`name` FROM `blog_article` INNER JOIN `blog_category` ON (`blog_article`.`category_id` = `blog_category`.`id`);

To get the foreign keys of foreign keys, just separate them with double underscores, and so on. For example: ArticleComment.objects.select_related('article__category') can preload the article to which the comment belongs and the category to which the article belongs at the same time.

However, in order to avoid the result set being too large due to the addition of multiple related objects, select_related is limited to obtaining single-value relationships-foreign keys and one-to-one relationships.

When preloading multiple related objects, you need to use prefetch_related, which queries each relationship separately, and then completes the connection between the related objects in Python.

Next, I will use Article and Category as examples. In the article table of the database, the category id is saved, but in the category table, the id of the subordinate article is not saved. There are two ways to get articles under a certain category:

c = Category.objects.get(id=1)
# These two methods are equivalent, you must visit the database once
c.article_set.all()
Article.objects.filter(category=c)

Let's look at the situation when searching for articles in multiple categories:

# Visit the database 1 time to get the classification
for c in Category.objects.all():
     # Visit the database n times to get articles
     c.article_set.all()

In this case, select_related cannot be used, because when there are multiple related objects, prefetch_related is required. This method will load all required associated objects into the memory, and load the objects directly from the cache every time c.article_set.all() is called.

# Visit the database twice to get categories and articles
for c in Category.objects.prefetch_related('article_set'):
     # Call the cache directly, no longer access the database
     c.article_set.all()

Why is it twice? The first step is to retrieve the classification, and the second step is to retrieve the article to which it belongs, using SELECT and IN statements to query, which is equivalent to:

>>> # prefetch_related
>>> Category.objects.prefetch_related('article_set')
SELECT `blog_category`.`id`, `blog_category`.`name`, `blog_category`.`number` FROM `blog_category`;
SELECT `blog_article`.`id`, `blog_article`.`title`, `blog_article`.`content`, `blog_article`.`pub_date`, `blog_article`.`category_id`, FROM `blog_article` WHERE `blog_article`.`category_id` IN (1, 2, 3, ...);
...
>>> # no prefetch_related
>>> c = Category.objects.all()
>>> a = Article.objects.filter(category__in=c)
>>> print(c)
SELECT `blog_category`.`id`, `blog_category`.`name`, `blog_category`.`number` FROM `blog_category`; args=()
...
>>> print(a)
SELECT `blog_article`.`id`, `blog_article`.`title`, `blog_article`.`content`, `blog_article`.`pub_date`, `blog_article`.`category_id` FROM `blog_article` WHERE `blog_article`.`category_id` IN (SELECT `blog_category`.`id` FROM `blog_category`)
...

Many-to-many relationships are also similar. Taking Article and Topic as examples, this method can also be used to preload related objects, Article.objects.prefetch_related('topics').

Prefetch-further control the preload operation

Prefetch can be used to further control the operation of preloading. For example, the following code uses Prefetch to limit the articles under the category to articles with id greater than 5:

>>> from django.db.models import Prefetch
>>> c=Category.objects.prefetch_related('article_set').get(id=2)
SELECT `blog_article`.`id`, `blog_article`.`title`, `blog_article`.`content`, `blog_article`.`pub_date`, `blog_article`.`category_id`, FROM `blog_article` WHERE `blog_article`.`category_id` IN (2);
>>> c.article_set.count() # No need to access the database
11
>>> qs=Article.objects.filter(id__gt=5)
>>> c=Category.objects.prefetch_related(Prefetch('article_set',queryset=qs)).get(id=2)
SELECT `blog_article`.`id`, `blog_article`.`title`, `blog_article`.`content`, `blog_article`.`pub_date`, `blog_article`.`category_id`, FROM `blog_article` WHERE (`blog_article`.`id` > 5 AND `blog_article`.`category_id` IN (2));
>>> c.article_set.count() # The result is different from the previous one
7

In addition, you can also use the to_attr parameter to specify the preload result as the attribute of the initial object, so that the original Manager will not be overwritten, and the attribute specified by to_attr will save the preload result in the list.

>>> c=Category.objects.prefetch_related(Prefetch('article_set',queryset=qs,to_attr='aidgt5')).get(id=2)
SELECT `blog_article`.`id`, `blog_article`.`title`, `blog_article`.`content`, `blog_article`.`pub_date`, `blog_article`.`category_id`, FROM `blog_article` WHERE (`blog_article`.`id` > 5 AND `blog_article`.`category_id` IN (2));
>>> c.article_set.count() # Execute the SQL statement, because the queryset is not cached
SELECT COUNT(*) AS `__count` FROM `blog_article` WHERE `blog_article`.`category_id` = 2;
11
>>> len(c.aidgt5) # Cached, no need to access the database
7

Preload performance comparison

Using select_related and prefetch_related can greatly reduce the number of database accesses, but how much does it improve performance? We still don't have an intuitive impression. Next, we will compare the efficiency of non-preloading and preloading through actual running code. (Actually, because I don’t know how to analyze the complexity of the algorithm, I can only compare the actual running time...)

def articles_retrieve_no_prefetch():
    for a in Article.objects.all():
        print(a.category,'\n',a.topics.all())

def articles_retrieve_prefetch():
    for a in Article.objects.all() \
    .select_related('category') \
    .prefetch_related('topics'):
        print(a.category,'\n',a.topics.all())

The above two functions respectively define a simple database query and a database query that uses preloading, and print the results.

The test method is to run the two functions 100 times respectively, and count the time spent in a single run. The specific statistical function will be put later, the hardware and software configuration will not be described in detail, and the result will be directly displayed, where the average value is the time spent in a single operation of the function, and the unit is s.

Obviously, the performance of preloading is higher.

performance comparison

Although django's model is simple and easy to use, it can't be done in a simple way. It is necessary to deeply understand the principles behind it and use query methods reasonably. Otherwise, it is easy to perform many unnecessary database accesses, causing serious performance waste.

Therefore, database access optimization is very important. This article only introduces an optimization method. For more optimization methods, please refer to the official django document: Database access optimization.

The detailed statistical functions are given below. If you are interested, you can run it in your own project.

'''run this uder Django project'''
import time, statistics

from django.utils import timezone
from blog.models import Article

def count_run_time(func):
    start=time.perf_counter()
    func()
    end=time.perf_counter()
    return end-start

def statistic_run_time(func, n):
    data = [ count_run_time(func) for i in range(n)]
    mean = statistics.mean(data)
    sd = statistics.stdev(data, xbar=mean)
    return [data, mean, sd, max(data), min(data)]

def compare_articles_retrieve_time(n):
    result1 = statistic_run_time(articles_retrieve_no_prefetch, n)
    result2 = statistic_run_time(articles_retrieve_prefetch, n)
    print('compare\t no prefetch\t prefetch')
    print('average\t',result1[1],'\t',result2[1])
    print('SD\t',result1[2],'\t',result2[2])
    print('max\t',result1[3],'\t',result2[3])
    print('min\t',result1[4],'\t',result2[4])
Please log in to leave a comment.