djangopythonsoftware-developmentperformance

Working with huge data sets in Django

4 min read

Working with huge data sets in Django

When prefetch_related() no longer cuts it

--

(update in 2023: Django fixed most of the issues this story addresses. Using cursor pagination can still be useful, and using custom Prefetch objects too.)

Last week in my Django work at Lab Digital, we’ve fixed several performance issues that plagued the website of a client. When a webshop has a million products, which get imported and exported each day, this presents a huge challenge. Simple Django performance tricks no longer work.

Recap: Django’s toolbox for huge data

Update: as of Django 4.1 QuerySet.iterator() supports prefetching related objects as long as the chunk_size argument is provided. In older versions, no prefetching was done. Using cursor pagination can still be a smart move though.

When a lot of data is received, using queryset.iterator() ensures Django doesn’t cache nor fetch all results in memory. This reduces memory consumption by processing one record at a time.

This optimisation conflicts with another tool: prefetch_related(). That feature allows Django to resolve related records in a single extra query. Instead of making a call per record (1M in our case), just one extra query could be made. To achieve this however, Django needs to inspect the all results beforehand.

Thus, it’s not possible to use both queryset.iterator() and prefetch_related() together.

Even if it would be possible, this won’t be efficient on 1M records…

Optimisation #1: Prefetching in batches

Since prefetch_related() depends on walking through a result set first, it would be tempting to fetch the data in batches on 1000 records, and perform a prefetch on that. Unfortunately, that puts a lot of strain on the database. Databases don’t like repeating the same question over and over, while throwing away all preceding results to only fetch records 984000–985000 for instance.

With the help of cursor pagination, everything comes together. Cursor pagination only fetches the next 1000 records since a previous known point. There is even a Django package for it: django-cursor-pagination.

from cursor_pagination import CursorPaginatordef chunked_queryset_iterator(queryset, size, *, ordering=('id',)): """Split a queryset into chunks. This can be used instead of ``queryset.iterator()``, so ``.prefetch_related()`` also works. .. note:: The ordering must uniquely identify the object, and be in the same order (ASC/DESC). """ pager = CursorPaginator(queryset, ordering) after = None while True: page = pager.page(after=after, first=size) if page: yield from page.items else: return if not page.has_next: break # take last item, next page starts after this. after = pager.cursor(instance=page[-1])

Now results can be processed again in batches:

product = Product.objects.prefetch_related("stockrecords")for product in chunked_queryset_iterator(products, 1000): print("Product:", product.default_stockrecord) print("Categories:", [c.id for c in product.categories.all()])

Note: as of Django 4.1, this can simply be written as:

for product in ( Product.objects.prefetch_related("stockrecords").iterator(chunk_size=1000)): print("Product:", product.default_stockrecord) print("Categories:", [c.id for c in product.categories.all()])

Optimisation #2: Use the prefetch cache

Update: this behavior has been fixed in Django 2.0.

When using things like queryset.first(), Django performs a query — even when data was prefetched. Using queryset.all()[0] however, uses the prefetch cache. This leads us to this extra optimisation, moving away from standard django-oscar logic:

class Product: ... @cached_property def default_stockrecord(self): try: # By using .all()[0] the prefetch cache is read, # using .first() ignores that. All our products # have one stockrecord anyway. return self.stockrecords.all()[0] except IndexError: return None

Optimisation #3: Prefetch only the fields that are needed

Django’s Prefetch object helps to adjust which data is retrieved from the database. When queryset.values_list() no longer suffices, consider this:

Product.objects.prefetch_related( "stockrecords", Prefetch("categories", queryset=Category.objects.only('id')),)

Concluding

With a few simple tricks, our background tasks run 3 times faster than before! The metrics in New Relic also confirmed this after our latest deployment.

All this avoided the need to drop into raw-SQL, rewrite our importer/exporter or complete architect a new solution. Such solutions are still on the table, and likely needed eventually. These require more time however and would be more costly for the client.

This presents the ongoing challenge; to improve the site while keeping the clients’ future plans in mind — so we pick the right solution at the right time.

Tags used in this article:
djangopythonsoftware-developmentperformance