Django sitemap generation detailed explanation and optimization for map-too-big problem

created at 07-16-2021 views: 1

Introduction

A sitemap is a container for all links on a website. Many websites have deep connection levels and are difficult for crawlers to capture. Sitemaps can facilitate crawlers to crawl website pages. By crawling website pages, you can clearly understand the structure of the website. Sitemaps are generally stored in the root directory and named sitemap. Crawlers guide the way and increase the inclusion of important content pages on the website. A site map is a navigation web page file generated according to the structure, frame, and content of the website. Sitemaps are good for improving the user experience. They point the way for website visitors and help lost visitors find the page they want to see.

Overview

A sitemap is an XML file on your website that tells search engine indexers how often your pages change and the "importance" of certain pages relative to other pages on the site. This information can help search engines index your website.

The Django sitemap framework allows you to represent this information using Python code, thereby automatically creating this XML file.

installation

Install the sitemap map application, the steps are as follows:

  1. In settings.py, add 'django.contrib.sitemaps' to your INSTALLED_APPS settings.
  2. Make sure that the TEMPLATES setting in settings.py contains a DjangoTemplates backend, and the APP_DIRS option is set to True. No modification is required by default, and you only need to adjust them back if you have modified these settings.
  3. Ensure that the sites framework is installed (Note: the sitemap application will not install any database tables. The only reason it needs to enter INSTALLED_APPS is for the Loader() template loader to find the default template.)

initialization

In order to activate the sitemap generation function on the website, please add the following code to URLconf (the urls.py file at the same level as setting.py):

from django.urls import path,include
from django.contrib.sitemaps.views import sitemap

urlpatterns = [
    path('sitemap.xml', sitemap, {'sitemaps': sitemaps},
     name='django.contrib.sitemaps.views.sitemap')
]

When the user visits /sitemap.xml, Django will generate and return a sitemap.

The name of the site map file is not important, but the location of the file is important. Search engines will only index the current URL level and subordinate levels of the website. For example, if sitemap.xml is located in the root directory of your website, it can refer to any URL in your website. However, if your sitemap is located at /content/sitemap.xml, you can only refer to URLs that start with /content/.

The sitemap view requires an additional required parameter: {'sitemaps': sitemaps}. Sitemaps is a dictionary that maps applied tags (for example, news or blog) to its Sitemap class (for example, NewsSitemap or BlogSitemap). It can also be mapped to an instance of the Sitemap class (for example, BlogSitemap(some_var)).

Sitemap class

A Sitemap class is a "part" Python class that represents an item of your sitemap. For example, one Sitemap class can represent all entries in your Weblog, while another can represent all events in the event calendar.

In the simplest case, all these parts are grouped together in sitemap.xml, but the framework can also be used to generate a sitemap index that references individual sitemap files, one for each part. (See Creating a sitemap index below.)

The Sitemap class must be a subclass of django.contrib.sitemaps.Sitemap. They can exist anywhere in your code base.

Example

Suppose you have a blog system with an Entry model, and you want the sitemap to include all links to each blog document. The following is the wording of the sitemap class

from django.contrib.sitemaps import Sitemap
from blog.models import Entry

class BlogSitemap(Sitemap):
    changefreq = "never"
    priority = 0.5

    def items(self):
        return Entry.objects.filter(is_draft=False)

    def lastmod(self, obj):
        return obj.pub_date

note:

  1. changefreq and priority correspond to the <changefreq> and <priority> tags in the HTML page, respectively. As shown in the lastmod example, you can make them callable as functions.
  2. items() is a method that returns a sequence or QuerySet object. The returned objects will be passed to any callable methods (location, lastmod, changefreq and priority) corresponding to the site map attributes.
  3. lastmod should return a datetime object.
  4. In this example, the location method is not written, but you can add this method to specify the URL of the object. By default, location() calls get_absolute_url() and returns the result as the url of the object. That is, to use a model that uses sitemap such as Entry, the get_absoulte_url() method needs to be implemented internally in the model.

Entry model example

from django.db import models

class Entry(models.Model):
    title = models.CharField(max_length=60)
    text = models.TextField()
    is_draft = models.BooleanField()
    date_added = models.DateTimeField(auto_now_add = True)

    class Meta:
        verbose_name_plural = 'entries'

    def get_absolute_url(self):
        return f"/post/{title}.html"

Detailed Sitemap Class

The Sitemap class can define the following methods/attributes:

items

Must be defined. Method to return a list of objects.

The framework does not care about the type of objects, the important thing is that these objects will be passed to the location(), lastmod(), changefreq() and priority() methods.

location

Optional. Its value can be a method or attribute.

If it is a method, it should be the absolute path of the object returned by items().

If it is an attribute, its value should be a string representing the absolute path of each object returned by items().

The "absolute path" mentioned above means a URL that does not include the protocol and domain name. example:

correct:'/foo/bar/'
wrong:'example.com/foo/bar/'
wrong:'https://example.com/foo/bar/'

If location is not provided, the framework will call items() to get the get_absolute_url() method on each object.

This attribute is finally reflected in the <loc></loc> tag on the HTML page

lastmod

Optional. Method or attribute.

If it is a method, you should use one parameter-the object returned by items()-and the last modified date/time of the object as the return datetime.

If it is an attribute, its value should be, datetime represents the last modified date/time items() of each object returned.

If all items in the sitemap have lastmod, the generated sitemap views.sitemap() will have Last-Modified equal to the latest title lastmod. You can activate ConditionalGetMiddleware to make Django respond appropriately to the request with the If-Modified-Since header, and if the header is not changed, it will prevent the sitemap from being sent.

changefreq
Optional. Method or attribute. Indicates the frequency of modification of the current entry

If it is a method, you should use one parameter-the object returned by items()-and return the change frequency of that object as a string.

If it is an attribute, its value should be a string representing the frequency of changes for each object returned by items().

The possible values of changefreq regardless of the method or the attribute are:

  • 'always' is updated frequently
  • 'hourly' updated every hour
  • 'daily' updated daily
  • 'weekly' weekly update
  • 'monthly' monthly update
  • 'yearly' updated every year
  • 'never' never update

priority

Optional. Method or attribute. Indicates the weight coefficient and priority of the current item on the website.

If it is a method, it should take one parameter-the object returned by items()-and return the priority of the object in the form of a string or floating point number.

If it is an attribute, its value should be a string or floating point number, representing the priority of each object returned, items().

Example values of priority: 0.4, 1.0. The default priority of the page is 0.5. For more information, see the sitemaps.org documentation.

protocol

Optional.

This attribute defines the protocol ('http' or'https') of the URL in the sitemap map. If not set, the protocol for requesting the sitemap is used. If the sitemap is constructed outside of the request context, the default value is'http'.

limit

Optional.

This attribute defines the maximum number of URLs included on each page of the sitemap. Its value should not exceed the default value. The default value of 50000 is the upper limit allowed in the Sitemaps protocol.

i18n

Optional.

Boolean attribute to define whether all your URL LANGUAGES that generate this sitemap should be used. The default value is False.

Sitemap of static view

Sometimes we want search engines to index views that are neither object detail pages nor plain text pages. The solution is to list the URL names of these pages in items and call reverse() in the location method of the site map.

from django.contrib import sitemaps
from django.urls import reverse

class StaticViewSitemap(sitemaps.Sitemap):
    priority = 0.5
    changefreq = 'daily'
    # Display quantity per page
    limit = 1000
    # Paging sorting rules
    ordering = ['id']

    def items(self):
        return ['main', 'about', 'license']

    def location(self, item):
        return reverse(item)

urls.py

from django.contrib.sitemaps.views import sitemap
from django.contrib.sitemaps import views
from django.urls import path

from .sitemaps import StaticViewSitemap
from . import views

sitemaps = {
    'static': StaticViewSitemap,
}

urlpatterns = [
    path('', views.main, name='main'),
    path('about/', views.about, name='about'),
    path('license/', views.license, name='license'),
    # ...
    # Get the overall sitemap.xml pagination
    path('sitemap.xml', views.index, {'sitemaps': sitemaps},),
    # Get the data of a single sitemap.xml
    path('sitemap-<section>.xml', views.sitemap, {'sitemaps': sitemaps},
         name='django.contrib.sitemaps.views.sitemap'),
]

access:

http://example.com/sitemap.xml

Please log in to leave a comment.