Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for big sitemapse (> 50K urls) #9861

Open
incognitozen opened this issue May 5, 2022 · 18 comments
Open

Add support for big sitemapse (> 50K urls) #9861

incognitozen opened this issue May 5, 2022 · 18 comments

Comments

@incognitozen
Copy link

incognitozen commented May 5, 2022

What version of Hugo are you using (hugo version)?

$ hugo version
hugo v0.97.3+extended linux/amd64 BuildDate=unknown

Does this issue reproduce with the latest release?

Yes,

If you create a sitemap on a site with over 50K urls , Google complains that the file is too big.

Your Sitemap contains too many URLs. Please create multiple Sitemaps with up to 50000 URLs each and submit all Sitemaps.

urls

I looked at the docs and noticed that there is no way to override this. Technically not a bug, but this makes it difficult to submit sitemap to Google.

@incognitozen
Copy link
Author

HI @carerragt

Thanks for pointing in the right direction.

Please note davidsneighbour response on that thread.

Often requested, but technically not possible.

This means that there is a reasonable demand and need for this feature. I have a site with a single 'type'. I don't have categories,tags or other taxonomies. There are 79K url's in my site all belonging to the same type.

Hence, Ju52 proposal may not work with me as I don't have different sections on the site. I understand that this may not be the top priority but it is a problem worth fixing. I run hundred's of sites all in wordpress. Atleast 50%+ sites would have more than 50K url's.

@sifigi4335
Copy link

Perhaps what you should be asking is how to split the sitemap list to multiple files. The 50k is a Google limitation, not Hugo's per se.

@incognitozen
Copy link
Author

hi @carerragt

Sure, I understand.

It does beg the discussion that the very notion of having a sitemap is to submit to search engines. Without this need, there is no requirement for a sitemap. Both Google and Bing that provide consoles for managing the sitemap submission specifically request a sitemap that is chunked over 50K.

I would open a forum thread but if you solicit community feedback from those that have larger site, they will tell you that this might be a very important feature for them.

@bep bep added this to the v0.100.0 milestone May 25, 2022
@midzer
Copy link

midzer commented May 25, 2022

I also have a site with 50k+ pages in a single sitemap.

Adopting some kind of automatically splitting due an external limit in Hugo might break things for others. We should complain about the limit at the external search engine at first. Maybe those provider can up the limit to let's say 100k?

@incognitozen
Copy link
Author

@midzer

You can certainly try but there is a rational for them to limit the file to 50K url being the size of the file.
Try downloading a file that has 50K url and the size will be approx 4MB.

Furthermore, Google and Bing certainly don't need to change their processing pipelines because a static site gen decided that sitemap.xml shouldn't be split. If Hugo wants to be adopted, then the onus of adding features or making changes inline with industry expectation lies with Hugo and not other providers.

@bep bep modified the milestones: v0.100.0, v0.101.0 May 31, 2022
@bep bep modified the milestones: v0.101.0, v0.102.0 Jun 16, 2022
@bep bep modified the milestones: v0.102.0, v0.103.0 Aug 28, 2022
@bep bep modified the milestones: v0.103.0, v0.104.0 Sep 15, 2022
@bep bep modified the milestones: v0.104.0, v0.105.0 Sep 23, 2022
@FuadEfendi
Copy link

FuadEfendi commented Oct 3, 2022

"sitemaps" protocol supports main "sitemap index" with many child "sitemaps" (50k each). Example:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http:https://www.sitemaps.org/schemas/sitemap/0.9">
   <sitemap>
      <loc>http:https://www.example.com/sitemap1.xml.gz</loc>
      <lastmod>2004-10-01T18:23:17+00:00</lastmod>
   </sitemap>
   <sitemap>
      <loc>http:https://www.example.com/sitemap2.xml.gz</loc>
      <lastmod>2005-01-01</lastmod>
   </sitemap>
</sitemapindex>

Reference:
https://www.sitemaps.org/protocol.html

For example, Amazon.com website used this in past, and they had millions pages. It seems they stopped this ;) perhaps because of billions pages to list?

And BTW Google understands this "index" file. You can put link to it in robots.txt, it is not necessary to submit it explicitly.

@bep bep modified the milestones: v0.105.0, v0.106.0 Oct 26, 2022
@bep bep modified the milestones: v0.106.0, v0.107.0 Nov 18, 2022
@bep bep modified the milestones: v0.107.0, v0.108.0 Dec 3, 2022
@bep bep modified the milestones: v0.119.0, v0.120.0 Oct 5, 2023
@bep bep modified the milestones: v0.120.0, v0.121.0 Oct 31, 2023
@bep bep modified the milestones: v0.121.0, v0.122.0 Dec 6, 2023
@bep bep modified the milestones: v0.122.0, v0.123.0, v0.124.0 Jan 27, 2024
@bep bep modified the milestones: v0.124.0, v0.125.0 Mar 4, 2024
@FuadEfendi
Copy link

Sitemaps are “nice to have” for SEO; no need to include pagination and taxonomy pages into sitemaps; and in this case (if we ignore generated pages) it is quite easy to write tool which will generate it as part of Hugo build (some JavaScript, run as part of Node command, etc.) - it could be part of custom theme.

@FuadEfendi
Copy link

Workaround:

  1. Let Hugo create default sitemap.xml
  2. Download it and split into multiple files, 50 URLs each
  3. Follow https://www.sitemaps.org/protocol.html and create necessary files accordingly, place it in "/static" folder

Note: "sitemaps" are needed for documents which are not reachable from "home"; or, which are not easily reachable. For example, huge websites such as Amazon are in need of sitemaps: the only other way to "reach" product is via search bar.

So, I don't think sitemaps are as so important for static websites as for E-Commerce... "categories" and "pagination" replace it.

@FuadEfendi
Copy link

As per documentation at https://gohugo.io/templates/sitemap-template/, we can explicitly use page front matter:

sitemap:
  changeFreq: ""
  disable: false
  filename: sitemap-01.xml
  priority: -1

Hugo also supports sitemapindex.xml generation.

Simple script can traverse your tons of documents and insert sitemap-01.xml for first 50,000, sitemap-02.xml for 2nd, and so on. This is just workaround, but Hugo made huge progress since this ticket was initially created.

@jmooring
Copy link
Member

jmooring commented Jun 10, 2024

insert sitemap-01.xml for first 50,000, sitemap-02.xml

You are confusing site configuration with front matter override. You cannot override the filename in front matter. That's why the front matter override example in the documentation does not include filename.

@FuadEfendi
Copy link

Ok, I didn't know that... but then, to confirm, we have sitemap-index feature, and we still don't have multi-index support? For now, I run local build which generates huge sitemap, then I split is manually & disable sitemap generation, then deploy sitemaps from "static" folder as workaround.

@jmooring
Copy link
Member

With a multilingual project we create one sitemap index, and individual sitemaps per language (site). Regardless of whether a project is monolingual or multilingual, we don't split sitemaps based on the number of entries.

That's why this issue is open.

@bep bep modified the milestones: v0.125.0, v0.128.0 Jun 10, 2024
@bep bep changed the title Sitemap exceeding 50K urls Add support for big sitemapse (> 50K urls) Jun 10, 2024
@gohugoio gohugoio locked and limited conversation to collaborators Jun 10, 2024
@gohugoio gohugoio unlocked this conversation Jun 10, 2024
@bep
Copy link
Member

bep commented Jun 10, 2024

I think it's relatively clear what this issue is about. If you want to discuss workarounds, use https://discourse.gohugo.io/

One workaround could be to add your own sitemap template to your theme/project:

https://github.com/gohugoio/hugo/blob/master/tpl/tplimpl/embedded/templates/_default/sitemap.xml

And possibly filter out your 50k most interesting URLs from a SEO perspective ...

@FuadEfendi
Copy link

I have 270k modern terminology dictionary, all English, why should I filter "most interesting" terms?
My workaround it simple: let Hugo generate huge XML, then take scissors and cut it into 6 pieces; or just write Java application which will generate what I need and place it into "static" folder, I'll need an hour for that. Since it is too hard for Hugo ;)

@FuadEfendi
Copy link

Yes, multilingual support adds more complexity

@FuadEfendi
Copy link

Anyway, after some more thinking, sitemaps were invented for pages which are not reachable from homepage. For Hugo -based sites, sitemaps are not needed at all; but it is my personal opinion.

I love example with Amazon: they used sitemaps approx. ten years ago; but now, they don't. Perhaps they prefer to upload product listings in different specialized format to Google and other sites.

@FuadEfendi
Copy link

FuadEfendi commented Jun 10, 2024

Sorry for writing too much, but continuing logically: I had a past "price comparison" site where product pages were reachable only from search results pages; it was nonsense to have "pagination" for such a huge site. So, I used sitemaps to explicitly generate URLs where I wanted the Search Engine to land.

It's important to note that sitemaps are not necessary for typical Articles or blog sites with a well-structured menu/submenu/pagination. They are only required in specific cases, such as the one I encountered: a site with a few hundred thousand products, accessible solely through the Search Bar. In such instances, Google may not discover these pages due to the lack of a link route from Home to Child to Sub-Child, and so on.
Therefore, sitemaps are particularly useful for managing large sites. For instance, I disabled pagination for my 270k dictionary site; it's not user-friendly to paginate the letter 'K' with 1000 links on a page, spread across 20 pages. In such cases, sitemaps can help to streamline the user ("robot" lol) experience.

Therefore, in Hugo, the use case for sitemaps is only for huge sites where we are forced to disable pagination.

Some other non-Hugo use cases for sitemaps: SPA (Single-Page Application) which we want to made searchable; and etc.

@bep bep modified the milestones: v0.128.0, v0.129.0 Jun 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants