Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML Sitemaps integration #15302

Open
2 of 8 tasks
jdevalk opened this issue May 25, 2020 · 11 comments
Open
2 of 8 tasks

XML Sitemaps integration #15302

jdevalk opened this issue May 25, 2020 · 11 comments

Comments

@jdevalk
Copy link
Contributor

jdevalk commented May 25, 2020

We want to let the Yoast SEO XML sitemaps run on indexables as this will perform much better. Some of the work needed for this is already done in the jdv/xml-sitemaps branch, see its changes compared to trunk.

As soon as a site has completed indexing, the current XML sitemaps code (henceforth known as “classic XML sitemaps”) should no longer be loaded. Instead the new XML sitemaps code (henceforth: "XML sitemaps integration") should be loaded.

The XML sitemaps for Yoast SEO should remain functionally the same regardless of which of them runs.

We'll also add a feature toggle that allows switching to classic XML sitemaps, so users can force the old implementation with multilingual solutions that are not compatible.

Code parts of the XML sitemaps integration

All of the code in classic XML sitemaps should have a counterpart in the XML sitemaps integration.

  • Conditional
    Allows only loading stuff on XML sitemap requests. Already done, in branch.
  • Integration
    Takes care of adding query vars and registering the rewrites. Already done, in branch.
    Takes care of other XML sitemaps code that has no other logical place.
  • Router
    Takes care of routing to the right xml sitemap presenter (through template_redirect, I think)
    Uses Routes, which are a simple mapping of urls to presenters. Similar like we did here. Some of the pre-work for this is already done, as this required making sure Routes are loaded for every request, adding a REST conditional
  • Presenters
    We should build an abstract presenter, that the normal presenters build upon. The index presenter probably can't use that abstract.
    • Post type
    • Taxonomy
    • Author
    • Index

Images

Images in the XML sitemaps are taken from the internal linking table. Therefore we no longer need to parse every post. All the images and all the URLs for a sitemap should be grabbed in one find_many call.

Tasks

  • Build the conditional.
  • Make routes work on all requests.
  • Make sure the router routes to a presenter.
  • Build sitemap index presenter.
  • Build abstract sitemap presenter.
  • Build post type presenter.
  • Build taxonomy presenter.
  • Build feature toggle.
@omarreiss
Copy link
Contributor

Let's also make sure the Router has a method of registering routes, so we can register routes for the sitemap similar to the way we register WP REST routes. This way we enable the router to actually do the routing and we keep the interface as consistent as possible.

In a normal web application, a router would also take care of redirects. I was thinking the Router could act as middleware for the redirect handler in Premium and the redirects integration in free as well. But since redirects need to be done as soon as possible in the loading order and the Router would only route on template_redirect this doesn't seem like the best idea.

@mmikhan
Copy link
Member

mmikhan commented May 13, 2021

Please inform the customer of conversation # 734729 when this conversation has been closed.

@suascat
Copy link

suascat commented Jun 14, 2021

Please inform the customer of conversation # 744280 when this conversation has been closed.

@wpsumo
Copy link

wpsumo commented Jun 14, 2021

@jdevalk @iamazik Any update oin the rebuild of XML sitemap?

@archon810
Copy link

We run a busy server with lots of data and visitors, and started hitting this issue more and more lately. When it happens, the same query is stuck by the hundreds with various offsets in MySQL, and slave load shoots up to 300-400. We have 3 32GB 8-CPU slave servers, and this is the only situation currently that brings them to their knees.

Here's an example:
image

And a sample of stuck queries: https://gist.github.com/e55ae2677ac44c7c3c3cf07004e5121e

These queries not only bog down the db servers and reduce the number of available connections, they also make the slaves lag.

From what I can tell, it is indeed sitemap generation that causes these. Once one query, which is as noted in #12161 quite expensive, takes a long time, it snowballs when more sitemaps are accessed and eventually destroys everything.

Is generating sitemaps from indexables the solution here? If so, can we please get an update on the status of this task? It's coming up on 2 years now. With Yoast's sale, will there be more resources dedicated to resolving such deal-breaking issues?

Thank you.

@archon810
Copy link

archon810 commented Sep 17, 2021

By the way, in the queries the LIMIT is 100, meaning 100 posts at a time, but the XML sitemaps contain 1000 entries, which is the default. Why is each query only getting 100 at a time instead of 1000?

Changing 100 to 1000 in the query doesn't really affect the run time negatively because the majority of the time is spent on searching and ordering, not returning data.

https://github.com/Yoast/wordpress-seo/blob/trunk/inc/sitemaps/class-post-type-sitemap-provider.php#L545
https://github.com/Yoast/wordpress-seo/blob/trunk/inc/sitemaps/class-sitemaps.php#L580

@archon810
Copy link

We found where the step is being reset to 100 for some reason https://github.com/Yoast/wordpress-seo/blob/trunk/inc/sitemaps/class-post-type-sitemap-provider.php#L175. As a temporary workaround, can you provide a filter to override this behavior perhaps?

@archon810
Copy link

Even with the index described here #12161 (comment), everything exploded again just now.

image

How can we get this under control? @jdevalk

@archon810
Copy link

I think I can explain why this happens sometimes and brings our dbs to their knees - bots.

Specifically, when bots, whether malicious or not (like MailRu) start requesting dozens of sitemaps almost at the same time. I'm not sure if there's no cache for those, or if the cache expires, but all of them start generating at the same time and requests start piling up, each request making the issue worse and worse.

Here's an example of such an attack:
image
image

The evil IP is definitely a bot https://www.abuseipdb.com/check/91.213.50.11.

The solution is either moving sitemap regeneration to cron or always serving cached sitemaps while regenerating them in the background with another method. Whatever the solution, it can't involve running really expensive queries triggered by site visitors.

@jdevalk
Copy link
Contributor Author

jdevalk commented Nov 7, 2021

@archon810 if you have time, testing this (still in draft) pull request would be appreciated: #17580

it's a rebuild of the XML sitemaps on indexables which should be loads faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants