Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrapers Detail Admins Page load too slow #90

Open
bezkos opened this issue Jun 6, 2017 · 8 comments
Open

Scrapers Detail Admins Page load too slow #90

bezkos opened this issue Jun 6, 2017 · 8 comments

Comments

@bezkos
Copy link

bezkos commented Jun 6, 2017

If you have many scrapers (around 100) and u try to change one of them then detail page load too slow (around 35 sec).
I try to inspect problem with DjDT and i see that there are 5790 SQL Queries to load this page and from them 5780 duplicates. I know u cant use select_related or prefetch_related like in Django to eliminate this problem but i think its important to fix it cause DDS with many scrapers is unusable atm.

@holgerd77
Copy link
Owner

Hi @bezkos, what do you mean with "change one of them" respectively what are you changing/doing for the detail pages to load so slow?

@bezkos
Copy link
Author

bezkos commented Jun 7, 2017

In scrapers page, i have around 100 scrapers. I need to change xpath for example in 1 of them then it needs around 35 sec to load this page with 5790 SQL Queries and 5780 duplicates.
This is a part of the results from Debug toolbar:

SELECT "dynamic_scraper_scrapedobjclass"."id", "dynamic_scraper_scrapedobjclass"."name", "dynamic_scraper_scrapedobjclass"."scraper_scheduler_conf", "dynamic_scraper_scrapedobjclass"."checker_scheduler_conf", "dynamic_scraper_scrapedobjclass"."comments" FROM "dynamic_scraper_scrapedobjclass" WHERE "dynamic_scraper_scrapedobjclass"."id" = 104
Duplicated 5758 times.
Connection: default
C:\venv27\lib\site-packages\dynamic_scraper/models.py in str(203)
return self.name + " (" + self.scraped_obj_class.name + ")"
18 {% trans 'Home' %}
19 › {{ opts.app_config.verbose_name }}
20 › {% if has_change_permission %}{{ opts.verbose_name_plural|capfirst }}{% else %}{{ opts.verbose_name_plural|capfirst }}{% endif %}
21 › {% if add %}{% blocktrans with name=opts.verbose_name %}Add {{ name }}{% endblocktrans %}{% else %}{{ original|truncatewords:"18" }}{% endif %}
22
23 {% endblock %}
24 {% endif %}
25
C:\venv27\lib\site-packages\django\contrib\admin\templates\admin\change_form.html

SELECT "dynamic_scraper_scrapedobjclass"."id", "dynamic_scraper_scrapedobjclass"."name", "dynamic_scraper_scrapedobjclass"."scraper_scheduler_conf", "dynamic_scraper_scrapedobjclass"."checker_scheduler_conf", "dynamic_scraper_scrapedobjclass"."comments" FROM "dynamic_scraper_scrapedobjclass" WHERE "dynamic_scraper_scrapedobjclass"."id" = 89
Duplicated 5758 times.
Connection: default
C:\venv27\lib\site-packages\dynamic_scraper/models.py in str(55)
return self.name + " (" + str(self.obj_class) + ")"

SELECT "dynamic_scraper_scrapedobjclass"."id", "dynamic_scraper_scrapedobjclass"."name", "dynamic_scraper_scrapedobjclass"."scraper_scheduler_conf", "dynamic_scraper_scrapedobjclass"."checker_scheduler_conf", "dynamic_scraper_scrapedobjclass"."comments" FROM "dynamic_scraper_scrapedobjclass" WHERE "dynamic_scraper_scrapedobjclass"."id" = 27
Duplicated 5758 times.
Connection: default
C:\venv27\lib\site-packages\dynamic_scraper/models.py in str(55)
return self.name + " (" + str(self.obj_class) + ")"
There are 5758 duplicates for each id............

@holgerd77
Copy link
Owner

Ah, I thought you meant the detail pages of the websites you are going to scrape.

With detail page do you mean the edit form page of a scraper in the admin? or do you mean the overview site with all the scrapers?

Have you got one Scraped Obj Class for every scraper? And how many Scraped Obj Classes have you got?

@holgerd77
Copy link
Owner

Can you make a test and edit the C:\venv27\lib\site-packages\dynamic_scraper/models.py file and remove the part of the returned name being in parantheses, both in line 55 and line 203?

So just leave return self.name.

@bezkos
Copy link
Author

bezkos commented Jun 7, 2017

Yes i did the test and it fixes the problem.
Load time is 1.8 sec from 35sec.
Queries are 29(20 duplicates) from 5790.

SELECT "dynamic_scraper_scrapedobjattr"."id", "dynamic_scraper_scrapedobjattr"."name", "dynamic_scraper_scrapedobjattr"."order", "dynamic_scraper_scrapedobjattr"."obj_class_id", "dynamic_scraper_scrapedobjattr"."attr_type", "dynamic_scraper_scrapedobjattr"."id_field", "dynamic_scraper_scrapedobjattr"."save_to_db" FROM "dynamic_scraper_scrapedobjattr" ORDER BY "dynamic_scraper_scrapedobjattr"."order" ASC
Duplicated 11 times.
Connection: default

I have around 100 obj classes and 110 scrappers.
I mean the edit form page of a scraper in the admin.

@holgerd77
Copy link
Owner

This was actually trickier than I though, experimented with 2-3 different things, all not completely satisfying (thought I could quickly fix this since I'm doing a minor release today anyhow).

I actually need the complete names otherwise users get confused when selecting the scraped object attributes for the scraper, so simplify the naming is not an option. I also experimented with simple caching of the name which also didn't work.

Limit the choices to only the attributes of the corresponding scraped object class is also trickier than one might think, since the object class is not determined yet when adding a new scraper or adding new scraper elems. I have now added such a limitation, but this works only for already saved scrapers for already added attributes.

Let me know if this improves the performance situation for you. Otherwise you will have to monkey patch this for yourself in your installed DDS version.

Greetings
Holger

@bezkos
Copy link
Author

bezkos commented Jun 8, 2017

Ok @holgerd77 i found a way to reduce 75% time and queries.
class ScraperElemInline(admin.TabularInline):
model = ScraperElem
extra = 3

def formfield_for_foreignkey(self, db_field, request=None, **kwargs):       
    if db_field.name == 'scraped_obj_attr':
        kwargs['queryset'] = ScrapedObjAttr.objects.select_related('obj_class').all() 
    return super(ScraperElemInline, self).formfield_for_foreignkey(db_field, request, **kwargs)

@bezkos
Copy link
Author

bezkos commented Jun 8, 2017

And my last update with no duplicates and <1 sec load (from 35secs)
In model.py
class WithObJClass(models.Manager):
def get_queryset(self):
return super(WithObJClass, self).get_queryset().select_related('obj_class')

@python_2_unicode_compatible
class ScrapedObjAttr(models.Model):
ATTR_TYPE_CHOICES = (
('S', 'STANDARD'),
('T', 'STANDARD (UPDATE)'),
('B', 'BASE'),
('U', 'DETAIL_PAGE_URL'),
('I', 'IMAGE'),
)
name = models.CharField(max_length=200)
order = models.IntegerField(default=100)
obj_class = models.ForeignKey(ScrapedObjClass)
attr_type = models.CharField(max_length=1, choices=ATTR_TYPE_CHOICES)
id_field = models.BooleanField(default=False)
save_to_db = models.BooleanField(default=True)
objects = WithObJClass()

def __str__(self):
    return self.name + " (" + str(self.obj_class.name) + ")"

class Meta(object):
    ordering = ['order',]

@holgerd77 holgerd77 reopened this Jun 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants