The script is designed to parse various sections of the www.wildberries.ru marketplace. It knows how to clean its tables, but theoretically it is needed to accumulate information about how the sales and goods prices in different sections was changing over time.
- python 3.6+
- mysql server
How to get it on Ubuntu 18.04+:
- You need MySQL server and Python pip, if you still don't have them:
@ sudo apt install python3-pip @ sudo apt install mysql-server
- Install requests module and Python connector to MySQL:
@ pip3 install bs4 @ pip3 install pandas @ pip3 install requests @ pip3 install python-mysql-connector
- Check MySQL installation and root password, choose default database:
@ mysql -u root -p mysql> USE sys; Ctrl-Z
How to get it on Windows:
- You need MySQL community server, if you still don't have it: https://dev.mysql.com/downloads/windows/installer/8.0.html
When you'll see "Choosing a setup type", click "Custom" type and click Next. On the next screen all you need are MySQL Server, MySQL Workbench and Connector/Python.
-
Run MySQL Workbench and choose SYS as the default database.
-
Install requests module and Python connector to MySQL:
@ pip install bs4 @ pip install pandas @ pip install requests @ pip install mysql-connector-python
First of all you need for real work with this script is a whole bunch of proxy servers or you'll get ban for an hour or so from www.wildberries.ru after showing 100 positions in a short time. Script is designed for work with the premium account on www.hidemyname.org, it will cost $8 for a month. After payment you must send the feedback form on a page https://hidemyname.org/en/feedback/ with request for API access. You'll get an e-mail with some keys and options.
Now you need to initialize API. It's one time procedure for every key. Go to https://hidemyna.me/ru/proxy-list/
There will be a form where you must click on IP:port link below. Next you'll see a small window with just one field and button. Copy-paste your key to this field and click "Login". If you see the list with some IP addresses, you're ready to work, no need to copy them.
Next you need to edit config.ini file. It has less than ten lines:
[mysql]
mysql_user = root
mysql_pass = <mysql root password>
mysql_host = localhost
mysql_base = sys
[params]
expiry_hours = 24
timeout_between_urls = 3
proxy_list_key = <your hidemyname 15-digit key>
expiry_hours - each record in MySQL table has a timestamp and if you'll run a script more than once a day for example, too "young" positions will not be updated. 24 hours is good option, you can choose even 1 hour as well as 168 hours (1 week).
timeout_between_urls - delay between requests in seconds. It will be a random time from 0 to this timeout. Don't choose too low delay, 3 seconds is ok.
Usage: crawler.py [-h] [-c] [-d] [-u] [-s] [-n] [-m] database source
Required arguments:
- database = Table name
- source = Page URL from where you start to crawl. URL must be in quotation marks.
Optional arguments:
- -c = Cleans the table before starting. If you want to start some table from scratch.
- -d = Debug mode.
- -u = Updates http proxies table.
- -s = Uses https proxies too. Overrides -u argument.
- -n = Don't uses proxies. It may need for debug mode or if you want to parse less than 100 positions.
- -m = Fills empty material cells. Sometimes positions parse without "material" cell, in this case there will be just "---". In the end of the whole parsing I check and refill such cells, but I left an argument too, just in case, if you want to try to refill them again. It will work only if expiry_hours still not passed.
First time you start this crawler in a day, you must use -u or -s argument, because the proxies are changing constantly. Example:
python crawler.py -u backpacks https://www.wildberries.ru/catalog/aksessuary/sumki-i-ryukzaki/ryukzaki/ryukzaki
When you see that some section has more than 20000 positions, you must specify your request or many positions in the middle will stay unnoticed. You can split it into two queries or more. Now you need the quotes before and after URL. Example:
python crawler.py backpacks "https://www.wildberries.ru/catalog/aksessuary/sumki-i-ryukzaki/ryukzaki/ryukzaki?price=200;2000"
python crawler.py backpacks "https://www.wildberries.ru/catalog/aksessuary/sumki-i-ryukzaki/ryukzaki/ryukzaki?price=2000;20000"
Here I splitted backpack section to "cheaper than 2000rub" and "from 2000rub to 20000rub". You can split and specify these sections using three or even more categories. Example:
python crawler.py backpacks "https://www.wildberries.ru/catalog/aksessuary/sumki-i-ryukzaki/ryukzaki/ryukzaki?price=3000;10000&color=0&kind=1&consists=6"
if you need only black polyester backpacks for men with the price from 3000rub to 10000rub. And so on.
Telegram: @berenzorn
Email: [email protected]
Feel free to contact me if you have a questions. Have a nice day :)