This project contains the source code of the final project from the course modern search engines at the University of Tübingen.
- Local set up for development
- Remote set up for deployment
- Crawler set up at local computer
- Frontend
- Quality check
- Tear down everything
./scripts/teardown.sh docker-compose.yml
- Create output directories and initialize environment variables.
cp -rf example.env .env
cp -rf example.frontend.env frontend/.env
- Start the project locally
./scripts/startup.sh docker-compose.yml
- Start the project on the server. Create external volumes
docker volume create prod_tuesearch_database
docker volume create prod_tuesearch
Change passwords in .env
and start the containers with
./scripts/startup.sh prod.docker-compose.yml
Analog, tear down with
./scripts/teardown.sh prod.docker-compose.yml
And remove the external volumes (if needed) with
docker volume rm prod_tuesearch_database
docker volume rm prod_tuesearch
Important note: when stop a crawler, stop gracefully so it has time to unreserve its reserved jobs.
- Add the
.env
file from Discord to the root directory and start the crawler with
docker-compose -f docker-compose.yml up loop_worker
- To start more than once crawler, do
docker-compose -f docker-compose.yml up --build --scale loop_worker=2 loop_worker
Change the number 2
to the number of crawlers you want to start. Start slowly and increase the number of crawlers
gracefully to see if everything works fine.
Be polite to other websites and use at most 4
crawlers at the same time to avoid overloading the crawled websites.
- Start mock up server
docker-compose -f docker-compose.yml up --build backend_mockup_server
and test the mock API at localhost:4001/search?q=tubingen
- Install dependencies
npm install
- Start the frontend
npm run dev
- Open the browser at
https://localhost:5000/
Some regularly used SQL queries to check quality:
- Test relevance ratio:
SELECT count(*) FROM `documents` where relevant = 1;
SELECT count(*) FROM `documents` where relevant = 0;
- Update priority list:
SELECT j.url, j.priority from jobs as j join documents as d where j.id = d.job_id and d.relevant = 1;
- Update block list:
SELECT j.url, j.priority from jobs as j join documents as d where j.id = d.job_id and d.relevant = 0;