Academic project for web data integration course held by Prof. G. Costagliola at the Dipartimento di Informatica ('Department of Computer Science') of the University of Salerno.
Adbis is an ebook and audiobook aggregator that offers to its users the chance to buy books from several e-commerce web sites by just making their queries to a single web site.
The available sources are the following ones:
- Amazon, Kobo and Google Books (via API) for ebooks;
- Audible and ilNarratore for audiobooks;
- QLibri for reviews.
Sources styles may vary without further notice, causing the application to stop working as expected anytime.
Adbis architecture is based on a mediator among the above-mentioned sources. Previously retrieved results are stored into a database acting like a cache.
Apart from Google Play Books exposing an API, the sources required a scraping activity to retrieve their data; scraping classes have been implemented as a hierarchy, in order to gather common methods into the abstract superclass and specializing the type of items to scrape within the subclasses.
According to this scheme, a Scraper
abstract class is superclass of BookScraper
, AudiobookScraper
and ReviewScraper
subclasses.
Every scraper connects to search pages via cURL; resulting pages are scraped by XPath queries, stored into source wrappers; extracted string data are checked in order to return valid results and a new entity is at last built and added into a set returned to wrappers which return it to mediator.
To determine whether a result was similar to user queried keyword, we implemented a similarity metric based on Jaccard index which is a value in [0, 1]
range that express how much similar two strings are.
The basic algorithm is divided into following steps:
- Tokenize both
keyword
andtarget
strings; - Remove stop words from keyword set
K
and target setT
; - Calculate Jaccard index over
K
andT
:- if
J(K, T)
is greater or equal to0.5
thenkeyword
andtarget
strings are similar - else check whether set
K
is contained into setT
or vice versa: if there's containment of one of them into the other one, considerkeyword
andtarget
strings similar.
- if
Backend has been written in object-oriented PHP 7; to cache data about previous search results, MySQL RDBMS has been used; front-end interface has been developed with Bootstrap.
Dependencies for PHP have been managed by Composer while for JavaScript NPM was used.
After cloning or downloading the repository or a release, make sure to run the following commands (composer
and npm
have to be installed):
- in project root
composer install
- in
view
subfoldernpm install
(dependencies should be installed anyway, despite security warnings)
A web server has to be configured in order to properly use routing functionality (PHP integrated development server isn't enough for that, please rely on Apache server or nginx).
Adbis has been developed on Apache server, properly configured to support PHP; the following alias configuration has been specified to connect to it by http:\\localhost:8080\adbis\
URL:
Alias /adbis "<parentDir>/adbis/"
<Directory "<parentDir>/adbis">
Options Indexes FollowSymLinks MultiViews ExecCGI
AllowOverride All
Require all granted
</Directory>
Also make sure that mod_rewrite
module is enabled.
Adbis authors are Antonio Addeo (@AddeusExMachina) and Simone Bisogno (@bissim).