GitHub - Rayen-Allaya/scraper: web scraper obtain information from any website, just by using its URL and the target CSS class that you want to scrape. It doesn't have a predefined purpose, so you can use it to gather information from any site you like

Index 🔖

FAQ
Demo
UMLs
Objetive
Documentation
Usage Limitations
Contributions
Contributing Guidelines
Code Of Conduct

Enjoying this project? Please consider giving it a star ⭐️. Your support means a lot to us!

Objetive ⭐

This application not only allows you to automate API creation, analyze and compare various websites, and generate insightful reports, but also enables you to export the obtained information in Excel format and retrieve specific results using keywords. The main goal of this web scraper is to gather information from any website by utilizing its URL and the designated target CSS class. Its adaptable design empowers you to collect data from your preferred sites without being constrained by predefined limits."

Documentation 📖

Postman Documentation

Custom Usage ⚙️

The code makes a POST request to the /api/v1/scrappe endpoint at http:https://localhost:5000. The request body should contain the following parameters:

keyWord (string): The keyword to filter articles by (optional).
url (string): The URL of the web page to scrape (mandatory).
objectClass (string): The CSS class of the elements to scrape from the web page (mandatory).

The API endpoint responds with a JSON object containing the following properties:

state: A string indicating the state of the scraping process.
objects found: The number of objects found after filtering.
key-word: The keyword used for filtering.
scanned webpage: The URL of the webpage that was scraped.
found articles: An array of articles that match the filtering criteria.
if the response is too big the api use compression middleware to reduce the size.
implementing findOrCreate method for mongoose is a powerful tool to ensure that the scraping of websites doesn't lead to duplicated results in the database.

Body Example

{
      "url":"https://www.url.com.ar",
      "objectClass":".css-class-selector",
      "keyWord":"keyword"
}

Response Example

{
    "state": "success",
    "objects found": 2,
    "key-word": {
        "doc": {
            "_id": "64d40fa677d90019c57302ed",
            "keyword": "keyword",
            "createdAt": "2023-08-09T22:13:58.108Z",
            "updatedAt": "2023-08-10T17:08:08.459Z",
            "__v": 0,
            "usedTimes": 28
        },
        "created": false
    },
    "scanned webpage": {
        "_id": "64d3e3459686e7f4087acfdb",
        "cssClass": ".css-class-selector",
        "url": "https://www.url.com.ar",
        "__v": 0,
        "createdAt": "2023-08-09T19:04:37.137Z",
        "scrapedTimes": 69,
        "updatedAt": "2023-08-10T17:08:08.328Z"
    },
    "found articles": [
        {
            "_id": "64d4fcf821aef9f1dd17bbb8",
            "websiteTarget": "64d3e3459686e7f4087acfdb",
            "keywords": [
                "64d40fa677d90019c57302ed"
            ],
            "title": "Some Title",
            "link": "/some/link/related/to/the/article",
            "createdAt": "2023-08-10T15:06:32.535Z",
            "updatedAt": "2023-08-10T17:08:08.643Z",
            "__v": 2
        },
     ]
}

Export data to xlsx

Make a Post request to /api/v1/export/to-excel
The request body should contain the following parameters:
scanned webpage (Object): Response for /api/v1/scrappe (mandatory)
found articles (Objects Array): Response for /api/v1/scrappe (mandatory).

body example:

{
    "scanned webpage": {
      "_id": "64d3e3459686e7f4087acfdb",
        "cssClass": ".css-class-selector",
        "url": "https://www.url.com.ar",
        "__v": 0,
        "createdAt": "2023-08-09T19:04:37.137Z",
        "scrapedTimes": 69,
        "updatedAt": "2023-08-10T17:08:08.328Z"
    },
    "found articles":[
        {
           "_id": "64d4fcf821aef9f1dd17bbb8",
            "websiteTarget": "64d3e3459686e7f4087acfdb",
            "keywords": [
                "64d40fa677d90019c57302ed"
            ],
            "title": "Some Title",
            "link": "/some/link/related/to/the/article",
            "createdAt": "2023-08-10T15:06:32.535Z",
            "updatedAt": "2023-08-10T17:08:08.643Z",
            "__v": 2
        }
     ]
}

Documentation

Usage Limitations

You can only send up to 100 requests per 10 minutes.
If the webpage has incorrect element nesting, the scraper will fail
before use this tool please read FAQ

Contributors ❤️

Especial thanks to:

Contributions 📈

Contributions are welcome! please read our guidelines

Name		Name	Last commit message	Last commit date
Latest commit History 344 Commits
.github		.github
src		src
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
UML.md		UML.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Index 🔖

Enjoying this project? Please consider giving it a star ⭐️. Your support means a lot to us!

Objetive ⭐

Documentation 📖

Custom Usage ⚙️

Body Example

Response Example

Export data to xlsx

Usage Limitations

Contributors ❤️

Contributions 📈

About

Releases

Packages

Languages

License

Rayen-Allaya/scraper

Folders and files

Latest commit

History

Repository files navigation

Index 🔖

Enjoying this project? Please consider giving it a star ⭐️. Your support means a lot to us!

Objetive ⭐

Documentation 📖

Custom Usage ⚙️

Body Example

Response Example

Export data to xlsx

Usage Limitations

Contributors ❤️

Contributions 📈

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages