Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Site Scraping - Breakdown #28

Closed
appatalks opened this issue May 23, 2023 · 1 comment
Closed

Site Scraping - Breakdown #28

appatalks opened this issue May 23, 2023 · 1 comment

Comments

@appatalks
Copy link
Owner

https://github.com/simonw/strip-tags/tree/main

For example CNN.com article:

$ cat /tmp/index.html | strip-tags '.article__content'

@appatalks
Copy link
Owner Author

appatalks commented May 26, 2023

I need to figure out a way to do external sources more effectively. right now I'm preloading a list of predefined sources with a bash script and creative curl commands. But if I ask to scrape a random URL, how do I CURL that through html and javascript? Fetch runs into cross site scripting permission errors... trying to avoid node.js

just an example

function scrapeURL(url) {
  return new Promise((resolve, reject) => {
    const xhr = new XMLHttpRequest();
    xhr.onreadystatechange = function () {
      if (xhr.readyState === XMLHttpRequest.DONE) {
        if (xhr.status === 200) {
          resolve(xhr.responseText);
        } else {
          reject(new Error('Failed to retrieve the HTML content'));
        }
      }
    };
    xhr.open('GET', url);
    xhr.send();
  });
}

// Usage example
const randomURL = 'https://example.com'; // Replace with the desired URL
scrapeURL(randomURL)
  .then((htmlContent) => {
    // Process the scraped HTML content
    console.log(htmlContent);
  })
  .catch((error) => {
    console.error('Error scraping the URL:', error);
  });
```
`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant