Site Scraping - Breakdown #28

appatalks · 2023-05-23T21:07:35Z

https://github.com/simonw/strip-tags/tree/main

For example CNN.com article:

$ cat /tmp/index.html | strip-tags '.article__content'

appatalks · 2023-05-26T00:47:58Z

I need to figure out a way to do external sources more effectively. right now I'm preloading a list of predefined sources with a bash script and creative curl commands. But if I ask to scrape a random URL, how do I CURL that through html and javascript? Fetch runs into cross site scripting permission errors... trying to avoid node.js

just an example

function scrapeURL(url) {
  return new Promise((resolve, reject) => {
    const xhr = new XMLHttpRequest();
    xhr.onreadystatechange = function () {
      if (xhr.readyState === XMLHttpRequest.DONE) {
        if (xhr.status === 200) {
          resolve(xhr.responseText);
        } else {
          reject(new Error('Failed to retrieve the HTML content'));
        }
      }
    };
    xhr.open('GET', url);
    xhr.send();
  });
}

// Usage example
const randomURL = 'https://example.com'; // Replace with the desired URL
scrapeURL(randomURL)
  .then((htmlContent) => {
    // Process the scraped HTML content
    console.log(htmlContent);
  })
  .catch((error) => {
    console.error('Error scraping the URL:', error);
  });
```
`

appatalks closed this as completed Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Site Scraping - Breakdown #28

Site Scraping - Breakdown #28

appatalks commented May 23, 2023

appatalks commented May 26, 2023 •

edited

Loading

Site Scraping - Breakdown #28

Site Scraping - Breakdown #28

Comments

appatalks commented May 23, 2023

appatalks commented May 26, 2023 • edited Loading

appatalks commented May 26, 2023 •

edited

Loading