Skip to content

scrape contents of a webpage and convert to org-mode markup language, adds as subtree to org document

License

Notifications You must be signed in to change notification settings

cphouser/org-scrape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Usage

example-1.org

...
** [[https://github.com/xeroxcat/org-scrape]] 

...

Executing org-scrape-link with the cursor on the above link replaces the same with a copy of the page (fetched through the python requests library, parsed with BeautifulSoup and converted to org syntax with pandoc):

...
** GitHub - xeroxcat/org-scrape: scrape contents of a webpage and convert to org-mode markup language, adds as subtree to org document
[[#start-of-content][Skip to content]]
[[/join?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E%2F%3Crepo-name%3E&source=header-repo][Sign up]]
- Why GitHub?
  [[/features][Features →]]
  - [[/features/code-review/][Code review]]
  - [[/features/project-management/][Project management]]
  - [[/features/integrations][Integrations]]
...

example-2.org

...
** [[https://github.com/xeroxcat/org-scrape][div#readme]]


...

When the link has a description, it is interpreted as a CSS selector and only the first corresponding element is formatted and inserted.

...
** GitHub - xeroxcat/org-scrape: scrape contents of a webpage and convert to org-mode markup language, adds as subtree to org document: div#readme
**** README.org
     :PROPERTIES:
     :CUSTOM_ID: readme.org
     :CLASS:    Box-title pr-3
     :END:
*** [[#usage][]]Usage
    :PROPERTIES:
    :CUSTOM_ID: usage
    :END:
 =example-1.org=
     ...
...

Designed for easily caturing formatted text from websites into an org file with a focus on extracting the same field from a set of identically formatted pages.

Setup

scrape.py

Scrape to org-mode formatted text.

Usage:
  scrape.py <url> [-e element] [-n] [-t]

Options:
  -e element  A CSS select string specifying the element to scrape from
  -n          Don't remove blank lines from output
  -t          Don't remove all org mode <<targets>> generated by pandoc

configuration

  • Edit the shebang line of scrape.py to point to a Python3 environment with the libraries in requirements.txt.

org-scrape.el

An elisp snippet that defines a function to convert a link at the cursor to a heading in the current document rendered by scrape.py.

configuration (spacemacs)

  • Edit the path of scrape.py to its stored location.
  • Paste the snippet into the body of the function (with-eval-after-load 'org body) in dotspacemacs/user-config.

About

scrape contents of a webpage and convert to org-mode markup language, adds as subtree to org document

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published