example-1.org
... ** [[https://github.com/xeroxcat/org-scrape]] ...
Executing org-scrape-link with the cursor on the above link replaces the same with a copy of the page (fetched through the python requests library, parsed with BeautifulSoup and converted to org syntax with pandoc):
... ** GitHub - xeroxcat/org-scrape: scrape contents of a webpage and convert to org-mode markup language, adds as subtree to org document [[#start-of-content][Skip to content]] [[/join?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E%2F%3Crepo-name%3E&source=header-repo][Sign up]] - Why GitHub? [[/features][Features →]] - [[/features/code-review/][Code review]] - [[/features/project-management/][Project management]] - [[/features/integrations][Integrations]] ...
example-2.org
... ** [[https://github.com/xeroxcat/org-scrape][div#readme]] ...
When the link has a description, it is interpreted as a CSS selector and only the first corresponding element is formatted and inserted.
... ** GitHub - xeroxcat/org-scrape: scrape contents of a webpage and convert to org-mode markup language, adds as subtree to org document: div#readme **** README.org :PROPERTIES: :CUSTOM_ID: readme.org :CLASS: Box-title pr-3 :END: *** [[#usage][]]Usage :PROPERTIES: :CUSTOM_ID: usage :END: =example-1.org= ... ...
Designed for easily caturing formatted text from websites into an org file with a focus on extracting the same field from a set of identically formatted pages.
Scrape to org-mode formatted text.
Usage: scrape.py <url> [-e element] [-n] [-t] Options: -e element A CSS select string specifying the element to scrape from -n Don't remove blank lines from output -t Don't remove all org mode <<targets>> generated by pandoc
- Edit the shebang line of scrape.py to point to a Python3 environment with the libraries in
requirements.txt
.
An elisp snippet that defines a function to convert a link at the cursor to a heading in the current document rendered by scrape.py
.
- Edit the path of
scrape.py
to its stored location. - Paste the snippet into the body of the function
(with-eval-after-load 'org body)
indotspacemacs/user-config
.