Skip to content

Convert an HTML page to markdown, including re-linking and downloading of related images.


Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



66 Commits

Repository files navigation


Build and test

This is a project I'm using to migrate content - it may or may not do exactly what you want for your content, but hopefully it's useful.

Reverse engineer markdown from an HTML page, including:

  • Re-linking and downloading of images
  • Front Matter metadata generation (Currently only YAML is supported)

Usage as a dotnet tool

dotnet tool install dotnet-html2md -g


html2md --uri|-u <URI> [--uri|-u <URI> [ ... ]] --output|-o <OUTPUT LOCATION>


--image-output|-i <IMAGE OUTPUT LOCATION>
If no image output location is specified then they will be written to the same folder as the markdown file.

--include-tags|--it|-t <TAG|XPATH,[TAG|XPATH[,...]]>
If unspecified the entire body tag will be processed, otherwise only text contained in the specified tags will be processed.

--exclude-tags|--et|-e <TAG|XPATH,[TAG|XPATH[,...]]>
Allows for specific tags to be ignored.

--image-path-prefix|--ipp <IMAGE PATH PREFIX>
The prefix to apply to all rendered image URLs - helpful when you're going to be serving images from a 
different location, relative or absolute.

--default-code-language <LANGUAGE>
The default language to use on code blocks converted from pre tags - defaults to csharp

--code-language-class-map <CLASSNAME:LANGUAGE,CLASSNAME:LANGUAGE,...>
Map between a pre tag's class names and languages. E.g. you might map the class name "sh_csharp" to "csharp" 
and "sh_powershell" to "powershell".

--front-matter-data <PROPERTY:[XPATH|{{MACRO}}|{{'CONSTANT'}}]>
Allows for configuration of information to be extracted to a Front Matter property. This can be an XPath to an element 
or attribute in the HTML page, a string constant or a supported macro.
Supported macros:
RelativeUriPath: The relative path of the page being converted. e.g. for the macro would 
return /pages/page-1

--front-matter-data-list <PROPERTY:XPATH[:Date]>
Allows for configuration of list-based information to be extracted to a Front Matter property. You can optionally specify
that the data should be formatted as a date. (Values not convertable to dates will be rendered as-is.)

--logging <None|Trace|Debug|Information|Warning|Error|Critical>
By default no logging takes place - you can turn on logging at different levels with this flag.

Usage as a nuget package

Install-Package Html2md.Core


var converter = new MarkdownConverter(new ConversionOptions());

ConversionResult converted = await converter.ConvertAsync("");

// Alternatively you can convert multiple pages at once:

ConversionResult converted = await converter.ConvertAsync(

You can also extract Front Matter metadata:

var options = new ConversionOptions
    FrontMatter =
        Enabled = true,
        SingleValueProperties = 
            { "Title", "//h1" },
            { "Author", "{{'Mike Goatly'}}" },
            { "RedirectFrom", @"{{RelativeUriPath}}" }
        ArrayValueProperties = 
            { "Tags", @"//p[@class='tags']/a" }

var converter = new MarkdownConverter(options);

ConversionResult converted = await converter.ConvertAsync("");

Where the resulting markdown would be:

Title: Article Title
Author: Mike Goatly
RedirectFrom: /some-article
  - Help
  - Coding


ConvertedDocument is the result of a conversion process, containing:

  • Documents: The markdown representations of all the converted pages.
  • Images: A collection of images referenced in the documents. Each image includes the downloaded raw data as a byte array.


In ConversionOptions you can specify:

  • ImagePathPrefix: The prefix to apply to all rendered image URLs - helpful when you're going to be serving images from a different location, relative or absolute.
  • DefaultCodeLanguage: The default code language to apply to code blocks mapped from pre tags. The default is csharp.
  • IncludeTags: The set of tags or XPaths for tags to include in the conversion process. If this is empty then all elements will processed.
  • ExcludeTags: The set of tags or XPaths for tags to exclude from the conversion process. You can use this if there are certain parts of a document you don't want translating to markdown, e.g. aside, nav, etc.
  • CodeLanguageClassMap: A dictionary mapping between class names that can appear on pre tags and the language they map to.E.g. you might map the class name "sh_csharp" to "csharp" and "sh_powershell" to "powershell".
  • FrontMatter: Configuration for how Front Matter metadata should be emitted into a converted document.
    • Enabled: Whether Front Matter metadata should be emitted. Defaults to false.
    • SingleValueProperties: Configuration of information to be extracted to a Front Matter property. This can be an XPath to an element or attribute in the HTML page, a string constant or a supported macro. Supported macros:
    • ArrayValueProperties: Configuration of list-based information to be extracted to a Front Matter property.

Converted content

<em> and <i>

<em>italic</em> becomes *italic*

<strong> and <b>

<strong>bold</strong> becomes **bold**


Linked images from the same domain (relative or absolute) are downloaded and returned in the Images collection of the ConvertedDocument. Images from a different domain are not downloaded and the urls are left untouched.

With ConversionOptions.ImagePathPrefix of "":

<img src="" alt="My image"> becomes ![My image](img.png)

With ConversionOptions.ImagePathPrefix of "/static/images/":

<img src="" alt="My image"> becomes ![My image](/static/images/img.png)


<a href="">Some blog</a> becomes [Some blog](

If the link is to an image, then the image is downloaded and the link's URL is updated as with images.


Paragraph tags cause an additional new line to be inserted after the paragraph's text.

<p>para 1</p><p>para 2</p> becomes:

para 1

para 2


<blockquote>quoted text</blockquote> becomes:

> quoted text

Nested styling is also supported, though you'll currently get additional lines if you use multiple paragraphs. This doesn't seem to bother any renderers I've seen so far:

    <p>Para 1</p>
    <p>Para 2</p>


> Para 1
> Para 2

<h1> ... <h6>

Header tags get converted to the markdown equivalent:

<h2>Header 2</h2><h3>Header 3</h3> becomes:

## Header 2

### Header 3


Tables are converted, though markdown tables are much more limited than HTML tables.

Where a header row is present in the source it is used as the markdown table header:

            <th>Header 1</th>
            <th>Header 2</th>


| Header 1 | Header 2 |
| 1-1 | 1-2 |
| 2-1 | 2-2 |

If no header row is found, the first row of the table is assumed to be the header:



| 1-1 | 1-2 |
| 2-1 | 2-2 |


<pre>content</pre> becomes:


However, if the pre tag has a code class name it will have the DefaultCodeLanguage in the ConversionOptions applied to it:

<pre class="code">content</pre> with a DefaultCodeLanguage of csharp becomes:

``` csharp

Additionally, if you have configured the CodeLanguageClassMap mapping lang_ps to powershell:

<pre class="lang_ps">content</pre> becomes:

``` powershell

As would <pre class="code lang_ps">content</pre>, as the class name lookup will be inspected before falling back to the default code language.


    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>


- Item 1
- Item 2
- Item 3


    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>


1. Item 1
1. Item 2
1. Item 3

Markdown renders should automatically apply the correct numbering to lists like this.


Convert an HTML page to markdown, including re-linking and downloading of related images.








No packages published
