Convert Docx to Markdown With Pandoc

I prefer to use Microsoft Word for most of my writing but I really like Markdown. I prefer Word because its spell and grammar checker is superior to every other word processor or text editor I have tried.
featured.png

I prefer to use Microsoft Word for most of my writing but I really like Markdown. I prefer Word because its spell and grammar checker is superior to every other word processor or text editor I have tried. In addition, word has text to speech build in. I use text to speech to have my text spoken to me in order to catch errors and I catch a lot of errors this way. While I write my blog posts in English. English is not my first language and I need these tools to keep spelling and grammar errors to a minimum.

This blog uses the static site generator Pelican (Update: this blog is now using WordPress) (Update 2023: this blog now uses Hugo) and it generates the blog from ether restructured text or markdown files. I have written about Pelican in my blog post The Static Site Generator Pelican VS WordPress.

I have been using Pandoc to convert markdown to Word documents or PDFs for years. A Google search for a way to convert from Word to markdown did not give any usable result. Therefore, up until now I have just copied and pasted the text making sure not to do any markdown syntax until after I had done spell checking in Word.

Then a couple of weeks ago I was reading the Pandoc docs to solve a different problem and I came across the section where it is described how Pandoc can convert from docx to markdown. I do not know if this is new or why Google did not find this for me but I immediately forgot the problem I was trying to solve and began testing it.

It turns out to be quite simple to convert a docx to markdown. The following example is from the Pandoc demos site.

pandoc -s example30.docx -t markdown -o example35.md

However, the generated markdown from the above command has a few issues.

The lines are only 80 characters long. I do not know why an 80-character line length is the default but I do not like it. This is fortunately quite easy to fix with the option –wrap=none.

Links do not use the reference style. I prefer the reference style links because it makes the text less cluttered by moving the link itself to the bottom of the file. This is also easy to fix with the option –reference-links.

With the two options added the command looks like this.

pandoc -s example30.docx --wrap=none --reference-links -t markdown -o example35.md

Now the generated markdown is very readable and close to what I would write myself. I only use Word to write text with simple formatting like lists, italic, bold, and links. The syntax for images and code I add to the generated markdown file along site the metadata that Pelican needs. Although I do not use it at this time, Pandoc can extract images from a .docx.

The option to extract images from the docx file and more can be found on the Pandoc options page.

Edit: The options page URL has changed and is now http://pandoc.org/README.html#reader-options

So there you have it, sometimes what you need is right under your nose :).