Export to GitHub

pandoc - issue #132

Better handling of tables in HTML to markdown conversion


Posted on Mar 5, 2009 by Swift Elephant

Describe the proposed feature, including illustrative examples.

Currently Pandoc 'handles' tables in HTML to markdown conversion by just stripping all table markup tags (<table> <tr> <th> <td>) usually leaving each table cell content as a normal text paragraph.

Best of all would be if Pandoc could convert HTML tables not containing any nested tables into Pandoc markdown tables and leave tables with nested tables in place as HTML. Second best would be if all tables just be left as HTML -- perhaps according to an option whether to strip them or leave them alone.

I attach a Perl script which simulates the behavior I'm looking for as well as the issue with <pre> blocks which I file at the same time as this issue.

/BP

Comment #1

Posted on Mar 11, 2009 by Swift Elephant

I encountered an issue with my conversion script and some input files, so I changed it to use a temporary file when converting content with tidy and pandoc through the shell.

New version attached.

/BP

Comment #2

Posted on Mar 25, 2009 by Grumpy Dog

If you run pandoc with --parse-raw, the table tags will be passed through unscathed. Is this sufficient for your needs? I hesitate to try to write an HTML table -> pandoc table converter, because (i) HTML tables are still sometimes used for layout in web pages, and (ii) pandoc tables include information about the relative widths of columns, which would be hard to derive from HTML tables.

Comment #3

Posted on Apr 5, 2009 by Massive Wombat

Sorry, I had totally missed --parse-raw. It is indeed what I needed. See, two days of writing bad Perl for nothing! Not really: I still think the HTML to Pandoc markdown conversion is kinda neat! :-)

Comment #4

Posted on Apr 5, 2009 by Swift Elephant

Apologies for writing that last comment inder the wrong identity. bpjonsson@gmail.com, melroch@gmail.com and bpj@melroch.se are all me, in case you haven't noticed!

/BP

Comment #5

Posted on Nov 1, 2009 by Grumpy Dog

I'd like to close this bug. But I thought your script might be useful to others. Would you mind if I put a link to it (or the script itself) on the pandoc website, under Extras? If you think this is a good idea, you should perhaps add a license to the source file.

Comment #6

Posted on Dec 5, 2009 by Grumpy Dog

(No comment was entered for this change.)

Comment #7

Posted on Feb 6, 2010 by Swift Elephant

I've written a new and IMHO improved script for converting HTML tables into markdown. It no longer tries to integrate pandoc or tidy: you have to pipe output from pandoc's HTML to markdown conversion (with the --parse-raw option) into it, like this:

pandoc -r html -R -w markdown in.html | perl html-table2pandoc >out.md

Input must be utf-8 encoded and output also always is utf-8 encoded. Not much of a limitation since it applies to Pandoc too!

Attachments

Comment #8

Posted on Mar 11, 2015 by Quick Elephant

I'm currently using pandoc for a knitr clone in python, which uses IPython as a backend. Unfortunately, most objects in IPython are only converted to html and not markdown (like pandas.DataFrames and Results from statsmodels). It would be really nice if pandoc could support the usecase to convert a string like <table>...</table> to the right markdown representation, so that the converted table can be included in the markdown file, which is then converted to docx/html/pdf...

Comment #9

Posted on Mar 22, 2015 by Quick Elephant

Update: According https://github.com/jgm/pandoc/issues/2015 tables are now parsed, but there is a bug in <=1.13.2, which has problems with "bbtext" ( in normal rows).

Status: WontFix

Labels:
Type-Enhancement Priority-Medium