How to convert PDF to HTML? [closed]

Closed. This question is seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. It does not meet Stack Overflow guidelines. It is not currently accepting answers.

We don’t allow questions seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. You can edit the question so it can be answered with facts and citations.

Closed 2 years ago . The community reviewed whether to reopen this question 12 months ago and left it closed:

Original close reason(s) were not resolved

Is there a proper library which I can use to convert PDF to HTML or some other format that can be converted to HTML easily? I searched similar questions, but to no luck. I want to be able to extract text from PDF's, possibly images. I'm not looking to embed the PDF inside the HTML.

asked Dec 3, 2011 at 18:44 Luchian Grigore Luchian Grigore 258k 66 66 gold badges 462 462 silver badges 627 627 bronze badges

6 Answers 6

If you're on Linux, try pdftohtml :

sudo apt-get install poppler-utils pdftohtml -enc UTF-8 -noframes infile.pdf outfile.html

On MacOS (with homebrew) pdftohtml can be installed with:

brew install pdftohtml

The open source ebook converter Calibre can also convert PDF files to HTML and is available on MacOS, Windows and Linux.

31.1k 20 20 gold badges 176 176 silver badges 180 180 bronze badges answered Nov 27, 2016 at 22:37 1,818 1 1 gold badge 18 18 silver badges 20 20 bronze badges Please note all layout will be gone. Commented Nov 8, 2020 at 18:02 is there anyway to inline the images so I don't need to host jpgs? Commented Dec 16, 2020 at 13:26

@chovy supply -dataurls option to generate inline images, supply -c to generate complex html with each page of pdf on separate page of html with layout of the page more or less the same, I noticed images on each page and boxes and other decorations are generated as image used as background while texts are extracted and shown in front of the background image, making the layout more or less the same, with some minor overlapping, however, the result is quite interesting, example use: pdftohtml -dataurls -c pdf_file_with_bookmarks.pdf sample_output.html

Commented Mar 16, 2021 at 13:42 how to install on arch? Commented Aug 15, 2021 at 6:21

Pdftohtml effectively only extracts text from the PDF. All formatting/font/colors are removed, which makes this tool fairly useless as a "converter". HTML is a lot more than just text. Calibre unfortunately does the same thing.

Commented Mar 23, 2023 at 18:40

Like I mentioned in the comment above, it is definitely possible to convert pdf to html using the tool Able2Extract7 which can be downloaded from here

I have been using this tool for almost 2 years now and I am pretty happy with it. This tool lets you convert PDF to Word, Excel, PowerPoint, Publisher, HTML, OO etc. See screenshot

enter image description here

Imp Note: This tool is not a freeware.

answered Jun 7, 2012 at 6:27 Siddharth Rout Siddharth Rout 149k 18 18 gold badges 208 208 silver badges 252 252 bronze badges

This tool is good at accurately converting pdf to .html or .docx. I use it with Calibre to pre-process a .pdf file into .html or .docx, so it will render correctly on my eReader (Kindle or Sony).

Commented Feb 20, 2013 at 12:57

Actually, at pdf.investintech.com they allow you to convert a PDF to HTML online.. I tried with a research paper, and the conversion was pretty accurate, except for Mathematical formulae. One drawback though is that it's not very smart meaning that, for example, each line is wrapped into a new div absolutely positioned.

Commented Aug 9, 2014 at 13:03

Why is every response to this question on stackoverflow almost like an advert for a paid for solution?

Commented Nov 2, 2020 at 4:31 @JayCroghan Way back in 2012, there were actually no reliable freewares. Commented Nov 2, 2020 at 4:53 @SiddharthRout it seems even now there aren't really any great freeware for this. Commented Nov 3, 2020 at 5:02

pdfbox-2.0.3.jar
fontbox-2.0.3.jar
preflight-2.0.3.jar
xmpbox-2.0.3.jar
pdfbox-tools-2.0.3.jar
pdfbox-debugger-2.0.3.jar

 import java.io.InputStream; import java.io.IOException; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.tools.PDFText2HTML; // . try < InputStream is = // . Read PDF file PDDocument pdd = PDDocument.load(is); //This is the in-memory representation of the PDF document. PDFText2HTML converter = new PDFText2HTML(); // the converter String html = converter.getText(pdd); // That's it! pdd.close(); is.close(); >catch (IOException ioe) < // . >

Please note: Images do not get pushed to the HTML output.

2,257 3 3 gold badges 24 24 silver badges 40 40 bronze badges answered Nov 23, 2016 at 20:42 Sergio Muriel Sergio Muriel 1,059 9 9 silver badges 15 15 bronze badges

This library seems to work better - but it produces invalid, unparsable HTML. This is quite disappointing for such an Apache project.

Commented Jul 7, 2019 at 8:21

It's not that difficult to convert PDF to HTML. There are many online options, which may, however, expose your data to third parties. Follow these steps, and the output is great.

Open the PDF2HTMLEX page. (You can either follow to next steps which i have mentioned, or follow the directions from the page.)
The package is available for download for Windows from here. From the many options available, I recommend downloading "pdf2htmlEX-win32-0.14.6-upx-with-poppler-data.zip (pdf2htmlEx.exe is packed with UPX)"
After downloading and un-zipping conversion is just one cmd command away.

C:\Users\kjk\Downloads\pdf2htmlEX-win32-0.14.6-upx-with-poppler-data>pdf2htmlEX.exe c:\1\abc.pdf

Final Command:

pdf2htmlEX.exe c:\1\abc.pdf

abc.pdf will be converted to HTML and will be saved as abc.html in the same folder as that of your exe.