Friday, October 18, 2013

Convert document to HTML with Apache Tika

Apache Tika has a wonderful feature, that can transform source document (PDF, MSOffice, Open Office etc.) into HTML during content extraction. Sound pretty simple, but I've dug through a lot of google search results and I can't find a simple working example anywhere.

But, here is a working snippet I extracted from tika-app:

ByteArrayOutputStream out = new ByteArrayOutputStream();
SAXTransformerFactory factory = (SAXTransformerFactory)
 SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, "UTF-8");
handler.setResult(new StreamResult(out));
ExpandedTitleContentHandler handler1 = new ExpandedTitleContentHandler(handler);
 
tikaParser.parse(new ByteArrayInputStream(file), handler1, new Metadata());
return new String(out.toByteArray(), "UTF-8");
 
It works pretty nicely. Here is an example of original MSOffice document:

And here how the above looks in my webapp as HTML preview:


No comments:

Post a Comment

Note: Only a member of this blog may post a comment.