PHP Classes

Extract PDF to text and XML: I need to parse a PDF file and convert whole text into XML

Recommend this page to a friend!
  All requests RSS feed  >  Extract PDF to text and XML  >  Request new recommendation  >  A request is featured when there is no good recommended package on the site when it is posted. Featured requests  >  No recommendations No recommendations  

Extract PDF to text and XML

A request is featured when there is no good recommended package on the site when it is posted. Edit

Picture of Anand Lagad by Anand Lagad - 8 years ago (2015-06-14)

I need to parse a PDF file and convert whole text into XML

This request is clear and relevant.
This request is not clear or is not relevant.

+2

I need PHP code to parse any PDF file and convert it into the XML format.

I think we can not examine the HTML tags in PDF, so I think that first of all we should parse whole PDF ,then convert it into the xml.

What I want is, if the PDF document contains table, I want table fields as XML tag and table data as a values.

  • 1 Clarification request
  • 1. Picture of Manuel Lemos by Manuel Lemos - 8 years ago (2015-06-15) Reply

    I do not think that right now there is a class here that can convert an arbitrary PDF document to XML, HTML or any format that preserves the document structure.

    There are classes for converting PDF to images of the pages, but I am not sure if that would address your needs.

    There are solutions that require using external Web services or external programs like xpdf or Ghostscript. If that would do for you, maybe somebody can submit a class that wraps around those Web services or programs.

    Ask clarification

    2 Recommendations

    Sweeper: Clean HTML to remove unwanted tags and attributes

    This recommendation solves the problem.
    This recommendation does not solve the problem.

    0

    Picture of Jill Lingoff by Jill Lingoff package author package author Reputation 25 - 5 years ago (2018-12-07) Comment

    Here are two methods.

    One: A custom mapping table when doing File > Save As in adobe acrobat (http: //flaurora-sonora.000webhostapp.com/Clean%20HTML%20V2.3.zip). Installing this file makes the "Clean HTML v2.3" show in the save as type select box.

    Two: Use adobe acrobat's "Save As... HTML (.html,.htm)" option then use the "clean_PDF" sweeper profile.

    Both will likely not perfectly convert the structure of the PDF content to HTML. This is due to the difference between PDF and HTML formats themselves. PDF positions content on a page while HTML has content in a nested structure. Funnily, a PDF is made accessible exactly by applying HTML tags to its content.

    So, in short, PDFs often do not contain the sort of content structure desired so that achieving that structure involves converting from PDF as cleanly as possible then using manual or automated methods (like sweeper) to create that structure.


    PHP DOC DOCX PDF to Text Converter: Convert DOCX, DOC, PDF to plain text

    This recommendation solves the problem.
    This recommendation does not solve the problem.

    0

    Picture of Dave Smith by Dave Smith Reputation 6845 - 8 years ago (2015-06-14) Comment

    The innovation nomination description indicates that this class will extract document elements in addition to text, which is what you will need to extract tables.

    • 8 Comments
    • 1. Picture of Manuel Lemos by Manuel Lemos - 8 years ago (2015-06-15) Reply

      I think the original poster wants a solution that preserves the original document structure. So, just extracting text may not be enough for him.

    • 2. Picture of adam berger by adam berger - 8 years ago (2015-06-15) Reply

      An interesting project would be happy to'll try the same class to convert pdf to xml I am waiting for results :)

    • 3. Picture of adam berger by adam berger - 8 years ago (2015-06-15) in reply to comment 2 by adam berger Reply

      I suggest you first perform a conversion to html in the cache and then to xml. This can be done on the fly with cache

    • 4. Picture of Manuel Lemos by Manuel Lemos - 8 years ago (2015-06-15) in reply to comment 3 by adam berger Reply

      Well, XHTML is still HTML and XML.

    • 5. Picture of Dave Smith by Dave Smith - 8 years ago (2015-06-15) in reply to comment 1 by Manuel Lemos Reply

      If the comments for the innovation nomination of this class is correct, or I am not misreading it, the class should be able to get the document elements, not just text. That is the basis of my recommendation.

    • 6. Picture of Manuel Lemos by Manuel Lemos - 8 years ago (2015-06-16) in reply to comment 5 by Dave Smith Reply

      What the nomination comments say is that extracting document elements is not a trivial task. That class just extracts text using a simple approach.

    • 7. Picture of Dave Smith by Dave Smith - 8 years ago (2015-06-16) in reply to comment 6 by Manuel Lemos Reply

      Okay, looks like I was confused. Better to have tried and failed than to not have tried at all :)

      Looks like adam berger will attempt the non trivial task.

    • 8. Picture of Manuel Lemos by Manuel Lemos - 8 years ago (2015-06-16) in reply to comment 7 by Dave Smith Reply

      That is OK, maybe my wording was not ideal either.


    Recommend package
    : 
    :