python - How to extract text and text coordinates from a pdf file? -


i want extract text boxes , text box coordinates pdf file.

many other stackoverflow posts address various solutions try extract text in ordered fashion, took me quite while figure out how intermediate step of getting text , text locations.

so once found it, thought worth posting here. given pdf file, output should like:

   489, 41,  "signature"    500, 52,  "b"    630, 202, "a_g_i_r" 

newlines converted underscores in final output. minimal working solution found.

from pdfminer.pdfparser import pdfparser pdfminer.pdfdocument import pdfdocument pdfminer.pdfpage import pdfpage pdfminer.pdfpage import pdftextextractionnotallowed pdfminer.pdfinterp import pdfresourcemanager pdfminer.pdfinterp import pdfpageinterpreter pdfminer.pdfdevice import pdfdevice pdfminer.layout import laparams pdfminer.converter import pdfpageaggregator import pdfminer  # open pdf file. fp = open('/users/me/downloads/test.pdf', 'rb')  # create pdf parser object associated file object. parser = pdfparser(fp)  # create pdf document object stores document structure. # password initialization 2nd parameter document = pdfdocument(parser)  # check if document allows text extraction. if not, abort. if not document.is_extractable:     raise pdftextextractionnotallowed  # create pdf resource manager object stores shared resources. rsrcmgr = pdfresourcemanager()  # create pdf device object. device = pdfdevice(rsrcmgr)  # begin layout analysis # set parameters analysis. laparams = laparams()  # create pdf page aggregator object. device = pdfpageaggregator(rsrcmgr, laparams=laparams)  # create pdf interpreter object. interpreter = pdfpageinterpreter(rsrcmgr, device)  def parse_obj(lt_objs):      # loop on object list     obj in lt_objs:          # if it's textbox, print text , location         if isinstance(obj, pdfminer.layout.lttextboxhorizontal):             print "%6d, %6d, %s" % (obj.bbox[0], obj.bbox[1], obj.get_text().replace('\n', '_'))          # if it's container, recurse         elif isinstance(obj, pdfminer.layout.ltfigure):             parse_obj(obj._objs)  # loop on pages in document page in pdfpage.create_pages(document):      # read page layout object     interpreter.process_page(page)     layout = device.get_result()      # extract text object     parse_obj(layout._objs) 

Comments

Popular posts from this blog

PHPMotion implementation - URL based videos (Hosted on separate location) -

javascript - Using Windows Media Player as video fallback for video tag -

c# - Unity IoC Lifetime per HttpRequest for UserStore -