Extract Text From Pdf With Formatting Python

This article volition meet how we tin employ Python to work with PDF (Portable Document Format) files. PDF files incorporate images, documents, text, links, audio, video, you tin also add a hyperlink to a pdf file. So, basically, this article volition help you on How to Extract Text and Images from PDF using Python?

The topics we are covering in this article are given below.

Reading text PDF files.
Reading tables in PDF files.
Extracting images from PDF files.
Write a PDF file

Working with PDF files in python is very easy you tin use dissimilar types of Python libraries/module for working in PDF like PyPDF2, tabula-py, PyMuPDF, etc. We are going to employ some of these libraries in this tutorial every bit they are very easy y'all simply need to install the library and run some codes in your ide permit's see how to do this process. So, permit'south showtime with how to excerpt text and images from PDF using Python?

Contents

i Reading PDF files
ii Reading tables in PDF files
- ii.ane Footstep -one: Get a sample file
- 2.2 Step -3: Install the required library/module
3 Extracting images from PDF files
4 Writing PDF files
5 Concluding Words

Reading PDF files

Step -ane: Go a sample file

The outset affair nosotros demand is a .pdf file (sample.pdf) for reading pdf files. Later yous have the .pdf file to work, let's get to the coding.

Step -two: Install the required library/module

y'all demand to install a library called PyPDF for python yous can install information technology past running a command in your concluding.

          pip3 install PyPDF

Step -3: Writing the code

Open your IDE (I am using PyCharm you can use a unlike one like VS Code) and starting time writing code merely before that let'southward run across the steps we need to write the code:

Import the PyPDF3 module in your IDE
Open up the pdf file in binary mode and salvage a file object equally PDF file.
Create an object of PDF filereader class.
Print the number of pages in the pdf file using 'numPages' property. It tells u.s.a. the number of pages (in our pdf file there are 206 pages).
And so we create an object of pages form and define specific page numbers(starting time with 0) which page content nosotros are extracting hither we are extracting text from page number 85.
Now we are going to employ a function chosen 'extractText()' that is going to extract the text from a PDF file from a specific page number which we are providing.
Lastly, close the PDF file.

At present permit's see the procedure in Python code:

          #import the PyPDF2 module import PyPDF2  #open up the PDF file PDFfile = open('Sample.pdf.', 'rb')  PDFfilereader = PyPDF2.PdfFileReader(PDFfile)  #print the number of pages print(PDFfilereader.numPages)  #provide the page number pages = PDFfilereader.getPage(85)  #extracting the text in PDF file impress(pages.extractText())  #close the PDF file PDFfile.close()

Output:

          206 76pronounced:declareddiscreet......................................................... .........................Consummate the Table equally shown below. Comprehension.

In the first line of output, you can run across a number(206) that'southward the number of the page and the rest of the text is the context of the specified number page.

Reading tables in PDF files

Stride -1: Get a sample file

The first thing we need for reading the tabular array in a pdf file is a .pdf (sample.pdf) file that contains a tabular array. After you take the .pdf file to work, let'south get to the coding.

Footstep -3: Install the required library/module

Method -1:

You need to install a library called tabula-py for python it helps read the tabular array in a pdf file, you tin can install it by running a command in your last:

          pip3 install tabula-py

Open your ide (I am using Pycharm you can utilize a different i like vs code) and first writing code but before that let'south run across the steps nosotros demand to take to write the lawmaking:

Beginning, you need to import the tabula library.
2nd important the pdf file that contains a tabular array.

          from tabula import read_pdf  from tabulate, import tabulate   #reads the table from pdf file   df = read_pdf("abc.pdf",pages="all") #address of pdf file print(tabulate(df))

Y'all can likewise read multiple tables as independent tables. You can use the beneath code to do so:

          #select the pdf file file = "sample.pdf"  #reading both table as an independent table tables = tabula.read_pdf(file,pages=i,multiple_tables=True) impress(tables[0]) print(tables[1])

Method -2:

Yous demand to install a library chosen camelot-py for Python. Information technology helps to read the table in a pdf file. You can install it past running a command in your terminal:

          pip3 install camelot-py

Let's see the steps nosotros need to write the code:

Import the Camelot library.
Extracting all the tables from the pdf
Finally impress it.

It's a very simple process you can just copy-paste the code in your IDE merely don't forget to keep the pdf file in the same folder as the Python file.

Pace -1: Get a sample file

The first thing we need for extracting the images from PDF files is a .pdf file (sample.pdf) that contains images that you want to extract. Afterwards y'all accept the .pdf file to piece of work, let's get to the coding.

Stride -ii: Install the required library/module

Yous need to install a library called PyMuPDF (you can use PyPDF2 as well but this is easier) for Python. You tin install it by running a command in your final.

          pip3 install PyMuPDF Pillow

Step -3: Writing the code

Permit'due south beginning writing the code simply before that allow's see the steps we demand to take to write the code:

Import the Fitz module to your ide.
Next, we are going to create a file and store the name of the file "sample.pdf".
Then we are opening the pdf file fitz.open up
Then create some other variable called image_list and use the method on the pdf that is to get PageImageList() and provide a page number.
The next thing is we are simply going to apply this loop in this image list.
And next, we are going to excerpt Xref from it because nosotros just desire pixels (if you desire you can extract another affair like the position of the image, properties of the image, etc)
The next matter is we need to convert it into a pixmap for that we are simply going to create a variable chosen pix.
And and so put an "if" condition if the image is grayscale or colored then we simply relieve information technology.
Lastly, nosotros are just going to print the images and excerpt them.

          #import the library import fitz  file = 'sample.pdf'  #open up the fitz file pdf = fitz.open up(file)  #select the page number image_list = pdf.getPageImageList(0)  #applying the loop for paradigm in image_list:    xref = paradigm[0]    pix = fitz.Pixmap(pdf, xref)    if pix.n < 5:        pix.writePNG(f'{xref}.png')    else:        pix1 = fitz.open(fitz.csRGB, pix)        pix1.writePNG(f'{xref}.png')        pix1 = None    pix = None  #impress the images print(len(image_list), 'detected')

Output:

          2 detected

Writing PDF files

We're going to use FPDF module to write the PDF file. And then, install the FPDF module using the below command:

          pip3 install fpdf==1.7

Once, you're washed with the installation. Utilise the below code to write the PDF file:

          from fpdf import FPDF text = "Hello, this text will be stored in PDF file past GeekyHumans" pdf = FPDF() pdf.add_page() pdf.set_xy(0, 0) pdf.set_font('arial', 'B', 13.0) pdf.cell(ln=0, h=5.0, marshal='50', w=0, txt=text, border=0) pdf.output('test.pdf', 'F')

Now, you're proficient to go with the PDF. A new PDF file will be created in the same folder where your Python lawmaking resides.

Final Words

In this article, we covered how to extract text and images from PDF using Python. Writing and reading a PDF file can exist a tough task equally it involves a lot of elements such every bit text, images, tables, etc. But we made it unproblematic for yous to understand the nuts of manipulating a PDF file using Python. I hope you understood the code and it was easy for you to implement the same. Please let us know in the annotate department if you're facing any problems or you're not able to run the code.

Here are some useful tutorials that you can read:

Spread the honey