Tuesday 6 December 2022

Using PyPDF2 to extract text from a pdf

 Today I learnt something very useful, how to extract text from a pdf file using Python, with the PyPDF2 module.

First I installed it, as I've written up on my blog here.

Then I wanted to extract text from the Supplementary File of a paper by Monir et al 2022.

I wrote a small Python script to do this, extract_data_from_pdf_file.py :

# Python script to extra data from a pdf file.

import os
import sys
import PyPDF2

#====================================================================#

def main():
       
    # check the command-line arguments:
        if len(sys.argv) != 2 or os.path.exists(sys.argv[1]) == False:
            print("Usage: %s input_pdf_file" % sys.argv[0])                              
            sys.exit(1)

        input_pdf_file = sys.argv[1]

        # following the example at https://www.geeksforgeeks.org/extract-text-from-pdf-file-using-python/:
        # create a pdf file object:   
        pdfFileObj = open(input_pdf_file, 'rb')
        # create a pdf reader object:
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
        # print the number of pages in the pdf file:
        format_string = "Number of pages in input pdf file: %d" % (pdfReader.numPages)
        print(format_string)
        # create a page object:
        pageObj = pdfReader.getPage(0)
        # extract text from the page:
        print(pageObj.extractText())
        # close the pdf file object:
        pdfFileObj.close()

        print("FINISHED\n")

#====================================================================#

if __name__=="__main__":
    main()

#====================================================================#

 Now I can run the script:

% python3 /nfs/users/nfs_a/alc/Documents/git/Python/extract_data_from_pdf_file.py   Monir2022_SuppTable1.pdf

 This is the output I see: (it is just taking the text from the first page of the pdf, but that could easily be changed by editing the python script to take extra pages, using the pdfReader.getPage(0) command):

 
Number of pages in input pdf file: 77
Genomic characteristics of recently recognized Vibrio cholerae El Tor lineages associated with cholera in Bangladesh, 1991-2017 Authors:  Md Mamun Monir1, Talal Hossain1, Masatomo Morita2, Makoto Ohnishi2, Fatema-Tuz Johura1, Marzia Sultana1, Shirajum Monira1, Tahmeed Ahmed1, Nicholas Thomson3, Haruo Watanabe2, Anwar Huq4, Rita R. Colwell4,5, Kimberley Seed6, and Munirul Alam1§.  Table S1. Genetic characteristics of strains included in the study Lineage Strain ID Year Source Reference Accession SXT ICE Acquired antibiotic resistance profile gyrA ToxR ctxB rstA CTX  PLE
BD-0 4670 1991 No data Mutreja et al. 2011, Nature ERR019883 ICEVflInd1 ant(3'')-Ia, catB9, sul1, qacE El tor gyrA 4 ctxB_1 TTAC CTX-2 PLE(-) MG116025 1991 No data Mutreja et al. 2011, Nature ERR018122 ICEgen catB9, dfrA1 El tor gyrA 4 ctxB_3 CTT CTX-3 PLE(-) MG116226 1991 No data Mutreja et al. 2011, Nature ERR025396 ICEVchBan5 aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2 El tor gyrA 4 ctxB_3 CTT CTX-3 PLE(-) 4660 1994 No data Mutreja et al. 2011, Nature ERR018117 ICEgen aph(3'')-Ib, aph(6)-Id, catB9, sul2 El tor gyrA 4 ctxB_1 CTT CTX-3 PLE(-) A346_1 1994 No data Mutreja et al. 2011, Nature ERR025392 ICEtet aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2, tet(A) Ser83 to ARG 4 ctxB_1 TTAC CTX-2 PLE(-) A346_2 1994 No data Mutreja et al. 2011, Nature ERR018179 ICEVchInd5 aph(6)-Id, catB9, dfrA1, sul2 Ser83 to ARG 4 ctxB_1 TTAC CTX-2 PLE(-) MJ1485 1994 No data Mutreja et al. 2011, Nature ERR018120 ICEVchInd4 aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2 El tor gyrA 4 ctxB_1 TTAC CTX-2 PLE(-) 4672 2000 No data Mutreja et al. 2011, Nature ERR019884 ICEtet aph(3'')-Ib, aph(6)-Id, catB9, floR, tet(A) El tor gyrA 4 ctxB_1 TTAC CTX-2 PLE(-) MAB035 2012 Env This study DRR335720 ICEtet aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2, tet(A) El tor gyrA 4 ctxB_1 TTAC CTX-2 PLE(-) MAB037 2012 Env This study DRR335721 ICEtet aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2, tet(A) El tor gyrA 4 ctxB_1 TTAC CTX-2 PLE(-) MAB039 2012 Clinical This study DRR335723 ICEtet aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2, tet(A) Asn253 to Asp 4 ctxB_1 TTAC CTX-2 PLE(-) BD-1 4679 1999 No data Mutreja et al. 2011, Nature ERR018114 ICEgen aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2 Haitian gyrA Ser83 to Ile 4 ctxB_1 CTT CTX-3 PLE(-) 4661 2001 No data Mutreja et al. 2011, Nature ERR018116 ICEgen aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2 Haitian gyrA Ser83 to Ile 4 ctxB_1 CTT CTX-3 PLE(-) 4662 2001 No data Mutreja et al. 2011, Nature ERR025373 ICEgen aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2 Haitian gyrA Ser83 to Ile 4 ctxB_1 CTT CTX-3 PLE(-) 4663 2001 No data Mutreja et al. 2011, Nature ERR018115 ICEgen aph(3'')-Ib, aph(6)-Id, catB9, dfrA1, floR, sul2 Haitian gyrA Ser83 to Ile 4 ctxB_1 CTT CTX-3 PLE(-)

 

 

No comments: