Python에서 PDF 파일 작업?

<시간/>

Python은 다양한 요구 사항에서 작동하는 방대한 라이브러리 세트를 제공하기 때문에 매우 다재다능한 언어입니다. 우리 모두는 PDF(Portable Document Format) 파일로 작업합니다. Python은 pdf 파일로 작업하는 다양한 방법을 제공합니다. 여기에서 우리는 PyPDF2라는 파이썬 라이브러리를 사용하여 pdf 파일로 작업할 것입니다.

PyPDF2는 PDF 파일의 페이지를 분할, 병합, 자르기 및 변환할 수 있는 순수한 Python PDF 라이브러리입니다. 또한 PDF 파일에 사용자 지정 데이터, 보기 옵션 및 암호를 추가할 수 있습니다. PDF에서 텍스트와 메타데이터를 검색하고 전체 파일을 병합할 수 있습니다.

PyPDF2를 사용하여 PDF에서 여러 작업을 수행할 수 있으므로 스위스 군용 칼처럼 작동합니다.

시작하기

pypdf2는 표준 파이썬 패키지이기 때문에 설치해야 합니다. 좋은 점은 매우 쉽기 때문에 pip를 사용하여 설치할 수 있다는 것입니다. 명령 터미널에서 아래 명령을 실행하기만 하면 됩니다.

C:\Users\rajesh>pip install pypdf2
Collecting pypdf2
Downloading https://files.pythonhosted.org/packages/b4/01/68fcc0d43daf4c6bdbc6b33cc3f77bda531c86b174cac56ef0ffdb96faab/PyPDF2-1.26.0.tar.gz (77kB)
100% |████████████████████████████████| 81kB 83kB/s
Building wheels for collected packages: pypdf2
Building wheel for pypdf2 (setup.py) ... done
Stored in directory: C:\Users\rajesh\AppData\Local\pip\Cache\wheels\53\84\19\35bc977c8bf5f0c23a8a011aa958acd4da4bbd7a229315c1b7
Successfully built pypdf2
Installing collected packages: pypdf2
Successfully installed pypdf2-1.26.0

확인하려면 python 셸에서 pypdf2를 가져오세요.

>>> import PyPDF2
>>>
Successful, Great.

메타데이터 추출

모든 pdf에서 중요한 유용한 데이터 중 일부를 추출할 수 있습니다. 예를 들어 문서의 작성자, 제목, 주제 및 pdf 파일에 포함된 페이지 수에 대한 정보를 추출할 수 있습니다.

다음은 pypdf2 패키지를 사용하여 pdf 파일에서 유용한 정보를 추출하는 파이썬 프로그램입니다.

from PyPDF2 import PdfFileReader
def extract_pdfMeta(path):
   with open(path, 'rb') as f:
      pdf = PdfFileReader(f)
      info = pdf.getDocumentInfo()
      number_of_pages = pdf.getNumPages()
   print("Author: \t", info.author)
   print()
   print("Creator: \t", info.creator)
   print()
   print("Producer: \t",info.producer)
   print()
   print("Subject: \t", info.subject)
   print()
   print("title: \t",info.title)
   print()
   print("Number of Pages in pdf: \t",number_of_pages)
if __name__ == '__main__':
   path = 'DeepLearning.pdf'
   extract_pdfMeta(path)

출력

Author: Nikhil Buduma,Nicholas Locascio

Creator: AH CSS Formatter V6.2 MR4 for Linux64 : 6.2.6.18551 (2014/09/24 15:00JST)

Producer: Antenna House PDF Output Library 6.2.609 (Linux64)

Subject: None

title: Fundamentals of Deep Learning

Number of Pages in pdf: 298

따라서 pdf 파일을 열지 않고도 pdf 파일에서 유용한 정보를 얻을 수 있습니다.

PDF에서 텍스트 추출

pdf에서 텍스트를 추출할 수 있습니다. 이미지 추출에 대한 지원이 내장되어 있지만.

위에서 다운로드한 pdf 파일의 특정 페이지(예:50페이지)에서 텍스트를 추출해 보겠습니다.

#Import pypdf2
from PyPDF2 import PdfFileReader
def extract_pdfText(path):
   with open(path, 'rb') as f:
      pdf = PdfFileReader(f)
      # get the 50th page
      page = pdf.getPage(50)
      print(page)
      print('Page type: {}'.format(str(type(page))))
      #Extract text from the 50th page
      text = page.extractText()
      print(text)
if __name__ == '__main__':
   path = 'DeepLearning.pdf'
   extract_pdfText(path)

출력

{'/Annots': IndirectObject(1421, 0),
'/Contents': IndirectObject(179, 0),
'/CropBox': [0, 0, 595.3, 841.9],
'/Group': {'/CS': '/DeviceRGB', '/S': '/Transparency', '/Type': '/Group'},
'/MediaBox': [0, 0, 504, 661.5],
'/Parent': IndirectObject(4863, 0),
'/Resources': IndirectObject(1423, 0),
'/Rotate': 0,
'/Type':
'/Page'
}

Page type: <class 'PyPDF2.pdf.PageObject'>
time. In inverted dropout, any neuron whose activation hasn†t been silenced has its
output divided by p before the value is propagated to the next layer. With this
fix, Eoutput=p⁄xp+1ƒ
p⁄0=
x, and we can avoid arbitrarily scaling neuronal
output at test time.

SummaryIn this chapter, we†ve learned all of the basics involved in training feed-forward neural
networks. We†ve talked about gradient descent, the backpropagation algorithm, as
well as various methods we can use to prevent overfitting. In the next chapter, we†ll
put these lessons into practice when we use the TensorFlow library to efficiently
implement our first neural networks. Then in
Chapter 4

, we†ll return to the problem
of optimizing objective functions for training neural networks and design algorithmsto significantly improve performance. These improvements will enable us to process
much more data, which means we†ll be able to build more comprehensive models.
Summary | 37

50페이지에서 일부 텍스트를 얻을 수 있지만 깨끗하지는 않습니다. 불행히도 pypdf2는 pdf에서 텍스트 추출을 매우 제한적으로 지원합니다.

pdf 파일의 특정 페이지 회전

>>> import PyPDF2
>>> deeplearningFile = open('DeepLearning.pdf', 'rb')
>>> pdfReader = PyPDF2.PdfFileReader(deeplearningFile)
>>> page = pdfReader.getPage(0)
>>> page.rotateClockwise(90)
{
'/Contents': [IndirectObject(4870, 0), IndirectObject(4871, 0), IndirectObject(4872, 0), IndirectObject(4873, 0), IndirectObject(4874, 0), IndirectObject(4875, 0), IndirectObject(4876, 0), IndirectObject(4877, 0)],

'/CropBox': [0, 0, 595.3, 841.9],

'/MediaBox': [0, 0, 504, 661.5], '/Parent': IndirectObject(4862, 0), '/Resources': IndirectObject(4889, 0),
'/Rotate': 90,
/Type': '/Page'
}
>>> pdfWriter = PyPDF2.PdfFileWriter()
>>> pdfWriter.addPage(page)
>>> resultPdfFile = open('rotatedPage.pdf', 'wb')
>>> pdfWriter.write(resultPdfFile)
>>> resultPdfFile.close()
>>> deeplearningFile.close()

출력

Python에서 PDF 파일 작업?