environmentaltools.common.pdf

environmentaltools.common.pdf(file_name: str, encoding: str = 'latin-1', table: bool = False, guess: bool = False, area: list = None)[source]

Read PDF files and extract text or tabular data.

Extracts content from PDF files using either text extraction (PyPDF2) or table extraction (tabula-py) methods.

Parameters:
  • file_name (str) – Path to PDF file.

  • encoding (str) – Character encoding for table extraction. Defaults to “latin-1”.

  • table (bool) – If True, extracts tables using tabula. If False, extracts plain text from first page. Defaults to False.

  • guess (bool) – If True, tabula will guess table locations. Defaults to False.

  • area (list, optional) – Coordinates [top, left, bottom, right] defining table area for extraction. Defaults to None (auto-detect).

Returns:

Extracted text string (if table=False) or DataFrame

with table data (if table=True).

Return type:

str or pd.DataFrame