environmentaltools.common.pdf

environmentaltools.common.pdf(file_name: str, encoding: str = 'latin-1', table: bool = False, guess: bool = False, area: list = None)[source]

Read PDF files and extract text or tabular data.

Extracts content from PDF files using either text extraction (PyPDF2) or table extraction (tabula-py) methods.

Parameters:

file_name (str) – Path to PDF file.
encoding (str) – Character encoding for table extraction. Defaults to “latin-1”.
table (bool) – If True, extracts tables using tabula. If False, extracts plain text from first page. Defaults to False.
guess (bool) – If True, tabula will guess table locations. Defaults to False.
area (list, optional) – Coordinates [top, left, bottom, right] defining table area for extraction. Defaults to None (auto-detect).

Returns:

Extracted text string (if table=False) or DataFrame: with table data (if table=True).

Return type:

str or pd.DataFrame