environmentaltools.common.pdf
- environmentaltools.common.pdf(file_name: str, encoding: str = 'latin-1', table: bool = False, guess: bool = False, area: list = None)[source]
Read PDF files and extract text or tabular data.
Extracts content from PDF files using either text extraction (PyPDF2) or table extraction (tabula-py) methods.
- Parameters:
file_name (str) – Path to PDF file.
encoding (str) – Character encoding for table extraction. Defaults to “latin-1”.
table (bool) – If True, extracts tables using tabula. If False, extracts plain text from first page. Defaults to False.
guess (bool) – If True, tabula will guess table locations. Defaults to False.
area (list, optional) – Coordinates [top, left, bottom, right] defining table area for extraction. Defaults to None (auto-detect).
- Returns:
- Extracted text string (if table=False) or DataFrame
with table data (if table=True).
- Return type:
str or pd.DataFrame