A PDF Text Extractor, Processor, and Formatter.
pdf-fmt is a powerful utility designed to extract text from PDF
documents and then clean, filter, and structure the output.
It is useful for converting raw PDF dumps into clean, formatted text.
Note that pdf-fmt is under active development, you might encounter bugs
and issues.
- Raw text extraction
- Copy to clipboard and/or write to file
- Extensive configuration schema
- See configuration
- Supports numerous formats
- Image extraction
- Bring your own OCR
- Under development
- and many others to come...
There are plenty of PDF tooling out there, but they seems to be geared towards OCR and generally do not help with extracting and processing the output text.
Personally, I use it to collate lecture slides for note taking and knowledge management. I hope that it would be useful for you as well.
This is not an OCR (Optical Character Recognition) tool. It only processes selectable text (with your cursor) found in the PDF structure.
If your file contains images of text, you can use the image extraction feature before passing the output images to your OCR. This feature is currently under development.
For converting non-PDF files (like .docx, .pptx, .odt) to PDF before
extraction, either dependency needs to be installed and accessible in your $PATH:
Cannot set gray non-stroke color because /'Pattern x' is an invalid float value
You can ignore this error or make use of qpdf
or GhostScript to convert your
PDF before running pdf-fmt.
Inaccurate locale enforcement e.g. localization -> localization even with UK locale enforcement enabled.
Upstream locale enforcement libraries may yield inaccurate words. I am working on adding a configuration option to define your own locale mappings to override Breame's.
pdf-fmt is currently undergoing a major rewrite. Stay tuned.
For Windows and Linux users, You can get the compiled binary the latest release.
After downloading, Open PowerShell or the terminal on Linux.
On Windows, run:
cd Downloads
mv pdf-fmt-x64-0.6.1.exe pdf-fmt.exe
./pdf-fmt.exeFor Windows users, remember to set execution policy.
On Linux, run:
cd Downloads
mv pdf-fmt-x64-0.6.1 pdf-fmt
chmod +x ./pdf-fmt
./pdf-fmtYou can also choose to do the following after this step:
- Adding it to your system
$PATH - Set an alias pointing to the binary or renaming it manually
- Creating the configuration file
- Choose the binary corresponding to your operating system
- macOS is not supported.
If you wish to get an updated version of the executable, download the newer latest version and remove the old executable file.
If you wish to use
pdf-fmton macOS, you can use the script installer or compile from source instead.
The version number might be different from the one in the above example.
- We encourage using the latest version, especially when major new features are added
You can use pdf-fmt via the script installer,
which sets up a isolated
Python Virtual Environment
to manage all dependencies.
- You would need to have Git and
Python 3.10 or above installed
- To confirm, run
which gitandwhich pythonin a Linux/macOS terminal - For Windows users, run
where gitandwhere pythonin Command Prompt
- To confirm, run
If you only downloading the compiled binaries, you can ignore this part.
These prerequisites also apply to compiling from source.
- Other prerequisites are documented in the section on compiling from source
- The script will prompt for confirmation before starting the installation
Before running scripts, please review their contents by opening the URL they
call in a browser. E.g. https://raw.githubusercontent.com/...
- Alternatively, you can view them here
Set execution policy to RemoteSigned.
Then, open PowerShell.
Invoke-RestMethod -Uri 'https://raw.githubusercontent.com/bladeacer/pdf-fmt/refs/heads/main/scripts/install.ps1' -OutFile install.ps1
Get-Content install.ps1
.\install.ps1Open a terminal.
curl -o install.sh https://raw.githubusercontent.com/bladeacer/pdf-fmt/refs/heads/main/scripts/install.sh
cat install.sh
chmod +x install.sh
./install.shThe installer places the Python script inside your new .venv folder.
Activate the environment and run the script:
For Linux or macOS
source .venv/bin/activate
chmod +x ./pdf-fmt.py
./pdf-fmt.pyFor Windows
.venv\Scripts\activate
pdf-fmtThe output is printed to the terminal and copied to your clipboard by default.
To update the script, run git pull in the repository the script creates
under the pdf-fmt directory.
Requires running the script installer or the following commands. This example assumes the use of Linux. See the script usage example on how to activate virtual environment for each OS.
It is recommended to use py-env to manage
different versions of Python. It is also recommended to install ccache
for compiled binaries to be cached. You would also need the following nuitka requirements.
After installing pyenv, follow its instructions on configuring with pyenv init.
Then, run the following immediately after you change directory into the cloned repository.
pyenv install 3.11
pyenv local 3.11You can use any other target Python version, though pdf-fmt primarily supports
Python 3.10 or above.
# Either clone the repository or change directory to it if you have used the
# script installer prior
git clone --depth 1 https://github.com/bladeacer/pdf-fmt
cd pdf-fmt
chmod +x ./scripts/compile.sh
./scripts/compile.shThe script creates a separate virtual environment for
compiling from source. It would output the binary to the build/ directory once
compiling is done.
Compilation too slow? Increase the number specified in the jobs count. Only do this if you have sufficient CPU cores and hardware. Remove the
--low-memoryflag at your own risk.If the compilation takes up too much memory, it will crash and exit without completing.
Compilation logs will be found at nuitka-build.log.
Crash reports would be found at nuitka-crash-report.xml.
Alternatively, you can call this script on Linux or macOS.
The configuration options available are documented in the
pdf-fmt.yaml file.
filters: Regex rules for character exclusion and pattern-based filtering- excluding footers matching a regex pattern.
- includes optional spelling enforcement (UK or US English).
conversion: Lists supported non-PDF formats (see handling non-PDF formats).formatting: Controls line re-wrapping, indentation conversion- converting single-space indents to Markdown lists
- enforcing capitalisation at the start of each line.
actions: Defines post-extraction behaviour- copying to the system clipboard and/or write to an output file.
For extensive customisation, you can consider create your own
configuration file. If you do, ensure that it is named pdf-fmt.yaml.
pdf-fmt will look for the configuration file under the following locations.
$PDF_FMT_CONFIG_PATHenvironment variable- Default configuration directory
APPDATAif you are on Windows$XDG_CONFIG_HOMEor~/.configif you are on Linux
- The current working directory of the script
Note: the configuration schema in this repository reflects the development branch.
The released binaries might not support some options yet. These are indicated
with [DEV].
This table documents the currently supported platforms for pdf-fmt and
highlights platforms where we are seeking community confirmation of functionality.
- Primarily, we aim to support the latest, most widely used version of each platform
- This means that LTS or stable versions of a platform are sometimes preferred when testing for compatibility
We welcome your contributions! Please help us by:
- Opening a pull request (PR) to confirm that
pdf-fmtworks on your platform, noting any specific setup caveats or workarounds. - Creating an issue if you encounter problems with the installer script or compiling from source.
| Platform | Display Protocol | C Standard Library | Known to work? | Comments |
|---|---|---|---|---|
| Alpine Linux x64 (musl-based) | X11 | musl |
Untested | Contributions are welcome |
| Arch Linux x64 | Wayland | glibc |
Untested | Contributions are welcome |
| Arch Linux x64 | X11 | glibc |
Untested | Contributions are welcome |
| Debian x64 (glibc) | Wayland | glibc |
Untested | Contributions are welcome |
| Debian x86 (glibc) | X11 | glibc |
Untested | Contributions are welcome |
| EndeavourOS x64 (Arch-based) | Wayland | glibc |
Partial | Script works out of the box. Contributions are welcome for binary/compiling from source |
| EndeavourOS x64 (Arch-based) | X11 | glibc |
Yes | Binary/script/compiling from source works. |
| Fedora 43 x64 (RPM-based) | Wayland | glibc |
Partial | Binary works out of the box. Contributions are welcome for script/compiling from source |
| Fedora x64 (RPM-based) | X11 | glibc |
Untested | Contributions are welcome |
| FreeBSD stable x64 | X11 | BSD libc |
Untested | Contributions are welcome |
| NetBSD x64 | X11 | BSD libc |
Untested | Contributions are welcome |
| OpenBSD x64 | X11 | BSD libc |
Untested | Contributions are welcome |
| Ubuntu LTS x64 (Debian-based) | Wayland | glibc |
Untested | Contributions are welcome |
| Ubuntu LTS x64 (Debian-based) | X11 | glibc |
Untested | Contributions are welcome |
| macOS 14 (Sonoma) | N/A | libSystem (BSD libc) |
Untested | Contributions are welcome |
| Windows 10 x86 | N/A | MSVCRT (via MSVC/MinGW) |
Untested | Contributions are welcome |
| Windows 11 x64 | N/A | MSVCRT (via MSVC/MinGW) |
Partial | Binary works out of the box. Contributions are welcome for script/compiling from source |
To check the C Standard Library used on Linux, run ldd --version.
To check the Display Protocol currently used on Linux, run echo $XDG_SESSION_TYPE.
You may need to install patchelf
- See Compile from source for more details.
| Python Version | Known to work? | Comments |
|---|---|---|
| 3.10 | Yes | Compiling from source, script works. |
| 3.11 | Yes | Compiling from source, script works. Used in GitHub Actions. |
| 3.12 | Untested | WIP |
| 3.13 | Partial | Compiling from source, script works. |
Create your own fork or clone the repository. The below example shows cloning this repository with the use of Linux.
Do note that this repository has its own Code of Conduct and Contributing Guide.
git clone https://github.com/bladeacer/pdf-fmt
chmod +x scripts/setup.sh
./scripts/dev.shTBC
The script, compiled binaries and compiling from source should work for all major
operating systems that support Git, Python,
pdfminer.six and
pyperclip.
Note: These dependencies are slightly larger than their C equivalents, though this is a calculated trade off.
Using unittest, which is of Python's standard library. You can make use of the
script installer for cloning the repository.
python -m unittest discover -sv testsAlternatively, you can run the script.
Using act.
curl -o act.sh https://raw.githubusercontent.com/bladeacer/pdf-fmt/refs/heads/main/scripts/act.sh
chmod +x act.sh
./act.shGPLv3, See license file for details.
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.
Existing PDF tooling for inspiration, LibreOffice CLI. Nuitka for compilation, GitHub for hosting and CI.
My friend Potato for testing the binary on Windows.
My friend Floodlight for testing the binary on Fedora.
The code of conduct was adopted from the Contributor Covenant.
The contributing guide was adopted from conduct.