網站提供下載 PDF 報表的功能很常見。自動化測試 PDF 內容(文字、排版、圖像等)的方法,大概分為下列幾種:
- 將 PDF 每一個頁面轉成圖像,然後做圖像對圖像的比對 – 一次比對到所有使用者會看到的東西,但如果報表內包含有會變動的內容(例如時間),事先要準備好驗證用的圖像就有困難。
- 將整個 PDF 轉成另一種格式後(例如 HTML、XML 或單純的文字檔),然後再對轉檔後的結果做解析 – 這類的工具有很多選擇,有些還能個別取出圖像。挑戰來自於怎麼對文字檔做 parsing。
- 透過 Object ID 直接取出要驗證的資料項目 – 這要事先跟程式開發人員溝通好,事先為要自動化驗證的對象埋下對應的 ID 才行。尤其適合用來取出經由 PDF Form Filler 填值的 PDF 文件?
由於測試框架採用 Python,所以對可能的方案有一些要求:
- 免安裝 – 可以直接放進 VCS,簡化佈署到測試機器的工作,測試機只要更新自己的 local copy 即可。
- Pure Python – 跨平台;跟上面一條 “免安裝" 的考量多少有點關係。
- 支援多國語言 – 從 PDF 檔取出的文字要能夠統一轉成 Unicode。
以 python pdf (parser or extractor) 為關鍵字,經過一番衝浪之後,找到幾個可能的方案:
雖然說網路上也有一些例子用 pyPDF 來取出 PDF 的文字內容,但在 PDF parsing/extraction 這個領域而這,多數人在談論的還是 PDFMiner,畢竟 pyPDF 的專長是在對 PDF 做加工(例如分割、合併、加解密等),而 PDFMiner 才是專注在 extracting and analyzing text data。
PDF Is Eval!
PDF 為什麼邪惡?
PDF is evil. Although it is called a PDF “document", it’s nothing like Word or HTML document. PDF is more like a graphic representation. PDF contents are just a bunch of instructions that tell how to place the stuff at each exact position on a display or paper. In most cases, it has no logical structure such as sentences or paragraphs and it cannot adapt itself when the paper size changes. PDFMiner attempts to reconstruct some of those structures by guessing from its positioning, but there’s nothing guaranteed to work. Ugly, I know. Again, PDF is evil.
Programming with PDFMiner
— Yusuke Shinyama
這也就是為什麼我們需要 PDFMiner 這類工具,幫我們把 “拼貼在一起的文字" 直的串接起來的原因。
PDF 不邪惡,因為它是為了 presentation 跟 printing 而生…
In a PDF, the text is not continous, but made from a lot of small groups of characters positioned absolutely in the page. The focus of PDF is to keep the layout intact. It’s not content oriented but presentation oriented.
Advanced PDF Parsing Using Python. What is the Best Library?
— Etienne
安裝 PDFMiner
最簡單的方式就是透過 EasyInstall 安裝 pdfminer 套件。
$ sudo easy_install pdfminer install_dir /usr/local/lib/python2.6/dist-packages/ Searching for pdfminer Reading http://pypi.python.org/simple/pdfminer/ Reading http://www.unixuser.org/~euske/python/pdfminer/index.html Best match: pdfminer 20110515 Downloading http://pypi.python.org/packages/source/p/pdfminer/pdfminer-20110515.tar.gz#md5=f3905f801ed469900d9e5af959c7631a Processing pdfminer-20110515.tar.gz Running pdfminer-20110515/setup.py -q bdist_egg --dist-dir /tmp/easy_install-oY2vXu/pdfminer-20110515/egg-dist-tmp-fsF5jc zip_safe flag not set; analyzing archive contents... pdfminer.cmapdb: module references __file__ Adding pdfminer 20110515 to easy-install.pth file Installing pdf2txt.py script to /usr/local/bin Installing dumppdf.py script to /usr/local/bin Installing latin2ascii.py script to /usr/local/bin Installed /usr/local/lib/python2.6/dist-packages/pdfminer-20110515-py2.6.egg Processing dependencies for pdfminer Finished processing dependencies for pdfminer |
過程中額外安裝了 pdf2txt.py、dumppdf.py 以及 latin2ascii.py 三個 command-line tools。 |
在 Windows 下也是一樣:
C:\>easy_install pdfminer Searching for pdfminer Reading http://pypi.python.org/simple/pdfminer/ Reading http://www.unixuser.org/~euske/python/pdfminer/index.html Best match: pdfminer 20110515 Downloading http://pypi.python.org/packages/source/p/pdfminer/pdfminer-20110515.tar.gz#md5=f3905f801ed469900d9e5af959c7631a Processing pdfminer-20110515.tar.gz Running pdfminer-20110515\setup.py -q bdist_egg --dist-dir c:\users\jeremy~1\appdata\local\temp\easy_install-dxz4iz\pdfminer-20110515\egg-dist-tmp-1ks9pp zip_safe flag not set; analyzing archive contents... pdfminer.cmapdb: module references __file__ Adding pdfminer 20110515 to easy-install.pth file Installing dumppdf.py script to C:\Python27\Scripts Installing latin2ascii.py script to C:\Python27\Scripts Installing pdf2txt.py script to C:\Python27\Scripts Installed c:\python27\lib\site-packages\pdfminer-20110515-py2.7.egg Processing dependencies for pdfminer Finished processing dependencies for pdfminer |
同樣將三個 command-line tools 安裝到 C:\Python27\Scripts,將這個目錄加到 PATH 環境變數,就可以直接叫用這些 tools。 |
如果要支援 CJK languages,就必須從 source 安裝:
- 從 PyPI 下載 pdfminer-<version>.tar.gz 後解壓縮。
- 切換到解壓縮縮的目錄,然後執行 make cmap。(不知道為什麼 Makefile 要寫成 PYTHON=python2,改成 python 即可)
$ make cmap python tools/conv_cmap.py pdfminer/cmap Adobe-CNS1 cmaprsrc/cid2code_Adobe_CNS1.txt cp950 big5 reading 'cmaprsrc/cid2code_Adobe_CNS1.txt'... writing 'CNS1-H.pickle.gz'... writing 'ETHK-B5-V.pickle.gz'... writing 'ETHK-B5-H.pickle.gz'... ... python tools/conv_cmap.py pdfminer/cmap Adobe-GB1 cmaprsrc/cid2code_Adobe_GB1.txt cp936 gb2312 reading 'cmaprsrc/cid2code_Adobe_GB1.txt'... writing 'GBT-EUC-V.pickle.gz'... writing 'GB-EUC-H.pickle.gz'... writing 'UniGB-UTF32-H.pickle.gz'... ... python tools/conv_cmap.py pdfminer/cmap Adobe-Japan1 cmaprsrc/cid2code_Adobe_Japan1.txt cp932 euc-jp reading 'cmaprsrc/cid2code_Adobe_Japan1.txt'... writing 'Add-V.pickle.gz'... writing '78ms-RKSJ-H.pickle.gz'... writing 'Hankaku-V.pickle.gz'... ... python tools/conv_cmap.py pdfminer/cmap Adobe-Korea1 cmaprsrc/cid2code_Adobe_Korea1.txt cp949 euc-kr reading 'cmaprsrc/cid2code_Adobe_Korea1.txt'... writing 'KSCms-UHC-HW-V.pickle.gz'... writing 'UniKS-UTF32-V.pickle.gz'... writing 'KSC-V.pickle.gz'...
在 Windows 下沒有 make,可以仿上面的輸出依序執行:
python tools\conv_cmap.py pdfminer\cmap Adobe-CNS1 cmaprsrc\cid2code_Adobe_CNS1.txt cp950 big5 python tools\conv_cmap.py pdfminer\cmap Adobe-GB1 cmaprsrc\cid2code_Adobe_GB1.txt cp936 gb2312 python tools\conv_cmap.py pdfminer\cmap Adobe-Japan1 cmaprsrc\cid2code_Adobe_Japan1.txt cp932 euc-jp python tools\conv_cmap.py pdfminer\cmap Adobe-Korea1 cmaprsrc\cid2code_Adobe_Korea1.txt cp949 euc-kr
- 執行 python setup.py install 進行安裝。
$ sudo python setup.py install ... running install_lib creating /usr/local/lib/python2.6/dist-packages/pdfminer copying build/lib.linux-x86_64-2.6/pdfminer/converter.py -> /usr/local/lib/python2.6/dist-packages/pdfminer copying build/lib.linux-x86_64-2.6/pdfminer/glyphlist.py -> /usr/local/lib/python2.6/dist-packages/pdfminer ... running install_scripts copying build/scripts-2.6/pdf2txt.py -> /usr/local/bin copying build/scripts-2.6/dumppdf.py -> /usr/local/bin copying build/scripts-2.6/latin2ascii.py -> /usr/local/bin changing mode of /usr/local/bin/pdf2txt.py to 755 changing mode of /usr/local/bin/dumppdf.py to 755 changing mode of /usr/local/bin/latin2ascii.py to 755 running install_egg_info Writing /usr/local/lib/python2.6/dist-packages/pdfminer-20110515.egg-info
同樣會安裝三個 command-line tools。
用 pdf2txt.py 測試安裝:
$ pdf2txt.py samples/simple1.pdf Hello World Hello World H e l l o W o r l d H e l l o W o r l d |
非英文 PDF 也沒問題:
$ pdf2txt.py samples/jo.pdf ... 宇 宙 塵 を た べ 、 ... |
由於 PDFMiner 是純 Python 的實作,解壓縮後把 pdfminer 子目錄放到目前的工作目錄底下也可以運作。但 make cmap 產出的 *.pickle.gz 好像會跟平台相依? |
從程式裡直接將 PDF 的文字取出來
既然 tools/pdf2txt.py 可以正常處理多國語言的 PDF 檔,就沒有什麼好擔心的了。剩下的只是如何把 pdf2txt.py 裡頭 “將 PDF 轉成文字檔" 的功能取出來…
from StringIO import StringIO from pdfminer.pdfinterp import PDFResourceManager, process_pdf from pdfminer.converter import TextConverter from pdfminer.layout import LAParams def pdf_to_text(pdf_file): rsrcmgr = PDFResourceManager() laparams = LAParams() try: infp = open(pdf_file, 'rb') outfp = StringIO() device = TextConverter(rsrcmgr, outfp, codec='utf-8', laparams=laparams) process_pdf(rsrcmgr, device, infp) return outfp.getvalue().decode('utf-8') finally: outfp.close() infp.close() device.close() |
codec 一定要給,預設採用 UTF-8。 | |
轉回 Unicode,方便後續的處理。 |
最後?
故事還沒結束…
取出 PDF 的文字內容之後,接下來就是要做對純文字做 parsing 的工作,這才是真正費時費工的部份…
在尋找解決方案的過程中,意外找到了一些有趣的東西,跟大家分享:
- openpipeline.org – An open source software for crawling, parsing, analyzing and routing documents.
- Dirk Loss: Python tools for penetration testers – 列出許多測試領會用到的 Python modules。
- Refine, reuse and request data | ScraperWiki – ScraperWiki 是個有趣的專案,任何人都可以在上面寫程式對網路上公開的資料進行蒐集、加工、分析。
- Nullege: A Search Engine for Python source code
PDFMiner 的其他資源
其他文件
- Denis Papathanasiou (2011-11-11)
- pdf – Python – Help using pdfminer as a library – Stack Overflow (2011-04-20) – 仿 tools/pdf2txt.py 用 TextConverter 從 PDF 取出文字。
- Python处理pdf文件的包 at 男单 618 (2011-05-12)
- PDFMiner by Patrice Neff – Memonic (2011-04-19)
- Python PDFMiner 解析pdf 文本 – warmb123的专栏 – 博客频道 – CSDN.NET (2011-02-18)
- Denis Papathanasiou » Blog Archive » Extracting Text & Images from PDF Files (2010-10-28)
- Old Nabble – python-chinese @ googlegroups – <CPyUG> pdfminer读取PDF,能否直接获取文本? (2010-08-02)
- Herb’s Blog: OpenDataBC: Extracting Data from A4CA PDFs (2010-05-20) – 利用 TextConverter 將加拿大公部門發行的非結構化 PDF 資料轉成可進一步應用的 CSV 檔。
- PDFMiner を使ってテキストを抽出 – 文字处理技术 – STPDomain – Powered by Discuz! (2009-05-01)