批量将Word文档转变为PDF格式文件

许爸玩编程® 收录于日常办公问题

2022-12-12 约 2067 字预计阅读 5 分钟次阅读

问题说明

我现在有很多Word文档（.docx格式文件），保存在一个目录里。我需要将全部Word文档都转换为PDF格式文件。

问题细化

数据输入，指定目录，获取目录下全部文件，排除掉不是Word文档的那些就可以了。
格式转换要求，Word文档转换为PDF文件，要保留原本格式，标题，文字大小，字体，缩进，表格…………就是Office Word软件看到的是什么样子，希望转换出来的PDF文件也是什么样子的。
转换好的PDF文件保存到指定目录，比如 output_pdf 目录下

程序实现方案一

Word转PDF，本质上就是调用Office Word程序来进行处理并输出。

在Windows操作系统上，已经安装好了Microsoft Office Word软件的情况下，如果是单独一个Word文档，其实手工用Word软件打开它，另存/导出为PDF格式就可以了。

这里的问题是有很多的Word文件，需要转换为PDF文件。比如说有几百个，甚至更多，那么一个个手动操作就太低效啦。我们需要使用程序来帮忙自动化干这个事情。

在Windows系统下，直接使用Python pywin32 库，可以调用Word，操作底层vba，将Word格式文件转成PDF文件，进行自动化处理。也可以使用docx2pdf库，他已经封装好了对Office Word的调用，使用更简单。使用前先安装 pip install docx2pdf

简单的示例代码：

1
2
3
4
5
6


from docx2pdf import convert

inputFile = "document.docx"
outputFile = "document2.pdf"

convert(inputFile, outputFile)

另外，在安装 docx2pdf库后，也可以不写Python代码，直接使用命令行进行转换。

在终端窗口输入 docx2pdf -h 查看详细的使用说明。

程序实现方案二

查看docx2pdf库的实现源码，就能发现，他其实也是调用的pywin32库。只是封装好后，我们使用更方便。

当然，直接操作win32其实也并不复杂，参考代码如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


from win32com.client import constants, gencache

# 创建PDF
def createPdf(wordPath, pdfPath):
    word = gencache.EnsureDispatch('Word.Application')
    doc = word.Documents.Open(wordPath, ReadOnly=1)
    doc.ExportAsFixedFormat(pdfPath,
                            constants.wdExportFormatPDF,
                            Item=constants.wdExportDocumentWithMarkup,
                            CreateBookmarks=constants.wdExportCreateHeadingBookmarks)
    word.Quit(constants.wdDoNotSaveChanges)

以上两种方式，都能够简单实现Word文档转PDF文件。

[特别注意] 程序实现方案一和二，都必须要求你的Windows系统上已经安装好了Office Word程序。

程序实现方案三

但是我这里考虑跨平台，想让程序在MacOS系统和Linux系统下，都可以运行。所以我选择了LibreOffice 办公软件。因为 LibreOffice 程序可以同时运行在 MacOS/Linux/Windows，而且它是开源免费的。是的，完全免费的。

安装开源免费的LibreOffice

在这里下载LibreOffice的多平台版本 https://www.libreoffice.org/download/download-libreoffice/

具体的安装操作就不详细说明了，自行参考官网说明文档进行安装。

LibreOffice的headless模式

然后我们使用 LibreOffice 的 headless 模式来实现将 Word 文件转换为 PDF。

headless 模式指的是它可以在没有图形界面的情况下运行。

在Python中可以使用 subprocess 模块来调用 LibreOffice 的命令行工具来进行转换。

参考代码：

1
2
3
4


import subprocess

def convert_to_pdf(input_file, output_directory):
    subprocess.run(['libreoffice', '--headless', '--convert-to', 'pdf', input_file, '--outdir', output_directory])

请将上面程序中的 libreoffice 替换为你自己电脑上的真实的 libreoffice 完整路径。

特别说明在 MacOS系统下，比如在我的电脑上，我需要将它替换为 /Applications/LibreOffice.app/Contents/MacOS/soffice

怎么判断你找到的完整路径对不对。很简单，在命令行窗口下，输入你找的完整路径，再 -h ，有没有出现一大串帮忙提示文字就对了。类似这样子：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


/Applications/LibreOffice.app/Contents/MacOS/soffice -h
LibreOffice 7.6.4.1 e19e193f88cd6c0525a17fb7a176ed8e6a3e2aa1

Usage: soffice [argument...]
       argument - switches, switch parameters and document URIs (filenames).

Using without special arguments:
Opens the start center, if it is used without any arguments.
   {file}              Tries to open the file (files) in the components
                       suitable for them.
   {file} {macro:///Library.Module.MacroName}
                       Opens the file and runs specified macro from
                       My Macros container.
   {file} {macro://./Library.Module.MacroName}
                       Opens the file and runs specified macro from
                       the file.

Getting help and information:
   --help | -h | -?    Shows this help and quits.
   ......
   ......

LibreOffice的支持的文件格式

LibreOffice支持多种格式文件的互相转换，使用上面的Python代码段，修改其中的参数，就可以实现各种不同格式文件的互相转换了，比如 xlsx->PDF， word->html， ppt->PDF，…………

具体支持哪一些格式，可以查询官方说明： https://help.libreoffice.org/6.3/en-US/text/shared/guide/convertfilters.html

程序实现方案四

如果即不想安装Office Word，也不想安装 LibreOffice，还有没有办法转换Word文档为PDF文件呢？

还是有办法的，但是就不能做到保留原本格式，比如各种标题，文字大小，字体，缩进，表格…………这些信息没有办法保证了。

原理就是使用第3方库pip install python-docx 来读取Word文件的内容，它是一个纯Python库，不安装Office Word软件也能使用。然后再用另外的第3方库来生成PDF文件。比如 reportlab，pdfkit 等

但是这种实现效果比较差，程序实现也很麻烦。这里就不展开细说了。

最终完整代码

最后附上完工后的完整Python代码：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35


import os
import docx2pdf
import subprocess

def convert_to_pdf(input_file, output_file):
    subprocess.run(['/Applications/LibreOffice.app/Contents/MacOS/soffice', '--headless', '--convert-to', 'pdf', input_file, '--outdir', output_file])


word_path = './word-files'
output_pdf_path = './output_pdf'

# 在指定目录下中查找扩展名为 .docx 和 .doc 的文件，忽略子目录，忽略其它扩展名的文件
# 如果需要查找包括子目录的，可以使用 os.walk()
files = [
    f
    for f in os.listdir(word_path)
    if (f.lower().endswith('.doc')) or (f.lower().endswith('.docx'))
]
files.sort()
# print(files)

if not os.path.exists(output_pdf_path):  # 判断文件夹是否存在
    os.makedirs(output_pdf_path)  # 若文件夹不存在就创建

for file in files:
    # 使用os.path模块的splitext()方法分离文件名和后缀。 注意，file_ext变量将包含包括点在内的后缀部分（例如：.docx）
    file_name, file_ext = os.path.splitext(file)
    word_filename = os.path.join(word_path, file)
    pdf_filename = os.path.join(output_pdf_path, file_name + '.pdf')
    print(word_filename, '-->', pdf_filename)
    convert_to_pdf(word_filename, output_pdf_path)

    # file = open(pdf_filename, "w")
    # file.close()
    # docx2pdf.convert(word_filename, pdf_filename)