markitdown

用于将文件和 Office 文档转换为 Markdown 的 Python 工具。使用大型语言模型进行图像描述。

主要功能

MarkItDown 是一个用于将各种文件转换为 Markdown 的实用程序（例如，用于索引、文本分析等）。它支持：

PDF
PowerPoint 微软幻灯片软件
Word
Excel
图像（EXIF 元数据和 OCR）
音频（EXIF 元数据和语音转录）
HTML 超文本标记语言
基于文本的格式（CSV、JSON、XML）
ZIP 文件（迭代内容）

安装和使用

要安装 MarkItDown，请使用 pip: pip install markitdown 。或者，您可以从源安装它： pip install -e .

命令行

markitdown path-to-file.pdf > document.md

您还可以管道内容：

cat path-to-file.pdf | markitdown

Python API

Python 中的基本用法：

from markitdown import MarkItDown


md = MarkItDown()
result = md.convert("test.xlsx")
print(result.text_content)

要使用大型语言模型进行图像描述，请提供 llm_client 和 llm_model ：

from markitdown import MarkItDown
from openai import OpenAI


client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("example.jpg")
print(result.text_content)

Docker

docker build -t markitdown:latest .
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md

运行测试和检查

在您的环境中安装 hatch 并运行测试：

pip install hatch  # Other ways of installing hatch: https://hatch.pypa.io/dev/install/
hatch shell
hatch test

（替代方法）使用安装了所有依赖项的 Devcontainer：

# Reopen the project in Devcontainer and run:
hatch test

在提交 PR 之前运行预提交检查： pre-commit run --all-files

更多...

小众AI