Skip to content

Incrementally resume long-PDF ingestion using cached PageIndex doc_id#43

Open
plasma16 wants to merge 1 commit intoVectifyAI:mainfrom
plasma16:feat/long-pdf-resume
Open

Incrementally resume long-PDF ingestion using cached PageIndex doc_id#43
plasma16 wants to merge 1 commit intoVectifyAI:mainfrom
plasma16:feat/long-pdf-resume

Conversation

@plasma16
Copy link
Copy Markdown

@plasma16 plasma16 commented May 7, 2026

Summary

  • add long-PDF ingest checkpoint state in .openkb/long_pdf_jobs.json
  • cache doc_id and description after successful PageIndex indexing
  • on re-run, reuse cached doc_id for long PDFs and retry compilation directly
  • persist index/compile failure state for troubleshooting and incremental retry

Why

When long PDF ingestion fails after indexing, re-running currently re-indexes the same document. This change makes retries incremental for long PDFs while leaving existing skip behavior unchanged for other file types.

Scope

  • only long-document (long_pdf) ingestion path
  • no queue/cursor behavior for non-PDF files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant