Skip to content

Fix XLSX ingestion memory spikes with streaming parser#42

Open
plasma16 wants to merge 1 commit intoVectifyAI:mainfrom
plasma16:fix/xlsx-memory-spike
Open

Fix XLSX ingestion memory spikes with streaming parser#42
plasma16 wants to merge 1 commit intoVectifyAI:mainfrom
plasma16:fix/xlsx-memory-spike

Conversation

@plasma16
Copy link
Copy Markdown

@plasma16 plasma16 commented May 7, 2026

Summary

  • route .xlsx conversion through a streaming openpyxl reader (read_only=True, data_only=True)
  • cap scan bounds (max_rows=5000, max_cols=64) to prevent pathological worksheet ranges from exploding memory
  • stop scanning after sustained empty tails to avoid sparse-sheet runaway processing

Why

Some workbooks report huge used ranges (e.g. max_row=1048571) despite having very little real data, which can cause generic converters to consume excessive RAM.

Result

Significantly lower memory use during XLSX ingest while preserving useful sheet content for KB compilation.

@plasma16 plasma16 force-pushed the fix/xlsx-memory-spike branch from 9075e4c to c3a1f11 Compare May 7, 2026 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant