Hacker News

Show HN: DataBridge: Open-Source Multi-Modal Parser and Retriever for LLMs

Hacker News - Tue, 03/11/2025 - 11:40am

Hey HN!

When testing complex PDFs (with diagrams, formulas, etc.), I found Gemini would straight up hallucinate, while GPT-4o asked me to run a script to parse the file: "It looks like you uploaded a PDF, but I can't directly analyze it unless I extract its contents. Do you want me to process the document and find the relevant data related to IRR (Internal Rate of Return)?"—but that script only handled plain text, missing tables, and images where the actual answer was.

We're building DataBridge, an open-source parser and retriever using specialized multi-modal embeddings (ColPali-inspired), aiming to solve this efficiently and cheaply. It handles PDFs, videos, DOCs, TXT files, and can also take in rules to extract or transform content. In my experience, higher-quality ingestion significantly improves retrieval accuracy for LLMs.

db.ingest_file(file="/path/file.pdf", filename="report2025", use_colpali=True, rules=[MetadataExtraction(schema=json_object)])

This single command embeds text, visual elements, and metadata into one vector space. This article explains Colpali vs traditional parsing pipelines better than I can: https://medium.com/@shashankvats/colpali-explained-bridging-....

We’re also currently working on an MCP server integration so tools like Claude Desktop can directly benefit from this enhanced context.

Would love your thoughts and any feedback.

Docs: https://databridge.gitbook.io/databridge-docs

Comments URL: https://news.ycombinator.com/item?id=43333571

Points: 1

# Comments: 0

Categories: Hacker News

Schelling Points

Hacker News - Tue, 03/11/2025 - 11:38am
Categories: Hacker News

Ask HN: How can DuckDuckGo afford their free private LLM model access?

Hacker News - Tue, 03/11/2025 - 11:35am

The allow private access to o3-mini, llama 3.3 70B, etc.

It is a nice service, but must cost real money to provide the service.

Comments URL: https://news.ycombinator.com/item?id=43333507

Points: 3

# Comments: 1

Categories: Hacker News

Pages