Hacker News
Show HN: DataBridge: Open-Source Multi-Modal Parser and Retriever for LLMs
Hey HN!
When testing complex PDFs (with diagrams, formulas, etc.), I found Gemini would straight up hallucinate, while GPT-4o asked me to run a script to parse the file: "It looks like you uploaded a PDF, but I can't directly analyze it unless I extract its contents. Do you want me to process the document and find the relevant data related to IRR (Internal Rate of Return)?"—but that script only handled plain text, missing tables, and images where the actual answer was.
We're building DataBridge, an open-source parser and retriever using specialized multi-modal embeddings (ColPali-inspired), aiming to solve this efficiently and cheaply. It handles PDFs, videos, DOCs, TXT files, and can also take in rules to extract or transform content. In my experience, higher-quality ingestion significantly improves retrieval accuracy for LLMs.
db.ingest_file(file="/path/file.pdf", filename="report2025", use_colpali=True, rules=[MetadataExtraction(schema=json_object)])
This single command embeds text, visual elements, and metadata into one vector space. This article explains Colpali vs traditional parsing pipelines better than I can: https://medium.com/@shashankvats/colpali-explained-bridging-....
We’re also currently working on an MCP server integration so tools like Claude Desktop can directly benefit from this enhanced context.
Would love your thoughts and any feedback.
Docs: https://databridge.gitbook.io/databridge-docs
Comments URL: https://news.ycombinator.com/item?id=43333571
Points: 1
# Comments: 0
Certificate Apocalypse: Bringing Your Chromecast Back from the Dead
Article URL: https://mensfeld.pl/2025/03/bringing-your-chromecast-back/
Comments URL: https://news.ycombinator.com/item?id=43333568
Points: 2
# Comments: 0
How the US economy went from booming to a recession scare in only 20 days
Article URL: https://www.cnn.com/2025/03/11/business/recession-economy-trump-dow-stocks/index.html
Comments URL: https://news.ycombinator.com/item?id=43333565
Points: 4
# Comments: 0
Schelling Points
Article URL: https://en.wikipedia.org/wiki/Focal_point_(game_theory)
Comments URL: https://news.ycombinator.com/item?id=43333543
Points: 1
# Comments: 0
Saving U.S. Climate and Environmental Data Before It Goes Away
Article URL: https://e360.yale.edu/features/eric-nost-interview
Comments URL: https://news.ycombinator.com/item?id=43333531
Points: 4
# Comments: 0
The IT world moves fast, so why are admins slow to upgrade?
Article URL: https://www.theregister.com/2025/03/10/database_upgrades_comment/
Comments URL: https://news.ycombinator.com/item?id=43333529
Points: 1
# Comments: 0
Microplastics Are Messing with Photosynthesis in Plants
Article URL: https://www.scientificamerican.com/article/microplastic-pollution-is-messing-with-photosynthesis-in-plants/
Comments URL: https://news.ycombinator.com/item?id=43333528
Points: 3
# Comments: 0
I decided to quit a Director's job
Article URL: https://zaidesanton.substack.com/p/losing-60k-why-i-decided-to-quit
Comments URL: https://news.ycombinator.com/item?id=43333511
Points: 2
# Comments: 0
New Biosensor can detect airborne bird flu
Article URL: https://source.washu.edu/2025/03/new-biosensor-can-detect-airborne-bird-flu-in-under-five-minutes/
Comments URL: https://news.ycombinator.com/item?id=43333510
Points: 2
# Comments: 0
Ask HN: How can DuckDuckGo afford their free private LLM model access?
The allow private access to o3-mini, llama 3.3 70B, etc.
It is a nice service, but must cost real money to provide the service.
Comments URL: https://news.ycombinator.com/item?id=43333507
Points: 3
# Comments: 1
Closing the borders alone won't fix the problems
Article URL: https://world.hey.com/dhh/closing-the-borders-alone-won-t-fix-the-problems-aca49088
Comments URL: https://news.ycombinator.com/item?id=43333496
Points: 1
# Comments: 1
AI-Generated Voice Evidence Poses Dangers in Court
Article URL: https://www.lawfaremedia.org/article/ai-generated-voice-evidence-poses-dangers-in-court
Comments URL: https://news.ycombinator.com/item?id=43333484
Points: 2
# Comments: 0
Ozempic's New Frontier: The War on Aging
Article URL: https://www.wsj.com/health/wellness/ozempic-weight-loss-drug-aging-health-benefits-d93a22f8
Comments URL: https://news.ycombinator.com/item?id=43332983
Points: 1
# Comments: 0
Trump doubles planned tariffs on Canadian steel and aluminum to 50%
Article URL: https://apnews.com/article/trump-economy-tariffs-stock-musk-business-8a5f28d9bb16e0b8a924d99ead0907fa
Comments URL: https://news.ycombinator.com/item?id=43332980
Points: 1
# Comments: 0
Write more blog articles, not fewer
Article URL: https://rmoff.net/2025/03/11/write-more-blog-articles-not-fewer-dont-leave-the-scraps-on-the-cutting-floor/
Comments URL: https://news.ycombinator.com/item?id=43332977
Points: 1
# Comments: 0
Solar adds more new capacity to US grid in 2024 than any source in 20 years
Article URL: https://electrek.co/2025/03/10/solar-new-capacity-us-grid-2024/
Comments URL: https://news.ycombinator.com/item?id=43332968
Points: 1
# Comments: 1
NIST Selects HQC as Fifth Algorithm for Post-Quantum Encryption
Article URL: https://www.nist.gov/news-events/news/2025/03/nist-selects-hqc-fifth-algorithm-post-quantum-encryption
Comments URL: https://news.ycombinator.com/item?id=43332944
Points: 2
# Comments: 0
VictoriaLogs Source Reading
Article URL: https://medium.com/@waynest/victorialogs-source-reading-833db3e8511b
Comments URL: https://news.ycombinator.com/item?id=43332921
Points: 2
# Comments: 0
18F Transformed Government Technology and Why Its Elimination Matters
Article URL: https://theconversation.com/how-18f-transformed-government-technology-and-why-its-elimination-matters-251333
Comments URL: https://news.ycombinator.com/item?id=43332902
Points: 4
# Comments: 1
Single-Base Editing Seems to Work
Article URL: https://www.science.org/content/blog-post/single-base-editing-seems-work
Comments URL: https://news.ycombinator.com/item?id=43332897
Points: 1
# Comments: 0