Playback speed
×
Share post
Share post at current time
0:00
/
0:00
Transcript

Search, Tika, and Document Processing with Tim Allison

A nerd out about Enterprise Search, information extraction from documents, a bit of PDF history and more

Enterprise search, not Google search, is big business, especially with the advent of LLMs and people wanting to know everything about an organisation immediately!

To solve this riddle, I interviewed Tim Allison, who recently departed from NASA’s JPL, where he’d been working on many significant projects, including DARPA’s Safedocs program, which looked at how to make PDFs more secure from attacks. The upshot was a new corpus for developers to use comprising 8m PDFs.

We also discussed Tim’s association with Apache Tika, the document parsing engine that powers many engines today, and how it extracts information from PDF documents and other documentation across many formats.

Search and document parsing has many wrinkles; what documents are you parsing? Are they textual, images or something completely different? How will you perform that search when you or your users search for them? How will you tune up your engine to mean users find meaningful content fast? Don’t think Search is set and forget, you can’t just stick everything in an Elasticsearch cluster and hope for the best. But I expect most people don’t understand how much open-source software is used under the hood to provide the results to the users. From query interface to document parser and document storage, so much of this is open source by nature and often maintained by teams you can count on your fingers.

If you’re interested in what Tim does, you can follow him on his Linkedin. If you’re interested in Apache Tika click here and if you’re interested in Safedocs you can find out more right here.

Idea Ignition: Fueling Startups from Concept to Cloud
Idea Ignition: Fueling Startups from Concept to Cloud Podcast
Transforming startup visions into cloud realities - from concept inception to market-ready MVP.
Listen on
Substack App
Spotify
RSS Feed
Appears in episode
Tom Barber