Web crawling, API traversing has existed for as long as the internet has existed. Many crawlers have been before, so why when I was working at the Accelerator for Princeton University did we embark on writing a new one?
Crawlers are often designed for one task, crawling websites, crawling a specific target, and I wanted to create something that would serve more than one purpose. There is always a lot of replicated code when building separate crawlers and I wanted to remove as much of that as possible whilst also bringing a standardised output format for all the sites we crawled to make the ETL process as easy as possible.
This is why the distributed crawler was born. It allowed us to crawl both Telegram and the Youtube Data API without having to rebuild huge chunks of very similar crawler logic across both platforms. I also wanted to be able to support multiple cloud backends, quite extensible configuration properties for the researchers who were going to use it, and it have as small a footprint as possible.
I started writing the Telegram portion of the crawler, as ever building and testing something that required many SIM cards to operate was an interesting challenge. With a Sim farm in my office, I succeeded in connecting the crawler to multiple Telegram API backends using a distributed technique, allowing for quite effective scalability of the platform. Of course, ensuring that the data you collect is both accurate and complete is an interesting challenge when you’re trying to crawl multiple channels inside of a Telegram account to create a global picture of what’s happening on the Telegram network. In doing that, it involved an awful lot of manual checking, check points, and check sums to ensure that the platform itself curated the data it needed for the researchers to be able to do what was important to them.
The other interesting challenge was collating the multimedia that is obviously synonymous with every social media network these days. That involved us looking for files, downloading different file types, processing them, and extracting numerous extra data using OCR techniques and metadata extraction on the files themselves to be able to provide search context and lookups for researchers when looking at specific topics that are embedded in the multimedia themselves.
Once I built out this framework, I realised that I wanted to make it pluggable to add the YouTube Data API. At that point, I leveraged Claude Code to help me refactor the existing codebase to a more generic interface that would allow me to plug in different social media endpoints. This was quite effective and it provided a useful entry point for Claude to then write the YouTube API layer leveraging the Golang libraries that YouTube had provided. The YouTube layer was easier to implement in a lot of respects in the fact that it didn’t require access to SIM cards. However, the YouTube Data API is quite restrictive in terms of both the data you can get and also the amount you can collect. Luckily, though, I was working for Princeton University, and so we made a valid submission to YouTube to allow access to their research API. This is basically exactly the same as the data API, but scaled up for larger volume data access. What you don’t get access to though with the data or the research API is the video content itself, nor access to the transcriptions for the video content (unless of course you happen to be the owner of that video). This means that you can analyse titles, descriptions, things like engagement metrics with comments and the like in the videos themselves, but you can’t actually access the content of the video itself. This is not immediately problematic, but could cause problems with people wanting to build upon this in the future. There is a workaround, and we will look to add this to the distributed crawler in the coming future. This is an open-source project called InnerTube. InnerTube leverages a public yet undocumented API that allows access to a number of different YouTube features and functionality that aren’t readily available inside of their data API. This does include transcriptions that we do not have access to, as well as things like similar videos and things that you would see in the sidebar. The reason for this is that people assume is that smart TVs and other smart devices need access to YouTube without a user being logged in, so they have to have a public API that is available to them. InnerTube cleverly leverages this to allow for it to be able to provide the same service as a library for other applications.
I mentioned earlier that I wanted to be able to use multiple different backends to be able to support the distributed crawler. The reason for this, of course, is that outside of running it inside of Azure inside of a Kubernetes cluster where we were going to run it, people might want to run this in a range of different places. Being able to leverage different backends was important:
Local mode so you could run it on your laptop (like a lot of people would do)
As sort of Azure CLI backend as an SDK back-end to support standard cloud operations
The other one that I implemented inside a distributed crawler was support for a framework called DAPA (a microservices framework from the Linux Foundation). The cool thing about DAPA is that it has a number of different plugins for a whole range of different storage features and functionality. Rather than me having to implement a number of different backends outside of the core ones I’ve just mentioned, it would allow for people to be able to use different services to store the data in depending on what they were interested in leveraging as their own backends. This gives users a very flexible deployment strategy when it comes to spinning this stuff up, especially in cloud environments and even more so inside of a Kubernetes deployment.
The other interesting thing with DAPR is that you could also use not just storage but other messaging protocols to be able to send the data around. So rather than persist it straight to disc you could for example send it to a message bus and send that to a different service to process that data in real-time. And so the idea of real-time data processing and streaming for a distributed crawler became a reality inside of the project because we could send that data directly to a message bus directly to Databricks and get it picked up and processed almost instantly.
The final piece of this puzzle was the use of a unified output format that I alluded to earlier. This format followed very closely to an output specification originally used by Junkopedia, and it is a standard output specification that they use across different social media sites in their service. It made sense for us to be able to mimic it closely so that we could also ingest their data should we need to, to be able to bolster the platform.
The idea was that every single social media post has basically the same content; it’s just the metadata that surrounds it is slightly different. So being able to support those different types whilst also making it easy for the platform to output similar formats would allow us to then be able to ingest that data very easily when it hit Databricks. And so regardless of which social media platform you use, you knew that the process to get that data into a readable format would be a lot shorter than if you were using individual crawlers for each different service.
And so that was basically the distributed crawler. The cool thing about it was I was allowed to open source it. You can find a link to the distributed crawler on GitHub right here. That is the platform to date. It supports a number of different backends as we’ve mentioned. It supports Telegram and YouTube, but we do have plans to add more to it over time. And make it more extensible and more useful for a whole range of different social media sites and probably also just standard website crawling as well into the future.