A peek inside Qu
Clinton Dreisbach on the guts of our online data engine
This past January, we launched a data exploration tool that lets the public dissect the 113 million mortgage applications that Americans filed between 2007 and 2012. This data, from the Home Mortgage Disclosure Act (HMDA), has been publicly available for years, but never before has it been in such an accessible format. This web-based tool is powered by Qu, a data platform we built to help us quickly deliver data like this to researchers and developers. To learn about it and where it’s going, I spoke to Clinton Dreisbach, Qu’s lead developer.
To start, what is Qu? Can you give some of the backstory on why we built it?
Qu is an open-source platform to deliver large sets of data. It allows you to query that data, combine it with other data, and summarize that data. We built it because we wanted to serve millions of mortgage application records, and there was nothing out there that could do the same thing on the scale we were looking for. There are some smaller things—Socrata, CKAN’s data tables—and some really large enterprise-y things like Apache Drill, but nothing really in the middle, for serving 10–100 million rows of data easily.
It’s important to note that Qu isn’t just “the CFPB data platform”; it’s a platform for building your own data APIs.
Right; other people can use it for their own data sets that have nothing to do with us.
The work we’re doing right now is to make that as easy as possible.
What’s the difference between Qu and tools like Socrata and CKAN? Is it an alternative to them, or a complement?
Yes and yes? I think it makes a nice complement with CKAN, as CKAN is more focused on being a data catalog, whereas Qu is a data provider. That is, CKAN is great for showing the world your data sets, including sets in non-machine-readable formats, like PDF or Word documents. Qu is good for taking the machine-readable data and putting a simple API on top of it.
The features found on our HMDA tool—those are applications built using the API, not Qu itself, right?
Socrata and CKAN are applications. You download them and install them. They are like WordPress in this way: a web application you put on your server. You configure the application, but in the end, you have that application.
Qu was like this until recently. The big change we are making is that Qu is becoming a toolkit to build your API with. It doesn’t take much, and you might only have one simple file. For example, here’s the file that runs api.consumerfinance.gov. This was generated by the Leinengen template (linked above). But, you can add whatever you want.
This is how Qu has become more like Django or Rails. It makes it infinitely extensible without mucking around in the source code of Qu itself. Right now, we’ve just begun exploring what that can give us.
Elementary question: what’s the benefit to building your own API instead of just using the one that comes out of the box with one of those other products?
To be honest, right now, not a lot, besides that you can benefit from upgrades in Qu’s core software easily. But the end goal will make it matter a lot, because you will be able to pick and choose Qu components—the database, the formats—easily. Qu has always tried to follow the principle that APIs should be discoverable by a human. So, the API has an HTML interface that should let you use the whole thing.
Going back a few minutes to what you were saying about making Qu infinitely extensible, you said you’re just beginning to explore the benefits of this, but you must have had some reason for doing it to begin with.
This fell out of me wanting to make the database interchangeable. Doing that led me to think about the best way to make it switchable through configuration. And I ended up with an application template/builder rather than an application.
So what this means is, if I have a data set that I want to provide an API for, but don’t know how to build APIs, a future version of Qu will let me build a powerful one with relative ease, without forcing me to run it from a database I don’t have.
Cool. Why did you choose Clojure?
Two reasons. First, for dealing with this much data, we need to use all the capabilities of our machines. There are not a lot of languages out there that make using multiple threads easy, and Clojure’s one of the few. (By the way, here’s a curriculum I wrote that explains why Clojure is good at this.)
#2: Clojure is fundamentally about data. It’s not an object-oriented language. Everything in Clojure is a data structure, which fits well when you’re writing programs to transform data.
#3 (I said two, but not true): Clojure is nothing more than a library for the Java Virtual Machine. This lets us use next-level technology while still being able to use all the Java libraries that exist today. In addition, most government and corporate environments know how to deploy Java applications. It’s a nice mix of looking forward without overwhelming our existing infrastructure.
And #4: I like using Clojure. Qu started as a prototype, so I used what I know and love. The prototype grew—like they do—and became the real application.
What have been some of the bigger engineering challenges?
Figuring out how to deliver an arbitrary amount of data was a big deal. If you’re working with our mortgage application data set, you can request any amount of data for download and we will serve it. This is hard, because we have a finite amount of memory and a very large amount of data. You can ask for 4 GB of data and we will serve it, yet we never keep that much data in memory.
Clojure made this fun and easy: it has “lazy sequences”, which not only allow us not to process things until we need them, but also allows us to garbage-collect data after we’ve used it.
We release the memory the data was using. The only data in memory is the data currently being delivered. Once you’ve got the data, we throw it away. This allows us to service multiple requests for large data sets without exploding. Imagine a window that you can look at a bunch of data through. That window moves over the data, showing only what’s necessary at any given time.
We serve that data using HTTP streaming, so you don’t have to wait for it all to be ready before you start receiving it.
So if you want to download a big file, we don’t have to tell you, “Okay, sit tight while we generate the data set for you—then you can come back and download your huge file.” Instead, it starts immediately.
Yes. Although, we have to do that right now for queries that are hard to calculate. We’re working on that.
What are some of the things at the top of your to-do list for Qu?
Our roadmap is public. I want to make Qu even easier to customize. Individual organizations should be able to take Qu and add new data backends, new endpoints, and new data formats very easily.
I want to overhaul the way you import data. Right now, it’s complex. You have to know a special format for the data definition. This should be easy to write, or even better, partially inferred.
And I want to continue to make the whole thing pluggable. For example, adding an admin dashboard adds a bunch of complexity for something you might not need. But, having a plugin for an admin dashboard lets you have that or leave it out as you wish.
My biggest goal is to get others using Qu so we can see what they need and they can contribute back. I’d love to see a CKAN/Qu integration. But I’d love to see someone else write it.
If someone else wanted to make some code contributions, where should they focus?
Definitely on data loading. It’s the first part of the app I wrote. It’s both pretty easy to understand and crufty: Here’s a sample data set ready for loading, and here is the definition file. It is huge and gross. I would love proposals on a better format for describing data coming in.
Soon, once things are a little more settled on this modularization, I’d love to see people write database adapters for DBs other than Mongo. Here are the docs on that.
Matthew Burton is the former Acting CIO of the Consumer Financial Protection Bureau. Though he has moved back to Brooklyn, he still works with the Bureau’s technology team on a part-time basis.
Clinton Dreisbach is a Clojure and Python hacker for the Consumer Financial Protection Bureau. He is the lead developer on Qu, the CFPB’s public data platform, and a contributor to Clojure, Hy, and other open source projects.