Fundamentals of Data Viz for the Web
This lesson is based on a workshop by Erie Meyer and Chris Goranson
Getting Started
In this tutorial, we'll begin exploring the CFPB dataset using a fantastic data visualization tool developed by Density Design. Called RAW, the data visualization tool allows you to simply cut and paste CSV data into a window, select the data visualization you'd like to design, and through simple drag-and-drop functionality develop the visualization further using variables of your choice.
RAW is particularly useful because it not only is a great introduction to many data visualizations we see more frequently in the news and elsewhere, but because it also provides you with embeddable code. Working with the code yourself is a fantastic and easy way to better understand how data visualization libraries work (in this case D3) - you can practice modifying the raw code yourself and always return to the visualization to see how changes impact the final design.
When you have time, you can see RAW in action here:
Step One: Getting the Data
Now that we know what tool we're going to use for this tutorial, let's go to the CFPB website and get some data! Open your web browser, and go to http://www.consumerfinance.gov/complaintdatabase/. Scroll down the page until you see the button that reads "Download Options and API".
Important Note: The CFPB website already provides some pretty great data visualization tools through the Socrata application. If you wanted to, you could stop here and simply click the "View Complaint Data" button to begin exploring the data. Since we want to not only demonstrate the richness and flexibility of the data on the site but also ways that you can visualize the information outside of the embedded tools, we're going to continue manually downloading the data and visualizing it elsewhere.
Click on the "Download Options and API" button. In the next section you'll be presented with an option to download the actual, raw data itself.
For this tutorial, we're going to work a bit with the student loan data. On the screen, select the "Student Loan" radio button, then from the first download option select "Only complaints with a consumer narrative", and finally make sure the file format is "CSV". This will provide us with a file that is both easy to work with in MS Excel as well as RAW. Click the download button.
Now that we have our data (check your downloads folder to make sure it's there), move the dataset to a good working folder location that you'll remember for this project, and you're ready to proceed to step two.
Step Two: Setting up the Data
Now that we have our dataset, let's continue!
Go ahead and open the dataset in a spreadsheet program such as MS Excel or Google Sheets. If you have access to MS Excel, you can first open the program, then go to File --> Open, navigate to the file, and make sure you enable "All Files" in the dropdown menu (otherwise MS Excel may behave as though it can't open the CSV, which it actually can).
With the dataset open in MS Excel or other application, let's take note of a few things. First, this dataset is huge! With over 12,000 records going back to March 1st, 2012, we have a lot of data to look at. While there are data visualization methods that work well with extremely large datasets, we will want to use a subset of this data. Since we're doing some exploratory data visualization here, we're going to make some choices about what we want to begin looking at without the benefit of a broader analytical framework. For this exercise that's okay—but we would want to check our data and methods if, for example, we were making decisions on policy. Data visualizations are powerful, so we do need to take care to make sure we're using the data correctly.
Let's start by taking a subset of the data to begin with. Remember that we're already working with a subset of the student loan data too—we've selected only records with a complaint narrative. Next, let's narrow the dataset down even further.
First, select the column entitled "Issue". In MS Excel you'll note that under the "data" tab you have the option to apply a filter with the "Filter" button. Select all records that say "Can't repay my loan" by entering that text in the search input.
Let's also only look at records since the beginning of the year. Our data is already sorted by date, so go to the earliest record that has a date of 1/1/16 or later and select the record by clicking on the row number. Next, go back to the top of the dataset, and while holding down the shift key, select row number 1. You should now have a selection of only the records where "Issue" is "Can't repay my loan" and that were made since the beginning of the year. Copy the data and paste it into a new sheet.
Step 3: Working with RAW
Now that we have the data we want, let's use RAW to start visualizing our data. Open a new browser tab and go to http://raw.densitydesign.org. Next, click on the "Use it Now!" button. We're presented with a nice step-by-step walk-thru on how to begin building your data visualization. If you just want to start with one of the sample datasets provided, select from the "choose one of our samples" pulldown menu, and select a dataset of your choice to begin. If you're ready to jump right in using the student loan data we created, simply copy and paste the data into the input box below from your spreadsheet.
If the data was read correctly, you'll see a little thumbs up message at the bottom of the window informing you that the records were successfully parsed. Next, scroll down and select a data visualization type. You'll want to select a data visualization that works well with your data. This step might involve a bit of trial and error, but it provides a fantastic way to understand how data visualizations work. Let's start with one that allows us to see how the "Can't repay my loan" complaint breaks down by company. This might provide some interesting insight into where we see these types of complaints coming from (IMPORTANT NOTE: we haven't normalized anything here yet, so it's quite possible that some of the companies with the most complaints are also the companies with the most student loans. So, before going forward, we'd want to do some further analysis). Leave the first choice selected (Alluvial Diagram), and scroll down.
Next, scroll down to the "Map your dimensions" section of the tool. Drag "Issue" and "Company" under Steps.
Finally, customize your visualization. Set the width to 700, height to 1000 and the node width to 10. Leave the other settings as they are.
If you'd like to save your data visualization, scroll down to the "Download" section to download the graphic as a PNG file. Try placing the file in a document or presentation—or, you can cut and paste the "Embed Code" directly into a web page: