Whether you’re a manager or an analyst, if you’ve ever had to work with data, chances are you’re no stranger to spreadsheets. Much of the data we interact with is stored in spreadsheets, with their ubiquity justified by the versatility and portability of tools such as Microsoft Excel, or Google Sheets. But often times, spreadsheets present data in a report format, which is conducive to reading, but not to analysis. For example, a company’s financial statements might contain a report that looks like this:
This report is easy to read, but what if we wanted to analyze this revenue data, say by making a time series forecast model for each department’s revenue, or by comparing quarterly revenues for each product line with the quarterly advertising expenditures for each product line? In order to do so, we would want to pivot the data so that it looks like this:
To execute such a pivot in Excel is not trivial, and would require too much manual input to be scalable if there more rows and columns in the original report. This tutorial will demonstrate how to automate the pivot in a way that is scalable to larger datasets, using the open-source database Postgres. Before proceeding, one disclaimer is necessary: the following approach will utilize iteration, meaning it is not suited for very large datasets of millions or billions of rows. Think of this approach as a good way to handle data that can fit within the 1-million row limit of an Excel spreadsheet, while being much more efficient than manual operation. Data that does not fit in an Excel spreadsheet will likely already be in an analytical format, not a report format.
Step 1: Download Postgres
Here’s a link to download Postgres.
Step 2: Open pgAdmin, the graphical user interface (GUI) that comes with the download
You can open pgAdmin directly from the start menu. You should see something like this.
Step 3: Create database
Servers in the menu on the left hand side, then choose
Create -> Server. Name your server under the
General tab, and enter
localhost in the
Host name/address box under the
Connection tab. Click
Save to set up the server.
Now you can set up your database, which is where we’ll be storing and working with the data. Right click the server you just created, choose
Create -> Database, and name your database. Finally, right click the database you created, and choose
Query Tool. A blank screen will come up, where we can write SQL statements.
Download Real Estate Data
For this demo, we’ll be using a dataset of the weekly median listing prices in various US metropolitan areas, as reported by Zillow.com. We’ll have to go to this link and download the
Median List Price (Raw, All Homes, Weekly) dataset.
Let’s open the Excel file we just downloaded.
The data is displayed in report format, where each row is a region, and that region’s median listing prices for each week are represented as columns. We would read this as “The median listing price of all homes in the New York metropolitan area, during the week of 10/14/2017, was 517,000 dollars”. We want to restructure the data so that it looks like this:
Here is the roadmap that we will follow to produce the desired output:
We’ll start by deleting superfluous rows and columns
- We don’t need
RegionIDbecause it doesn't give us any analytical value
- We can recreate the
SizeRankindex if we need to, for now it is unnecessary
- We don’t need the row with the entire US, since we’re just looking at individual metro regions
- We don’t need
RegionType, since all the remaining rows all have the same
- We don’t need
StateName, because we can parse that from the
- We also don’t need the first row, which is just column headers
- But let’s note that the range of dates in this dataset is from 10/14/2017 to 5/29/2021, in 7 day increments.
So here’s what we’re left with:
A couple things to notice:
- In each row, the first cell is the region, and the following cells are the median listing price at weekly intervals
- Some cells are blank — for example, this dataset does not contain median listing price data for Denver during the time interval pictured
Next, we’ll concatenate each row into a single string
In a separate sheet, we’ll type the command
=TEXTJOIN(" SPACE ", FALSE, , then highlight all cells in the first row until you reach the end of the date range (191 cells, counting the region cell). Then hit ENTER. The
TEXTJOIN function concatenates cells into a single string, with the contents of each cell separated by the delimiter of your choice. Typically, we might use a comma as a delimiter, but since there are commas in the region name, we used something else. In this case, we've chosen to use
" SPACE " as delimiter. Importantly, this delimiter works only because we know that no such string already exists in the data. If that were the case, we would choose a different delimiter. The
FALSE parameter tells the function not to ignore blank cells. This is important, because as mentioned previously, certain rows do not have data for each week.
This is what your screen should look like:
You should see one long string comprised of all of the row’s cells, split up by
SPACE. Now, we just have to repeat the operation for the rest of the rows. We can do so by dragging the concatenated cell down 95 rows, because there are 95 rows in the source spreadsheet. You should now see this:
Our work in Excel is now complete, and we need to save this sheet as a .csv file that can be read into our Postgres database.
Our first order of business in Postgres is to read in the data
To do so, we run the following code:
--Create a table with a single text column to house all the --concatenated strings
CREATE TABLE excel(
);--populate the table with concatenated strings from the .csv file
Make sure you replace the path and filename with the path and filename corresponding to where you saved the .csv file from earlier. Note that in our copy statement, we are not using the
HEADER keyword, since the first row of the .csv is data, not column headers.
Next, we’ll define 3 tables
zillow_weekly_median_listprice, the table which will house our final output
staging, the table in which we will convert each string into a pivoted table
zillow_weekly_median_listprice_date_range, a table of the range of dates in the dataset
--Final analytical table
CREATE TABLE zillow_weekly_median_listprice(
week_date DATE);--Table in which text strings will be pivoted
--rownumber will be blank initially, but we will need it in order to --define functions later
CREATE TABLE staging(
rownumber INT);--Table of range of dates in dataset
CREATE TABLE zillow_weekly_median_listprice_date_range AS
FROM generate_series('10-14-2017'::DATE, '5-29-2021'::DATE, '7 day'::INTERVAL) AS week_date;--add an autoincrement column, which we will use later to merge the --dates onto the pivoted table
ALTER TABLE zillow_weekly_median_listprice_date_range
ADD rownumber SERIAL;
With tables defined, we can now define 3 functions
pivot_string, a function which converts a single concatenated string into a pivoted table
populate_final, a function which merges dates with the pivoted table created using the previous function, and inserts the result into the final table
convert_textjoins, a function which iterates through each concatenated string, using the previous two functions to create the properly pivoted and merged output, and read the output into the final table
--Function takes one parameter, an individual concatenated string
CREATE OR REPLACE FUNCTION pivot_string(textjoin_string TEXT)
RETURNS void AS
--This part of the code splits the concatenated string into chunks --separated by the string SPACE, with each chunk becoming a row in --the column median_listprice of the table staging
INSERT INTO staging(median_listprice)
SELECT REGEXP_SPLIT_TO_TABLE($1, 'SPACE');--The first row of the column median_listprice is the region name
--so we should set the region column, which is currently blank, as --that value
SET region = (SELECT median_listprice FROM staging LIMIT 1);
--Now let's delete the first row of the table, so we're left with a --populated region_name column and a populated median_listprice --column
DELETE FROM staging
WHERE median_listprice = (SELECT DISTINCT region FROM staging);
--The rownumber column in the table definition is just a placeholder --so that the next function can be created
--In this function, we'll replace that column with an auto-increment --column, that we can use to merge this table with the date range --table we created earlier
ALTER TABLE staging
DROP rownumber;ALTER TABLE staging
ADD rownumber SERIAL;$$
LANGUAGE SQL VOLATILE;--This function takes no inputs, and merges the staging table, --populated using the previous function, with the date range table --we created earlier, before using the result to populate the final --table
CREATE OR REPLACE FUNCTION populate_final()
RETURNS void AS
--merges the staging table and date range, then populates final --table
INSERT INTO zillow_weekly_median_listprice(region, median_listprice, week_date)
s.region AS region,
s.median_listprice AS median_listprice,
w.week_date AS week_date
FROM staging s, zillow_weekly_median_listprice_date_range w
WHERE s.rownumber = w.rownumber;
--deletes data from staging table, so that it can be repopulated
--with the next concatenated string
DELETE FROM staging;$$
LANGUAGE SQL VOLATILE;--Iteratively passes each concatenated string from the excel table
--into the 2 previously created functions, resulting in a
--populated final table.
CREATE OR REPLACE FUNCTION convert_textjoins()
RETURNS void AS
--declare the variable r, representing a row of the excel table, as a text variable
--iterate through each row of the excel table, performing the previously created functions
FOR r IN
SELECT * FROM excel
LANGUAGE plpgsql VOLATILE;
Now we just have to run the last function
Checking our work
zillow_weekly_median_listprice table should be populated with the correctly pivoted data. There were 95 regions, and 190 weeks. We are also including null values, or what were originally blank cells in the Excel spreadsheet. So we should have 95 times 190 rows in the table, or 18050 rows. To check this, we'll run the following code:
SELECT COUNT(*) FROM zillow_weekly_median_listprice;
Now let’s check that each region has 190 rows:
COUNT(*) as weeks
GROUP BY region;
Looks good at first glance, but let’s double check to make sure that each region has 190 weeks:
COUNT(*) as weeks
GROUP BY region
HAVING COUNT(*) <> 190;
Let’s do a quick inspection of the
zillow_weekly_median_listprice table to see if the data appears to be structured properly:
SELECT * FROM zillow_weekly_median_listprice;
That looks pretty good! Now we just have one last thing to fix.
Fixing data type
The last step we have to take is converting the
median_listprice column to integer, so that it can be used to perform numerical calculations.
First, let’s look at how many rows in the
zillow_weekly_median_listprice table are null:
WHERE median_listprice IS NULL;
We should see 0 null values. But didn’t we have blank cells in the original spreadsheet? If you recall, we originally used
' SPACE ' as the delimiter in the Excel
TEXTJOIN function, but used
'SPACE' as the delimiter in our
REG_EXP_SPLIT_TO_TABLE function. So each cell currently contains some superfluous spaces.The rows which correspond to blank cells in the original spreadsheet are not empty, but rather contain spaces.
Let’s replace those using the following code:
SET median_listprice = REPLACE(median_listprice, ' ', '');
Now that we’ve effectively deleted spaces from our
median_listprice column, let's count how many rows have a
median_listprice string of length 0. This would mean that they previously contained only spaces, and now contain an empty string.
WHERE LENGTH(median_listprice) = 0;
These 227 rows have empty strings for their
median_listprice column, and we have to convert these into null values in order to change the data type of the column from text into integer (there's no way to convert an empty string into an integer).
SET median_listprice = NULL
WHERE LENGTH(median_listprice) = 0;
And finally, we can cast the column as an integer datatype.
ALTER TABLE zillow_weekly_median_listprice
ALTER COLUMN median_listprice TYPE INT
By creating a few functions in Postgres, we’ve given ourselves the ability to automatically pivot datasets from a report format to an analytical format. The business use cases for this are endless. Maybe you webscraped a company’s financial reports from Yahoo Finance. Or perhaps you stumbled on a cool PivotTable output that you would like to turn into a time series. In either case, or in any similar case, you would be able to structure your data for analysis, using the roadmap we’ve discussed in this article.
Thank you for reading! Let me know in the comments if this trick helped save you time that you would have otherwise spent wrangling data. And also let me know if you would like to see me analyze this particular dataset further.