Excel Functions in Python & Pandas
Prior to starting at Flatiron School, I was buried under balance sheets and P&L’s working as an accountant. Like my other accounting brethren, I spent the majority of my time either in an ERP system, like SAP or NetSuite, or hammering away ‘V-lookups’ and various other formulas in Excel.
With accounting, there are only so many ways things can be done. Publicly traded companies must follow guidelines stated by GAAP, the SEC and other governing bodies for regulatory reasons. A lot of the work can be relatively procedural due to this. So, you might understand why I was so overwhelmed when I first started learning how to code in Python, where there are about one million ways to do the same thing depending on your approach. (OK — one million might be an exaggeration but you get the point.)
I was struggling to understand the syntax (What’s a function vs what’s a method?). I was looking for parallels or anything that would make coding feel slightly familiar. So I was delighted when I discovered Python’s library Pandas! Pandas stands for “Python’s Data Analysis Library” and is mainly designed to work with tabular data using data frames. Essentially, it’s like Python’s version of Excel.
With this in mind, I’d like to share some of my favorite Excel formulas in Python Pandas!
- Excel Text to Columns vs Panda .read_csv()
Pandas allows you to create your own data frame from scratch using Python code, but better than that, you can read .csv’s, connect to SQL, import JSON files, ect. In Excel, when you’re given an .csv file, you need to highlight that first column and use the “Text to Columns” option in order for the data to be separated into individual columns.
In Pandas we are able to use the .read_csv() built in function to view the .csv file in a Pandas data frame. Assigning that function to our variable ‘stats,’ moving forward whenever we call ‘stats’ it will display our table of data.
In the basketball statistic example above, our data frame only has 5 lines. With larger data sets it might not make sense to view the entire table at once. Using the .head() method will show you the first five lines, and alternatively, the .tail() method will display the last 5 giving you a sense of what your data looks like.
2. Excel V-lookup vs Python .merge()
V-lookup, my go-to formula. I think I used v-lookups at least once every day. This is Excel’s method of joining two sets of data together. The user would select a key identifier from data set #1, search for that key identifier in set #2 and if there was a match, return a specified piece of data from set #2.
Pandas offers a great method called .merge() that allows users to join two data frames together. Referring back to our ‘stats’ data frame that shows a players’ in game points, rebounds, ect. Let’s say we had another table “records” that tracked our team’s wins/loss or if the game was played at home or away. Using the .merge() method and passing the on = ‘Game’ argument, combines both of the tables together using the “Game” column as the key identifier.
This allows us to further our analysis to see if our player’s stats impact our team’s record.
3. Excel Pivot Tables vs Python .pivot_tables()
Another one of my go-to ways of analyzing data quickly in Excel was to use a pivot table. A pivot table takes a larger set of data and allows the user to summarize the data with a variety of different aggregation statistics and grouping methods. Being completely honest here, I prefer the Excel user interface for setting up Pivot Tables, so getting used to setting them up in Pandas took some time.
Let’s take another look into our newly combined data frame. What if we wanted to know if the player scored more points in an away games that the team won compared to home games the team won? Well, a good way would be to call the .pivot_table() method in Pandas:
With one quick line of code, we are quickly able to see the player averages more points in home than away games and are able to see that there isn’t any data for away games won. Which makes sense, because referring back to the ‘record’ data frame the team is 0–2 away. If you would prefer to see a 0, instead of the “NaN” (Python lingo for not a number) you could apply the .fillna(0) method to the end of the pivot table function.
4. Excel Aggregating Functions vs Python Groupby’s
Excel has a lot of great quick aggregating functions (e.g. sum, avg, max, min, ect.) that can be used to aggregate across rows and columns. Python has all of these and more. Enabling coders to quickly see the qualities of their data. A few examples:
Groupby() in Pandas are great way to aggregate across categorical data. Groupby()’s Excel equivalent are the sumif/countif, ect family. For example if we wanted to return the average of all statistics dependent on if the game as played at home or away:
Diving deeper, let’s say we’re only interested in home games. We can further specify the groupby() using boolean logic as:
5. Python .describe()
Lastly, my favorite Python method is the .describe():
This Python method takes a data frame and returns descriptive statistics for every numeric column. My first steps when initially reviewing any new data set I receive is to first run a .head() to see what type of data I am working with, and then quickly followed up with a .describe() to get a quick, better understanding of the size of the data set and quick breakdown of the statistical qualities.
For me, being able to compare coding to prior technical experience has helped me better understand and retain Python syntax. I hope this quick tutorial will help any other Excel users transition into the world of coding & programming in Python’s Pandas Library.