Importing and analyzing a Quickbooks general ledger with Pandas¶
Pandas is amazing. What is pandas? Well, I'm glad you asked. Per http://pandas.pydata.org/, pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
For us accountants, Pandas is a complete game changer for the analysis, auditing, and completely beefing up your ability to do powerful data analytics. In this example we'll be pulling in a quickbooks general ledger from excel, but you can import multiple file formats. The more standardized the format, the better it will play with Pandas. The best formats are CSV or pipe delimited, but as we show here, excel can be used as well.
This post is not a complete tutorial on how to use Pandas, rather it's more of a show and tell of what's possible with Pandas. And it's only scratching the surface!
Ok, to get started, first I'll import the pandas library. For display purposes, I'll format the numbers so they will show commas and two decimal places.
import pandas as pd
pd.options.display.float_format = '{:,.2f}'.format
Now we'll import the sample Quickbooks general ledger excel file.¶
If you're trying this at home, download the source file here https://github.com/danshorstein/python4cpas/raw/master/01_Pandas_and_Jupyter_notebook_starter/general-ledger-sample.xlsx and save it to the same location as your jupyter notebook.
gl = pd.read_excel(r'general-ledger-sample.xlsx')
Let's see what this puppy looks like.¶
I'll use the .head() method to view the first 5 rows of data.
gl.head()
Yuck! That's ugly! Help?¶
There are several issues with this file we need to fix. As I mentioned above, excel files are not super friendly for data analysis or Pandas.
1) I see some blank columns we need to delete (Column head = Unnamed: x and details are all "NaN", which stands for Not a Number). I'll also fill in any NaNs in the Debit and Credit column with 0s.
2) The first 5 columns were imported as a big nested index. Don't worry about what that means, we'll fix that next. This is one of the issues that can come up when importing from an excel file.
Fortunately, I've written some code to fix the format.¶
I won't get into the details of what the code to fix the file is doing, but I attempted to write it in a way that would work with any standard formatted quickbooks excel general ledger. Feel free to try it out and let me know if it works for you. Here's the code, but don't worry about what it means for now. Just know it's awesome.
def acct_append(row, new_accts):
if pd.isnull(row[1]):
new_accts.append(row[0])
else:
new_accts.append('{} | {}'.format(*row))
def fix_qb_gl(gl):
gl = gl.dropna(axis=1, how='all')
main_acct = list(gl.index.get_level_values(1))
sub_acct = list(gl.index.get_level_values(2))
acct = list(zip(main_acct, sub_acct))
new_accts = []
acct_append(acct[0], new_accts)
for idx, (m, s) in enumerate(acct[1:]):
if str(m).startswith('Total'):
m = 'DELETE'
if str(s).startswith('Total'):
s = 'DELETE'
idx += 1
acct[idx] = m, s
if pd.isnull(m): # Fill NA if main is NA
acct[idx] = acct[idx - 1][0], acct[idx][1]
if pd.isnull(s): # If main is NA, then fill NA if sub is NA
acct[idx] = acct[idx][0], acct[idx-1][1]
acct_append(acct[idx], new_accts) # Create the new acct
gl = gl.reset_index(drop=True)
gl['Acct'] = pd.Series(new_accts)
gl[['Debit', 'Credit']] = gl[['Debit', 'Credit']].fillna(0)
gl['Net'] = gl.apply(lambda x: (x['Debit'] - x['Credit']
if 'DELETE' not in x['Acct']
else 0), axis=1)
gl = gl.fillna('NA')
gl = gl.where(gl['Net'] != 0).dropna()
columns = ['Acct', 'Type', 'Date', 'Num', 'Name', 'Memo',
'Split', 'Debit', 'Credit', 'Net']
gl = gl[columns]
gl['Date'] = gl['Date'].apply(pd.datetime.date)
return gl
gl = fix_qb_gl(gl)
Let's see how it looks now...¶
gl.head()
Ah, much better! Now it's looking like a GL! Let's see some more info about this file.¶
FYI, the file format in Pandas is called a DataFrame. You can get information about the size and type of data by calling the .info() method.
gl.info()
Now let's create a pivot table summing up the account activies.¶
pivot = gl.pivot_table(values=['Debit', 'Credit', 'Net'], index='Acct', aggfunc='sum', margins=True)
accts = list(gl.Acct.unique())
accts.append('All')
pivot.loc[accts]
gl[gl.Acct == 'Office Equipment']
Let's see all transactions that are over $30,000¶
gl.where(abs(gl.Net) > 30000).dropna()
Now let's see the transactions that occurred between 1/26/18 and 1/27/18¶
start_date = pd.datetime(2018,1,26).date()
end_date = pd.datetime(2018,1,27).date()
gl.where((start_date <= gl.Date) & (gl.Date <= end_date)).dropna()
AWESOME¶
Well there you go, I've shown that you can fairly easily import a quickbooks general ledger file from excel into pandas, fix the formatting issues, and run some simple queries. More to come next time! Oh ps, you can save any of these to excel or csv, which I'll get more into in a future post.
-Dan
Thanks a lot, your blog is so helpful!
ReplyDeleteI am an internal auditor, and also a newbie to Python. Your post inspired me a lot on how to solve my daily problems that I cannot solved for many years. Keep up your good work!!!
Glad you find it useful! More (hopefully) helpful Python stuff is coming soon!
DeleteHi Daniel, would you mind sharing a little bit more on how to tidy up the format of the data. As I think not many of us using quickbook but different accounting systems, every time we have to tidy the data before we can enter to any analysis work, and for my experience it is the most time consuming and frustrating workload. It would be so helpful that if there is any handy guide for us to deal with those formatting issues. Thanks!
DeleteThat's a great idea Roy. I don't know how simple it would be to create a generic guide to tidying up data, but perhaps I could come up with a few common issues that I find, and ways to handle them. If you have any examples you'd want me to walk through, I'd be happy to do so (just please remove any sensitive data) - if so, email to python4cpas@gmail.com
DeleteHi,
ReplyDeleteJust curious, but what are the primary advantages of using python for data analytics over say VB.NET? I'd imagine the latter plays better with Excel, which lends itself better to BI overall. Or is this just an incorrect assumption on my part?
Hey Redeam, thanks for the question.
DeleteI am not really a programmer, I've just started picking up on it over the past few years. I can do a little with visual basic, but when I discovered Python, everything changed. It's just so easy to use, has tons of support, and you can do so much with very little code. But I'd say it's probably just a personal preference. If you are good at using VB.NET, that's probably better for doing work with Excel. But otherwise, I think the advantages of Python's ease of use and fantastic library of packages and modules makes it a great language for doing data analytics (for example, web scraping/requests, datetime calculations, importing many different varieties of data types, etc is really easy with Python, and I have no idea about VB.NET)
Haha, thanks for responding! My experience is very similar to yours (i.e. I am an accountant, recently a CPA, doing accounting at a bank), except in terms of programming, I started with VB 6.0 and then moved to VB.NET and SQL. It is just recently that I started learning python, so I thought I'd pick your brain a bit. Any recommended resources to get started? I've mostly been reading PEPs, wikipedia, stack, and also the book 'Automate the Boring Stuff.' Also, are there any libraries to look into specifically that are targeted at our profession or are just generally useful (in addition to Pandas)? both BeautifulSoup and openpyxl seem interesting. Thanks again!
DeleteNice, glad to find another data oriented CPA! One of my goals in starting this blog was to try and find other data oriented / pythonic accounting people, as there aren't a ton of us (yet...), and get a little community going. One day I hope to get a few open source accounting projects going. Congrats on getting your CPA! As far as good resources go, I think that's a great idea for my next blog post, so thanks for the idea. I'll get a good list together of some recommended resources for Python beginners and put it up soon.
ReplyDeleteRegarding accounting oriented Python, there's really not much that I could find out there. I've seen a few github projects that mention accounting, but I didn't find them very user friendly or useful, but there may be others I haven't found yet. That's one area I would love to try and use this blog to help contribute, and get some open source accounting python projects out there.
For now though, here are some recommendations to get you going:
TRAINING:
Coursera has some great beginner Python courses. But since you are not new to programming, just new to Python, I'd actually recommend taking one or both of these courses: https://training.talkpython.fm/courses/explore_python_jumpstart/python-language-jumpstart-building-10-apps and https://training.talkpython.fm/courses/explore_pythonic_code/write-pythonic-code-like-a-seasoned-developer. They are the only Python courses I've ever paid for, but I got so much more value out of them than the cost of the course.
PODCASTS:
Check out these three podcasts: talk python (https://talkpython.fm/episodes/all), python bytes (https://pythonbytes.fm/episodes/all), and Podcast.__init__ (https://www.podcastinit.com/) - they are packed full of interesting interviews, interesting libraries and Python news, and great for inspiration about new projects and just cool stuff you can do with Python. Also gives a fantastic insight into the Python community
PYTHON VIDEOS:
There are several excellent Python conferences held each year, and most of them post their talks on youtube. Check out ones that interest you. Also, David Beazley is a superstar coder and entertaining speaker, and has some great videos. This one's a fun one to start with: https://youtu.be/j6VSAsKAj98
BOOKS:
I have all three of these, and they all have different approaches but are great ways to get more understanding of how to use Python and the different libraries. The automate the boring stuff is good too, but he keeps it at a pretty basic level and only give a few examples, where as these books dive in a little deeper, and present the code in a more Pythonic way.
https://www.amazon.com/gp/product/1449319793/
https://www.amazon.com/gp/product/1449340377/
https://www.amazon.com/gp/product/1491946008/
Hi Dan,
DeleteI am excited to find your blog. As I am a data oriented CPA and looking for similar minds too. :) I learnt VB a little bit back in high school and never touched coding again until I find VB very useful in cleaning accounting data during work. I learnt and used VB during the past couple years. I recently learn about Python and find it a more powerful tool than VB. So I started learning it and came across your blog. Love your posts and looking forward to more.
Thanks for the comment, glad to see you are following this. Stay tuned for more hopefully soon! In the meantime check out my latest post here https://www.python4cpas.com/2018/12/?m=1
DeleteDaniel. Awesome website idea and I look forward to future posts! I'm a corporate controller and always looking to add to my skillsets, so I've recently picked up learning Python.
ReplyDeleteMy questions is, from the above analysis, I don't see how this couldn't have been achieved quicker and easier by using Pivot Tables? Maybe I'm just not understanding the full potential of pandas.
Thanks,
RN!
Hey Robert, thanks for reading and for the comment! I presume you are referring to Excel's Pivot Tables rather than the pandas.pivot_table() method. I agree in this example, it probably would have been just as easy to filter and view the data in Excel Pivot Tables once it's in pivot table-friendly format. I also agree that you (and honestly I) haven't discovered the full potential of pandas. One thing pandas is really good at is cleaning up data, so start there if you're looking to get into it. After that, check out its capabilities with dealing with dates. It's also nice to be able to automate a big part of the process if you have a multi-step process with different data sources; Python and pandas is great for situations like that, where you can collect data from different sources, munge it together and bring into a pandas DataFrame, and even run some specific views, and save them to file or on an intranet page, or even email them out. Pivot Tables can do parts of that, but not the full thing.
DeleteHi Daniel,
ReplyDeleteHave you tried applying Python codes on VAT analysis/reconciliation? I am in a statutory reporting team which mainly responsible for VAT reconciliation, tax and year end audit. Currently I have challenge dealing with VAT reconciliation.
As most people would know, product sold could be classified as exempted, zero rated or standard rated. So my job is trying to identify the nature of VAT derived from sales. At the moment, my team is using excel and SAP to match the sales to its nature. However, the challenge is we have a list of items for a sales invoice which only some items are subject to VAT. Worse, the invoice is in PDF format. So I have to manually check each sales record on SAP (like few thousands or more per month) and cross check with invoice.
I learn some basic Python skill, however, so far I haven't seen particular skill to extract numbers in PDF and cross check with Excel.
Hey there, thanks for the comment! I have not done any analysis on VAT per se, but your work sounds like a great opportunity for Python. I actually recently discovered a library that does a great job of extracting data from PDF files, called pdfplumber, and I plan to do a post on that soon. I'll try and show an example of how to compare items from a PDF file with items from an excel file :)
DeleteHi Daniel, this is really helpful. It will wonderful a little explanation of the code (#), specially the part where you change the format of the file to make it workable. I do not only want to copy but to be able to write it myself at some point.
ReplyDelete