Monday, January 9, 2017

Module Highlight - Using difflib to compare employee and vendor addresses

Fun_with_difflib

Module Highlight - difflib

Welcome to my first Module Highlight! These posts will be shorter, focused posts with a goal of highlighting and providing useuful examples of a module that I like. Difflib is a built-in Python module that does quite a few things, but we will focus mainly on one of its features: the ability to find close matches to inputs. Difflib can be used to compare things such as sentences, lists, files, etc., but my favorite thing about it is the get_close_matches function, which looks for a match between the first input, which is one word or phrase, and the second input, which is a list of words or phrases.

For you Excel-oriented folks, this is like doing a vlookup that finds matches even if the two items don't match exactly. I can't count how many times I've been frustrated trying to do a vlookup and not find any matches, only to discover there's an extra space in one of the words, or one cell is a number and the other is treated text, and sometimes it's just that a word is misspelled.

This is where difflib.get_close_matches function is awesome! Let's look at an example to really understand how great it is.

We will be looking at elements of a few other modules as part of this post:

While the main focus is on using difflib, I am using a few other modules and wanted to give them a shoutout:

  • Jupyter Notebook - This post, and many future posts, will use a fantastic tool called Jupyter Notebook. Jupyter notebook allows you to write Python and documentation in your web browser, and you can run each line of code individually to see how it works, and tweak it. I won't go into any details about how to use it in this post, but see these instructions if you want to learn how to start it up, and play around with a Jupyter Notebook. Otherwise, you can do all of this example in IDLE.

  • Faker - This is a cool library that you can use to generate sample data such as names of inviduals and companies, addresses, socials, phone numbers, emails, etc. I'm using it in this post to generate most of the sample data, consisting of names, companies, and addresses.

  • Collections - specifically the namedtuple function with Collections. Namedtuple makes reading and using codea easier by letting you call tuples by their descriptive name rather than using the location (e.g. employee.name versus employee[0]). This is a more advanced topic, so I won't go into it much in this post, but just know that I'm using it and you should check it out if you want.
  • CSV - This is a very handy module for accountants that lets python read and save to CSV files. It's really useful, and you'll see a lot of it in future posts.

  • Example use of Difflib - Comparing two lists of addresses

    One common internal/external audit procedure I've done is to compare a list of employee addresses to vendor payment addresses. This can be done using data analytic software, but it's not as easily done with Excel. There is more than one way to do this, but difflib provides an easy way by using its get_close_matches. For this example, I've created a fake list of employees and a fake list of companies. I used a python library called faker to do this, then for purposes of this exercise, I manually adjusted a few of the addresses so some would be close but not equal matches between the two lists, so we could see if they are picked up by difflib. I've included the steps I took to create and save these fake lists at the end of this post in case you want to see how it was done.

    First I'll import difflib, csv, and the namedtuple method from collections.

    In [9]:
    import difflib
    import csv
    from collections import namedtuple
    

    Jupyter Notebook Comment:

    Each box of code in this notebook is part of the same instance of Python. Once you type in code, hit CTRL+ENTER to run the code. You can tell the code has been run when the "In [ ]:" turns into a numbered version, like "In [1]:". After you've run the first piece of code, any variables created or modules imported are available for access as later lines/boxes of code are run. It only resets if you close and reopen the Jupyter Notebook, or if you restart the kernel. The reason the box above is [9] instead of [1] is because I actually started with running the sample data code below (see Bonus section).

    Make the namedtuples

    Here we are defining namedtuples for Employee and Company variables to define ".name" and ".address" as the two elements of those tuples.

    In [10]:
    Employee = namedtuple('Employee', 'name, address')
    Company = namedtuple('Company', 'name, address')
    

    Now we will pull in the data from the csv files I created.

    If you downloaded the csv files to use, make sure to use the correct file path. In this case, the files are in a subfolder called "sample_data".

    In [11]:
    with open('sample_data/employees.csv', newline='') as csvfile:
        reader = csv.reader(csvfile, delimiter=',')
        employees = [Employee(row[0], row[1]) for row in reader]
        
    with open('sample_data/companies.csv', newline='') as csvfile:
        reader = csv.reader(csvfile, delimiter=',')
        companies = [Company(row[0], row[1]) for row in reader]
    

    Jupyter Notebook Comment:

    Notice that there are no Outputs below the above three boxes of code. This is because so far, we have only performed imports or put objects into variables. The next box will be the first box where we print an output.

    Let's see what the first three records in each file look like, to make sure this is pulled in correctly.

    In [12]:
    for employee in employees[:3]:
        print(employee)
        
    print()
        
    for company in companies[:3]:
        print(company)
    
    Employee(name='Craig Diaz', address='21955 Ruth Loaf, West Stephanie, WA 05103-3189')
    Employee(name='Sara Johnson', address='055 Susan Extensions, South Allenberg, GU 11551-2762')
    Employee(name='Amanda Kramer', address='90754 Sampson Crescent Suite 315, West Michelleshire, AR 35432')
    
    Company(name='Boyd, Thomas and Hayes', address='8909 Aguilar Village Suite 403, North Aaron, KS 05456-7446')
    Company(name='Fisher-Drake', address='19338 Parker Green Suite 336, Lake Kimberly, TX 00652-7366')
    Company(name='Cruz-Hudson', address='USNV Patel, FPO AA 52413-8440')
    

    Ok, time to see if we can find any potentially matching addresses!

    Here's the standard format of the difflib.get_close_matches() function:

    difflib.get_close_matches(word, possibilities, n=3, cutoff=0.6)
    
    

    Inputs of word and possibilities must be input; and n and cutoff have default vaules of 3 and 0.6, respectively.

    The word will be each employee's address. The possibilities must be a list. In our scenario we want to compare an employee's address to each vendor's address. We will be iterating over the list of employee addresses and comparing each to each of the vendor addresses, therefore we have to pass that item in as a list (we do so by putting it in [brackets]). The reason we are passing in each address individually rather than as a list, is so we can return the full name and address rather than just the matching address. It takes a little bit longer to do it this way, so if you're dealing with loads more records, it may be more efficient to find the potential matches first, and look up the names after.

    The part of the function we need to adjust is the cutoff. The cutoff represents a number between 0 and 1, and is the cutoff where if the matching score of the two words is not ABOVE the cutoff point, then it is ignored. I tried a few different cutoffs, and finally came to .8 as an appropriate cutoff for our case. 0.7 included too many potential matches, but 0.9 didn't include enough. Use your judgment here.

    The n=3 tells get_close_matches() how many results to give. By default it provides 3 results. In our case we are only looking at one potential match, so this does not need to be adjusted.

    In [13]:
    for employee in employees:
        for company in companies:
            if difflib.get_close_matches(employee.address, [company.address], cutoff=.8):
                print('-------POSSIBLY THE SAME ADDRESS!!------')
                print(employee)
                print(company)
                print()
    
    -------POSSIBLY THE SAME ADDRESS!!------
    Employee(name='Rebecca Ramirez', address='99963 Thompson Point, Ramseyburgh, KS 55309')
    Company(name='Johnson-Vega', address='99963 Tomson Pt, Ramseyburgh, KS 55309')
    
    -------POSSIBLY THE SAME ADDRESS!!------
    Employee(name='Erica Jensen', address='607 Moody Hill Apt. 996, East Jennytown, DC 70760')
    Company(name='Jones and Sons', address='607 Moody Hl, #996, E Jennytown, DC 70760')
    
    -------POSSIBLY THE SAME ADDRESS!!------
    Employee(name='Patricia Burgess', address='27148 Ramirez Turnpike Suite 454, East Daniel, ME 53518-5616')
    Company(name='Silva, Brown and Chang', address='27148 Ramirez Tpk #454, E Daniel, ME 53518')
    
    -------POSSIBLY THE SAME ADDRESS!!------
    Employee(name='Thomas Bass', address='935 Harris Highway Apt. 184, North Leonard, VI 26461')
    Company(name='Curtis, Mays and Spears', address='935 Harris Hwy Apt184, N Leonard, VI 26461')
    
    -------POSSIBLY THE SAME ADDRESS!!------
    Employee(name='Julian Graham', address='5365 Williams Center Suite 343, Hughesfurt, SC 78140-4247')
    Company(name='Orr LLC', address='5365 Williams Ctr Ste 343, Hughesfurt, SC 78140')
    
    

    What does this mean??

    Take a look at the addresses above. Under each "POSSIBLY THE SAME ADDRESS!!" line, our code is telling us that the employee address and the company address for the names provided are greater than .8 in the get_close_match cutoff. If you take a look at the actual addresses, you can see that indeed they appear to be the same address. I was pretty amazed when I discovered this module.

    What else can I do with this awesome tool?

    Another great use is if you have two trial balances with similar but not exactly the same worded descriptions, and you are trying to match them up. This works great in that case as well, just be careful if there are accounts that have the same name other than one number, like several years of the same expense as different accounts.

    That's it for now, good luck, and check out the bonus code below on how I created the CSV files using faker.

    I will be saving all of the code and sample data on these posts up on github. Just go to https://github.com/danshorstein/python4cpas to download any of the files, including the csv files for this exercise.

    Cheers! Daniel








    BONUS! Here's how I generated fake data and saved to CSV files

    Creating fake lists

    As noted above, I used faker to generate some fake lists. You can skip this part if you want. But if you would like to try to do this as well, you'll need to install the faker library. To do this, go to your command prompt, type "pip install faker", and it should install the module.

    import modules

    Here we are importing the faker, collections, and csv modules. From the collections module, we only imported namedtuple.

    In [1]:
    import faker
    from collections import namedtuple
    import csv
    

    Make the 'fake' variable so we can start generating sample data

    Now we create an instance to generate some fake data. I had to read the faker module instructions to see how to do this.

    In [2]:
    fake = faker.Factory.create()
    

    Make the namedtuples

    Here we are defining namedtuples for Employee and Company variables to define ".name" and ".address" as the two elements of those tuples.

    In [3]:
    Employee = namedtuple('Employee', 'name, address')
    Company = namedtuple('Company', 'name, address')
    

    Create the sample data

    Now we create variabled called employees and companies, which include lists of employee and company names and addresses generated for 500 randomly generated items each. The format I used here to create these lists is a more advanced topic, called a list comprehension. I'll try and do a post about list comprehensions at some point.

    In [4]:
    employees = [Employee(fake.name(), fake.address().replace('\n', ', ')) for _ in range(500)]
    companies = [Company(fake.company(), fake.address().replace('\n', ', ')) for _ in range(500)]
    

    Check the sample data

    Let's see what this data looks like. I'll print the first three of each (you can ignore the [None] output):

    In [5]:
    [print(n+1, _) for n, _ in enumerate(employees[:3])]
    print()
    [print(n+1, _) for n, _ in enumerate(companies[:3])]
    
    1 Employee(name='Craig Diaz', address='21955 Ruth Loaf, West Stephanie, WA 05103-3189')
    2 Employee(name='Sara Johnson', address='055 Susan Extensions, South Allenberg, GU 11551-2762')
    3 Employee(name='Amanda Kramer', address='90754 Sampson Crescent Suite 315, West Michelleshire, AR 35432')
    
    1 Company(name='Boyd, Thomas and Hayes', address='8909 Aguilar Village Suite 403, North Aaron, KS 05456-7446')
    2 Company(name='Fisher-Drake', address='19338 Parker Green Suite 336, Lake Kimberly, TX 00652-7366')
    3 Company(name='Cruz-Hudson', address='USNV Patel, FPO AA 52413-8440')
    
    Out[5]:
    [None, None, None]

    Looks good! Now let's "fix" a few of these addresses so they are similar

    Now let's replace five of the randomly generated addresses with our own "addresses" but make them slightly different, to see if difflib catches them. I won't change any of the numbers, but will change some of the spellings, remove the last 4 of zip, and abbreviate street types.

    In [6]:
    new_address1 = '99963 Thompson Point, Ramseyburgh, KS 55309', '99963 Tomson Pt, Ramseyburgh, KS 55309'
    new_address2 = '607 Moody Hill Apt. 996, East Jennytown, DC 70760', '607 Moody Hl, #996, E Jennytown, DC 70760'
    new_address3 = '27148 Ramirez Turnpike Suite 454, East Daniel, ME 53518-5616', '27148 Ramirez Tpk #454, E Daniel, ME 53518'
    new_address4 = '935 Harris Highway Apt. 184, North Leonard, VI 26461', '935 Harris Hwy Apt184, N Leonard, VI 26461'
    new_address5 = '5365 Williams Center Suite 343, Hughesfurt, SC 78140-4247', '5365 Williams Ctr Ste 343, Hughesfurt, SC 78140'
    
    employees[20] = Employee(employees[20].name, new_address1[0])
    employees[40] = Employee(employees[40].name, new_address2[0])
    employees[60] = Employee(employees[60].name, new_address3[0])
    employees[80] = Employee(employees[80].name, new_address4[0])
    employees[100] = Employee(employees[100].name, new_address5[0])
    
    companies[3] = Company(companies[3].name, new_address1[1])
    companies[25] = Company(companies[25].name, new_address2[1])
    companies[50] = Company(companies[50].name, new_address3[1])
    companies[70] = Company(companies[70].name, new_address4[1])
    companies[95] = Company(companies[95].name, new_address5[1])
    

    Let's see if we did it correctly...

    Let's look at the first updated address for each to make sure the addresses changed correctly:

    In [7]:
    print(employees[20])
    print(companies[3])
    
    Employee(name='Rebecca Ramirez', address='99963 Thompson Point, Ramseyburgh, KS 55309')
    Company(name='Johnson-Vega', address='99963 Tomson Pt, Ramseyburgh, KS 55309')
    

    Looks awesome! Save the files, and we're done here!

    Now we will save the employees and companies to csv files, so we can simulate pulling in real data csv files to analyze with difflib. Make sure to create a subfolder first called "sample_data".

    In [8]:
    with open('sample_data/employees.csv', 'w', newline='') as csvfile:
        row_writer = csv.writer(csvfile, delimiter=',')
        [row_writer.writerow([employee.name, employee.address]) for employee in employees]
        
    with open('sample_data/companies.csv', 'w', newline='') as csvfile:
        row_writer = csv.writer(csvfile, delimiter=',')
        [row_writer.writerow([company.name, company.address]) for company in companies]