Python

Python is an interpreted high-level general-purpose programming language. Python’s design philosophy emphasizes code readability with its notable use of significant indentation.

Do Excel things, get notebook Python code with Mito

#343: Do Excel things, get notebook Python code with Mito

11/30/21 by Michael Kennedy (@mkennedy)

Web player: https://podcastaddict.com/episode/132012853
Episode: https://talkpython.fm/episodes/download/343/do-excel-things-get-notebook-python-code-with-mito.mp3

Here’s a question: What’s the most common way to explore data? Would you say pandas and matplotlib? Maybe you went more general and said Jupyter notebooks. How about Excel, or Google Sheets, or Numbers, or some other spreadsheet app? Yeah, my bet is on Excel. And while it has many drawbacks, it makes exploring tabular data very accessible to many people, most of whom aren’t even developers or data scientists.

On this episode, we’re talking about a tool called Mito. This is an add-in for Jupyter notebooks that injects an Excel-like interface into the notebook. You pass it data via a pandas dataframe (or some other source) and then you can explore it as if you’re using Excel. The cool thing is though, just below that, it’s writing the pandas code you’d need to do to actually accomplish that outcome in code.

I think this will make pandas and Python data exploration way more accessible to many more people. So if you’ve been intimidated by pandas, or know someone who has, this could be what you’ve been looking for.

Below is the code I am using to generate the crunched numbers for Albany County, for example my 2021 Albany County Election Spreadsheet

Below is the code I am using to generate the crunched numbers for Albany County, for example my 2021 Albany County Election Spreadsheet. Text is converted using pdftotext -layout 2021GeneralRecanvass.txt which converts the PDF to a structured text file for easier parsing. The Excel Writer library is used for extensive formatting of the file for easy reading and parsing. View this as a separate page for syntax highlighting.

from io import StringIO
import re
import pandas as pd
import numpy as np

# paths
# electionResultText = '/home/andy/Desktop/2021GeneralRecanvass.txt'
# enrollmentXLS = '/home/andy/enroll/nov21-Enrollment/AlbanyED_nov21.xlsx' 
# outputPath = '/tmp/2021_albany_county_races.xlsx'

# electionResultText = '/home/andy/Documents/GIS.Data/election.districts/2020GeneralRecanvass.txt'
# enrollmentXLS = '/home/andy/enroll/nov20-Enrollment/AlbanyED_nov20.xlsx' 
# outputPath = '/tmp/2020_albany_county_races.xlsx'

electionResultText = '/home/andy/Documents/GIS.Data/election.districts/2019'
enrollmentXLS = '/home/andy/enroll/nov19-Enrollment/AlbanyED_nov19.xlsx' 
outputPath = '/tmp/2018_albany_county_races.xlsx'

# number to excel column letter
def letter(colNum):
    import math
    
    if colNum > 26:
        return chr(ord('@')+math.floor(colNum/26))+chr(ord('@')+colNum%26)
    else:
        return chr(ord('@')+colNum)

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None) # show everything when previewing

with open(electionResultText) as f:
    data = f.read()

enroll = pd.read_excel(enrollmentXLS, header=4)
enroll = enroll[(enroll['STATUS']=='Active')][['ELECTION DIST', 'TOTAL']].dropna()
enroll['ELECTION DIST'] = enroll['ELECTION DIST'].astype(str).str.replace('  ',' ')
enroll['ELECTION DIST']=enroll['ELECTION DIST'].str.strip()
enroll['Enrollment'] = enroll['TOTAL']
enroll.drop(labels=['TOTAL'], axis=1, inplace=True)

# array containing election result dataframes
er = {}
    
# split each race which is divided by ten or more equal signs
for raceData in re.split('={10,}',data):

    # blank out old data frame
    df = None

    # split lines
    rows = raceData.split('\n')

    race = ""
    candidates = {}
    startAt = 0
    
    for i, line in enumerate(rows):
        # find race name
        if re.search('VOTES\s*?PERCENT', line):
            race = rows[i+1]
        
        race = race.replace('  ',' ')
        race = race.rstrip()
        #print(race)
        
        # find candidate names
        for result in re.findall('(\d\d)\s*?=\s*?(\w.*?)\d', line):
            candidates[int(result[0])] = result[1].rstrip()
       
        # find start at location for CSV reader
        if re.findall('-{5,}', line):
            startAt = i+3
            break
    
    # skips enrollment stats, as the don't have candidates
    if not candidates or not 2 in candidates:
        continue
    
    if not race:
        continue
        
    #print(race)
    
    df = pd.read_csv(
        StringIO(raceData),
        header=None,
        skiprows=startAt,
        sep='(?<=\d)\s{1,}(?=\d)',
        engine='python',
        on_bad_lines='warn')
    
    df=df[df[0].str.contains('^\d{4}').fillna(False)] # ONLY ROWS STARTING WITH 4-DIGIT ED CODE
    
    df.reset_index(drop=True, inplace=True) # reset index so we can use it in formulas

    df=df.rename(candidates,axis=1) # rename columns 
    
    df.iloc[:,1:]=df.iloc[:,1:].apply(pd.to_numeric) # make sure all columns are numeric 
    df.iloc[:,1:]=df.iloc[:,1:].convert_dtypes('int32') # cast columns to int32

    # crunching
    df.insert(1, 'Ballot', df.iloc[:,1:].convert_dtypes('int32').sum(axis=1)) # add total column
    
    df.insert(2, 'Blanks', df['OVER VOTES'].convert_dtypes('int32')) # add blank
    df.drop(labels=['OVER VOTES'], axis=1, inplace=True)

    df['Blanks']+=df['UNDER VOTES'].convert_dtypes('int32') # add under votes
    df.drop(labels=['UNDER VOTES'], axis=1, inplace=True)

    df.insert(3, 'Canvas', '=F'+(df.index+2).map(str)+'-G'+(df.index+2).map(str))  
    df.insert(4, 'TO %', '=(F'+(df.index+2).map(str)+'/E'+(df.index+2).map(str)+')')
    df.insert(5, 'DO %', '=(G'+(df.index+2).map(str)+'/F'+(df.index+2).map(str)+')')
        
    # create check columns
    df['CHECK'] = '=F'+(df.index+2).map(str)+'-G'+(df.index+2).map(str)
    df['CHECK %'] = '=0'
    
    # add percent columns
    for i, col in enumerate(df.columns[6:-2]): 
        try:
            # check columns
            if (len(df.columns)-i-2) > 0:
                # add to check column
                df['CHECK'] += '-'+letter(i+i+11)+(df.index+2).map(str)

                # add check percent
                df['CHECK %'] += '+'+letter(i+i+12)+(df.index+2).map(str)

            # add percent
            df.insert(i+i+7, col+' %', '=('+letter(i+i+11)+(df.index+2).map(str)+'/H'+(df.index+2).map(str)+')') 
        except:
            pass

    # add T W E Columns, drop combined field
    df[['ED Code','Municipality','String','Ward','ED']]=df[0].str.extract('(\d\d\d\d)\s*?(.*?)(\s*?WARD\s*(\d*))?\s*?ED\s*(\d*)')

    # temporary string used for data merge with enrollments
    df['ELECTION DIST'] = df['Municipality'].str.strip().replace('  ',' ') + \
    ' '+df['Ward'].fillna(0).astype(str).str.zfill(3)+df['ED'].str.zfill(3).astype('str')

    # merge on ED Key
    df=df.merge(enroll, on='ELECTION DIST', how='left')
    df=df.drop(0, axis=1)
    df=df.drop(labels=['ELECTION DIST','String'],axis=1)
    
    # move columns to proper order
    cols = list(df.columns)
    df = df[cols[-5:]+cols[:-5]]
    
    df['Municipality']=df['Municipality'].str.title()
    df['Municipality'] = df['Municipality'].str.strip().replace('  ',' ') # remove whitespace around muni column
    
    # array with election result dataframes
    er[race]=df.fillna(0)
    
# write file
ew = pd.ExcelWriter(outputPath)

for i, race in enumerate(er):
    raceStr = str(i+1)+' '+race
    if (len(race)>28):
        raceStr = str(i+1)+' '+race[:10] + '...' + race[-10:]
    
    # disble header, manually write, as pre-defined headers can't be formatted
    er[race].fillna(0).to_excel(ew,sheet_name=raceStr, index=False)
    
    headForm = ew.book.add_format({'text_wrap': 1, 'font_family': 'Arial', 'bold': True , 
                                   'valign': 'vcenter', 'align': 'center', 'bg_color': '#CCCCCC'})
    for colnum, value in enumerate(er[race].columns.values):
        ew.sheets[raceStr].write(0, colnum, value, headForm)
    
    ew.sheets[raceStr].set_row(0,50)
    
    # set column width for all columns to 20, freeze panes
    bodyForm = ew.book.add_format({'text_wrap': 1, 'font_family': 'Arial', 'num_format': '#,##0', 'valign': 'vcenter', 'align': 'right'})
    bodyPerForm = ew.book.add_format({'text_wrap': 1, 'font_family': 'Arial', 'num_format': '0.0%', 'valign': 'vcenter', 'align': 'right'})
    muniForm = ew.book.add_format({'text_wrap': 1, 'font_family': 'Arial', 'valign': 'vcenter', 'align': 'center'})
    ew.sheets[raceStr].set_column('A1:A9999', 6, bodyForm)
    ew.sheets[raceStr].set_column('B1:B9999', 15, muniForm)
    ew.sheets[raceStr].set_column('C1:D9999', 4, bodyForm)
    ew.sheets[raceStr].set_column('E1:H9999', 10, bodyForm)
    ew.sheets[raceStr].set_column('I1:J9999', 10, bodyPerForm)
    
    for colNum in range(11,40):
        if colNum%2:
            ew.sheets[raceStr].set_column(letter(colNum)+'1:'+letter(colNum)+'9999', 12, bodyForm)
        else:
            ew.sheets[raceStr].set_column(letter(colNum)+'1:'+letter(colNum)+'9999', 12, bodyPerForm)

    ew.sheets[raceStr].freeze_panes('E2')
        
ew.save()

25 Pandas Functions You Didn’t Know Existed

#341: 25 Pandas Functions You Didn’t Know Existed

11/17/21 by Michael Kennedy (@mkennedy)

Web player: https://podcastaddict.com/episode/131419907
Episode: https://talkpython.fm/episodes/download/341/25-pandas-functions-you-didn-t-know-existed.mp3

Do you do anything with Jupyter notebooks? If you do, there is a very good chance you’re working with the pandas library. This is one of THE primary tools of anyone doing computational work or data exploration with Python. Yet, this library is massive and knowing the idiomatic way to use it can be hard to discover.

That’s why I’ve invited Bex Tuychiev to be our guest. He wrote an excellent article highlighting 25 idiomatic Pandas functions and properties we should all keep in our data toolkit. I’m sure there is something here for all of us to take away and use pandas that much better.

There is power in even little data πŸ”Œ

There is power in even little data πŸ”Œ

People talk a lot about big data these days. But often the scale of data needed to understand problems is relatively small compared to the power of modern desktop computers.

The fact is with free, open source software like Jupyter, PANDAS and Python along with easily downloadable data from public sources there is so much to learn by manipulating data and exploring.