Journals, Programming

Python Cheat Sheet

Keywords

Python reserves 31 keywords for its use:

and del from not while
as elif global or with
assert else if pass yield
break except import print
class exec in raise
continue finally is return
def for lambda try

Lists vs. Tuples

Lists are more common than tuples, mostly because they are mutable. But there are a few cases where you might prefer tuples:

  • In some contexts, like a return statement, it is syntactically simpler to create a tuple than a list. In other contexts, you might prefer a list.
  • If you want to use a sequence as a dictionary key, you have to use an immutable type like a tuple or string.
  • If you are passing a sequence as an argument to a function, using tuples reduces the potential for unexpected behavior due to aliasing.

Regular Expressions

ˆ

Matches the beginning of the line.

$

Matches the end of the line.

.

Matches any character (a wildcard).

\s

Matches a whitespace character.

\S

Matches a non-whitespace character (opposite of \s).

*

Applies to the immediately preceding character and indicates to match zero or

more of the preceding character(s).

*?

Applies to the immediately preceding character and indicates to match zero or

more of the preceding character(s) in “non-greedy mode”.

+

Applies to the immediately preceding character and indicates to match one or more

of the preceding character(s).

+?

Applies to the immediately preceding character and indicates to match one or more

of the preceding character(s) in “non-greedy mode”.

[aeiou]

Matches a single character as long as that character is in the specified set. In this

example, it would match “a”, “e”, “i”, “o”, or “u”, but no other characters.

[a-z0-9]

You can specify ranges of characters using the minus sign. This example is a

single character that must be a lowercase letter or a digit.

[ˆA-Za-z]

When the first character in the set notation is a caret, it inverts the logic. This

example matches a single character that is anything other than an uppercase or

lowercase letter.

( )

When parentheses are added to a regular expression, they are ignored for the purpose

of matching, but allow you to extract a particular subset of the matched string

rather than the whole string when using findall().

\b

Matches the empty string, but only at the start or end of a word.

\B

Matches the empty string, but not at the start or end of a word.

\d

Matches any decimal digit; equivalent to the set [0-9].

\D

Matches any non-digit character; equivalent to the set [ˆ0-9].

Refer here for the regular expression cheat sheet.

Tuples

The following codes read an input text file and return the  distribution of all alphabet characters included in the input file. The distribution is sorted alphabetically.

import string
import operator
fname = input("Enter a file name: ")
fhand = open(fname)
counts = dict()
for line in fhand:
    words = line.split()
    if len(words) == 0:
        continue
    line = line.translate(str.maketrans("", "", string.punctuation))
    line = line.strip()
    line = line.lower()
    line = ''.join([i for i in line if not i.isdigit()])
    line = line.replace(" ","")
    letters = list(line)
    for letter in letters:
        if letter not in counts:
            counts[letter] = 1
        else:
            counts[letter] += 1

sorted_lst = sorted(counts.items(), key = operator.itemgetter(0))
## change to the following to sort by character's frequency
## sorted_lst = sorted(counts.items(), key = operator.itemgetter(1))
print(sorted_lst)
for key, val in sorted_lst:
    print(key,val)

Web Scraping

Reading web page using BeautifulSoup library

BeautifulSoup is a very useful Python library for web scraping. Compared to native built-in Python packages, BeautifulSoup provides powerful features such as data convert (UTF-8, Unicode), parse tree, and built-in searches. Refer here for the source code and documents of BeautifulSoup.

The code snippet below read the website from BeautifulSoup and return all html elements with ‘a’ tags. Note that you would need to install BeautifulSoup before running this code.

import urllib.request
import bs4 #this is beautiful soup
req = urllib.request.Request('http://www.crummy.com/software/BeautifulSoup')
with urllib.request.urlopen(req) as response:
    the_page = response.read()
    ## get bs4 object
    soup = bs4.BeautifulSoup(the_page)
 
    ## compare the two print statements
    #print(soup)
    #print(soup.prettify())

    ## show how to find all 'a' tags
    soup.findAll('a')

Reading JSON using json package

To read JSON, Python provides a built-in package called json to parse as a dictionary. The following codes read user’s input for a location, send the input to Google APIs, and return the corresponding geographical information.

import urllib.request
import json

serviceurl = 'http://maps.googleapis.com/maps/api/geocode/json?'
while True:
    address = input('Enter location: ')
    if len(address) < 1 : break
    url = serviceurl + 
         urllib.parse.urlencode({'sensor':'false','address': address})
    print('Retrieving', url)
    req = urllib.request.Request(url)
    with urllib.request.urlopen(req) as response:
        data = response.read()
        print('Retrieved',len(data),'characters')
        try:
            js = json.loads(data.decode())
            print(json.dumps(js, indent=4))
        except:
            js = None

        if 'status' not in js or js['status'] != 'OK':
             print('==== Failure To Retrieve ====')
             print(data)
             continue
        lat = js["results"][0]["geometry"]["location"]["lat"]
        lng = js["results"][0]["geometry"]["location"]["lng"]
        try:
            country_code=js["results"][0]["address_components"][3]["short_name"]
        except:
            print('Country code is empty')
            country_code = "nil"

        print('lat',lat,'lng',lng,'country_code',country_code)
        location = js['results'][0]['formatted_address']
        print(location)

Reading CSV using Pandas DataFrame

Developed by NUMFocus, pandas is a very powerful library to process numerical data for high-performance, easy-to-use data structures and analysis. In particular, one of the most useful data types that pandas provides is DataFrame, whose programming syntax is very similar to R’s DataFrame.

The following codes read a csv file returned from a http request, return records of users as a pandas DataFrame, and print out the records of users who are 40-year-old males.

from pandas import Series
import pandas as pd
from pandas import DataFrame

# pass in column names for each CSV
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']

users = pd.read_csv(
    'http://files.grouplens.org/datasets/movielens/ml-100k/u.user', 
    sep='|', names=u_cols)
print(users[(users.age==40) & (users.sex == "M")])

# select the list of users who are female programmers
#FProgrammer = users[(users.sex == "F") & (users.occupation == "programmer")]
# print out the average age of this group of users
#print(FProgrammer.age.mean())

Extracting web data table using BeautifulSoup library

The following codes get the website content of the Harvard University’s wikipedia page, extract a data table of class wikitable and parse into a Pandas DataFrame

import requests
import pandas as pd
from bs4 import BeautifulSoup

# Get the HU Wikipedia page
req = requests.get("https://en.wikipedia.org/wiki/Harvard_University")

page=req.text
soup = BeautifulSoup(page, 'html.parser')

rows = [row for row in soup.find("table", "wikitable").find_all("tr")]

## get the columns header of the data table
columns = [col.get_text() for col in rows[0].find_all("th") if col.get_text()]

## get the row index of the data ta ble
indexes = [row.find("th").get_text() for row in rows[1:]]

## read the content of the data table except the column headers and row indexes
## since the data is in percentage, parse it using to_num function
to_num = lambda s: s[-1] == "%" and int(s[:-1]) or None
values = [to_num(value.get_text()) for row in rows[1:] 
          for value in row.find_all("td")]
## stack the data as a list of tuples. Each tuple represent a row
stacked_values = zip(*[values[i::3] for i in range(len(columns))])
stacked_values = list(stacked_values)

## parse the data into a Pandas DataFrame 
df = pd.DataFrame(stacked_values, columns=columns, index=indexes)

## remove rows with na values: df.dropna()
## remove columns with na values: df.dropna(axis=1)
## replace na with 0
df_clean = df.fillna(0).astype(int)

# the df looks clean now! can run some simple statistics
df_clean.describe()

 

Advertisements
Standard

2 thoughts on “Python Cheat Sheet

Comments are closed.