Welcome to DataQuest: Data Analyst course notes!¶

Contents:¶

Indices and tables¶

Popular functions¶

*type() - gives the type *str() - converts to string *int() - convers to integer *.replace() - substitutes a word for another *.lower() - converts all uppercase text to lowercase text

Lists¶

*months = [] - initialise a list *months.append(‘value’) - Adds to the end of the list *months = [1, “January”, 2, “February”] - creating list with values *months[0] - accessing values in list *len(months) - returns the length of the list *month_slice = months[2:4] - give the list items 2 and 3, not 4. *split_list = g.split(”,”) - split the data in g into a list

Files¶

*f = open(“crime_rates.csv”, “r”) - open files *g = f.read() - returns a string representation of a text in a file

shorthand: *g = open(“crime_rates.csv”).read()

for loops¶

for row in rows:: do something

list of lists¶

three_rows = [“Albuquerque,749”, “Anaheim,371”, “Anchorage,828”] final_list = [] for row in three_rows:

split_list = row.split(‘,’) final_list.append(split_list)

If you have a list of lists

first_element = data[0] will first give you the list (from the list of lists) for whatever is in that list - it may be a couple of items
first_element[0] will give you the first item in the list
shorthand: data[0][0]

To get a list of lists from a csv: Long way -

f = open(‘dq_unisex_names.csv’, ‘r’) names = f.read() names_list = names.split(‘n’)

nested_list = [] for element in names_list:

comma_list = element.split(‘,’) nested_list.append(comma_list)

print(nested_list[0:5])

short way -

import csv

f = open(“world_alcohol.csv”) reader = csv.reader(f) world_alcohol = list(reader)

Booleans¶

Booleans help you to filter data according to specified criteria:

== returns True if both variables are equivalent, and False if they’re different
!= returns True if both variables are different, and False if they’re equivalent

Use parentheses for cleaner code. t = (8 == 8) # True

Remember that when using len() to retrieve the last element from a list you should subtract 1: crime_last = crime[len(crime) - 1] The length of the list is does not specify the last element in the list as the list index begins at 0.

If Statements¶

if value > 500:: do something

If Else Statements¶

if temperature > 50:: print(“It’s hot!”)
else:: print(“It’s cold!”)

In Statement¶

The Instatement checks of there is a specific element in a list

animals = [“cat”, “dog”, “rabbit”] if “cat” in animals:

print(“Cat found”)

Or assign to a variable

animals = [“cat”, “dog”, “rabbit”] cat_found = “cat” in animals

The In statement can also check to see if there is a specific key in a dict

students = {: “Tom”: 60, “Jim”: 70

}

“Tom” in students will return True

Dictionaries¶

scores = {} - initialise a dictionary Stupid way: scores[“Tom”] = 70 - Will assign “Tom” with a score of 70. “Tom” is the index of the dict. Clever way:

students = {: “Tom”: 60, “Jim”: 70

}

This gives us key/value pairs

Functions¶

def clean_text(string_value):: cleaned_value = string_value.replace(”,”, “”) return(cleaned_value)

sentence = “Howdy,james,bond!” sentence = clean_text(sentence)

NumPy¶

NumPy gives you the ability to work with multidimensional arrays. e.g. a table where table 2,2 gives you the value at row 2, column 2.

import numpy nfl = numpy.genfromtxt(“nfl.csv”, delimiter=”,”)

generate an array: matrix = numpy.array([[5, 10, 15], [20, 25, 30], [35, 40, 45]])

find the shape of an array: matrix.shape OR vector.shape

type of an array: numbers.dtype

a numpy array has to be of the same type
numpy will convert all of the leements in the array to a type it guessed.
elements that can’t be converted to the selected type will be NaN 0 Not a Number
missing elements will resolve to na - Not Available

To specify that the genfromtxt() function should read in the data as string: import numpy world_alcohol = numpy.genfromtxt(‘world_alcohol.csv’, delimiter=”,”, dtype=’U75’, skip_header=1) print (world_alcohol)

#slice vectors and lists the same *indexing (getting the element) for vectors and lists are the same

To get the entire column (slicing) from an array countries = world_alcohol[:,2]

To get a matrix from a matrix matrix = numpy.array([

[5, 10, 15], [20, 25, 30], [35, 40, 45]

])

print(matrix[:,0:2])

[: [5, 10], [20, 25], [35, 40]

]

This specifies that the matrix should include column 0 to 3, but excluding column 3 (and all the rows)

Array comparisons: vector = numpy.array([5, 10, 15, 20]) vector == 10

numpy will compare 10 to each value in vector and build a new vector with True/ False values. e.g [False, True, False, False]

Select a row or column from an array or matrix according to specified criteria matrix = numpy.array([

[5, 10, 15], [20, 25, 30], [35, 40, 45]

])

second_column_25 = (matrix[:,1] == 25) print(matrix[second_column_25, :])

Recipes¶

Open a file and read in each row into a list of lists

f = open(“la_weather.csv”, ‘r’) data = f.read() rows = data.split(‘n’) weather_data = [] for row in rows:

split_row = row.split(”,”) weather_data.append(split_row)

Counting frequency in a dict pantry = [“apple”, “orange”, “grape”, “apple”, “orange”, “apple”, “tomato”, “potato”, “grape”]

pantry_counts = {} for element in pantry:

if element in pantry_counts:

pantry_counts[element] = pantry_counts[element] + 1

else:

pantry_counts[element] = 1

a function to read a csv, split the string (‘n’), converts the elements in the list (of lists) to integers.

def read_csv(filename):

data = open(filename).read() data_split = births_data.split(“n”) string_list = data_split[1:len(data_split)-1]

final_list = [] for element in string_list:

int_fields = [] string_fields = element.split(”,”) for elmnt in string_fields:

int_fields.append(int(elmnt))

final_list.append(int_fields)

return (final_list)

cdc_list = read_csv(“US_births_1994-2003_CDC_NCHS.csv”) print (cdc_list[0:10])

function to retrieve the frequency from a list of lists for a specific column def calc_counts(data, column):

column_total = {} for element in data:

chosen_column = element[column] birth_column = element[4] if chosen_column in column_total:

column_total[chosen_column] = column_total[chosen_column] + birth_column

else:

column_total[chosen_column] = birth_column

return (column_total)

convert the second element in the list of 2 elements to a numerical value temp_list = [] numerical_list = []

#print (nested_list[0][1])

for element in nested_list: # print (element)

first_element = element[0] second_element = float(element[1]) temp_list.append(first_element) temp_list.append(second_element) numerical_list.append(temp_list) temp_list = []

Get the second column from a list weather = [] for element in weather_data:

weather.append(element[1])

counting the number of unique values in a list pantry = [“apple”, “orange”, “grape”, “apple”, “orange”, “apple”, “tomato”, “potato”, “grape”]