.. DataQuest data analyst documentation master file, created by
   sphinx-quickstart on Mon Jan 09 22:34:42 2017.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

Welcome to DataQuest: Data Analyst course notes!
==================================================

.. toctree::
   :maxdepth: 2
   :caption: Contents:


Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`

Popular functions
#############
*type() - gives the type
*str() - converts to string
*int() - convers to integer
*.replace() - substitutes a word for another
*.lower() - converts all uppercase text to lowercase text

Lists
#############
*months = [] - initialise a list
*months.append('value') - Adds to the end of the list
*months = [1, "January", 2, "February"] - creating list with values
*months[0] - accessing values in list
*len(months) - returns the length of the list
*month_slice = months[2:4] - give the list items 2 and 3, not 4.
*split_list = g.split(",") - split the data in g into a list

Files
#############
*f = open("crime_rates.csv", "r") - open files
*g = f.read() - returns a string representation of a text in a file

shorthand:
*g = open("crime_rates.csv").read()


for loops
#############
for row in rows:
	do something

list of lists
#############
three_rows = ["Albuquerque,749", "Anaheim,371", "Anchorage,828"]
final_list = []
for row in three_rows:
    split_list = row.split(',')
    final_list.append(split_list)

If you have a list of lists
	* first_element = data[0] will first give you the list (from the list of lists) for whatever is in that list - it may be a couple of items
	* first_element[0] will give you the first item in the list
	* shorthand: data[0][0]

To get a list of lists from a csv:
Long way - 

f = open('dq_unisex_names.csv', 'r')
names = f.read()
names_list = names.split('\n')

nested_list = []
for element in names_list:
    comma_list = element.split(',')
    nested_list.append(comma_list)
    
print(nested_list[0:5])

short way -

import csv

f = open("world_alcohol.csv")
reader = csv.reader(f)
world_alcohol = list(reader)


Booleans
#############

Booleans help you to filter data according to specified criteria:
	* == returns True if both variables are equivalent, and False if they're different
	* != returns True if both variables are different, and False if they're equivalent

Use parentheses for cleaner code.
t = (8 == 8) # True

Remember that when using len() to retrieve the last element from a list you should subtract 1:
crime_last = crime[len(crime) - 1]
The length of the list is does not specify the last element in the list as the list index begins at 0.

If Statements
#############
if value > 500:
	do something

If Else Statements
#############
if temperature > 50:
    print("It's hot!")
else:
    print("It's cold!")

In Statement
#############
The Instatement checks of there is a specific element in a list

animals = ["cat", "dog", "rabbit"]
if "cat" in animals:
    print("Cat found")

Or assign to a variable

animals = ["cat", "dog", "rabbit"]
cat_found = "cat" in animals

The In statement can also check to see if there is a specific key in a dict

students = {
    "Tom": 60,
    "Jim": 70
}

"Tom" in students will return True

Dictionaries
#############
scores = {} - initialise a dictionary
Stupid way: scores["Tom"] = 70 - Will assign "Tom" with a score of 70. "Tom" is the index of the dict.
Clever way:

students = {
    "Tom": 60,
    "Jim": 70
}

This gives us key/value pairs

Functions
#############
def clean_text(string_value):
    cleaned_value = string_value.replace(",", "")
    return(cleaned_value)
sentence = "Howdy,james,bond!"
sentence = clean_text(sentence)

NumPy
#############
NumPy gives you the ability to work with multidimensional arrays.
e.g. a table where table 2,2 gives you the value at row 2, column 2.

import numpy
nfl = numpy.genfromtxt("nfl.csv", delimiter=",")

generate an array:
matrix = numpy.array([[5, 10, 15], [20, 25, 30], [35, 40, 45]])

find the shape of an array:
matrix.shape OR vector.shape

type of an array:
numbers.dtype

* a numpy array has to be of the same type
* numpy will convert all of the leements in the array to a type it guessed.
* elements that can't be converted to the selected type will be NaN 0 Not a Number
* missing elements will resolve to na - Not Available

To specify that the genfromtxt() function should read in the data as string:
import numpy
world_alcohol = numpy.genfromtxt('world_alcohol.csv', delimiter=",", dtype='U75', skip_header=1)
print (world_alcohol)

#slice vectors and lists the same
*indexing (getting the element) for vectors and lists are the same

To get the entire column (slicing) from an array
countries = world_alcohol[:,2]

To get a matrix from a matrix
matrix = numpy.array([
                    [5, 10, 15], 
                    [20, 25, 30],
                    [35, 40, 45]
                 ])
print(matrix[:,0:2])

[
    [5, 10],
    [20, 25],
    [35, 40] 
]

This specifies that the matrix should include column 0 to 3, but excluding column 3 (and all the rows)

Array comparisons:
vector = numpy.array([5, 10, 15, 20])
vector == 10

numpy will compare 10 to each value in vector and build a new vector with True/ False values.
e.g
[False, True, False, False]

Select a row or column from an array or matrix according to specified criteria
matrix = numpy.array([
                [5, 10, 15], 
                [20, 25, 30],
                [35, 40, 45]
             ])
    second_column_25 = (matrix[:,1] == 25)
    print(matrix[second_column_25, :])


Recipes
#############
Open a file and read in each row into a list of lists

f = open("la_weather.csv", 'r')
data = f.read()
rows = data.split('\n')
weather_data = []
for row in rows:
    split_row = row.split(",")
    weather_data.append(split_row)

Counting frequency in a dict
pantry = ["apple", "orange", "grape", "apple", "orange", "apple", "tomato", "potato", "grape"]

pantry_counts = {}
for element in pantry:
    if element in pantry_counts:
        pantry_counts[element] = pantry_counts[element] + 1
    else:
        pantry_counts[element] = 1

a function to read a csv, split the string ('\n'), converts the elements in the list (of lists) to integers.

def read_csv(filename):
    data = open(filename).read()
    data_split = births_data.split("\n")
    string_list = data_split[1:len(data_split)-1]
    
    final_list = []
    for element in string_list:
        int_fields = []
        string_fields = element.split(",")
        for elmnt in string_fields:
            int_fields.append(int(elmnt))
        final_list.append(int_fields)
    return (final_list)
    
cdc_list = read_csv("US_births_1994-2003_CDC_NCHS.csv")
print (cdc_list[0:10])

function to retrieve the frequency from a list of lists for a specific column
def calc_counts(data, column):
    column_total = {}
    for element in data:
        chosen_column = element[column]
        birth_column = element[4]
        if chosen_column in column_total:
            column_total[chosen_column] = column_total[chosen_column] + birth_column
        else:
            column_total[chosen_column] = birth_column
    return (column_total)

convert the second element in the list of 2 elements to a numerical value
temp_list = []
numerical_list = []

#print (nested_list[0][1])

for element in nested_list:
#    print (element)
    first_element = element[0]
    second_element = float(element[1])
    temp_list.append(first_element)
    temp_list.append(second_element)
    numerical_list.append(temp_list)
    temp_list = []

Get the second column from a list
weather = []
for element in weather_data:
    weather.append(element[1])

counting the number of unique values in a list
pantry = ["apple", "orange", "grape", "apple", "orange", "apple", "tomato", "potato", "grape"]

pantry_counts = {}
for element in pantry:
    if element in pantry_counts:
        pantry_counts[element] = pantry_counts[element] + 1
    else:
        pantry_counts[element] = 1

Remember
#############
months = ["Jan", "Feb"]
print (months[0:1])

NOT

print (months)[0:1]

This entire document is written with the RST syntax. In the right sidebar, you should find a link **show source**, which shows the RST source code.