Welcome to the Day-3 Morning Lab! We are glad for you to join us. Today, we are learning about how Pandas, a data manipulation tool, works, and working on cleaning some data of your own! Iris Dataset We are going to lead the Iris dataset from sklearn again. This time, however, we will load the full dataset and parse it ourselves (instead of using return_X_y.) Let’s begin by importing the Iris dataset, as we expect. from sklearn.datasets import load_iris And, load the dataset to see what it looks like. iris = load_iris() iris.keys() dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module']) We have a pretty large dictionary full of information! Let’s pull out data (our input data), target (our output data), and feature_names, the names of our feature. iris_in = iris["data"] iris_out = iris["target"] iris_names = iris["feature_names"] Data Manipulation pandas is a very helpful utility that allow us to see into data more conveniently. The object that we are usually working with, when using pandas, is called a DataFrame. We can actually create a DataFrame pretty easily. Let’s first import pandas import pandas as pd Loading Data We have aliased it as pd so that its easier to type. Awesome! Let’s make a DataFrame. df = pd.DataFrame(iris_in) df 0 1 2 3 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2 .. ... ... ... ... 145 6.7 3.0 5.2 2.3 146 6.3 2.5 5.0 1.9 147 6.5 3.0 5.2 2.0 148 6.2 3.4 5.4 2.3 149 5.9 3.0 5.1 1.8 [150 rows x 4 columns] Nice! We have our input data contained in a data frame and nicely printed in a table; cool! However, the column names 1, 2, 3, 4 aren’t exactly the most useful labels for us. Instead, then, let’s change the column headers to: iris_names sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) How? We can both get and set the columns via df.columns: df.columns = iris_names Let’s look at the DataFrame again! df sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2 .. ... ... ... ... 145 6.7 3.0 5.2 2.3 146 6.3 2.5 5.0 1.9 147 6.5 3.0 5.2 2.0 148 6.2 3.4 5.4 2.3 149 5.9 3.0 5.1 1.8 [150 rows x 4 columns] Excellent! Now our data frame looks much more reasonable. Wranging Data How do we manipulate the data around? Well, we can index this data by both columns and rows. Indexing by columns first is very easy. Pandas tables are, by default, “column-major”. This means that we can just index the columns just like a list! df["petal width (cm)"] 0 0.2 1 0.2 2 0.2 3 0.2 4 0.2 ... 145 2.3 146 1.9 147 2.0 148 2.3 149 1.8 Name: petal width (cm), Length: 150, dtype: float64 Nice! I want to know introduce the idea of a “cursor”. A “cursor” is used to index this high-dimensional data; think about it as the way to turn this table into something like an indexable 1-D list. The simplest cursor is .loc (“locator.”) Unlike list indexing directly, .loc is “row-major:” the first index selects rows instead of columns. df.loc[0] sepal length (cm) 5.1 sepal width (cm) 3.5 petal length (cm) 1.4 petal width (cm) 0.2 Name: 0, dtype: float64 Nice! You can see that .loc turned our table into a list, with each “sample” of the data more clearly represented by indexing it like a list. What if, then, we want to select the “pedal width” value inside this sample? We just select the first index, a comma, then select the second index. df.loc[0, "petal width (cm)"] 0.2 Excellent! We can see, because we changed the header columns to be strings, we have to index them like strings. What if, instead of the first row, we want to get… say, the first, fifth, and sixth rows? Unlike traditional lists, Pandas’ cursors can be indexed by a list. So this: df.loc[0] sepal length (cm) 5.1 sepal width (cm) 3.5 petal length (cm) 1.4 petal width (cm) 0.2 Name: 0, dtype: float64 turns into df.loc0,2,8,9 sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) 0 5.1 3.5 1.4 0.2 2 4.7 3.2 1.3 0.2 8 4.4 2.9 1.4 0.2 9 4.9 3.1 1.5 0.1 This would give us the 0th, 2nd, 8th, and 9th row! This is all good, but, it’s kind of annoying to type the column names (like “petal width (cm)”) every time! No worries, we can address this. iloc is a variant of loc which uses integer indexes. For row indexing, the syntax remains exactly the same; iloc, however, converts all column indexes to integers sequentially. Therefore: df.loc[0, "petal width (cm)"] becomes df.iloc[0, 3] 0.2 Nice! Isn’t that convenient. Some statistics The main gist of the lab here is to manipulate the input data a little. Pandas provides many helpful utilities to help us with that. For instance, let’s take a single feature in the data, say, the pedal with: pwidth = df["petal width (cm)"] # same pwidth = df.iloc[:,3], where : returns everything in the row dimention pwidth 0 0.2 1 0.2 2 0.2 3 0.2 4 0.2 ... 145 2.3 146 1.9 147 2.0 148 2.3 149 1.8 Name: petal width (cm), Length: 150, dtype: float64 We can now find out how distributed this data is, to glean some info about normalization! The most basic is for us to find the mean width of the petals: pwidth.mean() 1.1993333333333336 Awesome! We can calculate the standard by applying this constant to that entire row. The syntax works just like how you expect—subtracting a scalar from the whole column just subtracts that constant from every element—without any fuss: (((pwidth-pwidth.mean())**2).sum()/len(pwidth))0.5 0.7596926279021594 Cool! In the scheme of things, that’s actually a pretty good. However, if it was not, we could normalize the data! Let’s first get the norm of the vector pwidth_norm = sum(pwidth2)**0.5 pwidth_norm 17.38763928772391 And, let’s normalize our vector by this norm! pwidth_normd = pwidth/pwidth_norm pwidth_normd 0 0.011502 1 0.011502 2 0.011502 3 0.011502 4 0.011502 ... 145 0.132278 146 0.109273 147 0.115024 148 0.132278 149 0.103522 Name: petal width (cm), Length: 150, dtype: float64 Excellent. Let’s find out its standard deviation again! This time we will use .std() instead. pwidth_normd.std() 0.04383790440709825 Much better. Now you try Load the wine dataset into a DataFrame and manipulate it. Feed slices back into our functions yesterday! Can you make the subsets of the data you made yesterday via the .iloc notation to make slicing easier? Can you quantify the accuracy, precision, and recall on a shuffled version of the wine dataset and logistic regression? seed=0 Is there any columns that need normalisation? Any outliers (2 std. dev away)? Why/why not? Create a balanced version of the wine dataset between red and white classes. Does fitting this normalized version into our model makes training results better?