matplotlib is one of the best libraries for data visualization for Python (or someone disagree?). It’s quite easy to use and the plots obtained are really pretty 🙂

In matplotlib #1 I will focus on basics of matplotlib library and show an example of use of matplotlib.pyplot (scatter plot). First of all you need to install matplotlib on your machine as it is not included in most of Python distributions. The easiest way is to use apt-get (Ubuntu) or pip (Ubuntu/Windows). Please refer to the installation guide (https://matplotlib.org/users/installing.html). When you have any problems installing, check on stackoverflow for possible solutions. Most common error is due to lack of some dependencies (e.g. pkg-config/libpng-dev/libfreetype6-dev).

Ok, if we you have matplotlib installed, we can do some magic ;).

First of all import matplotlib; for plots usually you will use:

import matplotlib.pyplot as plt

For this example, we will use some example data that Python has included (actually it is a dataset from R implemented for Python)

from sklearn import datasets
iris = datasets.load_iris()

to prepare this data as DataFrame, we will do some pandas-based manipulations:

frame = pd.DataFrame(iris.data, columns = iris.feature_names)
frame['type'] = iris.target

So now we have pandas DataFrame with 4 columns with data and 5th column telling us which row was what plant (0 – ‘setosa’, 1 – ‘versicolor’, 2 – ‘virginica’). For simple testing this library we could also use random data generators (e.g. random from numpy).

To check how our DataFrame looks like we can look at its ‘head’.

frame.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) type
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0

So we can proceed to making our first plot, to this data line plot (the easiest one) doesn’t fit , so we will use scatter plot for all of the plants together showing relation between sepal length and petal length.

plt.scatter(frame['sepal length (cm)'], frame['petal length (cm)'])

plt.scatter() creates scatter plot and first two values are x values and y values, respectively.

To actually see the plot we need to write one more line:

plt.show()

scatterplot1

Looks nice, but we really don’t know what are units, what are x and y values. Also, you can see that part of results is separated from the rest. In the next step we will color these dots to see if its one of the three species. But first things first.

[Note if you want to clean the figure you can always do:

plt.clf()

and start over]

For adding x and y labels with units you can do:

plt.xlabel('sepal length [cm]')
plt.ylabel('petal length [cm]')

scatterplot2

Ok, so now we would like to add some color to see if separated dots are specific for one species. We have column that numerically says which result comes from which plant. We can color dots using value ‘c’ in scatter plot generator (we have to provide array with colors of the length the same as number of dots). In matplot lib we can use differently defined colors, e.g. simple names of colors like ‘red’, ‘blue’ and ‘green’.

colors = frame['type'].replace(0, 'red').replace(1, 'blue').replace(2, 'green')
plot1 = plt.scatter(frame['sepal length (cm)'], frame['petal length (cm)'], c = colors)

scatterplot3

So now our plot gives some information. But we need a legend, because I don’t think any of you remember which color is which species.

To do it nicely it would be better if we divide data on ‘type’ and create distinct scatter plot for each species (on the same image).

plot1 = plt.scatter(frame[frame['type'] == 0]['sepal length (cm)'], frame[frame['type'] == 0]['petal length (cm)'], c = 'red')
plot2 = plt.scatter(frame[frame['type'] == 1]['sepal length (cm)'], frame[frame['type'] == 1]['petal length (cm)'], c = 'blue')
plot3 = plt.scatter(frame[frame['type'] == 2]['sepal length (cm)'], frame[frame['type'] == 2]['petal length (cm)'], c = 'green')
plt.legend([plot1, plot2, plot3], ['setosa', 'versicolor', 'virginica'])

And last, we can add title to our plot:

plt.title('Iris sepal/petal length')

Overall, our code looks now:

import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
iris = datasets.load_iris()
frame = pd.DataFrame(iris.data, columns = iris.feature_names)
frame['type'] = iris.target
plot1 = plt.scatter(frame[frame['type'] == 0]['sepal length (cm)'], frame[frame['type'] == 0]['petal length (cm)'], c = 'red')
plot2 = plt.scatter(frame[frame['type'] == 1]['sepal length (cm)'], frame[frame['type'] == 1]['petal length (cm)'], c = 'blue')
plot3 = plt.scatter(frame[frame['type'] == 2]['sepal length (cm)'], frame[frame['type'] == 2]['petal length (cm)'], c = 'green')
plt.xlabel('sepal length (cm)')
plt.ylabel('petal length (cm)')
plt.legend([plot1, plot2, plot3], ['setosa', 'versicolor', 'virginica'])
plt.title('Iris sepal/petal length')
plt.show()

scatterplot4

I hope everything was understandable. Have fun with matplotlib and if you have any questions, suggestions what to do in matplotlib #2  – write in comments!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s