# Python basics 4: Cleaning data¶

This tutorial explores further concepts in Numpy such as categorical data, advanced indexing and dealing with Not-a-Number (NaN) data.

Follow the instructions below to download the tutorial and open it in the Sandbox.

## Download the tutorial notebook¶

Download the Python basics 4 tutorial notebook

To view this notebook on the Sandbox, you will need to first download it to your computer, then upload it to the Sandbox. Ensure you have followed the set-up prerequisities listed in Python basics 1: Jupyter, and then follow these instructions:

Download the notebook by clicking the link above.

On the Sandbox, open the

**Training**folder.Click the

**Upload Files**button as shown below.

Select the downloaded notebook using the file browser. Click

**OK**.The solution notebook will appear in the

**Training**folder. Double-click to open it.

You can now use the tutorial notebook as an interactive version of this webpage.

Note

The tutorial notebook should look like the text and code below. However, the tutorial notebook outputs are blank (i.e. no results showing after code cells). Follow the instructions in the notebook to run the cells in the tutorial notebook. Refer to this page to check your outputs look similar.

## Numpy dictionaries and categorical data¶

We will introduce a numpy structure called a **dictionary**. This will be useful for the next lesson on `xarray`

.

A dictionary represents a mapping between **keys** and **values**. The keys and values are Python objects of any type. We declare a dictionary using curly braces `{}`

. Inside we specify the key then its associated value, with the keys and values separated by a colon `:`

. Commas `,`

are used to separate elements in the dictionary.

```
dictionary_name = {key1: value1, key2: value2, key3: value3}
```

For example:

```
[ ]:
```

```
d = {1: 'one',
2: 'two',
3: 'apple'}
```

In the above dictionary `d`

, we have three **keys** `1`

, `2`

, `3`

, and their respective **values** `'one'`

, `'two'`

and `'apple'`

.

We can look up elements in a dictionary using the `[ key_name ]`

to address the value stored under a key. The syntax looks like:

```
dictionary_name[key_name]
```

In our example dictionary `d`

above, we can call upon the value associated with the key name `1`

like so:

```
d[1]
```

```
[ ]:
```

```
print(d[1], " + ", d[2], " = ", d[3])
```

Elements in a dictionary can be modified or new elements added by using the `dictionary_name[key_name] = value`

syntax.

```
[ ]:
```

```
d[3] = 'three'
d[4] = 'four'
print(d[1], " + ", d[2], " = ", d[3])
```

Again, the dictionary name, key name, and value must be specified.

Dictionaries are useful for data analysis (including satellite data analysis) because they make it easy to assign **categorical values** to our dataset. Remote sensing can be used to create classification products that use categorical values. These products do not contain continuous values. They use discrete values to represent different classes individual pixels can belong to.

As an example, the following cells simulate a very simple image containing three different land cover types. Value `1`

represents area covered with grass, `2`

croplands and `3`

city.

First, we import the libraries we want to use.

```
[4]:
```

```
%matplotlib inline
import numpy as np
from matplotlib import pyplot as plt
from matplotlib import colors
```

We will now create a 2-dimensional 100 pixel x 100 pixel numpy array where every value is `1`

. This is done using the `numpy.ones`

function. Then, we use array indexing to assign part of the area to have the value `2`

, and another part to have the value `3`

.

```
[5]:
```

```
# grass = 1
area = np.ones((100,100))
# crops = 2
area[10:60,20:50] = 2
# city = 3
area[70:90,60:80] = 3
area.shape, area.dtype
```

```
[5]:
```

```
((100, 100), dtype('float64'))
```

```
[6]:
```

```
area
```

```
[6]:
```

```
array([[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]])
```

We now have a matrix filled with 1s, 2s and 3s. At this point, there is no association between the numbers and the different types of ground cover.

If we want to show what the area looks like according to the grass/crops/city designation, we might want to give each of the classifications a colour.

```
[7]:
```

```
# We map the values to colours
index = {1: 'green', 2: 'yellow', 3: 'grey'}
# Create a discrete colour map
cmap = colors.ListedColormap(index.values())
# Plot
plt.imshow(area, cmap=cmap)
```

```
[7]:
```

```
<matplotlib.image.AxesImage at 0x7fe7b5487f98>
```

In the case above, every pixel had a value of either `1`

, `2`

or `3`

. What happens if our dataset is incomplete and there is no data in some places?

This is a common problem in real-life datasets. Real datasets can be incomplete and may be missing data at certain times or places. To deal with this, we use the special value known as `NaN`

, which stands for **Not a Number**.

`NaNs`

are designated by the numpy `np.nan`

function.

```
[8]:
```

```
arr = np.array([1,2,3,4,5,np.nan,7,8,9], dtype=np.float32)
arr
```

```
[8]:
```

```
array([ 1., 2., 3., 4., 5., nan, 7., 8., 9.], dtype=float32)
```

To compute statistics on arrays containing NaN values, numpy has special versions of common functions such as `mean`

, standard deviation `std`

, and `sum`

that ignore the `NaN`

values. For example, the next cell shows the difference between using the usual `mean`

function and the `nanmean`

function.

The `mean`

function cannot handle `NaN`

values so it will return `nan`

. The `nanmean`

function does not include `NaN`

values in the calculation, and therefore returns a number value.

```
[9]:
```

```
print(np.mean(arr))
print(np.nanmean(arr))
```

```
nan
4.875
```

Note that `NaN`

is generally not used as a key in dictionary key-value entries because there are different ways of expressing `NaN`

in Python and they are not always equivalent. However, it is still possible to visualise data with `NaNs`

; there will be gaps in the image where there is no data.

## Exercises¶

### 4.1 The harvesting season has arrived and our cropping lands have changed colour to brown. Can you:¶

#### 4.1.1 Modify the yellow area to contain the new value `4`

?¶

#### 4.1.2 Add a new entry to the `index`

dictionary mapping number `4`

to the value `brown`

.¶

#### 4.1.3 Plot the area.¶

```
[ ]:
```

```
# 4.1.1 Modify the yellow area to hold the value 4
```

```
[ ]:
```

```
# 4.1.2 Add a new key-value pair to index that maps 4 to 'brown'
```

```
[ ]:
```

```
# 4.1.3 Copy the cmap definition and re-run it to add the new colour
# Plot the area
```

Hint:If you want to plot the new area, you have to redefine`cmap`

so the new value is assigned a colour in the colour map. Copy and paste the`cmap = ...`

line from the original plot.

### 4.2 Set `area[20:40, 80:95] = np.nan`

. Plot the area now.¶

```
[ ]:
```

```
# Set the nan area
```

```
[ ]:
```

```
# Plot the entire area
```

### 4.3 Find the median of the `area`

array from 4.2 using `np.nanmedian`

. Does this match your visual interpretation? How does this compare to using `np.median`

?¶

```
[ ]:
```

```
# Use np.nanmedian to find the median of the area
```

## Conclusion¶

Two key Python capabilities have been introduced in this section. We can organise our data using the dictionary syntax, and understand incomplete datasets that may use `NaN`

values to show blanks. The next lesson provides a guide to xarray, a Python package that builds on these concepts to make multi-dimensional data easier to load and use.