This post explores weather data – more specifically, daily maximum and minimum temperatures – for the Netherlands between 2005 and 2015, as part of Assignment II in the Coursera course ‘Applied Plotting, Charting & Data Representation in Python’ by the University of Michigan.
The course provides a NOAA dataset in a CSV file, which contains a subset of The National Centers for Environmental Information (NCEI) daily climate records from land surface stations across the globe. The data is locations based, and my subset contains data from six locations in the Netherlands for the following variables: land station identification code, date recorded in YYYY-MM-DD format, element – minimum (TMIN) or maximum (TMAX) temperature –, and temperature in tenths of degrees Celsius1.
The assignment was to create a line graph of the record high and record low temperatures by day of the year over the period 2005-2014, also shading the area between the record high and record low temperatures for each day. This graph should also include a scatter overlay of 2015 data for which the ten year record high or record low was broken.
Analysis and visualisation
I used three packages for the visualisation and later ran a limited statistical analysis, which is not part of the original assignment, with Scipy.
import numpy as np, pandas as pd
from matplotlib import pyplot as plt
from scipy import stats
I used Pandas to manage the data once imported in Python, and the first hurdle is to extract the highest TMAX value and the lowest TMIN value out of 43,000 records from six locations2. As I kept running into problems applying multiple functions to the full dataset, I decided to go with a split-apply-merge.
dfx = df[df.Element == 'TMAX']
dfx = dfx[dfx.groupby('Date')['Data_Value'].transform(max) == dfx['Data_Value']]
dfx = dfx.drop_duplicates(subset = 'Date', keep = 'first')
dfn = df[df.Element == 'TMIN']
dfn = dfn[dfn.groupby('Date')['Data_Value'].transform(min) == dfn['Data_Value']]
dfn = dfn.drop_duplicates(subset = 'Date', keep = 'first')
df = pd.concat([dfx,dfn]).sort_values(['Date','Element'])
The resulting data frame contains data for 2005 up to and including 2014 with two entries for each day of the year: maximum temperature and minimum temperature. This data was then compared to TMIN and TMAX in 2015 for the same days, to yield data for the scatter overlay. This results in the following graph (Figure 1). To prevent cluttering on the x-axis I decided to leave out ticks corresponding to individual days, and to provide monthly labels instead. The purpose of the figure is to indicate trends, and to establish maximum temperature records broken in 2015. Since, it was going to be impossible distinguish between values for individual days, I decided to remove those indicators. This makes the graph somewhat inconsistent, because the lines are still based on those individual data points; a better way to do this would be to make the chart interactive so that individual data points can be easily read off.
The mean TMIN over the full year is 6.4°C, and the mean TMAX is 14.2°C; the maximum TMAX and the minimum TMIN naturally are a lot farther apart at 36.7°C and -22.9°C respectively. In 2015, the record maximum temperature was broken 48 times, and the temperature in 2015 was 1.7°C higher on average over all days.
The instructions do not specify why the year 2015 is singled out, but I have to assume it is simply because it is the most recent data. However, this graph might suggest that 2015 is special in some other way, namely that more records were broken than in other years. A quick analysis proves this is not the case. Figure 2 shows the number of records broken by year for TMIN and TMAX; for each year the frequencies are derived in relation to previous years. So, if a record high is recorded for January 1st in 2012, which is then subsequently broken in 2013, this counts for both years.
Overall, there does not seem to be much of a difference between years. The outliers, with values deviating +/- 1.5 times the standard deviation, are 2010, 2013 and 2014 for TMIN, and 2010 and 2014 for TMAX; neither include 2015. However, it does seem like the number of degrees by which the records are broken matter statistically. Paired t-tests for both TMIN and TMAX comparing 2005-2014 to 2015 show that temperatures are significantly higher in 2015. TMAX averages 21.0°C in 2015 and 15.2°C over the previous ten years (t=-23.82, p=0.000), and TMIN averages 5.4°C in 2015 against -0.9°C (t=23.66, p=0.000). All in all, that would in my opinion justify including the scatter overlay in Figure 1.
1 As suggested in the instructions, leap days were removed from the dataset.
2 I can provide the full script upon request should anyone be interested, but here I choose to include only the main challenges and features of the graphics and data cleaning processes necessary for rendering them.