Data Wrangling

This chapter gives an overview of the most important data manipulation functions used throughout this book.

It is a fact that the data import and manipulation part of analyses often takes more time than the actual analysis or visualization itself. This is because the exchange data formats are not standardized and meters and sensors record at different time intervals. Data quality, missing data, data imputation and data cleansing also play a major role.

# Load data set

import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/retomarek/edap/main/edap/sampleData/flatTempHum.csv",
                 sep = ";")
df.head()
time FlatA_Hum FlatA_Temp FlatB_Hum FlatB_Temp FlatC_Hum FlatC_Temp FlatD_Hum FlatD_Temp
0 2018-10-03 00:00:00 53.0 24.43 38.8 22.40 44.0 24.5 49.0 24.43
1 2018-10-03 01:00:00 53.0 24.40 38.8 22.40 44.0 24.5 49.0 24.40
2 2018-10-03 02:00:00 53.0 24.40 39.3 22.40 44.7 24.5 48.3 24.38
3 2018-10-03 03:00:00 53.0 24.40 40.3 22.40 45.0 24.5 48.0 24.33
4 2018-10-03 04:00:00 53.3 24.40 41.0 22.37 45.2 24.5 47.7 24.30

wide to long

dfWide = df.copy()
dfWide.head()
time FlatA_Hum FlatA_Temp FlatB_Hum FlatB_Temp FlatC_Hum FlatC_Temp FlatD_Hum FlatD_Temp
0 2018-10-03 00:00:00 53.0 24.43 38.8 22.40 44.0 24.5 49.0 24.43
1 2018-10-03 01:00:00 53.0 24.40 38.8 22.40 44.0 24.5 49.0 24.40
2 2018-10-03 02:00:00 53.0 24.40 39.3 22.40 44.7 24.5 48.3 24.38
3 2018-10-03 03:00:00 53.0 24.40 40.3 22.40 45.0 24.5 48.0 24.33
4 2018-10-03 04:00:00 53.3 24.40 41.0 22.37 45.2 24.5 47.7 24.30
dfLong = dfWide.melt(id_vars=["time"], var_name="dataPoint")
dfLong.head()
time dataPoint value
0 2018-10-03 00:00:00 FlatA_Hum 53.0
1 2018-10-03 01:00:00 FlatA_Hum 53.0
2 2018-10-03 02:00:00 FlatA_Hum 53.0
3 2018-10-03 03:00:00 FlatA_Hum 53.0
4 2018-10-03 04:00:00 FlatA_Hum 53.3

long to wide

dfWide = dfLong.pivot(index='time', columns='dataPoint', values='value')
dfWide.head()
dataPoint FlatA_Hum FlatA_Temp FlatB_Hum FlatB_Temp FlatC_Hum FlatC_Temp FlatD_Hum FlatD_Temp
time
2018-10-03 00:00:00 53.0 24.43 38.8 22.40 44.0 24.5 49.0 24.43
2018-10-03 01:00:00 53.0 24.40 38.8 22.40 44.0 24.5 49.0 24.40
2018-10-03 02:00:00 53.0 24.40 39.3 22.40 44.7 24.5 48.3 24.38
2018-10-03 03:00:00 53.0 24.40 40.3 22.40 45.0 24.5 48.0 24.33
2018-10-03 04:00:00 53.3 24.40 41.0 22.37 45.2 24.5 47.7 24.30

Access Data

get values of column(s)

df["FlatA_Temp"]
0        24.43
1        24.40
2        24.40
3        24.40
4        24.40
         ...  
17176      NaN
17177      NaN
17178      NaN
17179      NaN
17180      NaN
Name: FlatA_Temp, Length: 17181, dtype: float64

Note

A Series is returned!

df[["FlatA_Temp", "FlatA_Hum"]]
FlatA_Temp FlatA_Hum
0 24.43 53.0
1 24.40 53.0
2 24.40 53.0
3 24.40 53.0
4 24.40 53.3
... ... ...
17176 NaN NaN
17177 NaN NaN
17178 NaN NaN
17179 NaN NaN
17180 NaN NaN

17181 rows × 2 columns

Note

A Dataframe is returned!

Rename column name(s)

df.columns
Index(['time', 'FlatA_Hum', 'FlatA_Temp', 'FlatB_Hum', 'FlatB_Temp',
       'FlatC_Hum', 'FlatC_Temp', 'FlatD_Hum', 'FlatD_Temp'],
      dtype='object')
df1 = df.copy()
df1.rename({'FlatA_Temp': 'A_Temp', 'FlatA_Hum': 'A_Hum'}, axis=1, inplace=True)
df1.columns
Index(['time', 'A_Hum', 'A_Temp', 'FlatB_Hum', 'FlatB_Temp', 'FlatC_Hum',
       'FlatC_Temp', 'FlatD_Hum', 'FlatD_Temp'],
      dtype='object')

select/drop column(s)

select column(s) to keep

df1 = df.copy()
df1 = df[['FlatA_Temp','FlatB_Temp']]
#df1.head()

drop column(s) by index

df2 = df.copy()
df2.drop(df.columns[[1, 2]], axis=1, inplace=True)
#df2.head()

drop column(s) by name

df3 = df.copy()
df3 = df.drop(['FlatA_Hum', 'FlatB_Hum', 'FlatC_Hum', 'FlatD_Hum'], axis=1)
#df3.head()