Data Wrangling¶
This chapter gives an overview of the most important data manipulation functions used throughout this book.
It is a fact that the data import and manipulation part of analyses often takes more time than the actual analysis or visualization itself. This is because the exchange data formats are not standardized and meters and sensors record at different time intervals. Data quality, missing data, data imputation and data cleansing also play a major role.
# Load data set
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/retomarek/edap/main/edap/sampleData/flatTempHum.csv",
sep = ";")
df.head()
| time | FlatA_Hum | FlatA_Temp | FlatB_Hum | FlatB_Temp | FlatC_Hum | FlatC_Temp | FlatD_Hum | FlatD_Temp | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2018-10-03 00:00:00 | 53.0 | 24.43 | 38.8 | 22.40 | 44.0 | 24.5 | 49.0 | 24.43 |
| 1 | 2018-10-03 01:00:00 | 53.0 | 24.40 | 38.8 | 22.40 | 44.0 | 24.5 | 49.0 | 24.40 |
| 2 | 2018-10-03 02:00:00 | 53.0 | 24.40 | 39.3 | 22.40 | 44.7 | 24.5 | 48.3 | 24.38 |
| 3 | 2018-10-03 03:00:00 | 53.0 | 24.40 | 40.3 | 22.40 | 45.0 | 24.5 | 48.0 | 24.33 |
| 4 | 2018-10-03 04:00:00 | 53.3 | 24.40 | 41.0 | 22.37 | 45.2 | 24.5 | 47.7 | 24.30 |
wide to long¶
dfWide = df.copy()
dfWide.head()
| time | FlatA_Hum | FlatA_Temp | FlatB_Hum | FlatB_Temp | FlatC_Hum | FlatC_Temp | FlatD_Hum | FlatD_Temp | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2018-10-03 00:00:00 | 53.0 | 24.43 | 38.8 | 22.40 | 44.0 | 24.5 | 49.0 | 24.43 |
| 1 | 2018-10-03 01:00:00 | 53.0 | 24.40 | 38.8 | 22.40 | 44.0 | 24.5 | 49.0 | 24.40 |
| 2 | 2018-10-03 02:00:00 | 53.0 | 24.40 | 39.3 | 22.40 | 44.7 | 24.5 | 48.3 | 24.38 |
| 3 | 2018-10-03 03:00:00 | 53.0 | 24.40 | 40.3 | 22.40 | 45.0 | 24.5 | 48.0 | 24.33 |
| 4 | 2018-10-03 04:00:00 | 53.3 | 24.40 | 41.0 | 22.37 | 45.2 | 24.5 | 47.7 | 24.30 |
dfLong = dfWide.melt(id_vars=["time"], var_name="dataPoint")
dfLong.head()
| time | dataPoint | value | |
|---|---|---|---|
| 0 | 2018-10-03 00:00:00 | FlatA_Hum | 53.0 |
| 1 | 2018-10-03 01:00:00 | FlatA_Hum | 53.0 |
| 2 | 2018-10-03 02:00:00 | FlatA_Hum | 53.0 |
| 3 | 2018-10-03 03:00:00 | FlatA_Hum | 53.0 |
| 4 | 2018-10-03 04:00:00 | FlatA_Hum | 53.3 |
long to wide¶
dfWide = dfLong.pivot(index='time', columns='dataPoint', values='value')
dfWide.head()
| dataPoint | FlatA_Hum | FlatA_Temp | FlatB_Hum | FlatB_Temp | FlatC_Hum | FlatC_Temp | FlatD_Hum | FlatD_Temp |
|---|---|---|---|---|---|---|---|---|
| time | ||||||||
| 2018-10-03 00:00:00 | 53.0 | 24.43 | 38.8 | 22.40 | 44.0 | 24.5 | 49.0 | 24.43 |
| 2018-10-03 01:00:00 | 53.0 | 24.40 | 38.8 | 22.40 | 44.0 | 24.5 | 49.0 | 24.40 |
| 2018-10-03 02:00:00 | 53.0 | 24.40 | 39.3 | 22.40 | 44.7 | 24.5 | 48.3 | 24.38 |
| 2018-10-03 03:00:00 | 53.0 | 24.40 | 40.3 | 22.40 | 45.0 | 24.5 | 48.0 | 24.33 |
| 2018-10-03 04:00:00 | 53.3 | 24.40 | 41.0 | 22.37 | 45.2 | 24.5 | 47.7 | 24.30 |
Access Data¶
get values of column(s)¶
df["FlatA_Temp"]
0 24.43
1 24.40
2 24.40
3 24.40
4 24.40
...
17176 NaN
17177 NaN
17178 NaN
17179 NaN
17180 NaN
Name: FlatA_Temp, Length: 17181, dtype: float64
Note
A Series is returned!
df[["FlatA_Temp", "FlatA_Hum"]]
| FlatA_Temp | FlatA_Hum | |
|---|---|---|
| 0 | 24.43 | 53.0 |
| 1 | 24.40 | 53.0 |
| 2 | 24.40 | 53.0 |
| 3 | 24.40 | 53.0 |
| 4 | 24.40 | 53.3 |
| ... | ... | ... |
| 17176 | NaN | NaN |
| 17177 | NaN | NaN |
| 17178 | NaN | NaN |
| 17179 | NaN | NaN |
| 17180 | NaN | NaN |
17181 rows × 2 columns
Note
A Dataframe is returned!
Rename column name(s)¶
df.columns
Index(['time', 'FlatA_Hum', 'FlatA_Temp', 'FlatB_Hum', 'FlatB_Temp',
'FlatC_Hum', 'FlatC_Temp', 'FlatD_Hum', 'FlatD_Temp'],
dtype='object')
df1 = df.copy()
df1.rename({'FlatA_Temp': 'A_Temp', 'FlatA_Hum': 'A_Hum'}, axis=1, inplace=True)
df1.columns
Index(['time', 'A_Hum', 'A_Temp', 'FlatB_Hum', 'FlatB_Temp', 'FlatC_Hum',
'FlatC_Temp', 'FlatD_Hum', 'FlatD_Temp'],
dtype='object')
select/drop column(s)¶
select column(s) to keep¶
df1 = df.copy()
df1 = df[['FlatA_Temp','FlatB_Temp']]
#df1.head()
drop column(s) by index¶
df2 = df.copy()
df2.drop(df.columns[[1, 2]], axis=1, inplace=True)
#df2.head()
drop column(s) by name¶
df3 = df.copy()
df3 = df.drop(['FlatA_Hum', 'FlatB_Hum', 'FlatC_Hum', 'FlatD_Hum'], axis=1)
#df3.head()