Data Preprocessing

data preprocessing

Author

Sathvik

Published

October 9, 2022

Libraries

Load datasets

Code

duans <- read_excel("_data/n=149 with Duans.xlsx")
labworks <- read_excel("_data/091522 LABWORKS DOWNLOAD.xlsx") %>% filter(NAME == "PEARSON") %>% 
  select(-c(UNITS, STATUS))
usgs_gauge_data <- read_excel("_data/1 of 3 - USGS gage data downloaded 092022.xlsx",
                              sheet = "Pearson", guess_max = 1048576)

Had a problem while loading the data.guess_max determines how many cells in each column are used to make a guess of the column type. we can provide a guess_max for read_excel to correctly guess the column type.

Data and the datatypes of Columns

Code

dm_draw(dm(duans, labworks, usgs_gauge_data), view_type = "all", column_types = TRUE, focus = TRUE)

cleaning columns with correct data types

labworks$RESULT is character type in the table, correcting it to numeric type. “NA” value introduced while correcting data type in (only one)observation where result is “>24200”.

Filtered usgs_guage_data for the labworks datetime range

Code

labworks$RESULT <- as.numeric(labworks$RESULT)

Warning: NAs introduced by coercion

Code

labworks$WATERBODY <- as.factor(labworks$WATERBODY)
labworks$LCOD <- as.factor(labworks$LCOD)
labworks$PARAMETER <- as.factor(labworks$PARAMETER)
labworks <- labworks %>% rename("datetime" = "DATETIME")
# labworks <- labworks %>% rename("Name" = "NAME")

usgs_gauge_data <- usgs_gauge_data %>% 
  filter(between (datetime, min(labworks$datetime), max(labworks$datetime)))

Code

labworks <- labworks %>% 
  pivot_wider(names_from = PARAMETER, values_from =  RESULT)

Warning: Values from `RESULT` are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = {summary_fun}` to summarise duplicates.
* Use the following dplyr code to identify duplicates.
  {data} %>%
    dplyr::group_by(NAME, WATERBODY, LCOD, datetime, PARAMETER) %>%
    dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
    dplyr::filter(n > 1L)

Converting the dataframes to data.table and merging them to the nearest datetime.

Code

labworks <- data.table(labworks)
usgs_gauge_data <- data.table(usgs_gauge_data)
duans <- data.table(duans)

setkey(labworks, datetime)
setkey(usgs_gauge_data, datetime)
setkey(duans, datetime)

labworks_USGS_combined <- usgs_gauge_data[labworks , roll = "nearest" ] %>% select(-NAME)

all_data <- duans[labworks_USGS_combined, roll = "nearest"]

names(all_data) <- str_to_title(names(all_data))

Information about data.table
Information about setkey
Information about roll = “nearest”

Code

duans

Code

labworks

Code

usgs_gauge_data

Code

labworks_USGS_combined

Code

all_data

saving merged file

Code

all_data <- apply(as.data.frame(all_data),2 ,as.character)
write.csv(all_data, "_data/merged_data.csv")

--- title: "Data Preprocessing" author: "Sathvik" desription: "Data Preprocessing" date: "10/09/2022" format: html: df-print: paged toc: true code-fold: true code-copy: true code-tools: true categories: - data preprocessing --- ## Libraries ```{r, include=FALSE} library(readxl) library(dplyr) library(tidyr) library(dm) library(stringr) library(data.table) ``` ## Load datasets ```{r} duans <- read_excel("_data/n=149 with Duans.xlsx") labworks <- read_excel("_data/091522 LABWORKS DOWNLOAD.xlsx") %>% filter(NAME == "PEARSON") %>% select(-c(UNITS, STATUS)) usgs_gauge_data <- read_excel("_data/1 of 3 - USGS gage data downloaded 092022.xlsx", sheet = "Pearson", guess_max = 1048576) ``` Had a problem while loading the data.`guess_max` determines how many cells in each column are used to make a guess of the column type. we can provide a `guess_max` for read_excel to correctly guess the column type. ## Data and the datatypes of Columns ```{r} dm_draw(dm(duans, labworks, usgs_gauge_data), view_type = "all", column_types = TRUE, focus = TRUE) ``` cleaning columns with correct data types labworks$RESULT is character type in the table, correcting it to numeric type. "NA" value introduced while correcting data type in (only one)observation where result is ">24200". Filtered usgs_guage_data for the labworks datetime range ```{r} labworks$RESULT <- as.numeric(labworks$RESULT) labworks$WATERBODY <- as.factor(labworks$WATERBODY) labworks$LCOD <- as.factor(labworks$LCOD) labworks$PARAMETER <- as.factor(labworks$PARAMETER) labworks <- labworks %>% rename("datetime" = "DATETIME") # labworks <- labworks %>% rename("Name" = "NAME") usgs_gauge_data <- usgs_gauge_data %>% filter(between (datetime, min(labworks$datetime), max(labworks$datetime))) ``` ```{r} labworks <- labworks %>% pivot_wider(names_from = PARAMETER, values_from = RESULT) ``` Converting the dataframes to data.table and merging them to the nearest datetime. ```{r} labworks <- data.table(labworks) usgs_gauge_data <- data.table(usgs_gauge_data) duans <- data.table(duans) setkey(labworks, datetime) setkey(usgs_gauge_data, datetime) setkey(duans, datetime) labworks_USGS_combined <- usgs_gauge_data[labworks , roll = "nearest" ] %>% select(-NAME) all_data <- duans[labworks_USGS_combined, roll = "nearest"] names(all_data) <- str_to_title(names(all_data)) ``` Information about [data.table](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html) <br/> Information about [setkey](https://www.rdocumentation.org/packages/data.table/versions/1.14.2/topics/setkey)<br/> Information about [roll = "nearest"](https://jangorecki.gitlab.io/data.cube/library/data.table/html/data.table.html)<br/> ::: panel-tabset ## duans ```{r} duans ``` ## labworks ```{r} labworks ``` ## usgs_gauge_data ```{r} usgs_gauge_data ``` ## labworks USGS combined ```{r} labworks_USGS_combined ``` ## All data combined ```{r} all_data ``` ::: ## saving merged file ```{r} all_data <- apply(as.data.frame(all_data),2 ,as.character) write.csv(all_data, "_data/merged_data.csv") ```