duans <-read_excel("_data/n=149 with Duans.xlsx")labworks <-read_excel("_data/091522 LABWORKS DOWNLOAD.xlsx") %>%filter(NAME =="PEARSON") %>%select(-c(UNITS, STATUS))usgs_gauge_data <-read_excel("_data/1 of 3 - USGS gage data downloaded 092022.xlsx",sheet ="Pearson", guess_max =1048576)
Had a problem while loading the data.guess_max determines how many cells in each column are used to make a guess of the column type. we can provide a guess_max for read_excel to correctly guess the column type.
labworks$RESULT is character type in the table, correcting it to numeric type. “NA” value introduced while correcting data type in (only one)observation where result is “>24200”.
Filtered usgs_guage_data for the labworks datetime range
Warning: Values from `RESULT` are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = {summary_fun}` to summarise duplicates.
* Use the following dplyr code to identify duplicates.
{data} %>%
dplyr::group_by(NAME, WATERBODY, LCOD, datetime, PARAMETER) %>%
dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
dplyr::filter(n > 1L)
Converting the dataframes to data.table and merging them to the nearest datetime.
Code
labworks <-data.table(labworks)usgs_gauge_data <-data.table(usgs_gauge_data)duans <-data.table(duans)setkey(labworks, datetime)setkey(usgs_gauge_data, datetime)setkey(duans, datetime)labworks_USGS_combined <- usgs_gauge_data[labworks , roll ="nearest" ] %>%select(-NAME)all_data <- duans[labworks_USGS_combined, roll ="nearest"]names(all_data) <-str_to_title(names(all_data))
---title: "Data Preprocessing"author: "Sathvik"desription: "Data Preprocessing"date: "10/09/2022"format: html: df-print: paged toc: true code-fold: true code-copy: true code-tools: truecategories: - data preprocessing---## Libraries```{r, include=FALSE}library(readxl)library(dplyr)library(tidyr)library(dm)library(stringr)library(data.table)```## Load datasets```{r}duans <-read_excel("_data/n=149 with Duans.xlsx")labworks <-read_excel("_data/091522 LABWORKS DOWNLOAD.xlsx") %>%filter(NAME =="PEARSON") %>%select(-c(UNITS, STATUS))usgs_gauge_data <-read_excel("_data/1 of 3 - USGS gage data downloaded 092022.xlsx",sheet ="Pearson", guess_max =1048576)```Had a problem while loading the data.`guess_max` determines how many cells in each column are used to make a guess of the column type. we can provide a `guess_max` for read_excel to correctly guess the column type. ## Data and the datatypes of Columns```{r}dm_draw(dm(duans, labworks, usgs_gauge_data), view_type ="all", column_types =TRUE, focus =TRUE)```cleaning columns with correct data typeslabworks$RESULT is character type in the table, correcting it to numeric type. "NA" value introduced while correcting data type in (only one)observation where result is ">24200". Filtered usgs_guage_data for the labworks datetime range```{r}labworks$RESULT <-as.numeric(labworks$RESULT)labworks$WATERBODY <-as.factor(labworks$WATERBODY)labworks$LCOD <-as.factor(labworks$LCOD)labworks$PARAMETER <-as.factor(labworks$PARAMETER)labworks <- labworks %>%rename("datetime"="DATETIME")# labworks <- labworks %>% rename("Name" = "NAME")usgs_gauge_data <- usgs_gauge_data %>%filter(between (datetime, min(labworks$datetime), max(labworks$datetime)))``````{r}labworks <- labworks %>%pivot_wider(names_from = PARAMETER, values_from = RESULT)```Converting the dataframes to data.table and merging them to the nearest datetime.```{r}labworks <-data.table(labworks)usgs_gauge_data <-data.table(usgs_gauge_data)duans <-data.table(duans)setkey(labworks, datetime)setkey(usgs_gauge_data, datetime)setkey(duans, datetime)labworks_USGS_combined <- usgs_gauge_data[labworks , roll ="nearest" ] %>%select(-NAME)all_data <- duans[labworks_USGS_combined, roll ="nearest"]names(all_data) <-str_to_title(names(all_data))```Information about [data.table](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html)<br/>Information about [setkey](https://www.rdocumentation.org/packages/data.table/versions/1.14.2/topics/setkey)<br/>Information about [roll = "nearest"](https://jangorecki.gitlab.io/data.cube/library/data.table/html/data.table.html)<br/>::: panel-tabset## duans```{r}duans```## labworks```{r}labworks```## usgs_gauge_data```{r}usgs_gauge_data```## labworks USGS combined```{r}labworks_USGS_combined```## All data combined```{r}all_data```:::## saving merged file```{r}all_data <-apply(as.data.frame(all_data),2 ,as.character)write.csv(all_data, "_data/merged_data.csv")```