data preprocessing
Author

Sathvik

Published

October 9, 2022

Libraries

Load datasets

Code
duans <- read_excel("_data/n=149 with Duans.xlsx")
labworks <- read_excel("_data/091522 LABWORKS DOWNLOAD.xlsx") %>% filter(NAME == "PEARSON") %>% 
  select(-c(UNITS, STATUS))
usgs_gauge_data <- read_excel("_data/1 of 3 - USGS gage data downloaded 092022.xlsx",
                              sheet = "Pearson", guess_max = 1048576)

Had a problem while loading the data.guess_max determines how many cells in each column are used to make a guess of the column type. we can provide a guess_max for read_excel to correctly guess the column type.

Data and the datatypes of Columns

Code
dm_draw(dm(duans, labworks, usgs_gauge_data), view_type = "all", column_types = TRUE, focus = TRUE)

cleaning columns with correct data types

labworks$RESULT is character type in the table, correcting it to numeric type. “NA” value introduced while correcting data type in (only one)observation where result is “>24200”.

Filtered usgs_guage_data for the labworks datetime range

Code
labworks$RESULT <- as.numeric(labworks$RESULT)
Warning: NAs introduced by coercion
Code
labworks$WATERBODY <- as.factor(labworks$WATERBODY)
labworks$LCOD <- as.factor(labworks$LCOD)
labworks$PARAMETER <- as.factor(labworks$PARAMETER)
labworks <- labworks %>% rename("datetime" = "DATETIME")
# labworks <- labworks %>% rename("Name" = "NAME")

usgs_gauge_data <- usgs_gauge_data %>% 
  filter(between (datetime, min(labworks$datetime), max(labworks$datetime)))
Code
labworks <- labworks %>% 
  pivot_wider(names_from = PARAMETER, values_from =  RESULT)
Warning: Values from `RESULT` are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = {summary_fun}` to summarise duplicates.
* Use the following dplyr code to identify duplicates.
  {data} %>%
    dplyr::group_by(NAME, WATERBODY, LCOD, datetime, PARAMETER) %>%
    dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
    dplyr::filter(n > 1L)

Converting the dataframes to data.table and merging them to the nearest datetime.

Code
labworks <- data.table(labworks)
usgs_gauge_data <- data.table(usgs_gauge_data)
duans <- data.table(duans)

setkey(labworks, datetime)
setkey(usgs_gauge_data, datetime)
setkey(duans, datetime)

labworks_USGS_combined <- usgs_gauge_data[labworks , roll = "nearest" ] %>% select(-NAME)

all_data <- duans[labworks_USGS_combined, roll = "nearest"]

names(all_data) <- str_to_title(names(all_data))

Information about data.table
Information about setkey
Information about roll = “nearest”

Code
duans
Code
labworks
Code
usgs_gauge_data
Code
labworks_USGS_combined
Code
all_data

saving merged file

Code
all_data <- apply(as.data.frame(all_data),2 ,as.character)
write.csv(all_data, "_data/merged_data.csv")