## Set up and loading data ### `setup` chunk This chunk first sets up key preferences for this R Markdown file, in `opts_chunk$set()`. By default it is set to not show code, errors, or warning messages in the output. ### Import and inspect data The `read_data` chunk loads linelist data into RStudio. There are six options you can pick between to load data. Delete or comment out the code that you do not need, and make sure the code that you do need is recognised as code (not commented-out): 1) Load data from an excel file within a specific sheet. 2) Load data from an excel file with macros (this requires read_excel) 3) Load data from an excel file with a particular range of cells 4) Load data from an excel file within a particular sheet but also with a password. Note this needs the installation of some additional packages. 5) Load data from a csv file 6) Load data from a stata file When you load real data, you need to specify the right file name and path. The placeholders assume you are working in an RStudio project, and that the linelist is in a subfolder called `Data`. The `inspect_data` chunk will show you some basic information about your linelist: the dimensions and the column names. ### Understand your expected data format There is no R chunk for this step, but to ensure you clean your data correctly, make sure you are aware of the: - Columns required for your outbreak report, and their column names - The contents of columns, in particular the expected possible and their spellings for categorical columns You can use the function `msf_dict()` from the `{sitrep}` package for this. The package contains data dictionaries and simulated datasets for measles, meningitis, AJS, cholera, and diphtheria. The `msf_dict()` function will show you the data dictionary for your selected disease - see the example for measles below. In this function: - The data dictionary shows variable names in the `variable` column. - Accepted values for each variable are specified in `values_short` and `values_long columns`. `values_short` has the shortened value and `values_long` has the full-text value. You will need to recode your current linelist values into the "values_short" content. ```{r eval=F} ## get MSF standard dictionary for measles recode_dict <- msf_dict("measles", compact = FALSE) |> select("variable" = data_element_shortname, "values_short" = option_code, "values_long" = option_name) ## browse dictionary View(recode_dict) ``` ### `clean_column_names` chunk This step fixes the column names. This is done in two steps: 1) Use the `clean_names()` function from the package `{janitor}` to automatically standardise column names as per good coding practice (lower case, remove spaces and punctuation) 2) Manually clean column names to match linelist standard. The example code shows some recoding for measles, e.g. we have the column `sex` that we want to rename as `sex_id`. The syntax for this is `rename(data, NEW_NAME = OLD_NAME).` To facilitate this step, you can also use the function `msf_dict_rename_helper()` to create a template based again on the data dictionary held for that disease in the `{sitrep}` package. Do this with the following steps: 1) Run `msf_dict_rename_helper("xxxx")`, where "xxxx" refers to disease name. For instance, you can type `msf_dict_rename_helper("Measles")`. This will copy a rename command to your clipboard. 2) Paste the result in your code and edit to specifically rename certain columns. Be careful! You still need to be aware of what each variable means and what values it takes. If there are any columns that are in the MSF dictionary that are not in your data set, then you should comment them out, but be aware that some analyses may not run because of this. Here is an example of what the pasted code looks like, which you can then edit so that the name of the columns in your data is on the right-hand side. ```{r eval=F} ## Add the appropriate column names after the equals signs linelist_cleaned <- rename(linelist_cleaned, acute_otitis_media = , # BOOLEAN (REQUIRED) age_days = , # INTEGER_POSITIVE (REQUIRED) age_months = , # INTEGER_POSITIVE (REQUIRED) age_years = , # INTEGER_POSITIVE (REQUIRED) candidiasis = , # BOOLEAN (REQUIRED) case_number = , # TEXT (REQUIRED) cough = , # BOOLEAN (REQUIRED) croup = , # BOOLEAN (REQUIRED) date_of_consultation_admission = , # DATE (REQUIRED) residential_status = , # TEXT (optional) residential_status_brief = , # TEXT (optional) treatment_facility_name = , # TEXT (optional) treatment_facility_site = , # TEXT (optional) treatment_location = , # ORGANISATION_UNIT (optional) trimester = # TEXT (optional) ) ``` ### `standardise_capitalisation` chunk Before browsing data, you can standardise the capitalisation of categorical values. This minimises the number of corrections that are needed later on in the code. ### `browse_data` chunk You'll want to look at your data, to know what errors and typos exist in the column values. This chunk shows you a few ways to explore. The tbl_summary() function in particular will show you all the values within categorical columns. ### `recode_factor_vars` chunk This chunk is for recoding factor (categorical) variables. You will need to edit this section to recode the values in your dataset to suit the values in the expected linelist format. You can look at the data dictionary object (`recode_dict`) and the outputs from the `browse_data` chunk to write the correct code. There is an initial example in this code which shows you how to fix mispellings in the columns `sex_id` and `outcome`. You should put the various incorrect spellings that need correction into the brackets. Multiple different incorrect spellings can be listed within the brackets. For example: ```{r eval=F} linelist_processing <- linelist_processing |> mutate(sex_id = case_match( sex_id, c("M", "m") ~ "Male", c("F", "FEMALE") ~ "Female" , .default = sex_id )) |> mutate(outcome = case_match( outcome, c("Dead in facility - short") ~ "Dead in facility (<4h)", c("Dead in facility - long") ~ "Dead in facility (>4h)", c("Sent home", "Home") ~ "Discharged home", c("Death in community", "Dead in community") ~ "Dead in community", c("DOA") ~ "Dead on arrival", c("Left") ~ "Left against medical advice", c("Transferred - MSF") ~ "Transferred (to an MSF facility)", c("Transferred - External") ~ "Transferred (to External Facility)", .default = outcome)) ``` ### `recode_numeric_vars` chunk This chunk will help recode numeric variables, including restructuring the age column to be represented by two columns (age as a number, and unit). You will need to add to it based on your dataset by comparing to the variables in the standard data dictionary. ### `save_recoded_data` chunk Save your recoded dataset as an Excel. This automatically names your file "linelist_recoded_DATE", where DATE is the current date. You can now use this to use the analysis template.