tail(anthro.02)Introduction
This vignette demonstrates the use of mwana package’s functions to estimate the prevalence of wasting based on:
- Weight-for-height z-score (WFHZ) and/or oedema
- Raw MUAC values and/or oedema
- MUAC-for-age z-score (MFAZ) and/or oedema, and
- Combined prevalence.
The prevalence functions in mwana were carefully conceived and designed to simplify the workflow of a nutrition data analyst, especially when dealing with datasets containing imperfections that require additional layers of analysis. Let’s try to clarify this with two scenarios that you may have faced:
When analysing a multi-area dataset, users will likely need to estimate the prevalence for each area individually. Afterwards, they must extract the results and collate in a summary table for sharing.
When working with MUAC data, when age ratio test is rated as problematic, an additional tool is required to weight the prevalence and correct for age bias, thereby the associated likely prevalence over-estimation. In an unfortunate cases wherein multiple areas face this issue, the workflow must be repeated several times, making the process boredom and highly error-prone.
With mwana, you no longer have to worry about this. The functions are designed to deal with that. To demonstrate their use, we will use different datasets to represent different scenarios:
-
anthro.02: a survey data with survey weights. Learn more about this data with?anthro.02. -
anthro.03: district-level SMART surveys with two districts whose WFHZ standard deviations are rated as problematic while the rest lay within range. Do?anthro.03for more details. -
anthro.04: a community-based sentinel site data. The data has different characteristics that require different analysis approaches.
Estimation of the prevalence of wasting based on WFHZ
To estimate the prevalence of wasting based on WFHZ we use the mw_estimate_prevalence_wfhz() function. The dataset to supply must have been wrangled by mw_wrangle_wfhz().
As usual, we start off by inspecting our dataset:
#> # A tibble: 6 × 14
#> province strata cluster sex age weight height oedema muac wtfactor wfhz
#> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 Nampula Urban 285 1 59.5 13.8 90.7 n 149 487. 0.689
#> 2 Nampula Rural 234 1 59.5 17.2 105. n 193 1045. 0.178
#> 3 Nampula Rural 263 1 59.6 18.4 100 n 156 952. 2.13
#> 4 Nampula Rural 257 1 59.7 15.9 100. n 149 987. 0.353
#> 5 Nampula Rural 239 1 59.8 12.5 91.5 n 135 663. -0.722
#> 6 Nampula Rural 263 1 60.0 14.3 93.8 n 142 952. 0.463
#> # ℹ 3 more variables: flag_wfhz <dbl>, mfaz <dbl>, flag_mfaz <dbl>
We can see that the dataset contains the required variables for a WFHZ prevalence analysis, including for a weighted analysis. This dataset has already been wrangled, therefore there is no need to call the WFHZ wrangler.
Estimation of unweighted prevalence
To achieve this we do:
anthro.02 |>
mw_estimate_prevalence_wfhz(
wt = NULL,
oedema = oedema
)This will return:
#> # A tibble: 1 × 16
#> gam_n gam_p gam_p_low gam_p_upp gam_p_deff sam_n sam_p sam_p_low sam_p_upp
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 86 0.0408 0.0322 0.0494 Inf 14 0.00664 0.00273 0.0106
#> # ℹ 7 more variables: sam_p_deff <dbl>, mam_n <dbl>, mam_p <dbl>,
#> # mam_p_low <dbl>, mam_p_upp <dbl>, mam_p_deff <dbl>, N <dbl>
If for some reason the variable oedema is not available in the dataset, or it’s there but not plausible, we can exclude it from the analysis by setting the argument oedema to NULL:
anthro.02 |>
mw_estimate_prevalence_wfhz(
wt = NULL,
oedema = NULL # Setting oedema to NULL
)And we get:
#> # A tibble: 1 × 16
#> gam_n gam_p gam_p_low gam_p_upp gam_p_deff sam_n sam_p sam_p_low sam_p_upp
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 72 0.0342 0.0263 0.0420 Inf 0 0 0 0
#> # ℹ 7 more variables: sam_p_deff <dbl>, mam_n <dbl>, mam_p <dbl>,
#> # mam_p_low <dbl>, mam_p_upp <dbl>, mam_p_deff <dbl>, N <dbl>
If we inspect the gam_n and gam_p columns of this output table and the previous, we notice differences in the numbers. This occurs because oedema cases were excluded in the second implementation. It is noteworthy that you will observe a change if there are positive cases of oedema in the dataset; otherwise, setting oedema = NULL will have no effect whatsoever.
The above output summary does not show results by province. We can control this by supplying the variable or set of variables containing the locations where the data was collected, or any other category (such as teams, sex, etc.) after oedema. In our case, we will use the column province:
anthro.02 |>
mw_estimate_prevalence_wfhz(
wt = NULL,
oedema = oedema,
province # province is the variable name holding data on where the survey was conducted.
)And voila :
#> # A tibble: 2 × 17
#> province gam_n gam_p gam_p_low gam_p_upp gam_p_deff sam_n sam_p sam_p_low
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Nampula 53 0.0546 0.0397 0.0695 Inf 10 0.0103 0.00282
#> 2 Zambezia 33 0.0290 0.0195 0.0384 Inf 4 0.00351 0.0000639
#> # ℹ 8 more variables: sam_p_upp <dbl>, sam_p_deff <dbl>, mam_n <dbl>,
#> # mam_p <dbl>, mam_p_low <dbl>, mam_p_upp <dbl>, mam_p_deff <dbl>, N <dbl>
A table with two rows is returned with each province’s statistics.
Estimation of weighted prevalence
To get the weighted prevalence, we use the wt argument. We pass to it the column name containing the final survey weights. In our case, the column name is wtfactor:
anthro.02 |>
mw_estimate_prevalence_wfhz(
wt = wtfactor, # Passing the wtfactor to wt
oedema = oedema,
province
)And you get:
#> # A tibble: 2 × 17
#> province gam_n gam_p gam_p_low gam_p_upp gam_p_deff sam_n sam_p sam_p_low
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Nampula 53 0.0595 0.0410 0.0779 1.52 10 0.0129 0.00272
#> 2 Zambezia 33 0.0261 0.0161 0.0361 1.16 4 0.00236 -0.000255
#> # ℹ 8 more variables: sam_p_upp <dbl>, sam_p_deff <dbl>, mam_n <dbl>,
#> # mam_p <dbl>, mam_p_low <dbl>, mam_p_upp <dbl>, mam_p_deff <dbl>, N <dbl>
The work under the hood of
mw_estimate_prevalence_wfhzUnder the hood, before it begins with the prevalence estimation, the function first checks the quality of the WFHZ standard deviation. If it is not problematic, it proceeds with a complex-sample-based analysis; otherwise, prevalence is estimated applying the PROBIT method. This is as you see in the body of the plausibility report generated by ENA. The
anthro.02dataset has no such issues, so you don’t seemw_estimate_prevalence_wfhzin action in this regard. To see that, let’s use theanthro.03dataset.
anthro.03 contains problematic standard deviation in Metuge and Maravia districts; the rest lay within range.
Let’s inspect our dataset:
#> # A tibble: 6 × 9
#> district cluster team sex age weight height oedema muac
#> <chr> <int> <int> <chr> <dbl> <dbl> <dbl> <chr> <int>
#> 1 Metuge 2 2 m 9.99 10.1 69.3 n 172
#> 2 Metuge 2 2 f 43.6 10.9 91.5 n 130
#> 3 Metuge 2 2 f 32.8 11.4 91.4 n 153
#> 4 Metuge 2 2 f 7.62 8.3 69.5 n 133
#> 5 Metuge 2 2 m 28.4 10.7 82.3 n 143
#> 6 Metuge 2 2 f 12.3 6.6 69.4 n 121
Now let’s apply the prevalence function. This data needs to be wrangled before passing it to the prevalence function:
anthro.03 |>
mw_wrangle_wfhz(
sex = sex,
.recode_sex = TRUE,
height = height,
weight = weight
) |>
mw_estimate_prevalence_wfhz(
wt = NULL,
oedema = oedema,
district
)The returned output will be:
#> ================================================================================
#> # A tibble: 4 × 17
#> district gam_n gam_p gam_p_low gam_p_upp gam_p_deff sam_n sam_p sam_p_low
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Cahora-Ba… 22 0.0738 0.0348 0.113 Inf 1 0.00336 -0.00348
#> 2 Chiuta 10 0.0444 0.0129 0.0759 Inf 1 0.00444 -0.00466
#> 3 Maravia NA 0.0450 NA NA NA NA 0.00351 NA
#> 4 Metuge NA 0.0251 NA NA NA NA 0.00155 NA
#> # ℹ 8 more variables: sam_p_upp <dbl>, sam_p_deff <dbl>, mam_n <dbl>,
#> # mam_p <dbl>, mam_p_low <dbl>, mam_p_upp <dbl>, mam_p_deff <dbl>, N <dbl>
In this output, while in Cahora-Bassa and Chiúta districts all columns are populated with numbers, in Metuge and Maravia, only the gam_p, sam_p and mam_p columns are filled with numbers, and everything else with NA. These are the districts wherein the PROBIT method was applied.
Estimation of the prevalence of wasting based on MFAZ
The prevalence of wasting based on MFAZ can be estimated using the mw_estimate_prevalence_mfaz() function. This function is implemented in the same way as demonstrated in WFHZ, with the exception that its data wrangling is based on MUAC, which was also demonstrated in the plausibility checks.
Estimation of the prevalence of wasting based on raw MUAC values
This job is assigned to three different functions: mw_estimate_prevalence_muac(), mw_estimate_prevalence_screening() and mw_estimate_prevalence_screening2(). The former is designed for survey data, and the latter two for data derived from screenings. Nonetheless, under the hood, they all follow the following logic:
- If age ratio test result is not problematic, a normal analysis is performed. This means that for data derived from survey, standard complex-sample-based prevalence is estimated.
- If age ratio test is problematic and the proportion of children aged 24-59 months is < 66% (the expected), the prevalence is age-weighted; else, an age-unweighted prevalence is estimated.
When working with a multiple-area dataset, this logic is applied area wise.
How does it work on a multi-area dataset
Fundamentally, the function performs the standard deviation and age ratio tests, evaluates their acceptability, and returns a summarised table by area. It then iterates over that summary table (row-by-row) checking the above conditionals, and then the function accesses the original dataframe, pulls out the area-specific dataset, then it estimates the prevalence accordingly, and binds the results into a summary dataset.
Estimation for survey data
To demonstrate this we will use the anthro.04 dataset.
As usual, let’s first inspect it:
#> # A tibble: 6 × 6
#> analysis_unit cluster sex age muac oedema
#> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 Unit A 23 1 40 171 n
#> 2 Unit A 23 1 10 155 n
#> 3 Unit A 23 1 24 152 n
#> 4 Unit A 23 2 31 135 n
#> 5 Unit A 23 1 36 152 n
#> 6 Unit A 23 2 36 141 n
You see that this data has already been wrangled, so we will go straight to the prevalence estimation.
Important
As in ENA Software, make sure you run the plausibility check before you call the prevalence function. This is good to know about the acceptability of your data. If we do that with
anthro.04we will see which province has issues, hence what we should expect to see in below demonstrations is based on the conditionals stated above.
anthro.04 |>
mw_wrangle_age(age = age) |>
mw_wrangle_muac(
muac = muac,
.recode_muac = TRUE,
.to = "cm",
age = age,
sex = sex,
.recode_sex = FALSE
) |>
mutate(muac = recode_muac(muac, "mm")) |>
mw_estimate_prevalence_muac(
muac = muac,
age = age,
wt = NULL,
oedema = oedema,
analysis_unit
)This will return:
#> ================================================================================
#> # A tibble: 3 × 17
#> analysis_unit gam_n gam_p gam_p_low gam_p_upp gam_p_deff sam_n sam_p
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Unit A 39 0.0644 0.0389 0.0898 Inf 7 0.0116
#> 2 Unit B NA 0.121 NA NA NA NA 0.0293
#> 3 Unit C 19 0.0909 0.0492 0.133 Inf 5 0.0239
#> # ℹ 9 more variables: sam_p_low <dbl>, sam_p_upp <dbl>, sam_p_deff <dbl>,
#> # mam_n <dbl>, mam_p <dbl>, mam_p_low <dbl>, mam_p_upp <dbl>,
#> # mam_p_deff <dbl>, N <dbl>
We see that in Unit A, all columns are filled with numbers; in Unit B, some columns are filled with numbers, while other are filled with NAs: this is where the age-weighting approach was applied.
Alternatively, we can choose to apply the function that estimates age-weighted prevalence inside mw_estimate_prevalence_muac() directly onto our dataset. This can be done by calling the mw_estimate_age_weighted_prev_muac() function. It is noteworthy that although possible, it is recommend to use the main function. This is simply due the fact that if we decide to use the function independently, then we must, before calling it, check the acceptability of the age ratio test, and then evaluate if the conditions that fits the use mw_estimate_age_weighted_prev_muac() are there. We would have to do that ourselves. This can be boredom along the workflow, thereby increase the risk of picking a wrong analysis workflow.
anthro.04 |>
mw_wrangle_age(age = age) |>
mw_wrangle_muac(
muac = muac,
.recode_muac = TRUE,
.to = "cm",
age = age,
sex = sex,
.recode_sex = FALSE
) |>
mutate(muac = recode_muac(muac, "mm")) |>
mw_estimate_age_weighted_prev_muac(
muac = muac,
oedema = oedema,
has_age = TRUE,
age = age,
age_cat = FALSE,
raw_muac = FALSE,
analysis_unit
)This returns the prevalence estimates split into age categories and the overall age-weighted estimate. The latter is given in the last three columns: sam, mam, and gam.
#> ================================================================================
#> # A tibble: 3 × 13
#> analysis_unit u2oedema u2sam u2mam u2gam o2oedema o2sam o2mam o2gam
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Unit A 0 0.0368 0.121 0.158 0 0 0.0216 0.0216
#> 2 Unit B 0 0.0748 0.218 0.292 0 0.00665 0.0293 0.0359
#> 3 Unit C 0 0.0851 0.128 0.213 0 0.00617 0.0494 0.0556
#> # ℹ 4 more variables: sam_p <dbl>, mam_p <dbl>, gam_p <dbl>, N <int>
Estimation of weighted prevalence
For this we go back anthro.02 dataset.
We approach this task as follows:
anthro.02 |>
mw_wrangle_age(
age = age,
.decimals = 2
) |>
mw_wrangle_muac(
sex = sex,
.recode_sex = FALSE,
muac = muac,
.recode_muac = TRUE,
.to = "cm",
age = age
) |>
mutate(
muac = recode_muac(muac, .to = "mm")
) |>
mw_estimate_prevalence_muac(
muac = muac,
age = age,
wt = wtfactor,
oedema = oedema,
province
)This will return:
#> ================================================================================
#> # A tibble: 2 × 17
#> province gam_n gam_p gam_p_low gam_p_upp gam_p_deff sam_n sam_p sam_p_low
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Nampula 61 0.0571 0.0369 0.0773 2.00 19 0.0196 0.00706
#> 2 Zambezia 57 0.0552 0.0380 0.0725 1.67 10 0.0133 0.00412
#> # ℹ 8 more variables: sam_p_upp <dbl>, sam_p_deff <dbl>, mam_n <dbl>,
#> # mam_p <dbl>, mam_p_low <dbl>, mam_p_upp <dbl>, mam_p_deff <dbl>, N <dbl>
Warning
You may have noticed that in the above code block, we called the
recode_muac()function insidemutate(). This is because after you usemw_wrangle_muac(), it puts the MUAC variable in centimetres. Themw_estimate_prevalence_muac()function was defined to accept MUAC in millimetres; therefore, it must be converted to millimetres.
Estimation for non-survey data
The anthro.04 dataset will be used to illustrate the application of these functions.
anthro.04 |>
mw_wrangle_age(age = age) |>
mw_wrangle_muac(
muac = muac,
.recode_muac = TRUE,
.to = "cm",
age = age,
sex = sex,
.recode_sex = FALSE
) |>
mutate(muac = recode_muac(muac, "mm")) |>
mw_estimate_prevalence_screening(
muac = muac,
age = age,
oedema = oedema,
analysis_unit
)The returned output is:
#> ================================================================================
#> # A tibble: 3 × 8
#> analysis_unit gam_n gam_p sam_n sam_p mam_n mam_p N
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 Unit A 39 0.0644 7 0.0116 32 0.0528 606
#> 2 Unit B NA 0.121 NA 0.0293 NA 0.0920 1354
#> 3 Unit C 19 0.0909 5 0.0239 14 0.0670 209
The mw_estimate_prevalence_screening2() function is applied as demonstrated below. In this example, the input data contains age in months rather than in categories. To meet the function’s requirements, we convert the age variable into two categories and store the result in a new age_cat variable.
anthro.04 |>
mutate(age_cat = ifelse(age < 24, "6-23", "24-59")) |>
mw_wrangle_muac(sex = sex, .recode_sex = TRUE, muac = muac) |>
mw_estimate_prevalence_screening2(
age_cat = age_cat,
muac = muac,
oedema = oedema,
analysis_unit
)This will return:
anthro.04 |>
mutate(age_cat = ifelse(age < 24, "6-23", "24-59")) |>
mw_wrangle_muac(sex = sex, .recode_sex = TRUE, muac = muac) |>
mw_estimate_prevalence_screening2(
age_cat = age_cat,
muac = muac,
oedema = oedema,
analysis_unit
)Estimation of the combined prevalence of wasting
The estimation of the combined prevalence of wasting is a task attributed to the mw_estimate_prevalence_combined() function. The case-definition is based on the WFHZ, the raw MUAC values and oedema. From the workflow standpoint, it combines the workflow demonstrated in Section 2 and in Section 4.
To demonstrate it’s implementation we will use the anthro.01 dataset.
Let’s inspect the data:
#> # A tibble: 6 × 11
#> area dos cluster team sex dob age weight height oedema
#> <chr> <date> <int> <int> <chr> <date> <int> <dbl> <dbl> <chr>
#> 1 District… 2023-12-04 1 3 m NA 59 15.6 109. n
#> 2 District… 2023-12-04 1 3 m NA 8 7.5 68.6 n
#> 3 District… 2023-12-04 1 3 m NA 19 9.7 79.5 n
#> 4 District… 2023-12-04 1 3 f NA 49 14.3 100. n
#> 5 District… 2023-12-04 1 3 f NA 32 12.4 92.1 n
#> 6 District… 2023-12-04 1 3 f NA 17 9.3 77.8 n
#> # ℹ 1 more variable: muac <int>
Data wrangling
It combines the data wrangling workflow of WFHZ and MUAC:
## Apply the wrangling workflow ----
anthro.01 |>
mw_wrangle_age(
dos = dos,
dob = dob,
age = age,
.decimals = 2
) |>
mw_wrangle_muac(
sex = sex,
.recode_sex = TRUE,
muac = muac,
.recode_muac = TRUE,
.to = "cm",
age = age
) |>
mutate(
muac = recode_muac(muac, .to = "mm")
) |>
mw_wrangle_wfhz(
sex = sex,
weight = weight,
height = height,
.recode_sex = FALSE
)This is to get the wfhz and flag_wfhz the mfaz and flag_mfaz added to the dataset. In the output below, we have just selected these columns:
#> ================================================================================
#> ================================================================================
#> # A tibble: 1,191 × 5
#> area wfhz flag_wfhz mfaz flag_mfaz
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 District E -1.83 0 -1.45 0
#> 2 District E -0.956 0 -1.67 0
#> 3 District E -0.796 0 -0.617 0
#> 4 District E -0.74 0 -1.02 0
#> 5 District E -0.679 0 -0.93 0
#> 6 District E -0.432 0 -1.10 0
#> 7 District E -0.078 0 -0.255 0
#> 8 District E -0.212 0 -0.677 0
#> 9 District E -1.07 0 -2.18 0
#> 10 District E -0.543 0 -0.403 0
#> # ℹ 1,181 more rows
Under the hood, the function applies the same analysis approach as in mw_estimate_prevalence_wfhz and in mw_estimate_prevalence_muac(). In this function, a concept of “combined flags” is used.
What is combined flag?
Combined flags consists in defining as flag any observation that is flagged in either
flag_wfhzorflag_mfazvectors. A new columncflagsfor combined flags is created and added to the dataset. This ensures that all flagged observations from both WFHZ and MFAZ data are excluded from the prevalence analysis.
| flag_wfhz | flag_mfaz | cflags |
|---|---|---|
| 1 | 0 | 1 |
| 0 | 1 | 1 |
| 0 | 0 | 0 |
Now that we understand what happens under the hood, we can now proceed to implement it:
## Apply the workflow ----
anthro.01 |>
mw_wrangle_age(
dos = dos,
dob = dob,
age = age,
.decimals = 2
) |>
mw_wrangle_muac(
sex = sex,
.recode_sex = TRUE,
muac = muac,
.recode_muac = TRUE,
.to = "cm",
age = age
) |>
mutate(
muac = recode_muac(muac, .to = "mm")
) |>
mw_wrangle_wfhz(
sex = sex,
weight = weight,
height = height,
.recode_sex = FALSE
) |>
mw_estimate_prevalence_combined(
wt = NULL,
oedema = oedema,
area
)We get this:
#> ================================================================================
#> ================================================================================
#> # A tibble: 2 × 5
#> area std_wfhz age_ratio_prop age_ratio_pval path
#> <chr> <chr> <dbl> <chr> <chr>
#> 1 District E Excellent 0.628 Excellent analyse
#> 2 District G Excellent 0.673 Excellent analyse
#> # A tibble: 2 × 17
#> area cgam_n cgam_p cgam_p_low cgam_p_upp cgam_p_deff csam_n csam_p
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 District E 44 0.0878 0.0521 0.124 Inf 3 0.00599
#> 2 District G 47 0.0703 0.0447 0.0958 Inf 5 0.00747
#> # ℹ 9 more variables: csam_p_low <dbl>, csam_p_upp <dbl>, csam_p_deff <dbl>,
#> # cmam_n <dbl>, cmam_p <dbl>, cmam_p_low <dbl>, cmam_p_upp <dbl>,
#> # cmam_p_deff <dbl>, N <dbl>
In district E NAs were returned because there were issues with the data. I leave it to you to figure out what was/were the issue/issues.
Tip
Consider running the plausibility checkers.
