SFER Review Process

STUDY RATINGS

To determine whether a program caused a particular outcome, a study’s research design must be able to rule out alternative explanations. For example, an employment program for low-income fathers may measure employment levels before and after program participation, but changes in employment between the two points in time may be caused by factors other than the program. Fathers who are motivated to attend the program may also be motivated to seek out jobs, so their employment levels might increase over time regardless of program participation. To measure the true effects of the program, we must also estimate the “counterfactual”—that is, what would have happened in the absence of the program.

In the SFER, only studies that used a comparison group with characteristics that are initially similar to those of the treatment group are considered credible impact studies. The outcomes of the comparison group represent the counterfactual. In the example above, the comparison group would be a group of similar fathers who did not participate in the program. These fathers could be followed over the same period of time and used to establish what the program participants’ outcomes would have been without the program. The differences at followup between this group (those who did not participate in the program) and the treatment group (those who did) likely reflects the effects of the program on employment, rather than the effects of other factors.

Not all comparison groups provide equally plausible counterfactual comparisons, and this review does not designate all studies with a comparison group as credible impact studies. In some cases, studies use comparison groups that differ in important ways from the treatment groups. For example, if a comparison group includes fathers with a lower level of educational attainment than those in the treatment group, the fathers in the comparison group may have poorer employment prospects regardless of the program. In this case, the comparison group is not a good representation of the counterfactual because the treatment-group fathers and comparison-group fathers are different before the program begins.

A study design that randomly assigns participants to treatment or comparison groups is one of the best designs for establishing causality. In a randomized controlled trial, fathers are assigned by chance to one of the two groups. The key advantage of this design is that fathers in the treatment and comparison groups are similar, on average, in all initial characteristics, whether they are measured (such as education or employment history) or unmeasured (such as intrinsic motivation to get a job). If the treatment and comparison groups are very similar at the beginning of the study, the comparison group will be an excellent representation of the counterfactual.

To indicate the study’s quality for determining the effects of the program, we assigned a rating to every study that includes participant outcomes. This rating reflects the level of confidence that should be applied when assessing how well the research design can determine whether the program, rather than other factors, caused the reported outcomes. We took into account factors such as the use of a comparison group, use of random assignment, and similarities between the treatment and comparison groups before the start of the program.

There are three rating categories: high, moderate, and low. (Studies that do not include participant outcomes were unrated.) Only impact studies that used random assignment (randomized controlled trial) could receive a high rating. Studies with a nonrandomly assigned comparison group (quasi-experimental design) that was equivalent at baseline could receive a moderate rating. 1   We assigned low ratings to studies that reported outcomes but did not use a comparison group (such as pre/post designs) as well as studies that had methodological problems. Studies that did not include participant outcomes were unrated. See the table below for more details on the quality rating system.


[1] Regression discontinuity and single case designs also have strong internal (causal) validity, but we did not identify any relevant studies with these designs.




SUMMARY OF RATING CRITERIA


High Rating
Randomized controlled trials received a high rating if:
  • The sample was randomly assigned to at least two conditions (for example, treatment and comparison groups).
  • The sample meets the What Works Clearinghouse (WWC)a standards for low levels of overall and differential attrition.
  • The sample members were not reassigned after random assignment was conducted (for example, members assigned to the treatment group were not switched to the comparison group or vice versa).
  • There are no confounding factors, when one part of the design lines up exactly with either the treatment or comparison groups. An example would be a study in which all fathers in the treatment group are from one county, and all fathers in the comparison group are from another county. In this case, we cannot distinguish between the effect of the program and the effect of county-related factors, such as access to other available services.
  • The analysis includes statistical adjustments for selected measures (baseline measures of the outcomes, race/ethnicity, and socioeconomic status) if the treatment and comparison groups are not equivalent on these measures at baseline.
Quasi-experimental designs received a high rating if:
  • Not applicable; these studies cannot receive a high rating because the sample was not randomly assigned.
Pre/post or other designs received a high rating if:
  • Not applicable; these studies cannot receive a high rating because there is no comparison group.
Moderate Rating
Randomized controlled trials received a moderate rating if:
  • The sample members were not reassigned after random assignment was conducted.
  • The sample meets the WWC standards for low levels of overall and differential attrition.
  • There are no confounding factors.
  • The study includes groups that were not equivalent on selected baseline measures (baseline measures of the outcomes, race/ethnicity, or socioeconomic status, but the analysis does not include statistical adjustments.
OR
  • The study has high rates of overall or differential attrition OR sample members were reassigned after random assignment was conducted.
  • There are no confounding factors.
  • There is baseline equivalence of the treatment and comparison groups on selected measures (baseline measures of the outcomes, race/ethnicity, and socioeconomic status).
  • The analysis includes statistical adjustments for the selected measures.
Quasi-experimental designs received a moderate rating if:
  • There are no confounding factors.
  • There is baseline equivalence of the treatment and comparison groups on selected measures (baseline outcomes, race/ethnicity, and socioeconomic status).
  • The analysis includes statistical adjustments for the selected measures.
Pre/post or other designs received a moderate rating if:
  • Not applicable; these studies cannot receive a moderate rating because there is no comparison group.
Low Rating
  • A study received a low rating if it includes participant outcomes but does not meet the criteria for a high or moderate rating.
Unrated
  • We did not rate studies that do not include participant outcomes.

[a] WWC is an initiative of the U.S. Department of Education’s Institute of Education Sciences, which reviews and evaluates education research. For more information, visit http://ies.ed.gov/ncee/wwc/.