Quick note: in this document I use all caps to denote things that you’d have to fit in for your own applications. For example, VARIABLE1 is a generic variable name, and CONDITION is a generic condition you might want to include (like age < 100
). Also, Stata accepts return scalars (like r(N)
with or without macro quotes (` and ‘). For formatting reasons that are hard to explain, I don’t use the macro quotes here; you are welcome to use them in your code code if you want!
Testing Idioms
Key to testing in Stata is the assert
command, the syntax for which is quite sample: type assert
then a statement. If the statement is true, Stata will march forward; if the statement is false, Stata will stop and throw an error, drawing the programmers attention to the problem. For example:
* Make sure VARIABLE is never missing: assert VARIABLE != . * Make sure VARIABLE1 is not missing if VARIABLE2 is not missing assert VARIABLE1 != . if VARIABLE2 != .
There are, however, situations where the syntax of assert
is limited. assert
always tries to test things row by row, just like most Stata commands. But if you want to test a more general property of your dataset or don’t want an absolutist test (a certain number of known exceptions are permissible), you need to combine assert
with a helper command. For example:
* Make sure no more than 10 missing observations of
* VARIABLE1. Note r(N) is a local macro where "count"
* puts the number it displays.
count if VARIABLE1 == .
assert r(N) <= 10
* Make sure your data has no duplicate values of
* VARIABLE2. Here we use the two return values
* from the "duplicates report" command.
duplicates report VARIABLE2
assert r(N) ==
r(unique_value)
If you aren’t sure what values are made available through a command or how those values are called, type return list
or ereturn list
after you’re run the relevant command (like summarize,
duplicates report
, or regress).
When To Write Tests
The best way to get into writing tests is to think about how you check your data interactively to make stuff work. After a merge
or egen
, most people pause to browse the data and/or watch the code step by step, or do a set of quick tabs or plots. But these are not systematic, and you generally only do them once (when you’re first writing the code).
A great way to write tests is to think about what you’re looking for when you do these interactive tests and convert the logic of those interactive interrogations into systematic assert statements. That way they’ll be baked into your code, and will be executed every time your code runs!1
- After
merge:
No where are problems with data made more clear then in a merge. ALWAYS add tests after a merge! More on that below.
- After complicated commands (like
bysort
andegen
): If you have to think more than a little about how to get Stata to do something, there’s a chance you missed something. Add a test or two to make sure you did it right! Personally, for example, I almost never usebysort
oregen
without adding tests — it’s just not a natural way to think about things, so I know I may have screwed up (and often have!). - Before
drop if
statements: Dropping observations masks problems. Before you drop variables, add a test to, say, count the number of observations you expect to drop (count if CONDITION
, then assertr(N)
< NUMBER_YOU_EXPECT_TO_DROP, then
drop if CONDITION`) - When using conditions (
if
statements) with more than one or two simple clauses
Test Examples
General Tests
Test number of observations is right:
count assert r(N) == VALUE
Check var that should have no missing has no missing.
assert MY_VAR != .
Check my unique identifier is actually unique.
duplicates report MY_IDENTIFIER assert r(N) == r(unique_value)
In a panel, make sure each ID group has exactly one observation for each of five years
bysort ID: egen test_obs = count(YEAR) bysort ID: egen test_min = min(YEAR) bysort ID: egen test_max = max(YEAR) assert test_obs == 5 & test_min == 1990 & test_max == 1995 drop test_*
Make sure values of gender have a reasonable value. Note this is a “reasonableness” test, not an absolute test. It’s possible this would fail and the data is ok, but this way if there’s a problem your attention will be flagged so you can check.
sum GENDER assert r(mean) >= 0.4 & r(mean) < 0.6
Check that a new variable (a Herfindahl) has reasonable values. Do so by checking has right range of values, and that places with more groups have lower values (because we’ve all gotten these backwards in our lives at least once!). For context, assume data is long (I’m calculating Herf within each group identified with ID, and within each ID there are grows for each GROUP).
Like the test above, this is not an absolutist test — it COULD be the correlation is negative if group sizes are really strange. But this will PROBABLY pass and if it fails, you definitely want to check to make sure you know why it’s failing.
assert MY_HERF >= 0 & MY_HERF <= 1 bysort ID: egen NUM_GROUPS = count(GROUP) corr NUM_GROUPS MY_HERF assert r(rho) < 0
One annoyance of Stata is that replace
doesn’t give you access to the number of changes that are made, so you can’t easily test for whether your replace
command made changes. But there are ways to check replace
when important — here’s an admittedly verbose way of making sure your replace
command is doing what you expect. I don’t use it for singular replace
commands, but it is useful when I’m running a loop with a replace inside and I want to make sure it’s doing the right thing at each stage of the loop. Here’s an example where I want to make sure at each step there is exactly 1 and only 1 change made:
forvalues x in VAR1 VAR2 VAR3 VAR4 { count if `x' == NEW_VALUE & CONDITION local pre_replace_count = r(N) replace `x' = NEW_VALUE if CONDITION count if `x' == NEW_VALUE & CONDITION assert r(N) - `pre_replace_count' == 1 }
Post-merge checks
After a merge, make sure that there are no observations in second dataset not in first.
assert _m != 2
Imagine we know 5 observations in original data won’t merge and that’s ok. Add check that ONLY 5 fail.
count if _m == 1 assert r(N) < 6 drop _m
After Regressions
Make sure num obs is correct and obs you expect in regression aren’t dropping out due to missing values or something.
reg Y1 X1 X2 assert e(N) == 5
- Like many things in my life, credit for this way of thinking goes to Adriane Fresh