Somebody asked how to count the number of occurrences of a string within a string. For example, if I have the following data, I want to generate new variables *countSS*, *countSM*, and *countSG *that contains the number of occurrences of “SS”, “SM”, or “SG” in variable *awards*.

*————————————————————————————*

**clear**

**input ***id *str40 *awards*

1 “SS; SS; SM; SG”

2 “SM; SG”

3 “SG; SG; SG; SS”

4 “SS; SS; SG; SG; SS; SM; SG”

end

**list**

*————————————————————————————*

Here is one solution using the macro extended function -subinstr- (-help extended_fcn-).

*————————————————————————————*

**local **tocount *SS SM SG*

**foreach **t of local tocount{

**gen ***count t'</em>=0
<strong>local </strong><em>N</em> = _N
<strong>forvalues </strong>i = 1/N’{
local a = awards[i']
<strong>local </strong><em>c</em> : subinstr local a “t'" "t'" , all count(local <em>c2</em>)
<strong>replace </strong>countt' = c2' ini'
}
}
------------------------------------------------------------------------------------*

*Thanks to Jacob Reynolds (jlreynol@nps.edu) for the question. Although, for the best advise on Stata, Statalist is the best place to ask :). See Stuck? Hello Statalist .

Filed under: Basic functions, Basic Programming Tagged: | extended macro function, foreach, forvalues, string, subinstr

Nick Cox, on 21 January 2011 at 1:59 AM said:The number of occurrences can be got from a comparison of lengths before and after blanking out.

gen noccur_SS = (length(awards) – length(subinstr(awards, “SS”, “”,.))) / length(“SS”)

In this case we know that the length of “SS” is 2. I wrote it out like this to lead up to the more general rule (mixing now Stata and pseudocode)

(length(original) – length(original_with_substr_blanked)) / length(substr)

Thus you don’t need a loop over observations. I think you do need to do this separately for each substring.

Mitch, on 21 January 2011 at 9:37 AM said:The simplicity of the solution is amazing! Thanks, Nick.

Nick Cox, on 21 January 2011 at 2:56 AM said:There are also two [sic] -egen- functions for this within -egenmore- from SSC. Neither of them uses the trick above. I’d prefer to believe that the reason for that was that -subinstr()- wasn’t available when the functions were written, both about ten years ago, but I can’t rule out without checking that the authors (one of them me) just overlooked this simpler way to do it.

Nick Cox, on 25 January 2011 at 1:43 AM said:-subinstr()- was in fact added in Stata 7.

Jacob Reynolds, on 22 January 2011 at 1:45 AM said:Nick & Mitch,

That last comment about comparing lengths was the best ticket. I was able to count the awards like I needed by generating as many counting variables as req’d (g pa_XX); total of 14.

I wish I could have gotten the more “eloquent” code above to work, but the comparison line is more my speed in thesis work…maybe when I come back for a PhD

Thank you for your time and attention to this guys!

Jake

Mitch, on 22 January 2011 at 10:30 AM said:I always like simpler solution. Not knowing any better, I had come up with a complex one. ‘Eloquence’, I think, is not about complexity but simplicity. Nick’s solution is an example.