• "Everything should be made as simple as possible, but not simpler." - Anonymous (although often attributed to Albert Einstein)
  • Enter your email address to subscribe.

  • Categories

  • Recent Comments

  • RSS Statalist: the Stata forum

  • RSS Stackoverflow [Stata]

  • Google Analytics Stats

    Period:Last 30 Days
    Total Visits:11327

Counting occurrence of strings within strings


Somebody asked how to count the number of occurrences of a string within a string. For example, if I have the following data, I want to generate new variables countSS, countSM, and countSG that contains the number of occurrences of “SS”, “SM”, or “SG” in variable awards.

————————————————————————————
clear
input id str40 awards
1    “SS; SS; SM; SG”
2    “SM; SG”
3    “SG; SG; SG; SS”
4    “SS; SS; SG; SG; SS; SM; SG”
end
list
————————————————————————————

Here is one solution using the macro extended function -subinstr- (-help extended_fcn-).

————————————————————————————
local tocount SS SM SG
foreach t of local tocount{
gen countt'</em>=0
<strong>local </strong><em>N</em> = _N
<strong>forvalues </strong>i = 1/
N’{
local a = awards[i']
<strong>local </strong><em>c</em> : subinstr local  a  “
t'" "t'" , all  count(local <em>c2</em>)
<strong>replace </strong>count
t' = c2' ini'
}
}
------------------------------------------------------------------------------------




*Thanks to Jacob Reynolds (jlreynol@nps.edu) for the question. Although, for the best advise on Stata, Statalist is the best place to ask :). See Stuck? Hello Statalist .

6 Responses

  1. The number of occurrences can be got from a comparison of lengths before and after blanking out.

    gen noccur_SS = (length(awards) – length(subinstr(awards, “SS”, “”,.))) / length(“SS”)

    In this case we know that the length of “SS” is 2. I wrote it out like this to lead up to the more general rule (mixing now Stata and pseudocode)

    (length(original) – length(original_with_substr_blanked)) / length(substr)

    Thus you don’t need a loop over observations. I think you do need to do this separately for each substring.

  2. There are also two [sic] -egen- functions for this within -egenmore- from SSC. Neither of them uses the trick above. I’d prefer to believe that the reason for that was that -subinstr()- wasn’t available when the functions were written, both about ten years ago, but I can’t rule out without checking that the authors (one of them me) just overlooked this simpler way to do it.

  3. Nick & Mitch,
    That last comment about comparing lengths was the best ticket. I was able to count the awards like I needed by generating as many counting variables as req’d (g pa_XX); total of 14.

    I wish I could have gotten the more “eloquent” code above to work, but the comparison line is more my speed in thesis work…maybe when I come back for a PhD :)

    Thank you for your time and attention to this guys!

    Jake

    • I always like simpler solution. Not knowing any better, I had come up with a complex one. ‘Eloquence’, I think, is not about complexity but simplicity. Nick’s solution is an example. :)

Leave a Reply