• "Everything should be made as simple as possible, but not simpler." - Anonymous (although often attributed to Albert Einstein)
  • Categories

  • Recent Comments

  • RSS N.E.C. The Stata Blog

  • RSS Statalist: the Stata forum

  • RSS Stackoverflow [Stata]

  • Google Analytics Stats

  • Enter your email address to subscribe.

Haiku: cls


clean results window
in Stata thirteen, typing
cls clears screen

 

Where to buy Stata in the Philippines


In the Philippines, there has been an emerging demand in the public sector for data-driven analysis of various govenrment programs (yes, emerging.. don’t ask why.. but the leadership has a lot to do with this.. bless them). I am lucky enough to be invited to introduce Stata to some government agencies (and some private institutions) and how it can be used with their data. Often the the question where to get a copy of Stata pops up. So, for reference: here is the sole distributor of Stata in the country:

Columbia CP Ltd
Unit 515, Building 1-A
Palmdale Heights Pinagbuhatan
Pasig City, METRO MANILA
PHILIPPINES

Tel: 0-917 649-6421 (if calling from the Philippines)
Tel: 63-2-5664488 (if calling from outside the Philippines)

Email: sales@columbiacp.com
URL: http://www.columbiacp.com

I told them they could ask the distributor to come and discuss the best licensing option base on their needs. Although I really don’t know if the distributor really  does this. :p But that would be bad service if they don’t.

 

break me


Sometimes we want to break a continuous variable into a smaller set of categories—into evenly spaced or equally sized groups, or groups based on limits we specify, or groups based on another variable or a set of variables.

Let us take for example the variable price of cars in auto.dta.
sysuse auto.dta, clear    // open a Stata built-in data
summ price

Suppose we want to break price into 5 categories according to different ways of grouping: 1 evenly spaced, another equally sized, and so on. We create the following variables: price_cat1, price_cat2, ..., price_cat4.
  1. price_cat1 groups the prices into evenly spaced intervals, say, by 3300.
  2. price_cat2 groups the prices into 5 equally sized groups. Equally is used loosely here. In our example data there are 74 cars, 1 group will definitely have 1 less car in it.
  3. price_cat3 groups the prices into categories bounded to the left by 0, 5000, 8000, 10000. Why those boundaries? Because I can… :p
  4. price_cat4 groups the prices based on the variable rep78.

This is a walk in the park if you have already mastered the generate and replace commands and the qualifier if. But, even better (a walk in the park with ice cream and a puppy), are the shortcuts. For each case, we first use if followed by shorter ways to get the same result.
*** Case 1: groups the prices into evenly spaced intervals
* using the if qualifier
gen price_cat1 = .
replace price_cat1 = 1 if price>=0 & price<3300
replace price_cat1 = 2 if price>=3300 & price<6600
replace price_cat1 = 3 if price>=6600 & price<9900
replace price_cat1 = 4 if price>=9900 & price<13200
replace price_cat1 = 5 if price>=13200 & price<16500

. tab price_cat1, m

 price_cat1 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          2        2.70        2.70
          2 |         55       74.32       77.03
          3 |          7        9.46       86.49
          4 |          6        8.11       94.59
          5 |          4        5.41      100.00
------------+-----------------------------------
      Total |         74      100.00

* using the egen function cut()
drop price_cat1
egen price_cat1 = cut(price), at(0(3300)16500)

The generated codes for the categories correspond to the left ends of the intervals. In the at() option, you may also list of breakpoints separated by spaces or commas instead of using the interval syntax for numlist, e.g. at(0 3300 6600 9900 13200 16600). Adding the option icodes will generate codes 0,1,..,5 instead of the left ends.
. tab price_cat1, m

 price_cat1 |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |          2        2.70        2.70
       3300 |         55       74.32       77.03
       6600 |          7        9.46       86.49
       9900 |          6        8.11       94.59
      13200 |          4        5.41      100.00
------------+-----------------------------------
      Total |         74      100.00

That’s it! The 6 lines we wrote earlier can be written concisely in 1 line. Be careful however with the breakpoints. In our example above, the intervals are inclusive of the left end, i.e., from a up to (but excluding b) or, in set notation, [a,b). In some cases, you may want that the interval is inclusive of the right end, (a,b]. In the last case and where there are observations exactly at the breakpoints, egen cut() could be erroneous. We can use instead the function recode() or autocode() for the case (a,b].
* using recode() or autocode() function
gen price_cat1b = recode(price, 3300, 6600, 9900, 13200, 16500)
gen price_cat1c = autocode(price, 5, 0, 16500)

. tab price_cat1b, m

price_cat1b |      Freq.     Percent        Cum.
------------+-----------------------------------
       3300 |          2        2.70        2.70
       6600 |         55       74.32       77.03
       9900 |          7        9.46       86.49
      13200 |          6        8.11       94.59
      16500 |          4        5.41      100.00
------------+-----------------------------------
      Total |         74      100.00

. tab price_cat1c, m

price_cat1c |      Freq.     Percent        Cum.
------------+-----------------------------------
       3300 |          2        2.70        2.70
       6600 |         55       74.32       77.03
       9900 |          7        9.46       86.49
      13200 |          6        8.11       94.59
      16500 |          4        5.41      100.00
------------+-----------------------------------
      Total |         74      100.00

Both commands generate the same results (only because we want them to). In recode(), Stata compares the first argument (price in our case) with the second argument (3300, the first right end). If the price is less than or equal to 3300 it will return 3300. If not, then Stata compares the price with the third argument (the next boundary), and so on. Using the longhand with the if qualifier, this is equivalent to:
gen price_cat1b = 3300 if price<=3300
replace price_cat1b = 6600 if price>3300 & price<=6600
replace price_cat1b = 9900 if price>6600 & price<=9900
.
.
.

In autocode(), Stata breaks price (first argument) into 5 evenly spaced groups (second argument) from 0 to 16500 (last 2 arguments). If the price is less than or equal to the first generated breakpoint (3300), it will fall in this category, and so on. Thus, by using autocode(), we have instructed Stata to automatically generate the breakpoints that will yield evenly spaced intervals (in this case the same as our manually generated breakpoints) and then automatically assign the prices into the categories.

Let us save Case 2: groups the prices into “equally” sized groups for another post…

 
%d bloggers like this: