Ways to count the number of unique values in a variable

There are at least 3 convenient ways to count the number of distinct values contained in a variable: -tab-, -inspect-, and -codebook-.

tab varname, nofreq
display r(r)

The option nofreq supresses the reporting of the frequency table. Besides displaying output in the results window, Stata stores the results of some commands so that you can use them in subsequent commands. Results of r-class commands, such as -tab-, are stored in r(). In the expample above, display r(r) returns the number of rows in the table, that is, the number of unique observations for variable varname. The problem with using -tab- to count the unique number of values is its row limits: 12,000 rows (Stata/MP and Stata/SE), 3,000 rows (Stata/IC), or 500 rows (Small Stata).

inspect varlist
display r(N_unique)

Besides reporting the number of unique values, -inspect- also reports: the number of negative, zero,  positive, and missing values. It also draws a histogram. There is no need for r(N_unique) if the number of unique values is less than or equal to 99 as -inspect- reports the actual number. But if the number of unique values is more than 99, it will return “More than 99 unique values”. In this case, you need to type the second line.

codebook varlist

-codebook- also provide other summaries besides unique values: type of variable (numeric, etc), the range of values, mean, standard deviation, missing values, and some percentiles.

Note: If varlist is not specified in -inspect- and -codebook-, the commands will return the reports for all variables.

Loading and storing data to and from Stata and Mata

mkmat varlist, matrix(M)    /*creates Stata matrix M*/
mata{
M=st_matrix(“M”)        /*loads matrix M to Mata*/

st_matrix(“M”, M)
}
svmat M    /*loads matrix M back to Stata*/

OR

mata{
M=st_data(.,.)   /*reads current Stata data as matrix M in Mata*/

st_store(.,1..cols(M), M)    /*loads matrix M back to Stata*/
}

Friendly -fren-

Today I downloaded resources (do-files and datasets) used in the the book “Microeconometrics Using Stata” (Cameron and Trivedi, 2009) from Stata Press. When I unzipped the file, I found that all the file names in the folder have the prefix “mus”. I want to get rid of these prefixes. Renaming the files one by one seems crude (something that a muggle would do in Harry Potter’s world); somebody must have written a Stata program for this! So I googled using the keywords “stata rename filenames”, and there it was: “FREN: Stata module to batch rename files (Windows only)”.

-fren- (Liu Wei, 2010) renames file names (with specific file extensions) in the current directory. Its syntax is straight forward:

fren file_extension , from(str1) [ to(str2) ]

For example, to get rid of the prefix “mus” from the names of do-files and datasets in the folder, I first installed the program and changed the directory:

ssc install fren
cd path
fren dta , from(mus)
fren do , from(mus)

And, all those “mus” were gone. Happiness…

To check other commands Liu Wei has written, go to Liu Wei at IDEAS.

The all new syntax for -merge-

With Stata 11 comes an improved version of -merge-. -merge-  combines the observations from another Stata data file (the using dataset) to corresponding observation in the current data file (the master dataset). The observations are matched based on specified variable(s).

What’s different with the new syntax what are its value-added? First, the new syntax helps minimize merging errors. The Stata 11 syntax for -merge- is:

merge type varlist using, options

where “type” can be: 1:1, m:1, 1:m, or m:m

1:1 — Stata expects 1 observation in the master file matches exactly 1 observation in the using file

m:1 — many observations in the master file matches exactly 1 observation in the using file

1:m — 1 observation in the master file is matches more than 1 observations in the using file

m:m  — many observations in the master file matches many observations in the using file

By specifying the type of data, Stata will know what to expect and will return an error if either the master or the using database is not consistent with its expectations, thereby minimizing the chance of making errors.

Second, the new command automatically sorts both the master and using datasets according the the variables in varlist—this is my favorite. Gone are the days when you need to open the using file just to sort and save it.

Third, the output includes a table of _merge. In the pre-Stata 11 -merge-, we almost always type “tab _merge” every after merging datasets to make sure that we got it right. Stata 11 saves you this step by automatically reporting the match summaries unless you opt not to by using the option “noreport”.

While the old -merge- syntax will still work in Stata 11, it is worth learning its all new syntax.

Variable names and transformers

What are acceptable variable names? How do you use the commands -rename-, -renpfix-, and -renvars- to rename variable names?

A Stata variable name can contain up to 32 characters in any version of Stata (Small, Stata/IC, Stata/SE, or Stata/MP). Not all characters in the keyboard, however, are allowed in naming variables. A variable name may contain only the digits 0 to 9 and upper or lower case English alphabets (A to Z); and the first character cannot be a number. For example,

population
_Income
_1997
income0001
INC

are allowed, but not the following:

pop#
2gdppc
2007
~1997
POP/2001

Renaming variables are made easy by the commands -rename-, -renpfix-, and -renvars-. -renvars- (Weesie and Cox, 2005) is not available in the original package of Stata, but you may download and install it by typing:

net install dm88_1.pkg (see note below)

Do you need all these 3 renaming commands? Not really, but you will see that in some cases one is more convenient to use than the others.

If you only need to rename one variable, use -rename-. The syntax for -rename- is:

rename old_variable_name new_variable_name

For example, to rename the variable _Income to income, type:

rename _Income income

If, instead, you just want to replace the prefix of variables, use -renpfix-. The syntax for -renpfix- is:

renpfix old_prefix [new_prefix]

[Note: If new_prefix is not declared, -renpfix- returns the new variable names without the prefix provided they are allowed as variable names.]

For example, to change the prefix exportshare of the variables

exportshare1970
exportshare1980
exportshare1990
exportshare2000

to s_, type:

renpfix exportshare s_

This transforms the variables above to:

s_1970
s_1980
s_1990
s_2000

To drop the prefix export in the variable names, type:

renpfix export

This transforms the variables to:

share1970
share1980
share1990
share2000

But, you cannot do this:

renpfix exportshare

Why? Because, the first character of variable names cannot be numbers.

-renvars-, if installed, is the most flexible. It allows you to rename more than one variable at once, it can change the case of the variable names, and more. The syntax for -renvars- are:

renvars varlist \ new_varlist
renvars varlist, options

Examples:

renvars population income \ pop inc
/* renames population to pop and income to inc */

renvars population income, upper
/* renames population to POPULATION and income to INCOME */

renvars population income, trim(3)
/*renames population to pop and income to inc */

renvars population income, prefix(x_)
/*renames population to x_population and income to x_income */

For more -renvars- options, type:

help renvars

Note: Sometimes “net install dm88_1″ returns the error:

file http://www.stata.com/dm88_1.pkg not found
server says file temporarily redirected to http://www.stata.com/error/404.html
could not load dm88_1.pkg from http://www.stata.com/
r(601);

You may also install -renvars- by typing “search renvars”, then click on “dm88_1″ and then click “(click here to install)”.

Data dictionary and -infile- Which data input command (Part 3)

A piece of FIES

As we mentioned in Part 2 of this series, -infile- is used to load fixed-format data into Stata. But, what is a fixed-format (also fixed-width) data file? A fixed-format data is a text file wherein the format is specified by the length of each variable. For example, the following data is a 29-column data, where columns 1-4 is allotted for the 4-digit year; columns 5-19 is for country name; and columns 20-29 is for the data on population (I used the lines to represent the spaces to fill in the space allotted for each variable):

1960Philippines____27053834__
1960Thailand______27652013__
2007Philippines____87892094__
2007Thailand______63832135__

A data dictionary, a text file that instructs Stata how to read and store the contents of the data file, is required to load a fixed-format data. The dictionary and data may be in a single file or in two separate files. The dictionary contains the following information: storage type, data format, variable name, and and the starting column (or length of each variable). The dictionary may also contain the labels of each variable. For example, we can write a dictionary for our population data above as:

/*beginning of dictionary*/
dictionary {
_column(1) int year %4f “Year”
_column(5) str15  country  %15s “Country”
_column(20) float population %10f “Population”
}
/*end of dictionary; begginning of data*/
1960Philippines____27053834__
1960Thailand______27652013__
2007Philippines____87892094__
2007Thailand______63832135__
/*end of data (Note: the lines are only used to represent the blank spaces to fill in the allotted columns for each variable)*/

We can save this as population.dct file, and load into Stata by typing:

infile using population.dct, clear

The Family Income and Expenditure Survey (FIES) of the Philippines is another example of a fixed-width data, with a dictionary in a separate file. The image above—a string of numbers broken by blank spaces—is taken from a FIES data file. You can load this file into Stata with the help of the data dictionary, fies.dct, that accompanies the data, fies.dat:

infile fies.dct, using(fies.dat) clear

-infile- Which data input command (Part 2)

In Which data input command: -infile- or -insheet-?, we introduced the input command -insheet-. Here, we will introduce the more flexible data input command -infile-, which can read both delimited and free format text files. For example, the comma-delimited file in our previous post, population.csv, with the following data:

1960,Philippines,27053834
1960,Thailand,27652013
2007,Philippines,87892094
2007,Thailand,63832135

can also be loaded into Stata, by typing:

infile year str15 country population using population.csv, clear

Note that the first line of the data above are not variable names. -infile-, unlike -insheet-, does not recognize variable names in the data. The capability to read variable names is one advantage -insheet- has over -infile-. Also, the default of -infile- is to read the data as numbers. If, for example, we fail to declare the second variable as string:

infile year country population using population.csv, clear

Stata will still read the data but will input the second variable, country, as missing values. By inserting str15 before the string variable country, we are telling Stata to read and store country as a string variable with at most 15 characters.

The major advantage of -infile- is that it can read text files that are in free format. For example, instead of a comma-delimited file as above, we have a data that is separated only by white spaces:

1960 “Philippines” 27053834
1960 “Thailand” 27652013
2007 “Philippines” 87892094
2007 “Thailand” 63832135

If we use -insheet- to load this file, Stata will read it as a data with only 1 variable, v1, with the following entries:

“1960 Philippines 27053834″
“1960 Thailand 27652013″
“2007 Philippines 87892094″
“2007 Thailand 63832135″

To correctly load the data, use the -infile- command:

infile year str15 population using population.txt

Another advantage of -infile- is when you have a fixed-format text file with a dictionary. To be continued…

Which data input command: -infile- or -insheet-?

Stata has more than one data input command, depending on the format of the data you wish to load. We have already discussed that -use- loads Stata-format datasets (-forvalues- and the other -use- syntax) and the user-written -usespss- loads SPSS-format datasets (Reading SPSS data file into Stata). The other data input commands are: -insheet- and -infile-. Which one will you use?

Often, you need to load a data that is not in Stata format. Examples are data downloaded from web-based databases, e.g., World Bank’s World Development Indicators, UN’s Commodity Trade Statistics, IMF’s International Financial Statistics, which can be saved as tab or character delimited files. Some datasets are in undelimited text file format (or free format) with or without data dictionaries. The public-use file of the Family Income and Expenditure Survey (FIES) of the Philippines is a text file format with a dictionary.

The -insheet- command reads text files in which the values are separated by a tab or a character (usually a comma) and there is only one observation per line. These files are created using a spreadsheet or a database program, such as, Microsoft Excel. The general syntax of -insheet- is:

insheet [varlist] using filename [, options]

For example, the file population.csv is a comma-delimited file with the following data:

year,country,population
1960,Philippines,27053834
1960,Thailand,27652013
2007,Philippines,87892094
2007,Thailand,63832135

To read this data using -insheet-, type:

insheet using population.csv, comma names

The above command loads the data population.csv into Stata. By using the option comma and names, we declare that the data in the file are separated by commas and that the first line of the file contains variable names, respectively. In most cases, however, this is not necessary as Stata is clever enough to recognize comma- and tab- delimited text files or whether the first line of the data are variable names. For other options available for -insheet-, type: help insheet.

If the first line of population.csv are not variable names, the command:

insheet using population.csv

reads the file and names the variables in the file as v1, v2, and v3.

If you don’t want these v’s, you may name your variables by using the optional varlist:

insheet year country population using population.csv

If Stata detects more than 3 variables in the dataset, it will return an error indicating that too few variables are specified. What happens if you use -insheet- for a file that is not delimited? Stata will read the file as a data that contains only variable.

Although a bit crude, you may also copy-paste your data from a spreadsheet into Stata data editor (to bring out the Stata editor window, type: “edit”).

-infile-, on the other hand, reads files that are not delimited (or in free format)… to be continued.

Reading SPSS data file into Stata

Is it possible to open an SPSS file (*.sav) directly from Stata? Yes. Thanks for the program -usespss- written by Sergiy Radyakin of the World Bank, which loads a *.sav dataset into Stata preserving variable names and labels. Since -usespss- is user-written, you need to install it first from the Statistical Software Components (SSC) archive, which is Stata’s download site for user-written programs. Installing files from SSC is made easy by using the -ssc install- command; but if you want to read the description of the program first before installing, you may use the -ssc describe- command:

ssc describe usespss

/* Provides a description of the program -usespss */
ssc install usespss
/* Installs -usespss- into Stata */

You can now use -usespss- after installation. The basic syntax for -usespss- (for the full syntax, type: “help usespss”) is:

usespss using filename.sav [, clear saving(filename) iff(condition) inn(condition)]

For example:

usespss using filename.sav
/* loads filename.sav into Stata memory */

usespss using filename.sav, clear
/* clears the existing data in memory and loads filename.sav */

usespss using filename.sav, saving(filename.dta)
/* loads filename.sav and saves it as a Stata file filename.dta */

usespss using filename.sav, clear inn(1/50)
/* loads the first 50 observations in filename.sav */

usespss using filename.sav, clear iff(var1<1000)
/* loads the observations in filename.sav that satisfy the condition var1<1000 */

[Note: The -iff- and -inn- options work in same manner as the Stata qualifiers -if- and -in-, respectively. We have showed some examples how the qualifiers -if- and -in- are used in "Putting observations in order" and "_n _N and bysort."]

Another way to convert an SPSS to Stata data is to use the software called Stat/Transfer. Or, if you have SPSS, you may save the data as a Stata-formatted file.

Putting observations in order

Is it necessary to put observations in a certain order? In a number of cases, yes. The most obvious case is when you are using the qualifier -in- to specify a subset in your data. For example,

drop in 1/100               /* Drops the observations from line 1 to line 100 */
keep in 30/l                   /* Keeps the observations from line 30 to the last line, denoted by small letter l */

If the observations were in arbitrary order, then you wouldn’t know which ones were dropped or kept, would you? This is when -sort- and -gsort- come in handy. These two put the observations in a certain order. The -sort- command put the observations in ascending order based on a specific variable or a set of variables. The basic syntax for -sort- is:

sort varlist

If varlist is only one variable, then Stata will sort the observations in ascending order based on that variable. If there are 2 variables, var1 and var2, after sort, Stata will sort the observations according to var1 first. Then, for observations with common var1, Stata will sort them according to var2. If there are more than 2 variables, then the observations will be sorted by the first variable first, then the second variable second, and so on. -gsort-, on the other hand, can sort the observations in either ascending or descending order. The basic syntax for -gsort- is:

gsort [+ or -] varname [+ or -] varname [+ or -] varname

A plus sign (+) before the varname instructs Stata to order the observations in ascending order, while a minus sign (-) implies descending order of observations. For example, to sort the countries by their geographical region (regn) in alphabetical order and by GDP per capita (gdppc), from highest to lowest:

gsort + regngdppc

The -by varlist:- prefix also requires the observations to be sorted according to the varlist. But, as we have discussed in “_n, its big brother _N, and Super -bysort-,” this can be conveniently written as:

bysort varlist:

or

by varlist, sort:

%d bloggers like this: