run me ctrl-d
in pc, in ios
In -destring- complication, Anup asked how to split a string variable. In his case, he has a variable of the form 28-18-0018-02183100-02-O-B where 28 represents state code, 18 represents districts code, 0018 represents subdistricts code and 02183100 represents village code. His problem is how to extract the state, districts, etc. codes separately from the variable and label all the code accordingly.
In response, Freddy provided a solution using the
substr function assuming that the code for each part is of the same length of characters, i.e., a district code is always 2 characters, a subdistrict code is always 4 characters.
gen state = substr(yourvariablename, 1, 2) gen district = substring(yourvariablename, 4, 2) gen subdistrict = substring(yourvariablename, 7, 4)
A similar solution was suggested in Splitting numbers before
nsplit was discussed for numeric variables.
But how about when the codes are not of the same length but is separated by a character, such as a hyphen: 8-18-18-02183100-02-O-B and 8-18-018-02183100-02-O-B. In this case,
split would be helpful.
split literally splits string variables into parts using specified character or strings as a separator. The basic syntax for
split is (see
split stringvariable, parse(stringseparator)
To split a code of the form 8-18-0018-02183100-02-O-B into 7 parts using the hyphen as a parser:
split yourvariablename, parse(-)
This will create 7 new variables:
yourvariablename2, and so on. You may specify a new prefix using the
gen() option. You may also want rename the variable names after.
In splitting variables, string or numeric, I would like to echo Nick Cox’s comment in Splitting numbers, “the bottom line is just standard: be careful.”
In And we’re rolling, rolling; rolling on the river, Hasan asked how he could “keep only those values that were calculated using at least 3 observations” after he calculated the 4 period rolling standard deviation of a set of observations. One solution is to tag the periods when the missing observations within the window (in this case 4) is more than 1 then replace the calculated standard deviations for these periods to missing.
Two things to note are:
rolling requires that your data has been declared as a time-series dataset (see
help tsset). Time-series operators, such L. for lags, are allowed.
keep() option in
rolling allows you to keep the date variable, which you can use as an identifier in merging files
Here is an illustration (assuming nonrecursive analysis):
clear set obs 20 set seed 1 gen date = _n gen v1 = 1+int((100)*runiform()) gen v2 = v1 replace v2 = . in 1/4 replace v2 = . in 10/12 replace v2 = . in 18/20 tsset date rolling sd2 = r(sd), window(4) keep(date) saving(f2, replace): sum v2 merge 1:1 date using f2, nogenerate gen tag = missing(l3.v2) + missing(l2.v2) + missing(l1.v2) + missing(v2) > 1 gen sd = sd2 if tag==0
In the first block, we created an artificial data set of 20 uniformly distributed random integers between 1 and 100, replaced some observations to missing, and told Stata that we are dealing with a time-series data set.
In the second block, we calculated the 4 window rolling standard deviation. By using the
saving() option rather than
clear, we have not replaced the current data in memory and saved the resulting dataset from the
rolling command in f2.dta. We merged this to our current data.
In the last block, we generated the variable tag that returns 1 if the expression
missing(l3.v2) + missing(l2.v2) + missing(l1.v2) + missing(v2) > 1 is true, i.e., if the number of missing observations within the 4 period window is more than 1. Otherwise, tag is 0. Finally, in the last line, we created a new variable
sd that is missing if the number of observations used in each window is less than 3.