Download - Stata Training 1
-
8/3/2019 Stata Training 1
1/58
Introduction to Stata
Fitsum Zewdu
J. Research Fellow
EEPRI
-
8/3/2019 Stata Training 1
2/58
The Stata Interface
Windows
The Stata windows give you all the key information about the data file you are using,recent commands, and the results of those commands. Some of them openautomatically when you start Stata, while others can be opened using theWindows pull-down menu or the buttons on the tool bar.
These are the Stata windows:
Stata Results To see recent commands and output
Stata Command To enter a command
Stata Browser To view the data file (needs to be opened)
Stata Editor To edit the data file (needs to be opened)
Stata Viewer To get help on how to use Stata
Variables To see a list of variables
Review To see recent commands
Stata Do-file Editor To write or edit a program (needs to be opened)
-
8/3/2019 Stata Training 1
3/58
-
8/3/2019 Stata Training 1
4/58
MenusStata displays 9 drop-down menus across the top of the outer window, from left to right:File
Open open a Stata data file (use)Save/Save as save the Stata data in memory to diskDo execute a do-file
Filename copy a filename to the command linePrint print log or graphExit quit Stata
EditCopy/Paste copy text among the Command, Results, and Log windows
Copy Table copy table from Results window to another fileTable copy options what to do with table lines in Copy Table
Prefs Various options for setting preferences. For example, you can savea particularly layout of the different Stata windows or change the
colors used in Stata windows.DataGraphicsStatistics build and run Stata commands from menusUser menus for user-supplied Stata commands (download from Internet)
Window bring a Stata window to the frontHelp Stata command syntax and keyword searches
-
8/3/2019 Stata Training 1
5/58
Button bar
The buttons on the button bar are from left to right (equivalent command is in bold):
Open a Stata data file: use
Save the Stata data in memory to disk: save
Print a log or graph
Open a log, or suspend/close an open log: log
Open a new viewer
Bring Results window to front
Bring Graph window to front
New Dofile Editor: doedit
Edit the data in memory: edit
Browse the data in memory: browseScroll another page when --more-- is displayed: Space Bar
Stop current command or do-file: Ctrl-Break
-
8/3/2019 Stata Training 1
6/58
EXPLORING DATA FILES
Common Stata Syntax
This section covers commands that are used forpreliminary exploration of data in a file. Statacommands follow the same syntax:
[by varilist1:] command [varlist2] [ifexp] [in range][weight], [options]
Items inside of the squares brackets are eitheroptions or not available for every command. Thissyntax applies to all Stata commands
-
8/3/2019 Stata Training 1
7/58
Logical operators used in Stata
~ Not
== Equal
~= not equal
!= not equal
> greater than
>= greater than or equal
< less than
-
8/3/2019 Stata Training 1
8/58
Examining dataset
Clear
The clear command deletes all files, variables, and
labels from the memory to get ready to use a new
data file
You can clear memory using the clear command or
by using the clear up command as part of the use
command This command does not delete any data saved to
the hard-drive
-
8/3/2019 Stata Training 1
9/58
Examining dataset
set memory First you can check to see how much memory is
allocated to hold your data using the memorycommand
By default we have 11MB free for reading in a datafile.
Whenever we want to read data file bigger than thisfree bytes, we will get the error message read as:
no room to add more observations
r(901);
-
8/3/2019 Stata Training 1
10/58
. memorybytes
--------------------------------------------------------------------Details of set memory usage
overhead (pointers) 5,808 0.06%data 107,448 1.02%
----------------------------data + overhead 113,256 1.08%
free 10,372,496 98.92%----------------------------
Total allocated 10,485,752 100.00%--------------------------------------------------------------------Other memory usage
set maxvar usage 1,816,666set matsize usage 1,315,200programs, saved results, etc. 3,338
---------------
Total 3,135,204-------------------------------------------------------Grand total 13,620,956
-
8/3/2019 Stata Training 1
11/58
Examining dataset
In this case wehave to allocate to more memory, say
25MB (if 25MB are sufficient for current file), with the
set memory command before trying to use our file.
set memory 25m Now that we have allocated enough memory, we will
be able to read bigger files provided that it is within
the specified memory spaces
If we want to allocate 25m (25 megabytes) every time
we start Stata, We can type;
set memory 250m, permanently
-
8/3/2019 Stata Training 1
12/58
Examining dataset
Use This command opens an existing Stata data file. The syntax is:
use filename [, clear ]opens new file
use [varlist] [if exp] [in range] using filename [, clear ]opens selected parts of file
If there is no extension, Stata assumes it is .dta.
If there is no path, Stata assumes it is in the current folder.
You can use a path name such as: use C:\...\ERHScons1999
If the path name has spaces, you must use double quotes:use .d:\my data\ERHScons1999
You can open selected variables of a file using a variable list.
You can open selected records of a file using ifor in.
-
8/3/2019 Stata Training 1
13/58
Examining dataset
Here are some examples of the use command:
use ERHScons1999 opens the file ERHScons1999.dta foranalysis.
use ERHScons1999 if q1a == 1 opens data from region 1
use ERHScons1999 in 5/25 opens records 5 through 25 of
file use hhid hhsize cons using ERHScons1999
opens 3 variables from ERHScons1999 file
use C:\training\ ERHScons1999 opens the file ERHScons1999.dta in thespecifiedfolder
use .C:\data files\ ERHScons1999 use quotation marks if there are
spaces use ERHScons1999, clear clears memory before opening the new
file
-
8/3/2019 Stata Training 1
14/58
Examining dataset
save The save command will save the dataset as a .dta file
under the name you choose. Editing the dataset changesdata in the computer's memory, it does not change the
data that is stored on the computer's disk.
save C:\...\consumption.dta, replace
T
he replace option allows you to save a changed file to thedisk, replacing the original file. Stata is worried that youwill accidentally overwrite your data file. You need to usethe replace option to tell Stata that you know that the fileexists and you want to replace it.
-
8/3/2019 Stata Training 1
15/58
Examining dataset
edit
This command use to open window called dataeditor windowthat allow us to view all
observation in the memory. You can change the data using data editor window
but it is not recommend to edit data using thiswindow
It is better to correct errors in the data using a Do-file program that can be saved (we will see Do-fileprogram latter).
-
8/3/2019 Stata Training 1
16/58
Examining dataset
browse
This window is exactly like the Stata editor windowexcept that you cant change the data
describe This command provides a brief description of the data
file. You can use des or d and Stata willunderstand. The output includes:
the number of variables the number of observations (records)
the size of the file
the list of variables and their characteristics
-
8/3/2019 Stata Training 1
17/58
Example 1: Using describe to show information about a data file. des
Contains data from C:\training\ERHSCONS1999.dta
obs: 1,452vars: 15 24 Feb 2007 07:07size: 113,256 (98.9% of memory free) (_dta has notes)
-----------------------------------------------------------------------------storage display value
variable name type format label variable label-----------------------------------------------------------------------------q1a float %9.0g reg Regionq1b double %15.0g w Wereda
q1c double %17.0g pa Peseant associationq1d double %12.0g Household idsexh byte %8.0g sexhh Sex of household headageh float %9.0g p1s1q4 Age of household headcons float %9.0g consumption per monthfood float %9.0g food cons per monthhhsize byte %8.0g household sizeaeu float %9.0g adult equivalent units in
household
fpi float %9.0g food price indexrconspc float %9.0g real consumption per capita
1994 pricesrconsae float %9.0g real consumption per adult 1994
pricespoor double %8.2fhhid double %12.0f selected household unique id-----------------------------------------------------------------------------Sorted by: hhid
-
8/3/2019 Stata Training 1
18/58
Examining dataset
list
This command lists values of variables in data set.
The syntax is:
list [varlist] [if exp] [in range]
examples:
. list lists entire dataset
. list in 1/10 lists observations 1 through 10
. list hhsize q1a food lists selected variables
. list hhsize sex in 1/20 lists observations 1-20 for selected
variables
. list ifq1a < 6 lists cases in region is 1 through 5
-
8/3/2019 Stata Training 1
19/58
Examining dataset
if This command is used to select certain records in carrying out a
command
command ifexp
Examples: . list hhid q1a food iffood>12000 lists data if food is above
12000
. tab q1a ifcons>10000 &cons=1200 browse data if food consumption isabove 12000
Note that if statements always use ==, not a single =. Also note that | indicatesor while & indicates and
-
8/3/2019 Stata Training 1
20/58
Examining dataset
in
We have also used in to select records based on
the case number. The syntax is:
command in exp
For example:
. list in 10 list observation number 10
. summarize in 10/20 summarize observations
10-20
. l in -10/-1 list the last 10 observations
-
8/3/2019 Stata Training 1
21/58
Examining dataset
codebook
The codebook command is a great tool for getting
a quick overview of the variables in the data file.
It produces a kind of electronic codebook from the
data file, displaying information about variables'
names, labels and values. codebook
sexh Sex of household head----------------------------------------------------------------------------
type: numeric (byte)label: sexhh
range: [0,1] units: 1unique values: 2 missing .: 0/1452
tabulation: Freq. Numeric Label400 0 Female1052 1 Male
-
8/3/2019 Stata Training 1
22/58
Examining dataset
inspect
It is another useful command for getting a quick
overview of a data file.
inspect command displays information about the
values of variables and is useful for checking data
accuracy. inspect sexh
sexh: Sex of household head Number of Observations
---------------------------- Non-
Total Integers Integers| # Negative - - -
| # Zero 400 400 -
| # Positive 1052 1052 -
| # ----- ----- -----
| # # Total 1452 1452 -
| # # Missing -
+---------------------- -----
0 1 1452
(2 unique values)
sexh is labeled and all values are documented in the label.
-
8/3/2019 Stata Training 1
23/58
Examining dataset
count
count command can be used to show the numberof observations that satisfying if options. If no
conditions are specified, count displays thenumber of observations in the data.
. count
1452
. count if q1a==3
466
-
8/3/2019 Stata Training 1
24/58
Descriptive Statistics
tabulate, tab1, tab2
These are three related commands that
produce frequency tables for discretevariables.
They can produce one-way frequency tables
(tables with the frequency of one variable)
or two-way frequency tables (tables with a
row variable and a column variable.
-
8/3/2019 Stata Training 1
25/58
Descriptive Statistics
tabulate or tab produce a frequency table
for one or two variables
tab1 produces a one-wayfrequency table for each
variable in the variable list
tab2 produces all possible two-
variable tables from the
list of variables
-
8/3/2019 Stata Training 1
26/58
Descriptive Statistics
You can use several options with these commands:
all gives all the tests of association for two-waytables
cell gives the overall percentage for two-way
tables column gives column percentages for two-way
tables
row gives row percentages for two-way tables
nofreq suppresses printing the frequencies.
chi2 provides the chi squared test for two-waytables
There are many other options, including other statistical tests. For more information,type help tabulate
-
8/3/2019 Stata Training 1
27/58
Descriptive Statistics
Some examples of the tabulate commands are:
. tabulate q1a produces table of frequency by region
. tabulate q1a sexh produces a cross-tab offrequencies by region and sex of head
. tabulate q1a hhsize, row produces a cross-tab byregion and hhsize with rowpercentages
. tabulate sexh hhsize, cell nofreq produces a cross-tab of overallpercent by sex and hhsize.
. tab1 q1a q1b hhsize produces three tables, a
frequency table for eachvariable
. tab2 q1a poor sexh produces three tables, a cross-tab of each pair of variables
-
8/3/2019 Stata Training 1
28/58
Descriptive Statistics
summarize The summarize command produces statistics on continuous variables like age,
food, cons hhsize. The syntax looks like this:
summarize [varlist] [if exp] [in range] [, [detail]]
By default, it produces the following statistics:
Number of observations Average (or mean)
Standard deviation
Minimum
Maximum
If you specify detail Stata gives you additional statistics, such as
skewness, kurtosis,
the four smallest values
the four largest values
various percentiles.
-
8/3/2019 Stata Training 1
29/58
Descriptive Statistics
Here are some examples:
. summarize gives statistics on
all variables
. summarize hhsize food gives statistics on
selected
variables
. summarize hhsize cons if q1a==3 gives statistics ontwo variables for
one region
-
8/3/2019 Stata Training 1
30/58
Descriptive Statistics
by
This prefix goes before a command and asks Stata torepeat the command for each value of a variable. The
general syntax is:by varlist: command
Note: bysortcommand is most commonly used toshorten the sorting process
example of the by prefix are: bysort sex: sum rconsae for sex of hh head, give stats on real per
capita consumption.
-
8/3/2019 Stata Training 1
31/58
Descriptive Statistics
help
The help command gives you information about anyStata command or topic
help [command]For example,
. help tabulate gives a description ofthe tabulate command
. help summarize gives a description of thesummarize command
-
8/3/2019 Stata Training 1
32/58
STORING COMMANDS AND OUTPUT
The following topics are covered:
Using the Do-file Editor
log using
log off
log on
log close
set logtype to move tables from Stata to Word and
Excel
-
8/3/2019 Stata Training 1
33/58
STORING COMMANDS AND OUTPUT
Using the Do-file Editor
The Do-file Editor allows you to store a program
(a set of commands),
It makes it easier to check and fix errors,
It allows you to run the commands later,
It lets you show others how you got your result,
and It allows you to collaborate with others on the
analysis.
-
8/3/2019 Stata Training 1
34/58
STORING COMMANDS AND OUTPUT
In general, any time you are running more
than 10 commands to get a result, it is easier
and safer to use a Do-file to store the
commands.
To open the Do-file Editor, you can click on
Windows/Do-file Editor or click on the
envelope on the Tool Bar.
-
8/3/2019 Stata Training 1
35/58
STORING COMMANDS AND OUTPUT
keyboard commands are quicker to use than
the buttons. The most useful ones are:
Control-O Open file
Control-S Save file
Control-C Copy
Control-X Cut
Control-V Paste
Control-Z Undo
Control-F Find
Control-H Find and Replace
-
8/3/2019 Stata Training 1
36/58
STORING COMMANDS AND OUTPUT
To run the commands in a Do-file,
you can click on the Do button (the second-to-last
one) or
click on Tools/Do.
If you want to run one or just a few commands
rather than the whole file, mark the commands
and click on the Do buttonNote: Ifyou would like to add a note to a do file, but
do not want Stata to execute your notes, /* */ is
used
-
8/3/2019 Stata Training 1
37/58
STORING COMMANDS AND OUTPUT
Saving the Output
Stata Results window does not keep all the output
you generate.
It only stores about 300-600 lines, and when it is
full, it begins to delete the old results as you add
new results.
Thus, we need to use log to save the output
-
8/3/2019 Stata Training 1
38/58
STORING COMMANDS AND OUTPUT
log using
This command creates a file with a copy of all the
commands and output from Stata. The syntax is:
log using filename [, append replace [ text | smcl ] ]
append adds the output to an existing file
replace replaces an existing file with the output
text tells Stata to create the log file in text
(ASCII) format
smcl tells Stata to create the log file in
SMCL format
-
8/3/2019 Stata Training 1
39/58
STORING COMMANDS AND OUTPUT
Here are some examples:
log using temp22 saves output to a file
called temp22
log using temp22, replacesaves output to an existing file,
temp20, replacing content
log using temp22, append
saves output to an existingfile, results, adding to contents
log using .d:\my data\myfile.txt.
saves output in specified file in
specified folder
-
8/3/2019 Stata Training 1
40/58
STORING COMMANDS AND OUTPUT
log off
This command temporarily turns off the logging of
output,
log on
This command is used to restart the logging,
log close
This command is used to turn off the logging and
save the file.
-
8/3/2019 Stata Training 1
41/58
STORING COMMANDS AND OUTPUT
set logtype text
This command tells Stata to always save the log
files in text (ASCII) format
set logtype smcl
This command tells Stata to always save log files in
SMCL format.
-
8/3/2019 Stata Training 1
42/58
CREATING NEW VARIABLES
We have how to explore the data using
existing variables so far.
Now we will discuss how to create new
variables.
When new variables are created, they are in
memory and they will appear in the Data
Browser, but they will not be saved on the
hard-disk unless you use the save command.
-
8/3/2019 Stata Training 1
43/58
generate
This command is used to create a new variable. It
is similar to compute in SPSS.
The syntax is;
generate newvar = exp [if exp]
where exp is an expression like
price*quant or
1000*kg
-
8/3/2019 Stata Training 1
44/58
Cannot be used to change the definition of an
existing variable
You can use gen or g as an abbreviation
for generate
If the expression is an equality or inequality,
the variable will take the values 0 if the
expression is false and 1 if it is true
If you use if, the new variable will have
missing values when the if statement is false
-
8/3/2019 Stata Training 1
45/58
For example,
generate age2 = age*age
create age squared variable
gen yield = outputkg/area if area>0
create new yield variable if area is positive
gen price = value/quant if quant>0
create new price variable if quant is positive
gen highprice = (price>1000)
creates a dummy variable equal to 1 for high prices
-
8/3/2019 Stata Training 1
46/58
replace
This command is used to change the definition of
an existing variable.
The syntax is the same:
replace oldvar = exp [if exp] [in exp]
-
8/3/2019 Stata Training 1
47/58
For example,
replace price = avgprice if price > 100000
replaces high values with an average price
replace income =. if income
-
8/3/2019 Stata Training 1
48/58
tabulate generate
This command is useful for creating a set of
dummy variables (variables with a value of 0 or 1)
depending on the value of an existing categoricalvariable.
The syntax is:
tabulate oldvariable, generate(newvariable)
-
8/3/2019 Stata Training 1
49/58
tab q1a, gen(region)
This creates 6 new variables:
region1=1 if q1a=1 and 0 otherwise
region2 =1 if q1a =3 and 0 otherwise
region8=1 if q1a =8 and 0 otherwise
-
8/3/2019 Stata Training 1
50/58
egen
This is an extended version of
generate[extended generate] to create a new
variable by aggregating the existing data.
The syntax is:
egen newvar = fcn(arguments) [if exp] [in range] , by(var)
-
8/3/2019 Stata Training 1
51/58
count() number of non-missing values
diff() compares variables, 1 if different, 0 otherwise
fill() fill with a pattern
group() creates a group id from a list of variables
iqr() interquartile range
ma() moving average
max() maximum value
mean() mean
median() median
min() minimum value
pctile() percentile
rank () rank
rmean() mean across variables
sd () standard deviation
std() standardize variables
sum () sums
-
8/3/2019 Stata Training 1
52/58
egen avg = mean(yield)
creates variable of average yield over
entire sample
egen avg2 = median(income), by(sex)
creates variable of median income for each
sex
egen regprod = sum(prod), by(region)
creates variable of total production for
each region
-
8/3/2019 Stata Training 1
53/58
Exercise,
we want to know which households haveexpenditure (cons) above the village average.
I.e. Create a dummy (1 for those whoconsume above the village/peasant
association and 0 otherwise)
-
8/3/2019 Stata Training 1
54/58
egen avecon=mean(cons), by( q1c)
gen highavecon=(cons> avecon)
list hhid q1c cons avecon highavecon in650/675
-
8/3/2019 Stata Training 1
55/58
Arithmetic
+ addition
- subtraction
* multiplication/ division
^ power
Logical
~ not
| or
& and
Relational
> greater than
< less than
>= more than or equal
-
8/3/2019 Stata Training 1
56/58
Here are some examples to illustrate the use
of these operators. Suppose you want you
create a
dummy variable indicating households in the
Amhara region. One way is to write:
generate AmD = 0
replace AmD = 1 if q1a==3 Or you can get exactly the same result with just
one command:
generate AmD = (q1a==3)
-
8/3/2019 Stata Training 1
57/58
For example, a household head must be
female head and in Dodota wereda to be
selected.
gen DDfemale = 0
replace DDfemale = 1 if q1b==9 & sexh==0
or an easier way to do this would be:
gen DDfemale = (q1b==9 & sexh==0)
-
8/3/2019 Stata Training 1
58/58
abs(x) computes the absolute value of xexp(x) calculates e to the x power.
ln(x) computes the natural logarithm of xlog(x) is a synonym for ln(x), the natural logarithm.
log10(x) computes the log base 10 of x.sqrt(x) computes the square root of x.
invnorm(p) provides the inverse cumulative normal; invnorm(norm(z)) = z.
normden(z) provides the standard normal density.normden(z,s) provides the normal density. normden(z,s) = normden(z)/s if s>0 and s not
missing, otherwise, the result is missing.
norm(z) provides the cumulative standard normal.group(x) creates a categorical variable that divides the data into x as nearly equal-
sized subsamples as possible, numbering the first group 1, the secondgroup 2, etc. It uses the current order of the data.
int(x) gives the integer obtained by truncating x.round(x,y) gives x rounded into units of y.