二分法 1. %% ：取余 2. %/%,整除 3.& |: 与/或运算, 在向量时: & …

一、R基础(1)

数据结构：1. Booleans: True/False 2.整型 3. 浮点型 4.character，字符串 5(NA,NAN)

注意: integer 和 numeric 不能混为一谈.

基本运算符：

1. %% ：取余 2. %/%,整除 3.& |: 与/或运算, 在向量时: & 结果为向量，&&为单

个结果(可以用来做控制流) 。 4.is.na(), is.character(), as.numeric ， is.array..

5. == 等价于 identical(), 区别于 all.equal() ：后者允许计算机精度误差

查看变量: ls(), 或 objects() 删除:例: rm("circumference.in.meters")

读取文件：source("data/example.R") 或 source("data/example.R", echo=TRUE) [输出结果]

二、R基础(2):下面 x 均为一维向量,A为矩阵,s 为字符串(向量),lst 为列表,Dt为 Dataframe

数据结构： Vectors • Arrays • Matrices • Lists • Dataframes

向量化运算(包括前面的&, | ) + 向量的广播原则(整数倍长度自动补全)

Vector ：x[c(2,4)]: 保留向量中的2，4 位;

x[c(-1,-3)]: 去除向量中的1，3 # 这样的删除在R中常用

rep(1, 4) : 把 1重复 4遍；序列: seq(1,10,by=,length.out= k )

Array： array(x,dim=c(2,2)) , matrix(x, nrow = 2)：未指定，按列优先排列x元素,

对于 matrix 函数，可以用( byrow =True )来使行优先排列:

对同一个: is.array()和 is.matrix() 结果一样(矩阵视作 2d的特殊 array)

Array元素查找:1.x.arr[1,2]:(1,2)位置，x.arr[3]第三个元素(按列优先),x.arr[,2]：取第二列

类似

常用函数:rowSums(x)/colSums(x), rowMeans, 矩阵相乘: A %*% B

重要 : apply(x,1,mean) = rowSums(x), 其中 1表示第一个维度(行)

Matrices：常用函数: t(A), det(A), diag(A) = x : 把 A的对角元替换为x,

solve(A):求 A的逆, solve(A,b): 求 x ，s.t ： Ax= b,

rownames(A) = s， colnames(A) = s, summary(A): 当 A均为数值形. 建立矩阵: diag(x),

List：元素可以不同，可用 []或者 [[ ]]选取元素

lst <- c (lst,*) 将* 添加到 lst ; 直接 length(lst) = 3 :把 lst 截断为前面 3位

设置名字:lst2 <- list(family="gaussian",mean=7,sd=1,is.symmetric=TRUE): 每个元素都带

名字, 添加:lst2$was.estimated <- FALSE ,删除：lst2$was.estimated <- NULL

Dataframe: 每列内元素类型相同，不同列可不同, (矩阵 + 列表)结合体

基本操作：1. 取行和列，Dt$v1/ Dt[ ,‘v1’] ; 2.head(Dt,3),tail(Dt,3) 取开头/最后行

3. 取元素: Dt[a,b]或者 Dt[‘name1’,’name2’] ; 按条件选取: Dt[Dt$v1 ==a,]

4. 添加行:rbind(Dt,list(v1=-3,v2=-5,logicals=TRUE)) 5.with(Dt, 此处操作可省略 Dt)

6.画图:plot(v1~v2,data= Dt)

三.Getting Data:

<1>读取和写入 data: read.csv()/table() ; write.table(), write.csv()

factor 的用法 factor(), 查看类别 levels() (因子数据的水平，默认是 x中不重复的值;)

查看类别数 nevels(): 例如:nlevels(factor(rio$id))

birthwt$race <- factor(c("white", "black", "other")[birthwt$race]) :前面一部分为元素值

t.test 使用: t.test(x,y) ; t.test$p.value, t.test$statistic,

查看每类数目 table(): 例如: table(factor(rio$nationality))

线性回归模型 lm(y~x,data = D) 特殊用法 lm(y~.,data=D);只有 x不分析:lm(y~. -x,data =

D);lm(y~z+x,data = D) ;算相关系数 cor() ; 计算模型的残差 residuals(model_name)；

模型总结: summary(linear.model.2)

数据变换:标准化（Z-score）: 函数 scale(data,center = TRUE,scale=TRUE)[均值+方差];

函数考点: <1> diff(x);x[t]与 x[t-k]的差别;diff(x,lag=lag,difference=m) 等价与 x[(1+lag):n]

- x[1:(n-lag)]做 m 次迭代（ lag 控制滞后数， m控制迭代次数)

; <2> cummean 等的 cumlative 函数的写法

子集运算:aggregate(a[,-1],by=a[1],mean)注意蓝色部分是特征的一维 list: 因为是要按第一

列分类，所以 a[,-1]就是用来去掉 by=a[1]中使用的第一列

apply 函数类：<1>apply(input,1,func)其中第二个参数代表按行计算(1)还是按列计算(2)

<2> lappply(input,func):返回 list.

<3>sapply:与<2>不同的地方只在于返回的不是 list 而是向量或者矩阵

<4>tapply:分组运算【类似于group by，返回向量】tapply(X,index,func,simplify=T)其中X

为向量化数据，index为分组指标 simplify=”array”可以返回数组

<5>: mapply: 对于函数有多个参数时使用

排序运算: <1>order 和 rank 不同，后者相同大小时 rank会取平均，所以有分数；并且两

者都是标明各个元素的按从小到大的位置顺序。

<2> sort()返回排序后的向量，默认 decreasing = FLASE

<3> A[which.min(A)]和 A[which(A==min(A))]不一定一样: 在有多个值达到min时

<4> min(A,na.rm=TURE)可以排除掉 na数据掉影响

<5>: 矩阵中最小元素位置: mat <- matrix(rnorm(40), 10, 4)

which(mat == min(mat, na.rm=TRUE), arr.ind = TRUE)

Merge 运算:merge(x=A,y=B,by.x=”a”,by.y=”b” )如果写 by默认按照所有公共列来merge，

生产的dataframe有两个 dataframe 的所有列,除了 by.x和 by.y这一列合并

df2.2 <- merge(x=fha,y=ua,by.x="City",by.y="NAME",all.x=TRUE) : 把所有 x的部分都

保留下来，如果 x,y不匹配，则Y中所有值赋为NA

阶乘和组合: factorial(5): 5 的阶乘 ;choose(6,2) : 组合数: C_6^2

四.Graphics(2)：

函数：hist(), boxplot(), plot(), points(), lines(), text(), mtext(), axis(), par():设置参数

Plot.ecdf(x,pch=””)，pie(x,col = (“a”,”b”)), barplot(x,col = (“a”,”b”),horiz = T/F)，

symbols(x,y ,circles = 10 ^ quakes$mag) -强度图; qqnorm(x); qqplot(x,y)

boxplot(x ~ y, data = Dt,, horiz = F/T) ; barplot(table(a,b), beside= T/F)

添加线条: abline(lm(y~x), col="blue", lwd=2, lty=2)

高维：contour(x):等高线图;image(x):热点图

参数:例子 hist(x,breaks = 4(可以是一个vec),col="*",freq = T/F, xlab="*", main="*”,

rgb(0,0.5, 0.5, 0.5)); (rgb最后一位是透明度) ;lines(density(x), lwd(宽度)=3,lty(类型)=2)

Legend(“bottomright”,legend = (“a”,”b”));

其他参数: pch: 点的种类; col, col.axis：坐标颜色, col.lab:标签颜色；cex:字体大小

(cex.lab;cex.axis; cex.main)

Font:字体格式(1-5), font.axis, font.lab, font.main

设置子图格式:layout(matrix(c(1,1,2,3), 2,2, byrow = TRUE),widths = c(2,1), heights =

c(1,1)) : 一共是两行,第一行是一张图，第二行两张图宽度比2:1

制作一张空白带坐标的图: plot(c(0,9),c(-10,20), type=’n’, ylab=” “) - 空白画布，可以根

据坐标在上面加文字和点线

叠加:par(new= F)/ plot(..,add= T)

图的上下左右留白: par(mar = c(5,8,5,8)): 上留 5行白，下留 8行白，同理左和右

其他图形: 透视图:persp(x,y,z); 时间序列图:ts.plot(x)；

多数据图:matplot(x, cbind(sine,consine))

两两之间关系图: pairs(x): 如 x有三列,则会产生C3^2张图 + 3张图示

保存图片:pdf(“names.pdf”, width= 6,height=4) ; ...;dev.off()

五:ggplot2

与 plot 的最大不同的是使用+来连接每个画图都命令(类似与图层)

且不用每次都导入 data,且第一层命令是 ggplot(data =D,aes(x=X,y=Y,size=Z,color=H))+下

一层命令, 其中第一层 ggplot 的 aes 映射是全局的，后面的只作用在当前命令上

aes 是对数据做映射,<1>参数应该用对是 D的某一列而非具体值比如color = ”blue”意味着

只有一类，所以都是红色点，这类名字叫”blue”。<2>aes 做的设置(颜色)会影响这一图的所

有图层，而在每一层里面设置的颜色互不影响<3>画图参数:color,alpah(透明度),size

一个例子: p <- ggplot(data = mpg, aes(x=displ, y=hwy)) +geom_point(aes(color = class))

画图函数：<1>点:geom_point(), <2>线 geom_line() <3>线性平滑拟合曲线

geom_smooth() 默认参数:method=lm,se=TRUE 后者是95%置信区间(图上的阴影)

<4> 条形图ggplot(X,aes(x=H,fill=as.factor(M)))+geom_bar()默认德国国旗摆放两组的计数,

可以加参数 geom_bar(position=position_doge())来变成法国国旗的组内平铺式

<5> 直方图geom_histogram(binwidth= {你要的窗口数}) <6>复杂的直方图+密度图操

作:geom_histgram(aes(y=..density..),color = “black),fill = “white”) + geom_desntiy(alpha =

0.2,fill = “green)+ geom_vline(aes(xintercept=mean(X),color =“red”,size=2,linetype=”dashed”)

其中(v: vertical; h: horizontal) geom_hline 可以画横线，density..是为了把频率而不是默认

的频数放在 y轴注意，图层可能会相互覆盖，要注意顺序 ; aes(color=class) 是常用的

<7>箱盒图 geom_boxplot(): ggplot(mpg, aes(x="",y=hwy)) + geom_boxplot(color="blue");

按 x的每个类画图: ggplot(mpg, aes(x=class,y=hwy,fill=class)) + geom_boxplot()

<8>stat(统计量):画中位数的点图： stat_summary(fun.y = “median”,geom=”point”)

直方图 stat_bin (),可以通过参数geom=”line”来改变形状 : stat_bin(geom = "line")

小提琴图 geom_violin(): ggplot(mpg, aes(x="",y=hwy)) + geom_violin()

<8.5>: barplot: ggplot(mpg, aes(x=class, fill=as.factor(cyl))) +

geom_bar(position = position_dodge()) : 表示不同组横着排列

<9>qqplot 的 ggplot 实现：

<10>饼图:ggplot(A,aes(x=””,fill = class))+geom_bar() + cood_polar(theta = “y”,start=0),

注意这里的 aes对 x没有分类，所以没有最后一个图层会出现德国国旗对效果; 若分类后则

会出现多个同心的圆.; 如果令 fill=as.factor(cyl)则会出现多个大小不一的扇形。

格式函数系列：<1>coord_filp() 横轴和纵轴对调,如可加在 boxplot 后面.

<2>纵坐标取常用对数 sclae_y_log10(),但是显示的纵坐标用”e”表示

<3> 为了不想前面一个用“1e3”来表示,而是用科学计数法,可以使用 scale_y_log10(breaks =

trans_breaks(“log10”,function (x) 10^x),labels = trans_format(“log10”,math_format(10^.x)))

前一个 break 表示 y轴之间的间隔，第二个_format 表达成数学符号

<4>coord_trans(y=”log10”)与前面一个方法画出来不同，这个画出来的坐标刻度不是均匀的

<5>设置图中坐标范围也可以用新增图层命令 lims(x=c(1,2),y=c(1,2))

<6>标题 ggtitle(“x”) <7>标签 labs(x=””,y=””)或者只改一个的 xlab(“”)

<8>根据不同的 Z画子图 facet_wrap(~Z,nrow=3)或者也可以考二维对子图排列，x属性对于

行，y属性对于列 facet_grid(x~y,scales = “free_y”[optional，为了让那个每一行y对取值不再

固定])而多加一个参数 space=”free”可以让每一个框都不固定；两个变量x,y分别划分行和列

画多个子图还可以用grid(p1,p2,p3,ncol=2,widths=c(1,2))不是图层不用前面加+，可以展现

不同类型的图 <9>guides(fill =FALSE)去掉标注 <10> ggsave("plot.pdf")

六:Text: str 表示一个 str

基础：1. class(“abc”): "character"

函数:cat(s):按格式输出(包括转义字符)， sprintf = C中 print

str.vec[-(1:2)]: 不要第 1,2位,且 a:b= [a,b]，与 python区分！； tolower() and toupper();

as.character();(无法转换出现NA), nchar():转义字符也计算长度，区分 length()

substr():可向量化;例:substr(str,1,4):提取第一个到第4个字符，向量化注意此例:

substr(presidents, 1, 1:5); 字符替换:substr(phrase, 1, 1) = "L"

替换与查找:gsub(“s1”,”s2”,str):全部替换, 且与 substr不同, substr 替换者长度不足时会保留

原来部分(一一替换), 而 gsub 直接 s1替代 s2 -- sub(“s1”,”s2”,str):替换第一个；替换后 str

并未改变，要改变需赋值。 gsub("[ae]", "o", phrase): 遇到 a/e就替换成 o � � � �查找 str中是否有 x:grep(x,str):返回下标 --- grepl(x,str)：返回T/F

分割:split.obj = strsplit(str, split=",") ;length(split.obj) : 1 (返回的是 a vector of strings)

可以用 unlist 展开: unlist(strsplit(str, split=",")) 可以得到一个 list.

注意: strsplit 可向量化: strsplit(c(ingredients, great.profs, favorite.cats), split=", ") ;取其

中一个结果:split.list[[1]]

分割每个字符: split.chars = strsplit(ingredients, split="")[[1]],

len(split.chars) = 字符总数 = nchar(ingredients)

正则表达式:例子:strsplit("Fortough problems, you need R.", split ="[[:space:]]|[[:punct:]]")

表示按空格/换行符和标点符号开始分割

其他正则表达式:<1> [:digit:] = [0,9]; <2> [:alpha:] = [a-z]+[A-Z]

<3> [:alnum:] = [:digit:] + [:alpha:]

字符串合并:

<1> paste("Spider", "Man") # Default is to separate by " " :paste 的默认 sep=” “ ; 所以

paste(str1,str2, sep=””) = paste0(str1,str2)

paste 的 Default 参数: paste(str, sep= “ ”， collapse= NULL)

<2> paste 支持向量化; 当 str1 是一个向量时; paste(str1,c("D", "R")) : 会自动将第二个向

量重复使得和 str1 长度相等.然后一一对应的合并.

<3> paste(presidents, collapse="; ") � 输出: "Clinton; Bush; Reagan; Carter; Ford"

其中: presidents："Clinton" "Bush" "Reagan" "Carter" "Ford"

<4> 重要例子: paste(presidents, " (", c("D", "R", "R", "D", "R"), 42:38, ")",

sep="", collapse="; ") ; 输出:

"Clinton (D42); Bush (R41); Reagan (R40); Carter (D39); Ford (R38)"

从文本读入句子: <1>: trump.lines = readLines("data/trump.txt") # class: "character"

<2>: trump.text = paste(trump.lines, collapse=" ") :每行用空格隔开

<3> 提取前60个字符: substr(trump.text, 1, 60)

<4> 提取单词: trump.words = strsplit(trump.text, split=" ")[[1]]

<5> 建立wordtable: trump.wordtab = table(trump.words)： wordtab 中名字为单词名字，值

为出现次数。所以可以根据单词来查询词的频率,如: trump.wordtab["America"]

如果不存在会返回 : NA ,如:trump.wordtab["Canada"] # NA

<6> 按词频排序: trump.wordtab.sorted = sort(trump.wordtab, decreasing=TRUE)

展现前 20个: head(trump.wordtab.sorted, 20) # First 20

日期和时间:

基本例子： <1>:class(Sys.Date()) # "Date" <2>查看当前时间: Sys.time()

<3>class(Sys.time()) # "POSIXct" "POSIXt" ，注意时间有这两种类型

<4> ad.Date 的两种格式: as.Date("2019-10-23") 或者 as.Date("2019/10/23")

如果不是标准格式: 可以用替代符 + 表达式来读入：例如:

as.Date("October 23, 2019", format = "%B %d, %Y") :%B：月全称; %d: 数字日期

%Y:年份,4位数字; %y: 年份,两位数字. %b: Month (abbreviated)):月份简写(如 Oct)

%m: 月份数字. 如果未按照表达式或者非 Sys.录入格式，则会报错。

相关函数: 例子; bdays <- c(tukey=as.Date('1915-06-16'),fisher=as.Date('..),..)

<1>查看这一串日期的星期,月份和季度:weekdays(bdays); months(bdays); quarters(bdays)

日期存储方式: <1>: POSIXct:stores date/time values as the number of seconds since

January 1,1970 <2>: POSIXlt: stores date/time values as a list with elements for second,

minute, hour, day, month, and year.(从 1900 年开始计算)

<3>:as.POSIXlt(dts) 与 as.POSIXct(dts) 输出一样

<4> 可以利用 POSIXlt 提取其中元素：p <- as.POSIXlt("2019/10/23") ; p$mday

p$year + 1900 # 2019

一些函数: <1> range(bdays) ；mean (bdays) :平均日期;

bdays[3]- bdays [1]:两个日期间差值;

difftime (bdays [3], bdays [1], units=' weeks')：以周为单位的差值;

<2>: 生成日期序列: seq(as.Date('2019-10-1'),by='days',length=10) ;

seq(as.Date('2019-9-4'),to=as.Date('2020-1-1'),by='2 weeks').

七:Controal Flow:

支持向量化操作:log(),abs(),ifelse(x>0,x,-x); 但是 if和 else 不具备向量化

一个容易犯的错误： if 后面的}结束后 else 没有紧跟而是换行了

注意: if 只有一行时可以不加花括号

常用写法：<1>: u.vec[-0.5 <= u.vec & u.vec <= 0.5]=999

<2>: (0 > 0) && (all.equal(42%%6, 169%%13))

<3> : Use && and || for control or conditionals, & and | for subsetting or indexing

Repeat写法:

返回一列的最大值：pmax()

八:Fuction(2):

基本概念: <1>: 如果没有 return，则输出或者返回最后一行.

<2>:structure of a function has three basic parts: 1. Inputs (or arguments) 2.Body (code

that is executed) 3. Output (or return value)

<3>: R doesn’t let your function have multiple outputs, but you can return a list

<4>: 默认参数:当使用了默认参数后，可以不输入有默认值的参数.

<5>: 当使用参数名时，参数顺序无所谓:Named arguments can go in any order when

explicitly tagged

<6>: 函数寻找参数时优先从内部寻找，找不到会去全局寻找，所以尽量在函数中定义好要

使用的参数。同时函数内部定义的参数的活动空间也仅在函数内部 (Each function has its

own environment) ; 函数中的赋值不会改变全局同名变量的值！

<7>: 对于系统内置参数可以使用: Generally OK for built-in constants like pi, letters,

month.names,etc.

<8>: 函数的一些问题，如结果在函数中间步骤产生(Side effects are things that happen as

a result of a function call, but that aren’t returned as an output):

• Printing something out to the console • Plotting something on the display • Saving

an R data file, or a PDF, etc.

<9>: 如何设计一个函数(可先从伪代码写起): 1. Start with the big-picture view of the

task 2. Break the task into a few big parts 3. Figure out how to fit the parts together 4.

Repeat this for each part

函数向量化：例子: if..else 不支持向量化，但 ifelse(x^2 > 1, 2*abs(x)-1, x^2) 支持

参数检查与返回 list: <1>使用 stopifont(条件 A,条件 B) <2> list要写好元素名.

九:Fuction(2)：

<1>plot.lm(lm）与 plot(lm)效果一样

<2> r 里面的 assert 函数 stopifnot(“a” %in% names(object),is.numeric(y))

子函数: 基于一个函数创建的函数.

十、R包

如何建立 R包：1. Click File | New Project. 2. Choose “New Directory”:

3.Then “R Package”: 4.Then give your package a name and click “Create Project”:

包中所涵内容：1. An R/ directory. 2. A basic DESCRIPTION file. 3. A basic

NAMESPACE file. 4.在 Rstudio中 also include an RStudio project file, pkgname.Rproj

一个包的生命周期：1. Source 2. Bundled 3. Binary 4. Installed 5. In-memory

包的类型: 1. Source packages: 电脑中的文件夹形式

2.Bundled packages： .tar.gz.形式,（files in .Rbuildignore are not included in the bundle）

3.Binary packages: Mac 上:.tgz , windows上:.zip

4.Installed package: 解压过的 binary package

5.In memory packages：加载到内存的包: 如:library(devtools) # or require(devtools)

程序设计规划: <1>: { 接在代码后面； <2> }:单独占一行；

<3> .一行代码小于80字符

其他注意: 自己包中调用其他包函数需要额外添加引用，否则不会自动加载依赖包而导致

函数失效。

library () and require()：主要区别在于当一个包没有找到时两者反应不同.While library()

throws an error, require() prints a warning message and returns FALSE

十一:R包(2)：

<1>DESCRIPTION：Imports 后面写用到的别的包

在 DESCRIPTION 中的 Title 最多一行，Descrptiop最多一段，缩进4 格

<2>Roxygen2:#’ **@param** **@example**

<3>Namespace :当 library 两个同名函数，会覆盖

<4>: LazyData: If the DESCRIPTION contains LazyData: true, then datasets will be lazily

loaded. This means that they won’t occupy any memory until you use them

用:roxygen2:

使用 Rcpp

十二、Random Variate Generation

基本函数:1. sample(1:100, size = 6, replace = TRUE）:有放回，sample 也可以是字符

2.各种分布: p: 分布函数， r：随机变量， d：密度函数， q：逆分布函数

3.例:指数分布 cdf:

plot (this.range, pexp(this.range), ty="l",main="exp",xlab="x",ylab="P(X<x)")

4.A-R方法: 要生成X～f，让Y ~h(x) = t(x)/c ,t(x) >= f(x)

且 U～U(0,1),若 U <= f(x)/t(x), X <- Y, 则 X～f

5.上机课作业代码：

6.卷积法:

1.X～ Bern(p), \sum X～B(r, p) 2.几何分布 -》 Pascal分布 3. 标准正态 -》卡方 n

4.指数分布 -》 Gamma(n,λ)

多元正态:1. MASS:mvrnorm(n, mean = mu, sigma = sigma)

p: 分布函数， r：随机变量， d：密度函数， q：逆分布函数

分解 \Sigma 和 Choleski decomposition 法，SVD方法:

十三:OPtimization：[图片]

<1>二分法

<2>割线法(Brent’s) uniroot(f, lower = 0, upper = 5*n, a = a, n = n)最后记得 unlist

或者如：out <- uniroot(f, interval=c(-5*n,0), a=a, n=n) unlist(out)

<3>梯度下降法: 见 HW5:

<4>牛顿法:

<5>FIsher Scoring 把<4>中的 hessain 矩阵改为信息矩阵

<5.5>: 坐标下降法: 梯度下降 + 每次仅更新 p个维度中的一个维度

<6>函数:

<6.1> 一维情况optimize(func,lower=1,upper=4,maximum=T):func为一个写好的函数

<6.2>多维: optim(list_of_par,func,gradient[optional],method=”BFGS”,hessian =TURE[是否

返回 Hassian]); 此处参数为 func 的参数.

<6.3> 非线性最小二乘估计:

fit2 <-nls(pcgmp~y0*pop^a,data=gmp,start=list(y0=5000,a=0.1)) summary(fit2)

十四:Cross-Validation：

基本概念:1. 模型越复杂(overfit 情况下)，training err 和 test err 通常差的越大. 此处

error 定义就是普通的 MSE。此处若想用多项式拟合: lm.10 = lm(y ~ poly(x,10))

上课例子: Train err 和 test err 曲线.

分离 train 和 val set： inds =sample(rep(1:2, length=n));dat.tr = dat[inds==1,]

交叉验证:步骤: • Split data into k parts or folds • Use all but one fold to train your

model/method • Use the left out folds to make predictions • Rotate around the roles of

folds, k rounds total • Compute squared error of all predictions

对于交叉验证 error 定义:

留一验证(也即 K=n 的 Kfolds):

cv.errs = colMeans((pred.mat - dat$y)^2

画图:xlim = range(c(x,x0)); ylim = range(c(y,y0))

xx = seq(min(xlim), max(xlim), length=100) : 一串从小到大的点，用于画predict 线:

loess (locally weighted smoothing): 常用于:

<1>:Fitting a line to a scatter plot or time plot 且存在噪音情况

<2>:Linear regression where least squares fitting doesn’t create a line of good fit 或者开销

太大而不适合使用 <3>:Data exploration and analysis

从而可以通过 CV选出一个合适的曲线光滑度

十五:Booststrap (重抽样)

<1> 不知道分布的情况下使用 ecdf 代替 cdf，整个抽样过程就是 bootstrap

<1.1>自己写函数时候用到 sample(1:n,size=n,replace=TRUE)

<1.2>调函数 obj<-boot(data=D,statistics=func,R=200)，可以用 onj$t 来访问 bootstrap 后的

统计量数据（再用mean或者 sd）

<1.3>the <- function(x,data){...}

Obj2 <-bootstrap(1:n,nboot=200,theta=the,data)用 Obj2$thetastar 来类似<2>的访问

<2>非参数 bootstrap:

<2.1> 方差估计:

<2.2>偏差估计：

<2.3>置信区间估计

<2.3.1>First Order Normal Approximation Interval(假设正态分布)

<2.3.2>Bootstrap Percentile Interval 只依赖于经验分布函数得到

Simplest method of making confidence intervals for the unknown parameter is to take α/2

and 1 − α/2 quantiles of the bootstrap distribution of the estimator θ_n as endpoints of

the 100(1-α)% confidence interval.

<2.3.3>Basic Bootstrap Interval

<2.3.4>Studentized Bootstrap Interval （代码）

<2.3.5>Accelerated Bias-Corrected Percentile (BCa) Interval

主要用于分布不对称的情况

<2.3.5>调包实现上面1，2，3, 5的方法, type 分别表示输出各种区间

<3>参数估计 We use a parametric model rather than the empirical distribution

参数与非参数的对比

<1>非参数 bootstrap 总是正确的. 只有在 sample size 过小或者 square root law 不满足时

或者数据不满足 IID时，或者参数对原分布来说不够好

<2>:参数化对 bootstrap 在模型错了时则会出错. 此外，当参数化 bootstrap 用对模型时，在

小样本情形下对表现比非参数化的好很多

十六:Permutation Tests(置换检验): 可以类比 bootstrap ，思想上有类似。

基本思想: 可以用 permutation 来制造符合零假设的分布，从而根据p值判断接受或者拒绝

permutation的步骤: 1.确定一个统计量- \theta

用 permutation检验同分布:

检验相关系数:

Permutation 和 bootstrap上述例子对应的 bootstrap做法:

Python一.Data structure数和字符串：<1> Int（-2.8):-2，bool(-2):True，bool(0):False， Not True = False

22//8:2， 22%8:6，

<2> 复数:c = a+ bj; j 在 python表示复数单位;判断复数>0会出错，

<3> 不是字符串的不能+贴在一起，要先 str()；且注意合并是无缝合并

<4> string.title()首字母大写其余小写，.upper()全大写，

<5> .isalnum()判断是字母数字混合，isdecimal()是否只包含数字，

<6>: str* 3 : 表示把 str 重复三遍

<7> ’ ‘.join([stra,strb])以空格合并；

', '.join([str1, str2, "2020"]) ：前边单引号里面的是合并时的分隔，相当于collapse=’, ‘

<8> 读取数据: input_str = input('input something')

’stra,strb’.split(‘s’)以 s 分割，str.split(' ') 以空格分割,

输出字符串有单引号，输出 list 有[]，

<9>:字符串排版'Statistical software (2020)'.ljust(30) ;'Statistical software'.center(50, "=")

<10>: print(“xx %i” %(a)) 或者 print(f"a has shape {a.shape}: {a}")

数据结构: <1> list: 字符串 list 不能 sort,出来的是 None；

<2>: [:]全部元素，[:3]长度为 3，[2:7]为 2~6；

<3> .index()查找第一个出现的位置

<4>: mylist.append(newelement) ; mylist.insert(position, newelement)

<5>:sort 会重新赋值，sorted 不会; 原因: L.sort() ; sorted(L)

tuple: 不能改不能删; 但可以 tuple4=tuple2+tuple3

Dictionary{key1:value1,key2:value2}：

<1>删掉 del dic0[‘a’]；

<2>: dic0.keys()；dic0.values()； <3>:dic0.items()返回 dict_items([(a,b),(c,d)])；

<4>:dic0.get(‘key1’) 基本上同 dic0[‘key1]

Set：<1>:自动把字符串，列表打散，A|B 合并，A&B交集， A-B做差

<2>:set.add()，set.remove() <3>:区分:set(['apple']) 与 set(‘apple’) # {'a', 'e', 'l', 'p'}

二. Function

If..else 例子:If a: \n while a<200: \n Do xxx \n elif b: \n for i in range(10):

\n Do xx \n else: \n Do xxx

函数相关:def fun(*x)：<1> 输入什么类型都转化为元组; return 多个时自动返回元组；

<2> 函数里局部变量不能在外面使用

<3> list(map(int, list1))类似 r的 apply，int不能向量化运算，必须有 list，

<4> map(lambda x: 2*x, L)匿名函数

<5> list(filter(is_odd,L))把返回 true 的筛选出来; 第一个位置的函数返回的要是 bool 值

<6> L= [1,2,3,4] reduce: （functools）例子:

reduce(lambda x,y:x+y,L,0) : 输出为 10； 0为起始值; L中的放y上

reduce(lambda x,y:x*y,L,1) : 输出为 24；

用传给 reduce 中的函数 function（有两个参数）先对集合中的第 1、2 个元素进行操作，

得到的结果再与第三个数据用 function 函数运算,初始值可省略

<7> range(100)*2 出错，要先 list

三．numpy

矩阵：<1>: np.array([[],[]])，np.zeros((3,4),dtype=int)，np.full((2,3),fill_value=7)，

np.ones()，np.empty():让值没有初始化，np.eye()，

<2>: np.arange(0,10,2) (02468)，np.linspace(0,8,5）:从 0到 8共 5个元素(包括 0和 8)

range 返回一个 range 对象，numpy.arange 和 numpy.linspace 返回一个数组，np.arange的步长

可以为小数，但 range的步长只能是整数

<3>:b.ndim = b.shape[0]; b.shape ，b.size = b.shape的各个元素相乘，

<4>:a[::-1]按第一个维度倒过来，a[:,0]第一列，a[:,1:]除了第一列,不要写 a[,1:](错误)

<5>: d.reshape(4,1)，d[np.newaxis,:]行增加一维，d.T: 转置，

<6>: np.concatenate((a,b))默认 axis=0 按行合并（）向量直接首尾拼, axis=1 按列合并，

<7>: np.stack((b,b),axis=0)，

<8>: np.split(d,2)等分，np.split(d,(2,3,5),axis=1)在指定列前划一刀，其中 2,3,5 为断

点；,enumerate():返回一个 tuple: 常用写法: for i, p in enumerate(parts):

函数:print ("Column sums:",b.sum( axis=0) ) ； print ("Row sums:", b. sum (axis=1))

<2> a,mean()，a.min()，a.std()，

<3> <可向量化比较，c.all()/.any()，分别表示全部正确 /有一个为True

<4>: np.sum(a<0)， ~c：ture 和 false 交换

<5>:np.sort()不改变值，an.argsort()：返回角标

<6>:矩阵乘法: np.matmul 或者@，

reduce(lambda x,y: np.matmul(x,y) ,(a for i in range(n))) : 矩阵 a的 n次方

<7>: np.linalg.inv()求逆: 返回的就是逆矩阵，np.linalg.solve(a,b)解方程

随机数：np.random.uniform(0,1,(3,4))，np.random.normal(0,1,(3,4)) ：均是(3,4)的矩阵

np.random.randint(-2,10,(3,4)): (左闭右开）np.random.RandomState(seed=)，

np.random.randn(100) : 默认N(0,1)

<8>: 相关系数: out = np.corrcoef(X.T,X.T) return out[:4,:4]: X 矩阵前4列的相关系数

四．画图: import matplotlib.pyplot as plt

plt.plot(sorted(x)，sorted(y))，plt.title()，plt.xlabel()，plt.grid() ,plt.show()

散点图 plt.plot(a,’ro’)，散点加线plt.plot(a,’o-’,mfc=’w’) :mfc 是点内部的颜色，

添加点的坐标：

多条线：组图：fig，ax=plt.subplots(2,2)；ax[0,0].plot(a)，...，plt.show()

Plt.subplot(2,2,1)（从一开始），plt.plot(a)，...,plt.show()

箱图：plt.boxplot([a,b])（每个单独画）

直方图：barh 就是横过来; 例子:

plt.hist(x，bins=100（分格数）),color=’r’，cumulative=True: 由 hist 转化为经验分布函数，

饼图：plt.pie(x,autopct=’%1.0f%%’，labels=l，explode=(0,0,0,0.1)，

wedgeprops={‘width’=0.1,’edgecolor’:’k’})k 是黑色

五．pandas

基本操作: Dataframe: Each row is given as a dict, list, Series, or NumPy array.

<1>:Wh=pd.read_csv(“”)，wh.drop(“snow”,axis=1).head()去掉一列但不改变原 Dataframe

<2>:wh["Rainy"] = wh["Precipitation amount (mm)"] > 5: 添加一列;

Series:s=pd.Series([1,4,5]，index=list(“abc”))，s.name=”grades”，行名没改的不能按行号提

取，只能行名，s[-4:]倒数四个; pd.Series(val, index=idx)

这里要特别注意行号和 index 的区别! 直接 df[0]会报错

• loc: use explicit indices • iloc: use the implicit integer indices

查看 dataframe：Dt.describe()

生成 Dataframe: <1>:df=pd.DataFrame([[a1,b1],[a2,b2]],columns=[“fst”,”scd”],index=[“a”,”b”])

<2>: Df=pd.DataFrame([{“fst”:a1,”scd”:b1},{“fst”:a2,”scd”:b2}])，

<3>: pd.DataFrame({"a":s1,"b":s2}) # 其中 s1,s2为 series

<4>: df.loc[0,”fst”]，df.iloc[-1,-1]按位置找，df.sort_values()排序

缺失值：<1>:wh[wh.isnull().any(axis=1)]（得到存在缺失值的行），

<2>: wh.dropna(axis=1)去掉有 null 的列; wh.dropna()去掉有 null 的行

转化：<1>:pd.to_numeric(series,downcast=”integer”,errors=”coerce”)：不能转数的变NaN，

<2>: df.astype({“a”:float,”b”:str}))，

<3>: pd.Series(["1","2"]).map(int) 转为 int

<4>:转化为整型: pd.to_numeric(pd.Series([1,1.0]), downcast="integer")

字符串处理：names = pd.Series(["donald", "theresa", "angela", "vladimir"])

names.str.capitalize():首大写，可以就当作字符串处理.

names.str.split()，names.str.split(expand=True)变成两列

合并：<1>:

<1>: pd.concat([df1,df2],ignore_index=True)，按行合并;pd.concat([a,c], axis=1):按列

<2>:pd.merge(df1,df2):按相同行合并，增加列，默认合并时行取交；

（how=”outer”）: 按两个的并集合并

df转变：Wh.rename(columns={“a”:”apple”})改列名，

分组: groups = wh3.groupby("Month") ;len(groups) # 12; groups 是一个字典;

<2>:key是不同的 group 名， value 是 group中元素个数

<3>:groups.get_group(2)拿出 2月，

<4>:按 group求平均: group[“time”].mean()

<5>:按 gropu排序: wh.groupby("Month").apply(lambda df : df.sort_values("Temperature"))

HW1: 基本操作

1. vector2 <-c("5",7,12)# Result:"5" "7" "12", because it convert 7 & 12 into character

2. 数据修正: rain.df.fixed[rain.df.fixed<0] = 0 # We assign the negative data as 0

daily_fixed <- apply(rain.df.fixed[,c(4:27)],1,sum) 3. 提取元素

HW2: Dataframe操作

1. 删除一列: state.df$Region <- NULL2. 添加一列:state.df <- data.frame(state.df , Center.x = state.center$x

3. sapply返回向量， lapply返回 list，参数都只需要 dataframe + 函数; 例如:<1>: lapply(pros[,c(1:4,6:9)], function(x) {plot(x~pros$svi)}) (此处 sapply 也可)<2>: lapply(tests, function(x) {x}$p.value)Mapply: 用于多个参数情形

4. R 的 dim 会输出所有维度: 与 python 区分!5. 查看缺失值个数: print(paste("number of missing data:", sum(is.na(rio))))6. 查看最大值: table(factor(rio$nationality))[which.max(table(factor(rio$nationality)))]7. 总奖牌数目最多: rio[which(rio$total == max(rio$total)),1:3]8. 分组使用函数 tapply: total.by.nat <- tapply(rio$total, rio$nationality, sum)依照国家进行分类，返回的是一个向量，名字为分组名

9. 没有奖牌国家数: length(total.by.nat[which(total.by.nat == 0)])10. 年龄最大的运动员(可以多个) :rio[which(rio$age == max(rio$age)),c(1:3,14)]11. 不计入缺失值: mean(x, na.rm = T)12. 从大到小排列: order(sports$ave_weight,decreasing = T)13. 合并两个，且只取交集:dat.ij <- merge(dat.m,dat.w ,by = "Country", all = F)length(intersect(dat.m$Country,dat.w$Country))14. 查看两个集合不相同的部分: length(setdiff(dat.m$Country,dat.w$Country))15. 查看缺失值个数 : length(which(rowSums(is.na(dat.lj))>= 1))16. 集合处理: setdiff union intersect

HW3：画图和 ggplot1. 截断一些点: x.trimmed <- x[(x<1)&(x>-1)]2. plot(x, y, type="p",pch = c(1,16), col = c("black","blue","black","red"))注意此处的广播法则，黑，蓝，红点交替

3. 图上加点+ 连线: points(x2, y2,pch=16, col = "blue") lines(x2,y2, col = "red",lwd = 2)4. 给上图附上图例： legend( "bottomright",title="Legend", c("Cubic","Quadratic"), pch= c(1,16), col = c("black","blue"))5. 画出一个灰色置信区间：(0.1 和 0.9的)rect(xleft = -2, ybottom = qnorm(0.1), xright = 2, ytop = qnorm(0.9),col = "gray")

下面代码依次效果图

分别是上面两个代码效果

<1>:jitter点图: ggplot(data = gm , aes(x=continent, y= gdpPercap)) + geom_jitter()

HW4: 文字处理

1.text(1.5,1, " Linear", font =1) : 画图时加 text 指定坐标，按照图片中坐标即可

2.展示一张本地图片:

HW5: 随机变量，算法和优化

1. 有放回抽取: letter = c("A","G","C","T"); letter_1000 <- sample(letter,1000,replace = T)

A-R法:和 Gradient desent 方法代码

查看数据个数: nrow(test_data)常用组合写法: lm1 = lm(lpsa~lcavol+lweight,data=train_data) ;l_predict1 = predict(lm1, data.frame(lcavol = test_dat$lcavol,lweight= test_dat$lweight))

HW6,7: python部分

1. dict添加元素： result_dict[“key1”] = value1 ; 要注意 dict中元素的类型,可以是任何

2. 可以用 len(set(str))来看 str中的非重复字母数

3. 此处注意 enumerate 的使用 + str.find()

4. 格式化输出.5. 从 List提取>0的

6.

7. 求矩阵行向量模长: np.sqrt(np.sum(a**2,axis=1))

8.

9.

10.11. 矩阵中第二个数比倒数第二个数大的行: array[array[:,1]>array[:,-2],:]12.

内层和外层的

bootstrap都是

200层循环

这里 q是用来估计 t*

二分法 1. %% ： 取余 2. %/%,整除 3.& |: 与/或运算, 在向量时: & …

Documents

二分法 1. %% ：取余 2. %/%,整除 3.& |: 与/或运算, 在向量时: & …