1. 开始吧
1.1 安装包与导入包
# installing the dslabs package
install.packages("dslabs")
# loading the dslabs package into the R session
library(dslabs)
1.2 同时安装多个包
# to install two packages at the same time
install.packages(c("tidyverse", "dslabs"))
1.3 查看已安装的包
# to see the list of all installed packages
installed.packages()
1.4 RStudio的一些快捷键
- 保存脚本:Ctrl+S(Win),Command+S(Mac)
- 执行整个脚本:Ctrl+Shift+Enter(Win),Command+Shift+Return(Mac)
- 执行脚本中的当前行:Ctrl+Enter(Win),Command+Return(Mac)
- 打开一个新脚本:Ctrl+Shift+N(Win),Command+Shift+N(Mac)
2. R基础
2.1 R中的对象
给一个变量赋值,使用<-
:
# assigning values to variables
a <- 1
b <- 1
c <- -1
查看当前会话中的所有变量:
ls()
2.2 函数
可以通过help(function_name)
或?function_name
的方式来查看函数文档。
函数args(function_name)
可以返回某个函数所需的参数。
R中的注释使用#。
2.3 R数组的起始下标是1!
vector的起始下标是1,而不是0。
3. 数据类型
3.1 查看变量的类型
通过class()
函数查看:
# determining that the murders dataset is of the "data frame" class
class(murders)
3.2 查看关于data frame的更多信息
# finding out more about the structure of the object
str(murders)
head(murders)
函数可以查看数据集的前6行。
3.3 获取data frame的某一列
使用$符号:
# using the accessor operator to obtain the population column
murders$population
得到的就是一种类型vector。
3.4 获取data frame的列名
# displaying the variable names in the murders dataset
names(murders)
3.5 factor
# obtaining the levels of a factor
levels(murders$region)
就是枚举值。
4. Vectors, Sorting
4.1 创建一个vector
# We may create vectors of class numeric or character with the concatenate function
codes <- c(380, 124, 818)
country <- c("italy", "canada", "egypt")
4.2 给vector的元素起一个名字
# We can also name the elements of a numeric vector
# Note that the two lines of code below have the same result
codes <- c(italy = 380, canada = 124, egypt = 818)
codes <- c("italy" = 380, "canada" = 124, "egypt" = 818)
还可以使用names()
函数:
# We can also name the elements of a numeric vector using the names() function
codes <- c(380, 124, 818)
country <- c("italy","canada","egypt")
names(codes) <- country
4.3 通过下标获取vector中的元素
使用[]:
# Using square brackets is useful for subsetting to access specific elements of a vector
codes[2]
codes[c(1,3)]
codes[1:2]
4.4 通过名字获取vector中的元素
# If the entries of a vector are named, they may be accessed by referring to their name
codes["canada"]
codes[c("egypt","italy")]
4.5 R中的类型转换
R会尝试类型转换:
x <- c(1, "canada", 3)
这时1和3就会变成字符串的”1”和”3”。
可以通过as.character()
将数字转换成字符串。
也可以使用as.numeric()
将字符串转换成数字,如果不能转换的话就变成NA。
4.6 sort()排序
使用sort()
来排序:
x <- c(31, 4, 15, 92, 65)
x
sort(x) # puts elements in order
# result: 4 15 31 65 92
4.7 order()
order返回排序后每个位置的数字在原来vector中的下标:
index <- order(x) # returns index that will put x in order
x[index] # rearranging by this index puts elements in order
order(x)
# order(x)的结果就是:2 3 1 5 4
4.8 min()和max()
获取vector的最大值和最小值。
4.9 which:获取对应的下标
min()
和max()
获取vector的最大值和最小值。
而which.min()
和which.max()
获取vector中最大值和最小值的下标。
如果有多个,就返回第一个。
4.10 rank()
order()
是返回一个vector中排序后每个元素对应的排序前的下标。
而rank()
返回的是每个元素排序后应该在的位置。
x <- c(31, 4, 15, 92, 65)
x
rank(x) # returns ranks (smallest to largest)
所以:x[order(x)][rank(x)]
,x[rank(x)][order(x)]
都相当于x
。
4.11 vector的算术运算
# The name of the state with the maximum population is found by doing the following
murders$state[which.max(murders$population)]
# how to obtain the murder rate
murder_rate <- murders$total / murders$population * 100000
# ordering the states by murder rate, in decreasing order
murders$state[order(murder_rate, decreasing=TRUE)]
5. Indexing, Data Wrangling, Plots
5.1 Indexing
# defining murder rate as before
murder_rate <- murders$total / murders$population * 100000
# creating a logical vector that specifies if the murder rate in that state is less than or equal to 0.71
index <- murder_rate <= 0.71
# determining which states have murder rates less than or equal to 0.71
murders$state[index]
# calculating how many states have a murder rate less than or equal to 0.71
sum(index)
# creating the two logical vectors representing our conditions
west <- murders$region == "West"
safe <- murder_rate <= 1
# defining an index and identifying states with both conditions true
index <- safe & west
murders$state[index]
5.2 which()
which()
函数给出vector中为TRUE的下标:
x <- c(FALSE, TRUE, FALSE, TRUE, TRUE, FALSE)
which(x) # returns indices that are TRUE: 2 4 5
可以这样:
# to determine the murder rate in Massachusetts we may do the following
index <- which(murders$state == "Massachusetts")
index
murder_rate[index]
5.3 match()
match()
函数查看一个vector在另一个vector中的下标。
没有的话就是NA。比如:
# to obtain the indices and subsequent murder rates of New York, Florida, Texas, we do:
index <- match(c("New York", "Florida", "Texas"), murders$state)
index
murders$state[index]
murder_rate[index]
# another example
x <- c("a", "b", "c", "d", "e")
y <- c("a", "d", "f")
match(x, y) # 1 NA NA 2 NA
match(y, x) # 1 4 NA
5.4 %in%
%in%
来查看一个vector中的元素是否在另一个vector中,和match()
类似。
不同在于match()
返回下标,而%in%
返回是与否。
x <- c("a", "b", "c", "d", "e")
y <- c("a", "d", "f")
y %in% x # TRUE TRUE FALSE
5.5 mutate
mutate可以修改data frame,在dplyr包中。
比如添加一列:
# installing and loading the dplyr package
install.packages("dplyr")
library(dplyr)
# adding a column with mutate
library(dslabs)
data("murders")
murders <- mutate(murders, rate = total / population * 100000)
5.6 filter
filter()
函数可以进行过滤:
# subsetting with filter
filter(murders, rate <= 0.71)
就是通过一定条件来选行。
5.7 select
select
来选列:
# selecting columns with select
new_table <- select(murders, state, region, rate)
5.8 pipe管道
通过pipe可以在不同函数间传递data:
# using the pipe
murders %>% select(state, region, rate) %>% filter(rate <= 0.71)
5.9 创建一个data frame
# creating a data frame with stringAsFactors = FALSE
grades <- data.frame(names = c("John", "Juan", "Jean", "Yao"),
exam_1 = c(95, 80, 90, 85),
exam_2 = c(90, 85, 85, 90),
stringsAsFactors = FALSE)
需要注意的是,data.frame()
默认将字符串的值变成factor,可以通过stringsAsFactors=FALSE
来关闭。
不过R 4.0之后就不默认把字符串认作factor了。
5.10 plot散点图
# a simple scatterplot of total murders versus population
x <- murders$population /10^6
y <- murders$total
plot(x, y)
5.11 hist柱状图
# a histogram of murder rates
hist(murders$rate)
5.12 boxplot箱图
# boxplots of murder rates by region
boxplot(rate~region, data = murders)
6. 基础编程
6.1 条件语句
# an example showing the general structure of an if-else statement
a <- 0
if(a!=0){
print(1/a)
} else{
print("No reciprocal for 0.")
}
# an example that tells us which states, if any, have a murder rate less than 0.5
library(dslabs)
data(murders)
murder_rate <- murders$total / murders$population*100000
ind <- which.min(murder_rate)
if(murder_rate[ind] < 0.5){
print(murders$state[ind])
} else{
print("No state has murder rate that low")
}
# changing the condition to < 0.25 changes the result
if(murder_rate[ind] < 0.25){
print(murders$state[ind])
} else{
print("No state has a murder rate that low.")
}
# the ifelse() function works similarly to an if-else conditional
a <- 0
ifelse(a > 0, 1/a, NA)
# the ifelse() function is particularly useful on vectors
a <- c(0,1,2,-4,5)
result <- ifelse(a > 0, 1/a, NA)
6.2 any和all
vector中有一个为真any就是TRUE。
vector中全部为真all才是TRUE。
6.3 is.na()
查看是否是NA:
data(na_example)
no_nas <- ifelse(is.na(na_example), 0, na_example)
sum(is.na(no_nas))
6.4 自定义函数
# example of defining a function to compute the average of a vector x
avg <- function(x){
s <- sum(x)
n <- length(x)
s/n
}
6.5 identical
查看两个函数是否一致:
# we see that the above function and the pre-built R mean() function are identical
x <- 1:100
identical(mean(x), avg(x))
6.6 参数默认值
# functions can have multiple arguments as well as default values
avg <- function(x, arithmetic = TRUE){
n <- length(x)
ifelse(arithmetic, sum(x)/n, prod(x)^(1/n))
}
6.7 for循环
# a very simple for-loop
for(i in 1:5){
print(i)
}