Data Science (Part 1): Basic R

1. 开始吧

1.1 安装包与导入包

# installing the dslabs package
install.packages("dslabs")

# loading the dslabs package into the R session
library(dslabs)

1.2 同时安装多个包

# to install two packages at the same time
install.packages(c("tidyverse", "dslabs"))

1.3 查看已安装的包

# to see the list of all installed packages
installed.packages()

1.4 RStudio的一些快捷键

  • 保存脚本:Ctrl+S(Win),Command+S(Mac)
  • 执行整个脚本:Ctrl+Shift+Enter(Win),Command+Shift+Return(Mac)
  • 执行脚本中的当前行:Ctrl+Enter(Win),Command+Return(Mac)
  • 打开一个新脚本:Ctrl+Shift+N(Win),Command+Shift+N(Mac)

2. R基础

2.1 R中的对象

给一个变量赋值,使用<-

# assigning values to variables
a <- 1
b <- 1
c <- -1

查看当前会话中的所有变量:

ls()

2.2 函数

可以通过help(function_name)?function_name的方式来查看函数文档。

函数args(function_name)可以返回某个函数所需的参数。

R中的注释使用#。

2.3 R数组的起始下标是1!

vector的起始下标是1,而不是0。

3. 数据类型

3.1 查看变量的类型

通过class()函数查看:

# determining that the murders dataset is of the "data frame" class
class(murders)

3.2 查看关于data frame的更多信息

# finding out more about the structure of the object
str(murders)

head(murders)函数可以查看数据集的前6行。

3.3 获取data frame的某一列

使用$符号:

# using the accessor operator to obtain the population column
murders$population

得到的就是一种类型vector。

3.4 获取data frame的列名

# displaying the variable names in the murders dataset
names(murders)

3.5 factor

# obtaining the levels of a factor
levels(murders$region)

就是枚举值。

4. Vectors, Sorting

4.1 创建一个vector

# We may create vectors of class numeric or character with the concatenate function
codes <- c(380, 124, 818)
country <- c("italy", "canada", "egypt")

4.2 给vector的元素起一个名字

# We can also name the elements of a numeric vector
# Note that the two lines of code below have the same result
codes <- c(italy = 380, canada = 124, egypt = 818)
codes <- c("italy" = 380, "canada" = 124, "egypt" = 818)

还可以使用names()函数:

# We can also name the elements of a numeric vector using the names() function
codes <- c(380, 124, 818)
country <- c("italy","canada","egypt")
names(codes) <- country

4.3 通过下标获取vector中的元素

使用[]:

# Using square brackets is useful for subsetting to access specific elements of a vector
codes[2]
codes[c(1,3)]
codes[1:2]

4.4 通过名字获取vector中的元素

# If the entries of a vector are named, they may be accessed by referring to their name
codes["canada"]
codes[c("egypt","italy")]

4.5 R中的类型转换

R会尝试类型转换:

x <- c(1, "canada", 3)

这时1和3就会变成字符串的”1”和”3”。

可以通过as.character()将数字转换成字符串。

也可以使用as.numeric()将字符串转换成数字,如果不能转换的话就变成NA。

4.6 sort()排序

使用sort()来排序:

x <- c(31, 4, 15, 92, 65)
x
sort(x)    # puts elements in order
# result: 4 15 31 65 92

4.7 order()

order返回排序后每个位置的数字在原来vector中的下标:

index <- order(x)    # returns index that will put x in order
x[index]    # rearranging by this index puts elements in order
order(x)
# order(x)的结果就是:2 3 1 5 4

4.8 min()和max()

获取vector的最大值和最小值。

4.9 which:获取对应的下标

min()max()获取vector的最大值和最小值。

which.min()which.max()获取vector中最大值和最小值的下标。

如果有多个,就返回第一个。

4.10 rank()

order()是返回一个vector中排序后每个元素对应的排序前的下标。

rank()返回的是每个元素排序后应该在的位置。

x <- c(31, 4, 15, 92, 65)
x
rank(x)    # returns ranks (smallest to largest)

所以:x[order(x)][rank(x)]x[rank(x)][order(x)]都相当于x

4.11 vector的算术运算

# The name of the state with the maximum population is found by doing the following
murders$state[which.max(murders$population)]

# how to obtain the murder rate
murder_rate <- murders$total / murders$population * 100000

# ordering the states by murder rate, in decreasing order
murders$state[order(murder_rate, decreasing=TRUE)]

5. Indexing, Data Wrangling, Plots

5.1 Indexing

# defining murder rate as before
murder_rate <- murders$total / murders$population * 100000
# creating a logical vector that specifies if the murder rate in that state is less than or equal to 0.71
index <- murder_rate <= 0.71
# determining which states have murder rates less than or equal to 0.71
murders$state[index]
# calculating how many states have a murder rate less than or equal to 0.71
sum(index)

# creating the two logical vectors representing our conditions
west <- murders$region == "West"
safe <- murder_rate <= 1
# defining an index and identifying states with both conditions true
index <- safe & west
murders$state[index]

5.2 which()

which()函数给出vector中为TRUE的下标:

x <- c(FALSE, TRUE, FALSE, TRUE, TRUE, FALSE)
which(x)    # returns indices that are TRUE: 2 4 5

可以这样:

# to determine the murder rate in Massachusetts we may do the following
index <- which(murders$state == "Massachusetts")
index
murder_rate[index]

5.3 match()

match()函数查看一个vector在另一个vector中的下标。

没有的话就是NA。比如:

# to obtain the indices and subsequent murder rates of New York, Florida, Texas, we do:
index <- match(c("New York", "Florida", "Texas"), murders$state)
index
murders$state[index]
murder_rate[index]

# another example
x <- c("a", "b", "c", "d", "e")
y <- c("a", "d", "f")
match(x, y) # 1 NA NA  2 NA
match(y, x) # 1 4 NA

5.4 %in%

%in%来查看一个vector中的元素是否在另一个vector中,和match()类似。

不同在于match()返回下标,而%in%返回是与否。

x <- c("a", "b", "c", "d", "e")
y <- c("a", "d", "f")
y %in% x # TRUE TRUE FALSE

5.5 mutate

mutate可以修改data frame,在dplyr包中。

比如添加一列:

# installing and loading the dplyr package
install.packages("dplyr")
library(dplyr)

# adding a column with mutate
library(dslabs)
data("murders")
murders <- mutate(murders, rate = total / population * 100000)

5.6 filter

filter()函数可以进行过滤:

# subsetting with filter
filter(murders, rate <= 0.71)

就是通过一定条件来选行。

5.7 select

select来选列:

# selecting columns with select
new_table <- select(murders, state, region, rate)

5.8 pipe管道

通过pipe可以在不同函数间传递data:

# using the pipe
murders %>% select(state, region, rate) %>% filter(rate <= 0.71)

5.9 创建一个data frame

# creating a data frame with stringAsFactors = FALSE
grades <- data.frame(names = c("John", "Juan", "Jean", "Yao"), 
                     exam_1 = c(95, 80, 90, 85), 
                     exam_2 = c(90, 85, 85, 90),
                     stringsAsFactors = FALSE)

需要注意的是,data.frame()默认将字符串的值变成factor,可以通过stringsAsFactors=FALSE来关闭。

不过R 4.0之后就不默认把字符串认作factor了。

5.10 plot散点图

# a simple scatterplot of total murders versus population
x <- murders$population /10^6
y <- murders$total
plot(x, y)

5.11 hist柱状图

# a histogram of murder rates
hist(murders$rate)

5.12 boxplot箱图

# boxplots of murder rates by region
boxplot(rate~region, data = murders)

6. 基础编程

6.1 条件语句

# an example showing the general structure of an if-else statement
a <- 0
if(a!=0){
  print(1/a)
} else{
  print("No reciprocal for 0.")
}

# an example that tells us which states, if any, have a murder rate less than 0.5
library(dslabs)
data(murders)
murder_rate <- murders$total / murders$population*100000
ind <- which.min(murder_rate)
if(murder_rate[ind] < 0.5){
  print(murders$state[ind]) 
} else{
  print("No state has murder rate that low")
}

# changing the condition to < 0.25 changes the result
if(murder_rate[ind] < 0.25){
  print(murders$state[ind]) 
} else{
  print("No state has a murder rate that low.")
}

# the ifelse() function works similarly to an if-else conditional
a <- 0
ifelse(a > 0, 1/a, NA)

# the ifelse() function is particularly useful on vectors
a <- c(0,1,2,-4,5)
result <- ifelse(a > 0, 1/a, NA)

6.2 any和all

vector中有一个为真any就是TRUE。

vector中全部为真all才是TRUE。

6.3 is.na()

查看是否是NA:

data(na_example)
no_nas <- ifelse(is.na(na_example), 0, na_example) 
sum(is.na(no_nas))

6.4 自定义函数

# example of defining a function to compute the average of a vector x
avg <- function(x){
  s <- sum(x)
  n <- length(x)
  s/n
}

6.5 identical

查看两个函数是否一致:

# we see that the above function and the pre-built R mean() function are identical
x <- 1:100
identical(mean(x), avg(x))

6.6 参数默认值

# functions can have multiple arguments as well as default values
avg <- function(x, arithmetic = TRUE){
  n <- length(x)
  ifelse(arithmetic, sum(x)/n, prod(x)^(1/n))
}

6.7 for循环

# a very simple for-loop
for(i in 1:5){
  print(i)
}

 Previous
Learn OAuth 2.0 Learn OAuth 2.0
说起OAuth,大多数人都听说过,有的还在工作中用到过。OAuth可以用来保护资源,尤其是API。不过当深入OAuth并仔细看看的时候,原来OAuth有很多值得讨论以及注意地方。这篇文章是《OAuth 2 In Action》的阅读笔记与
2020-10-30
Next 
Functional Options Pattern in Go Functional Options Pattern in Go
Functional Options Pattern: 定义一个Options结构体(StuffClientOptions),包含所有的可选项; 定义一个函数类型,参数是Options结构指针(StuffClientOption); 创
2020-08-31
  You Will See...