빅데이터 분석기사 실기 모의고사를 풀이해 보았다.
(1) 필답형
크롤링 / 메타데이터 / 차원의 저주 / 요인분석 / CART / 단순 확률 대치법
분포 시각화 / 등분산성 / 단계별 선택법 / 엘보우 기법
(2) 작업형
1. BostonHousing 데이터 세트에서 본인 소유의 주택 가격(medv)에서 상위 50개의 데이터에 대해서 최소값으로 변환한 후 타운별 1인당 범죄율(crim)이 1 이상인 데이터의 평균은 ?
> library(mlbench)
> library(dplyr)
> data("BostonHousing")
> ds <- BostonHousing
> str(ds)
'data.frame': 506 obs. of 14 variables:
$ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
$ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
$ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
$ rm : num 6.58 6.42 7.18 7 7.15 ...
$ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
$ dis : num 4.09 4.97 4.97 6.06 6.06 ...
$ rad : num 1 2 2 3 3 3 5 5 5 5 ...
$ tax : num 296 242 242 222 222 222 311 311 311 311 ...
$ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
$ b : num 397 397 393 395 397 ...
$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
$ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
> head(ds)
crim zn indus chas nox rm age dis rad tax ptratio b lstat medv
1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
> summary(ds)
crim zn indus chas nox rm age
Min. : 0.00632 Min. : 0.00 Min. : 0.46 0:471 Min. :0.3850 Min. :3.561 Min. : 2.90
1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1: 35 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02
Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.5380 Median :6.208 Median : 77.50
Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.5547 Mean :6.285 Mean : 68.57
3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08
Max. :88.97620 Max. :100.00 Max. :27.74 Max. :0.8710 Max. :8.780 Max. :100.00
dis rad tax ptratio b lstat medv
Min. : 1.130 Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32 Min. : 1.73 Min. : 5.00
1st Qu.: 2.100 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38 1st Qu.: 6.95 1st Qu.:17.02
Median : 3.207 Median : 5.000 Median :330.0 Median :19.05 Median :391.44 Median :11.36 Median :21.20
Mean : 3.795 Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67 Mean :12.65 Mean :22.53
3rd Qu.: 5.188 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23 3rd Qu.:16.95 3rd Qu.:25.00
Max. :12.127 Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90 Max. :37.97 Max. :50.00
> ds2 <- ds
> ds2 <- ds %>% arrange(desc(ds$medv))
> head(ds)
crim zn indus chas nox rm age dis rad tax ptratio b lstat medv
1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
> head(ds2)
crim zn indus chas nox rm age dis rad tax ptratio b lstat medv
162 1.46336 0 19.58 0 0.605 7.489 90.8 1.9709 5 403 14.7 374.43 1.73 50
163 1.83377 0 19.58 1 0.605 7.802 98.2 2.0407 5 403 14.7 389.61 1.92 50
164 1.51902 0 19.58 1 0.605 8.375 93.9 2.1620 5 403 14.7 388.45 3.32 50
167 2.01019 0 19.58 0 0.605 7.929 96.2 2.0459 5 403 14.7 369.30 3.70 50
187 0.05602 0 2.46 0 0.488 7.831 53.6 3.1992 3 193 17.8 392.63 4.45 50
196 0.01381 80 0.46 0 0.422 7.875 32.0 5.6484 4 255 14.4 394.23 2.97 50
> min <- min(ds2$medv[c(1:50)])
> min
[1] 34.9
> ds2$medv[c(1:50)] <- min
> head(ds2)
crim zn indus chas nox rm age dis rad tax ptratio b lstat medv
162 1.46336 0 19.58 0 0.605 7.489 90.8 1.9709 5 403 14.7 374.43 1.73 34.9
163 1.83377 0 19.58 1 0.605 7.802 98.2 2.0407 5 403 14.7 389.61 1.92 34.9
164 1.51902 0 19.58 1 0.605 8.375 93.9 2.1620 5 403 14.7 388.45 3.32 34.9
167 2.01019 0 19.58 0 0.605 7.929 96.2 2.0459 5 403 14.7 369.30 3.70 34.9
187 0.05602 0 2.46 0 0.488 7.831 53.6 3.1992 3 193 17.8 392.63 4.45 34.9
196 0.01381 80 0.46 0 0.422 7.875 32.0 5.6484 4 255 14.4 394.23 2.97 34.9
> ds3 <- ds2
> ds3 <- ds3 %>% filter(crim>=1)
> head(ds3)
crim zn indus chas nox rm age dis rad tax ptratio b lstat medv
162 1.46336 0 19.58 0 0.605 7.489 90.8 1.9709 5 403 14.7 374.43 1.73 34.9
163 1.83377 0 19.58 1 0.605 7.802 98.2 2.0407 5 403 14.7 389.61 1.92 34.9
164 1.51902 0 19.58 1 0.605 8.375 93.9 2.1620 5 403 14.7 388.45 3.32 34.9
167 2.01019 0 19.58 0 0.605 7.929 96.2 2.0459 5 403 14.7 369.30 3.70 34.9
369 4.89822 0 18.10 0 0.631 4.970 100.0 1.3325 24 666 20.2 375.52 3.26 34.9
370 5.66998 0 18.10 1 0.631 6.683 96.8 1.3567 24 666 20.2 375.33 3.73 34.9
> result <- mean(ds3$crim)
> print(result)
[1] 10.13898
2. iris 데이터 세트에서 70%를 데이터 샘플링 후 꽃받침 길이(Sepal.Length)의 표준편차를 구하시오
이 문제는 샘플 추출 70%를 랜덤하게 하느냐, 위에서부터 순서대로 하느냐에 따라 값이 달라진다
- 랜덤하게 추출
> data("iris")
> library(caret)
> set.seed(2022)
> idx <- createDataPartition(iris$Sepal.Length, p=0.7)
> x_train <- iris[idx$Resample1,]
> result <- sd(x_train$Sepal.Length)
> print(result)
[1] 0.8397347
- 위에서부터 70% 추출
> row <- nrow(iris)
> row2 <- row*0.7
> ds <- iris[c(1:row2),]
> sd(ds$Sepal.Length)
[1] 0.6632932
3. mtcars 데이터 세트에서, wt 컬럼을 최소-최대 척도로 변환한 후(Min-max scale) 0.5보다 큰 레코드 수를 구하시오.
- caret 패키지의 preProcess 이용하여 min-max 정규화
> data("mtcars")
> ds <- mtcars
> head(ds)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
> summary(ds)
mpg cyl disp hp drat wt qsec
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760 Min. :1.513 Min. :14.50
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89
Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695 Median :3.325 Median :17.71
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597 Mean :3.217 Mean :17.85
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930 Max. :5.424 Max. :22.90
vs am gear carb
Min. :0.0000 Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
Median :0.0000 Median :0.0000 Median :4.000 Median :2.000
Mean :0.4375 Mean :0.4062 Mean :3.688 Mean :2.812
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :1.0000 Max. :1.0000 Max. :5.000 Max. :8.000
> pre_ds <- preProcess(ds, method="range")
> ds2 <- predict(pre_ds, ds)
# 정규화 되었는지 확인
> head(ds2)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 0.4510638 0.5 0.2217511 0.2049470 0.5253456 0.2830478 0.2333333 0 1 0.5 0.4285714
Mazda RX4 Wag 0.4510638 0.5 0.2217511 0.2049470 0.5253456 0.3482485 0.3000000 0 1 0.5 0.4285714
Datsun 710 0.5276596 0.0 0.0920429 0.1448763 0.5023041 0.2063411 0.4892857 1 1 0.5 0.0000000
Hornet 4 Drive 0.4680851 0.5 0.4662010 0.2049470 0.1474654 0.4351828 0.5880952 1 0 0.0 0.0000000
Hornet Sportabout 0.3531915 1.0 0.7206286 0.4346290 0.1797235 0.4927129 0.3000000 0 0 0.0 0.1428571
Valiant 0.3276596 0.5 0.3838863 0.1872792 0.0000000 0.4978266 0.6809524 1 0 0.0 0.0000000
> ds3 <- ds2 %>% filter (wt > 0.5)
> result <- length(ds3$wt)
> print(result)
[1] 11
'자기계발 > 자격증' 카테고리의 다른 글
[빅데이터 분석기사] 실기 모의고사 2회 - 1/2 (0) | 2022.06.19 |
---|---|
[빅데이터 분석기사] 실기 모의고사 1회 - 2/2 (0) | 2022.06.18 |
[빅데이터 분석기사] 2회 기출문제 연습(R 코드) - 3/3 (0) | 2022.06.16 |
[빅데이터 분석기사] 2회 기출문제 연습(R 코드) - 2/3 (0) | 2022.06.15 |
[빅데이터 분석기사] 2회 기출문제 연습(R 코드) - 1/3 (0) | 2022.06.14 |
댓글