빅데이터 분석기사 실기 2회 기출문제 풀이 R 연습 2번째이다.
이번에도 작업형 1유형 문제 두가지를 풀어보았고 R 코드와 실행 결과는 다음과 같다.
이번에도 저작권 관련해서 문제를 복원할 순 없고, 비슷한 느낌으로다가 풀어보았다.
1. 주어진 데이터를 첫번째 행부터 80%까지를 훈련 데이터로 추출하고, 'total_bedrooms' 변수의 결측값(NA)을 'total_bedrooms' 변수의 중앙값으로 대체하고 대체 전/후의 표준편차 차이의 절대값을 구하여라.
# 데이터 불러오기
> ds <- read.csv("c:/data/exam2/housing.csv")
#데이터셋 확인
> str(ds)
'data.frame': 20640 obs. of 10 variables:
$ longitude : num -122 -122 -122 -122 -122 ...
$ latitude : num 37.9 37.9 37.9 37.9 37.9 ...
$ housing_median_age: num 41 21 52 52 52 52 52 52 42 52 ...
$ total_rooms : num 880 7099 1467 1274 1627 ...
$ total_bedrooms : num 129 1106 190 235 280 ...
$ population : num 322 2401 496 558 565 ...
$ households : num 126 1138 177 219 259 ...
$ median_income : num 8.33 8.3 7.26 5.64 3.85 ...
$ median_house_value: num 452600 358500 352100 341300 342200 ...
$ ocean_proximity : chr "NEAR BAY" "NEAR BAY" "NEAR BAY" "NEAR BAY" ...
> head(ds)
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income
1 -122.23 37.88 41 880 129 322 126 8.3252
2 -122.22 37.86 21 7099 1106 2401 1138 8.3014
3 -122.24 37.85 52 1467 190 496 177 7.2574
4 -122.25 37.85 52 1274 235 558 219 5.6431
5 -122.25 37.85 52 1627 280 565 259 3.8462
6 -122.25 37.85 52 919 213 413 193 4.0368
median_house_value ocean_proximity
1 452600 NEAR BAY
2 358500 NEAR BAY
3 352100 NEAR BAY
4 341300 NEAR BAY
5 342200 NEAR BAY
6 269700 NEAR BAY
> summary(ds)
longitude latitude housing_median_age total_rooms total_bedrooms population households
Min. :-124.3 Min. :32.54 Min. : 1.00 Min. : 2 Min. : 1.0 Min. : 3 Min. : 1.0
1st Qu.:-121.8 1st Qu.:33.93 1st Qu.:18.00 1st Qu.: 1448 1st Qu.: 296.0 1st Qu.: 787 1st Qu.: 280.0
Median :-118.5 Median :34.26 Median :29.00 Median : 2127 Median : 435.0 Median : 1166 Median : 409.0
Mean :-119.6 Mean :35.63 Mean :28.64 Mean : 2636 Mean : 537.9 Mean : 1425 Mean : 499.5
3rd Qu.:-118.0 3rd Qu.:37.71 3rd Qu.:37.00 3rd Qu.: 3148 3rd Qu.: 647.0 3rd Qu.: 1725 3rd Qu.: 605.0
Max. :-114.3 Max. :41.95 Max. :52.00 Max. :39320 Max. :6445.0 Max. :35682 Max. :6082.0
NA's :207
median_income median_house_value ocean_proximity
Min. : 0.4999 Min. : 14999 Length:20640
1st Qu.: 2.5634 1st Qu.:119600 Class :character
Median : 3.5348 Median :179700 Mode :character
Mean : 3.8707 Mean :206856
3rd Qu.: 4.7432 3rd Qu.:264725
Max. :15.0001 Max. :500001
> summary(ds$total_bedrooms)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.0 296.0 435.0 537.9 647.0 6445.0 207
# 80% 데이터 추출
> row <- nrow(ds$total_bedrooms)
> row <- nrow(ds)
> idx <- row*0.8
> ds2 <- ds[c(1:idx),]
> str(ds2)
'data.frame': 16512 obs. of 10 variables:
$ longitude : num -122 -122 -122 -122 -122 ...
$ latitude : num 37.9 37.9 37.9 37.9 37.9 ...
$ housing_median_age: num 41 21 52 52 52 52 52 52 42 52 ...
$ total_rooms : num 880 7099 1467 1274 1627 ...
$ total_bedrooms : num 129 1106 190 235 280 ...
$ population : num 322 2401 496 558 565 ...
$ households : num 126 1138 177 219 259 ...
$ median_income : num 8.33 8.3 7.26 5.64 3.85 ...
$ median_house_value: num 452600 358500 352100 341300 342200 ...
$ ocean_proximity : chr "NEAR BAY" "NEAR BAY" "NEAR BAY" "NEAR BAY" ...
# 중앙값 계산
> med <- median(ds$total_bedrooms, na.rm=TRUE)
# 결측치 대체
> ds3 <- ds2
> ds3$total_bedrooms <- ifelse(is.na(ds3$total_bedrooms), med, ds3$total_bedrooms)
# 표준편차 계산
> sd1 <- sd(ds2$total_bedrooms, na.rm=TRUE)
> sd2 <- sd(ds3$total_bedrooms)
> sd1-sd2
[1] 1.975147
# 절대값 씌워 출력
> result <- abs(sd1-sd2)
> print(result)
[1] 1.975147
2. insurance 데이터 세트의 charges 항목에서 이상값의 합을 구하시오 (이상값은 평균에서 1.5 표준편차 이상의 값)
# 데이터 로드
> ds <- read.csv("c:/data/exam2/insurance.csv")
# 데이터 확인
> str(ds)
'data.frame': 1338 obs. of 7 variables:
$ age : int 19 18 28 33 32 31 46 37 37 60 ...
$ sex : chr "female" "male" "male" "male" ...
$ bmi : num 27.9 33.8 33 22.7 28.9 ...
$ children: int 0 1 3 0 0 0 1 3 2 0 ...
$ smoker : chr "yes" "no" "no" "no" ...
$ region : chr "southwest" "southeast" "southeast" "northwest" ...
$ charges : num 16885 1726 4449 21984 3867 ...
> head(ds)
age sex bmi children smoker region charges
1 19 female 27.900 0 yes southwest 16884.924
2 18 male 33.770 1 no southeast 1725.552
3 28 male 33.000 3 no southeast 4449.462
4 33 male 22.705 0 no northwest 21984.471
5 32 male 28.880 0 no northwest 3866.855
6 31 female 25.740 0 no southeast 3756.622
> summary(ds)
age sex bmi children smoker region
Min. :18.00 Length:1338 Min. :15.96 Min. :0.000 Length:1338 Length:1338
1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000 Class :character Class :character
Median :39.00 Mode :character Median :30.40 Median :1.000 Mode :character Mode :character
Mean :39.21 Mean :30.66 Mean :1.095
3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
Max. :64.00 Max. :53.13 Max. :5.000
charges
Min. : 1122
1st Qu.: 4740
Median : 9382
Mean :13270
3rd Qu.:16640
Max. :63770
# 이상값 구하기
> mean <- mean(ds$charges)
> sd <- sd(ds$charges)
> max <- mean + 1.5*sd
> min <- mean - 1.5*sd
# 이상값들만 가지는 dataset 추출
> library(dplyr)
> ds2 <- ds
> ds2 <- ds %>% filter(ds$charges <= min | ds$charges >= max)
# 이상값들의 charges 합 출력
> result <- sum(ds2$charges)
> print(result)
[1] 6421430
'자기계발 > 자격증' 카테고리의 다른 글
[빅데이터 분석기사] 실기 모의고사 1회 - 1/2 (0) | 2022.06.17 |
---|---|
[빅데이터 분석기사] 2회 기출문제 연습(R 코드) - 3/3 (0) | 2022.06.16 |
[빅데이터 분석기사] 2회 기출문제 연습(R 코드) - 1/3 (0) | 2022.06.14 |
[빅데이터 분석기사] 실기 작업형 1 R 코드 - TEST체험하기 (dataq 예시문제) (0) | 2022.06.14 |
[빅데이터 분석기사] 필기 합격 후기, 공부 방법, 합격 TIP (0) | 2022.06.12 |
댓글