3.2.1 R 데이터 불러오기

Ch2. 인사관리 데이터를 통한 R 데이터 핸들링 1편

데이터 다운로드 링크: https://www.kaggle.com/liujiaqi/hr-comma-sepcsv

변수 설명
- satisfaction_level : 직무 만족도
- last_evaluation : 마지막 평가점수
- number_project : 진행 프로젝트 수
- average_monthly_hours : 월평균 근무시간
- time_spend_company : 근속년수
- work_accident : 사건사고 여부(0: 없음, 1: 있음)
- left : 이직 여부(0: 잔류, 1: 이직)
- promotion_last_5years: 최근 5년간 승진여부(0: 승진 x, 1: 승진)
- sales : 부서
- salary : 임금 수준
데이터 불러오기

DATA = read.csv('C:/R/HR_comma_sep.csv')
DATA = read.csv('C:\\R\\HR_comma_sep.csv')

데이터 파악하기

head(DATA) # 데이터 윗부분 띄우기

##   satisfaction_level last_evaluation number_project average_montly_hours
## 1               0.38            0.53              2                  157
## 2               0.80            0.86              5                  262
## 3               0.11            0.88              7                  272
## 4               0.72            0.87              5                  223
## 5               0.37            0.52              2                  159
## 6               0.41            0.50              2                  153
##   time_spend_company Work_accident left promotion_last_5years sales salary
## 1                  3             0    1                     0 sales    low
## 2                  6             0    1                     0 sales medium
## 3                  4             0    1                     0 sales medium
## 4                  5             0    1                     0 sales    low
## 5                  3             0    1                     0 sales    low
## 6                  3             0    1                     0 sales    low

str(DATA) # 데이터 strings 파악

## 'data.frame':    14999 obs. of  10 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ time_spend_company   : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ left                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ promotion_last_5years: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sales                : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ salary               : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...

summary(DATA) # 요약된 데이터 살펴보기

##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6400     Median :0.7200   Median :4.000   Median :200.0       
##  Mean   :0.6128     Mean   :0.7161   Mean   :3.803   Mean   :201.1       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##                                                                          
##  time_spend_company Work_accident         left       
##  Min.   : 2.000     Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 3.000     1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 3.000     Median :0.0000   Median :0.0000  
##  Mean   : 3.498     Mean   :0.1446   Mean   :0.2381  
##  3rd Qu.: 4.000     3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :10.000     Max.   :1.0000   Max.   :1.0000  
##                                                      
##  promotion_last_5years         sales         salary    
##  Min.   :0.00000       sales      :4140   high  :1237  
##  1st Qu.:0.00000       technical  :2720   low   :7316  
##  Median :0.00000       support    :2229   medium:6446  
##  Mean   :0.02127       IT         :1227                
##  3rd Qu.:0.00000       product_mng: 902                
##  Max.   :1.00000       marketing  : 858                
##                        (Other)    :2923

데이터 strings 변경

summary(DATA$left)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2381  0.0000  1.0000

DATA$Work_accident=as.factor(DATA$Work_accident)
DATA$left=as.factor(DATA$left)
DATA$promotion_last_5years=as.factor(DATA$promotion_last_5years)

summary(DATA$left)

##     0     1 
## 11428  3571

left 변수가 numeric으로 되어 있을 때와 Factor로 되어 있을 때, 요약값이 다르게 표시 되는 것을 알 수 있습니다.

재미라도 꿈꾸자

이 블로그 검색

3.2.1 R 데이터 불러오기

Ch2. 인사관리 데이터를 통한 R 데이터 핸들링 1편

태그

이 블로그의 인기 게시물

6.1.2 고수들이 자주 쓰는 R코드 소개 2편 [중복 데이터 제거 방법]

4.4.1 R 문자열(TEXT) 데이터 처리하기 1

3. Resampling 방법론(Leave one out , Cross Validation)

4. 통계적 추정(점추정,구간추정)

3.2.3 R 시각화[ggplot2] 2편 (히스토그램, 밀도글래프, 박스플롯, 산점도)