데이터분석/Code & Tools & Script Snippet

R을 이용한 텍스트 마이닝 샘플

늘근이 2016. 1. 19. 22:24
> library(tm)

> reut21578 <- system.file("texts","crude",package="tm")

> crudeCorp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain))

> crudeCorp[[1]]


<<PlainTextDocument>>

Metadata: 16

Content: chars: 527


> inspect(crudeCorp[1])

<<VCorpus>> Metadata: corpus specific: 0, document level (indexed): 0 Content: documents: 1 [[1]] <<PlainTextDocument>> Metadata: 16 Content: chars: 527

 
 
공백제거
 

> crudeCorp <- tm_map(crudeCorp,stripWhitespace)

 

> inspect(crudeCorp[1]) <<VCorpus>> Metadata: corpus specific: 0, document level (indexed): 0 Content: documents: 1 [[1]] <<PlainTextDocument>> Metadata: 16 Content: chars: 514

 

 

소문자 제거

 

> crudeCorp <- tm_map(crudeCorp, content_transformer(tolower))

 

혹은 (*버전이 업데이트 되면서 타입 관련해서 에러가 날때)

 

>  crudeCorp <- tm_map(crudeCorp, PlainTextDocument)
 
 
 
불용어 제거
 
> crudeCorp <- tm_map(crudeCorp, removeWords, stopwords("english"))
 
 
단어문서 행렬 생성
 
 
> crudeDtm <- DocumentTermMatrix(crudeCorp)
 
 
문서에서 10회이상 언급된 단어 조회
 
> findFreqTerms(crudeDtm, 10)
 [1] "bpd"        "crude"      "dlrs"       "government"
 [5] "kuwait"     "last"       "market"     "mln"       
 [9] "new"        "official"   "oil"        "one"       
[13] "opec"       "pct"        "price"      "prices"    
[17] "reuter"     "said"       "said."      "saudi"     
[21] "sheikh"     "u.s."       "will" 
 

 

 

상관도가 0.7이상 단어를 찾음

 

> findAssocs(crudeDtm, "crude" , 0.7)
$crude
 dlr corp fell four 
0.75 0.71 0.71 0.70 

 

 

 

단어사전

 

> crudeDic = c("prices" , "crude" , "opec")
> inspect(DocumentTermMatrix(crudeCorp, list(dictionary = crudeDic)))
<<DocumentTermMatrix (documents: 20, terms: 3)>>
Non-/sparse entries: 31/29
Sparsity           : 48%
Maximal term length: 6
Weighting          : term frequency (tf)

              Terms
Docs           crude opec prices
  character(0)     2    0      3
  character(0)     0   10      3
  character(0)     2    0      0
  character(0)     3    0      0
  character(0)     0    0      0
  character(0)     1    6      2
  character(0)     0    1      0
  character(0)     0    2      1
  character(0)     0    1      0
  character(0)     0    6      7
  character(0)     5    5      4
  character(0)     2    1      0
  character(0)     0    2      4
  character(0)     2    4      1
  character(0)     0    0      0
  character(0)     0    0      2
  character(0)     0    0      2
  character(0)     2    0      2
  character(0)     0    0      2
  character(0)     1    0      0
 
 

Sparsity 가 높은 단어를 제거

 

 

> removeSparseTerms(crudeDtm, 0.3)

 


 


'데이터분석 > Code & Tools & Script Snippet' 카테고리의 다른 글

[250/1100] 3일차 정리  (0) 2016.02.03
[150/1100] 2일차 정리  (0) 2016.02.02
[100/1000] 1일차 정리  (0) 2016.02.01
R을 이용한 샤이니(shiny) 이용  (0) 2016.01.17
IBM Watson 이용해보기  (2) 2015.10.25