> library(tm)
> reut21578 <- system.file("texts","crude",package="tm")
> crudeCorp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain))
> crudeCorp[[1]]
<<PlainTextDocument>>
Metadata: 16
Content: chars: 527
> inspect(crudeCorp[1])
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 1
[[1]]
<<PlainTextDocument>>
Metadata: 16
Content: chars: 527
공백제거
> crudeCorp <- tm_map(crudeCorp,stripWhitespace)
> inspect(crudeCorp[1])
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 1
[[1]]
<<PlainTextDocument>>
Metadata: 16
Content: chars: 514
소문자 제거
> crudeCorp <- tm_map(crudeCorp, content_transformer(tolower))
혹은 (*버전이 업데이트 되면서 타입 관련해서 에러가 날때)
> crudeCorp <- tm_map(crudeCorp, PlainTextDocument)
불용어 제거
> crudeCorp <- tm_map(crudeCorp, removeWords, stopwords("english"))
단어문서 행렬 생성
> crudeDtm <- DocumentTermMatrix(crudeCorp)
문서에서 10회이상 언급된 단어 조회
> findFreqTerms(crudeDtm, 10)
[1] "bpd" "crude" "dlrs" "government"
[5] "kuwait" "last" "market" "mln"
[9] "new" "official" "oil" "one"
[13] "opec" "pct" "price" "prices"
[17] "reuter" "said" "said." "saudi"
[21] "sheikh" "u.s." "will"
상관도가 0.7이상 단어를 찾음
> findAssocs(crudeDtm, "crude" , 0.7)
$crude
dlr corp fell four
0.75 0.71 0.71 0.70
단어사전
> crudeDic = c("prices" , "crude" , "opec")
> inspect(DocumentTermMatrix(crudeCorp, list(dictionary = crudeDic)))
<<DocumentTermMatrix (documents: 20, terms: 3)>>
Non-/sparse entries: 31/29
Sparsity : 48%
Maximal term length: 6
Weighting : term frequency (tf)
Terms
Docs crude opec prices
character(0) 2 0 3
character(0) 0 10 3
character(0) 2 0 0
character(0) 3 0 0
character(0) 0 0 0
character(0) 1 6 2
character(0) 0 1 0
character(0) 0 2 1
character(0) 0 1 0
character(0) 0 6 7
character(0) 5 5 4
character(0) 2 1 0
character(0) 0 2 4
character(0) 2 4 1
character(0) 0 0 0
character(0) 0 0 2
character(0) 0 0 2
character(0) 2 0 2
character(0) 0 0 2
character(0) 1 0 0
Sparsity 가 높은 단어를 제거
> removeSparseTerms(crudeDtm, 0.3)
'데이터분석 > Code & Tools & Script Snippet' 카테고리의 다른 글
[250/1100] 3일차 정리 (0) | 2016.02.03 |
---|---|
[150/1100] 2일차 정리 (0) | 2016.02.02 |
[100/1000] 1일차 정리 (0) | 2016.02.01 |
R을 이용한 샤이니(shiny) 이용 (0) | 2016.01.17 |
IBM Watson 이용해보기 (2) | 2015.10.25 |