python 자연어 처리

필기체 폰트 사족... 최근 새로 구한 광화문의 직장 동료들이 첨엔 코드 리뷰 잘 안 보인다고 그러다가 변태라고 놀리다가, 이제 한 6개월 정도 필기체 폰트 쓰니까 이젠 그러려니 한다.

단어 빈도수에 따른 각기 다른 크기의 단어 구름(word cloud)이 그 결과인데, 뭔가 되게 있어 보이지만, 소스는 간단하다.

import nltk

from wordcloud import WordCloud

import matplotlib.pyplot as plt

nltk.download("book", quiet=False)

nltk.corpus.gutenberg.fileids()

emma_raw = nltk.corpus.gutenberg.raw("austen-emma.txt")

cw = WordCloud(max_font_size=100).generate(emma_raw)

plt.imshow(cw, interpolation="mitchell")

plt.show()

레알 끝... 그러나 부디 엔지니어 무시하기 말아 주기 바란다. 파이썬이 유연하긴 해도 말이다. 아무리 작은 건물이라도 요구 사항대로 다 짓고 나서 옆으로 1Cm 옮기는 것은 무진장 힘들다.

이제 연관 있는 링크들.

https://www.nltk.org/book/

NLTK Book

www.nltk.org

https://github.com/nltk

Natural Language Toolkit

Natural Language Toolkit has 10 repositories available. Follow their code on GitHub.

github.com

여기 있는 코드를 pyCharm Python console에 좀 쳐 보면

>> import nltk

>> nltk.corpus.gutenberg.fileids()

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

>> sense = nltk.corpus.gutenberg.words('austen-sense.txt')

>> len(sense)

141576

텍스트 전 처리는

https://github.com/lovit 자연어 처리는 이 분이 잘하시는 것 같다.

lovit - Overview

Data scientist / Natural Language Processing / Machine Learning // soy.lovit@gmail.com - lovit

github.com

저작자표시

'[진행] {BE} Python 3.1x' 카테고리의 다른 글

How to disable vim on pyCharm (0)	2020.06.19
Making pyCharm shortcuts to be the Xcode one's (0)	2020.06.19
신형 맥북 에어 맥북 프로 (0)	2019.04.02
애플 정품 케이블 너무 얇아졌다 ㅠㅠ (0)	2019.04.02
core graphics는 snapkit이 먹지 않는다. (0)	2019.03.20