The result of Naver crawling is useless.

All of the data is attached above.

AutoCralwer solution is not working properly when the datas are huge.

The saving mechanism is not working I think.

After one more try without "naver" option, I will fix the code.

current code snapshot

---

"""
Copyright 2018 YoongiKim

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
"""

import os
import requests
import shutil
from multiprocessing import Pool
import argparse
from collect_links import CollectLinks
import imghdr
import base64

class Sites:
GOOGLE = 1
NAVER = 2
GOOGLE_FULL = 3
NAVER_FULL = 4

@staticmethod
def get_text(code):
if code == Sites.GOOGLE:
return 'google'
elif code == Sites.NAVER:
return 'naver'
elif code == Sites.GOOGLE_FULL:
return 'google'
elif code == Sites.NAVER_FULL:
return 'naver'

@staticmethod
def get_face_url(code):
if code == Sites.GOOGLE or Sites.GOOGLE_FULL:
return "&tbs=itp:face"
if code == Sites.NAVER or Sites.NAVER_FULL:
return "&face=1"

class AutoCrawler:
def __init__(self, skip_already_exist=True, n_threads=4, do_google=True, do_naver=True, download_path='download',
full_resolution=False, face=False):
"""
:param skip_already_exist: Skips keyword already downloaded before. This is needed when re-downloading.
:param n_threads: Number of threads to download.
:param do_google: Download from google.com (boolean)
:param do_naver: Download from naver.com (boolean)
:param download_path: Download folder path
:param full_resolution: Download full resolution image instead of thumbnails (slow)
:param face: Face search mode
"""

self.skip = skip_already_exist
self.n_threads = n_threads
self.do_google = do_google
self.do_naver = do_naver
self.download_path = download_path
self.full_resolution = full_resolution
self.face = face

os.makedirs('./{}'.format(self.download_path), exist_ok=True)

@staticmethod
def all_dirs(path):
paths = []
for dir in os.listdir(path):
if os.path.isdir(path + '/' + dir):
paths.append(path + '/' + dir)

return paths

@staticmethod
def all_files(path):
paths = []
for root, dirs, files in os.walk(path):
for file in files:
if os.path.isfile(path + '/' + file):
paths.append(path + '/' + file)

return paths

@staticmethod
def get_extension_from_link(link, default='jpg'):
splits = str(link).split('.')
if len(splits) == 0:
return default
ext = splits[-1].lower()
if ext == 'jpg' or ext == 'jpeg':
return 'jpg'
elif ext == 'gif':
return 'gif'
elif ext == 'png':
return 'png'
else:
return default

@staticmethod
def validate_image(path):
ext = imghdr.what(path)
if ext == 'jpeg':
ext = 'jpg'
return ext # returns None if not valid

@staticmethod
def make_dir(dirname):
current_path = os.getcwd()
path = os.path.join(current_path, dirname)
if not os.path.exists(path):
os.makedirs(path)

@staticmethod
def get_keywords(keywords_file='keywords.txt'):
# read search keywords from file
with open(keywords_file, 'r', encoding='utf-8-sig') as f:
text = f.read()
lines = text.split('\n')
lines = filter(lambda x: x != '' and x is not None, lines)
keywords = sorted(set(lines))

print('{} keywords found: {}'.format(len(keywords), keywords))

# re-save sorted keywords
with open(keywords_file, 'w+', encoding='utf-8') as f:
for keyword in keywords:
f.write('{}\n'.format(keyword))

return keywords

@staticmethod
def save_object_to_file(object, file_path, is_base64=False):
try:
with open('{}'.format(file_path), 'wb') as file:
if is_base64:
file.write(object)
else:
shutil.copyfileobj(object.raw, file)
except Exception as e:
print('Save failed - {}'.format(e))

@staticmethod
def base64_to_object(src):
header, encoded = str(src).split(',', 1)
data = base64.decodebytes(bytes(encoded, encoding='utf-8'))
return data

def download_images(self, keyword, links, site_name):
self.make_dir('{}/{}'.format(self.download_path, keyword))
total = len(links)

for index, link in enumerate(links):
try:
print('Downloading {} from {}: {} / {}'.format(keyword, site_name, index + 1, total))

if str(link).startswith('data:image/jpeg;base64'):
response = self.base64_to_object(link)
ext = 'jpg'
is_base64 = True
elif str(link).startswith('data:image/png;base64'):
response = self.base64_to_object(link)
ext = 'png'
is_base64 = True
else:
response = requests.get(link, stream=True)
ext = self.get_extension_from_link(link)
is_base64 = False

no_ext_path = '{}/{}/{}_{}'.format(self.download_path, keyword, site_name, str(index).zfill(4))
path = no_ext_path + '.' + ext
self.save_object_to_file(response, path, is_base64=is_base64)

del response

ext2 = self.validate_image(path)
if ext2 is None:
print('Unreadable file - {}'.format(link))
os.remove(path)
else:
if ext != ext2:
path2 = no_ext_path + '.' + ext2
os.rename(path, path2)
print('Renamed extension {} -> {}'.format(ext, ext2))

except Exception as e:
print('Download failed - ', e)
continue

def download_from_site(self, keyword, site_code):
site_name = Sites.get_text(site_code)
add_url = Sites.get_face_url(site_code) if self.face else ""

try:
collect = CollectLinks() # initialize chrome driver
except Exception as e:
print('Error occurred while initializing chromedriver - {}'.format(e))
return

try:
print('Collecting links... {} from {}'.format(keyword, site_name))

if site_code == Sites.GOOGLE:
links = collect.google(keyword, add_url)

elif site_code == Sites.NAVER:
links = collect.naver(keyword, add_url)

elif site_code == Sites.GOOGLE_FULL:
links = collect.google_full(keyword, add_url)

elif site_code == Sites.NAVER_FULL:
links = collect.naver_full(keyword, add_url)

else:
print('Invalid Site Code')
links = []

print('Downloading images from collected links... {} from {}'.format(keyword, site_name))
self.download_images(keyword, links, site_name)

print('Done {} : {}'.format(site_name, keyword))

except Exception as e:
print('Exception {}:{} - {}'.format(site_name, keyword, e))

def download(self, args):
self.download_from_site(keyword=args[0], site_code=args[1])

def do_crawling(self):
keywords = self.get_keywords()

tasks = []

for keyword in keywords:
dir_name = '{}/{}'.format(self.download_path, keyword)
if os.path.exists(os.path.join(os.getcwd(), dir_name)) and self.skip:
print('Skipping already existing directory {}'.format(dir_name))
continue

if self.do_google:
if self.full_resolution:
tasks.append([keyword, Sites.GOOGLE_FULL])
else:
tasks.append([keyword, Sites.GOOGLE])

if self.do_naver:
if self.full_resolution:
tasks.append([keyword, Sites.NAVER_FULL])
else:
tasks.append([keyword, Sites.NAVER])

pool = Pool(self.n_threads)
pool.map_async(self.download, tasks)
pool.close()
pool.join()
print('Task ended. Pool join.')

self.imbalance_check()

print('End Program')

def imbalance_check(self):
print('Data imbalance checking...')

dict_num_files = {}

for dir in self.all_dirs(self.download_path):
n_files = len(self.all_files(dir))
dict_num_files[dir] = n_files

avg = 0
for dir, n_files in dict_num_files.items():
avg += n_files / len(dict_num_files)
print('dir: {}, file_count: {}'.format(dir, n_files))

dict_too_small = {}

for dir, n_files in dict_num_files.items():
if n_files < avg * 0.5:
dict_too_small[dir] = n_files

if len(dict_too_small) >= 1:
print('Data imbalance detected.')
print('Below keywords have smaller than 50% of average file count.')
print('I recommend you to remove these directories and re-download for that keyword.')
print('_________________________________')
print('Too small file count directories:')
for dir, n_files in dict_too_small.items():
print('dir: {}, file_count: {}'.format(dir, n_files))

print("Remove directories above? (y/n)")
answer = input()

if answer == 'y':
# removing directories too small files
print("Removing too small file count directories...")
for dir, n_files in dict_too_small.items():
shutil.rmtree(dir)
print('Removed {}'.format(dir))

print('Now re-run this program to re-download removed files. (with skip_already_exist=True)')
else:
print('Data imbalance not detected.')

if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--skip', type=str, default='true',
help='Skips keyword already downloaded before. This is needed when re-downloading.')
parser.add_argument('--threads', type=int, default=4, help='Number of threads to download.')
parser.add_argument('--google', type=str, default='true', help='Download from google.com (boolean)')
parser.add_argument('--naver', type=str, default='true', help='Download from naver.com (boolean)')
parser.add_argument('--full', type=str, default='false', help='Download full resolution image instead of thumbnails (slow)')
parser.add_argument('--face', type=str, default='false', help='Face search mode')
args = parser.parse_args()

_skip = False if str(args.skip).lower() == 'false' else True
_threads = args.threads
_google = False if str(args.google).lower() == 'false' else True
_naver = False if str(args.naver).lower() == 'false' else True
_full = False if str(args.full).lower() == 'false' else True
_face = False if str(args.face).lower() == 'false' else True

print('Options - skip:{}, threads:{}, google:{}, naver:{}, full_resolution:{}, face:{}'.format(_skip, _threads, _google, _naver, _full, _face))

crawler = AutoCrawler(skip_already_exist=_skip, n_threads=_threads, do_google=_google, do_naver=_naver, full_resolution=_full, face=_face)
crawler.do_crawling()

=-=-=-=-=

err msg

Downloading images from collected links... "breast cancer" CT from google
Exception google:"breast cancer" CT - [WinError 123] 파일 이름, 디렉터리 이름
또는 볼륨 레이블 구문이 잘못되었습니다: 'C:\\morpheus\\JOE\\AutoCrawler-mast
er 2\\AutoCrawler-master\\download/"breast cancer" CT'
Task ended. Pool join.
Data imbalance checking...

is the problem

[4876:14980:0625/200842.641:ERROR:ssl_client_socket_impl.cc(959)] handshake f ailed; returned -1, SSL error code 1, net_error -200

collect_links.py 48 line

chrome_options = Options()

chrome_options.add_argument('--headless')

chrome_options.add_argument('--no-sandbox')

chrome_options.add_argument('--disable-dev-shm-usage')

self.browser = webdriver.Chrome(executable, chrome_options=chrome_options)

still...

430: https://www.oncologynurseadvisor.com/wp-content/uploads/sites/13/2019/01

/c752406354ff50ac055bfccde4105efd_bookmarkImage_335x250_large_original_1.jpg

[0625/211556.818:ERROR:ssl_client_socket_impl.cc(959)] handshake failed; retu

rned -1, SSL error code 1, net_error -100

[0625/211556.820:ERROR:ssl_client_socket_impl.cc(959)] handshake failed; retu

rned -1, SSL error code 1, net_error -100

[0625/211557.085:ERROR:ssl_client_socket_impl.cc(959)] handshake failed; retu

rned -1, SSL error code 1, net_error -100

[0625/211557.088:ERROR:ssl_client_socket_impl.cc(959)] handshake failed; retu

rned -1, SSL error code 1, net_error -100

maybe keyword?

흠...

다양한 시스템(맥, 윈서버, 윈10)에서 돌리다보니, 문제의 양상이 다양하다. 이제 Click time out - //input[@type="button"] 으로 나타난다.

chrome_options.add_argument('--headless') 옵션을 주니 창이 나오지 않아 더 빠른 것 같은데... 위의 에러가 나서 창을 띄우니 제대로 받아진다.

Collecting links... Breast cancer CT from google
Scrolling down
Scraping links
Collect links done. Site: google, Keyword: Breast cancer CT, Total: 434
Downloading images from collected links... Breast cancer CT from google
Downloading Breast cancer CT from google: 1 / 434
Download failed - Invalid URL 'None': No schema supplied. Perhaps you meant http://None?
Downloading Breast cancer CT from google: 2 / 434
Downloading Breast cancer CT from google: 3 / 434
Downloading Breast cancer CT from google: 4 / 434
Downloading Breast cancer CT from google: 5 / 434
Downloading Breast cancer CT from google: 6 / 434
Downloading Breast cancer CT from google: 7 / 434
Downloading Breast cancer CT from google: 8 / 434
Downloading Breast cancer CT from google: 9 / 434
Downloading Breast cancer CT from google: 10 / 434

구글인데 434 밖에 없나? 쩝.. ㅠㅠ 창으로 확인해 보니 더 자료가 없다고 한다. 데이터 제공처를 더 찾아보고 키워드 조합도 다양하게 try 해봐야겠다.

Breast cancer CT.7z

3.13MB

그래도 이미지 퀄리티가 괜찮네.

breast cancer CT 말고 breast cancer ct scan images 로 키워드 바꾸고 full resolution mode로 해 본다.

그러니 700개가 나온다.

너무 적은데? ...

인터넷 뒤져보니 내 상상보다 많은 이미지 데이터가 이미 있다.

https://www.cancerimagingarchive.net/collections/

TCIA Collections - The Cancer Imaging Archive (TCIA)

TCIA data are organized as “collections”; typically these are patient cohorts related by a common disease (e.g. lung cancer), image modality or type (MRI, CT, digital histopathology, etc) or research focus. Supporting data related to the images such as

www.cancerimagingarchive.net

데이터 받는 앱도 있고. 아마 토렌트 겠지?

믓튼, 저 사이트 외에도 꽤 있다. 잘 정리해 둬야지.

저작자표시 변경금지

'[진행] 건강 자료 수집 &amp;amp;amp;quot;해볼까?&amp;amp;amp;quot;' 카테고리의 다른 글

DICOM file info (0)	2020.07.01
자료 받기 (0)	2020.06.26
Breast cancer (0)	2020.06.24
학습 테스트 베드(윈도우) (0)	2020.06.24
서강대 신운섭 교수와 케어메디: 혁신적인 의료기기 개발을 선도하다 (0)	2019.03.22