100字范文 > Python多线程抓取网页图片地址

Python多线程抓取网页图片地址

时间：2022-01-17 14:25:12

相关推荐

Python多线程抓取网页图片地址

mini-spider

功能描述：多线程网络爬虫，爬取网页图片地址(也可提取其他特征的URL)使用python开发一个迷你定向抓取器mini_spider.py，实现对种子链接的广度优先抓取，并把URL长相符合特定pattern的网页保存到磁盘上。程序运行:python mini_spider.py -c spider.conf配置文件spider.conf:

[spider]

feedfile: ./urls # 种子文件路径result: ./result.data # 抓取结果存储文件, 一行一个max_depth: 6 # 最大抓取深度(种子为0级)crawl_interval: 1 # 抓取间隔. 单位: 秒crawl_timeout: 2 # 抓取超时. 单位: 秒thread_count: 8 # 抓取线程数filter_url: .*.(gif|png|jpg|bmp)$ # URL特征种子文件urls:抓取策略广度优先的网页抓取策略多线程抓取获取符合特征的链接地址并存储到文件(例如gif|png|jpg|bmp为扩展格式的 url)链接的绝对路径存储到result.data文件中, 一行一个 (图片也可直接保存至本地)从HTML提取链接时支持处理相对路径及绝对路径

mini_spier.py

spider.conf

[spider]feedfile: result: ./result.datamax_depth: 6crawl_interval: 1crawl_timeout: 2thread_count: 8filter_url: .*\.(gif|png|jpg|bmp)$

SpiderThread.py 多线程模块

SpiderWorker.py 主工作模块

URLHandler.py URL处理，http请求模块

#!/usr/bin/env python################################################################################## Copyright (c) , Inc. All Rights Reserved#################################################################################"""This module is used to handle URL and HTTP related requests@Time : /11/09@File : UrlHandler.py@Author : cenquanyu@"""import osfrom urllib import parse, requestimport loggingimport chardetfrom bs4 import BeautifulSoupimport requestsclass UrlHandler(object):"""Public url tools for handle url"""@staticmethoddef is_url(url):"""Ignore url starts with Javascipt:param url::return: True or False"""if url.startswith("javascript"):return Falsereturn True@staticmethoddef get_content(url, timeout=10):"""Get html contents:param url: the target url:param timeout: request timeout, default 10:return: content of html page, return None when error happens"""try:response = requests.get(url, timeout=timeout)except requests.HTTPError as e:logging.error("url %s request error : %s" % (url, e))return Noneexcept Exception as e:logging.error(e)return Nonereturn UrlHandler.decode_html(response.content)@staticmethoddef decode_html(content):"""Decode html content:param content: origin html content:return: returen decoded html content. Error return None"""encoding = chardet.detect(content)['encoding']if encoding == 'GB2312':encoding = 'GBK'else:encoding = 'utf-8'try:content = content.decode(encoding, 'ignore')except Exception as err:logging.error("Decode error: %s.", err)return Nonereturn content@staticmethoddef get_urls(url):"""Get all suburls of this url:param url: origin url:return: the set of sub_urls"""urlset = set()if not UrlHandler.is_url(url):return urlsetcontent = UrlHandler.get_content(url)if content is None:return urlsettag_list = ['img', 'a', 'style', 'script']linklist = []for tag in tag_list:linklist.extend(BeautifulSoup(content).find_all(tag))# get url has attr 'src' and 'href'for link in linklist:if link.has_attr('src'):urlset.add(UrlHandler.parse_url(link['src'], url))if link.has_attr('href'):urlset.add(UrlHandler.parse_url(link['href'], url))return urlset@staticmethoddef parse_url(url, base_url):"""Parse url to make it complete and standard:param url: the current url:param base_url: the base url:return: completed url"""if url.startswith('http') or url.startswith('//'):url = parse.urlparse(url, scheme='http').geturl()else:url = parse.urljoin(base_url, url)return url@staticmethoddef download_image_file(result_dir, url):"""Download image as file, save in result dir:param result_dir: base_path:param url: download url:return: succeed True, fail False"""if not os.path.exists(result_dir):try:os.mkdir(result_dir)except os.error as err:logging.error("download to path, mkdir errror: %s" % err)try:path = os.path.join(result_dir, url.replace('/', '_').replace(':', '_').replace('?', '_').replace('\\', '_'))logging.info("download url..: %s" % url)request.urlretrieve(url, path, None)except Exception as e:logging.error("download url %s fail: %s " % (url, e))return Falsereturn True@staticmethoddef download_url(result_file, url):"""Download the URL that matches the characteristics, and save in a file:param result_file: base_path:param url: download url:return: succeed True, fail False"""try:path = os.path.join(os.getcwd(), result_file)logging.info("download url..: %s" % url)with open(path, 'a') as f:f.write(url + '\n')except Exception as e:logging.error("download url %s fail: %s " % (url, e))return Falsereturn True

param_parser.py 参数解析模块

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。