在Python中,Go爬虫可以通过多种方式协同工作。以下是一些常见的协同工作方法:
1. 使用消息队列
消息队列是一种常见的异步通信方式,可以用来解耦爬虫组件。例如,可以使用RabbitMQ、Kafka等消息队列系统来分发爬取任务。
示例:使用RabbitMQ
-
安装RabbitMQ:
sudo apt-get install rabbitmq-server
-
安装Python库:
pip install pika
-
生产者(Producer):
import pika connection = pika.BlockingConnection(pika.ConnectionParameters('localhost')) channel = connection.channel() channel.queue_declare(queue='crawl_queue') def send_task(url): channel.basic_publish(exchange='', routing_key='crawl_queue', body=url) print(f"Sent {url}") send_task('http://example.com') connection.close()
-
消费者(Consumer):
import pika import requests connection = pika.BlockingConnection(pika.ConnectionParameters('localhost')) channel = connection.channel() channel.queue_declare(queue='crawl_queue') def callback(ch, method, properties, body): url = body.decode('utf-8') print(f"Received {url}") response = requests.get(url) print(response.text) channel.basic_consume(queue='crawl_queue', on_message_callback=callback, auto_ack=True) print('Waiting for messages. To exit press CTRL+C') channel.start_consuming()
2. 使用多线程或多进程
多线程或多进程可以用来并行处理爬取任务,提高效率。
示例:使用多线程
import threading import requests def crawl(url): response = requests.get(url) print(response.text) urls = ['http://example.com', 'http://example.org', 'http://example.net'] threads = [] for url in urls: thread = threading.Thread(target=crawl, args=(url,)) thread.start() threads.append(thread) for thread in threads: thread.join()
示例:使用多进程
import multiprocessing import requests def crawl(url): response = requests.get(url) print(response.text) urls = ['http://example.com', 'http://example.org', 'http://example.net'] processes = [] for url in urls: process = multiprocessing.Process(target=crawl, args=(url,)) process.start() processes.append(process) for process in processes: process.join()
3. 使用Web框架
可以使用Flask、Django等Web框架来构建爬虫的API接口,实现远程控制和监控。
示例:使用Flask
-
安装Flask:
pip install Flask
-
创建Flask应用:
from flask import Flask, request, jsonify import requests app = Flask(__name__) @app.route('/crawl', methods=['POST']) def crawl(): url = request.json['url'] response = requests.get(url) return jsonify({'status': 'success', 'content': response.text}) if __name__ == '__main__': app.run(debug=True)
-
发送请求:
import requests url = 'http://localhost:5000/crawl' data = https://www.yisu.com/ask/{'url': 'http://example.com'} response = requests.post(url, json=data) print(response.json())
4. 使用Scrapy框架
Scrapy是一个强大的爬虫框架,支持分布式爬取和任务调度。
示例:使用Scrapy
-
安装Scrapy:
pip install scrapy
-
创建Scrapy项目:
scrapy startproject myproject cd myproject
-
创建Spider:
# myproject/spiders/example_spider.py import scrapy class ExampleSpider(scrapy.Spider): name = 'example' start_urls = ['http://example.com'] def parse(self, response): self.log('Visited %s' % response.url) for quote in response.css('div.quote'): item = { 'author_name': quote.css('span.text::text').get(), 'author_url': quote.xpath('span/small/a/@href').get(), } yield item
-
配置设置:
# myproject/settings.py # 启用分布式调度 SCHEDULER = "scrapy.schedulers. twisted.TwistedScheduler"
-
启动爬虫:
scrapy crawl example -o output.json
通过以上方法,Python Go爬虫可以实现协同工作,提高爬取效率和可靠性。