在Python中,我们可以使用线程库threading
和队列模块queue
来实现多线程爬虫的资源限制。以下是一个简单的示例:
- 首先,导入所需的库:
import threading import requests from bs4 import BeautifulSoup from queue import Queue
- 定义一个函数来处理爬取到的数据:
def process_data(data): # 处理数据的逻辑 pass
- 定义一个函数来爬取网页内容:
def fetch_url(url, session, result_queue): try: response = session.get(url) response.raise_for_status() soup = BeautifulSoup(response.text, 'html.parser') data = https://www.yisu.com/ask/soup.find_all('div', class_='content') # 根据网页结构选择合适的标签和类名 result_queue.put(data) except Exception as e: print(f"Error fetching {url}: {e}")
- 定义一个函数来限制线程数量:
def limited_thread_spider(urls, max_threads): session = requests.Session() result_queue = Queue() # 创建线程列表 threads = [] # 开始爬取 for url in urls: if threading.active_count() < max_threads: thread = threading.Thread(target=fetch_url, args=(url, session, result_queue)) thread.start() threads.append(thread) else: # 如果线程数达到上限,等待线程完成 for thread in threads: thread.join() # 清空线程列表 threads = [] # 重新开始爬取 for url in urls: if threading.active_count() < max_threads: thread = threading.Thread(target=fetch_url, args=(url, session, result_queue)) thread.start() threads.append(thread) # 等待所有线程完成 for thread in threads: thread.join() # 处理爬取到的数据 while not result_queue.empty(): data = https://www.yisu.com/ask/result_queue.get()>
- 使用
limited_thread_spider
函数进行爬取:urls = [ 'https://example.com/page1', 'https://example.com/page2', # ... ] max_threads = 5 # 设置最大线程数 limited_thread_spider(urls, max_threads)这个示例中,我们使用了一个队列
result_queue
来存储爬取到的数据,以及一个计数器active_count
来跟踪当前活跃的线程数量。当活跃线程数达到最大值时,我们会等待线程完成,然后继续添加新线程。这样可以确保线程数量不会超过设定的限制。