在Python中,使用多线程进行爬虫并发控制时,可以通过以下方法实现:
- 使用
threading
模块:Python的threading
模块提供了基本的线程支持。你可以创建多个线程,每个线程执行一个爬虫任务。为了控制并发数量,可以使用threading.Semaphore
。
示例代码:
import threading import requests from bs4 import BeautifulSoup class Crawler(threading.Thread): def __init__(self, url, semaphore): threading.Thread.__init__(self) self.url = url self.semaphore = semaphore def run(self): with self.semaphore: self.fetch_url(self.url) def fetch_url(self, url): response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') # 处理爬取到的数据 print(f"Visited {url}") def main(): urls = ["http://example.com/page1", "http://example.com/page2", ...] concurrency_limit = 5 semaphore = threading.Semaphore(concurrency_limit) threads = [] for url in urls: crawler = Crawler(url, semaphore) crawler.start() threads.append(crawler) for thread in threads: thread.join() if __name__ == "__main__": main()
- 使用
asyncio
模块:Python的asyncio
模块提供了异步编程支持,可以更高效地处理并发任务。你可以使用asyncio.Semaphore
来控制并发数量。
示例代码:
import asyncio import aiohttp from bs4 import BeautifulSoup class Crawler: def __init__(self, url, semaphore): self.url = url self.semaphore = semaphore async def fetch_url(self, session, url): async with self.semaphore: await self.fetch(session, url) async def fetch(self, session, url): async with session.get(url) as response: html = await response.text() soup = BeautifulSoup(html, 'html.parser') # 处理爬取到的数据 print(f"Visited {url}") async def main(): urls = ["http://example.com/page1", "http://example.com/page2", ...] concurrency_limit = 5 semaphore = asyncio.Semaphore(concurrency_limit) async with aiohttp.ClientSession() as session: tasks = [Crawler(url, semaphore).fetch_url(session, url) for url in urls] await asyncio.gather(*tasks) if __name__ == "__main__": asyncio.run(main())
这两种方法都可以实现爬虫的多线程并发控制。threading
模块适用于I/O密集型任务,而asyncio
模块适用于高并发场景,特别是I/O密集型任务。