在Python中,使用多线程进行爬虫并处理数据存储可以通过以下几个步骤实现:
- 导入所需库:
import threading import requests from bs4 import BeautifulSoup import json import sqlite3
- 创建一个数据库连接:
def create_connection(): conn = sqlite3.connect('data.db') return conn
- 创建一个用于存储数据的表(如果尚未创建):
def create_table(conn): cursor = conn.cursor() cursor.execute('''CREATE TABLE IF NOT EXISTS web_data ( id INTEGER PRIMARY KEY AUTOINCREMENT, url TEXT NOT NULL, title TEXT NOT NULL, content TEXT NOT NULL )''') conn.commit()
- 定义一个函数来处理爬取到的数据:
def process_data(url, title, content): # 在这里可以对数据进行清洗、解析等操作 return { 'url': url, 'title': title, 'content': content }
- 定义一个函数来存储数据到数据库:
def save_data(conn, data): cursor = conn.cursor() cursor.execute('''INSERT INTO web_data (url, title, content) VALUES (?, ?, ?)''', (data['url'], data['title'], data['content'])) conn.commit()
- 定义一个爬虫函数,该函数将在多线程中运行:
def crawl(url, title, conn): try: response = requests.get(url) response.raise_for_status() soup = BeautifulSoup(response.text, 'html.parser') content = soup.get_text() data = https://www.yisu.com/ask/process_data(url, title, content)"Error while processing {url}: {e}")
- 定义一个函数来启动多个线程:
def start_threads(urls, num_threads): conn = create_connection() create_table(conn) threads = [] for i in range(num_threads): url = urls[i % len(urls)] thread = threading.Thread(target=crawl, args=(url, url, conn)) threads.append(thread) thread.start() for thread in threads: thread.join() conn.close()
- 准备要爬取的URL列表,并设置线程数量:
urls = [ 'https://example.com/page1', 'https://example.com/page2', # ... ] num_threads = 10
- 启动多线程爬虫:
start_threads(urls, num_threads)
这个示例使用了SQLite数据库来存储数据。你可以根据需要替换为其他数据库,如MySQL、PostgreSQL等。同时,你可以根据需要调整数据处理和存储的逻辑。