是爬虫知识,如何发起request,如何处理响应,如何存储数据,各种反爬虫措施应对。
了解scrapy,分布式爬虫,爬虫性能优化,大规模数据爬取,代理,app爬取等等。
熟悉多线程编程、网络编程、HTTP协议相关
开发过完整爬虫项目
反爬相关,cookie、ip池、验证码等等
熟练使用分布式
import json
import requests
import re
from multiprocessing import Pool
def get_one_page(url,headers):
response = requests.get(url, headers = headers)
response.encoding = response.apparent_encoding
html = response.text
return html
def parse_one_page(text):
index = re.findall(r'<i class="board-index.*?">(\d+)</i>',text,re.DOTALL)
images = re.findall(r'<.*?data-src="(.*?)".*?/>',text,re.DOTALL)
titles = re.findall(r'<p.*?"name".*?title="(.*?)".*?>',text,re.DOTALL)
actors = re.findall(r'</a>.*?star">.*?:(.*?)\n.*?</p>',text,re.DOTALL)
time = re.findall(r'</p>.*?<p.*?releasetime">.*?:(.*?)</p>',text,re.DOTALL)
Ascores = re.findall(r'<p.*?integer">(.*?)</i>',text,re.DOTALL)
Bscores = re.findall(r'</i><i.*?fraction">(.*?)</i></p>',text,re.DOTALL)
scores = []
for i in range(len(Ascores)):
score = Ascores[i] + Bscores[i]
scores.append(score)
global informations
informations = []
for value in zip(index,titles,actors,time,scores,images):
index,titles,actors,time,scores,images = value
poem = {
'index': index,
'titles': titles,
'actors': actors,
'scores': scores,
'images': images
}
informations.append(poem)
return informations
def write_to_file(content):
with open('result.txt','a',encoding='utf-8') as fp:
fp.write(json.dumps(content,ensure_ascii=False) + '\n')
fp.close
def main(offset):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
' Chrome/92.0.4515.107 Safari/537.37',
'Connection':'keep-alive',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image
主要涉及一门语言的爬虫库、html解析、内容存储等,复杂的还需要了解URL排重、模拟登录、验证码识别、多线程、代理、移动端抓取等
爬虫知识,主要涉及一门语言的爬虫库、html解析、内容存储等,复杂的还需要了解URL排重、模拟登录、验证码识别、多线程、代理、移动端抓取