一个面向分享的公开版爬虫示例仓库。仓库保留 5 个最终 Python 脚本,按平台整理:
examples/99designs/requests_webstructure_99designs_contest.pyexamples/ssrn/selenium_crawl4ai_webstructure_ssrn_paper.pyexamples/reddit/praw_api_reddit_submission_enrich.pyexamples/tiktok/playwright_webstructure_tiktok_creator_marketplace.pyexamples/tiktok/playwright_api_tiktok_capture.py
99designs三个旧脚本已经合并成一个公开版脚本,支持list、brief、entries、all四种模式。SSRN两个旧 notebook 已经合并成一个公开版脚本,支持list、detail、all三种模式。- 所有公开入口都改成了
.py。 - 仓库已移除旧版脚本、notebook、Node 抓包文件。
- 不再保留真实 cookie、token、绝对路径、私有数据库和邮箱验证码逻辑。
依赖按方法分组安装,不要求一次性全部安装。
pip install requests beautifulsoup4
pip install selenium webdriver-manager
pip install crawl4ai
pip install praw
pip install playwright
playwright install chromium比赛列表抓取:
python3 examples/99designs/requests_webstructure_99designs_contest.py \
--mode list \
--url "https://99designs.hk/logo-design/contests?sort=start-date%3Adesc&status=won" \
--output output/99designs列表、brief、entries 串联执行:
python3 examples/99designs/requests_webstructure_99designs_contest.py \
--mode all \
--url "https://99designs.hk/logo-design/contests?sort=start-date%3Adesc&status=won" \
--output output/99designs页面需要登录态时,可通过 JSON 文件传入请求头和 cookie:
参考该网站:https://blog.csdn.net/qingliuun/article/details/131168368
{
"User-Agent": "Mozilla/5.0 ..."
}{
"session_cookie_name": "your_cookie_value"
}分类列表抓取:
python3 examples/ssrn/selenium_crawl4ai_webstructure_ssrn_paper.py \
--mode list \
--input data/ssrn_category_list.csv \
--output output/ssrn \
--headless论文详情和作者信息抓取:
python3 examples/ssrn/selenium_crawl4ai_webstructure_ssrn_paper.py \
--mode detail \
--input output/ssrn/paper_list.csv \
--output output/ssrn完整链路执行:
python3 examples/ssrn/selenium_crawl4ai_webstructure_ssrn_paper.py \
--mode all \
--input data/ssrn_category_list.csv \
--output output/ssrn \
--headless运行前配置环境变量,示例见 .env.example。
export REDDIT_CLIENT_ID=...
export REDDIT_CLIENT_SECRET=...
export REDDIT_USER_AGENT="easywebcrawl-demo"
python3 examples/reddit/praw_api_reddit_submission_enrich.py \
--input data/reddit_submission_ids.csv \
--output output/reddit/reddit_submission_enrich.csv页面结构抓取示例:
python3 examples/tiktok/playwright_webstructure_tiktok_creator_marketplace.py \
--url "https://seller-us.tiktok.com/creator-marketplace" \
--output output/tiktok/creator_marketplace.csv接口返回抓包示例:
python3 examples/tiktok/playwright_api_tiktok_capture.py \
--target-url "https://seller-us.tiktok.com/creator-marketplace" \
--url-includes "/api/creator" \
--output output/tiktok/captured_api_responses.jsonoutput/99designs/contest_list.csvoutput/99designs/contest_brief.csvoutput/99designs/contest_entries.csvoutput/ssrn/paper_list.csvoutput/ssrn/paper_detail.csvoutput/ssrn/author_info.jsonoutput/reddit/reddit_submission_enrich.csvoutput/tiktok/creator_marketplace.csvoutput/tiktok/captured_api_responses.json
更详细的中文介绍见: