让我们来看看Chrome神奇的爬虫插件web scraper

  • A+
所属分类:IT相关

一提到爬虫,你们定会想到Python、JAVA、编程、代码、正则、调试、代理……各种IT相关的技术。BUT,我是机械男,编程知识业余,利用业余时间入门一丢丢C/C++已经花了我好几年,学会上面这些那我估计70岁了,有没有神器的海螺……额…神奇的工具呢?还真让我找到了。免费、轻量化、易用、无需编程、基于Chrome。就是标题中提到的web scraper

下载

web scraper是一款Chrome插件,理论上也可以应用在国内一系列Chromium内核的浏览器,比如360极速、猎豹……

插件的安装我就不细说了,如若不会,自行搜索。

使用

假设我打算爬取知乎我关注的108将的详细信息

1、安装好插件之后,打开Chrome,输入网址https://www.zhihu.com/people/yang-seang/following,这个页面就是我关注的用户界面,列表有6页,稍作分析,可得出以下结论:

  • 每页20个用户
  • 各页对应网址如下
    • https://www.zhihu.com/people/yang-seang/following?page=1
    • https://www.zhihu.com/people/yang-seang/following?page=2
    • https://www.zhihu.com/people/yang-seang/following?page=3
    • ……
    • https://www.zhihu.com/people/yang-seang/following?page=6
  • 这个就很明了,页面网址最后数据即页数。

2、按键盘上的F12打开浏览器控制台,可以在底部控制台看到web scraper列

选择Create sitemap

在sitemap name和start URL输入相应内容:

  1. sitemap name:yangss-following
  2. startURL:https://www.zhihu.com/people/yang-seang/following?page=[1-6:1]
  3. 完成后点击底部按钮Create sitemap

其中sitemap name名称自定义即可,只要是无特殊要求

startURL中红色部分[1-6:1]表示爬取第一页至第六页,间隔为1.即每一页都爬取。

4.因为过程步骤比较多,我录了一个视频,简单记录了一下操作过程

 

导入现有配置

你也可以使用现有配置,选择import sitemap

  1. sitemap JOSN贴入现有数据
  2. rename 输入一个名称
  3. 点击底部按钮import sitemap,即可导入

爬了一下午轮子哥,效果还可以,就是单线程速度有点慢,加上延时,差不多要5秒钟一个用户

下面提供上述视频中操作的配置,和爬取知乎轮子哥@vczh关注的用户相关信息配置文件

上文配置

{"startUrl":"https://www.zhihu.com/people/yang-seang/following?page=[1-6:1]","selectors":[{"parentSelectors":["_root"],"type":"SelectorElement","multiple":true,"id":"list","selector":"div.List-item","delay":""},{"parentSelectors":["list"],"type":"SelectorText","multiple":false,"id":"name","selector":"div.UserItem-title a.UserLink-link","regex":"","delay":""},{"parentSelectors":["list"],"type":"SelectorText","multiple":false,"id":"answer","selector":"span.ContentItem-statusItem","regex":"","delay":""},{"parentSelectors":["list"],"type":"SelectorText","multiple":false,"id":"follower-number","selector":"span.ContentItem-statusItem","regex":"","delay":""},{"parentSelectors":["list"],"type":"SelectorText","multiple":false,"id":"describe","selector":"div.RichText","regex":"","delay":""}],"_id":"yangss-followeing"}

爬取轮子哥粉丝

{"selectors":[{"parentSelectors":["_root"],"type":"SelectorElement","multiple":false,"id":"div-main","selector":"div.Profile-main","delay":""},{"parentSelectors":["div-main"],"type":"SelectorElement","multiple":false,"id":"div.List","selector":"div.List","delay":""},{"parentSelectors":["div.List"],"type":"SelectorElement","multiple":true,"id":"div.List-item","selector":"div.List-item","delay":""},{"parentSelectors":["div.List-item"],"type":"SelectorLink","multiple":true,"id":"UserLink-link","selector":"div.UserItem-title a.UserLink-link","delay":""},{"parentSelectors":["UserLink-link"],"type":"SelectorElement","multiple":false,"id":"Header-content","selector":"div.ProfileHeader-content","delay":""},{"parentSelectors":["Header-content"],"type":"SelectorText","multiple":false,"id":"User-name","selector":"span.ProfileHeader-name","regex":"","delay":""},{"parentSelectors":["Header-content"],"type":"SelectorElement","multiple":false,"id":"Header-detail","selector":"div.ProfileHeader-detail","delay":""},{"parentSelectors":["Header-detail"],"type":"SelectorText","multiple":false,"id":"location","selector":"div.ProfileHeader-detailItem:nth-of-type(1) div.ProfileHeader-detailValue","regex":"","delay":""},{"parentSelectors":["Header-detail"],"type":"SelectorText","multiple":false,"id":"job","selector":"div.ProfileHeader-detailItem:nth-of-type(2) div.ProfileHeader-detailValue","regex":"","delay":""},{"parentSelectors":["Header-detail"],"type":"SelectorText","multiple":false,"id":"detail","selector":"div.RichText","regex":"","delay":""},{"parentSelectors":["UserLink-link"],"type":"SelectorElement","multiple":false,"id":"ul.Tabs","selector":"div.Card ul.Tabs","delay":""},{"parentSelectors":["ul.Tabs"],"type":"SelectorText","multiple":false,"id":"asswer","selector":"li.Tabs-item:nth-of-type(2) a.Tabs-link","regex":"","delay":""},{"parentSelectors":["UserLink-link"],"type":"SelectorElement","multiple":false,"id":"sideColumnItems","selector":"div.Profile-sideColumnItems","delay":""},{"parentSelectors":["sideColumnItems"],"type":"SelectorText","multiple":false,"id":"sideColumnItem","selector":"div.Profile-sideColumnItem:nth-of-type(1)","regex":"","delay":""},{"parentSelectors":["UserLink-link"],"type":"SelectorElement","multiple":false,"id":"a.Button1","selector":"a.Button","delay":""},{"parentSelectors":["a.Button1"],"type":"SelectorText","multiple":false,"id":"guanzhu","selector":"div.NumberBoard-value","regex":"","delay":""},{"parentSelectors":["UserLink-link"],"type":"SelectorElement","multiple":false,"id":"a.Button2","selector":"a.Button:nth-of-type(2)","delay":""},{"parentSelectors":["a.Button2"],"type":"SelectorText","multiple":false,"id":"guanzhuzhe","selector":"div.NumberBoard-value","regex":"","delay":""}],"startUrl":"https://www.zhihu.com/people/excited-vczh/following?page=[1-120:1]","_id":"vczh-following"}

总结

爬虫只是一个工具,能用它获取互联网数据,但是这些数据用来做什么就是个人事情了,有人爬来只是看看(比如我),有人爬来能赚钱。枪始终是枪,只是看握在谁手里

 

 

 

  • 我的微信
  • 这是我的微信扫一扫
  • weinxin
  • 我的微信公众号
  • 我的微信公众号扫一扫
  • weinxin

发表评论

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: