Snoopy 实现模拟百度爬虫 PHP代码

<p>Snoopy 实现模拟百度爬虫 PHP代码</p> <p> //爬虫<br /> public function robotAction()<br /> {<br /> $snoopy = dirname(dirname(dirname(dirname(dirname(__FILE__))))).DIRECTORY_SEPARATOR.&#39;library&#39;.DIRECTORY_SEPARATOR.&#39;Snoopy&#39;.DIRECTORY_SEPARATOR.&#39;Snoopy.class.php&#39;;<br /> if (is_file($snoopy))<br /> {<br /> require_once $snoopy;<br /> $snoopy = new Snoopy();<br /> //$httpurl = &quot;<a href="http://www.oldpcs.cn/369.php?tupian=zzl3.jpg">http://www.oldpcs.cn/369.php?tupian=zzl3.jpg</a>&quot;;<br /> $httpurl = &quot;<a href="http://www.uenu.com/">http://www.uenu.com/</a>&quot;;<br /> $snoopy-&gt;agent = &quot;Baiduspider+(<a href="http://www.baidu.com/search/spider.htm">http://www.baidu.com/search/spider.htm</a>)&quot;; //伪装浏览器<br /> //$snoopy-&gt;referer = $login; //伪装来源页地址 http_referer<br /> //$snoopy-&gt;referer = &quot;<a href="http://83.8.8.8/vcm/index">http://83.8.8.8/vcm/index</a>&quot;; //伪装来源页地址 http_referer<br /> $snoopy-&gt;rawheaders[&quot;Pragma&quot;] = &quot;no-cache&quot;; //cache 的http头信息<br /> $snoopy-&gt;rawheaders[&quot;X_FORWARDED_FOR&quot;] = &quot;8.8.8.8&quot;; //伪装ip<br /> //$snoopy-&gt;submit($action, $formvars);<br /> //header(&quot;Content-type: text/html; charset=utf-8&quot;);<br /> $snoopy-&gt;fetch($httpurl);<br /> $info = $snoopy-&gt;results;</p> <p> //echo &quot;</p> <hr /> <p>$cookies</p> <hr /> <p>&quot;;<br /> print_r($info);<br /> }else{<br /> echo &#39;Deny.&#39;;<br /> }</p> <p></p> <p> }</p> <p></p> <p></p> <p>采集时被封ip的解决方法<br /> ? kekehu / 数据采集 / 2010.01.14 / 18:08 / 3555PV<br /> 引用功能被关闭了。<br /> 最近各种网站的采集程序写的比较多,遇到在采某网站时采到100多条时突然发现对方的网站打不开了,猜到肯定被封ip了,用了代理还是会封,这不是办法。在网上找了一些资料都没有找到,功夫不负有心人啊,在找的时侯有一个人提到了用搜索引擎爬虫蜘蛛的USERAGENT。虽然只提到一点点我还是想到了,列出我的解决方法,</p> <p>1.使用Snoopy或curl传搜索引擎爬虫的USERAGENT值。<br /> 查看搜索引擎爬虫的USERAGENT值:<a href="http://www.geekso.com/spdier-useragent/">http://www.geekso.com/spdier-useragent/</a></p> <p>2.使用Snoopy或curl传referer值。<br /> 如:$snoopy-&gt;referer = &#39;http://www.google.com&#39;;<br /> $header[] = &quot;Referer: <a href="http://www.google.com/">http://www.google.com/</a>&quot;;</p> <p>3.使用Snoopy或curl代理。<br /> 如:$snoopy-&gt;proxy_host = &quot;59.108.44.41&quot;;<br /> $snoopy-&gt;proxy_port = &quot;3128&quot;;</p> <p>4.使用Snoopy或curl防造IP。<br /> 如:$snoopy-&gt;rawheaders[&#39;X_FORWARDED_FOR&#39;] = &#39;127.0.0.1&#39;;</p> <p>5.用php与一个重起路由的程序,这样就会获得新的ip地址。</p> <p>6.如果发现重起路由器还是显示被封,有可能对方封了你路由器的mac地址,现在路由器都有修改MAC的功能,可以写程序或手动修改路由器的MAC地址。</p> <p><br /> </p> <p>搜索引擎爬虫蜘蛛的USERAGENT收集<br /> ? kekehu / 技术资源 / 2010.01.14 / 17:52 / 2972PV<br /> 引用功能被关闭了。<br /> 百度爬虫<br /> * Baiduspider+(+http://www.baidu.com/search/spider.htm)</p> <p>google爬虫<br /> * Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)<br /> * Googlebot/2.1 (+http://www.googlebot.com/bot.html)<br /> * Googlebot/2.1 (+http://www.google.com/bot.html)</p> <p>雅虎爬虫(分别是雅虎中国和美国总部的爬虫)<br /> *Mozilla/5.0 (compatible; Yahoo! Slurp China; <a href="http://misc.yahoo.com.cn/help.html">http://misc.yahoo.com.cn/help.html</a>)<br /> *Mozilla/5.0 (compatible; Yahoo! Slurp; <a href="http://help.yahoo.com/help/us/ysearch/slurp">http://help.yahoo.com/help/us/ysearch/slurp</a>)</p> <p>新浪爱问爬虫<br /> *iaskspider/2.0(+http://iask.com/help/help_index.html)<br /> *Mozilla/5.0 (compatible; iaskspider/1.0; MSIE 6.0)</p> <p>搜狗爬虫<br /> *Sogou web spider/3.0(+http://www.sogou.com/docs/help/webmasters.htm#07&Prime;)<br /> *Sogou Push Spider/3.0(+http://www.sogou.com/docs/help/webmasters.htm#07&Prime;)</p> <p>网易爬虫<br /> *Mozilla/5.0 (compatible; YodaoBot/1.0; <a href="http://www.yodao.com/help/webmaster/spider/">http://www.yodao.com/help/webmaster/spider/</a>; )</p> <p>MSN爬虫<br /> *msnbot/1.0 (+http://search.msn.com/msnbot.htm)</p> <p></p>
返回顶部 留言