<p>Snoopy 实现模拟百度爬虫 PHP代码</p>
<p> //爬虫<br />
public function robotAction()<br />
{<br />
$snoopy = dirname(dirname(dirname(dirname(dirname(__FILE__))))).DIRECTORY_SEPARATOR.'library'.DIRECTORY_SEPARATOR.'Snoopy'.DIRECTORY_SEPARATOR.'Snoopy.class.php';<br />
if (is_file($snoopy))<br />
{<br />
require_once $snoopy;<br />
$snoopy = new Snoopy();<br />
//$httpurl = "<a href="http://www.oldpcs.cn/369.php?tupian=zzl3.jpg">http://www.oldpcs.cn/369.php?tupian=zzl3.jpg</a>";<br />
$httpurl = "<a href="http://www.uenu.com/">http://www.uenu.com/</a>";<br />
$snoopy->agent = "Baiduspider+(<a href="http://www.baidu.com/search/spider.htm">http://www.baidu.com/search/spider.htm</a>)"; //伪装浏览器<br />
//$snoopy->referer = $login; //伪装来源页地址 http_referer<br />
//$snoopy->referer = "<a href="http://83.8.8.8/vcm/index">http://83.8.8.8/vcm/index</a>"; //伪装来源页地址 http_referer<br />
$snoopy->rawheaders["Pragma"] = "no-cache"; //cache 的http头信息<br />
$snoopy->rawheaders["X_FORWARDED_FOR"] = "8.8.8.8"; //伪装ip<br />
//$snoopy->submit($action, $formvars);<br />
//header("Content-type: text/html; charset=utf-8");<br />
$snoopy->fetch($httpurl);<br />
$info = $snoopy->results;</p>
<p> //echo "</p>
<hr />
<p>$cookies</p>
<hr />
<p>";<br />
print_r($info);<br />
}else{<br />
echo 'Deny.';<br />
}</p>
<p></p>
<p> }</p>
<p></p>
<p></p>
<p>采集时被封ip的解决方法<br />
? kekehu / 数据采集 / 2010.01.14 / 18:08 / 3555PV<br />
引用功能被关闭了。<br />
最近各种网站的采集程序写的比较多,遇到在采某网站时采到100多条时突然发现对方的网站打不开了,猜到肯定被封ip了,用了代理还是会封,这不是办法。在网上找了一些资料都没有找到,功夫不负有心人啊,在找的时侯有一个人提到了用搜索引擎爬虫蜘蛛的USERAGENT。虽然只提到一点点我还是想到了,列出我的解决方法,</p>
<p>1.使用Snoopy或curl传搜索引擎爬虫的USERAGENT值。<br />
查看搜索引擎爬虫的USERAGENT值:<a href="http://www.geekso.com/spdier-useragent/">http://www.geekso.com/spdier-useragent/</a></p>
<p>2.使用Snoopy或curl传referer值。<br />
如:$snoopy->referer = 'http://www.google.com';<br />
$header[] = "Referer: <a href="http://www.google.com/">http://www.google.com/</a>";</p>
<p>3.使用Snoopy或curl代理。<br />
如:$snoopy->proxy_host = "59.108.44.41";<br />
$snoopy->proxy_port = "3128";</p>
<p>4.使用Snoopy或curl防造IP。<br />
如:$snoopy->rawheaders['X_FORWARDED_FOR'] = '127.0.0.1';</p>
<p>5.用php与一个重起路由的程序,这样就会获得新的ip地址。</p>
<p>6.如果发现重起路由器还是显示被封,有可能对方封了你路由器的mac地址,现在路由器都有修改MAC的功能,可以写程序或手动修改路由器的MAC地址。</p>
<p><br />
</p>
<p>搜索引擎爬虫蜘蛛的USERAGENT收集<br />
? kekehu / 技术资源 / 2010.01.14 / 17:52 / 2972PV<br />
引用功能被关闭了。<br />
百度爬虫<br />
* Baiduspider+(+http://www.baidu.com/search/spider.htm)</p>
<p>google爬虫<br />
* Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)<br />
* Googlebot/2.1 (+http://www.googlebot.com/bot.html)<br />
* Googlebot/2.1 (+http://www.google.com/bot.html)</p>
<p>雅虎爬虫(分别是雅虎中国和美国总部的爬虫)<br />
*Mozilla/5.0 (compatible; Yahoo! Slurp China; <a href="http://misc.yahoo.com.cn/help.html">http://misc.yahoo.com.cn/help.html</a>)<br />
*Mozilla/5.0 (compatible; Yahoo! Slurp; <a href="http://help.yahoo.com/help/us/ysearch/slurp">http://help.yahoo.com/help/us/ysearch/slurp</a>)</p>
<p>新浪爱问爬虫<br />
*iaskspider/2.0(+http://iask.com/help/help_index.html)<br />
*Mozilla/5.0 (compatible; iaskspider/1.0; MSIE 6.0)</p>
<p>搜狗爬虫<br />
*Sogou web spider/3.0(+http://www.sogou.com/docs/help/webmasters.htm#07″)<br />
*Sogou Push Spider/3.0(+http://www.sogou.com/docs/help/webmasters.htm#07″)</p>
<p>网易爬虫<br />
*Mozilla/5.0 (compatible; YodaoBot/1.0; <a href="http://www.yodao.com/help/webmaster/spider/">http://www.yodao.com/help/webmaster/spider/</a>; )</p>
<p>MSN爬虫<br />
*msnbot/1.0 (+http://search.msn.com/msnbot.htm)</p>
<p></p>