当前位置:AIGC资讯 > 数据采集 > 正文

java爬虫与python爬虫对比

java爬虫与python爬虫的对比:

python做爬虫语法更简单,代码更简洁。java的语法比python严格,而且代码也更复杂

示例如下:

url请求:

java版的代码如下:

public String call (String url){

            String content = "";

            BufferedReader in = null;

            try{

                URL realUrl = new URL(url);

                URLConnection connection = realUrl.openConnection();

                connection.connect();

                in = new BufferedReader(new InputStreamReader(connection.getInputStream(),"gbk"));

                String line ;

                while ((line = in.readLine()) != null){

                    content += line + "\n";

                }

            }catch (Exception e){

                e.printStackTrace();

            }

            finally{

                try{

                    if (in != null){

                        in.close();

                    }

                }catch(Exception e2){

                    e2.printStackTrace();

                }

            }

            return content;

        }

python版的代码如下:

# coding=utf-8

import chardet

import urllib2

url = "http://www.baidu.com"

data = (urllib2.urlopen(url)).read()

charset = chardet.detect(data)

code = charset['encoding']

content = str(data).decode(code, 'ignore').encode('utf8')

print content

正则表达式

java版的代码如下:

public String call(String content) throws Exception {

            Pattern p = Pattern.compile("content\":\".*?\"");

            Matcher match = p.matcher(content);

            StringBuilder sb = new StringBuilder();

            String tmp;

            while (match.find()){

                tmp = match.group();

                tmp = tmp.replaceAll("\"", "");

                tmp = tmp.replace("content:", "");

                tmp = tmp.replaceAll("<.*>", "");

                sb.append(tmp + "\n");

            }

            String comment = sb.toString();

            return comment;

        }

    }

python的代码如下:

import repattern = re.compile(正则)

group = pattern.findall(字符串)

更新时间 2023-11-08