纽约客网站文章爬取与统计

发表于 2017-07-01

内容介绍

  1. 用scrapy框架编写爬虫,爬取纽约客网站文章,包括文章url、文章标题、作者及发表时间、文章正文、图片等。
  2. 将上述爬取到的信息保存到mysql数据库中。
  3. 将数据库中的文章进行分页展示。
  4. 统计每篇文章正文的单词总数、段落总数、句子总数、词汇总数,并计算平均单词长度(单词字母数)、平均句子长度(句子单词数)、平均段落长度(段落句子数)。

开发环境配置

Python2.7

Pycharm2016.2.3

Windows7 64位

Python2.7环境配置

Pycharm安装

先在官网上下载http://www.jetbrains.com/pycharm/download/
下载后进行安装

scrapy框架的安装

Scrapy安装过程中需要很多依赖项,lxml,pyin32,twisted,pyOpenSSL,zope.inerface等等,而这些依赖在安装过程中很容易出错,最好的办法就是下载最新的python2.7。在cmd命令行下执行pip install scrapy 进行安装,安装成功后在cmd中输入scrapy。结果如下:

我们cmd执行$ Python进入python控制台

  • 执行import lxml,如果没报错,则说明lxml安装成功.
  • 执行import twisted,如果没报错,则说明twisted安装成功.
  • 执行import OpenSSL,如果没报错,则说明OpenSSL安装成功.
  • 执行import zope.interface,如果没报错,则说明zope.interface安装成功.

Flask框架安装

  • 在D盘中创建myvir文件(任意文件),打开cmd进入这个目录
  • 安装虚拟环境,pip install virtualenv 进行安装
  • 安装Flask,pip install Flask进行安装

开发过程

数据爬取过程的实现

进入cmd ,用scrapy stratproject newyorker 命令创建一个命名为newyorker的项目。创建scrapy项目后会形成如下目录

newyorker/
scrapy.cfg
newyorker/
__init__.py
pipelines.py
settings.py
__init__.py

这些文件分别是:
scrapy.cfg: 项目的配置文件

newyorker/: 该项目的python模块。之后将在此加入代码。

newyorker/items.py: 项目中的item文件.

newyorker/pipelines.py: 项目中的pipelines文件.

newyorker/settings.py: 项目的设置文件.

newyoker/spiders/: 放置spider代码的目录.

我们在newyorker中创建new_spider.py文件,在此文件中我们编写爬取网站文章,和统计文章的关键代码,new_spider.py的内容如下:

# -*- coding:utf-8 -*-
import requests
from scrapy.http import Request
from scrapy.spiders import CrawlSpider
from bs4 import BeautifulSoup
from scrapy.selector import Selector
from newyorker.items import NewyorkerItem
class newyorker(CrawlSpider):
name = "newyorker"
allowed_domains = ["newyorker.com"]
start_urls = ['http://www.newyorker.com/news/daily-comment/']
def parse(self, response):
sel = Selector(response)
infos = sel.xpath("//main/div/ul/li")
for info in infos:
article_url_part = info.xpath("div/h4/a/@href").extract()[0]
article_url = 'http://www.newyorker.com/'+article_url_part
yield Request(article_url,meta={'article_url':article_url},callback=self.parse_item)
urls = ['http://www.newyorker.com/news/daily-comment/page/{}'.format(str(i)) for i in range(1,10)]
for url in urls:
yield Request(url,callback=self.parse)
def parse_item(self,response):
item = NewyorkerItem()
item['article_url'] = response.meta['article_url']
data =requests.get(response.meta['article_url'])
sel =Selector(response)
title = sel.xpath("//h1/text()").extract()[0]
author = sel.xpath("//div/div/div[2]/p/a/text()").extract()[0]
time = sel.xpath("//hgroup/div[2]/p/text()").extract()[0]
soup=BeautifulSoup(data.text,'lxml')
image_urls = soup.select('figure > div > picture > img')[0].get('srcset')if soup.find_all('picture','component-responsive-image') else None
articles=soup.select('#articleBody p')
article = [i.text +'<br />' for i in articles]
article_process = str(article).replace("', '"," ").strip("['").strip("']").strip(" ?").replace('\\xa0','')
w_sum = len(re.findall('[a-zA-Z]+', article_process))
s_sum = len(re.findall('([.!?].\s?[A-Z����])', article_process))
p_sum = len(article)
v_sum = len(set(re.findall('[a-zA-Z]+', article_process.lower())))
a_sum = len(re.findall('[a-zA-Z]', article_process))
avg_w = round(a_sum / w_sum, 2)
avg_s = round(w_sum / s_sum, 2)
avg_p = round(s_sum / p_sum, 2)
item['title']=title #标题
item['author']=author #作者
item['time']=time #时间
item['article']=article_process #正文
item['image_urls']= image_urls #图片
item['w_sum']=w_sum #单词
item['s_sum']=s_sum #句子
item['p_sum']=p_sum #段落
item['v_sum']=v_sum #词汇
item['a_sum']=a_sum #字母
item['avg_w']=avg_w #平均单词长度
item['avg_s']=avg_s #平均句子长度
item['avg_p']=avg_p #平均段落长度
yield item

另外我们还需要连接数据库,当然在连接mysql数据库前,要确认是否安装了Python连接mysql数据库的驱动pymysql。连接数据库的代码写在pipelines.py文件中。

# -*- coding: utf-8 -*-
import pymysql
def dbHandle():
conn = pymysql.connect(
host = "localhost",
user = "root",
passwd = "root",
db="text",
port =3306,
charset = "utf8",
use_unicode = False
return conn
class newyorkerPipeline(object):
def process_item(self,item,spider):
dbObject = dbHandle()
cursor = dbObject.cursor()
cursor.execute("USE text")
sql = "INSERT INTO newtext(title,author,time,article,image_urls,w_sum,s_sum,p_sum,v_sum,a_sum,avg_w,avg_s,avg_p) VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"
cursor.execute(sql, (item['title'], item['author'], item['time'], item['article'],item['image_urls'],item['w_sum'],item['s_sum'],item['p_sum'],item['v_sum'],item['a_sum'],item['avg_w'],item['avg_s'],item['avg_p']))
cursor.connection.commit()
except BaseException as e:
dbObject.rollback()
return item

我们用命令scrapy crawl newyorker启动这个项目,结果如下:

我们打开数据库,发现数据库中已经存入了我们所要的数据,并完成了统计,当然在此之前要在数据库中建好数据库和表。

数据展示

我们创建一个Flask项目,项目结构如下:

├─.idea
│ └─dictionaries
├─static
│ └─web.css
├─templates
│ └─base.html
│ └─index.html
├─app.cfg
└─app.py

app.py中的代码如下:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import pymysql
from flask import Flask, render_template, g, current_app
from flask_paginate import Pagination, get_page_args
import click
click.disable_unicode_literals_warning = True
app = Flask(__name__)
app.config.from_pyfile('app.cfg')
@app.before_request
def before_request():
g.conn = pymysql.connect(
host = "localhost",
user = "root",
passwd = "root",
db="text",
port =3306,
charset = "utf8",
use_unicode = False
g.cur = g.conn.cursor()
@app.teardown_request
def teardown(error):
if hasattr(g, 'conn'):
g.conn.close()
@app.route('/')
def index():
g.cur.execute('select count(*) from newtext')
user = g.cur.fetchone()[0]
page, per_page, offset = get_page_args(page_parameter='page',
per_page_parameter='per_page')
sql = 'select title from newtext order by title limit {}, {}'\
.format(offset, per_page)
g.cur.execute(sql)
users = g.cur.fetchall()
print "------sss", users
pagination = get_pagination(page=page,
per_page=per_page,
total=100,
record_name='users',
format_total=True,
format_number=True,
return render_template('index.html', users=users,
page=page,
per_page=per_page,
pagination=pagination,
@app.route('/users/', defaults={'page': 1})
@app.route('/users', defaults={'page': 1})
@app.route('/users/page/<int:page>/')
@app.route('/users/page/<int:page>')
def users(page):
g.cur.execute('select count(*) from newtext')
total = g.cur.fetchone()[0]
page, per_page, offset = get_page_args()
sql = 'select title from newtext order by title limit {}, {}'\
.format(offset, per_page)
g.cur.execute(sql)
users = g.cur.fetchall()
pagination = get_pagination(page=page,
per_page=per_page,
total=total,
record_name='users',
format_total=True,
format_number=True,
return render_template('index.html', users=users,
page=page,
per_page=per_page,
pagination=pagination,
active_url='users-page-url',
@app.route('/search/<name>/')
@app.route('/search/<name>')
def search(name):
"""The function is used to test multi values url."""
sql = 'select count(*) from newtext where title like ?'
#sql = 'select title from newtext'
args = ('%{}%'.format(title), )
g.cur.execute(sql)
total = g.cur.fetchone()[0]
page, per_page, offset = get_page_args()
# sql = 'select * from articals where name like ? limit {}, {}'
#g.cur.execute(sql.format(offset, per_page), args)
users = g.cur.fetchall()
pagination = get_pagination(page=page,
per_page=per_page,
total=total,
record_name=users,
return render_template('index.html', users=users,
page=page,
per_page=per_page,
pagination=pagination,
def get_css_framework():
return current_app.config.get('CSS_FRAMEWORK', 'bootstrap3')
def get_link_size():
return current_app.config.get('LINK_SIZE', 'sm')
def show_single_page_or_not():
return current_app.config.get('SHOW_SINGLE_PAGE', False)
def get_pagination(**kwargs):
kwargs.setdefault('record_name', 'records')
return Pagination(css_framework=get_css_framework(),
link_size=get_link_size(),
show_single_page=show_single_page_or_not(),
@click.command()
@click.option('--port', '-p', default=5000, help='listening port')
def run(port):
app.run(debug=True, port=port)
if __name__ == '__main__':

我们run这个项目,然后在浏览器中输入127.0.0.1:5000即可,结果如下:

我们发现这个web程序已经成功读取到了mysql数据库中的数据并做好了分页。

这个项目的源码在我的github中,链接


为您推荐了相关的技术文章:

  1. dweinstein/awesome-frida
  2. Calling JNI Functions with Java Object Arguments from the Command Line
  3. Persistence - Attacker Knowledge Base
  4. LeakerLocker: Mobile Ransomware Acts Without Encryption | McAfee Blogs
  5. Starbucks should really make their APIs public. – Tendigi

原文链接: spd.dropsec.xyz