python--半自动爬取Leetcode上面的所有题目并转成word打印.md

Posted by lizhao on 07-09,2019

python--半自动爬取Leetcode上面的所有题目并转成word打印

写在前面

最近想做下算法题目,补补基础,在手机上面或者电脑上面做的不是那么的爽,就想着打印下来,然后在每道题目下面写下草稿,后面再用电脑写代码。

但是找了半天都没有整理版的题目,要不然就是有答案的,要不然就是英文的,要不然就是按分类来的,要不然就是不全的。

所以就打算自己写一个,现在的版本是按照难度来划分的,然后后面的连接是整理好的文本。已经分栏分好了。纸张的利用率也是比较高。

话不多说,看下代码吧。

代码逻辑

代码结构:

---leetcode                     文件夹
------leetcode.py               主要的代码
------python-leetcode.html      题目的html
------python-leetcode-easy.html 所有的简单题目
------python-leetcode-hard.html 所有的困难题目
------python-leetcode-m.html    所有的中等题目
------result1-200.txt           部分结果

<tbody class="reactable-data">

代码流程

def run(self):
    self.get_urls()
    self.get_description_by_request()
    self.save_problem()

这个是main函数的主要流程

  • self.get_urls() 是根据python-leetcode.html读取所有的题目url,然后存到数组里面
  • self.get_description_by_request() 获取详细的题目信息
  • self.save_problem() 将题目信息存到文本里面

具体代码

leetcode.py

# -*- coding: UTF-8 -*-

from urllib import request
# 导入BeautifulSoup模块和re模块,re是python中正则表达式的模块. 涓€浜�-涓夊垎-鍦帮紝鐙鍙戝竷
# from BeautifulSoup  import soup
import bs4
import re
import datetime
import aiohttp, asyncio
import requests, os, time
import random


def get_html(url):  # 获取页面的源代码
    page = request.urlopen(url)
    html = page.read()
    html = html.decode('utf-8')
    return html


USER_AGENTS = [
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
    "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
    "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
    "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
    "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
    "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
]


# 随机生成user-agent
class RandomUAMiddleware(object):
    def process_request(self, request, spider):
        request.headers["User-Agent"]=random.choice(USER_AGENTS)


def request_url(url):
    print("request_url")
    response = requests.get(url)
    response.encoding = 'utf-8'
    if response.status_code == 200:
        return response
    else:
        print("请求失败,errcode = %s" % response.status_code)


def request_url_read(url):
    print("request_url_read")
    headers = {'User-Agent': random.choice(USER_AGENTS)}
    response = requests.get(url, headers=headers)
    response.encoding = 'utf-8'
    if response.status_code == 200:
        return response
    else:
        print("请求失败,errcode = %s" % response.status_code)
        return None


async def __get_content(link):
    # print("__get_content")
    async with aiohttp.ClientSession() as session:
        response = await session.get(link)
        content = await response.read()
        return content


class AttrDisplay:
    def gatherAttrs(self):
        return " , ".join("{}={}"
                          .format(k, getattr(self, k))
                          for k in self.__dict__.keys())
        # attrs = []
        # for k in self.__dict__.keys():
        #     item = "{}={}".format(k, getattr(self, k))
        #     attrs.append(item)
        # return attrs
        # for k in self.__dict__.keys():
        #     attrs.append(str(k) + "=" + str(self.__dict__[k]))
        # return ",".join(attrs) if len(attrs) else 'no attr'

    def __str__(self):
        return "[{}:{}]".format(self.__class__.__name__, self.gatherAttrs())


class Problem(AttrDisplay):
    def __init__(self):
        self.number = 0
        self.title = ""
        self.link = ""
        self.description = ""

    def format_to_file(self):
        result = str(self.number) + "、" + self.title
        result += "\n"
        description_str = self.description.replace("\n\n", "\n")
        description_str = description_str.replace("&nbsp;", " ")
        description_str = description_str.replace("&minus;", "-")
        description_str = description_str.replace("&quot;", "\"")
        description_str = description_str.replace("&#39;", "\'")
        description_str = description_str.replace("&gt;", ">")
        description_str = description_str.replace("&lt;", "<")
        description_str = description_str.replace("&le;", "<=")
        description_str = description_str.replace("&lsquo;", "\'")
        description_str = description_str.replace("&rsquo;", "\'")
        description_str = description_str.replace("&lfloor;", " ")
        description_str = description_str.replace("&rarr;", "→")
        result += description_str
        result += "\n"
        result += "\n"
        result += "\n"
        result += "\n"
        return result


class MySpider(object):

    # 初始化
    def __init__(self):
        self.base_url = "https://leetcode-cn.com"

        self.headers = {
            'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) '
                          ' AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1'
        }
        self.num = 1
        print("MySpider init ")
        self.problem_list = []

    def get_urls(self):
        # 打开一个文件
        fo = open("python-leetcode.html", "r", encoding='UTF-8')
        html_contont = fo.read()
        soup = bs4.BeautifulSoup(html_contont, "html.parser")
        # div_main = soup.html.body.find('tbody', {'class': 'reactable-data'})
        tr_list = soup.html.body.find_all('tr')
        for tr in tr_list:
            problem = Problem()
            for index, td in enumerate(tr.children):
                print(index)
                print(td)
                if index == 1:
                    # print(td.string)
                    problem.number = td.string
                if index == 2:
                    problem.link = self.base_url + td.div.a.attrs["href"]
                    problem.title = td.div.a.string

            self.problem_list.append(problem)

    def get_description_by_request(self):
        for index, problem in enumerate(self.problem_list):
            if index > -1:
                print(problem.link)
                response = request_url_read(problem.link)
                if response is None:
                    print("response is None " + str(index))
                    continue

                rep_html = response.text
                soup = bs4.BeautifulSoup(rep_html, "html.parser")

                # 获取题目内容
                meta_description = soup.html.head.find('meta', {'name': 'description'})
                description = meta_description.attrs["content"]
                problem.description = description

                time.sleep(1)

                if index % 10 == 0:
                    time.sleep(4)

    def save_problem(self):
        for problem in self.problem_list:
            fo = open("result.txt", "a", encoding='UTF-8')  # a	打开一个文件用于追加。
            fo.write(str(problem.format_to_file()))
            fo.close()

    def run(self):
        self.get_urls()
        self.get_description_by_request()
        self.save_problem()


def main():
    mySpider = MySpider()
    # links = mySpider.get_page_links("http://m.sfv5.com/shaonv/2018/0425/5436.html")
    # mySpider.download_by_links(links,'好')
    mySpider.run()


if __name__ == "__main__":
    main()

python-leetcode.html 部分

<!DOCTYPE html>
<html>
<head>
</head>
<body>
<tbody class="reactable-data"><tr><td label="[object Object]"></td><td label="[object Object]">899</td><td value="Orderly Queue" label="[object Object]"><div><a href="/problems/orderly-queue">有序队列</a>&nbsp;<span class="fa fa-info-circle title-tooltip" data-toggle="tooltip" data-placement="top" data-original-title="Orderly Queue" aria-hidden="true" style="cursor: pointer;"></span>&nbsp;&nbsp;&nbsp;<span class="badge badge-pill new-question-pill">新</span></div></td><td label="[object Object]"></td><td value="24.087591240875913" label="[object Object]">24.1%</td><td value="[object Object]" label="[object Object]"><span class="label label-danger round">困难</span></td></tr></tbody>
</body>
</html>  

result1-200.txt

1、两数之和
给定一个整数数组和一个目标值,找出数组中和为目标值的两个数。
你可以假设每个输入只对应一种答案,且同样的元素不能被重复利用。
示例:
给定 nums = [2, 7, 11, 15], target = 9
因为 nums[0] + nums[1] = 2 + 7 = 9
所以返回 [0, 1]




7、反转整数
给定一个 32 位有符号整数,将整数中的数字进行反转。
示例1:
输入: 123
输出: 321

示例 2:
输入: -123
输出: -321

示例 3:
输入: 120
输出: 21

注意:
假设我们的环境只能存储 32 位有符号整数,其数值范围是 [-231, 231- 1]。根据这个假设,如果反转后的整数溢出,则返回 0。




9、回文数
判断一个整数是否是回文数。回文数是指正序(从左向右)和倒序(从右向左)读都是一样的整数。
示例 1:
输入: 121
输出: true

示例2:
输入: -121
输出: false
解释: 从左向右读, 为 -121 。 从右向左读, 为 121- 。因此它不是一个回文数。

示例 3:
输入: 10
输出: false
解释: 从右向左读, 为 01 。因此它不是一个回文数。

进阶:
你能不将整数转为字符串来解决这个问题吗?




13、罗马数字转整数

运行结果

result1-200.txt 各个题目中间会有空格

1、两数之和
给定一个整数数组和一个目标值,找出数组中和为目标值的两个数。
你可以假设每个输入只对应一种答案,且同样的元素不能被重复利用。
示例:
给定 nums = [2, 7, 11, 15], target = 9
因为 nums[0] + nums[1] = 2 + 7 = 9
所以返回 [0, 1]




7、反转整数

扩展

这个内容是半自动的,可以根据自己的需求,爬取不同的题目。 可以看到代码中的题目链接是根据python-leetcode.html中的标签内容进行获取的,所以这里是可以定制化的。

这个里面的内容登录https://leetcode-cn.com/problemset/all/,筛选自己的内容找到标签,整个复制并替换。然后重新运行代码就好了

成品结果

有了文本,然后改到word里面,再排版下像这个样子:

下载链接

代码链接:码云链接 成品下载链接:leetcode所有简单题目包括排版