列举出知乎所有的问题链接

用到的库

  • requests
  • json

原理

  • 导入所需的库
  • 用迭代的方式枚举所有知乎链接
  • 判断状态码是否是200
  • 如果是200就打印链接
  • 保存为txt文件

完整代码

import requests
import json


def get_links():
    links = []
    nummber = 19550224
    while nummber < 900000000:
        nummber = nummber + 1
        urls = 'https://www.zhihu.com/question/' + str(nummber)
        links.append(urls)
    return links

def write_to_file(content):
    with open('19550224.txt', 'a', encoding='utf-8') as f:
        f.write(json.dumps(content, ensure_ascii=False) + '\n')
        f.close()

def main():    
    links = get_links()
    for link in links:
        headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
        response = requests.get(link, headers=headers)
        if response.status_code == 200:
            print(link)
            write_to_file(link)
    

if __name__ == '__main__':
    main()

其他

需要知道知乎的第一个链接地址,当然也可以自己从 10000000 开始迭代自己找出来。

https://www.zhihu.com/question/19550225

这是知乎的第一个问题, 编号是 19550225

优化后

把判断放到了循环内,这样就不用获取所有的nummer 再开始运算了。效率高了很多。

另外,把headers 放到全局,这样不用每一次都需要获取一次 headers

import requests
import json

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}

def write_to_file(content):
    with open('zhihu.txt', 'a', encoding='utf-8') as f: 
        f.write(json.dumps(content, ensure_ascii=False) + '\n')  
        f.close()

def main():
    nummber = 61158073
    while nummber >= 61158073:
        nummber = nummber + 1
        url = 'https://www.zhihu.com/question/' + str(nummber)
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            print(url)
            write_to_file(url)

if __name__ == '__main__':
    main()
Comments
Write a Comment