爬虫数据库bug

bug

发布日期: 2021-05-17

更新日期: 2021-05-17

文章字数: 156

阅读次数:

爬虫数据存入数据库bug修改

问题：微博所有数据爬取正常，但存入数据库的数据很少，有时还会重复获取同一时间段的文章，导致微博ID一样，无法存入数据库
解决：修改search.py的parse_by_hour为：

    def parse_by_hour(self, response):
        """以小时为单位筛选"""
        keyword = response.meta.get('keyword')
        is_empty = response.xpath(
            '//div[@class="card card-no-result s-pt20b40"]')
        if is_empty:
            print('当前页面搜索结果为空')
        else:
            # 解析当前页面
            for weibo in self.parse_weibo(response):
                self.check_environment()
                yield weibo
            next_url = response.xpath(
                '//a[@class="next"]/@href').extract_first()
            if next_url:
                next_url = self.base_url + next_url
                yield scrapy.Request(url=next_url,
                                     callback=self.parse_page,
                                     meta={'keyword': keyword})