Laravel chunk和chunkById的坑 - 旺阳知识图谱

Laravel chunk和chunkById的坑 - 旺阳

本站和网页 https://lqwang.net/13.html 的作者无关，不对其内容负责。快照谨为网络故障时之索引，不代表被搜索网站的即时页面。

Laravel chunk和chunkById的坑 - 旺阳
旺阳
主页
归档
分类
标签
写于2020-01-06
12 分钟读完 (大约 1837 个字)
0次访问
Laravel chunk和chunkById的坑
公司中的项目在逐渐的向Laravel框架进行迁移。在编写定时任务脚本的时候，用到了chunk和chunkById的API，记录一下踩到的坑。
一、前言数据库引擎为innodb。
表结构简述，只列出了本文用到的字段。
字段
类型
注释
id
int(11)
ID
type
int(11)
类型
mark_time
int(10)
标注时间（时间戳）
索引，也只列出需要的部分。
索引名
字段
PRIMARY
id
idx_sid_blogdel_marktime
typeblog_delmark_time
Idx_marktime
mark_time
二、需求每天凌晨一点取出昨天标注type为99的所有数据，进行逻辑判断，然后进行其他操作。本文的重点只在于取数据的阶段。
数据按月分表，每个月表中的数据为1000w上下。
三、chunk处理数据代码如下:
123$this->dao->where('type', 99)->whereBetween('mark_time', [$date, $date+86399])->select(array('mark_time', 'id'))->chunk(1000, function ($rows){ // 业务处理});
从一个月中的数据，筛选出type为99，并且标注时间在某天的00:00:00-23:59:59的数据。可以使用到mark_time和type的索引。
type为99，一天的数据大概在15-25w上下的样子。使用->get()->toArray()内存会直接炸掉。所以使用chunk方法，每次取出1000条数据。
使用chucnk，不会出现内存不够的情况。但是性能较差。粗略估计，从一月数据中取出最后一天的数据，跑完20w数据大概需要一两分钟。
查看源码，底层的chunk方法，是为sql语句添加了限制和偏移量。
1select * from `users` asc limit 500 offset 500;
在数据较多的时候，越往后的话效率会越慢，因为Mysql的limit方法底层是这样的。
limit 10000，10
是扫描满足条件的10010行，然后扔掉前面的10000行，返回最后最后20行。在数据较多的时候，性能会非常差。
查了下API，对于这种情况Laraverl提供了另一个API chunkById。
四、chunkById 原理使用limit和偏移量在处理大量的数据会有性能的明显下降。于是chunkById使用了id进行分页处理。很好理解，代码如下：
1select * from `users` where `id` > :last_id order by `id` asc limit 500;
API会自动保存最后一个ID，然后通过id > :last_id 再加上limit就可以通过主键索引进行分页。只取出来需要的行数。性能会有明显的提升。
五、chunkById的坑API显示chunk和chunkById的用法完全相同。于是把脚本的代码换成了chunkById。
123$this->dao->where('type', 99)->whereBetween('mark_time', [$date, $date+86399])->select(array('mark_time', 'id'))->chunkById(1000, function ($rows){ // 业务处理});
在执行脚本的时候，1月2号和1月1号的数据没有任何问题。执行速度快了很多。但是在执行12月31号的数据的时候，发现脚本一直执行不完。
在定位后发现是脚本没有进入业务处理的部分，也就是sql一直没有执行完。当时很疑惑，因为刚才执行的没问题，为什么执行12月31号的就出问题了呢。
于是查看sql服务器中的执行情况。
1show full processlist;
发现了问题。上节说了chunkById的底层是通过id进行order by，然后limie取出一部分一部分的数据，也就是我们预想的sql是这样的。
1select * from `tabel` where `type` = 99 and mark_time between :begin_date and :end_date limit 500;
explain出来的情况如下：
select_type
type
key
rows
Extra
SIMPLE
Range
idx_marktime
2370258
Using index condition; Using where
实际上的sql是这样的：
1select * from `tabel` where `type` = 99 and mark_time between :begin_date and :end_date order by id limit 500;
实际explain出来的情况是这样的：
select_type
type
key
rows
Extra
SIMPLE
Index
PRIMARY
4379
Using where
chunkById会自动添加order by id。innodb一定会使用主键索引。那么就不会再使用mark_time的索引了。导致sql执行效率及其缓慢。
六、解决方法再次查看chunkById的源码。
123456789101112131415161718192021222324252627282930313233343536373839404142434445/** * Chunk the results of a query by comparing numeric IDs. * * @param int $count * @param callable $callback * @param string|null $column * @param string|null $alias * @return bool */ public function chunkById($count, callable $callback, $column = null, $alias = null) { $column = is_null($column) ? $this->getModel()->getKeyName() : $column; $alias = is_null($alias) ? $column : $alias; $lastId = null; do { $clone = clone $this; // We'll execute the query for the given page and get the results. If there are // no results we can just break and return from here. When there are results // we will call the callback with the current chunk of these results here. $results = $clone->forPageAfterId($count, $lastId, $column)->get(); $countResults = $results->count(); if ($countResults == 0) { break; } // On each chunk result set, we will pass them to the callback and then let the // developer take care of everything within the callback, which allows us to // keep the memory low for spinning through large result sets for working. if ($callback($results) === false) { return false; } $lastId = $results->last()->{$alias}; unset($results); } while ($countResults == $count); return true; }
能看到这个方法有四个参数count，callback，column，alias。
默认的column为null，第一行会进行默认赋值。
1$column = is_null($column) ? $this->getModel()->getKeyName() : $column;
往下跟:
12345678910111213141516/** * Get the primary key for the model. * * @return string */ public function getKeyName() { return $this->primaryKey; } /** * The primary key for the model. * * @var string */ protected $primaryKey = 'id';
能看到默认的column为id。
进入forPageAfterId方法。
12345678910111213141516171819/** * Constrain the query to the next "page" of results after a given ID. * * @param int $perPage * @param int|null $lastId * @param string $column * @return \Illuminate\Database\Query\Builder|static */ public function forPageAfterId($perPage = 15, $lastId = 0, $column = 'id') { $this->orders = $this->removeExistingOrdersFor($column); if (! is_null($lastId)) { $this->where($column, '>', $lastId); } return $this->orderBy($column, 'asc') ->take($perPage); }
能看到如果lastId不为0则自动添加where语句，还会自动添加order by column。
看到这里就明白了。上文的chunkById没有添加column参数，所以底层自动添加了order by id。走了主键索引，没有使用上mark_time的索引。导致查询效率非常低。
chunkById的源码显示了我们可以传递一个column字段来让底层使用这个字段来order by。
代码修改如下：
123$this->dao->where('type', 99)->whereBetween('mark_time', [$date, $date+86399])->select(array('mark_time', 'id'))->chunkById(1000, function ($rows){ // 业务处理}, 'mark_time');
这样最后执行的sql如下：
1select * from `tabel` where `type` = 99 and mark_time between :begin_date and :end_date order by mark_time limit 500;
再次执行脚本，大概执行一次也就十秒作用了，性能提升显著。
七、总结chunk和chunkById的区别就是chunk是单纯的通过偏移量来获取数据，chunkById进行了优化，不实用偏移量，使用id过滤，性能提升巨大。在数据量大的时候，性能可以差到几十倍的样子。
而且使用chunk在更新的时候，也会遇到数据会被跳过的问题。详见解决Laravel中chunk方法分块处理数据的坑
同时chunkById在你没有传递column参数时，会默认添加order by id。可能会遇到索引失效的问题。解决办法就是传递column参数即可。
本人感觉chunkById不光是根据Id分块，而是可以根据某一字段进行分块，这个字段是可以指定的。叫chunkById有一些误导性，chunkByColumn可能更容易理解。算是自己提的小小的建议。
本文标题：Laravel chunk和chunkById的坑
本文作者：旺阳
本文链接：https://www.lqwang.net/13.html
发布时间：2020-01-06
版权声明：本博客所有文章除特别声明外，均采用 CC BY-NC-SA 4.0 许可协议。转载请注明出处！
Laravel, PHP
喜欢这篇文章？打赏一下作者吧
支付宝
微信
事务隔离级别&数据库锁
phpMyAdmin导出长数字到Excel
评论
旺阳
一个有老婆的程序员
浙江杭州
文章
38
分类
12
标签
20
关注我
目录
一、前言
二、需求
三、chunk处理数据
四、chunkById 原理
五、chunkById的坑
六、解决方法
七、总结
友情链接
leetcode题解
leetcode.wang
KTAO'S BLOG
sktao.github.io
windliang
windliang.wang
分类
Blog
Hexo
WordPress
DolphinScheduler
Java
Jenkins
Kubernetes
12
Fluentd
Prometheus
Linux
Mysql
Personal
最新文章
2022-10-16
Using Prometheus To Monitoring K8s Cluster
Kubernetes / Prometheus
2022-10-06
使用Prometheus监控K8s集群
Kubernetes / Prometheus
2022-08-14
Support E2E Test && Jenkins Build Pod Could not connect to Ryuk
DolphinScheduler / Jenkins / Kubernetes
2022-08-07
支持E2E测试 && Jenkins Build Pod Could not connect to Ryuk
DolphinScheduler / Jenkins / Kubernetes
2022-03-26
Apache DolphinScheduler March Online MeetUp
DolphinScheduler
归档
十月 2022
八月 2022
三月 2022
一月 2022
十二月 2021
十一月 2021
七月 2020
五月 2020
三月 2020
一月 2020
十一月 2019
十月 2019
一月 2019
九月 2018
七月 2018
五月 2018
八月 2017
七月 2017
四月 2017
标签
.NET
Cisco
Docker
DolphinScheduler
E2E
Fluentd
Hexo
Java
Jenkins
Kubernetes
12
Laravel
Linux
Mysql
Nginx
PHP
Prometheus
Python
Scala
Spring
WordPress
旺阳
您是第0位访客
2017-2022 旺阳
Powered by Hexo & Icarus
浙ICP备-19044317号
Your browser is out-of-date!
Update your browser to view this website correctly. Update
my browser now
×