OceanBase 分析函数
简介
分析函数(某些数据库下也叫做窗口函数)与聚合函数类似,计算总是基于一组行的集合,不同的是,聚合函数一组只能返回一行,而分析函数每组可以返回多行,组内每一行都是基于窗口的逻辑计算的结果。分析函数可以显著优化需要 self-join 的查询。
分析函数语法
“窗口”也称为 FRAME,OceanBase 数据库同时支持 ROWS 与 RANGE 两种 FRAME 语义,前者是基于物理行偏移的窗口,后者则是基于逻辑值偏移的窗口。
分析函数语法如下:
analytic_function:
analytic_function([ arguments ]) OVER (analytic_clause)
analytic_clause:
[ query_partition_clause ] [ order_by_clause [ windowing_clause ] ]
query_partition_clause:
PARTITION BY { expr[, expr ]... | ( expr[, expr ]... ) }
order_by_clause:
ORDER [ SIBLINGS ] BY{ expr | position | c_alias } [ ASC | DESC ] [ NULLS FIRST | NULLS LAST ] [, { expr | position | c_alias } [ ASC | DESC ][ NULLS FIRST | NULLS LAST ]]...
windowing_clause:
{ ROWS | RANGE } { BETWEEN { UNBOUNDED PRECEDING | CURRENT ROW | value_expr {
PRECEDING | FOLLOWING } } AND{ UNBOUNDED FOLLOWING | CURRENT ROW | value_expr {
PRECEDING | FOLLOWING } } | { UNBOUNDED PRECEDING | CURRENT ROW| value_expr
PRECEDING}}
SUM/MIN/MAX/COUNT/AVG
声明
SUM 的语法为:SUM([ DISTINCT | ALL ] expr) [ OVER (analytic_clause) ]
MIN 的语法为:MIN([ DISTINCT | ALL ] expr) [ OVER (analytic_clause) ]
MAX 的语法为:MAX([ DISTINCT | ALL ] expr) [ OVER (analytic_clause) ]
COUNT 的语法为:COUNT({ * | [ DISTINCT | ALL ] expr }) [ OVER (analytic_clause) ]
AVG 的语法为:AVG([ DISTINCT | ALL ] expr) [ OVER(analytic_clause) ]
说明
以上分析函数都有对应的聚合函数,其中,SUM
返回 expr
的和,MIN
/MAX
返回 expr
的最小值/最大值,COUNT
返回窗口中查询的行数,AVG
返回 expr
的平均值。
对于 COUNT
函数,如果指定了 expr
,即返回 expr
不为 NULL 的统计个数,如果指定 COUNT(*)
返回所有行的统计数目。
例子
obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.17 sec)
obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.03 sec)
obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.00 sec)
obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.01 sec)
obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.00 sec)
obclient>select last_name, sum(salary) over(partition by job_id) totol_s, min(salary) over(partition by job_id) min_s, max(salary) over(partition by job_id) max_s, count(*) over(partition by job_id) count_s from exployees;
+-----------+---------+-------+-------+---------+
| last_name | totol_s | min_s | max_s | count_s |
+-----------+---------+-------+-------+---------+
| jim | 2000 | 2000 | 2000 | 1 |
| mike | 36000 | 11000 | 13000 | 3 |
| lily | 36000 | 11000 | 13000 | 3 |
| tom | 36000 | 11000 | 13000 | 3 |
+-----------+---------+-------+-------+---------+
4 rows in set (0.01 sec)
NTH_VALUE/FIRST_VALUE/LAST_VALUE
声明
NTH_VALUE 的语法为:NTH_VALUE (measure_expr, n) [ FROM { FIRST | LAST } ] [ { RESPECT | IGNORE } NULLS ] OVER (analytic_clause)
FIRST_VALUE 的语法为:FIRST_VALUE { (expr) [ {RESPECT | IGNORE} NULLS ] | (expr [ {RESPECT | IGNORE} NULLS ])} OVER (analytic_clause)
LAST_VALUE 的语法为:LAST_VALUE { (expr) [ {RESPECT | IGNORE} NULLS ] | (expr [ {RESPECT | IGNORE} NULLS ])} OVER (analytic_clause)
说明
NTH_VALUE 函数表示第几个值,方向由 [ FROM { FIRST | LAST } ]
确定,默认为 FROM FIRST
,含有是否忽略 NULL 值的标志。其窗口为统一的 analytic_clause
。这里 n
应该是正数,如果 n
是 NULL,函数将返回错误;如果 n
大于窗口内所有的行数,此函数将返回 NULL。
FIRST_VALUE 和 LAST_VALUE 表示从第一个开始计数或者是从最后一个开始计数。
例子
obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.08 sec)
obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.11 sec)
obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.00 sec)
obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.00 sec)
obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.01 sec)
obclient> select last_name, first_value(salary) over(partition by job_id) totol_s, last_value(salary) over(partition by job_id) min_s, max(salary) over(partition by job_id) max_s from exployees;
+-----------+---------+-------+-------+
| last_name | totol_s | min_s | max_s |
+-----------+---------+-------+-------+
| jim | 2000 | 2000 | 2000 |
| mike | 12000 | 11000 | 13000 |
| lily | 12000 | 11000 | 13000 |
| tom | 12000 | 11000 | 13000 |
+-----------+---------+-------+-------+
4 rows in set (0.01 sec)
LEAD/LAG
声明
LEAD 的语法为:LEAD { ( value_expr [, offset [, default]]) [ { RESPECT | IGNORE } NULLS ] | ( value_expr [ { RESPECT | IGNORE } NULLS ] [, offset [, default]] )} OVER ([ query_partition_clause ] order_by_clause)
LAG 的语法为:LAG { ( value_expr [, offset [, default]]) [ { RESPECT | IGNORE } NULLS ] | ( value_expr [ { RESPECT | IGNORE } NULLS ] [, offset [, default]] )} OVER ([ query_partition_clause ] order_by_clause)
说明
LEAD 和 LAG 含义为可以在一次查询中取出当前行的同一个字段的前面或后面第 N 行的数据,这种操作可以使用相同表的自连接来实现,但 LEAD/LAG 窗口函数有更高的效率。
其中,value_expr
是要做比对的字段,offset
是 value_expr
的偏移量,default
参数的默认值为 NULL,即如果在 LEAD/LAG 没有显示的设置 default
值的情况下,返回值为 NULL。例如:对 LAG 来说,当前行为 4,offset
值为 6,这时候所要找的数据就是第 -2 行,不存在此行即返回 default
的值。
[ { RESPECT | IGNORE } NULLS ]
的语法为是否考虑 NULL 值,默认为 RESPECT
,考虑 NULL 值。
注意 LEAD/LAG 两个函数后必须有 order_by_clause
,数据应该在一个列上排序之后才能有前多少行后多少行的概念。query_partition_clause
是可选的,如果没有 query_partition_clause
,就是全局的数据。
例子
obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.08 sec)
obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.11 sec)
obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.01 sec)
obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.00 sec)
obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.00 sec)
obclient> select last_name, lead(salary) over(order by salary) lead, lag(salary) over(order by salary) lag from exployees;
+-----------+-------+-------+
| last_name | lead | lag |
+-----------+-------+-------+
| jim | 11000 | NULL |
| tom | 12000 | 2000 |
| mike | 13000 | 11000 |
| lily | NULL | 12000 |
+-----------+-------+-------+
4 rows in set (0.01 sec)
STDDEV/VARIANCE/STDDEV_SAMP/STDDEV_POP
声明
VARIANCE 的语法为:VARIANCE([ DISTINCT | ALL ] expr) [ OVER (analytic_clause) ]
STDDEV 的语法为:STDDEV([ DISTINCT | ALL ] expr) [ OVER (analytic_clause) ]
STDDEV_SAMP 的语法为:STDDEV_SAMP(expr) [ OVER (analytic_clause) ]
STDDEV_POP 的语法为:STDDEV_POP(expr) [ OVER (analytic_clause) ]
说明
VARIANCE 返回的是 expr
的方差,expr
可能是数值类型或者可以转换成数值类型的类型,方差的类型和输入的值的类型相同。
STDDEV 返回的是 expr
的标准差,参数类型方面和 VARIANCE 的相同。
STDDEV_SAMP 返回的是样本标准差。
STDDEV_POP 返回的是总体标准差。
例子
obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.08 sec)
obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.11 sec)
obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.01 sec)
obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.00 sec)
obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.00 sec)
obclient> select last_name, stddev(salary) over(order by salary) std, variance(salary) over(order by salary) var, stddev_pop(salary) over() std_pop, stddev_samp(salary) over() from exployees;
+-----------+-------------------+--------------------+-------------------+----------------------------+
| last_name | std | var | std_pop | stddev_samp(salary) over() |
+-----------+-------------------+--------------------+-------------------+----------------------------+
| jim | 0 | 0 | 4387.482193696061 | 5066.228051190222 |
| tom | 4500 | 20250000 | 4387.482193696061 | 5066.228051190222 |
| mike | 4496.912521077347 | 20222222.222222224 | 4387.482193696061 | 5066.228051190222 |
| lily | 4387.482193696061 | 19250000 | 4387.482193696061 | 5066.228051190222 |
+-----------+-------------------+--------------------+-------------------+----------------------------+
4 rows in set (0.00 sec)
NTILE
声明
NTILE(expr) OVER ([ query_partition_clause ] order_by_clause)
说明
NTILE 函数将分区中已经排序的行划分为大小尽可能相同的指定数量的分组,并返回给每行组号。expr
如果是 NULL,则返回 NULL。
例子
obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.08 sec)
obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.11 sec)
obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.01 sec)
obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.00 sec)
obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.00 sec)
obclient> select last_name, ntile(10) over(partition by job_id order by salary) ntl from exployees;
+-----------+------+
| last_name | ntl |
+-----------+------+
| jim | 1 |
| tom | 1 |
| mike | 2 |
| lily | 3 |
+-----------+------+
4 rows in set (0.01 sec)
ROW_NUMBER
声明
ROW_NUMBER( ) OVER ([ query_partition_clause ] order_by_clause)
说明
ROW_NUMBER 函数按照 order_by_clause
子句中指定的行的顺序,为每一行分配一个编号。
例子
obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.08 sec)
obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.11 sec)
obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.01 sec)
obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.00 sec)
obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.00 sec)
obclient> select last_name, row_number() over(partition by job_id order by salary) ntl from exployees;
+-----------+------+
| last_name | ntl |
+-----------+------+
| jim | 1 |
| tom | 1 |
| mike | 2 |
| lily | 3 |
+-----------+------+
4 rows in set (0.00 sec)
RANK/DENSE_RANK/PERCENT_RANK
声明
RANK 的语法为:RANK( ) OVER ([ query_partition_clause ] order_by_clause)
DENSE_RANK 的语法为:DENSE_RANK( ) OVER([ query_partition_clause ] order_by_clause)
PERCENT_RANK 的语法为:PERCENT_RANK( ) OVER ([ query_partition_clause ] order_by_clause)
说明
RANK 计算每一行数据在某列上的排序,该列由 order_by_clause
中的列决定。例如,按照 salary 排序可以看出员工的收入排名。
DENSE_RANK 的语义基本和 RANK 函数相同,但是 RANK 的排序中间会有‘跳过’,但是 DENSE_RANK 中不会有。
PERCENT_RANK 的语义基本和 RANK 函数相同,但是 PERCENT_RANK 排序的结果是百分比,计算的是给定行的百分比。
例子
obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.10 sec)
obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.11 sec)
obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.00 sec)
obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.00 sec)
obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.00 sec)
obclient> select last_name, rank() over(partition by job_id order by salary) rank, dense_rank() over(partition by job_id order by salary) dense_rank, percent_rank() over(partition by job_id order by salary) percent_rank from exployees;
+-----------+------+------------+----------------------------------+
| last_name | rank | dense_rank | percent_rank |
+-----------+------+------------+----------------------------------+
| jim | 1 | 1 | 0.000000000000000000000000000000 |
| tom | 1 | 1 | 0.000000000000000000000000000000 |
| mike | 2 | 2 | 0.500000000000000000000000000000 |
| lily | 3 | 3 | 1.000000000000000000000000000000 |
+-----------+------+------------+----------------------------------+
4 rows in set (0.01 sec)
CUME_DIST
声明
CUME_DIST() OVER ([ query_partition_clause ] order_by_clause)
说明
该函数计算一个值的分布,返回值为大于 0 小于等于 1 的值。作为一个分析函数,CUME_DIST 在升序情况下计算比当前行的特定列小的数据的占比。例如如下例子中,按 job_id 分组并在薪水排序的情况下,每行数据在窗口内的排序列上的占比。
例子
obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.10 sec)
obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.11 sec)
obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.00 sec)
obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.00 sec)
obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.00 sec)
obclient> select last_name, cume_dist() over(partition by job_id order by salary) cume_dist from exployees;
+-----------+----------------------------------+
| last_name | cume_dist |
+-----------+----------------------------------+
| jim | 1.000000000000000000000000000000 |
| tom | 0.333333333333333333333333333333 |
| mike | 0.666666666666666666666666666667 |
| lily | 1.000000000000000000000000000000 |
+-----------+----------------------------------+
4 rows in set (0.01 sec)