学习promql常用语法,更好的理解qps,tp99等参数
promql
In Prometheus’s expression language, an expression or sub-expression can evaluate to one of four types:
- Instant vector - a set of time series containing a single sample for each time series, all sharing the same timestamp
- Range vector - a set of time series containing a range of data points over time for each time series
- Scalar - a simple numeric floating point value
- String - a simple string value; currently unused
gauge
gauge是忽高忽低的一个类型,比较适合算瞬时值,或者 avg,min,max,topk这些,例如:
1 | # HELP go_goroutines Number of goroutines that currently exist. |
counter
counter是持续递增的类型,例如rpc的数量,总的处理时间这些。
rate irate increase
1 | rate(grpc_server_handling_seconds_count{job="grpc-go",grpc_type="unary"}[30s]) |
1 | irate(grpc_server_handling_seconds_count{job="grpc-go",grpc_type="unary"}[30s]) |
1 | increase(grpc_server_handling_seconds_count{job="grpc-go",grpc_type="unary"}[30s]) |
我这里单线程发了100次rpc,总体增长是比较平滑的。rate是算duration(这里是30s)的开始和结尾2个point的数据,除以时间间隔得到的增长率。irate是取duration的最后2个时间点的数据,这个取决于step的大小,所以irate在开始和结尾的瞬间增长下降是很快的。而rate在30s就趋于稳定了,因为我这里客户端发的频率比较稳定。
increase和rate的区别是,increase不除以时间间隔,算的是增长不是增长率,这个可以看下源码一目了然:
1 | // === rate(node parser.ValueTypeMatrix) Vector === |
再看这里:
1 | rate(grpc_server_handling_seconds_count{job="grpc-go",grpc_type="unary"}[2s]) |
因为step=1s,这里rate的duration取2s,结果和irate一样了。
avg avg_over_time
avg_over_time(range-vector): the average value of all points in the specified interval.
sum sum_over_time
sum_over_time:
The following functions allow aggregating each series of a given range vector over time and return an instant vector with per-series aggregation results:
- avg_over_time(range-vector): the average value of all points in the specified interval.
- min_over_time(range-vector): the minimum value of all points in the specified interval.
- max_over_time(range-vector): the maximum value of all points in the specified interval.
- sum_over_time(range-vector): the sum of all values in the specified interval.
- count_over_time(range-vector): the count of all values in the specified interval.
- quantile_over_time(scalar, range-vector): the φ-quantile (0 ≤ φ ≤ 1) of the values in the specified interval.
- stddev_over_time(range-vector): the population standard deviation of the values in the specified interval.
- stdvar_over_time(range-vector): the population standard variance of the values in the specified interval.
- last_over_time(range-vector): the most recent point value in specified interval.
- present_over_time(range-vector): the value 1 for any series in the specified interval.
Note that all values in the specified interval have the same weight in the aggregation even if the values are not equally spaced throughout the interval.
sum:
Prometheus supports the following built-in aggregation operators that can be used to aggregate the elements of a single instant vector, resulting in a new vector of fewer elements with aggregated values:
- sum (calculate sum over dimensions)
- min (select minimum over dimensions)
- max (select maximum over dimensions)
- avg (calculate the average over dimensions)
- group (all values in the resulting vector are 1)
- stddev (calculate population standard deviation over dimensions)
- stdvar (calculate population standard variance over dimensions)
- count (count number of elements in the vector)
- count_values (count number of elements with the same value)
- bottomk (smallest k elements by sample value)
- topk (largest k elements by sample value)
- quantile (calculate φ-quantile (0 ≤ φ ≤ 1) over dimensions)
These operators can either be used to aggregate over all label dimensions or preserve distinct dimensions by including a without or by clause. These clauses may be used before or after the expression.
1 | <aggr-op> [without|by (<label list>)] ([parameter,] <vector expression>) |
or
1 | <aggr-op>([parameter,] <vector expression>) [without|by (<label list>)] |
具体看官方文档吧,简单的说sum是对instant-vector的操作,还可以对label进行过滤。
这两个都是聚合操作,可以算多个实例的和。sum还可以根据label过滤。区别主要就是作用的对象不同,instant vs range;
测试效果:
1 | // query |
1 | //query |
over_time也是类似的。
histogram
这是grpcserver的api接口拿到的数据:
1 | grpc_server_handling_seconds_bucket{grpc_method="SayHello",grpc_service="proto.DemoService",grpc_type="unary",le="0.005"} 231 |
这里的grpc-go的prome里面直接用了prome官方的bucket定义,如下:
1 | DefBuckets = []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10} |
和上面api拿到的是一致的。
Bucket | Count |
---|---|
0-0.005 | 231 |
0-0.01 | 231 |
0-0.025 | 231 |
0-0.05 | 231 |
0-0.1 | 231 |
0-0.25 | 231 |
0-0.5 | 231 |
0-1 | 231 |
0-2.5 | 331 |
0-5 | 331 |
0-10 | 331 |
0-+Inf | 331 |
这里之前我的服务都是官方的demo,一个简单的回显程序所以速度很快,所以bucket都是231,后面我加了sleep(1s),后面的bucket就变了。
summary
Summaries also measure events and are an alternative to histograms. They are cheaper, but lose more data. They are calculated on the application level hence aggregation of metrics from multiple instances of the same process is not possible. They are used when the buckets of a metric is not known beforehand, but it is highly recommended to use histograms over summaries whenever possible.
summary和histogram有点像,但是summary是在client端计算quantile的,并且不能aggregation.summaries的quantitile是提前写死的,client返回这个的精确值,不像bucket是近似值,但是histogram是prom端计算的,所以quantitile是可以手动调整的,牺牲的是prom的性能不是client的性能。官方不推荐用summary,实际也比较少见使用。
reference
https://prometheus.io/docs/prometheus/latest/querying/basics/