转化漏斗的基本实现

转化漏斗的基本实现

参照 clickhouse-presentation funnel 实现:
https://github.com/yandex/clickhouse-presentations/blob/master/meetup9/funnels.pdf

测试环境:单机 Intel Xeon L5520,内存16G(配置是不是很穷呢😭)
样本数据:易观OLAP大赛Demo数据,官方说6亿,17年1月和2月的数据,我实际下载数据总条数3亿左右,只有1月的数据,可能下载过程丢了。。。

测试项

  • 计算2017年1月份中,依次有序触发“搜索商品”、“查看商品”、“生成订单”的用户转化情况,且时间窗口为1天
  • 计算2017年1月和2月份中,依次有序触发“登陆”、“搜索商品”、“查看商品”、“生成订单”、“订单付款”的用户转化情况, 且时间窗口为7天,“搜索商品”事件的content属性为Apple,“浏览商品”事件的price属性大于5000。

原题中,有时间窗口概念,clickhouse现有原生聚合函数无法支持,所以到后面实现;下载的数据没有二月份的,而且机器配置有点低,测试的时候好几次内存不够分配

数据处理

测试数据入clickhouse的处理章节参考的是易观OLAP开源组冠军的处理方式

原数据结构:

  1. 用户ID,字符串类型
  2. 时间戳,毫秒级别,Long类型
  3. 事件ID,Int类型,包含10001到10010十个事件
  4. 事件名称,字符串类型,包含启动、登陆、搜索商品等十个事件
  5. 事件属性,Json串格式
  6. 日期,字符串类型

这样的数据结构还不能直接导入到clickhouse,原因:

  1. 查询需要支持事件属性的过滤,clickhouse不能直接解析json过滤,需要将json格式的各个事件属性 转化为列。列式存储,这才是clickhouse嘛(最终clickhouse 并没被采用到正式商业环境,可能原因就在于 事件属性 是支持用户自定义的,无法预先设定好所有的事件属性有哪些。仅仅是猜测….)
  2. 事件属性,现在不知道 样本数据 中有多少事件属性,所以需要动态扫描一遍样本数据,将事件属性全部识别出来,方便建表
  3. clickhouse 不支持20170130 这样的日期导入,需要变成2017-01-30格式

数据预处理:

1.读一遍数据文件识别动态Scheme生成sql和模型文件

1
go run createtable.go -files=`ls 2017*` | tee create_table.sql`

createtable.go 这里查看

生成的create_table.sql:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
CREATE TABLE trend_event
(
user_id UInt32,
timestamp UInt64,
event_id UInt32,
event_name String,
event_tag_brand String,
event_tag_content String,
event_tag_how Int32,
event_tag_page_num Int32,
event_tag_price Int32,
event_tag_price_all Int32,
event_date Date
) engine = MergeTree(event_date, (user_id, timestamp, event_date), 8192);

也就是样本数据中只有brand,content,how,page_num,price,,price_all这几个属性

2.通过模型文件,处理数据, 将数据存入 output目录

1
2
3
4
5
6
## 数据如果单节点硬盘存不下,可以将数据按月份分散到多台机器,修改 `ls 2017XXX`来处理数据
output="`pwd`/output"
mkdir -p $output
for f in `ls 2017*`;do
go run index.go -file="$f" -out="`pwd`/output"
done

index.go 这里查看

处理过的数据(一个是将json 对象变成列值,一个是将日期变为 yyyy-MM-dd:

数据入库

  1. 建表(create_table.sql 见上一章节)

    1
    clickhouse-client < create_table.sql
  2. 导入库

    1
    clickhouse-client --query='INSERT INTO trend_event FORMAT TabSeparated' < ./output/2017*

funnel实践一:事件转化

计算2017年1月份中(01-01~01-15),依次有序触发搜索商品(10001)、查看商品(10004)、生成订单(10007)的用户转化情况

sequenceMatch

https://clickhouse.yandex/docs/en/agg_functions/parametric_functions.html#sequencematch-pattern-time-cond1-cond2

sequenceMatch(pattern)(time, cond1, cond2, …)

  • aggregate function
  • Matches sequence of events to pattern
  • returns 1 or 0
1
2
3
4
5
6
7
8
9
10
11
12
13
# 过程sql
select
user_id,
max((event_id=10001)) as step1_condition,
sequenceMatch('(?1).*(?2)')(toDateTime(timestamp),(event_id=10003),(event_id=10004)) as step2_condition,
sequenceMatch('(?1).*(?2).*(?3)')(toDateTime(timestamp),(event_id=10003),(event_id=10004),(event_id=10007)) as step3_condition
from
trend_event
where
event_date >= '2017-01-01'
and event_date < '2017-01-15'
group by user_id
limit 10

1
2
3
4
5
6
7
SELECT
sum(step1_condition) as step1_achieved,
sum(step2_condition) as step2_achieved,
sum(step3_condition) as step3_achieved
FROM (
# 上述`过程sql`
)

Arrays functionality

之前直接从全表拉数计算,这个过程还可以继续优化,去除重复序列事件。主要用到的是groupArrayarrayJoinARRAY JOIN等函数

介绍 funnel 的ppt里有个图示:

https://clickhouse.yandex/docs/en/agg_functions/reference.html#grouparray-x-grouparray-max-size-x
https://clickhouse.yandex/docs/en/functions/array_join.html#functions_arrayjoin

groupArray – aggregation function to create arrays
arrayJoin and ARRAY JOIN to split array into rows
Higher order functions

第一步骤,将事件按照用户分组 发生事件顺序打横成Array,并且去掉重复序列

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
SELECT
user_id,
groupArray(event_id) AS events,
groupArray(timestamp) AS times,
arrayEnumerate(events) AS events_index,
arrayFilter((name, index) -> ((index = 1) OR (events[index] != events[(index - 1)])), events, events_index) AS events_filt,
arrayFilter((time, index) -> ((index = 1) OR (events[index] != events[(index - 1)])), times, events_index) AS times_filt
FROM
(
SELECT *
FROM trend_event
WHERE event_date = '2017-01-30'
ORDER BY timestamp ASC
)
GROUP BY user_id
LIMIT 10

椭圆形标记的脚本需要再后续删除;select 中缺失 user_id。。。

第二步骤,再利用array Join将结果转化为原始可用的表结构:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
select
user_id as userId,
eventID,
EventTimestamp
from (
select
user_id,
groupArray(event_id) as events,
groupArray(timestamp) as times,
arrayEnumerate(events) as events_index,
arrayFilter(name,index -> (index=1) or (events[index] != events[index-1]),events,events_index) as events_filt,
arrayFilter(time,index -> (index=1) or (events[index] != events[index-1]), times,events_index) as times_filt
FROM
(select * from trend_event where event_date='2017-01-30' order by timestamp)
group by user_id
)
array Join
events_filt as eventID,
times_filt as EventTimestamp
limit 10

最后表就如下图的结果,数据量是不是少了😂:

最后来看看优化后的效果:

结果是,优化后,结果跑不出来了,白优化了😂(还是配置太差了,内存不够用,又是单机),只好将 查询时间短 缩短成一天….

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
SELECT
sum(step1_condition) as step1_achieved,
sum(step2_condition) as step2_achieved,
sum(step3_condition) as step3_achieved
FROM (
select
user_id,
max((event_id=10001)) as step1_condition,
sequenceMatch('(?1).*(?2)')(toDateTime(timestamp),(event_id=10003),(event_id=10004)) as step2_condition,
sequenceMatch('(?1).*(?2).*(?3)')(toDateTime(timestamp),(event_id=10003),(event_id=10004),(event_id=10007)) as step3_condition
from
(
select
user_id,
eventID as event_id,
EventTimestamp as timestamp
from (
select
user_id,
groupArray(event_id) as events,
groupArray(timestamp) as times,
arrayEnumerate(events) as events_index,
arrayFilter(name,index -> (index=1) or (events[index] != events[index-1]),events,events_index) as events_filt,
arrayFilter(time,index -> (index=1) or (events[index] != events[index-1]), times,events_index) as times_filt
FROM
(select *
from trend_event
where
event_date >= '2017-01-01' and event_date < '2017-01-02'
order by timestamp)
group by user_id
)
array Join
events_filt as eventID,
times_filt as EventTimestamp
)
group by user_id
)

funnel实践二:事件转化+事件属性过滤

计算2017年1月和2月份中01-01~01-07,依次有序触发“登陆”、“搜索商品”、“查看商品”、“生成订单”、“订单付款”的用户转化情况, 且时间窗口为7天,“搜索商品”事件的content属性为Apple,“浏览商品”事件的price属性大于5000。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
SELECT
sum(step1_condition) as step1_achieved,
sum(step2_condition) as step2_achieved,
sum(step3_condition) as step3_achieved,
sum(step4_condition) as step4_achieved,
sum(step5_condition) as step5_achieved
FROM (
select
user_id,
max((event_id=10002)) as step1_condition,
sequenceMatch('(?1).*(?2)')(toDateTime(timestamp),(event_id=10002),(event_id=10003 and event_tag_content='Apple')) as step2_condition,
sequenceMatch('(?1).*(?2).*(?3)')(toDateTime(timestamp),(event_id=10002),(event_id=10003 and event_tag_content='Apple'),(event_id=10004 and event_tag_price>5000)) as step3_condition,
sequenceMatch('(?1).*(?2).*(?3).*(?4)')(toDateTime(timestamp),(event_id=10002),(event_id=10003 and event_tag_content='Apple'),(event_id=10004 and event_tag_price>5000),(event_id=10007)) as step4_condition,
sequenceMatch('(?1).*(?2).*(?3).*(?4).*(?5)')(toDateTime(timestamp),(event_id=10002),(event_id=10003 and event_tag_content='Apple'),(event_id=10004 and event_tag_price>5000),(event_id=10007),(event_id=10008)) as step5_condition
from
trend_event
where
event_date >= '2017-01-01'
and event_date < '2017-01-08'
group by user_id
)

总结

简单的漏斗,用clickhouse还是很轻松解决,甚至单机这么差配置都可以。但是如果加上时间窗口这个概念,原生的函数就不支持了,另外clickhouse不支持UDAF,需要直接修改源码。开源组冠军VectorLineX是在源码中加了自定义函数(类似UDAF),只不过给的源码分支已经被remove了,后续再研究下patch,看能否理解

坚持原创技术分享,您的支持将鼓励我继续创作!
分享