Clickhouse Tutorial

Clickhouse tutorial

ClickHouse is an open source column-oriented database management system capable of real time generation of analytical data reports using SQL queries.

clickhousegif

参照官网教程:https://clickhouse.yandex/tutorial.html

单节点,测试数据为 1987到2015 美国民用航班数据,1.66亿rows

  • single server
  • use USA civil flights data since 1987 till 2015
  • contains 166 millions rows, 63 Gb of uncompressed data

我的测试基于 centos6.9

其他系统安装见:official Docker images of ClickHouse


安装包下载

https://packagecloud.io/altinity/clickhouse
  • clickhouse-server-common
  • clickhouse-server
  • clickhouse-debuginfo
  • clickhouse-compressor

根据操作系统版本 在el / 6 和 el / 7挑选下载,依次上述排列顺序安装;

安装server包时,缺少依赖

1
2
yum install -y openssl
yum install libicu unixODBC

clickhouse-server启动

1
sudo service clickhouse-server start

clickhouse-client 命令

Interactive mode:

1
2
clickhouse-client
clickhouse-client --host=... --port=... --user=... --password=...

Enable multiline queries:

1
2
clickhouse-client -m
clickhouse-client --multiline

Run queries in batch-mode:

1
2
clickhouse-client --query='SELECT 1'
echo 'SELECT 1' | clickhouse-client

Insert data from file of a specified format:

1
2
clickhouse-client --query='INSERT INTO table VALUES' < data.txt
clickhouse-client --query='INSERT INTO table FORMAT TabSeparated' < data.tsv

sample dataset

https://clickhouse.yandex/tutorial.html

创建表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
$ clickhouse-client --multiline
ClickHouse client version 1.1.54292.
Connecting to localhost:9000.
Connected to ClickHouse server version 1.1.54292.
:) CREATE TABLE ontime
(
Year UInt16,
Quarter UInt8,
Month UInt8,
DayofMonth UInt8,
DayOfWeek UInt8,
FlightDate Date,
UniqueCarrier FixedString(7),
AirlineID Int32,
Carrier FixedString(2),
TailNum String,
FlightNum String,
OriginAirportID Int32,
OriginAirportSeqID Int32,
OriginCityMarketID Int32,
Origin FixedString(5),
OriginCityName String,
OriginState FixedString(2),
OriginStateFips String,
OriginStateName String,
OriginWac Int32,
DestAirportID Int32,
DestAirportSeqID Int32,
DestCityMarketID Int32,
Dest FixedString(5),
DestCityName String,
DestState FixedString(2),
DestStateFips String,
DestStateName String,
DestWac Int32,
CRSDepTime Int32,
DepTime Int32,
DepDelay Int32,
DepDelayMinutes Int32,
DepDel15 Int32,
DepartureDelayGroups String,
DepTimeBlk String,
TaxiOut Int32,
WheelsOff Int32,
WheelsOn Int32,
TaxiIn Int32,
CRSArrTime Int32,
ArrTime Int32,
ArrDelay Int32,
ArrDelayMinutes Int32,
ArrDel15 Int32,
ArrivalDelayGroups Int32,
ArrTimeBlk String,
Cancelled UInt8,
CancellationCode FixedString(1),
Diverted UInt8,
CRSElapsedTime Int32,
ActualElapsedTime Int32,
AirTime Int32,
Flights Int32,
Distance Int32,
DistanceGroup UInt8,
CarrierDelay Int32,
WeatherDelay Int32,
NASDelay Int32,
SecurityDelay Int32,
LateAircraftDelay Int32,
FirstDepTime String,
TotalAddGTime String,
LongestAddGTime String,
DivAirportLandings String,
DivReachedDest String,
DivActualElapsedTime String,
DivArrDelay String,
DivDistance String,
Div1Airport String,
Div1AirportID Int32,
Div1AirportSeqID Int32,
Div1WheelsOn String,
Div1TotalGTime String,
Div1LongestGTime String,
Div1WheelsOff String,
Div1TailNum String,
Div2Airport String,
Div2AirportID Int32,
Div2AirportSeqID Int32,
Div2WheelsOn String,
Div2TotalGTime String,
Div2LongestGTime String,
Div2WheelsOff String,
Div2TailNum String,
Div3Airport String,
Div3AirportID Int32,
Div3AirportSeqID Int32,
Div3WheelsOn String,
Div3TotalGTime String,
Div3LongestGTime String,
Div3WheelsOff String,
Div3TailNum String,
Div4Airport String,
Div4AirportID Int32,
Div4AirportSeqID Int32,
Div4WheelsOn String,
Div4TotalGTime String,
Div4LongestGTime String,
Div4WheelsOff String,
Div4TailNum String,
Div5Airport String,
Div5AirportID Int32,
Div5AirportSeqID Int32,
Div5WheelsOn String,
Div5TotalGTime String,
Div5LongestGTime String,
Div5WheelsOff String,
Div5TailNum String
)
ENGINE = MergeTree(FlightDate, (Year, FlightDate), 8192);

下载数据(3G+,1.6亿条数据):

https://yadi.sk/d/pOZxpa42sDdgm

load data:

xz -v -c -d < ontime.csv.xz | clickhouse-client --query="INSERT INTO ontime FORMAT CSV"

如果只下载部分测试数据

参考 https://clickhouse.yandex/docs/en/getting_started/example_datasets/ontime.html

1
2
3
4
5
6
7
for s in `seq 1987 2017`
do
for m in `seq 1 12`
do
wget http://transtats.bts.gov/PREZIP/On_Time_On_Time_Performance_${s}_${m}.zip
done
done

(from https://github.com/Percona-Lab/ontime-airline-performance/blob/master/download.sh )

1
for i in *.zip; do echo $i; unzip -cq $i '*.csv' | sed 's/\.00//g' | clickhouse-client --query="INSERT INTO ontime FORMAT CSVWithNames"; done

测试QUERY

  • count 数目:
1
select count(*) from ontime;

  • the most popular destinations in 2015;
1
2
3
4
5
6
SELECT
OriginCityName,
DestCityName,
count(*) AS flights,
bar(flights, 0, 20000, 40)
FROM ontime WHERE Year = 2015 GROUP BY OriginCityName, DestCityName ORDER BY flights DESC LIMIT 10

  • cities of departure with most frequent delays for 1 hour or longer;
1
2
3
4
5
6
SELECT OriginCityName, count() AS c, avg(DepDelay > 60) AS delays
FROM ontime
GROUP BY OriginCityName
HAVING c > 100000
ORDER BY delays DESC
LIMIT 10

坚持原创技术分享,您的支持将鼓励我继续创作!
分享