Query Patterns
This document demonstrates some typical query patterns that you can achieve in Timeplus to solve various use cases. You can also check the subpages for some of the highlighted query recipes.
Customer Scenario and Data Model
You are the lead business analyst in a carsharing company. Sensors are equipped in each car to report car locations. The customers use the mobile app to find available cars nearby, book them, unlock them and hit the road. At the end of the trip, the customer parks the car, locks it, and ends the trip. The payment will proceed automatically with the registered credit card.
Some of the typical use cases for time-sensitive insights are:
- How many cars are being driven by users in certain locations? Do we need to move some cars from less busy locations to those hot zones?
- Which cars are being driven too fast or running low on fuel? The service team may need to take action.
- Which users keep booking cars then canceling them? Shall we send real-time notification to those users to avoid abuse.
There are multiple data streams in the systems:
dim_user_info
A relative static stream with all registered user information.
Column | Type | Sample Value |
---|---|---|
uid | string | u00001 |
first_name | string | Foo |
last_name | string | Bar |
string | a@timeplus.io | |
credit_card | string | 371712345678910 |
gender | string | F |
birthday | string | 1990-01-15 |
dim_car_info
A relative static stream with all registered cars
Column | Comment | Type | Sample Value |
---|---|---|---|
cid | car ID | string | c00001 |
license_plate_no | string | KM235L | |
in_service | False if the car is suspended (retried or in maintenance) | bool | True |
car_live_data
A data stream with the latest data from car sensors. When the car engine is started, report data every second. Otherwise, report data every half an hour.
Column | Comment | Type | Sample Value |
---|---|---|---|
time | datetime of the sensor data | datetime | 2022-01-12 23:00:58.476 |
cid | Car ID | string | c00001 |
longitude | current position | float | 40.75896 |
latitude | current position | float | -73.985195 |
gas_percent | percentage of gas level, 100 means full tank | decimal | 86.12 |
speed_kmh | current driving speed in KM/hour | int | 56 |
total_km | this car's total distance in km. Keep increasing after trips | float | 3536 |
locked | whether the car is locked | bool | True |
in_use | whether someone is using the car | bool | True |
bookings
A data stream with trip details and payment info. Each row is generated during the booking lifecycle
- when the user books the car, a new event with action=add, with the booking_time=now, expire=now+30m
- when the user unlock the car, a new event with action=service
- when the user completes the trip and lock the car, a new event with action=end
- when the user cancels the booking, a new event with action=cancel
- when the user extends the booking for another 30 min, a new event with action=extend, and update the expire field
- if the user doesn't unlock the car before the expire time, then new event is added with action=expire
Column | Comment | Type | Sample Value |
---|---|---|---|
time | When the event happens | datetime | 2022-01-12 13:00:58.476 |
bid | booking ID | string | b00001 |
booking_time | When the user books the car. Expire in 30min | datetime | 2022-01-12 13:30:58.476 |
uid | User ID | string | u00001 |
cid | Car ID | string | c00001 |
action | One of the values: add, cancel, extend, service, expire,end | string | add |
expire | When the booking will be expired | datetime | 2022-01-12 13:30:58.476 |
trips
A data stream with trip details and payment info. Each row is generated at the end of the trip
Column | Comment | Type | Sample Value |
---|---|---|---|
tid | Trip ID | string | t00001 |
start_time | When the trip starts | datetime | 2022-01-12 13:00:58.476 |
end_time | When the trip ends | datetime | 2022-01-12 24:00:58.476 |
bid | booking ID | string | b00001 |
start_lon | Start location | float | 40.75896 |
start_lat | Start location | float | -73.985195 |
end_lon | End location | float | 42.75896 |
end_lat | End location | float | -71.985195 |
distance | distance drove in km | float | 23.2 |
amount | how much the user should pay for the trip | decimal | 40.75 |
The following sections show how to query Timeplus to understand the business.
Streaming Analysis
S-TAIL: Showing raw data with or without filter conditions
Use Case: to start the data exploration, the analyst wants to show all recently reported car iot data
SELECT * FROM car_live_data
or focusing on which cars are almost running out of gas (so that they can send service team to fill gas or suspend the car)
SELECT time,cid,gas_percent FROM car_live_data WHERE gas_percent < 25
Result:
time | cid | gas_percent | in_use |
---|---|---|---|
2022-01-12 23:00:58.476 | c00001 | 18 | false |
S-DOWNSAMPLING: Converting detailed data points to high level data
Use Case: The sensors on each car may report data from half a second to every 10 seconds. The analyst may reduce the granularity and only need to save per-minute data to downstream
SELECT window_start,cid, avg(gas_percent) AS avg_gas_percent,avg(speed_kmh) AS avg_speed FROM
tumble(car_live_data,1m) GROUP BY window_start, cid
Result:
window_start | cid | avg_gas_percent | avg_speed |
---|---|---|---|
2022-01-12 23:01:00.000 | c00001 | 34 | 35 |
More practically, the user can create a materialized view to automatically put downsampled data into a new stream/view.
CREATE MATERIALIZED VIEW car_live_data_1min as
SELECT window_start AS time,cid, avg(gas_percent) AS avg_gas,avg(speed_kmh) AS avg_speed
FROM tumble(car_live_data,1m) GROUP BY window_start, cid
Then the user can search the data via
SELECT * FROM car_live_data_1min
Result:
time | cid | avg_gas | avg_speed |
---|---|---|---|
2022-01-12 23:01:00.000 | c00001 | 34 | 35 |
S-AGG-RECENT: Showing aggregation for the recent data
Use Case: the analyst wants to monitor the total revenue for the past 1 hour.
Timeplus provides a special syntax to get such result easily
SELECT sum(amount) FROM trips EMIT LAST 1h
Once the query is submitted, it will show quite a few rows based on the past day, then show new results in a streaming fashion.
Result:
sum(amount) |
---|
3500.42 |
There are other ways to get similar results, with more verbose queries
-
We can apply a global aggregation for data in a recent 1 hour window.
select sum(amount) from trips where end_time > date_sub(now(), 1h)
-
The other solution is to use hop window aggregation. Similar to the
tumble
window in S-DOWNSAMPLING ,the data are grouped per a fixed size time window, such an hour. Tumble windows are not overlapped to each other, so it's ideal for downsampling without data duplication (for example, forcount
aggregation, no data will be counted twice) For hop window, it will be shifted to the left or right(past or future in the timeline) with a sliding step. For example, the following query will use the hop window to get total revenue for the past 1 hour, the result will be sent out every second.select window_start,window_end, sum(amount) from hop(trips,end_time,1s,1h) group by window_start,window_end
S-SESSION: analyzing activities with active sessions
Use Case: The analyst wants to track the daily movement of the cars. The sensors on the cars report data every second while the engine is started, and report data every half an hour when the engine is off. If the server doesn't receive the data for a running car for 5 seconds, the car is considered disconnected. We can run the following query to show the trip distances for each running cars
SELECT cid,window_start,window_end,max(total_km)-min(total_km) AS trip_km
FROM session(car_live_data, time, 5s, cid)
GROUP BY __tp_session_id, cid, window_start, window_end
HAVING trip_km > 0
Result:
cid | window_start | window_end | trip_km |
---|---|---|---|
c00040 | 2022-03-23 21:42:08.000 | 2022-03-23 21:42:12.000 | 0.05395412226778262 |
c00078 | 2022-03-23 21:42:08.000 | 2022-03-23 21:42:33.000 | 0.4258001818272703 |
More complex queries can be created to aggregate the data by car id and trip ending time.
with query_1 AS (
select cid,window_start AS w_start,window_end AS w_end,max(total_km)-min(total_km) AS trip_km
from session(car_live_data,time,20m, cid) group by __tp_session_id, cid, window_start, window_end
)
select cid,window_start,window_end,sum(trip_km)
from tumble(query_1,w_end,1h) group by cid,window_start,window_end
Result:
cid | window_start | window_end | trip_km |
---|---|---|---|
c00001 | 2022-01-12 00:00:00.000 | 2022-01-12 23:59:59.999 | 17.2 |
c00002 | 2022-01-12 00:00:00.000 | 2022-01-12 23:59:59.999 | 4.1 |
This query is a continuously streaming query. Every hour (or every day, depending on tumble window size), the analysis results can be sent to email/slack or a Kafka topic for further processing.
S-TIME-TRAVEL: Going back to a past time and run analysis since then
Use Case: the analysts don't need to keep watching the streaming charts or dashboards. They can rewind to a past time to run the streaming analysis since that moment. This could help them to better understand what happened a few hours ago (such as midnight).
For example, the analyst wants to understand how the users book the car 2 hours ago
SELECT window_start,count(*) FROM tumble(bookings,15m)
WHERE action='add' GROUP BY window_start
EMIT LAST 2h
Or they can specify an exactly timestamp, e.g.
SELECT window_start,count(*) FROM tumble(bookings,15m)
WHERE action='add' and _tp_time>='2022-01-12 06:00:00.000' GROUP BY window_start
Result:
window_start | count(*) |
---|---|
2022-01-12 06:00:00.000 | 34 |
2022-01-12 06:15:00.000 | 23 |
Not only the past data will be analyzed, but also the latest incoming data will be processed continuously.
S-MVIEW: Creating materialized view to keep latest analysis result and cache for other systems to query
Use Case: Unlike the traditional SQL queries, streaming queries never end until the user cancels it. The analysis results are kept pushing to the web UI or slack/kafka destinations. The analysts want to run advanced streaming queries in Timeplus and cache the results as a materialized view. So that they can use regular SQL tools/systems to get the streaming insights as regular tables. Materialized views are also useful to downsample the data to reduce the data volume for future analysis and storage
CREATE MATERIALIZED VIEW today_revenue as
SELECT sum(amount) FROM trips WHERE end_time > today();
-- in Timeplus or other connected SQL clients
SELECT * FROM today_revenue
S-DROP-LATE: Dropping late events to get real-time aggregation insights
Use Case: The streaming data may arrive late for many reasons, such as network latency, iot sensor malfunction, etc. When we run streaming analysis (such as payment per minute), we aggregate the data based on their event time (when the payment actually happened, instead of when Timeplus receives the data), and we don't want to wait for events which are significantly too late.
Watermark is a common mechanism in the streaming processing world to set the bar on how late the events can be. Unlike other systems, Timeplus makes it very easy to identify late events without explicitly setting a watermark policy.
For a query like this
SELECT window_start,window_end,sum(amount),count(*)
FROM tumble(trips,end_time,1m) GROUP BY window_start,window_end
It will show the total payment every minute, for example
window_start | window_end | sum(amount) | count(*) |
---|---|---|---|
2022-01-12 10:00:00.000 | 2022-01-12 10:01:00.000 | 200 | 42 |
2022-01-12 10:01:00.000 | 2022-01-12 10:02:00.000 | 300 | 67 |
Considering two cars are returned at the same time at 10:00:10. For tripA and trip, both of them are supposed to be calculated into the first time window. However, for some reason, the data point tripA arrives in Timeplus on 10:01:15, and tripB data point arrives on 10:01:16. Timeplus will accept tripA data and add it into the 1st window aggregation, and also close the first window. The watermark will be sent to 10:01:00. So when tripB data point arrives, it's considered to be too late and won't be calculated in the streaming result. But it'll still be available when we run a historical query.
data point | event time | arrive time | note |
---|---|---|---|
tripA | 2022-01-12 10:00:10.000 | 2022-01-12 10:01:15.000 | included in 1st window, trigger the watermark change |
tripB | 2022-01-12 10:00:10.000 | 2022-01-12 10:01:16.000 | its time is lower than the watermark. 1st window has been closed(not accepting more data) The data is dropped for streaming analysis. Still can be analyzed with historical searches |
S-WAIT-LATE: Waiting for extra time for late events
Use Case: for more advanced use for the above case, it might be desirable for the analysts to set an extra time to wait for late events. This will make the streaming analysis not so real-time, however can include more data points, if the network latency is unpredictable.
Given the similar scenario, the query with the advanced setting is
SELECT window_start,window_end,sum(amount),count(*)
FROM tumble(trips,end_time,1m) GROUP BY window_start,window_end
EMIT AFTER WATERMARK AND DELAY 30s