How Do I Grab The “next” Event When The Offset Is Variable For Items That Can Be Repeatedly Processed?

June 25, 2024 Post a Comment

This question is virtually identical to another I recently asked, with the very important distinction that these transactions are loan transactions and, therefore, items may reappe

Solution 1:

This is a gaps-and-islands problem, but the islands being defined by a REQ transaction make it a bit more complicated than some.

You could use nested lead and lag functions and some manipulation to get what you need:

selectdistinct item,
  coalesce(start_tran,
    lag(start_tran) over (partitionby item orderbytimestamp)) as start_tran,
  coalesce(end_tran,
    lead(end_tran) over (partitionby item orderbytimestamp)) as end_tran,
  coalesce(end_time, 
    lead(end_time) over (partitionby item orderbytimestamp))
    -coalesce(start_time,
        lag(start_time) over (partitionby item orderbytimestamp)) astimefrom (
  select item, timestamp, start_tran, start_time, end_tran, end_time
  from (
    select item,
      timestamp,
      casewhen lag_tran isnullor transaction like'REQ%'then transaction endas start_tran,
      casewhen lag_tran isnullor transaction like'REQ%'thentimestampendas start_time,
      casewhen lead_tran isnullor lead_tran like'REQ%'then transaction endas end_tran,
      casewhen lead_tran isnullor lead_tran like'REQ%'thentimestampendas end_time
    from (
      select item, transaction, timestamp,
        lag(transaction)
          over (partitionby item orderbytimestamp) as lag_tran,
        lead(transaction)
          over (partitionby item orderbytimestamp) as lead_tran
      from transactions
    )
  )
  where start_tran isnotnullor end_tran isnotnull
)
orderby item, start_tran;

With additional records for a second cycle for items 1 and 2 that could give:

      ITEM START_TRAN END_TRAN   TIME      
---------- ---------- ---------- -----------
         1 REQ-A      PICKUP     01:53:30.01 REQ-E      PICKUP     01:23:30.02 REQ-B      MAIL       00:24:13.02 REQ-F      REQ-F      00:0:0.03 REQ-C      PICKUP     01:46:30.04 REQ-D      PULL       00:23:59.05 REQ-A      PICKUP     01:43:59.0

SQL Fiddle showing all the intermediate steps.

It's not quite as scary as it might look at first glance. The innermost query takes the raw data and adds an extra column for the lead and lag transactions. Taking just the first set of item-1 records that would be:

ITEMTRANSACTIONTIMESTAMPLAG_TRANLEAD_TRAN-----------------------------------------------------------------1REQ-A2014-07-31T09:51:32ZPULL1PULL2014-07-31T10:22:21ZREQ-ATRANSFER1TRANSFER2014-07-31T10:22:23ZPULLARRIVE1ARRIVE2014-07-31T11:45:01ZTRANSFERPICKUP1PICKUP2014-07-31T11:45:02ZARRIVEREQ-E

Notice REQ-E popping up as the last lead_tran? That's the first transaction for the second cycle of records for this item, and is going to be useful later. The next level of query uses those lead and lag values and treats REQ values as start and end markers, and uses that information to null out everything except the first and last record for each cycle.

ITEMTIMESTAMPSTART_TRANSTART_TIMEEND_TRANEND_TIME------------------------------------------------------------------------------------------------------12014-07-31T09:51:32ZREQ-A2014-07-31T09:51:32Z12014-07-31T10:22:21Z12014-07-31T10:22:23Z12014-07-31T11:45:01Z12014-07-31T11:45:02ZPICKUP2014-07-31T11:45:02Z

The next level of query removes any rows which are not representing the start or end (or both - see REQ-F in the Fiddle) as we aren't interested in them:

ITEMTIMESTAMPSTART_TRANSTART_TIMEEND_TRANEND_TIME------------------------------------------------------------------------------------------------------12014-07-31T09:51:32ZREQ-A2014-07-31T09:51:32Z12014-07-31T11:45:02ZPICKUP2014-07-31T11:45:02Z

We now have pairs of rows for each cycle (or a single row for REQ-F). The final level uses lead and lag again to fill in the blanks; if the start_tran is null then this is an end-row and we should use the previous row's start data; if end_tran is null then this is a start-row and we should use the next row's end data.

  ITEM START_TRAN START_TIME               END_TRAN   END_TIME                 TIME

1REQ-A2014-07-31T09:51:32ZPICKUP2014-07-31T11:45:02Z01:53:30.01REQ-A2014-07-31T09:51:32ZPICKUP2014-07-31T11:45:02Z01:53:30.0

That makes both rows the same, so the distinct removes the duplicates.

Solution 2:

This should give you the same result. I am reproducing Gordon's answer and it still holds good

select item,
       min(transaction) keep (dense_rank firstorderbytimestamp) as StartTx, 
       min(transaction) keep (dense_rank lastorderbytimestamp) as EndTx,
       max(timestamp) -min(timestamp)
from transactions t
groupby item;

Even though you have duplicates in the txn, it would be handled by the analytic function. See the documentation for dense rank and the keyword keep.

Spark Sql Blog

How Do I Grab The “next” Event When The Offset Is Variable For Items That Can Be Repeatedly Processed?

Solution 1:

Solution 2:

Post a Comment for "How Do I Grab The “next” Event When The Offset Is Variable For Items That Can Be Repeatedly Processed?"