Aggregate For Each Day Over Time Series, Without Using Non-equijoin Logic

June 25, 2023 Post a Comment

Initial Question Given the following dataset paired with a dates table: MembershipId | ValidFromDate | ValidToDate ========================================== 0001 | 1997-01

Solution 1:

If most of your membership validity intervals are longer than few days, have a look at an answer by Martin Smith. That approach is likely to be faster.

When you take calendar table (DIM.[Date]) and left join it with Memberships, you may end up scanning the Memberships table for each date of the range. Even if there is an index on (ValidFromDate, ValidToDate), it may not be super useful.

It is easy to turn it around. Scan the Memberships table only once and for each membership find those dates that are valid using CROSS APPLY.

Sample data

DECLARE@TTABLE (MembershipId int, ValidFromDate date, ValidToDate date);

INSERTINTO@TVALUES
(1, '1997-01-01', '2006-05-09'),
(2, '1997-01-01', '2017-05-12'),
(3, '2005-06-02', '2009-02-07');

DECLARE@RangeFromdate='2006-01-01';
DECLARE@RangeTodate='2006-12-31';

Query 1

SELECTCA.dt
    ,COUNT(*) ASMembershipCountFROM
    @TASMembershipsCROSSAPPLY
    (
        SELECT dbo.Calendar.dt
        FROM dbo.Calendar
        WHERE
            dbo.Calendar.dt >= Memberships.ValidFromDate
            AND dbo.Calendar.dt <= Memberships.ValidToDate
            AND dbo.Calendar.dt >= @RangeFrom
            AND dbo.Calendar.dt <= @RangeTo
    ) ASCAGROUPBYCA.dtORDERBYCA.dtOPTION(RECOMPILE);

OPTION(RECOMPILE) is not really needed, I include it in all queries when I compare execution plans to be sure that I'm getting the latest plan when I play with the queries.

When I looked at the plan of this query I saw that the seek in the Calendar.dt table was using only ValidFromDate and ValidToDate, the @RangeFrom and @RangeTo were pushed to the residue predicate. It is not ideal. The optimiser is not smart enough to calculate maximum of two dates (ValidFromDate and @RangeFrom) and use that date as a starting point of the seek.

It is easy to help the optimiser:

Query 2

SELECTCA.dt
    ,COUNT(*) ASMembershipCountFROM
    @TASMembershipsCROSSAPPLY
    (
        SELECT dbo.Calendar.dt
        FROM dbo.Calendar
        WHERE
            dbo.Calendar.dt >= 
                CASE WHEN Memberships.ValidFromDate > @RangeFrom 
                THEN Memberships.ValidFromDate 
                ELSE @RangeFrom END
            AND dbo.Calendar.dt <= 
                CASE WHEN Memberships.ValidToDate < @RangeTo 
                THEN Memberships.ValidToDate 
                ELSE @RangeTo END
    ) ASCAGROUPBYCA.dtORDERBYCA.dtOPTION(RECOMPILE)
;

In this query the seek is optimal and doesn't read dates that may be discarded later.

Finally, you may not need to scan the whole Memberships table. We need only those rows where the given range of dates intersects with the valid range of the membership.

Query 3

SELECT
    CA.dt
    ,COUNT(*) AS MembershipCount
FROM@TAS Memberships
    CROSS APPLY
    (
        SELECT dbo.Calendar.dt
        FROM dbo.Calendar
        WHERE
            dbo.Calendar.dt >=CASEWHEN Memberships.ValidFromDate >@RangeFromTHEN Memberships.ValidFromDate 
                ELSE@RangeFromENDAND dbo.Calendar.dt <=CASEWHEN Memberships.ValidToDate <@RangeToTHEN Memberships.ValidToDate 
                ELSE@RangeToEND
    ) AS CA
WHERE
    Memberships.ValidToDate >=@RangeFromAND Memberships.ValidFromDate <=@RangeToGROUPBY
    CA.dt
ORDERBY
    CA.dt
OPTION(RECOMPILE)
;

Two intervals [a1;a2] and [b1;b2] intersect when

a2 >= b1 and a1 <= b2

These queries assume that Calendar table has an index on dt.

You should try and see what indexes are better for the Memberships table. For the last query, if the table is rather large, most likely two separate indexes on ValidFromDate and on ValidToDate would be better than one index on (ValidFromDate, ValidToDate).

You should try different queries and measure their performance on the real hardware with real data. Performance may depend on the data distribution, how many memberships there are, what are their valid dates, how wide or narrow is the given range, etc.

I recommend to use a great tool called SQL Sentry Plan Explorer to analyse and compare execution plans. It is free. It shows a lot of useful stats, such as execution time and number of reads for each query. The screenshots above are from this tool.

Solution 2:

On the assumption your date dimension contains all dates contained in all membership periods you can use something like the following.

The join is an equi join so can use hash join or merge join not just nested loops (which will execute the inside sub tree once for each outer row).

Assuming index on (ValidToDate) include(ValidFromDate) or reverse this can use a single seek against Memberships and a single scan of the date dimension. The below has an elapsed time of less than a second for me to return the results for a year against a table with 3.2 million members and general active membership of 1.4 million (script)

DECLARE@StartDateDATE='2016-01-01',
        @EndDateDATE='2016-12-31';

WITH MD
     AS (SELECTDate,
                SUM(Adj) AS MemberDelta
         FROM   Memberships
                CROSS APPLY (VALUES ( ValidFromDate, +1),
                                    --Membership count decremented day after the ValidToDate
                                    (DATEADD(DAY, 1, ValidToDate), -1) ) V(Date, Adj)
         WHERE--Members already expired before the time range of interest can be ignored
          ValidToDate >=@StartDateAND--Members whose membership starts after the time range of interest can be ignored
          ValidFromDate <=@EndDateGROUPBYDate),
     MC
     AS (SELECT DD.DateKey,
                SUM(MemberDelta) OVER (ORDERBY DD.DateKey ROWS UNBOUNDED PRECEDING) AS CountOfNonIgnoredMembers
         FROM   DIM_DATE DD
                LEFTJOIN MD
                  ON MD.Date = DD.DateKey)
SELECT DateKey,
       CountOfNonIgnoredMembers AS MembershipCount
FROM   MC
WHERE  DateKey BETWEEN@StartDateAND@EndDateORDERBY DateKey

Demo (uses extended period as the calendar year of 2016 isn't very interesting with the example data)

Solution 3:

One approach is to first use an INNER JOIN to find the set of matches and COUNT() to project MemberCount GROUPed BY DateKey, then UNION ALL with the same set of dates, with a 0 on that projection for the count of members for each date. The last step is to SUM() the MemberCount of this union, and GROUP BY DateKey. As requested, this avoids LEFT JOIN and NOT EXISTS. As another member pointed out, this is not an equi-join, because we need to use a range, but I think it does what you intend.

This will serve up 1 year's worth of data with around 100k logical reads. On an ordinary laptop with a spinning disk, from cold cache, it serves 1 month in under a second (with correct counts).

Here is an example that creates 3.3 million rows of random duration. The query at the bottom returns one month's worth of data.

--Stay quiet for a momentSET NOCOUNT ONSET STATISTICS IO OFF
SET STATISTICS TIME OFF

--Clean up if re-runningDROPTABLE IF EXISTS DIM_DATE
DROPTABLE IF EXISTS FACT_MEMBER

--Date dimensionCREATETABLE DIM_DATE
  (
  DateKey DATENOTNULL 
  )

--Membership factCREATETABLE FACT_MEMBER
  (
  MembershipId INTNOTNULL
  , ValidFromDateKey DATENOTNULL
  , ValidToDateKey DATENOTNULL
  )

--Populate Date dimension from 2001 through end of 2018DECLARE@startDateDATE='2001-01-01'DECLARE@endDateDATE='2018-12-31'
;WITH CTE_DATE AS
(
SELECT@startDateAS DateKey
UNIONALLSELECT
       DATEADD(DAY, 1, DateKey)
FROM
       CTE_DATE AS D
WHERE
       D.DateKey <@endDate
)
INSERTINTO
  DIM_DATE
  (
  DateKey
  )
SELECT
  D.DateKey
FROM
  CTE_DATE AS D
OPTION (MAXRECURSION 32767)

--Populate Membership fact with members having a random membership length from 1 to 36 months 
;WITH CTE_DATE AS
(
SELECT@startDateAS DateKey
UNIONALLSELECT
       DATEADD(DAY, 1, DateKey)
FROM
       CTE_DATE AS D
WHERE
       D.DateKey <@endDate
)
,CTE_MEMBER AS
(
SELECT1AS MembershipId
UNIONALLSELECT MembershipId +1FROM CTE_MEMBER WHERE MembershipId <500
)
,
CTE_MEMBERSHIP
AS
(
SELECTROW_NUMBER() OVER (ORDERBY NEWID()) AS MembershipId
  , D.DateKey AS ValidFromDateKey
FROM
  CTE_DATE AS D
  CROSSJOIN CTE_MEMBER AS M
)
INSERTINTO
    FACT_MEMBER
    (
    MembershipId
    , ValidFromDateKey
    , ValidToDateKey
    )
SELECT
    M.MembershipId
    , M.ValidFromDateKey
      , DATEADD(MONTH, FLOOR(RAND(CHECKSUM(NEWID())) * (36-1)+1), M.ValidFromDateKey) AS ValidToDateKey
FROM
    CTE_MEMBERSHIP AS M
OPTION (MAXRECURSION 32767)

--Add clustered Primary Key to Date dimensionALTERTABLE DIM_DATE ADDCONSTRAINT PK_DATE PRIMARY KEY CLUSTERED
    (
    DateKey ASC
    )

--Index--(Optimize in your spare time)DROP INDEX IF EXISTS SK_FACT_MEMBER ON FACT_MEMBER
CREATE CLUSTERED INDEX SK_FACT_MEMBER ON FACT_MEMBER
    (
    ValidFromDateKey ASC
    , ValidToDateKey ASC
    , MembershipId ASC
    )


RETURN--Start test--Emit statsSET STATISTICS IO ONSET STATISTICS TIMEON--Establish range of datesDECLARE@rangeStartDateDATE='2010-01-01'
  , @rangeEndDateDATE='2010-01-31'--UNION the count of members for a specific date range with the "zero" set for the same range, and SUM() the counts
;WITH CTE_MEMBER
AS
(
SELECT
    D.DateKey
    , COUNT(*) AS MembershipCount
FROM
    DIM_DATE AS D
    INNERJOIN FACT_MEMBER AS M ON
        M.ValidFromDateKey <=@rangeEndDateAND M.ValidToDateKey >=@rangeStartDateAND D.DateKey BETWEEN M.ValidFromDateKey AND M.ValidToDateKey
WHERE
    D.DateKey BETWEEN@rangeStartDateAND@rangeEndDateGROUPBY
    D.DateKey

UNIONALLSELECT
    D.DateKey
    , 0AS MembershipCount
FROM
    DIM_DATE AS D
WHERE
    D.DateKey BETWEEN@rangeStartDateAND@rangeEndDate
)
SELECT
    M.DateKey
    , SUM(M.MembershipCount) AS MembershipCount
FROM
    CTE_MEMBER AS M
GROUPBY
    M.DateKey
ORDERBY
    M.DateKey ASC
OPTION (RECOMPILE, MAXDOP 1)

Solution 4:

Here's how I'd solve this problem with equijoin:

--data generationdeclare@Membershiptable (MembershipId varchar(10), ValidFromDate date, ValidToDate date)
insertinto@Membershipvalues
('0001', '1997-01-01', '2006-05-09'),
('0002', '1997-01-01', '2017-05-12'),
('0003', '2005-06-02', '2009-02-07')

declare@startDatedate, @endDatedateselect@startDate=MIN(ValidFromDate), @endDate=max(ValidToDate) from@Membership--in order to use equijoin I need all days between min date and max date from Membership table (both columns)
;with cte as (
    select@startDate [date]
    unionallselect DATEADD(day, 1, [date]) from cte
    where [date] <@endDate
)
--in this query, we will assign value to each day:--one, if project started on that day--minus one, if project ended on that day--then, it's enough to (cumulative) sum all this values to get how many projects were ongoing on particular dayselect [date],
       sum(casewhen [DATE] = ValidFromDate then1else0end+casewhen [DATE] = ValidToDate then-1else0end)
            over (orderby [date] rowsbetween unbounded preceding andcurrentrow)
from cte [c]
leftjoin@Membership [m]
on [c].[date] = [m].ValidFromDate  or [c].[date] = [m].ValidToDate
option (maxrecursion 0)

Here's another solution:

--data generationdeclare@Membershiptable (MembershipId varchar(10), ValidFromDate date, ValidToDate date)
insertinto@Membershipvalues
('0001', '1997-01-01', '2006-05-09'),
('0002', '1997-01-01', '2017-05-12'),
('0003', '2005-06-02', '2009-02-07')

;with cte as (
    selectCAST('2016-01-01'asdate) [date]
    unionallselect DATEADD(day, 1, [date]) from cte
    where [date] <'2016-12-31'
)

select [date],
       (selectCOUNT(*) from@Membershipwhere ValidFromDate < [date]) - 
       (selectCOUNT(*) from@Membershipwhere ValidToDate < [date]) [ongoing]
from cte
option (maxrecursion 0)

Solution 5:

Pay attention, I think @PittsburghDBA is right when it says that current query return wrong result. The last day of membership is not counted and so final sum is lower than it should be. I have corrected it in this version.

This should improve a bit your actual progress:

declare@sdate='20160101';
declare@edate= getdate();

with 
x as (
    select d, sum(c) c
    from (
        select ValidFromDateKey d, count(MembershipID) c
        from Memberships
        groupby ValidFromDateKey 

        unionall-- dateadd needed to count last day of membership too!!select dateadd(dd, 1, ValidToDateKey) d, -count(MembershipID) c
        from Memberships
        groupby ValidToDateKey 
    )x
    groupby d
),
c as
(
    select d, sum(x.c) over (orderby d) as c
    from x
)
select d.day, c cnt
from calendar d
leftjoin c on d.day = c.d
where d.day between@sand@eorderby d.day;

Spark Sql Blog