Doing Some Of Columns Based On Some Complex Logic In Pyspark

August 21, 2024 Post a Comment

Here is the question in the image attached: Table: Row Col1 Col2 Col3 Result 1 10 20 100 30 2 20 40 200 60 3 30 60 0 240

Solution 1:

This answers (correctly, I might add) the original version of the question.

In SQL, you can express this using window functions. Use a cumulative sum to define the group and the an additional cumulative sum:

select t.*,
       (casewhen col3 <>0then col1 + col2
             elsesum(col2 +casewhen col3 =0then col1 else0end) over (partitionby grp orderbyrowdesc)
        end) asresultfrom (select t.*,
             sum(casewhen col3 <>0then1else0end) over (orderbyrowdesc) as grp
      from t
     ) t;

Here is a db<>fiddle (which uses Postgres).

Note:

Your description says that the else logic should be:

elsesum(col2) over (partitionby grp orderbyrowdesc)

Your example says:

elsesum(col2 + col3) over (partitionby grp orderbyrowdesc)

And in my opinion, this seems most logical:

elsesum(col1 + col2) over (partitionby grp orderbyrowdesc)

Spark Sql Blog

Doing Some Of Columns Based On Some Complex Logic In Pyspark

Solution 1:

Post a Comment for "Doing Some Of Columns Based On Some Complex Logic In Pyspark"