Solved: Hierarchy Modelling for SellOut Data

wildrose2023 · ‎04-17-2025

Hi everyone,

I'm looking for some input on the following modeling concept.

The idea is pretty straightforward: we receive SellOut data from different providers—one file contains data at the SKU (product) level, and another contains data at the Brand level (which is the lowest level in that file’s hierarchy).

The SKU-level data is accurate for each SKU, but some SKUs are not reported. This means that simply summing the SKU data results in incorrect Brand totals. In these cases, the external file with Brand-level data becomes the source of truth.

To keep things clean and simple in Power BI—avoiding complex DAX like ISINSCOPE and instead relying on plain SUM, AVG, etc.—I’m aiming to bring everything into a single fact table. For additive measures, my approach is to assign the missing values to a blank row, which seems to solve the problem neatly.

However, for non-additive measures—like weighted distribution expressed as a percentage—there needs to be a mathematical logic to determine the value attributed to the blank row so that aggregated results remain correct.

Does my diagram make sense to you, and does this approach seem solid? Do you think that the NON-ADDITIVE part makes it much more complicated to do in Databricks? As always thank you for the help and input!

v-priyankata · ‎06-02-2025

Hi @wildrose2023
Thank you for sharing your detailed explanation and diagram. You’ve clearly put a lot of thought into modeling your SellOut data for both additive and non-additive measures.

Additive Measures:
Your approach of using a blank row to store the “missing” SKU-level data to align with the brand-level source of truth makes sense. This ensures that simple aggregations work cleanly without extra DAX complexity.

Non-Additive Measures:
For weighted distributions (non-additive), your method of reverse-calculating the blank row’s value so that the overall brand-level measure aggregates to the correct average also looks solid. This technique ensures that the final brand-level WD aligns with the trusted brand-level file.

Regarding Databricks (or Spark-based environments):
In principle, the calculation logic for the blank row (reverse-calculating the missing piece) is the same, regardless of the underlying platform (Power BI, Databricks, etc.). The challenge in Databricks would typically be to replicate the dynamic calculation of that blank row’s value in PySpark or SQL. However, if you pre-calculate and stamp the value in your ETL layer (as your diagram suggests), you’d avoid runtime complexity in the query layer same as in Power BI.

If this information is helpful, please “Accept it as a solution” and give a "kudos" to assist other community members in resolving similar issues more efficiently. Thank you.

View solution in original post

lbendlin · ‎04-17-2025

but some SKUs are not reported.

how do you know that? It is very hard to report on the absence of things if you don't know that they are absent.

v-priyankata · ‎06-02-2025

Hi @wildrose2023
Thank you for sharing your detailed explanation and diagram. You’ve clearly put a lot of thought into modeling your SellOut data for both additive and non-additive measures.

Additive Measures:
Your approach of using a blank row to store the “missing” SKU-level data to align with the brand-level source of truth makes sense. This ensures that simple aggregations work cleanly without extra DAX complexity.

Non-Additive Measures:
For weighted distributions (non-additive), your method of reverse-calculating the blank row’s value so that the overall brand-level measure aggregates to the correct average also looks solid. This technique ensures that the final brand-level WD aligns with the trusted brand-level file.

Regarding Databricks (or Spark-based environments):
In principle, the calculation logic for the blank row (reverse-calculating the missing piece) is the same, regardless of the underlying platform (Power BI, Databricks, etc.). The challenge in Databricks would typically be to replicate the dynamic calculation of that blank row’s value in PySpark or SQL. However, if you pre-calculate and stamp the value in your ETL layer (as your diagram suggests), you’d avoid runtime complexity in the query layer same as in Power BI.

If this information is helpful, please “Accept it as a solution” and give a "kudos" to assist other community members in resolving similar issues more efficiently. Thank you.