Skip to main content
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Calling all Data Engineers! Fabric Data Engineer (Exam DP-700) live sessions are back! Starting October 16th. Sign up.

Reply
DebbieE
Community Champion
Community Champion

Fabric Pyspark Help. Adding a Min into a group by agg code in Notebooks

It would be really useful if we had a Pyspark forum

 

Im SQL through and through and learning Pyspark is a NIGHTMARE

I have the following code that finds all the contestants with more that one record in a list that should be unique

from pyspark.sql.functions import *

dfcont.groupBy('contestant')\
        .agg(count('contestant').alias('TotalRecords'))\
        .filter(col('TotalRecords')>1).show(1000)
 
There are 4 contestants that have more than one value in the list
 
I also want to bring through the min contestantKey into the above result set.
 
In SQL it would be
SELECT Min(CustomerID, Customer, Count(*)
GROUP BY Customer
Having Count(*) >1
 
I find it unbelievably frustrating that I can just type that in a couple of seconds. And yet I have been struggling to do it in Pyspark for...... well. Too long. 
 
Any help would be appreciated.  
1 ACCEPTED SOLUTION
Anonymous
Not applicable

Hi @DebbieE ,

Thanks for using Fabric Community.
As I understand -

vgchennamsft_2-1712288591192.png

 



Spark SQL Code:

vgchennamsft_0-1712288207519.png

 

df = spark.sql("SELECT Min(CustomerID), CompanyName, Count(*) FROM gopi_lake_house.customer_table1 group by CompanyName having count(*)>1")
display(df)

 


Pyspark Code:

vgchennamsft_1-1712288265517.png

 



Can you please try below code -

 

 

from pyspark.sql.functions import *



result = dfcont.groupBy('CompanyName')\
        .agg(min('CustomerID').alias('minCustomerID'), count('CustomerID').alias('TotalRecords'))\
        .filter(col('TotalRecords') > 1)\
        .show(1000)

 



Hope this is helpful. Please let me know incase of further queries.

 

View solution in original post

3 REPLIES 3
Anonymous
Not applicable

Hi @DebbieE ,

Thanks for using Fabric Community.
As I understand -

vgchennamsft_2-1712288591192.png

 



Spark SQL Code:

vgchennamsft_0-1712288207519.png

 

df = spark.sql("SELECT Min(CustomerID), CompanyName, Count(*) FROM gopi_lake_house.customer_table1 group by CompanyName having count(*)>1")
display(df)

 


Pyspark Code:

vgchennamsft_1-1712288265517.png

 



Can you please try below code -

 

 

from pyspark.sql.functions import *



result = dfcont.groupBy('CompanyName')\
        .agg(min('CustomerID').alias('minCustomerID'), count('CustomerID').alias('TotalRecords'))\
        .filter(col('TotalRecords') > 1)\
        .show(1000)

 



Hope this is helpful. Please let me know incase of further queries.

 

DebbieE
Community Champion
Community Champion

Yey. It worked. thank you so much. the excercise is to try to do everything I usually do with pyspark so having these extra examples are gold too.

Anonymous
Not applicable

Hi @DebbieE ,

Glad to know that your query got resolved. Please continue using Fabric Community on your further queries.

Helpful resources

Announcements
November Fabric Update Carousel

Fabric Monthly Update - November 2025

Check out the November 2025 Fabric update to learn about new features.

Fabric Data Days Carousel

Fabric Data Days

Advance your Data & AI career with 50 days of live learning, contests, hands-on challenges, study groups & certifications and more!

FabCon Atlanta 2026 carousel

FabCon Atlanta 2026

Join us at FabCon Atlanta, March 16-20, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.

Users online (25)