Page 17 sur 25
Group, count and sort
List unique values
[code]print(df_Deduped.devise.unique())[/code]
Count rows
[code]number = df.shape[0]
print(number)[/code]
Count values (group by)
If you have a field named Country in a data-frame named df:
[code]CountCountry = df.groupby(['Country'], dropna=False).size().sort_values(ascending=False)
print(CountCountry)[/code]
Count values (group by) and get it in a dataframe
[code]CountCommune = pd.DataFrame(df.groupby(['id_commune'], dropna=False).size(), columns=['Total']).sort_values(['Total'], ascending=False).reset_index()
print(tabulate(CountCommune.head(10), headers='keys', tablefmt='psql', showindex=False))[/code]
Count values grouping by several fields to find duplicates
[code]df_Duplicates = pd.DataFrame(df.groupby(['id_commune', 'PAGES'], dropna=False).size(), columns=['Total'])\
.sort_values(['Total'], ascending=False).reset_index()
df_Duplicates = df_Duplicates[(df_Duplicates['Total'] > 1)]
print(tabulate(df_Duplicates.head(10), headers='keys', tablefmt='psql', showindex=False))[/code]
Group concat
[code]df = df.groupby(['country'])['city'].apply(lambda x: ', '.join(x)).reset_index()[/code]
Count a specific value
[code]print (len(df[df['Your field'] == 'United States']))[/code]
Count with condition
[code]myCount = df[df['Field where search the condition'] == 'A value'].count()['Field to count'][/code]
Count empty values
[code]number = df['Your field'].isna().sum()[/code]
Keep empty values in a group by and fill them
[code]newDf = pd.DataFrame(df.groupby(['Your field'], dropna=False).size(), columns=['Total']).sort_values(['Total'], ascending=False).reset_index()
newDf = newDf.fillna('Unknow')[/code]
Count email domains (group by)
You have a Email field with clean emails. Thanks to Pandas it is easy to group and count their DNS.
First create a virtual field to split the emails and recover the 2nd part of the split. Then just group your new field.
[code]df['Domain'] = df['Email'].str.split('@').str[1]
df_Email_DNS_count = pd.DataFrame(df.groupby(['Domain']).size(), columns=['Total'])\
.sort_values(['Total'], ascending=False).reset_index()[/code]
Calculate percentages
Acording your needs, you can use the sum from a Total field.
[code]df['Percent'] = (df['Total'] / df['Total'].sum()) * 100
df['Percent'] = df['Percent'].round(decimals=2)[/code]
Or the number of rows in your data-frame (useful if you work with a multi-valued field, see above the Analyze data from a multi-valued field chapter) .
[code]df['Percent'] = (df['Total'] / df.shape[0]) * 100
df['Percent'] = df['Percent'].round(decimals=2)[/code]
Sum group by year
[code]df['Year'] = df['DateInvoice'].dt.to_period('Y')
df_AmountYear = df.groupby(['Year'], dropna=False)['AmountEuro'].sum().reset_index()
df_AmountYear = df_AmountYear.sort_values(['Year'], ascending=False)
print('\nAmount per year:')
print(tabulate(df_AmountYear, headers='keys', tablefmt='psql', showindex=False))[/code]
Sort on a field
[code]df.sort_values(by=['Field'], ascending=False, inplace=True)
# Recreate the index if necessary (if you will use the index later)
df = df.reset_index(drop=True)[/code]
Sort on a field containing both text and numerical values
[code]df = df.loc[pd.to_numeric(df['My field'], errors='coerce').sort_values().index][/code]
Sum and count
If you want sum an Amount field per country counting the number of lines:
[code]df_AmountCountries = df.groupby(['Country'], dropna=False)\
.agg({'ID': 'count', 'AmountEuro': 'sum'})\
.reset_index()\
.rename(columns={'ID': 'Number'})[/code]