Panda Image

Top 20 Pandas Functions You Aren't Using, Which You Should Be Using

Muhammad Anas

9 Jul 2023

This blog post will explore 20 powerful and unique Pandas functions that can significantly enhance your data analysis workflow. We will be using the famous Iris dataset as an example to demonstrate each function.

The Iris dataset contains four features: Sepal Length, Sepal Width, Petal Length, and Petal Width, along with their corresponding Iris species. We'll also use another dataset towards the end to show some datetime Pandas features.

All code and files will be hosted on my Github, which can be found here.

Before we begin, ensure you have Python installed; if not, here's how to do it. Also, you need to install pandas which you can do by going to your terminal and typing:

pip3 install pandas

If you want to avoid all this hassle, go to colab.research.google.com, and you can start coding with Python and Pandas straightaway, as everything you need is pre-installed!

Now we can get started with expanding our Pandas toolkit! But first, let's import Pandas and set up the dataset.

import pandas as pd # Importing Pandas

# Load the Iris dataset
iris_df = pd.read_csv('iris.csv')

1. Count Unique Values

The nunique() function is used in Pandas to count the number of unique values in a Series or DataFrame column. It helps us understand the diversity and variety of values present in a dataset, allowing us to gain insights into the uniqueness of the data.

This function is handy for data quality analysis, identifying duplicates, and understanding the distribution of distinct values.

Let's illustrate the usage of nunique() with a logical and real-life code example using the Iris dataset:

# Count unique species in the dataset
num_unique_species = iris_df['Species'].nunique()
print(num_unique_species)

We apply the nunique() function on the 'Species' column of the dataset to count the number of unique species present.

The result, stored in the num_unique_species variable represents the count of unique species in the Iris dataset. By printing this value, we can easily observe the diversity and variety of species in the dataset. This information is valuable for understanding the composition of the dataset and can guide further analysis or decision-making.

💡

In simple terms, the nunique() function helps us find out how many different types of species are present in the Iris dataset.

By applying this function, we can obtain a single number that tells us the count of unique species, allowing us to comprehend the variety of species within the dataset.

Learn More ➡️

2. Transform Values

The map() function in Pandas is used to transform values in a Series or DataFrame column. It enables us to replace existing values with new values based on a mapping dictionary, another Series, or a custom function.

This function is particularly useful when we want to perform value mapping or transformation on specific columns.

Learn more about Series and DataFrames.

Let's explore the usage of map() with a logical and real-life code example:

# Create a mapping dictionary for flower colors
color_mapping = {
    'setosa': 'blue',
    'versicolor': 'orange',
    'virginica': 'purple'
}

# Map flower colors using the mapping dictionary
iris_df['Flower Color'] = iris_df['Species'].map(color_mapping)
print(iris_df[['Species', 'Flower Color']].head())

In the code above, we start by creating a mapping dictionary color_mapping, where the keys represent the original species names ('setosa', 'versicolor', 'virginica'), and the values represent the corresponding color codes ('blue', 'orange', 'purple') we want to map them to.

Next, we then use the map() function on the 'Species' column to transform the species names into their corresponding color codes using the color_mapping dictionary.

The resulting 'Flower Color' column is added to the iris_df DataFrame, containing the mapped color values for each species. By printing the 'Species' and 'Flower Color' columns using head(), we can observe the transformed values side by side.

💡

In simple terms, the map() function allows us to convert the species names in the 'Species' column of the Iris dataset into corresponding color codes.

By providing a mapping dictionary, we can easily transform the values and create a new column ('Flower Color') that represents the mapped colors.

This function is helpful when we need to replace existing values with new values based on specific mapping rules or transformations.

Learn More ➡️

3. Group Data

The groupby() function in Pandas groups data based on one or more columns in a DataFrame.

It enables us to perform operations on each group separately, which is particularly useful for aggregation and summarization tasks.

This function allows us to split the data into groups based on a specified column or columns and then apply functions or calculations to each group.

Let's explore the usage of groupby() with a logical and real-life code example:

# Group the data by species and calculate the mean of sepal length
species_grouped = iris_df.groupby('Species')['sepal_length'].mean()

pd.DataFrame(species_grouped)

In the code above, we then apply the groupby() function on the DataFrame, specifying the column 'Species' to group the data by.

Next, we select the 'sepal_length' column and use the mean() function to calculate the average sepal length for each species. The result is a Series object, species_groupedwhere the species names are the index and the corresponding mean sepal length values are the values.

By printing the species_grouped Series, we can observe the average sepal length for each species. This information allows us to compare the average sepal length across different species in the Iris dataset, providing insights into their characteristic differences.

💡

In simple terms, the groupby() function helps us group the data in the Iris dataset based on the species column. By specifying the column to group by, we create distinct groups for each species.

We then calculate the mean sepal length within each group using the mean() function.

The result allows us to see the average sepal length for each species, aiding in the comparison and analysis of sepal length across different species.

Learn More ➡️

4. Create a Pivot Table

The .pivot_table() function in Pandas is used to create a pivot table based on a DataFrame. It allows you to summarize and aggregate data based on two or more columns, providing a compact representation of the data.

Pivot tables are particularly useful for analyzing and visualizing data from multiple perspectives.

If you don't know what Pivot Tables are, read more about them here ➡️

Let's explore the usage of .pivot_table() with a nice example:

# Create a pivot table to calculate the average petal length for each species based on sepal width

pivot_table = pd.pivot_table(
	iris_df, 
	values='petal_length', 
    index='Species', 
    columns='sepal_width', 
    aggfunc='mean')

print(pivot_table)

In the code above, we use the .pivot_table() function to create a pivot table based on the DataFrame.

Within the .pivot_table() function, we specify:

the column to be aggregated (values='petal_length'),
the columns to be used as row labels (index='Species'), and
the columns to be used as column labels (columns='sepal_width').

Additionally, we specify the aggregation function (aggfunc='mean') to calculate the average petal length for each species and sepal width combination.

The resulting pivot table, stored in the pivot_table variable presents a compact representation of the data. The rows represent the species, the columns represent the sepal widths, and the values represent the average petal lengths corresponding to each combination of species and sepal width.

By printing the pivot_table, we can observe the average petal length for each species based on different sepal widths. This information provides a comprehensive view of how petal length varies across species and sepal width categories.

💡

In simple terms, the .pivot_table() function allows us to summarize and aggregate data in a tabular format.

We can specify which columns to use as row and column labels, which column to aggregate, and the aggregation function to apply.

The resulting pivot table provides a condensed representation of the data, making it easier to analyze and compare values across different categories.

Learn More ➡️

5. Bin Data into Intervals

The cut() function in Pandas is used to divide continuous data into bins or intervals. It is beneficial when transforming a numerical column into categorical bins, allowing for better analysis and visualization.

This function helps in discretizing data and creating meaningful categories based on specific ranges or criteria.

Let's explore the usage of cut() with an example:

# Create three bins for sepal length: Short, Medium, and Long

sepal_length_bins = pd.cut(
	iris_df['sepal_length'], 
	bins=[0, 5, 6.5, 10], 
    labels=['Short', 'Medium', 'Long'])

pd.DataFrame(sepal_length_bins.head())

In the code above, we use the cut() function to divide the 'sepal_length' column into three bins: 'Short', 'Medium', and 'Long'.

Within the cut() function, we specify the column to be binned (iris_df['sepal_lenghth']), the bin edges (bins=[0, 5, 6.5, 10]), and the labels for each bin (labels=['Short', 'Medium', 'Long']).

In this example, sepal lengths below 5 will be categorized as 'Short', lengths between 5 and 6.5 will be categorized as 'Medium', and lengths above 6.5 will be categorized as 'Long'.

The resulting sepal_length_bins Series contains the bin labels corresponding to each sepal length value in the 'sepal_length' column. By printing the head() of this Series, we can observe the transformed values, where each sepal length is assigned to the respective bin category.

💡

In simple terms, the cut() function helps us create categories or bins for numerical data.

In this example, we divide the sepal length values in the Iris dataset into three bins: 'Short', 'Medium', and 'Long'. By specifying the bin edges and labels, we can assign each sepal length value to its appropriate bin.

This transformation allows us to analyze and visualize the data in terms of these meaningful categories instead of continuous values.

Learn More ➡️

6. Unpivot Data

The melt() function in Pandas transforms a DataFrame from a wide format to a long format by "unpivoting" the data. It is handy when converting columns into rows, making the data more suitable for analysis and visualization.

This function helps in restructuring data by gathering multiple columns into key-value pairs.

Let's explore the usage of melt() with an example:

# Convert the DataFrame from wide to long format
melted_df = pd.melt(
	iris_df, 
    id_vars='Species', 
    value_vars=[
    	'sepal_length', 
        'sepal_width', 
        'petal_length', 
        'petal_width'])
        
print(melted_df.head())

The iris dataset is in a wide format, where each attribute (Sepal Length, Sepal Width, Petal Length, Petal Width) has its own column.

If you're wondering what's a wide format, learn about it here ➡️

We then use the melt() function to transform the iris_df DataFrame from a wide format to a long format. We specify the id_vars parameter as 'Species' to indicate that we want to keep the 'Species' column as an identifier. The value_vars parameter lists the columns we want to unpivot or melt, which are ['sepal_lenth', 'sepal_width', 'petal_length', 'petal_width'] in this example.

The resulting melted_df DataFrame contains the unpivoted data, where each row represents a unique combination of 'Species' and an attribute column. The 'variable' column indicates the attribute name and the 'value' column contains the corresponding attribute values.

By printing the head() of melted_df, we can observe the transformed data, where the attribute columns are converted into key-value pairs. This long format is often more suitable for further analysis, as it allows for easier aggregation, filtering, and visualization of the data.

💡

In simple terms, the melt() function helps us convert a DataFrame from a wide format to a long format.

It gathers multiple columns and stacks them into key-value pairs, with each row representing a unique combination of identifiers and attributes.

This transformation is useful when we want to analyze or visualize the data in a more structured and organized manner, especially when dealing with data that has attributes spread across different columns.

Learn More

7. Apply a Function

The apply() function in Pandas is used to apply a function to each element or row in a Series or DataFrame. It provides a flexible way to perform custom operations on your data.

This function allows you to process data in a more granular and personalized manner, as you can define your own function or use built-in functions.

Let's explore the usage of apply() with a useful example:

# Apply a lambda function to calculate the square of each Sepal Length value

iris_df['sepal_length_sqaured'] = iris_df['sepal_length'].apply(lambda x: x**2)
print(iris_df[['sepal_length', 'sepal_length_sqaured']].head())

We use the apply() function to apply a lambda function to each value in the 'sepal_length' column. The lambda function calculates the square of each value by raising it to the power of 2.

The resulting values are stored in a new column called 'sepal_length_sqaured'. By printing the head() of the iris_df DataFrame, we can observe both the original 'sepal_length' column and the newly added 'sepal_length_sqaured' column, which contains the squared values.

💡

In simple terms, the apply() function allows us to perform a custom operation on each element in a Series or DataFrame.

In this example, we use a lambda function to calculate the square of each sepal length value in the 'sepal_length' column.

This transformation is useful when we want to derive new values or perform calculations based on existing data. The result is a modified DataFrame with the original column and the newly created column reflecting the applied function.

Learn More ➡️

8. Sort Data

The sort_values() function in Pandas is used to sort a DataFrame or Series by one or more columns. It allows you to arrange your data in ascending or descending order, making it easier to analyze and visualize.

This function is particularly useful when you want to explore the data in a specific order or identify the top or bottom values based on certain criteria.

Let's explore the usage of sort_values() with a quick example:

# Sort the DataFrame by Sepal Length in descending order
sorted_df = iris_df.sort_values('sepal_length', ascending=False)

print(sorted_df.head())

We use the sort_values() function to sort the DataFrame based on the 'sepal_length' column in descending order. By specifying the column to sort by ('sepal_length') and setting the ascending parameter to False, we arrange the data in descending order of sepal length.

The resulting sorted_df DataFrame contains the rows sorted based on the sepal length values in descending order. By printing the head() of sorted_df, we can observe the top rows with the longest sepal lengths.

💡

In simple terms, the sort_values() function helps us arrange the data in a DataFrame or Series based on the values in one or more columns.

In this example, we sort the Iris DataFrame by sepal length in descending order to identify the flowers with the longest sepals.

Sorting the data allows us to explore the dataset in a specific order and identify patterns or outliers based on the sorted criteria.

Learn More ➡️

9. Count Unique Values

The value_counts() function in Pandas is used to count the occurrences of unique values in a Series. It is handy for understanding the distribution of values in a categorical column.

This function allows us to quickly determine the frequency of each unique value in a column, providing insights into the data's composition.

Let's explore the usage of value_counts() with a useful example:

# Count the occurrences of each species
species_counts = iris_df['Species'].value_counts()

print(species_counts)

We use the value_counts() function on the 'Species' column to count the occurrences of each unique species. The resulting species_counts Series contains the counts of each species, with the species names as the index and the corresponding frequencies as the values.

By printing the species_counts, we can observe the number of occurrences for each species in the dataset. This information helps us understand species distribution and identify any imbalances or biases in the data.

💡

In simple terms, the value_counts() function allows us to count the occurrences of each unique value in a Series.

In this example, we count the occurrences of each species in the 'Species' column of the Iris dataset.

The result provides a frequency count for each species, helping us analyze the distribution of species in the dataset and gain insights into the composition of the data.

Learn More ➡️

10. Fill in Missing Values

The fillna() function in Pandas is used to fill missing or NaN (Not a Number) values in a DataFrame or Series. It allows you to replace missing values with specific values or methods like forward-fill or backward-fill.

This function helps in handling missing data and ensuring the completeness of the dataset.

Let's explore the usage of fillna() with a logical and real-life code example:

# Fill missing values in the 'sepal_width' column with the mean value
iris_df['sepal_width'] = iris_df['sepal_width'].fillna(
	iris_df['sepal_width'].mean())
    
print(iris_df['sepal_width'].isnull().sum())

We then use the fillna() function to fill any missing values in the 'sepal_width' column with the mean value of that column.

By calling iris_df['sepal_width'].mean(), we calculate the mean value of the 'sepal_width' column. The fillna() function replaces any missing values with this calculated mean value.

After filling in the missing values, we check for remaining null values by calling iris_df['sepal_width'].isnull().sum(). This expression returns the sum of null values in the 'sepal_width' column. If the output is zero, all missing values have been successfully filled.

💡

In simple terms, the fillna() function helps us handle missing values in a DataFrame or Series.

In this example, we fill the missing values in the 'sepal_width' column of the Iris dataset with the mean value of that column.

By doing so, we ensure that there are no missing values, allowing us to work with complete and reliable data.

Learn More ➡️

11. Change Data Types

The astype() function in Pandas is used to change the data type of a column in a DataFrame or a Series. It allows you to convert a column from one data type to another, such as from integer to float, string to datetime, or vice versa.

This function helps in ensuring the appropriate data type for analysis and computation.

Let's explore the usage of astype() with a practical example:

# Convert the 'sepal_length' column to float
iris_df['sepal_length'] = iris_df['sepal_length'].astype(float)


print(iris_df['sepal_length'].dtype)

We use the astype() function to convert the 'sepal_length' column from its original data type to float. By calling iris_df['sepal_length'].astype(float), we specify the desired data type as float and apply the conversion to the 'sepal_length' column.

After the conversion, we check the data type of the 'sepal_length' column by calling iris_df['sepal_length'].dtype. This expression returns the data type of the column. If the output is float, it indicates that the conversion was successful.

💡

In simple terms, the astype() function allows us to change the data type of a column in a DataFrame or Series.

In this example, we convert the 'sepal_length' column in the Iris dataset from its original data type to float.

This conversion ensures that the values in the column are treated as floating-point numbers, which might be necessary for certain calculations or operations.

Learn More ➡️

12. Find Duplicates

The duplicated() function in Pandas is used to identify duplicate rows in a DataFrame.

It helps in detecting and handling any duplicate entries in your dataset, allowing you to ensure data integrity and identify potential issues.

Let's explore the usage of duplicated() with a logical example:

# Check for duplicate rows based on all columns
duplicates = iris_df.duplicated()

print(duplicates.sum())

We then use the duplicated() function on the iris_df DataFrame to check for duplicate rows. By calling iris_df.duplicated(), the function returns a boolean Series where True indicates a duplicated row and False indicates a unique row. Each row in the Series corresponds to a row in the DataFrame.

To count the total number of duplicate rows, we use the sum() function on the duplicates Series. By calling duplicates.sum(), we get the sum of True values, which represents the count of duplicate rows.

💡

In simple terms, the duplicated() function helps us find duplicate rows in a DataFrame.

In this example, we check for duplicate rows in the Iris dataset by applying duplicated() on the iris_df DataFrame.

By counting the number of True values in the resulting boolean Series, we can determine the total count of duplicate rows. This information is valuable for data quality analysis and ensuring the uniqueness of the data.

Learn More ➡️

13. Remove Duplicates

The drop_duplicates() function in Pandas is used to remove duplicate rows from a DataFrame.

It helps in cleaning and ensuring the uniqueness of the data by eliminating any redundant entries.

Let's explore the usage of drop_duplicates() with a simple example:

# Remove duplicate rows based on all columns
deduplicated_df = iris_df.drop_duplicates()

print(deduplicated_df.shape[0])

We use the drop_duplicates() function on the iris_df DataFrame to remove duplicate rows. By calling iris_df.drop_duplicates(), the function returns a new DataFrame with duplicate rows removed. The original iris_df DataFrame remains unchanged.

We use the attribute to check the number of rows in the deduplicated DataFrame. By calling deduplicated_df.shape[0], we get the number of rows in the DataFrame, which represents the count of unique rows after removing duplicates.

💡

In simple terms, the drop_duplicates() function helps us eliminate duplicate rows from a DataFrame.

In this example, we remove duplicate rows in the Iris dataset by applying drop_duplicates() on the iris_df DataFrame.

The resulting deduplicated_df DataFrame contains only the unique rows, ensuring the uniqueness of the data. This operation is useful for data cleaning and maintaining the integrity of the dataset.

Learn More ➡️

14. Pattern Matching

The str.contains() function in Pandas is used to check whether each element of a string column contains a specific pattern or substring. It is used when searching for a specific pattern within the values of a string column.

This function helps in identifying rows that contain a particular pattern, allowing us to filter or analyze the data based on specific criteria.

Let's explore the usage of str.contains() with a quick example:

# Check if the 'Species' column contains the pattern 'versi'
contains_versi = iris_df['Species'].str.contains('versi')

print(contains_versi.head())

Next, we use the str.contains() function on the 'Species' column to check whether each value contains the pattern 'versi'.

By calling iris_df['Species'].str.contains('versi'), the function returns a boolean Series indicating whether the pattern is present (True) or not (False) in each value of the 'Species' column.

The resulting contains_versi Series contains True for the rows where the pattern 'versi' is found in the 'Species' column and False for the rows where it is not found. By printing contains_versi.head(), we can observe the boolean values corresponding to the first few rows of the 'Species' column.

💡

In simple terms, the str.contains() function allows us to search for a specific pattern within the values of a string column.

In this example, we use it to check whether the 'Species' column in the Iris dataset contains the pattern 'versi'.

By doing so, we obtain a boolean Series that indicates which rows have the specified pattern. This function is useful when filtering or analyzing the data based on specific patterns or substrings within the string values.

Learn More ➡️

15. String Replacement

The str.replace() function in Pandas is used to replace occurrences of a pattern or substring with a new value in a string column. It is used when we want to modify or update specific parts of the string values within a column.

This function helps in performing string replacement operations, which can be useful for data cleaning, standardization, or transforming the data to a desired format.

Let's explore the usage of str.replace() with a useful example:

# Replace 'setosa' with 'SETOSA' in the 'Species' column
replaced_species = iris_df['Species'].str.replace('setosa', 'SETOSA')

print(replaced_species.head())

Next, we use the str.replace() function on the 'Species' column to replace all occurrences of 'setosa' with 'SETOSA'. By calling iris_df['Species'].str.replace('setosa', 'SETOSA'), the function performs the replacement operation, resulting in a new Series called replaced_species.

The resulting replaced_species Series contains the modified values where all instances of 'setosa' have been replaced with 'SETOSA'. By printing replaced_species.head(), we can observe the updated values of the 'Species' column for the first few rows.

💡

In simple terms, the str.replace() function allows us to replace specific patterns or substrings within the string values of a column.

In this example, we use it to replace all occurrences of 'setosa' with 'SETOSA' in the 'Species' column of the Iris dataset. By doing so, we obtain a new Series with the modified values.

Learn More ➡️

16. Extract Substrings

The str.extract() function in Pandas extracts substrings from a string column using regular expressions. It is used when we want to extract specific segments or patterns from the values within a string column.

This function helps in retrieving specific information from the strings, such as extracting numbers, dates, or other structured patterns.

Let's explore the usage of str.extract() with an example:

# Extract the numeric part from the 'Species' column
numeric_species = iris_df['Species'].str.extract('(\d+)')

print(numeric_species.head())

Next, we use the str.extract() function on the 'Species' column to extract the numeric part from each string value.

By calling iris_df['Species'].str.extract('(\d+)'), we specify the regular expression pattern (\d+) to capture one or more digits from the strings.

The resulting numeric_species DataFrame contains the extracted numeric values. Each value is extracted based on the provided regular expression pattern. By printing numeric_species.head(), we can observe the extracted numeric values corresponding to the first few rows of the 'Species' column.

💡

In simple terms, the str.extract() function allows us to extract specific substrings or patterns from the values of a string column.

In this example, we use it to extract the numeric part from the 'Species' column in the Iris dataset.

The extracted values are then stored in a new DataFrame.

Learn More ➡️

17. Categorical to Dummy

The get_dummies() function in Pandas converts categorical variables into dummy or indicator variables. It is used when we want to transform categorical columns into numerical representations that can be used in machine learning models or further analysis.

This function helps in handling categorical data by creating binary columns to represent each category.

Let's explore the usage of get_dummies() with a logical example:

# Convert the 'Species' column into dummy variables
dummy_species = pd.get_dummies(iris_df['Species'])

print(dummy_species.head())

Next, we use the get_dummies() function on the 'Species' column to convert it into dummy variables. By calling pd.get_dummies(iris_df['Species']), the function generates binary columns for each unique category in the 'Species' column.

The resulting dummy_species DataFrame contains the transformed data, where each category in the 'Species' column is represented by a separate binary column. The value of each binary column is 1 if the original value matches the category, and 0 otherwise. By printing dummy_species.head(), we can observe the dummy variables corresponding to the first few rows of the 'Species' column.

💡

In simple terms, the get_dummies() function allows us to convert categorical variables into binary columns.

In this example, we use it to transform the 'Species' column in the Iris dataset into dummy variables. Each unique category in the column is represented by a separate binary column, where a value of 1 indicates the presence of that category, and zero indicates its absence.

This transformation is useful for handling categorical data in machine learning models or performing further analysis with numerical representations.

Learn more ➡️

18. Extract Year from Dates

To showcase this function, we'll need to switch our dataset, and for this example, I'll be using an electricity-production dataset which you'll be able to find in the Github Repo (LINK)

The dt.year accessor in Pandas is used to extract the year component from a datetime series. It is used when we want to extract only the year information from dates in a datetime column.

This accessor allows us to focus on the year component of the dates and perform analysis or operations based on the year.

Let's explore the usage of dt.year with a quick example:

# Load the electric_production dataset
electric_df = pd.read_csv('electric_production.csv')

# Convert the 'Date' column to datetime and extract the year
electric_df['Date'] = pd.to_datetime(electric_df['Date'])
electric_df['Year'] = electric_df['Date'].dt.year
print(electric_df[['Date', 'Year']].head())

In the code above, we start by loading the electric_production dataset using Pandas' read_csv() function and storing it in the electric_df DataFrame. The dataset contains two columns: 'Date' representing the date of electric production and 'Production' representing the production value on that date.

Next, we convert the 'Date' column to a datetime data type using pd.to_datetime(). By calling electric_df['Date'] = pd.to_datetime(electric_df['Date']), we ensure that the 'Date' column is recognized as a datetime column.

We then use the dt.year accessor on the 'Date' column to extract the year component. By calling electric_df['Date'].dt.year, we extract only the year information from each date and create a new column called 'Year' in the electric_df DataFrame.

The resulting DataFrame contains the original 'Date' and the newly created 'Year' column. By printing electric_df[['Date', 'Year']].head(), we can observe the first few rows with both the original dates and the extracted year values.

💡

In simple terms, the dt.year accessor allows us to focus on the year component of dates in a datetime column.

In this example, we use it to extract the year from the 'Date' column in the electric_production dataset. By doing so, we create a new column called 'Year' containing only the year information.

This extraction can be helpful for analyzing or aggregating data based on yearly trends, patterns, or comparisons.

Learn More ➡️

19. Resample Time Series Data

The resample() function in Pandas is used to resample time series data to a different frequency. It is used when we want to change the time intervals of our data, either by upsampling (increasing the frequency) or downsampling (decreasing the frequency).

This function helps in aggregating or summarizing data over different time intervals, such as daily, monthly, or yearly.

Let's explore the usage of resample() with a useful example:

# Convert the 'Date' column to datetime and set it as the index
electric_df['Date'] = pd.to_datetime(electric_df['Date'])
electric_df.set_index('Date', inplace=True)

# Resample the data to monthly frequency and calculate the mean
monthly_mean = electric_df.resample('M').mean()
print(monthly_mean.head())

Firstly, we convert the 'Date' column to a datetime data type using pd.to_datetime().

By calling electric_df['Date'] = pd.to_datetime(electric_df['Date']), we ensure that the 'Date' column is recognized as a datetime column.

We then set the 'Date' column as the index of the electric_df DataFrame using electric_df.set_index('Date', inplace=True). Setting the index as the datetime column allows us to perform time-based operations and resampling.

Next, we use the resample() function on the electric_df DataFrame to resample the data to a monthly frequency. By calling electric_df.resample('M'), we specify 'M' as the frequency, which stands for monthly. This operation aggregates the data over monthly intervals.

Finally, we calculate the mean of each month's data using the mean() function. By calling electric_df.resample('M').mean(), we calculate the mean production value for each month.

The resulting monthly_mean DataFrame contains the resampled data, where each row represents the mean production value for a specific month. By printing monthly_mean.head(), we can observe the resampled data for the first few months.

💡

In simple terms, the resample() function allows us to change the time intervals of our time series data. In this example, we resample the electric_production dataset to a monthly frequency.

By doing so, we aggregate the data over monthly intervals and calculate the mean production value for each month.

This resampling is helpful for analyzing and summarizing time series data at different frequencies, enabling us to observe patterns, trends, or seasonality on a larger time scale.

Learn More ➡️

20. Save Data to CSV

The .to_csv() function in Pandas is used to export a DataFrame to a CSV (Comma-Separated Values) file. It is used when we want to save our DataFrame as a CSV file, which is a common file format for storing tabular data.

This function helps in saving the data for future use, sharing it with others, or using it in other applications that accept CSV files as input.

Let's create a dummy dataset on flowers and export it as a CSV file using the .to_csv() function:

# Create a dummy dataset on flowers
flowers_data = {
    'Name': ['Rose', 'Lily', 'Tulip', 'Sunflower', 'Daisy'],
    'Color': ['Red', 'White', 'Pink', 'Yellow', 'White'],
    'Petals': [5, 6, 4, 10, 8],
    'Fragrance': ['Yes', 'Yes', 'No', 'No', 'Yes']
}

flowers_df = pd.DataFrame(flowers_data)

# Export the DataFrame as a CSV file
flowers_df.to_csv('flowers_dataset.csv', index=False)

In the code above, we first create a dummy dataset on flowers using a Python dictionary called flowers_data. The dataset contains information about each flower's name, color, number of petals, and fragrance.

Next, we create a DataFrame called flowers_df using the pd.DataFrame() function and pass it to the flowers_data dictionary. This DataFrame represents our dummy flower dataset.

To export the DataFrame as a CSV file, we use the .to_csv() function. By calling flowers_df.to_csv('flowers_dataset.csv', index=False), we specify the filename 'flowers_dataset.csv' and set index=False to exclude the index column from the exported CSV file.

After executing this code, a CSV file named 'flowers_dataset.csv' will be created in the same directory as your Python script or notebook. This file will contain the data from the flowers_df DataFrame, with each row representing a flower and each column representing a specific attribute.

💡

In simple terms, the .to_csv() function allows us to export a DataFrame as a CSV file.

In this example, we create a dummy dataset on flowers and store it in the flowers_df DataFrame. By using .to_csv(), we save this DataFrame as a CSV file named 'flowers_dataset.csv'.

This file can be used for future analysis, shared with others, or imported into other applications that accept CSV files as input.

Learn More ➡️

Conclusion

In conclusion, we have explored several powerful functions in Pandas for data manipulation and analysis.

Throughout this blog post, we have seen how these functions can be used to perform various tasks such as counting unique values, transforming values, grouping data, creating pivot tables, binning data, merging DataFrames, unpivoting data, applying functions, sorting data, counting values, filling missing values, changing data types, finding duplicates, removing duplicates, pattern matching, string replacement, substring extraction, extracting year from dates, and resampling time series data.

While it may seem overwhelming to remember all of these functions, it's important to note that the Pandas' documentation is a valuable resource. It provides detailed explanations, examples, and usage guidelines for each function. Rather than memorizing all the functions, it's more efficient to understand their capabilities and consult the documentation as needed. This allows you to leverage the full potential of Pandas and apply the appropriate functions based on your data analysis requirements.

For further learning, I recommend the book "Python for Data Analysis" by Wes McKinney, the core contributor of Pandas. This book provides in-depth coverage of data analysis techniques using Python and Pandas. It covers various topics, including data manipulation, data cleaning, visualization, and more. You can read the e-book on Wes McKinney's website here.

I hope this blog post has provided valuable insights into Pandas' functions' power and versatility. Feel free to explore the Pandas documentation and continue your journey in mastering data manipulation with Python!

If you have any further questions or need assistance, feel free to contact me on Twitter or connect with me on LinkedIn.

Happy coding and data analysis!

Data Science

Muhammad Anas Twitter

Muhammad is a Machine Learning Maniac, who loves to help others break into ML in an easy and friendly manner, without the use of math equations! Long Live Machine Learning

Top 20 Pandas Functions You Aren't Using, Which You Should Be Using

Table of Contents

1. Count Unique Values

2. Transform Values

3. Group Data

4. Create a Pivot Table

5. Bin Data into Intervals

6. Unpivot Data

7. Apply a Function

8. Sort Data

9. Count Unique Values

10. Fill in Missing Values

11. Change Data Types

12. Find Duplicates

13. Remove Duplicates

14. Pattern Matching

15. String Replacement

16. Extract Substrings

17. Categorical to Dummy

18. Extract Year from Dates

19. Resample Time Series Data

20. Save Data to CSV

Conclusion

Muhammad Anas Twitter

Comments

Featured Posts

Convolutional Neural Networks in JAX: Ultimate Guide

Implementing Transformer decoder for text generation in Keras and TensorFlow

How to Train Stable Diffusion With Keras

Authors →

Derrick Mwiti

Brian Mutea

Kamanda Wycliffe

Table of Contents

1. Count Unique Values

2. Transform Values

3. Group Data

4. Create a Pivot Table

5. Bin Data into Intervals

6. Unpivot Data

7. Apply a Function

8. Sort Data

9. Count Unique Values

10. Fill in Missing Values

11. Change Data Types

12. Find Duplicates

13. Remove Duplicates

14. Pattern Matching

15. String Replacement

16. Extract Substrings

17. Categorical to Dummy

18. Extract Year from Dates

19. Resample Time Series Data

20. Save Data to CSV

Conclusion

Muhammad Anas Twitter

Machine learning nuggets Newsletter

Comments

You might also like

Entropy, information gain, and Gini impurity(Decision tree splitting criteria) Paid Members Public

Decision Trees and Random Forests(Building and optimizing decision tree and random forest models) Paid Members Public

Machine learning nuggets Newsletter

Featured Posts

Convolutional Neural Networks in JAX: Ultimate Guide

Implementing Transformer decoder for text generation in Keras and TensorFlow

How to Train Stable Diffusion With Keras

Authors →

Derrick Mwiti

Brian Mutea

Kamanda Wycliffe