Pandas Basic DataFrame Operations
Hello guys!,here we will work with some basic operations of pandas,we will learn the pandas operations in modules.so this is the first module in which we will go through how dataframe is created,how it is read,how we can apply various operation on rows and column .so rest we will disscuss in another module.
Pandas DataFrame is two-dimensional size-mutable, heterogeneous tabular data structure with labeled axes . A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.
import pandas as pd # intialise data of lists. data = {'Name':['ram', 'sham', 'alpha', 'gamma'], 'Age':[20, 21, 19, 18], 'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'], 'Qualification':['Msc', 'MA', 'MCA', 'Phd']} # Create DataFrame df = pd.DataFrame(data) # Print the output. print(df)
Name Age Address Qualification 0 ram 20 Delhi Msc 1 sham 21 Kanpur MA 2 alpha 19 Allahabad MCA 3 gamma 18 Kannauj Phd
#dataframe is created df.to_csv("name.csv")
Column Selection: In Order to select a column in Pandas DataFrame, we can either access the columns by calling them by their columns name.
df[['Name', 'Qualification']]
Name | Qualification | |
---|---|---|
0 | ram | Msc |
1 | sham | MA |
2 | alpha | MCA |
3 | gamma | Phd |
Row Selection: Pandas provide a unique method to retrieve rows from a Data frame. DataFrame.loc[] method is used to retrieve rows from Pandas DataFrame. Rows can also be selected by passing integer location to an iloc[] function.
df = pd.read_csv("name.csv", index_col ="Name") first = df.loc["ram"] second = df.loc["gamma"] print(first, "\n\n\n", second)
Unnamed: 0 0 Age 20 Address Delhi Qualification Msc Name: ram, dtype: object Unnamed: 0 3 Age 18 Address Kannauj Qualification Phd Name: gamma, dtype: object
This function allows us to retrieve rows and columns by position. In order to do that, we’ll need to specify the positions of the rows that we want, and the positions of the columns that we want as well. The df.iloc indexer is very similar to df.loc but only uses integer locations to make its selections.
row2 = df.iloc[2] row2
Unnamed: 0 2 Age 19 Address Allahabad Qualification MCA Name: alpha, dtype: object
Missing Data can occur when no information is provided for one or more items or for a whole unit. Missing Data is a very big problem in real life scenario. Missing Data can also refer to as NA(Not Available) values in pandas.
Checking for missing values using isnull() and notnull()
import numpy as np dict = {'First ':[100, 90, np.nan, 95,89,0,100,np.nan], 'Second ': [30, 45, 56, np.nan,1,40,np.nan,70], 'Third ':[np.nan, 40, 80, 98,np.nan,np.nan,13,55]} # creating a dataframe from list df = pd.DataFrame(dict) df
First | Second | Third | |
---|---|---|---|
0 | 100.0 | 30.0 | NaN |
1 | 90.0 | 45.0 | 40.0 |
2 | NaN | 56.0 | 80.0 |
3 | 95.0 | NaN | 98.0 |
4 | 89.0 | 1.0 | NaN |
5 | 0.0 | 40.0 | NaN |
6 | 100.0 | NaN | 13.0 |
7 | NaN | 70.0 | 55.0 |
df.isnull()
First | Second | Third | |
---|---|---|---|
0 | False | False | True |
1 | False | False | False |
2 | True | False | False |
3 | False | True | False |
4 | False | False | True |
5 | False | False | True |
6 | False | True | False |
7 | True | False | False |
In order to fill null values in a datasets, we use fillna(), replace() and interpolate() function these function replace NaN values with some value of their own. All these function help in filling a null values in datasets of a DataFrame.
Interpolate() function is basically used to fill NA values in the dataframe but it uses various interpolation technique to fill the missing values rather than hard-coding the value.
df.fillna(0)
First | Second | Third | |
---|---|---|---|
0 | 100.0 | 30.0 | 0.0 |
1 | 90.0 | 45.0 | 40.0 |
2 | 0.0 | 56.0 | 80.0 |
3 | 95.0 | 0.0 | 98.0 |
4 | 89.0 | 1.0 | 0.0 |
5 | 0.0 | 40.0 | 0.0 |
6 | 100.0 | 0.0 | 13.0 |
7 | 0.0 | 70.0 | 55.0 |
In order to drop a null values from a dataframe, we used dropna() function this fuction drop Rows/Columns of datasets with Null values in different ways.
dict = {'First Score':[100, 90, np.nan, 95], 'Second Score': [30, np.nan, 45, 56], 'Third Score':[52, 40, 80, 98], 'Fourth Score':[np.nan, np.nan, np.nan, 65]} df = pd.DataFrame(dict) print(df) df.dropna()
First Score Second Score Third Score Fourth Score 0 100.0 30.0 52 NaN 1 90.0 NaN 40 NaN 2 NaN 45.0 80 NaN 3 95.0 56.0 98 65.0
First Score | Second Score | Third Score | Fourth Score | |
---|---|---|---|---|
3 | 95.0 | 56.0 | 98 | 65.0 |