Pandas Basics
Actually, I don’t want to use pandas DataFrame at all. It always feels difficult to use, maybe my skill level hasn’t reached that point yet. But many times I have to use it for reading and writing lab Excel files and outputting to Excel. The seaborn I used before is also based on pandas, so let me review the basic operations again.
Reading, Writing and Creating
Data Reading and Writing
SQL
- Reading
import pandas as pd
from sqlalchemy import create_engine
db = create_engine("mysql+pymysql://username:password@localhost:port(3306)/database_name?charset=utf8")
sql = "select * from text"
df = pd.read_sql(sql, db, index_col="index") # index_col sets index column, auto-generated by default- Writing
sql.to_sql(df, name='test', con=db,
if_exists="append", # If table exists: append, replace (delete and recreate), fail (don't insert)
index=False # Don't insert df index to database
)Excel
- Reading
df = pd.read_excel(r'file_path',
sheet_name='specify sheet, default is first',
index=False, # Don't read index from excel, auto-generate new index
index_col=0, # Set column 0 as index
header=0, # Set row n as columns, default is 0, can be None (auto-generate 0-n columns)
usecols=[0, 2] # Only import columns 0, 2
)- Writing
'''
Write to different sheets
'''
# Create workbook
excelWriter = pd.ExcelFile('file_path/test.xlsx')
# Write to workbook
df.to_excel(
excelWriter,
sheet_name='',
index=False, # Don't write df index to excel
encoding='utf-8',
columns=['a', 'b'], # Specify columns to write
na_rep=0, # Missing value handling (fill with 0)
inf_rep=0, # Infinity handling (fill with 0)
)
# Save (won't take effect without saving)
excelWriter.save()
'''
Direct write
'''
df.to_excel('file_path/test.xlsx') # Parameters: index, encoding, columns, na_rep, inf_repCSV
- Reading
df = pd.read(
r'file_path/test.csv',
sep="", # Specify delimiter, default is comma
nrows=2, # Specify number of rows to read
encoding='utf-8',
engine='python', # Add this when path contains Chinese characters
usecols=[0, 2], # Only import columns 0, 2
index_col=0, # Set column 0 as index
header=0 # Set row n as columns, default is 0, can be None
)- Writing
df.to_csv(
r'file_path/test.csv',
index=False, # Don't write index column
columns=['a', 'b'], # Specify columns to write
sep=',', # Set delimiter (default is comma)
na_rep=0, # Fill missing values with 0
encoding='utf-8',
)TXT
- Reading
pd.read_table(r'file_path/test.txt', sep='') # Can also read csv filesSource: https://www.modb.pro/db/26894
Creating
What annoys me most is that DataFrame doesn’t have functions like np.zeros, np.ones to initialize an empty DataFrame based on an existing one.
df_empty=pd.Dataframe(columns=['A','B','C','D'])So one approach is to extract column names from an existing DataFrame and then create a new one.
Indexing
Pandas data indexing is like a book’s table of contents, allowing us to quickly find the chapter we want. For large amounts of data, creating reasonable and business-meaningful indexes is crucial for data analysis.
Understanding Indexes
Below is an example of indexes in a simple DataFrame:

Where:
- Row index is the data’s index, column index points to a Series
- DataFrame’s index is also the index of the Series that forms the series
- Building indexes makes data more intuitive, e.g., each row is for a specific country
- Building indexes facilitates data processing
- Indexes allow duplicates, but typically shouldn’t in business scenarios
Sometimes data with many row and column levels will have multi-level indexes.
Building Indexes
Previously we learned that we can specify indexes when loading data to generate a DataFrame:
data = 'https://www.gairuo.com/file/data/dataset/team.xlsx'
df = pd.read_excel(data, index_col='name') # Set index to name
df
'''
team Q1 Q2 Q3 Q4
name
Liver E 89 21 24 64
Arry C 36 37 37 57
Ack A 57 60 18 84
Eorge C 93 96 71 78
Oah D 65 49 61 86
'''If we didn’t specify an index when loading, we can use df.set_index() to specify one:
df.set_index('month') # Set month as index
df.set_index(['month', 'year']) # Set month and year as multi-level index
'''
sale
month year
1 2012 55
4 2014 40
2013 84
10 2014 31
'''
s = pd.Series([1, 2, 3, 4])
df.set_index(s) # Specify an index
df.set_index([s, 'year']) # Specify both a custom index and existing field
df.set_index([s, s**2]) # Computed index
# Other parameters
df.set_index('month', drop=False) # Keep original column
df.set_index('month', append=True) # Keep original index
df.set_index('month', inplace=True) # Build index and overwrite dfResetting Indexes
Sometimes we want to cancel existing indexes and start over, we can use df.reset_index():
df.reset_index() # Clear index
df.set_index('month').reset_index() # Equivalent to doing nothing
# Delete original index, month column is gone
df.set_index('month').reset_index(drop=True)
df2.reset_index(inplace=True) # Overwrite to take effect
# Cancel year level index
df.set_index(['month', 'year']).reset_index(level=1)
df2.reset_index(level='class') # Same as above using level index name
df.reset_index(level='class', col_level=1) # Column index
# Fill in specified name for non-existent level names
df.reset_index(level='class', col_level=1, col_fill='species')Index Types
To accommodate various business data processing needs, indexes have different types defined for various data types:
Numeric Index
There are several types:
- RangeIndex: Immutable index of monotonic integer range.
- Int64Index: int64 type, immutable ndarray of ordered sliceable set.
- UInt64Index: For unsigned integer labels
- Float64Index: Float64 type
pd.RangeIndex(1,100,2)
# RangeIndex(start=1, stop=100, step=2)
pd.Int64Index([1,2,3,-4], name='num')
# Int64Index([1, 2, 3, -4], dtype='int64', name='num')
pd.UInt64Index([1,2,3,4])
# UInt64Index([1, 2, 3, 4], dtype='uint64')
pd.Float64Index([1.2,2.3,3,4])
# Float64Index([1.2, 2.3, 3.0, 4.0], dtype='float64')Categorical Index
Categories can only contain a limited number of (usually fixed) possible values (categories). Think of it as an enum, like gender only has male and female, but appears in every row of data. Processing as text would be inefficient. The underlying implementation is pandas.Categorical.
pd.CategoricalIndex(['a', 'b', 'a', 'b'])
# CategoricalIndex(['a', 'b', 'a', 'b'], categories=['a', 'b'], ordered=False, dtype='category')Categories will be explained in detail later, and only show their advantages with very large datasets.
Interval Index
pd.interval_range(start=0, end=5)
'''
IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]],
closed='right',
dtype='interval[int64]')
'''Multi Index
Will be explained in detail later in the tutorial.
arrays = [[1, 1, 2, 2], ['red', 'blue', 'red', 'blue']]
pd.MultiIndex.from_arrays(arrays, names=('number', 'color'))
'''
MultiIndex([(1, 'red'),
(1, 'blue'),
(2, 'red'),
(2, 'blue')],
names=['number', 'color'])
'''DateTime Index
# From one date continuously to another date
pd.date_range(start='1/1/2018', end='1/08/2018')
# Specify start time and periods
pd.date_range(start='1/1/2018', periods=8)
# Monthly periods
pd.period_range(start='2017-01-01', end='2018-01-01', freq='M')
# Nested periods
pd.period_range(start=pd.Period('2017Q1', freq='Q'),
end=pd.Period('2017Q2', freq='Q'), freq='M')Timedelta Index
pd.TimedeltaIndex(data =['06:05:01.000030', '+23:59:59.999999',
'22 day 2 min 3us 10ns', '+23:29:59.999999',
'+12:19:59.999999'])
# Using datetime
pd.TimedeltaIndex(['1 days', '1 days, 00:00:05',
np.timedelta64(2, 'D'),
datetime.timedelta(days=2, seconds=2)])Period Index
t = pd.period_range('2020-5-1 10:00:05', periods=8, freq='S')
pd.PeriodIndex(t,freq='S')Index Object
Row and column indexes in Pandas are actually Index objects. Here’s how to create an index object:
Creating Objects
pd.Index([1, 2, 3])
# Int64Index([1, 2, 3], dtype='int64')
pd.Index(list('abc'))
# Index(['a', 'b', 'c'], dtype='object')
# Can define a name
pd.Index(['e', 'd', 'a', 'b'], name='something')Viewing
df.index
# RangeIndex(start=0, stop=4, step=1)
df.columns
# Index(['month', 'year', 'sale'], dtype='object')Properties
The following methods also apply to df.columns, since they are all index objects:
# Properties
df.index.name # Name
df.index.array # array
df.index.dtype # Data type
df.index.shape # Shape
df.index.size # Number of elements
df.index.values # array
# Others, less commonly used
df.index.empty # Is empty
df.index.is_unique # Has no duplicates
df.index.names # List of names
df.index.is_all_dates # All datetime
df.index.has_duplicates # Has duplicates
df.index.values # Index values arrayOperations
The following methods also apply to df.columns, since they are all index objects:
# Methods
df.index.astype('int64') # Convert type
df.index.isin() # Check existence, see example below
df.index.rename('number') # Rename index
df.index.nunique() # Number of unique values
df.index.sort_values(ascending=False,) # Sort, descending
df.index.map(lambda x:x+'_') # Map function processing
df.index.str.replace('_', '') # String replace
df.index.str.split('_') # Split
df.index.to_list() # Convert to list
df.index.to_frame(index=False, name='a') # Convert to DataFrame
df.index.to_series() # Convert to series
df.index.to_numpy() # Convert to numpy
df.index.unique() # Deduplicate
df.index.value_counts() # Deduplicate with counts
df.index.where(df.index=='a') # Filter
df.index.rename('grade', inplace=False) # Rename index name
df.index.rename(['species', 'year']) # Multi-level, rename index names
df.index.max() # Maximum value
df.index.argmax() # Index of maximum value
df.index.any()
df.index.all()
df.index.T # Transpose, useful for multi-level indexesIndex Renaming
Modify row and column index names.
# Rename column indexes one by one
df.rename(columns={"A": "a", "B": "c"})
df.rename(str.lower, axis='columns')
# Rename row indexes
df.rename(index={0: "x", 1: "y", 2: "z"})
df.rename({1: 2, 2: 4}, axis='index')
# Modify data type
df.rename(index=str)
# Re-modify indexes
replacements = {l1:l2 for l1, l2 in zip(list1, list2)}
df.rename(replacements)Index Name Renaming
Note, this modifies the name of the index, not the index or column names themselves:
s.rename_axis("animal") # Rename index
df.rename_axis(["dow", "hr"]) # Modify multi-level index names
df.rename_axis('info', axis="columns") # Modify row index name
# Modify multi-level column index names
df.rename_axis(index={'a': 'A', 'b': 'B'})
# Modify multi-level column row index names
df.rename_axis(columns={'name': 's_name', 'b': 'B'})
df.rename_axis(columns=str.upper) # Row index names to uppercaseQuery and Modify
Data Inspection
When we first get data, we need to do a spot check. On one hand to understand the data structure, on the other hand to randomly check data quality issues. Common methods:
| Syntax | Operation | Return Type |
|---|---|---|
df.head(n) | View first n rows of DataFrame | DataFrame |
df.tail(n) | View last n rows of DataFrame | DataFrame |
df.sample(n) | View n random samples | DataFrame |
All of the above select entire rows.
View Head df.head()
After loading data, we usually need to look at the head data:
df.head()
out:
name team Q1 Q2 Q3 Q4
0 Liver E 89 21 24 64
1 Arry C 36 37 37 57
2 Ack A 57 60 18 84
3 Eorge C 93 96 71 78
4 Oah D 65 49 61 86
# Can specify quantity
df.head(15)View Tail df.tail()
View the last tail data.
df.tail()
out:
name team Q1 Q2 Q3 Q4
95 Gabriel C 48 59 87 74
96 Austin7 C 21 31 30 43
97 Lincoln4 C 98 93 1 20
98 Eli E 11 74 58 91
99 Ben E 21 43 41 74
# Can specify quantity
df.tail(15)View Sample df.sample()
df.sample() will randomly return one sample data.
df.sample()
out:
name team Q1 Q2 Q3 Q4
79 Tyler A 75 16 44 63
# Can specify quantity
df.sample(15)Data truncation:
# Remove data before and after index
df.truncate(before=2, after=4) # Only keep index 2-4
s.truncate(before="60", after="66")
df.truncate(before="A", after="B", axis="columns") # Select columnsColumn Operations
Both of the following methods can represent a column:
df['name'] # Returns the column as a Series
df.name
df.Q1
# df.1Q cannot be used even if column name is 1Q
# df.my name cannot be called with spaces, can add underscoreNote that when the column name is a valid Python variable, it can be used directly as an attribute.
Selecting Rows and Columns
Sometimes we need to select partial columns or rows by condition. Common methods:
| Operation | Syntax | Return Type |
|---|---|---|
| Select column | df[col] | Series |
| Select row by index | df.loc[label] | Series |
| Select row by position | df.iloc[loc] | Series |
| Select rows by slice | df[5:10] | DataFrame |
| Filter rows by expr | df[bool_vec] | DataFrame |
Next we will focus on these query methods.
Slicing []
We can use slicing like lists to select partial row data, but single index is not supported:
df[:2] # First two rows
df[4:10]
df[:] # All data, rarely used
df[:10:2] # By step
s[::-1] # Reverse orderCan also select columns:
df['name'] # Only one column, Series
df[['Q1', 'Q2']] # Select two columns
df[['name']] # Select one column, returns DataFrame, note difference from aboveBy Label .loc
df.loc() format is df.loc[
Single label:
# Represents index, strings need quotes
df.loc[0] # Select row with index 0
df.loc[8]Single list of labels:
df.loc[[0,5,10]] # Rows with index 0, 5, 10
df.loc[['Eli', 'Ben']] # If index is name
# Boolean selection, length must match index
df.loc[[False, True]*50] # Show rows where True, every other oneSlice with labels (includes start and stop):
df.loc[0:5] # Index slice, rows 0-5, including 5
df.loc['2010':'2014'] # If index is time, can query with string
df.loc[:] # All
# This method supports SeriesColumn filtering, must have row filtering:
dft.loc[:, ['Q1', 'Q2']] # All rows, Q1 and Q2 columns
dft.loc[:10, 'Q1':] # Rows 0-10, all columns after Q1By Position .iloc
df.iloc is similar to df.loc, but only uses natural index (0-n index for rows and columns), not labels.
df.iloc[:3]
df.iloc[:]
df.iloc[2:20:3]
s.iloc[:3]Get Specific Value .at
Similar to loc, but only gets a single specific value, structure is at[
# Note: string index needs quotes
df.at[4, 'Q1'] # 65
df.at['lily', 'Q1'] # 65 assuming index is name
df.at[0, 'name'] # 'Liver'
df.loc[0].at['name'] # 'Liver'
# Get value of other column corresponding to specified column value
df.set_index('name').at['Eorge', 'team'] # 'C'
df.set_index('name').team.at['Eorge'] # 'C'
# Get value at specified index of column
df.team.at[3] # 'C'Similarly, iat like iloc only supports numeric index:
df.iat[4, 2] # 65
df.loc[0].iat[1] # 'E'.get can do dictionary-like operations, returns default value if no value (0 in example):
df.get('name', 0) # Is name column
df.get('nameXXX', 0) # 0, returns default value
s.get(3, 0) # 93, Series passes index returns specific value
df.name.get(99, 0) # 'Ben'Expression Filtering
[] slicing can use expressions for filtering:
df[df['Q1'] == 8] # Q1 equals 8
df[~(df['Q1'] == 8)] # Not equal to 8
df[df.name == 'Ben'] # Name is Ben
df.loc[df['Q1'] > 90, 'Q1':] # Q1 greater than 90, only show Q1
df.loc[(df.Q1 > 80) & (df.Q2 < 15)] # and relationship
df.loc[(df.Q1 > 90) | (df.Q2 < 90)] # or relationship
df[df.Q1 > df.Q2]The index part in df.loc can use expressions for data filtering.
df.loc[df['Q1'] == 8] # Equals 8
df.loc[df.Q1 == 8] # Equals 8
df.loc[df['Q1'] > 90, 'Q1':] # Q1 greater than 90, only show Q1
# Other expressions same as slicing
df.loc[:, lambda df: df.columns.str.len()==4] # Boolean sequence
df.loc[:, lambda df: [i for i in df.columns if 'Q' in i]] # Column name list
df.iloc[:3, lambda df: df.columns.str.len()==2] # Boolean sequenceLogical comparison functions:
df.eq() # Equal ==
df.ne() # Not equal !=
df.le() # Less than or equal <=
df.lt() # Less than <
df.ge() # Greater than or equal >=
df.gt() # Greater than >
# All support axis{0 or 'index', 1 or 'columns'}, default 'columns'
df[df.Q1.ne(89)] # Q1 not equal to 89
df.loc[df.Q1.gt(90) & df.Q2.lt(90)] # and relationship Q1>90 Q2<90Other functions:
# isin
df[df.team.isin(['A','B'])] # Contains groups A and B
df[df.isin({'team': ['C', 'D'], 'Q1':[36,93]})] # Complex query, other values are NaNFunction Filtering
df[lambda df: df['Q1'] == 8] # Q1 is 8
df.loc[lambda df: df.Q1 == 8, 'Q1':'Q2'] # Q1 is 8, show Q1 Q2where and mask
s.where(s > 90) # Non-matching values become NaN
s.where(s > 90, 0) # Non-matching values become 0
# np.where, greater than 80 is True otherwise False
np.where(s>80, True, False)
np.where(df.num>=60, 'pass', 'fail')
s.mask(s > 90) # Matching values become NaN
s.mask(s > 90, 0) # Matching values become 0
# Example: show divisible values, show negative for non-divisible
m = df.loc[:,'Q1':'Q4'] % 3 == 0
df.loc[:,'Q1':'Q4'].where(m, -df.loc[:,'Q1':'Q4'])
# Same number of rows and columns, returns an array
df.lookup([1,3,4], ['Q1','Q2','Q3']) # array([36, 96, 61])
df.lookup([1], ['Q1']) # array([36])query
df.query('Q1 > Q2 > 90') # Write SQL-like where statement directly
df.query('Q1 + Q2 > 180')
df.query('Q1 == Q2')
df.query('(Q1<50) & (Q2>40) and (Q3>90)')
df.query('Q1 > Q2 > Q3 > Q4')
df.query('team != "C"')
df.query('team not in ("E","A","B")')
# For column names with spaces, use backticks
df.query('B == `team name`')
# Supports passing variables, e.g., 40 points above average
a = df.Q1.mean()
df.query('Q1 > @a+40')
df.query('Q1 > `Q2`+@a')
# df.eval() usage similar to df.query
df[df.eval("Q1 > 90 > Q3 > 10")]
df[df.eval("Q1 > `Q2`+@a")]filter
Use filter to filter row and column names.
df.filter(items=['Q1', 'Q2']) # Select two columns
df.filter(regex='Q', axis=1) # Column names containing Q
df.filter(regex='e$', axis=1) # Ending with e
df.filter(regex='1$', axis=0) # Regex, index names containing 1
df.filter(like='2', axis=0) # Index contains 2
# Index starting with 2, column names containing Q
df.filter(regex='^2', axis=0).filter(like='Q', axis=1)Merge and Add Rows/Columns
Adding Columns
Assume original data is as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({'num_legs': [4, 2], 'num_wings': [0, 2]},
index=['dog', 'hawk'])
slen = len(df['num_legs'])- Direct assignment
df['a'] = pd.Series(np.random.randn(slen), index=df.index) # Remember to add index
df['b'] = None # Add a column with None values
df['c'] = [2, 4] # Add list data
# c1 and c3 columns have same order, c2 is opposite, see below
df['c1'] = ['no', 'yes']
df.index = [1, 0]
df['c2'] = pd.Series(['no', 'yes'])
df['c3'] = pd.Series(['no', 'yes'], index=df.index)- loc method
df.loc[:,'d'] = pd.Series(np.random.randn(slen), index=df.index)
df.loc[:, 'd'] = [2, 4]- insert method
insert method column names cannot have duplicates, even updates are not allowed
df.insert(len(df.columns), 'e', pd.Series(np.random.randn(slen)), index=df.index)
df.insert(len(df.columns), 'ee', [1,2])- assign method
assign method parameters can be Series, scalar, list, and can add multiple columns at once
df = df.assign(f=df.num_wings.mean()) # Use num_wings column mean as new column f
df = df.assign(A=df.num_wings.sum(), B=[1,2]) # Add columns A and B- concat method
pd.concat([df, pd.Series(['yes', 'yes']).rename('t')], axis=1) # Add column tNotes:
- Each method’s parameters can be Series, scalar, or list
- insert method’s new column name cannot be same as existing ones, even updating just-added columns will error
df['a']=pd.Series(['no', 'yes']if index order is modified, defaults to Series index order, useindex=df.indexto specify original DataFrame index order
Adding Rows
import pandas as pd
import numpy as np
# Create empty DataFrame
df = pd.DataFrame(columns=['lib', 'qty1', 'qty2'])- Using loc
for i in range(4):
df.loc[i] = [np.random.randint(-1, 1) for n in range(3)]
# df.loc[i] = 5 add a record with all values as 5- Using append
df.append({'lib': 2, 'qty1': 3, 'qty2': 4}, ignore_index=True)
# append can also directly add DataFrame
df2 = pd.DataFrame([[1,2,3], [2,3,4]], columns=['lib', 'qty1', 'qty2'])
df.append(df2, ignore_index=True) # ignore_index=True will ignore df2's index- Regenerate DataFrame
Loop to save data to be added as dictionaries in a list, then create DataFrame from list
row_list = []
input_rows = [[1,2,3], [2,3,4]] # Data to insert
for row in input_rows:
dict1 = dict(lib=row[0], qty1=row[1], qty2=row[2]) # Convert data to dict
row_list.append(dict1) # Save to list
df = pd.DataFrame(row_list)Merge
append() Vs. concat()
When concatenating or merging DataFrames, there are generally two ways: vertical and horizontal. Simply put, vertical means connecting two or more DataFrames vertically (top to bottom) into one DataFrame, without any operations even if index and column have duplicates - just crude vertical concatenation. Horizontal considers if there are same indexes, it will merge all column data on the same index together, similar to using Excel’s V-lookup to merge data from two tables with same id but different data.
Common functions for concatenating and merging DataFrames are append(), concat() and merge(). append() can only do vertical concatenation, while concat() and merge() can do both. concat() defaults to vertical concatenation (same as append()), to use concat() for horizontal merge you need to declare axis variable. Default: concat(axis=0) vertical, concat(axis=1) horizontal.
concat() Vs. merge()
After discussing concat() and append() horizontal and vertical concatenation above, let’s talk about the difference between concat() and merge(). Both can do horizontal and vertical merging, but there are some differences in usage and output.
Here we introduce the concept of join. concat() default join method is outer join, while merge() default is inner join. Another important difference is that concat() merges by index, while merge() merges by column label. If column name is set as index, you need to remove it before merging, otherwise KeyError will occur.
# concat with outer join (default)
pd.concat([population, unemployment], axis=1)
# concat with inner join
pd.concat([population, unemployment], axis=1, join='inner')For merge operations:
# merge() default is inner join, parameter is 'how' not 'join'
pd.merge(population, unemployment, left_on='ZipCode', right_on='Zip')
# If column is set as index, need to reset_index() first
population = population.reset_index()
unemployment = unemployment.reset_index()
pd.merge(population, unemployment, left_on='ZipCode', right_on='Zip', how='outer')
# Using same column name with on parameter
population.rename(columns={'ZipCode':'Zip'}, inplace=True)
pd.merge(population, unemployment, on=['Zip'], how='outer')join() Vs. concat()
join has four merge methods: how='left', how='right', how='inner' and how='outer'. merge() also has all these methods. merge() is the most powerful of the four functions, but in practice you don’t always need to use merge() - sometimes append() and concat() may be more convenient.
# join default merge method is how='left'
population.join(unemployment)
# join with right
population.join(unemployment, how='right')df1.join(df2, how=‘left’) means merge based on the left DataFrame. join and concat both merge by index, so corresponding index must exist. concat lacks left and right merge methods compared to join, but outer and inner results are identical:
population.join(unemployment, how='outer')
pd.concat([population, unemployment], join='outer', axis=1)
# Above two results are the same
population.join(unemployment, how='inner')
pd.concat([population, unemployment], join='inner', axis=1)
# Above two results are the sameSummary of append(), concat(), join() and merge()
append()
Syntax: df1.append(df2)
Description: append() simply stacks two DataFrames vertically, no index needed.
concat()
Syntax: pd.concat([df1, df2])
Description: concat() can merge multiple rows or columns horizontally or vertically, can use inner or outer method, needs index.
join()
Syntax: df1.join(df2)
Description: join() can use multiple merge methods, besides inner and outer can also use left and right, these operations also need index.
merge()
Syntax: pd.merge([df1, df2])
Description: Most versatile merge function. No index needed.
merge_ordered() Function
merge_ordered() function can perform two operations with one function: merge() and sort_values().
pd.merge_ordered(hardware, software, on=['', ''], suffixes=['', ''], fill_method='ffill')