Comparing Columns of Different Dataframes with data Reconciliation

The work seems easy:

Verify that all the records were successfully moved. However, like with other chores, there is typically an if statement followed by a clause.

Create a report of the records that don’t match if any are found.

The original plan was to carry out this reconciliation using Excel and macros. But I recognised the difficulty and concluded that Excel might not be the best tool for the job. Consequently, the thought to implement it in Python emerged.

Few Assumptions:
Assume that the source and destination dataframes are df1 and df2, respectively. For the time being, let’s suppose that they both have the same number of columns and that the columns have the same datatypes when paired up (that is, column 1 in df1 and column 1 in df2, etc.). Assume as well that the column(s) being compared are stored in the uid variable.

At the end of the article, I’ll briefly go over the extra processes and checks that one may perform to make sure the data is adequately clean before running the main comparison script.

1. Records were correctly migrated

Perform an inner merge over all of the columns to produce a dataframe with identical records. Noting that the merge is performed on all columns, uid is not relevant.

df_merge = pd.merge(df1, df2, how='inner', on=df1.columns.tolist())

2. Records in the source but not in the target

Occasionally, stray records that are in the source but not the target somehow never make it through the migration process.

There are numerous ways to locate these records; here is one:

Add df merge back to df1, label duplicate rows, select rows that are not duplicates by using keep=False to designate all duplicates as True.

df1_only = df1.append(df_merge).reset_index(drop=True)
df1_only['Duplicated'] = df1_only.duplicated(keep=False)
df1_only = df1_only[~df1_only['Duplicated']]

3. Records in the Source but not the Target

It’s possible that records will appear unexpectedly throughout the migration process; this is most often the result of a transformation error or human interference.

In a similar vein, it is possible to spot those that are present in the target but absent from the source.

df2_only = df2.append(df_merge).reset_index(drop=True)
df2_only['Duplicated'] = df2_only.duplicated(keep=False)
df2_only = df2_only[~df2_only['Duplicated']]

4. Records with distinct values in both the source and the target

Let’s say there are two dataframes, each containing just one record:

df1 = pd.DataFrame([['Apple',1]], columns=['Fruit', 'Qty'])
df2 = pd.DataFrame([['Apple',2]], columns=['Fruit', 'Qty'])

By observation, df_merge would be empty and these dataframes would also be equivalent to df1_only and df2_only respectively.

It would be meaningful to create a new dataframe to show that Apple appears in both df1 and df2 but they have different value in the Qty column.

 #Label each dataframe
df1_only['S/T'] = 'Source'
df2_only['S/T'] = 'Target'
# Identify the duplicated rows
df3 = df1_only.append(df2_only).reset_index(drop=True)
df3['Duplicated'] = df3.duplicated(subset='Fruit', keep=False)
# `subset` argument considers only a subset of the columns when identifying duplicated rows.

Applying this to a less trivial example:

>>> cols = ['Fruit', 'Qty']
>>> df1 = pd.DataFrame([['Apple',1],['Orange',3]], columns=cols)
>>> df2 = pd.DataFrame([['Apple',2],['Pear',4]], columns=cols)
...
...
>>> # Note that `df1` is still the same as `df1_only`
>>> # Repeat the step in the above code block to get `df3`
>>> df3
Fruit Qty S/T Duplicated
0 Apple 1 Source True
1 Orange 3 Source False
2 Apple 2 Target True
3 Pear 4 Target False

Boolean indexing on the Duplicated column can now be used to choose:

Records with Fruit values that are same in the source and target but varied in the number of Records that are precisely in the source precisely within the target records

df_diff = df3[df3['Duplicated']]
df1_only = df3[(~df3['Duplicated']) &(df3['S/T']=='Source')]
df2_only = df3[(~df3['Duplicated']) &(df3['S/T']=='Target')]

5. Additional checks or steps

In order to verify that downstream code runs successfully, the script performs additional tests.

a. The identical number of columns are included in both dataframes.

b. If the column names in both dataframes differ, print the orphans that were found.

c. Both dataframes’ columns are in the same order (this is not important for the comparison, but it is visually tough for a human to inspect 2 dataframes with columns jumbled up.)

d. Each column in both dataframes has the exact same datatype. (If a source column’s data type is int and the target column’s data type is float, there may be rows with empty values.)

e. Based on the columns that the comparison is done to, determine if the records are unique. (This should always be 100%; otherwise, adding extra columns to the uid would be necessary to apply the comparison.)

6. Using the script

Using the script is very straightforward, just pass 3 positional arguments:

  • df_left and df_right refers to your input dataframes
  • uid refers to a single column (str) or a list of columns that make up the unique key. (i.e. Fruits)

There are 2 optional keyword arguments that can be passed too:

  • labels (default = (Left, Right)) allows you to name the input dataframes perhaps to Source & Target. Labels form the keys of the output ordered dictionary.
  • drop (default = [[], []]) can take a list of columns to be excluded from the comparison.
>>> df1 = pd.DataFrame([['Apple',1],['Orange',3],['Banana',5]], columns=['Fruit','Qty'])
>>> df2 = pd.DataFrame([['Apple',2],['Pear',4],['Banana',5]], columns=['Fruit','Qty'])
>>> d1 = diff_func(df1, df2, 'Fruit')

The output d1 will be an ordered dictionary with 6 dataframes:

>>> d1['Left']
Fruit Qty
0 Apple 1
1 Orange 3
2 Banana 5
>>> d1['Right']
Fruit Qty
0 Apple 2
1 Pear 4
2 Banana 5
>>> d1['Merge']
Fruit Qty
0 Banana 5
>>> d1['Left_only']
Fruit Qty
1 Orange 3
>>> d1['Right_only']
Fruit Qty
3 Pear 4
>>> d1['Diff']
Left or Right Fruit Qty
0 Left Apple 1
1 Right Apple 2

Conclusion

While the original purpose for creating such a script was merely to reconcile a specific set of migration data, I recognized that if this could be generalized, it could be used to compare any 2 dataframes. Hence, I decided to write a generic function which does a series of checks sequentially and then compares the dataframes and finally outputs the result in a dictionary containing 6 dataframes.

When dictionaries are already ordered, why use OrderedDict?

I’ve used ordered dictionaries because this script was initially developed when Python dictionaries were still unordered. Since I don’t think switching to a standard dictionary would much increase efficiency, I chose to maintain it out of sentimentality and to make sure it works with older versions of Python.

--

--

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store