Python: how to compare strings/files/contents

created at 06-27-2021 views: 1

1 File content difference comparison method

The file content difference comparison is realized through the difflib module. As a standard library module of Python, difflib does not need to be installed. Its function is to compare the differences between files and support the output of relatively readable HTML documents, similar to the diff command under Linux. We can use difflib to compare the difference between code and configuration files, which is very useful in version control. Official document: here.

1.1 The difference between the two strings

This example uses the difflib module to compare the differences between two strings, and then outputs them in a version control style. The sample code is as follows:

import difflib
from pprint import pprint

text1_lines = '''  1. Beautiful is better than ugly.
  2. Explicit is better than implicit.
  3. Simple is better than complex.
  4. Complex is better than complicated.'''.splitlines(keepends=True)
# Split by rows 
text2_lines = '''  1. Beautiful is better than ugly.
  3.   Simple is better than complex.
  4. Complicated is better than complex.
  5. Flat is better than nested.'''.splitlines(keepends=True)

d = difflib.Differ()  # Create a Differ() object
result = list(d.compare(text1_lines, text2_lines))  # Use the "compare" method to compare strings
pprint(result)

The example uses the Differ() class to compare two strings. In addition, the SuquenceMatcher() class of difflib supports the comparison of any type of sequence. The HtmlDiff() class supports the output of the comparison result in HTML format. The running results of the example are as follows

Each line of a Differ delta begins with a two-letter code:

1.2 Generate beautiful comparative HTML format documents

The sample code is as follows:

import difflib

text1_lines = '''  1. Beautiful is better than ugly.
  2. Explicit is better than implicit.
  3. Simple is better than complex.
  4. Complex is better than complicated.'''.splitlines(keepends=True)
# Split by rows 
text2_lines = '''  1. Beautiful is better than ugly.
  3.   Simple is better than complex.
  4. Complicated is better than complex.
  5. Flat is better than nested.'''.splitlines(keepends=True)

d = difflib.HtmlDiff()  # Create HtmlDiffer() object
with open("test.html", "w") as file:
    # Use the make_file method to compare the strings and write them into the html file
    file.write(d.make_file(text1_lines, text2_lines))

Use the make_file method to compare the strings and write them into the html file:

2 File directory difference comparison method

When we perform code audits or verify backup results, we often need to check the consistency of the original and target files. Python's standard library has its own module filecmp that meets this requirement. filecmp can realize the difference comparison function of files, directories, and traversing subdirectories. For example, in the report, the output target is more than the original file or subdirectory, even if the file has the same name, it will be judged whether it is the same file (content-level comparison), etc. Python2.3 or higher version comes with the filecmp module by default, and no additional installation is required. Official document: here. filecmp provides three operation methods, cmp (single file comparison) is as follows:

filecmp.cmp(f1, f2, shallow=True)

Compare the files named f1 and f2, returning True if they seem equal, False otherwise.

cmpfiles (multi-file comparison) are as follows:

filecmp.cmpfiles(dir1, dir2, common, shallow=True)

Compare the files in the two directories dir1 and dir2 whose names are given by common.
Returns three lists of file names: match, mismatch, errors.

For example, cmpfiles('a', 'b', ['c', 'd/e']) will compare a/c with b/c and a/d/e with b/d/e. 
'c' and 'd/e' will each be in one of the three returned lists.

dircmp (directory comparison) is as follows:

class filecmp.dircmp(a, b, ignore=None, hide=None)

Construct a new directory comparison object, to compare the directories a and b. 
ignore is a list of names to ignore, and defaults to filecmp.DEFAULT_IGNORES. 
hide is a list of names to hide, and defaults to [os.curdir, os.pardir].

2.1 Single file comparison

Single file comparison: Use filecmp.cmp(f1, f2, shallow=True) method to compare files named f1 and f2, return True for the same, return False for different, shallow defaults to True, which means only based on os.stat () The basic information of the file returned by the method is compared, such as the last access time, modification time, status change time, etc. The comparison of the file content will be ignored. When shallow is False, os.stat() and the file content will be verified at the same time. The contents of the file are as follows:

import filecmp

print(filecmp.cmp("test1.txt", "test2.txt"))  # False
print(filecmp.cmp("test2.txt", "test3.txt"))  # True

2.2 Multi-file comparison

Multi-file comparison: Use the filecmp.cmpfiles(dir1, dir2, common, shallow=True) method to compare the file lists given in the dir1 and dir2 directories. This method returns three lists of file names, namely match, mismatch, and error. Matching is a list that contains matched files. Otherwise, the error list contains a list of files that cannot be compared due to no files in the directory, no read permission, or other reasons. The directory file list is as follows:

The complete sample code is as follows:

import filecmp

print(filecmp.cmpfiles('one', 'two', ['test1.txt', 'test2.txt', 'test3.txt', 'test4.txt', 'test5.txt']))

2.3 Directory comparison

Create a directory comparison object through the filecmp.dircmp(a, b, ignore=None, hide=None) class, where 

  • a and b are the names of the directories to be compared.
  • ignore represents the list of file names to ignore,
  • hide represents the hidden list, the default is [os.curdir, os.pardir].

The dircmp class can obtain detailed information about directory comparison, such as only the files included in the a directory, the subdirectories where both a and b exist, and the matching files. It also supports recursion. dircmp provides three methods for outputting reports:

  1. report():Print (to sys.stdout) a comparison between a and b.
  2. report_partial_closure():Print a comparison between a and b and common immediate subdirectories.
  3. report_full_closure():Print a comparison between a and b and common subdirectories (recursively).

The dircmp class offers a number of interesting attributes that may be used to get various bits of information about the directory trees being compared.

  1. left:The directory a.
  2. right:The directory b.
  3. left_list:Files and subdirectories in a, filtered by hide and ignore.
  4. right_list:Files and subdirectories in b, filtered by hide and ignore
  5. common:Files and subdirectories in both a and b.
  6. left_only:Files and subdirectories only in a.
  7. right_only:Files and subdirectories only in b.
  8. common_dirs:Subdirectories in both a and b.
  9. common_files:Files in both a and b.
  10. common_funny:Names in both a and b, such that the type differs between the directories, or names for which os.stat() reports an error.
  11. same_files:Files which are identical in both a and b, using the class’s file comparison operator.
  12. diff_files:Files which are in both a and b, whose contents differ according to the class’s file comparison operator.
  13. funny_files:Files which are in both a and b, but could not be compared.
  14. subdirs:A dictionary mapping names in common_dirs to dircmp objects.

Example: Compare the directory differences between one and two. The dircmp() method is called to realize the function of directory difference comparison, and at the same time output all the attribute information of the directory comparison object. code show as below:

import filecmp

cmp = filecmp.dircmp("one", "two")
print(cmp.report())

Please log in to leave a comment.