Oct 24, 2023 by Charles Beumier | 163 views
Image format testing is a necessary action for digital preservation to ensure that the data will be readable in the long term. It may also be part of the solution to detect image manipulation for cybersecurity defense or in Capture-The-Flag exercises.
The number of image formats is huge and all possess particularities. This post presents some tips to test image files and a few references to available opensource tools that check common image formats. The last section describes a simple BMP image manipulation which is detected and recovered thanks to an hexadecimal editor.
To check if a file is a valid image, we suggest at least 4 methods: to display the file with a viewer, to use a format checker, to access its metadata and to get its file type or extension based on content. In case of trouble with one of these methods, one can go back to the format description and interpret the file content accordingly.
Most of the time, displaying an image file is the simplest and fastest way to check the validity of the digital information. A standard image viewer should report errors in case the format is not consistent with the version(s) it supports. Trying several image viewers may be helpful since they often differ in their ability to report problems.
In situations for which a more precise error reporting is expected, more specific tools can be run. They may support one or several image formats. A few tools are presented in section 2.
Accessing an image by its metadata is a valid way to gather important information or check that the file is properly formed. One of the famous tool for most file types (not only images) is exiftool (https://exiftool.org/). Through command lines you can read and sometimes edit metadata values.
The CheckFileType web application (https://www.checkfiletype.com/) derives the type or possible extension of any uploaded file (up to 16MB) from the file content. For some types it prints metadata, and for some files it suggests a list of possible matching extensions. This is particularly useful if the file has no extension, or a wrong one.
If none of the previous 4 methods can help, you have to go back to the format description. The website fileformat.com holds a list containing most (if not all) image formats (https://docs.fileformat.com/image). However the description of some format might not be as complete as in other references. For example BMP is better described by the wikipedia page.
The PNG file chunk inspector analyses the different chunks contained in a PNG image. The tool is a web application (https://www.nayuki.io/page/png-file-chunk-inspector) which controls each chunck checksum and displays errors accordingly.
pngcheck (https://manpages.ubuntu.com/manpages/xenial/man1/pngcheck.1.html) verifies checksums contained in the chunks of a PNG image and report possible inconsistencies.
To detect inconsistencies in JPEG files, the JHOVE and BadPeggy applications are efficient solutions. In the following blog of the Open Preservation Foundation: https://openpreservation.org/blogs/jpegvalidation/, BadPeggy appeared slightly better than JHOVE on the 3070 files of a google image test suite.
JHOVE is the JSTOR/Harvard Object Validation Environment, which is able to identify a format from a digital object and validate that a digital object complies to the specification of a format. The initial release supports many formats like ASCII, HTML, XML, PDF, GIF, JPEG, TIFF, AIFF, WAVE and can be extended by adding other format modules. JHOVE is written in Java and needs a 1.5 compliant Java Runtime Environment. It offers a command line and a GUI application.
Bad Peggy uses the Java Image IO library to check image files from the BMP, GIF, JPEG and PNG formats. It detects small deviations from the standard (e.g. extra data, unknown values) as well as major damages (unknown formats, truncated files, inconsistencies).
Several programs or interfaces are provided to check TIFF files. JHOVE seems to be a major reference while DPF Manager (https://www.projecttracks.be/en/toolbox-overview/digitaal-bewaren/validating-tiff-files-with-dpf-manager) has the advantage to offer a clear interface. Refer to a comparison of these two programs and a few others at https://openpreservation.org/blogs/tiff-format-validation-easy-peasy/.
Mention that the link given in the post to download and install DPF Manager is broken. But the github repository is available.
The BMP format is quite simple as it consists of headers with metadata followed by the array of pixel values. As shown in the use case in section 3, a simple Hexadecimal editor and the format description can sometimes be enough to detect an inconsistency and repair it.
In this small exercise we only considered manipulating the first metadata features present in the file header. According to the format (https://docs.fileformat.com/image/bmp/), a BMP file should start with 2-byte "BM", followed by 4-byte for the file size and 2 times 2-byte reserved for the application creating the image. Then, at offset 10, you have a 4-byte offset to the pixel data (possibly stored uncompressed as BGRBGRBGR... for a color image).
We first modified the 4-byte integer at offset 2 to see the influence of changing the stored image size. For this we used the Linux ghex hexabyte editor (sudo apt install ghex) that we recommend for the facility, clarity and possibilities offered through its graphical interface. We had to select little-endian encoding (used by BMP) to correctly interpret the modified bytes. The image viewers ImageMagick, eye of Mate and Shotwell photo viewer were not influenced by the new values for the file size parameter. We can imagine that these viewers rely on the image width and height for image rendering and do not need the file size parameter.
We then modified the 4-bytes integer at byte offset 10 to change the offset from which pixel values are read. As seen in the picture below, two different effects show up: a shift in the display and / or a change in the color. The shift is due to the offset, when divided by 3 since a pixel is stored for this image on 3 consecutive bytes (B,G,R). The change of color appears when the offset is not a multiple of 3. In that case, the BGR arrangement becomes a GRB or a RBG ordering and the color rendering is affected.
Note that these results are specific to the (common) case of a BGR representation. Monochrome or BGRA BMP images will behave differently.
We can repair the introduced little 'bug' by overwriting the modified offset parameter with its original value (0x36) thanks to ghex so that the image is again properly displayed.