Reading a Binary File in C Implementation
- Download source files - 26.vii Kb
Introduction
I've been working on a time-series analysis projection where the data are stored equally structures in massive binary files. Importing the files into a database would cause a performance hitting with no value added, and then dealing with the files in their original binary format is the all-time option. My initial supposition was that throughput would be express by deejay speed, but I plant that my first implementation resulted in 100% CPU utilization on my research box. It was evidently fourth dimension to optimize.
While there is a wealth of information available on the innumerable ways of reading files with C#, at that place is virtually no give-and-take about the functioning implications of various design decisions. Hopefully, this article will allow the reader to improve the operation of binary file reading in their application and will shed some light on some of the undocumented performance traps hidden in the System.IO
classes.
Is there Information?
It may seem light-headed to have a section on checking for the end of a file (EOF), simply there are a plethora of methods employed past programmers, and improperly checking for the EOF can absolutely cripple performance and introduce mysterious errors and exceptions to your application.
BinaryReader.PeekChar Method
If you are using this method in whatever application, god save you. Based on its frequent appearance in .NET newsgroups, this method is widely used, but I'm not certain why it fifty-fifty exists. According to Microsoft, the BinaryReader.PeekChar
method "Returns the side by side bachelor grapheme and does non advance the byte or character position." The return value is an int
containing "The adjacent available character, or -1 if no more characters are bachelor or the stream does not back up seeking." Gee, that sounds awfully useful in determining if nosotros're at the cease of the stream.
The BinaryReader
class is used for reading binary files which are broken into bytes not chars, then why peek at the next char
rather than byte
? I could understand if there was an issue implementing a common interface, but the TextReader
derived classes just employ Peek
. Why doesn't the BinaryReader
include a obviously sometime Peek
method that returns the next byte every bit an int
? By at present, yous're probably wondering why I'm ranting then much almost this. Who cares? So, you get the side by side byte for free? Well, something entirely unnatural happens somewhere in the bowels of this method that periodically results in a "Conversion Buffer Overflow" exception. As the result of some dark voodoo process, certain 2 byte combinations in your binary file can not be converted into an appropriate return value by the method. I take no idea why certain byte combinations take been deigned toxic to PeekChar
, but gear up for freaky results if yous use information technology.
Stream.Position >= Stream.Length
This examination is pretty straightforward. If your current position is greater than or equal to the length of the stream, you're going to exist pretty hard-pressed to read any additional data. Every bit information technology turns out, this statement is a massive performance bottleneck.
Afterwards finishing the initial build of my application, it was fourth dimension for some optimization. I downloaded the ANTS Profiler Demo from Red Gate Software, and was shocked to discover that over half the execution fourth dimension of my program was being spent in the EOF
method of my data reader. Without the profiler results, I never would have imagined that this innocuous looking line of code was cut the performance of my awarding into half. After all, I opened the FileStream
using the FileShare.Read
option, then there was no danger of the file's length changing, merely information technology appears equally though the position and file length are not cached past the class, so every call to Position
or Length
results in another file organization query. In my benchmarking, I've plant that calling both Position
and Length
takes twice as long as calling one or the other.
_position >= _length (Cache information technology yourself)
It's lamentable, but true. This is the fastest method by a long shot. Get the length of your FileStream
once when yous open it, and don't forget to advance your position counter every fourth dimension y'all read. Maybe Microsoft will fix this functioning trap someday, but until and then, don't forget to enshroud the file length and position yourself!
Read It!
Now that nosotros know there'south information, we have to read it into our data structures. I've included three different approaches, with varying merits. I did not include the dangerous approach of casting a byte assortment of freshly read data into a structure because I prefer to avoid dangerous lawmaking if at all possible.
FileStream.Read with PtrToStructure
Logically, I causeless that the fastest way to read in a structure would be the functional equivalent of C++'south basic_istream::read
method. In that location are enough of articles and newsgroup posts near using the Marshal
class in order to torture raw $.25 into a struct
. The cleanest implementation I've found is this:
public static TestStruct FromFileStream(FileStream fs) { byte[] buff = new byte[Marshal.SizeOf(typeof(TestStruct))]; int amt = 0; while(amt < buff.Length) amt += fs.Read(buff, amt, vitrify.Length-amt); GCHandle handle = GCHandle.Alloc(vitrify, GCHandleType.Pinned); TestStruct s = (TestStruct)Align.PtrToStructure(handle.AddrOfPinnedObject(), typeof(TestStruct)); handle.Free(); render south }
BinaryReader.ReadBytes with PtrToStructure
This approach is functionally nigh identical to the FileStream.Read
approach, but I provided it as a more than apples-to-apples comparison to the other BinaryReader
approach. The code is as follows:
public static TestStruct FromBinaryReaderBlock(BinaryReader br) { byte[] buff = br.ReadBytes(Align.SizeOf(typeof(TestStruct))); GCHandle handle = GCHandle.Alloc(vitrify, GCHandleType.Pinned); TestStruct s = (TestStruct)Align.PtrToStructure(handle.AddrOfPinnedObject(), typeof(TestStruct)); handle.Free(); render s; }
BinaryReader with private Read calls for structure fields
I assumed that this would exist the slowest method for filling my information structures --information technology was certainly the to the lowest degree sexy approach. Hither'due south the relevant sample code:
public static TestStruct FromBinaryReaderField(BinaryReader br) { TestStruct south = new TestStruct(); s.longField = br.ReadInt64(); s.byteField = br.ReadByte(); s.byteArrayField = br.ReadBytes(sixteen); due south.floatField = br.ReadSingle(); render south; }
Results
As I've already foreshadowed, my assumptions nigh the performance of various read techniques was entirely wrong for my data structures. Using the BinaryReader
to populate the private fields of my structures was more than twice every bit fast every bit the other methods. These results are highly sensitive to the number of fields in your structure. If you are concerned about operation, I recommend testing both approaches. I found that, at near 40 fields, the results for the three approaches were near equivalent, and beyond that, the block reading approaches gained an upper hand.
Using the Test App
I've thrown together a quick benchmarking application with simplified reading classes to demonstrate the techniques outlined and then far. Information technology has facilities to generate sample data and benchmark the three reading approaches with dynamic and cached EOF sensing.
Generating Test Data
Past default, test data is created in the aforementioned directory every bit the executable with the filename "sampledata.bin". The number of records to be created can be varied. Ten meg records volition take upwards a little bit more than than 276 MB, and so brand sure you have plenty disk space to accommodate the data. The 'Randomize Output' checkbox determines whether each record volition be created using random data to thwart NTFS'south deejay compression. Click the 'Generate Data' button to build the file.
Benchmarking
Benchmarking results are more reliable when averaged over many trials. Accommodate the number of trials for each test scenario using the 'Test Count' box. 'Update Frequency' can be used to adjust how frequently the status bar volition inform you lot of progress. Designate an update frequency greater than the number of records to avoid including condition bar updates in your benchmark results. The 'Drop Best and Worst Trials from Average' cheque box will omit the longest and shortest trial from the average entry --they will however exist listed in the 'Results' ListView
. Select the readers to be tested using the checkboxes –'BinaryReader Block' corresponds to the PtrToStructure
approach. Select the 'EOF detection' methods to exam --'Dynamic' uses Length
and Position
backdrop each time EOF
is called. Click 'Run Tests' to generate results.
Miscellaneous Findings
StructLayoutAttribute
If you're working with reading pre-defined binary files, you will become very familiar with the StructLayoutAttribute
. This attribute allows y'all to tell the compiler specifically how to layout a struct
in memory using the LayoutKind
and Pack
parameters. Marshaling a byte array into a structure where the memory layout differs from its layout on deejay will result in corrupted data. Make certain they lucifer.
Warning! Depending on the way a structure is saved, you may demand to read and discard empty packing bytes between reading fields when using the BinaryReader
.
MarshalAsAttribute
Be sure to utilise the MarshalAsAttribute
for all fixed width arrays in your construction.Structures with variable length arrays cannot be marshaled to or from pointers.
Writing Data
Writing binary information tin be accomplished in the same ways equally reading. I imagine that the functioning considerations are very similar also. So, writing out the fields of a structure using the BinaryWriter
is probably optimal for small structures. Larger structures tin can be marshaled into byte arrays using this blueprint:
public byte[] ToByteArray() { byte[] buff = new byte[Marshal.SizeOf(typeof(TestStruct))]; GCHandle handle = GCHandle.Alloc(vitrify, GCHandleType.Pinned); Marshal.StructureToPtr(this, handle.AddrOfPinnedObject(), false); handle.Gratuitous(); return vitrify; }
Align.SizeOf
Even modest changes to a method can yield pregnant boost to functioning when the method is called millions or billions of times during the execution of a program. Patently, Align.SizeOf
is evaluated at runtime even when there is a call to typeof
equally the parameter. I shaved several minutes off of my application's execution time by creating a form with a static Size
property to use in place of Align.SizeOf
. Since the return value is calculated every time the application is started, the dangers of using a constant for size are avoided.
internal sealed class TSSize { public static int _size; static TSSize() { _size = Marshal.SizeOf(typeof(TestStruct)); } public static int Size { go { render _size; } } }
Source: https://www.codeproject.com/Articles/10750/Fast-Binary-File-Reading-with-C
Post a Comment for "Reading a Binary File in C Implementation"