MATLAB: How to solve issue with strncmp returning incorrect logical values for text comparison…

comparefgetlstringstrncmptext;

I need to scan through a textfile line by line and pull out numerical variables corresponding to given line beginnings (e.g. save subjectnumber = 2 for the line 'Subject number: 2').
I am currently attempting to do this by loading the file with fopen, then using fget1 to work through the file 1 line at a time, and comparing the relevant amount of characters at the beginning of each line with saved strings which act as 'keys', using the strncmp function: If the first 'n' characters of the line match the key, the script would then save the numerical value as a variable in the workspace, to later incorporate into the final data structure.
However the strncmp function does not seem to be working correctly, and I cannot figure out why. Regardless of whether I compare Char array to Char array, or convert to Strings before comparison, the function returns a logical '0' even when the key matches the line. I can copy and paste the retrieved line from the document to the command window, use this to test the strncmp function against the key variable saved in the workspace, and get a logical '1' true result. However in the script itself, the function always returns logical '0'.
Has anybody encountered a similar issue before?
fileID = fopen('textfile.txt');
subkey = " S u b j e c t :";
while ischar(tline)
tline = string(fgetl(fileID)) % get next line & convert to string
submatch = strncmp(tline, subkey, length(subkey)) %check match for subject key PROBLEM LINE
if submatch == 1
% code to save numerical variable
end
end
The printed output for fgetl for the line containing the desired information, and the subsequent strncmp check is:
tline =
" S u b j e c t : 1 " % i.e. identical to the specified key 'subkey' over the first 16 character
submatch =
logical
0

Best Answer

Explanation: The problem is caused by the file encoding, which is little-endian UCS-2, a two-byte character encoding. So what you see as two separate characters (a letter followed by a space) is actually one single two-byte character inside the file. Combining string into the mix just confuses things even more, but does not change this fundamental issue with reading the file.
The reason that your string with space characters does not match is because what you see as space characters (i.e. ASCII 32) and used in subkey are not really spaces at all in the imported data: they are interpreted as NULL characters (ASCII 0) (of course they are not really characters at all, just the trailing byte of a two-byte character). For example the first line of the file apparently contains this (note all the NULL "characters"):
>> +tline
ans =
Columns 1 through 21
729 355 42 0 42 0 42 0 32 0 72 0 101 0 97 0 100 0 101 0 114
Columns 22 through 42
0 32 0 83 0 116 0 97 0 114 0 116 0 32 0 42 0 42 0 42 0
Also note the first few bytes are a quite large: these contain information which tells us about the byte order, and implies something about the file encoding.
You might like to read this:
Solutions:
  • save the file as UTF8, and then you won't have any problems.
  • fopen the file telling MATLAB that it uses two bytes per character, e.g.:
fileID = fopen('textfile.txt','rt','n','UTF16');
Which when I test it using R2012b gives this:
>> tline
tline =
*** Header Start ***
>> +tline
ans =
42 42 42 32 72 101 97 100 101 114 32 83 116 97 114 116 32 42 42 42