MATLAB: Working with unicode paths

folderpathunicode

The following is a followup to:
This question however is a bit more specific. I have a file which was created using a program on Windows. I can browse to the file in Windows Explorer (Win 7). I am however unable to:
  1. Open the file in Matlab (using fopen)
  2. If I create a directory with the same name, I am unable to cd to the directory. cd(directory)
I have uploaded the file to a public folder on my dropbox account. https://www.dropbox.com/sh/d2mghr9xyb426lz/ZEM4DH8XTp
The files are: v. Békésy – 1957.txt v. Békésy – 1957.zip
I am currently unable to provide instructions as to how one would create such a file in Matlab (hence providing them for download). For handling naming, I have also included the file in a zip, so that even if the zip is renamed on download, the file inside should maintain the same name. Incidentally, it was by exporting the zip to a folder with the same name that created the folder which I cannot cd to with Matlab.
Thus, the question is how do I get around issues #1 and #2 (without renaming them using manually using a windows interface). I am assuming this might mean using a custom library (mex and/or Java code).
The ideal solution is to provide a generic class of code that actually works for path/file manipulation instead of needing manual interference any time this problem is encountered.
Thanks, Jim

Best Answer

As others have alluded to, the problem seems to be with Matlab touching the character data. I still don't have a solution for changing the directory, since I don't know of a way of doing this without using Matlab strings.
Here's how to read a file (in this case to bytes) using Java which bypasses the unicode problem.
dir_obj = java.io.File(DIR_ROOT);
dir_files = dir_obj.listFiles;
file_bytes = typecast(org.apache.commons.io.FileUtils.readFileToByteArray(dir_files(end)),'uint8');
NOTE: There are other methods of extracting bytes given a file but the method alluded to above exists on my system and seemed the most straightforward.
At this point native2unicode() or char() would be fine if you wanted the content as a string.
It seems like the problem is most likely tied to combining characters, which is one way of adding something like an accent to a "normal" letter.
I believe that the file on disk which has caused the problem actually consists of a combined character which adds an accent to an e, thus the 101 769, which is the letter e followed by a combining acute accent: