MATLAB: Text Extraction and retrieval

data mininginformation retrievalText Analytics Toolboxtext extractiontext miningword count

<P ID=1>
A LITTLE BLACK BIRD.
</P>
<P ID=2>
Story about a bird,
(1811)
</P>
<P ID=3>
Part 1.
</P>
As I am new to text extraction, I need help in;
  1. Writing a code to count the delimiters (</P>)
  2. Remove all punctuation
  3. Break the text into individual documents at each delimiter, knowing that ID=1 refers to document 1, ID=2 refers to document 2. etc

Best Answer

Just tried to make a script to do that. Here is the result (assuming the maximum ID = 10).
% Read your text file
fid = fopen('yourText.txt');
C = textscan(fid,'%s','TextType','string','Delimiter','\n','EndOfLine','\r\n');
C = C{1};
fclose(fid);
% 1. Count the delimiters '</P>'
idx = strfind(C,'</P>');
n = nnz(cellfun(@(x) ~isempty(x), idx));
% 2. Remove all punctuation
C2 = regexprep(C,'[.,!?:;]','');
% 3. Break the text into individual documents at each delimiter
idx2 = find(strcmp(C,'</P>'));
for kk = 1:10
str = ['<P ID=',num2str(kk),'>'];
idx_s = find(strcmp(C,str));
if ~isempty(idx_s)
idx_e = idx2(find(idx2>idx_s,1));
fileName = ['document',num2str(kk),'.txt'];
fid = fopen(fileName,'w');
fprintf(fid,'%s\r\n',C(idx_s:idx_e));
fclose(fid);
end
end