MATLAB: Delete rows- tanimoto matrix

delete rowsmatrixtanimoto

Hi everyone;
with help of some of you, we could write two scripts to delete rows, in tanimoto matrix which we create based on RDKit fingerprints.
the idea was to delete those rows (= molecules) which are similar to another molecules if :
1.tanimoto index > 0.7 but less than 1.
2. the sum of all tanimoto indexes in a row is larger than the compared molecule.
*in the first solution we used for loop, with a vector called "good_ones"; the rows that got zero we delete and we don't compare them again :
[M,text,alldata]=xlsread('test2.csv');
[r c]=size(M);
S=sum(M,2); %sum rows
good_ones = ones(1,size(M,1));
% loop over rows
for row=1:r;
for col=row+1:c;
if good_ones(row)==1
if (M(row,col) >= 0.7) & (M(row,col) <1.0 ); %if the value between 0.7 to 1 then we compare the sum column
if S(row)>S(col) % if the sum of the i line is larger we delete this line
good_ones(row) = 0;
else % the sum of the j line is larger so we delete the other line
good_ones(col) = 0;
end
end
% mark lines for deletion afterwards
end
end
end
new_M = M(find(good_ones),:);
  • in the second script we used the vector way, but in this solution, rows that we compare and mark to delete, are compared again to another molecules, and in this way we loose more molecules,so the question how i can add a condition here like i did previous and check only the rows that i didn't have check before???:*
an example file of the matrix attached
[M,text,alldata]=xlsread('test2.csv.csv') ;% M is numeric date
% text is the text :-)
% all data is both M + text
[i j]=size(M);
S=sum(M,2);
triM = tril(M); %lower diagonal
[r, c] = find(triM >= 0.7 & triM < 1.0); %find position of all values in range
deleterow = S(r) > S(c); %compare row r with respective row c
%a true in deleterow means delete respective r, otherwise delete respective c
todelete = unique([r(deleterow) ;c(~deleterow)]);
nM = M;
nM(todelete, :) = [];
text2=text(2:i+1,1);
text2(todelete, :) = [];

Best Answer

"...but if you have time i would like to see your solution :)"
Well, I've not tested this extensively but it does reproduce your script results for the sample dataset --
Presuming
a) S is the summation column vector, and
b) M is the tridiagonal form w/o the diagonal
>> M=M>0.7; % logical array
>> R=squareform(pdist(S,@(x,y) rdivide(x,y)))>1; % sum ratios to match
>> good = ~any(M&R,2) & ~any(M&~R,1).';
>> all(good==Good)
ans =
1