MATLAB: Excel table data analysis

data processing and analysisexcel

Hi guys
My data is (are) in an excel table 20000 by 3. I have a list of 358 patients. There are 48 different bacteria that I am looking for. each patent is tested to see if a bacteria is found in their blood. The day each test is conducted is recorded.
1) when was the first day the patient tested positive (regradless of bacteria type) (ex:first time patient A was positive is day 2)
2) List of all the bacterias that have been found for each individual patient.( ex: Patient A shows bactesias:BAC1 BAC2 BAC3)
3) When was the first time each patient tested possitive for each type of bacteria.(ex: First time BAC1 was found is day 4)
4) In what days was each patient exsibiting each bacteria (ex: BAC 1 was forund for patient A in days 300, 4)
The table is in the following format:
patient day they were tested bacteria found
A 300 BAC1
A 2 BAC2
A 4 BAC1
A 8 BAC3
B 66 BAC5
B 55 BAC1
C 208 BAC2
C 77 BAC2
C 51 BAC9
D 90 NAN
I have been struggling with this for a while. I would appriciate any input.
Please let me know if clarifications are needed. Thanks so much for the help in advance!

Best Answer

It's not trivial to learn the tricks, granted...but here's a start for the first...with it and some study of the examples, should get an idea -- it's late; I've got to turn in at this point, though..sorry :)
tBAC=readtable('bacteria.dat','ReadVariableNames',1); % read the data in

tBAC.patient=categorical(tBAC.patient); % fix data types

tBAC.found=categorical(tBAC.found);
[ig,Patient]=findgroups(tBAC.patient); % group index, group names

FirstInfection=splitapply(@(t,r) {min(t(r~='NAN'))},tBAC.testday,tBAC.found,ig); % find first infection any kind
FirstInfection(cellfun(@(x) length(x)==0,FirstInfection))={inf}; % clean up missing (no infection)
[table(Patient) cell2table(FirstInfection)] % display result in a table form
results in for the first problem statement...I chose "inf' as the indicator of no infection instead of NaN -- the decimals show up because I'd been doing financial work and had format bank in effect.
ans =
4×2 table
Patient FirstInfection
_______ ______________
A 2.00
B 55.00
C 51.00
D Inf
>>
I slightly modified your data to a file with the header line "patient, testday, found" and made a csv-file of your plain text...I presume you have a file format of your own.
Modifications to the functional for the remainder should suffice I think...altho 3 and 4 need grouping by both patient and bacteria ID.
ADDENDUM:
Added the identifier for which bacteria ID was the first. Having the standalone function means can clean up the return data there instead of afterwards--so there is some payback for the extra code. :)
NB: I got the return arguments from min in wrong order last night; the index is the optional second, not the first. This produces the amplified table:
tBAC=readtable('bacteria.dat','ReadVariableNames',1); % read the data in
tBAC.patient=categorical(tBAC.patient); % fix data types
tBAC.found=categorical(tBAC.found);
% 1. First occurrence of any in each patient
[ig,Patient]=findgroups(tBAC.patient); % group index, group names
%FirstInfection=splitapply(@(t,r) {min(t(r~='NAN'))},tBAC.testday,tBAC.found,ig); % find first infection any kind
[FirstInfection,Infection]=splitapply(@firstinfected,tBAC.testday,tBAC.found,ig);
table(Patient,FirstInfection,Infection)
% 2. All occurrences in each patient
[AllInfections]=splitapply(@allinfections,tBAC.found,ig);
[table(Patient) cell2table(AllInfections)]
% 3. First occurrence of each bacterium in each patient
[ig,Patient,Bacterium]=findgroups(tBAC.patient,tBAC.found);
[FirstInfection,Infection]=splitapply(@firstinfected,tBAC.testday,tBAC.found,ig2);
table(Patient,Bacterium,FirstInfection,Infection)
% 4. Ooccurrences of each bacterium in each patient
% EXERCISE FOR STUDENT :)
function [tFirst,bFirst]=firstinfected(t,r)
% return first time, infection
[tFirst,iFirst]=min(t(r~='NAN'));
if isempty(tFirst)
tFirst=nan;
bFirst='NAN';
else
bFirst=r(iFirst);
end
end
function [b]=allinfections(r)
% return all infections for each
b={unique(r(r~='NAN')).'};
if isempty(b)
b='NAN';
end
end
>> table(Patient,FirstInfection,Infection)
ans =
4×3 table
Patient FirstInfection Infection
_______ ______________ _________
A 2.00 BAC2
B 55.00 BAC1
C 51.00 BAC9
D NaN NAN
ans =
4×2 table
Patient AllInfections
_______ _________________
A [1×3 categorical]
B [1×2 categorical]
C [1×2 categorical]
D [1×0 categorical]
ans =
8×4 table
Patient Bacterium FirstInfection Infection
_______ _________ ______________ _________
A BAC1 4.00 BAC1
A BAC2 2.00 BAC2
A BAC3 8.00 BAC3
B BAC1 55.00 BAC1
B BAC5 66.00 BAC5
C BAC2 77.00 BAC2
C BAC9 51.00 BAC9
D NAN NaN NAN
>>
Unfortunately, the builtin table display function won't show the actual categorical variable values for each patient since they're not the same length of each array--and a table has to be regular in number of variable sfor each row/observation so can't create multiple variables without a lot of ugly NAN values scattered around.
The really cute part is the firstinfection function works for any chosen grouping so that don't have to do anything except use the other grouping variables. You could choose to not populate the tble with the second return or not use the second ID variable since they are the same...
Now, your mission, should you choose to accept it, is last item, #4... :)
Related Question