MATLAB: Excel table data analysis

data processing and analysisexcel

Hi guys

My data is (are) in an excel table 20000 by 3. I have a list of 358 patients. There are 48 different bacteria that I am looking for. each patent is tested to see if a bacteria is found in their blood. The day each test is conducted is recorded.

1) when was the first day the patient tested positive (regradless of bacteria type) (ex:first time patient A was positive is day 2)

2) List of all the bacterias that have been found for each individual patient.( ex: Patient A shows bactesias:BAC1 BAC2 BAC3)

3) When was the first time each patient tested possitive for each type of bacteria.(ex: First time BAC1 was found is day 4)

4) In what days was each patient exsibiting each bacteria (ex: BAC 1 was forund for patient A in days 300, 4)

The table is in the following format:

  patient   day they were tested         bacteria found  
  A                   300                                    BAC1 
  A                    2                                       BAC2 
  A                    4                                       BAC1 
  A                    8                                       BAC3 
  B                   66                                      BAC5 
  B                   55                                      BAC1 
  C                  208                                     BAC2 
  C                  77                                       BAC2 
  C                  51                                       BAC9 
  D                 90                                        NAN

I have been struggling with this for a while. I would appriciate any input.

Please let me know if clarifications are needed. Thanks so much for the help in advance!

Best Answer

It's not trivial to learn the tricks, granted...but here's a start for the first...with it and some study of the examples, should get an idea -- it's late; I've got to turn in at this point, though..sorry :)

tBAC=readtable('bacteria.dat','ReadVariableNames',1);     % read the data in

tBAC.patient=categorical(tBAC.patient);                   % fix data types

tBAC.found=categorical(tBAC.found);
[ig,Patient]=findgroups(tBAC.patient);                    % group index, group names

FirstInfection=splitapply(@(t,r) {min(t(r~='NAN'))},tBAC.testday,tBAC.found,ig);  % find first infection any kind
FirstInfection(cellfun(@(x) length(x)==0,FirstInfection))={inf};                  % clean up missing (no infection)
[table(Patient) cell2table(FirstInfection)]                                       % display result in a table form

results in for the first problem statement...I chose "inf' as the indicator of no infection instead of NaN -- the decimals show up because I'd been doing financial work and had format bank in effect.

ans =
  4×2 table
    Patient    FirstInfection
    _______    ______________
    A           2.00         
    B          55.00         
    C          51.00         
    D            Inf         
>>

I slightly modified your data to a file with the header line "patient, testday, found" and made a csv-file of your plain text...I presume you have a file format of your own.

Modifications to the functional for the remainder should suffice I think...altho 3 and 4 need grouping by both patient and bacteria ID.

ADDENDUM:

Added the identifier for which bacteria ID was the first. Having the standalone function means can clean up the return data there instead of afterwards--so there is some payback for the extra code. :)

NB: I got the return arguments from min in wrong order last night; the index is the optional second, not the first. This produces the amplified table:

tBAC=readtable('bacteria.dat','ReadVariableNames',1);     % read the data in
tBAC.patient=categorical(tBAC.patient);                   % fix data types
tBAC.found=categorical(tBAC.found);
% 1.  First occurrence of any in each patient
[ig,Patient]=findgroups(tBAC.patient);                    % group index, group names
%FirstInfection=splitapply(@(t,r) {min(t(r~='NAN'))},tBAC.testday,tBAC.found,ig);  % find first infection any kind
[FirstInfection,Infection]=splitapply(@firstinfected,tBAC.testday,tBAC.found,ig);
table(Patient,FirstInfection,Infection)
% 2.  All occurrences in each patient
[AllInfections]=splitapply(@allinfections,tBAC.found,ig);
[table(Patient) cell2table(AllInfections)]
% 3.  First occurrence of each bacterium in each patient
[ig,Patient,Bacterium]=findgroups(tBAC.patient,tBAC.found);
[FirstInfection,Infection]=splitapply(@firstinfected,tBAC.testday,tBAC.found,ig2);
table(Patient,Bacterium,FirstInfection,Infection)
% 4.  Ooccurrences of each bacterium in each patient
%  EXERCISE FOR STUDENT  :)
function [tFirst,bFirst]=firstinfected(t,r)
  % return first time, infection 
  [tFirst,iFirst]=min(t(r~='NAN'));
  if isempty(tFirst)
    tFirst=nan;
    bFirst='NAN';
  else
    bFirst=r(iFirst);
  end
end
function [b]=allinfections(r)
  % return all infections for each
  b={unique(r(r~='NAN')).'};
  if isempty(b)
    b='NAN';
  end
end
>> table(Patient,FirstInfection,Infection)
ans =
  4×3 table
    Patient    FirstInfection    Infection
    _______    ______________    _________
    A           2.00             BAC2     
    B          55.00             BAC1     
    C          51.00             BAC9     
    D            NaN             NAN      
ans =
  4×2 table
    Patient      AllInfections  
    _______    _________________
    A          [1×3 categorical]
    B          [1×2 categorical]
    C          [1×2 categorical]
    D          [1×0 categorical]
ans =
  8×4 table
    Patient    Bacterium    FirstInfection    Infection
    _______    _________    ______________    _________
    A          BAC1          4.00             BAC1     
    A          BAC2          2.00             BAC2     
    A          BAC3          8.00             BAC3     
    B          BAC1         55.00             BAC1     
    B          BAC5         66.00             BAC5     
    C          BAC2         77.00             BAC2     
    C          BAC9         51.00             BAC9     
    D          NAN            NaN             NAN      
>>

Unfortunately, the builtin table display function won't show the actual categorical variable values for each patient since they're not the same length of each array--and a table has to be regular in number of variable sfor each row/observation so can't create multiple variables without a lot of ugly NAN values scattered around.

The really cute part is the firstinfection function works for any chosen grouping so that don't have to do anything except use the other grouping variables. You could choose to not populate the tble with the second return or not use the second ID variable since they are the same...

Now, your mission, should you choose to accept it, is last item, #4... :)

Related Solutions

MATLAB: REPEATED MEASURES ANOVA MATLAB

Is repeated measures anova the best way to go?

Your design is certainly applicable to a repeated measures design but the answer to that question depends on what you're testing.

It looks like you want to perform a repeated measures ANOVA with a covariate. In addition to patient number, the treatment type is the covariate, and the 6 time points is the within-subject repeated measure.

There are 4 null hypotheses to look at:

Patient number has no effect on the population mean
Treatment type has no effect on the population mean
There is no interaction between patient number and treatment type (usually the most important question)
There is no 3-way interaction between patient number, treatment type, and treatment time.

Setting up a RM Anova with a convariate in Matlab

Your data are attached in a file named Q2unlocked.xlsx. The following lines set up a RM Anova with a covariate (written in r2019b).

% Read in the table
Q2 = readtable('Q2unlocked.xlsx');
% Replace the 6 measurment headers 
Q2.Properties.VariableNames(3:8) = {'t1', 't2', 't3', 't4', 't5', 't6'}; 
% The within-subjects design
withinDsgn = table((1:6)','VariableNames',{'Time'});  
% Run the RM Anova (note the interaction between PATIENT and TREAT!)
rm = fitrm(Q2, 't1-t6~PATIENT*TREAT', 'WithinDesign', withinDsgn);

Also see a similar example provided by Matlab.

Testing for assumptions

Before we look at the results, you must make sure your data are appropriate for a RM Anova by confirming the following assumptions.

Independent observations: You collected the data randomly and without bias
Normality: The measurments have an approximately normal distribution
Sphericity: we'll use the Mauchly test.

% Are the data approximately normal? This should ideally be done for each factor
histogram(reshape(Q2{:,3:end},[],1))

A small rightward tail but not too bad.

% Does the data pass the sphericity test?
rm.mauchly
ans = 1×4 table
       W       ChiStat    DF    pValue 
    _______    _______    __    _______

    0.48842    11.537     14    0.64344

Yes: a pValue < 0.05 would fail

https://www.mathworks.com/help/stats/mauchlys-test-of-sphericity.html

RM Anova results

ranova(rm) or rm.ranova produce the results table showing the p-value and then 3 adjusted p-values depending on whether the the response variables have different variances (symmetry assumption).

% Produce ranova table
rm.ranova
ans = 5×8 table
                            SumSq       DF      MeanSq        F        pValue     pValueGG    pValueHF    pValueLB
                          __________    __    __________    ______    ________    ________    ________    ________

    (Intercept):Time      3.9654e+06     5    7.9308e+05    2.2798    0.053221    0.071668    0.053221    0.14843 
    PATIENT:Time          4.0894e+06     5    8.1788e+05    2.3511    0.047012    0.064753    0.047012    0.14259 
    TREAT:Time            3.2657e+06     5    6.5314e+05    1.8775     0.10606     0.12652     0.10606    0.18747 
    PATIENT:TREAT:Time    3.9524e+06     5    7.9048e+05    2.2723    0.053916    0.072433    0.053916    0.14905 
    Error(Time)           3.1309e+07    90    3.4788e+05

Row 1 representing all differences across the within-subjects factors (treatment times). It has a p value of 0.053 which is just beyond the socially accepted threshold of 0.05. A boxplot will show us if this value seems reasonable.

figure()
boxplot(Q2{:,3:end})
xlabel('Treatment times')
ylabel('measurement value')
title('Data pooled between all Patients and treatment factors')

It doesn't look like there's much of an effect of treatment time when factors are combined

Row 2 shows the interaction between Patient and treatment times and is p=0.047

bpdata = [];  
for i = 1:max(Q2.PATIENT) %assuming patient numbers are 1:max
    bpdata = [bpdata, Q2{Q2.PATIENT==i,3:8},nan(size(unique(Q2.TREAT)))]; 
end
figure()
boxplot(bpdata)
arrayfun(@xline,7:7:size(bpdata,2))
xlabel('6 treatment times across 11 patients')
ylabel('measurement value')
title('Data pooled between treatment factors')
set(gca,'XTick', [])

Clearly the difference between the 6 treatment times differs across subjects (Treatment type combined)

Row 3 shows the interaction between treatment type and treatment times and is p=0.106.

bpdata = [];  
for i = 1:max(Q2.TREAT) %assuming TREAT numbers are 1:max
    bpdata = [bpdata, Q2{Q2.TREAT==i,3:8},nan(size(unique(Q2.PATIENT)))]; 
end
figure()
boxplot(bpdata)
xline(7)
xlabel('6 treatment times across 2 TREAT factors')
ylabel('measurement value')
title('Data pooled between patients')
set(gca,'XTick', [])

Other than the 2nd treatment factor there is no difference between the 2 Treatment types.

Row 4 shows the 3-way interaction between patient, treatment type, and treatment times.

MATLAB: Why am I getting ‘ugly’ output

At this point you should consider creating a small GUI for your input selection. But for your question at hand: you forgot to allign the start of the rows, so you missed the fact that they were actually shifted. The smart allign button in the editor and a few spaces gets you this:

combinations_available=input(['Enter the number associated with the combination of salts available. Enter 0 if none of the combinations are available.'...
    '\n1:Two of these chlorides        1) NH4Cl(ammonium chloride)    2) MgCl2(magnesium chloride)  3) CaCl2(calcium chloride)  4) NaCl(sodium chloride)   5) KCl(potassium chloride) '...
    '\n2:ZnBr2(zinc bromide)        &  1) NH4Cl(ammonium chloride)    2) MgCl2(magnesium chloride)  3) CaCl2(calcium chloride   4) NaCl(sodium chloride)   5) KCl(potassium chloride)'...
    '\n3:NH4Cl(ammonium chloride)   &  1) CH3CO2K(potassium acetat)   2) ZnSO4(zink sulphate)       3) HCOONa(Sodium formate)   4) HCOOK(potassium formate)'...
    '\n4:ZnBr2(zinc bromide)        &  1) CH3CO2K(potassium acetate)  2) ZnSO4(zink sulphate)       3) HCOONa(sodium formate)   4) HCOOK(potassium formate)'...
    '\n5:CH3CO2K(potassium acetate) &  1) NH4Cl(ammonium chloride)    2) MgCl2(magnesium chloride)  3) CaCl2(calcium chloride)  4) NaCl(sodium chloride)  5) KCl(potassium chloride)'...
    '\n6:CH3CO2K(potassium acetate) &  1) ZnSO4(zinc sulphate)        2) HCOONa(sodium formate)     3) HCOOK(potassium formate)'...
    '\n7:ZnSO4(zinc sulphate)       &  1) NH4Cl(ammonium chloride)    2) MgCl2(magnesium chloride)  3) CaCl2(calcium chloride)  4) NaCl(sodium chloride)  5) KCl(potassium chloride)'...
    '\n8:ZnSO4(zinc sulphate)       &  1) HCOONa(sodium formate)      2) HCOOK(potassium formate)'...
    '\n9:HCOONa(sodium formate)     &  1) NH4Cl(ammonium chloride)    2) MgCl2(magnesium chloride)  3) CaCl2(calcium chloride)  4) NaCl(sodium chloride)  5) KCl,(potassium chloride)'...
    '\n10:HCOOK(potassium formate)  &  1) NH4Cl(ammonium chloride)    2) MgCl2(magnesium chloride)  3) CaCl2(calcium chloride)  4) NaCl                   5) KCl'...
    '\n11:HCOONa                    &  1) HCOOK. ']);

With the result in the command prompt:

Enter the number associated with the combination of salts available. Enter 0 if none of the combinations are available.
1:Two of these chlorides        1) NH4Cl(ammonium chloride)    2) MgCl2(magnesium chloride)  3) CaCl2(calcium chloride)  4) NaCl(sodium chloride)   5) KCl(potassium chloride) 
2:ZnBr2(zinc bromide)        &  1) NH4Cl(ammonium chloride)    2) MgCl2(magnesium chloride)  3) CaCl2(calcium chloride   4) NaCl(sodium chloride)   5) KCl(potassium chloride)
3:NH4Cl(ammonium chloride)   &  1) CH3CO2K(potassium acetat)   2) ZnSO4(zink sulphate)       3) HCOONa(Sodium formate)   4) HCOOK(potassium formate)
4:ZnBr2(zinc bromide)        &  1) CH3CO2K(potassium acetate)  2) ZnSO4(zink sulphate)       3) HCOONa(sodium formate)   4) HCOOK(potassium formate)
5:CH3CO2K(potassium acetate) &  1) NH4Cl(ammonium chloride)    2) MgCl2(magnesium chloride)  3) CaCl2(calcium chloride)  4) NaCl(sodium chloride)  5) KCl(potassium chloride)
6:CH3CO2K(potassium acetate) &  1) ZnSO4(zinc sulphate)        2) HCOONa(sodium formate)     3) HCOOK(potassium formate)
7:ZnSO4(zinc sulphate)       &  1) NH4Cl(ammonium chloride)    2) MgCl2(magnesium chloride)  3) CaCl2(calcium chloride)  4) NaCl(sodium chloride)  5) KCl(potassium chloride)
8:ZnSO4(zinc sulphate)       &  1) HCOONa(sodium formate)      2) HCOOK(potassium formate)
9:HCOONa(sodium formate)     &  1) NH4Cl(ammonium chloride)    2) MgCl2(magnesium chloride)  3) CaCl2(calcium chloride)  4) NaCl(sodium chloride)  5) KCl,(potassium chloride)
10:HCOOK(potassium formate)  &  1) NH4Cl(ammonium chloride)    2) MgCl2(magnesium chloride)  3) CaCl2(calcium chloride)  4) NaCl                   5) KCl
11:HCOONa                    &  1) HCOOK.

Best Answer

Related Solutions

MATLAB: REPEATED MEASURES ANOVA MATLAB

MATLAB: Why am I getting ‘ugly’ output

Related Question