Prospective grading of graft-versus-host disease after unrelated donor marrow transplantation: a grading algorithm versus blinded expert panel review.
In conjunction with a randomized trial of T-cell depletion versus conventional graft-versus-host disease (GVHD) prophylaxis, we assessed GVHD grading by comparing the transplant center 100-day score, a clinically calculated algorithm, and a blinded expert panel review (PR). Weekly skin, gut, and liver clinical staging; clinically verified differential diagnosis; biopsy information; cyclosporine levels; and initiation of treatment were reviewed and graded according to the consensus GVHD grading method modified by a prospectively determined grading algorithm that specified liver and gut downstaging if a differential diagnosis in that organ was identified. Transplant center (TC) determination of maximum grade was compared with the algorithm-calculated grade and the final expert PR. Of 404 patients reviewed, the TC grade concurred with the calculated algorithm grade in 72% (the algorithm upgraded 18% and downgraded 10%), whereas the TC grade agreed with the PR in 77% (the PR upgraded 12% and downgraded 11%). The calculated algorithm grade was nearly fully (92%) concordant with the final PR grade (the PR upgraded 0.7% and downgraded 7%). Blinded, duplicate reviews for quality control (n = 108) agreed with the initial review in 89% of cases. Algorithm and/or PR review reduced the TC-reported incidence of grade II (28% to 23%) and increased grade III (11% to 20%), whereas grade 0 (41% to 42%), grade I (13% to 12%), and grade IV (7% to 6%) were invariant. Recalculation of the algorithm grading without differential diagnosis downstaging reduced agreement with the TC to a small extent. The original algorithm changed 51 (13%) of 404 from grade 0 to II into grade III or IV or vice versa; calculation without the downgrade modified 44 cases (11%). Maximum acute GVHD grade had a major effect on 2-year disease-free survival, but assignment by TC, calculated algorithm, or final PR grade had little effect on survival within grades or grade categories 0 through II versus III or IV. We conclude that detailed and expert PR yields GVHD scoring that is internally consistent and reproducible with 89% concordance. Weekly recording of GVHD stage along with a calculated grading algorithm acknowledging differential diagnoses results in a final and maximum grade nearly fully concordant with the expert blinded PR. Multicenter prospective GVHD scoring using all available weekly staging and differential diagnosis data can be reliably assessed with a clinically relevant algorithm. This approach can thereby reduce investigator bias, facilitate comparison between centers, and perhaps eliminate the need for an expert PR. This technique should be used in future prospective studies of GVHD prophylaxis.