Fat Quantitation in Liver Biopsies Using a Pretrained Classification Based System

Non-Alcoholic Fatty Liver Disease (NAFLD) is a common syndrome that mainly leads to fat accumulation in liver and steatohepatitis. It is targeted as a severe medical condition ranging from 20% to 40% in adult populations of the Western World. Its effect is identified through insulin resistance, which places patients at high mortality rates. An increased fat aggregation rate, can dramatically increase the development of liver steatosis, which in later stages may advance into fibrosis and cirrhosis. During recent years, new studies have focused on building new methodologies capable of detecting fat cells, based on the histology method with digital image processing techniques. The current study, expands previous work on the detection of fatty liver, by identifying once more a number of diverse histological findings. It is a combined study of both image analysis and supervised learning of fat droplet features, with a specific goal to exclude other findings from fat ratio calculation. The method is evaluated in a total set of 40 liver biopsy images with different magnification capabilities, performing satisfyingly (1.95% absolute error). Keywords-liver biopsy; steatohepatitis; fatty liver; machine learning; image analysis


INTRODUCTION
Fatty liver develops in 90% of the population, with an alcohol consumption of over 60g on a daily basis.However, a fatty liver may be further classified as non-alcoholic (NAFL) [1], by the presence of hepatic steatosis without indicating any hepatocellular damage, but by swollen hepatocytes and circular fat cells.NAFLD has been targeted as a massive disease, particularly in Western country's adults, due to the expanding predominance of obesity and insulin resistance.The latter has the effect on patients with NAFLD developing hepatitis C at high risk, which may lead to cirrhosis, a reduced hepatocyte activity due to a large number of histological scar tissues.HVC (Hepatic Virus C) has been documented to be the leading cause of deaths from 2003 to 2012 in USA [2].In particular, hepatitis C is capable of extending the incidence of liver cancer by 50%, indicating an irreversible pathological condition, as opposed to hepatitis B (15%).Diagnosis of NAFLD requires a liver biopsy, which is obtained by needle insertion.Each biopsy specimen is followed by histology for precise steatosis evaluation.In most cases, a single or even a two-part tissue sample is divided into formalin-fixed and Hematoxylin-Eosin (H&E) stained sections (Figure 1).A biopsy slide is then scanned through microscopy and a digitized image is extracted by the software.Through this technique, each digitized sample can reveal several liver structures, including a) fat cells, b) ballooned cells, as swollen fat wrapped hepatocytes c) central veins, and d) sinusoids, being responsible for mixing nutrientrich blood from the portal vein and with corresponding oxygenrich blood from the hepatic artery.
Traditional NAFLD examination by a pathologist, involves in most cases a lengthy and highly subjective visual interpretation of the biopsy slides through a microscope.As a result, "intra-observer" and "inter-observer" variability, can cause a critical degree of inaccuracy during the quantification of steatosis, among several clinicians.Due to these limitations, researchers have undertaken the task of developing reliable automated software for the accurate assessment of steatosis prevalence, with widespread and also cutting-edge methods, based on the digital biopsy image processing.In earliest years, simple stereological and morphometric techniques, for steatosis microvesicular and macrovesicular quantification have been introduced [3].Consequently, some studies that identified significant quantitative deviations in the fat ratio, between a computerized method and a pathologist's manual scaling have been presented [4].Earliest image analysis methods composed a 2-class detection problem, to isolated tissue and fat pixels according to their intensity values [5].A combination of fat cell separation with Body Mass Index (BMI) and its insulin resistance prevalence degree has also been presented [6].Some studies used diverse color spaces either by replacing H&E dye with toluidine-blue (TB-PAS) tissue coloring [7], or by utilizing different color spaces of the processed image [8].In recent years, the conversion of an RGB image to grayscale, in order to indicate the bright white objects of interest, has been attempted [9].Another work focuses on the detection of large droplet macrovesicular steatosis (ld-MaS), using the radial hepatic nucleus displacement calculation [10].One of the most extensive works, identifies various histological findings using Machine Learning techniques [11], where a series of annotated histological features is included for non-fat shapes exclusion.Finally, two recent studies are based on the parameters of solidity, roughness, elongation [12] and the circular shape of fat cell [13], in order to separate them from other regions of interest.The current work introduces a complete methodology for image analysis of liver biopsies, aiming to steatosis assessment.The proposed method is based on image processing techniques, which are combined with clustering and classification algorithms.Apart from classic segmentation of the image to provide histological findings detection, the method proposes a pre-trained classification step for their identification.The identification of four different findings and the accurate isolation of fat droplets, leads to the elimination of many false positive outcomes.Specifically, the method defines a 4-class classification problem, focusing on the separation between four categories of findings (classes), which are: a) fat cells, b) ballooning degeneration, c) sinusoids and d) veins.The proposed method improves the concept of histological finding classification and characterization, including ballooning degeneration as an alternative finding class, in contrast with the method proposed in [11].Ballooned cells, which in various cases appear as circular formed structures comparable to fat droplets, are now an interesting field of research according to pathologists.The classification stage could be incorporated into already developed fat detection methodologies, as the last identification stage before fat ratio extraction.

II. METHODS
The classification method consists of four essential steps: • the implementation of a clustering-based algorithm to separate tissue from the background, • an iterative morphological opening for circular structure detection and irregular object elimination, • feature extraction of an annotated set of objects, • training of a classifier, employing the extracted histological features driven by the annotated ones.
The trained system will have the ability to separate each region of interest into fat droplets, ballooned cells, sinusoids, or veins and lead to the calculation of fat ratio with respect to the isolation of identified fat cells.Figure 2 presents a flowchart of the proposed method, enclosed into a general fat identification methodology and fat ratio quantification.Flowchart of the classification procedure.

A. Histological Image Segmentation
After a biopsy slide scanning, usually a large part of the digital image represents a white background, therefore it contains high intensity values on the RGB color channels.In contrast, the histological H&E staining sample consists of low intensity pixel values.Thus, by applying clustering techniques, such as the K-means algorithm, hepatic tissue pixels could be separated from other background pixels, using only their color values as features.More specifically, during this step, Kmeans divides the image elements into two clusters (K=2), providing a segmented binary image: 1) white pixels will be the background and 2) black pixels will be the tissue.At the algorithm convergence, the outer tissue boundaries are determined by the most remote cluster data points, whereas at each iteration the pixel segmentation is achieved according to the lowest square error criterion and expressed as: where ܰ and ܰ denote the number of image rows and columns, respectively.At the same time, ‫ܫ‬ indicates the feature vector of each pixel with the following relation: where via the following formula: the two clusters (k=1, 2) centroid calculation is updated in every current iteration until the convergence of K-means.This action is performed according to the already stored pixel intensities at each segmentation step.As a result, the centroid with the highest intensities sum characterizes the background pixel cluster, while the other the tissue cluster, respectively.By default, Euclidean metric is applied to the cluster data point distance calculation.

B. Object of Interest Detection
The next stage attempts an object of interest detection to the previous segmented color image.The algorithm aims to include circular white regions of interest with various sizes.A recurrent morphological opening method is applied to detect circular structures, which are probably fat droplets.Although morphological operators have the ability to locate objects of a particular structure, the size of fat droplets in the biopsy tends to vary (Figure 3).As a starting point, a circular mask with a minimum radius of 5 pixels is defined as the initial structural element.Then, morphological opening procedure increases the mask radius (2 pixels) after each subsequent loop.As a result of the morphological opening between the binary image and the circular mask, the binary image I morph is produced as: where ⊖ and ⊕ are the morphological operators of erosion and deletion, respectively [14].I bin denotes the original binary image and M ci the circular mask of the i th iteration.The M ci radius tends to increase by 2 pixels in each iteration, to allow larger circular structures to be detected.Next, the current I morph image produced by the i th iteration, is placed as an input to a logical OR.Through this control step, the loop process terminates, if no further changes are detected between the last two consecutive I morph images.Based on the above methodology, each region is filtered according to the size, so that very small circular structures along with the correspondingly large ones, are eventually discarded from upcoming calculations.Specifically, small regions (<5 pixels) are considered as noise, due to resulting errors in the image capture process.Whereas, large regions (>1000 pixels) are associated with other large histological structures, such as central veins, portal veins and arteries, sinusoids and bile ducts.

C. Features Selection
To develop the pretrained system, a set of 13 biopsy samples, with the standard H&E histological staining derived from NAFLD patients, has been annotated.The concept is to make use of knowledge extracted by the annotation in order to train a classification based system.Thus, several features have been extracted from the annotated regions.One part of this subset presents a high steatosis prevalence, while in several cases a high rate of hepatocyte ballooning degeneration is observed.The entire number of findings in all 13 liver images is 7305.This dataset is collected from patient cases within St. Mary Hospital -Imperial College Healthcare NHS Trust of London, UK.They were digitized with a Hamamatsu microscope, providing a magnification of ×40.Each one of these contains hundred regions of interest, most of which are fat droplets and ballooned cells.Taking under consideration this particular advantage of clinical specimens, an adequate set of training instances becomes available for annotation, which at a later stage can be fed into supervised machine learning classifiers.
As shown in Figure 4, fat areas, as well as ballooning degeneration, sinusoids and veins, have been manually labeled by specialist pathologists, through Hamamatsu's NDP.View 2 annotation tool.For each of the annotated structures, an XML structured file is exported directly.Each record includes a unique ID and title as a description.In addition, x and y points which correspond to the exact location of the freehand tool area, in the original ×40 liver color image.To deal with the above information, a parsing function for retrieving different annotated findings from XML files has been developed, by recognizing IDs, coordinates and color of the freehand contour (ex.ff0000-red, ffff00-yellow, etc.) of all regions.According to the calculation of the histological features in Table I, an identification label is given in each region of the 4-class objects: a) fat → 1, b) ballooning → 2, c) sinusoid → 3 and d) vein → 4. A visualization of the manual region selection is achieved, using a grayscale image with the same size as the initial RGB color tissue sample.The four varied findings are remarked with different gray levels, along with colorful sign dots.Figure 5 presents the demonstration of the annotated regions as described above.
As a final step of the annotation extraction, a list of features for each distinct liver image is computed, while at the same time, a corresponding set of class objects from the entire subset of 13 biopsy images, is gathered as the main training set of data.Thus, a large number of regions from all 13 annotation images has been extracted.It is noted that a total count of a) 4023 fat droplets, b) 3064 cases of ballooning, c) 165 sinusoids and d) 53 hepatic veins, was initially obtained providing a sufficient training set.The created dataset is unbalanced, due to the various existing anatomical features.Hence, the clinical classification problem becomes harder to solve.Thereby, only a few hundred fat and balloon cells are included, randomly selected from all the 13 annotated images.Aiming the training of a supervised algorithm, an informative feature set is extracted for each annotated region of interest.Details for the annotation and the dataset, which have been employed to provide knowledge, are presented in the next subsection.Each feature is selected in accordance with the characteristics of histopathological findings.The features are grouped into five feature categories, to highlight the differences between the four classes (Table I).The shape is characterized as a key calculation feature, providing valuable information about biopsy structures.Histological objects also tend to carry various peculiarities resulting in one of the four detection classes (fat droplets, balloon cells, sinusoids, veins), including the size, texture and set of pixel intensity.For instance, lipid droplets of fat have a circular shape, with an increased brightness intensity, and smoothed texture.In the case of ballooning degeneration, there is a significant distinction in the region texture, due to the interference of fat-swollen hepatocytes.On the other hand, sinusoids are described as noncircular and irregular shaped objects.Finally, veins are distinguished by their large size with a slight deviation of the mean pixel intensity, due to the occurrence of a number of red blood cells.

D. Classification Training
The training set is employed in algorithm learning, where the computed features set provide the input of several supervised classifiers, including: Naive Bayes classifier, k Nearest Neighbors, Decision Tree using (C4.5 algorithm) and Support Vector Machines.Naive Bayes (NB) algorithm [15], is part of a probabilistic classifiers' family based on the Bayes theorem.Its assignment is to associate an unknown liver sample, to a class that carries the greater posterior probability.Here, each one of the above extracted histological features is considered as independent [16].In k Nearest Neighbors (k-NN) algorithm [17], the classification model decides about the class of a new sample taking into account the majority of its k nearest neighbors.Thus, its Euclidean distance is estimated in order to find out the members of its neighborhood.Employing the C4.5 decision tree induction algorithm, a histological object is identified through a repetitive branching process [15].Given an unknown input sample, the process starts from the tree root and runs repeatedly on the internal nodes, until a terminal leaf yields a decision value.As a conclusion, when a branching presents high homogeneity, a leaf node is added to the corresponding class.In Support Vector Machines (SVM) [18] the overall process involves the optimal determination of hyperplanes, whereby a histological sample is classified with a maximum margin value, in case it carries the same predictive value with an annotation label.

III. RESULTS AND DISCUSSION
For the evaluation of the proposed method, a second subset of 27 H&E images is employed, this time carrying a smaller magnification of ×20 for testing-diagnostic purposes.Pathologists have provided the fat ratio for these biopsies.For all magnification settings in this study, the captured biopsy images originally consisted of sizes exceeding 10,000×10,000 pixels.At first, the visualization results for the efficiency of the method are examined.In Figure 6 the detected circular structures are marked with a green contour after the biopsy image processing.Visualization results of the proposed method As found in the histological sample MS12-23945, all cases of hepatocyte ballooning along with other structures, such as the long-length liver vein (pointing arrow), have been successfully excluded during the initial segmentation stage.However, a weakness of the method is observed, in the MS14-9711 sample, in which the high steatosis prevalence (29.8%) in combination with the lower magnification ×20, results in an agglomeration of some adjacent circular regions (red area).This type of regions exhibit a feature set that does not match the size, circularity and eccentricity of a typical fat droplet, so they are excluded from the classification stage, resulting in an underestimation of the total fat ratio.However, it should be noted that as the magnification of an image increases, this phenomenon is eliminated, as the computer vision algorithm tends to present a greater discrimination capability among the densely occurring droplets of fat.That means that the outer steatosis structure boundaries become more evident, causing in that way fewer agglomeration regions to be formed.
After the exclusion of non-fat objects, for each of the 27 testing images, the fat ratio is computed dividing the cumulative area of fat structures and the whole tissue area.This value is derived both from the segmentation method and a semi-quantitative annotation procedure by St. Mary's pathologists.Therefore, the difference between these percentages presents the absolute error for each biopsy image and each classification algorithm, as follows: where S annot indicates the estimated fat ratio from the St. Mary's pathologists' annotations, while S class is equivalent to the percentage of liver fat, based on the automated detection including the classification stage.In addition, it is feasible to compare the absolute error with and without expanding our methodology with the classification stage, and measure the improvement in fat quantitation by providing a second value of absolute error: where ܵ ூ is the fat ratio after the 2 stages of image segmentation, without performing the classification.In Table II, it is perceived that in all 27 histological samples, the classification stage computes lower values of fat ratio than that of the segmentation.It is emphasized that the four classifiers have reduced the mean fat ratio up to 1.5%, in comparison with the 17.3%.This is because the false positive ballooning, sinusoid and vein findings were excluded from all steatosis calculations.The results of the third main column, show all fat percentage values, according to semi-quantitative estimates of specialist pathologists.It is important to emphasize the improvement of diagnostic accuracy of the method, initially with the mean absolute error between the doctors and the diagnosis coming from each trained classifier (A error(1) ), and then to compare it with the mean absolute error originating from the doctors' estimation and the segmentation method without the classification stage (A error(2) ).According to the results, it is noted that the trained classifiers present a diagnostic reliability in the majority of ×20 testing samples reducing the absolute error rate up to 0.5%, as opposed to the one of segmentation and equal to 2.4%.Specifically, a total mean value of the absolute error of 1.95% is performed, using the pre-trained classification step algorithms.The percentage error results are also presented in Figure 7, using bars, emphasizing to the reliability of the proposed for the methodology classifiers, compared to the segmentation stage prior to the hepatic structures identification.Due to the above agglomeration phenomenon and based on Table II, the method shows a reduction in performance in 6 images (MS14-10783, MS14-1559, MS14-2449, MS14-786, MS14-8355 and MS15-1128).However, in most of the cases (21 biopsies) it achieves an optimal performance.Through this observation, current outcomes are sufficient for pathologists, in comparison to semi-quantitative methods and tend to keep up with similar fat region classification performances in a previous study [19].IV.CONCLUSSIONS In this study, a pretrained method for various annotated liver tissue findings is presented, employing image processing and machine learning techniques.According to the overall performance, it is concluded that the classification stage improves the reliability of the results, for a number of biopsies.The main advantage of the method lies in the fact that the biopsy structures are discerned by a series of histological features including shape, magnitude, pixel intensity and texture.Based on this inclusion, it is shown that the trained classifiers can detect main differences between circular balloon cells and fat droplets in NAFLD patients.At the present time, ballooning degeneration is at the heart of the clinical interest among research pathologists, as its assessment emerges as a critical factor in chronic liver diseases.Thus, the proposed work could be also considered as a follow-up of a previous study [20] for ballooned cell identification and ballooning degeneration quantification.

Fig. 6 .
Fig. 6.Visualization results of the proposed method