Federated Learning Category - MarkTechPost

Meet FedTabDiff: An Innovative Federated Diffusion-based Generative AI Model Tailored for the High-Quality Synthesis of Mixed-Type Tabular Data

Pragati Jhunjhunwala — Thu, 18 Jan 2024 03:00:00 +0000

https://arxiv.org/abs/2401.06263

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2024/01/Screenshot-2024-01-17-at-6.16.01-PM-300x176.png" data-large-file="https://www.marktechpost.com/wp-content/uploads/2024/01/Screenshot-2024-01-17-at-6.16.01-PM-1024x602.png" />https://arxiv.org/abs/2401.06263

While generating realistic tabular data, one of the difficulties faced by the researchers is maintaining privacy, especially in sensitive domains like finance and healthcare. As the amount of data and the importance of data analysis is increasing in all fields and privacy concerns are leading to hesitancy in deploying AI models, the importance of maintaining privacy is also increasing. Some of the challenges in preserving the privacy in financial field are mixed attribute types, implicit relationships, and distribution imbalances in real-world datasets.

Researchers from the University of St.Gallen (Switzerland), Deutsche Bundesbank (Germany), and International Computer Science Institute (USA) have introduced a method to generate high-fidelity mixed-type tabular data without centralized access to the original datasets, FedTabDiff, ensuring privacy and compliance with regulations (Example: EU’s General Data Protection Regulation and the California Privacy Rights Act).

Traditional methods like anonymization and elimination of sensitive attributes in high-stake domains do not provide any privacy. FedTabDiff introduces the concept of synthetic data, which involves generating data through a generative process based on the inherent properties of real data. The researchers leverage Denoising Diffusion Probabilistic Models (DDPMs), which have successfully generated synthetic images, and used the concept in a federated setting for tabular data generation.

FedTabDiff incorporates DDPMs into a federated learning framework, allowing multiple entities to collaboratively train a generative model while respecting data privacy and locality. The DDPMs use a Gaussian Diffusion Model, employing a forward process to perturb data incrementally with Gaussian noise and then restoring it through a reverse process. The federated learning aspect involves a synchronous update scheme and weighted averaging for effective model aggregation. The architecture of FedTabDiff includes a central FinDiff model maintained by a trusted entity and decentralized FinDiff models contributed by individual clients. The federated optimization provides a weighted average over decentralized model updates which helps in the collaborative learning process. For the evaluation of the model, the researchers used standard metrics of fidelity, utility, privacy, and coverage.

FedTabDiff shows exceptional performance with both financial and medical datasets, proving its effectiveness in diverse scenarios. The comparison of the model to the non-federated FinDiff models showcased better performance in all four metrics. The approach manages to balance maintaining privacy as well as keeping deviations from the original data in control and preventing the data from being too unrealistic. FedTabDiff’s effectiveness is demonstrated through empirical evaluations on real-world datasets, showcasing its potential for responsible and privacy-preserving AI applications in domains like finance and healthcare.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel

The post Meet FedTabDiff: An Innovative Federated Diffusion-based Generative AI Model Tailored for the High-Quality Synthesis of Mixed-Type Tabular Data appeared first on MarkTechPost.

This Artificial Intelligence Paper Presents an Advanced Method for Differential Privacy in Image Recognition with Better Accuracy

Mahmoud Ghorbel — Mon, 24 Jul 2023 13:00:00 +0000

Machine learning has increased considerably in several areas due to its performance in recent years. Thanks to modern computers’ computing capacity and graphics cards, deep learning has made it possible to achieve results that sometimes exceed those experts give. However, its use in sensitive areas such as medicine or finance causes confidentiality issues. A formal privacy guarantee called differential privacy (DP) prohibits adversaries with access to machine learning models from obtaining data on specific training points. The most common training approach for differential privacy in image recognition is differential private stochastic gradient descent (DPSGD). However, the deployment of differential privacy is limited by the performance deterioration caused by current DPSGD systems.

The existing methods for differentially private deep learning still need to operate better since that, in the stochastic gradient descent process, these techniques allow all model updates regardless of whether the corresponding objective function values get better. In some model updates, adding noise to the gradients might worsen the objective function values, especially when convergence is imminent. The resulting models get worse as a result of these effects. The optimization target degrades, and the privacy budget is wasted. To address this problem, a research team from Shanghai University in China suggests a simulated annealing-based differentially private stochastic gradient descent (SA-DPSGD) approach that accepts a candidate update with a probability that depends on the quality of the update and the number of iterations.

Concretely, the model update is accepted if it gives a better objective function value. Otherwise, the update is rejected with a certain probability. To prevent settling into a local optimum, the authors suggest using probabilistic rejections rather than deterministic ones and limiting the number of continuous rejections. Therefore, the simulated annealing algorithm is used to select model updates with probability during the stochastic gradient descent process.

The following gives a high-level explanation of the proposed approach.

1- DPSGD generates the updates iteratively, and the objective function value is computed following that. The energy shift from the previous iteration to the current one and the overall number of approved solutions are then used to calculate the acceptance probability of the current solution.

2- The acceptance probability is always kept to 1, when the energy change is negative. That means updates that step in the right direction are accepted. It is nevertheless guaranteed that the training moves mostly in the direction of convergence even while the model updates are noisy, meaning that the actual energy may be positive with a very small probability.

3- When the energy change is positive, the acceptance probability falls exponentially as the number of approved solutions rises. In this situation, accepting a solution would make the energy worse. Deterministic rejections, however, can lead to the ultimate solution falling inside a local optimum. Therefore, the authors proposed to accept updates of positive energy changes with a small, decreasing probability.

4- If there have been too many consecutive rejections, an update will still be allowed since the number of continuous rejections is limited. The acceptance probability may drop so low that it almost rejects all solutions with positive energy changes as the training approaches convergence, and it may even reach a local maximum. Limiting the number of rejections prevents this issue by accepting a solution when it is essential.

To evaluate the performance of the proposed method, SA-DPSGD is evaluated on three datasets: MNIST, FashionMNIST, and CIFAR10. Experiments demonstrated that SA-DPSGD significantly outperforms the state-of-the-art schemes, DPSGD, DPSGD(tanh), and DPSGD(AUTO-S), regarding privacy cost or test accuracy.

According to the authors, SA-DPSGD significantly bridges the classification accuracy gap between private and non-private images. Using the random update screening, the differentially private gradient descent proceeds in the right direction in each iteration, making the obtained result more accurate. In the experiments under the same hyperparameters, SA-DPSGD achieves high accuracies on datasets MNIST, FashionMNIST, and CI-FAR10, compared to the state-of-the-art result. Under the freely adjusted hyperparameters, the proposed approach achieves even higher accuracies.

Check out the Paper. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.

The post This Artificial Intelligence Paper Presents an Advanced Method for Differential Privacy in Image Recognition with Better Accuracy appeared first on MarkTechPost.

University of Michigan Researchers Open-Source ‘FedScale’: a Federated Learning (FL) Benchmarking Suite with Realistic Datasets and a Scalable Runtime to Enable Reproducible FL Research on Privacy-Preserving Machine Learning

Aneesh Tickoo — Wed, 12 Jul 2023 01:15:00 +0000

Source: https://arxiv.org/pdf/2105.11367.pdf

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2022/07/Screen-Shot-2022-07-23-at-6.39.09-PM-300x202.png" data-large-file="https://www.marktechpost.com/wp-content/uploads/2022/07/Screen-Shot-2022-07-23-at-6.39.09-PM-1024x690.png" />Source: https://arxiv.org/pdf/2105.11367.pdf

Federated learning (FL) is a new machine learning (ML) environment in which a logically centralized coordinator orchestrates numerous dispersed clients (e.g., cellphones or laptops) to train or assess a model collectively. It enables model training and assessment of end-user data while avoiding significant costs and privacy hazards associated with acquiring raw data from customers, with applications spanning a wide range of ML jobs. Existing work has focused on improving critical features of FL in the context of varied execution speeds of client devices and non-IID data distributions.

A thorough benchmark for evaluating an FL solution must study its behavior in a practical FL scenario with (1) data heterogeneity and (2) device heterogeneity under (3) heterogeneous connectivity and (4) availability conditions at (5) many scales on a (6) wide range of ML tasks. While the first two elements are frequently cited in the literature, real network connections and client device availability might impact both forms of heterogeneity, impeding model convergence. Similarly, large-scale assessment can reveal an algorithm’s resilience since actual FL deployment frequently involves thousands of concurrent participants out of millions of customers.

Overlooking just one component can cause the FL assessment to be skewed. Regrettably, established FL benchmarks frequently fall short across numerous dimensions. For starters, they have restricted data flexibility for many real-world FL applications. Even though they have many datasets and FL training objectives (e.g., LEAF, their datasets frequently comprise synthetically created partitions derived from conventional datasets (e.g., CIFAR) and do not represent realistic features. This is because these benchmarks are primarily based on classic ML benchmarks (e.g., MLPerf or are built for simulated FL systems such as TensorFlow Federated or PySyft.

Second, existing benchmarks frequently ignore system performance, connection, and client availability (e.g., FedML and Flower). This inhibits FL attempts from considering system efficiency, resulting in unduly optimistic statistical performance. Third, their datasets are predominantly small-scale because their experimental setups cannot simulate large-scale FL deployments. While real FL frequently involves thousands of participants in each training cycle, most available benchmarking platforms can only train tens of participants each round.

Finally, most of them lack user-friendly APIs for automatic integration, necessitating significant technical work for large-scale benchmarking. To facilitate complete and consistent FL assessments, we present FedScale, an FL benchmark and supporting runtime: • FedScale, to the best of our knowledge, has the most comprehensive collection of FL datasets for examining various elements of practical FL installations. It presently has 20 actual FL datasets with small, medium, and large sizes for a wide range of task categories, including image classification, object identification, word prediction, speech recognition, and reinforcement learning.

FedScale Runtime to standardize and simplify FL assessment in more realistic conditions. FedScale Runtime includes a mobile backend for on-device FL assessment and a cluster backend for benchmarking different practical FL metrics (for example, actual client round length) on GPUs/CPUs using accurate FL statistical and system information. The cluster backend can efficiently train thousands of clients on a small number of GPUs in each cycle. FedScale Runtime is also extendable, allowing for the quick implementation of new algorithms and concepts through flexible APIs. Researchers conducted systematic tests to demonstrate how FedScale enables thorough FL benchmarking and highlight the critical requirement for co-optimizing system and statistical efficiency, particularly in dealing with system stragglers, accuracy bias, and device energy trade-offs.

FedScale (fedscale.ai) provides high-level APIs for implementing FL algorithms, deploying, and evaluating them at scale across various hardware and software backends. FedScale also features the most comprehensive FL benchmark, including FL tasks from image classification and object identification to language modeling and speech recognition. Furthermore, it delivers datasets that properly simulate FL training scenarios where FL will be applied practically. The best feature is its open source, and the code is freely available on Github.

This Article is written as a summary article by Marktechpost Staff based on the research paper 'FedScale: Benchmarking Model and System Performance of Federated Learning at Scale'. All Credit For This Research Goes To Researchers on This Project. Checkout the paper, github link and reference article.

Please Don't Forget To Join Our ML Subreddit

The post University of Michigan Researchers Open-Source ‘FedScale’: a Federated Learning (FL) Benchmarking Suite with Realistic Datasets and a Scalable Runtime to Enable Reproducible FL Research on Privacy-Preserving Machine Learning appeared first on MarkTechPost.

Google AI and Tel Aviv Researchers Introduce FriendlyCore: A Machine Learning Framework For Computing Differentially Private Aggregations

Tanushree Shenwai — Fri, 17 Feb 2023 20:07:34 +0000

Source: https://ai.googleblog.com/2023/02/friendlycore-novel-differentially.html

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2023/02/Screenshot-2023-02-17-at-12.05.59-PM-300x193.png" data-large-file="https://www.marktechpost.com/wp-content/uploads/2023/02/Screenshot-2023-02-17-at-12.05.59-PM-1024x659.png" />Source: https://ai.googleblog.com/2023/02/friendlycore-novel-differentially.html

Data analysis revolves around the central goal of aggregating metrics. The aggregation should be conducted in secret when the data points match personally identifiable information, such as the records or activities of specific users. Differential privacy (DP) is a method that restricts each data point’s impact on the conclusion of the computation. Hence it has become the most frequently acknowledged approach to individual privacy.

Although differentially private algorithms are theoretically possible, they are typically less efficient and accurate in practice than their non-private counterparts. In particular, the requirement of differential privacy is a worst-case kind of requirement. It mandates that the privacy requirement holds for any two neighboring datasets, regardless of how they were constructed, even if they are not sampled from any distribution, which leads to a significant loss of accuracy. Meaning that “unlikely points” that have a major impact on the aggregation must be considered in the privacy analysis.

Recent research by Google and Tel Aviv University provides a generic framework for the preliminary processing of the data to ensure its friendliness. When it is known that the data is “friendly,” the private aggregation stage can be carried out without considering potentially influential “unfriendly” elements. Because the aggregation stage is no longer constrained to perform in the original “worst-case” setting, the proposed method has the potential to significantly reduce the amount of noise introduced at this stage.

Initially, the researchers formally define the conditions under which a dataset can be considered friendly. These conditions will vary depending on the type of aggregation required, but they will always include datasets for which the sensitivity of the aggregate is low. For instance, if the sum is average, “friendly” should include compact datasets.

The team developed the FriendlyCore filter that reliably extracts a sizable friendly subset (the core) from the input. The algorithm is designed to meet a pair of criteria:

It must eliminate outliers to retain only elements close to many others in the core.
For nearby datasets that differ by a single element, the filter outputs all elements except y with almost the same probability. Cores derived from these nearby databases can be joined together cooperatively.

Then the team created the Friendly DP algorithm, which, by introducing less noise into the total, meets a less stringent definition of privacy. By applying a benevolent DP aggregation method to the core generated by a filter satisfying the aforementioned conditions, the team proved that the resulting composition is differentially private in the conventional sense. Clustering and discovering the covariance matrix of a Gaussian distribution are further uses for this aggregation approach.

The researchers used the zero-Concentrated Differential Privacy (zCDP) model to test the efficacy of the FriendlyCore-based algorithms. 800 samples were taken from a Gaussian distribution with an unknown mean through their paces. As a benchmark, the researchers looked at how it stacked against the CoinPress algorithm. CoinPress, in contrast to FriendlyCore, necessitates a norm of the mean upper bound of R. The proposed method is independent of the upper bound and dimension parameters and hence outperforms CoinPress.

The team also evaluated the efficacy of their proprietary k-means clustering technology by comparing it to another recursive locality-sensitive hashing technique, LSH clustering. Each experiment was repeated 30 times. FriendlyCore frequently fails and produces inaccurate results for tiny values of n (the number of samples from the mixture). Yet as n grows, the proposed technique becomes more likely to succeed (as the created tuples get closer to each other), producing very accurate results, while LSH-clustering falls behind. Even without a distinct division into clusters, FriendlyCore performs well on huge datasets.

Check out the Paper and Reference Article. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 14k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Do You Know Marktechpost has 1.5 Million+ Pageviews per month and 500,000 AI Community members?

Want to support us? Become Our Sponsor

The post Google AI and Tel Aviv Researchers Introduce FriendlyCore: A Machine Learning Framework For Computing Differentially Private Aggregations appeared first on MarkTechPost.

In A New AI Research, Federated Learning Enables Big Data For Rare Cancer Boundary Detection

Aneesh Tickoo — Wed, 14 Dec 2022 01:48:49 +0000

The number of primary observations produced by healthcare systems has dramatically increased due to recent technological developments and a shift in patient culture from reactive to proactive. Clinical professionals may become burned out since such observations need careful evaluation. There have been several attempts to develop, assess, and ultimately translate machine learning (ML) technologies into clinical settings to address this issue and lessen the load on clinical professionals by identifying pertinent links among these observations. In particular, deep learning (DL) has made strides in ML and has shown promise in tackling these challenging healthcare issues.

According to the literature, robust and accurate models must be trained on huge quantities of data, the variety of which impacts how well the model generalizes to “out-of-sample” situations. However, there are issues with their generalizability on “out-of-sample” data or data from sources that did not take part in model training. To overcome these issues, models must be trained on data from different sites representing various demographic samples. “Centralized learning” (CL), in which data from several locations are exchanged in a single place after inter-site agreements, is the current paradigm for such multi-site cooperation.

Due to privacy, data ownership, intellectual property, technological difficulties (such as network and storage restrictions), and compliance with various governmental laws, data centralization is difficult to scale (and may not even be practicable), particularly at a worldwide level. When opposed to models trained using the centralized paradigm, “federated learning” (FL) refers to a paradigm where models are taught by simply exchanging model parameter updates from decentralized data (i.e., each site stores its data locally) (CL).

Thus, FL can provide an alternative to CL, possibly leading to a paradigm change that reduces the requirement for data sharing, increases access to geographically dispersed collaborators, and subsequently expands the volume and variety of data used to train ML models. Health inequities and underserved communities are some of the issues that FL can help with by allowing ML models to learn from a wealth of data that would otherwise be unavailable. In light of this, they concentrate on the “rare” disease of glioblastoma in this article, emphasizing how multi-parametric magnetic resonance imaging (mpMRI) scans may be used to determine the extent of the disease.

Although glioblastoma is the most prevalent malignant primary brain tumor, its incidence rate (i.e., 3/100,000 individuals) is far lower than the rate required to meet the criteria of a rare illness (i.e., 10/ 100,000 people). Hence it is still categorized as a “rare” disease. Collaboration between geographically disparate sites is required because a single site cannot amass big and varied datasets to train reliable and generalizable ML models. The median overall survival of glioblastoma patients following standard-of-care treatment is only 14.6 months, and their median survival without treatment is only four months, despite significant attempts to improve the prognosis of these patients with rigorous multimodal therapy. Despite advancements in glioblastoma subtyping and the expansion of standard-of-care treatment choices over the past 20 years, overall survival has not significantly increased.

This reflects the necessity for analysis of bigger and more diverse data to understand better the illness and the main challenge in treating these tumors, which is their inherent heterogeneity. Glioblastomas have three main sub-compartments in terms of their radiologic appearance:

The “enhancing tumor” (ET) represents the breakdown of the blood-brain barrier within the tumor.
The “tumor core” (TC), which combines the ET and the necrotic (NCR) part and represents the surgically relevant part of the tumor
The “whole tumor” (WT).

To better quantify and evaluate these various uncommon diseases and eventually have an impact on clinical decision-making, it is crucial to identify these sub-compartment borders. The results of these investigations confirmed the advantages of the FL process, which was based on an aggregate server and had a performance nearly equal to CL for this use case. This definition of the task as a multi-parametric multi-class learning problem is vital.

As opposed to merely transcribing a categorical entry from medical records, this study dealt with a multi-parametric multi-class challenge with reference standards that demand professional doctors to follow a careful manual annotation methodology. Additionally, due to differences in scanner technology and acquisition techniques, consistent preprocessing pipelines were created at each participating location to manage the different aspects of the mpMRI data. These elements, together with the study’s extensive global scope and job difficulty, set it apart.

The main scientific contributions of this manuscript are I demonstrating the effectiveness of FL at such scale and task complexity as a paradigm-shifting approach; (ii) making a potential impact for the treatment of the rare disease of glioblastoma by publicly releasing clinically deployable trained consensus models; and, most importantly, (iii) paving the way for more successful FL studies of increased scale and task complexity. Data and code are available on GitHub.

Check out the Paper and Github. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.

Meet Hailo-8: An AI Processor That Uses Computer Vision For Multi-Camera Multi-Person Re-Identification

Meet Hailo-8™: An AI Processor That Uses Computer Vision For Multi-Camera Multi-Person Re-Identification” — MarkTechPost" src="https://www.marktechpost.com/2022/12/05/meet-hailo-8-an-ai-processor-that-uses-computer-vision-for-multi-camera-multi-person-re-identification/embed/#?secret=6xkLnzCEt1#?secret=PzMD3Jwzh2" data-secret="PzMD3Jwzh2" width="600" height="338" frameborder="0" marginwidth="0" marginheight="0" scrolling="no">

The post In A New AI Research, Federated Learning Enables Big Data For Rare Cancer Boundary Detection appeared first on MarkTechPost.

IOM Releases Its Second Synthetic Dataset From Trafficking Victim Case Records Generated With Differential Privacy And AI From Microsoft

Dhanshree Shripad Shenwai — Mon, 12 Dec 2022 02:56:42 +0000

Researchers at Microsoft are committed to researching ways technology may help the world’s most marginalized peoples improve their human rights situations. Their expertise spans human-computer interaction, data science, and the social sciences. The research team collaborates with community, governmental, and nongovernmental groups to develop available technologies that allow scalable answers to such issues.

International Organization for Migration (IOM) is a United Nations agency that helps migrants and survivors of human trafficking. By offering assistance to governments and migrants in its 175 member nations, IOM strives to promote humanitarian and orderly migration.

IOM has released its second synthetic dataset, derived from case records of victims of trafficking, using software built by Microsoft researchers. This dataset is the first public dataset to depict victim-perpetrator interactions. To further facilitate data sharing and rigorous research while respecting privacy and civil liberties, the synthetic dataset is the first to be developed with differential privacy, offering an extra security assurance for repeated data releases. The new data release results from years of cooperation between Microsoft and the IOM. It promotes the secure sharing of victim case information in ways that may influence collaborative action within the anti-trafficking community. The CTDC data hub (Counter-Trafficking Data Collaborative) is the first worldwide gateway for human trafficking case data. Its creation was motivated by a shared commitment to improving that hub’s security and usefulness. Since then, IOM and Microsoft have worked together to enhance the use of information on victims and survivors, including their descriptions of traffickers, in the fight against human trafficking.

This work has resulted in a new user interface offered as a public utility web application, allowing users to aggregate and synthesize private data without sending any of it outside the user’s local web browser.

Importance of data privacy while working with vulnerable populations

All precautions must be taken to prevent traffickers from identifying victims of trafficking in published databases. People’s personal information must be kept confidential to avoid further traumatization or social exclusion. The over- or under-reporting of a certain trend in victim instances by a privacy approach might mislead decision-makers into improperly allocating limited resources, preventing them from solving the underlying problem.

IOM and Microsoft’s collaboration was founded on rather than redacting sensitive data to achieve privacy. It could be possible to produce synthetic datasets that properly capture the structure and statistics of underlying sensitive information while staying private by design. In light of this guiding principle and the necessity of providing case count breakdowns by various attribute combinations (e.g., age range, gender, nationality), a method was developed whereby synthetic data matching all short combinations of case attributes would be released alongside privacy-preserving counts of cases. Therefore, the compiled information is useful for assessing the quality of synthetic data and recovering precise numbers for official reporting.

Datasets aggregated in this way maintain the same level of privacy since differentially private data has the feature that additional processing cannot exacerbate privacy loss which allowed the team to adapt their preexisting method of data synthesis, which involves synthesizing records by sampling sets of qualities until all attributes were covered, to extrapolate these noisy reported attribute combinations into complete, differentially-private synthetic records. This yields accurate aggregate data for official reporting, synthetic data for engaging exploration and machine learning, and differential privacy assurances that provide protection even over multiple overlapping data releases, all of which are essential for IOM and similar organizations to establish a strong data ecosystem against human trafficking and other human rights violations.

Stakeholders may improve their understanding of susceptibility risk factors and implement efficient counter-trafficking actions when they have access to precise yet anonymous patterns of attributes describing victim-perpetrator connections.

What’s next?

To make the solution available to other businesses and government entities, Microsoft and IOM have made it open to the public. It may be used by any interested party to collect and share personal information safely.

Together with the UN Office on Drugs and Crime (UNODC), IOM has been developing guidelines and recommendations to assist countries in generating high-quality administrative data. They have also been working with the International Labor Organization (ILO) of the United Nations to compile a bibliography of studies focusing on the effects of trafficking on public policy. To encourage governments and frontline anti-trafficking organizations to share data securely, IOM is developing an online course that will include a session with instructions on synthetic data.

Check out the Reference Article. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.

Meet Hailo-8: An AI Processor That Uses Computer Vision For Multi-Camera Multi-Person Re-Identification

Meet Hailo-8™: An AI Processor That Uses Computer Vision For Multi-Camera Multi-Person Re-Identification” — MarkTechPost" src="https://www.marktechpost.com/2022/12/05/meet-hailo-8-an-ai-processor-that-uses-computer-vision-for-multi-camera-multi-person-re-identification/embed/#?secret=r0S33ju2vC#?secret=hOiYCZqtjd" data-secret="hOiYCZqtjd" width="600" height="338" frameborder="0" marginwidth="0" marginheight="0" scrolling="no">

The post IOM Releases Its Second Synthetic Dataset From Trafficking Victim Case Records Generated With Differential Privacy And AI From Microsoft appeared first on MarkTechPost.

Researchers Developed SmoothNets For Optimizing Convolutional Neural Network (CNN) Architecture Design For Differentially Private Deep Learning

Mahmoud Ghorbel — Fri, 30 Sep 2022 15:24:18 +0000

Source: https://arxiv.org/pdf/2205.04095.pdf

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2022/09/Blog-Banner-300x169.jpg" data-large-file="https://www.marktechpost.com/wp-content/uploads/2022/09/Blog-Banner-1024x576.jpg" />Source: https://arxiv.org/pdf/2205.04095.pdf

Differential privacy (DP) is used in machine learning to preserve the confidentiality of the information that forms the dataset. The most used algorithm to train deep neural networks with Differential Privacy is Differentially Private Stochastic Gradient Descent (DP-SGD), which requires clipping and noising of per-sample gradients. As a result, the model utility decreases compared to non-private training.

There are mainly two approaches to deal with this decrease in performance caused by DP-SGD. These two approaches are architectural modifications and training methods. The first technique aims to make the network structure more robust against the challenges of DP-SGD by modifying the model’s architecture. The second technique focuses on finding a suitable training strategy to minimize the negative effect of DP-SGD on accuracy. Only a few studies have examined specific concrete model design options that offer robustness against utility reductions for DP-SGD training. In this context, a German research team has recently proposed SmoothNet, a new deep architecture formed to reduce this performance loss.

The authors evaluated individual model components of widely used deep learning architectures concerning their influence on DP-SGD training performance. Then, based on this study, they distilled optimal components and assembled a new model architecture called SmoothNet, which produces SOTA results in differentially privately trained models on CIFAR-10 and ImageNette reference datasets. The evaluation study of the individual model components showed that the width to depth ratio correlates highly with model performance. In fact, the optimal width-depth ratio is higher for private training compared to non-private training. In addition, using residual and dense connections is beneficial for robust models where DP-SGD is employed. The authors also concluded that the SELU activation function performs better than RELU. Finally, Max Pooling showed superior results compared to other pooling functions.

Based on the results cited above, the authors proposed a new architecture named SmoothNet. The core components are building blocks named Smooth-Blocks. DenseBlocks inspire these blocks, but with some modifications: The first modification is that the width of the 3×3 convolutional layers increases rapidly. The second modification is the use of Group Normalisation layers with eight groups instead of Batch Normalisation. Finally, the last modification is the use of SELU layers as activation functions. The depth of the network is limited to 10 SmoothBlocks. In addition, similar to DenseNets, the average pooling is implemented between SmoothBlocks. The extracted features are compressed to 2048 features, fed to the classifier block made by three linear layers, separated by SELU activation functions.

To validate the novel architecture, the authors performed an experimental study on CIFAR-10 and geNette, a subset of ImageNet. SmoothNet performances are compared to several standard architectures, such as ResNet-18, ResNet-34, t EfficientNet-B0, and DenseNet-121. Results demonstrate that SmoothNets achieves the highest performance in terms of validation accuracy when using DP-SGD.

In this article, we showed an investigation made to find optimal architectural choices for high-utility training of neural networks with DP guarantees. SmoothNet, a novel network, was proposed to deal with the performance decrease related to the use of DP-SGD during the training. Results proved that the proposed new network outperforms previous works, which follow the strategy of architectural modifications.

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'SmoothNets: Optimizing CNN architecture design for differentially private deep learning'. All Credit For This Research Goes To Researchers on This Project. Check out the paper.

Please Don't Forget To Join Our ML Subreddit

The post Researchers Developed SmoothNets For Optimizing Convolutional Neural Network (CNN) Architecture Design For Differentially Private Deep Learning appeared first on MarkTechPost.

Researchers Analyze the Current Findings on Confidential Computing-Assisted Machine Learning ML Security and Privacy Techniques Along with the Limitations in Existing Trusted Execution Environment (TEE) Systems

Mahmoud Ghorbel — Sat, 10 Sep 2022 17:45:53 +0000

Source: https://arxiv.org/pdf/2208.10134.pdf

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2022/09/Screen-Shot-2022-09-10-at-10.43.26-AM-300x201.png" data-large-file="https://www.marktechpost.com/wp-content/uploads/2022/09/Screen-Shot-2022-09-10-at-10.43.26-AM.png" />Source: https://arxiv.org/pdf/2208.10134.pdf

The evolution of machine learning (ML) offers broader possibilities of use. However, wide applications also increase the risks of large attack surface on ML’s security and privacy. . ML models likely use private and sometimes sensitive data, for example, specific information about people (names, photos, addresses, preferences, etc.). In addition, the architecture of the network can be stolen. In response to these risks, several methods of anonymizing data and securing the different stages of the machine learning process have been and are still being developed. On the other hand, these solutions are only rarely applied.

In a professional context, the different steps (training/inference) and the data necessary for the operation of the model can be held by various stakeholders, such as customers and companies. In addition, they can occur or be stored in different places (model provider server, the data owner, the cloud, etc.). The risk of attack can be present in any of these entities. One promising method for obtaining reliable ML to ensure privacy is confidential computing. Given the importance and the challenges relating to the security and confidentiality of machine learning models, a research team from England proposed a systemization of knowledge (SoK) paper. In this paper, the authors introduced the problem and offered future solutions to achieve ML with Confidential Computing for the hardware, the system, and the framework.

The authors affirm that the Confidential Computing technology ensures a level of assurance of privacy and integrity when employing Trusted Execution Environments (TEE) to run codes on data. TEE is one of the newest methods for isolating and verifying code execution inside protected memory, also known as enclaves or secure world, and away from the host’s privileged system stacks like the operating system or hypervisor. It is based on the challenging keys: the root of trust measurement, the remote trust establishment and attestation, and the trustworthy code execution and compartmentalization. Owners of data/models must covertly supply their data/models to the TEE of the untrusted host in Confidential Computing-assisted ML. To be more precise, the owners prepare the model and/or data, do remote attestation to ensure the integrity of the remote TEE, and then create secure communication channels with the TEE. The primary feature offered by confidential computing is the separation of enclaves/TEEs from the untrusted environment with hardware assistance.

In this SoK article, several recommendations have been presented. The authors believe the privacy concept is still unclear compared to security or integrity. To have a well-founded privacy assurance, one has to establish the theoretically based protection aim, for instance, with differential privacy information. They insist that the upstream portion of the ML pipeline, such as data preparation, must be protected at all costs because its absence has unavoidable detrimental effects. By incorporating TEE-based verification into data signature, it may be accomplished. The whole ML pipeline protection may also benefit from several TEEs/Conclaves. It is necessary to carefully research the privacy and integrity weaknesses of various ML components (layers, feature maps, numeral calculations) before designing the ML framework to be TEE-aware and partitionable for heterogeneous TEEs. Additionally, managing the TEE system to effectively protect the most sensitive ML components with a high priority is necessary.

In this paper, we have seen an exciting and challenging new era related to protecting ML against privacy leaks and integrity breaches using confidential computing techniques. Although running the training and inference processes has been the subject of numerous studies. They continue to struggle with the lack of trust resources inside TEEs. The existing protection measures only guarantee the confidentiality and integrity of the training/inference stage in the full ML pipeline because ML requires significantly more reliable resources. Confidential computing establishes a more reliable execution environment for ML operations by achieving a hardware-based root-of-trust. The idea that hiding the training/inference process inside such enclaves is the best course of action must be reconsidered. Future researchers and developers must comprehend the privacy challenges that underlie the ML pipeline better so that future security measures can concentrate on the essential components.

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'SoK: Machine Learning with Confidential Computing'. All Credit For This Research Goes To Researchers on This Project. Check out the paper.

Please Don't Forget To Join Our ML Subreddit

The post Researchers Analyze the Current Findings on Confidential Computing-Assisted Machine Learning ML Security and Privacy Techniques Along with the Limitations in Existing Trusted Execution Environment (TEE) Systems appeared first on MarkTechPost.

3 Machine Learning Business Challenges Rooted in Data Sensitivity

Luca Arrotta — Thu, 08 Sep 2022 22:35:54 +0000

Source: https://protopia.ai/

Machine Learning (ML) and, in particular, Deep Learning is drastically changing the way we conduct business as now data can be utilized to guide business strategies to create new value, analyze customers and predict their behavior, or even provide medical diagnosis and care. We may think that data is at risk when these algorithms recommend and direct our purchases on social media or monitor our doorways, elderly, and youngsters, but that is only the tip of the iceberg. Data is used to make banking decisions, detect fraudulent transactions, and decide insurance rates. In all these cases, the data is embroiled with sensitive information regarding enterprises or even individuals and the benefits are entangled with data risks.

One of the most critical challenges that companies are facing today is understanding how to handle and protect their own data while using it to improve their businesses through ML solutions. The data includes customers’ personal information as well as business data, such as data regarding the sales of a company itself. Clearly, it is essential for a company to correctly handle and protect such data since its exposure would be a massive vulnerability.

Source: https://protopia.ai/

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2022/09/article_graphic-300x202.png" data-large-file="https://www.marktechpost.com/wp-content/uploads/2022/09/article_graphic-1024x689.png" src="http://www.marktechpost.com/wp-content/uploads/2022/09/article_graphic-1024x689.png" alt="" class="wp-image-26316" width="-821" height="-551" srcset="https://www.marktechpost.com/wp-content/uploads/2022/09/article_graphic-1024x689.png 1024w, https://www.marktechpost.com/wp-content/uploads/2022/09/article_graphic-300x202.png 300w, https://www.marktechpost.com/wp-content/uploads/2022/09/article_graphic-768x517.png 768w, https://www.marktechpost.com/wp-content/uploads/2022/09/article_graphic-150x101.png 150w, https://www.marktechpost.com/wp-content/uploads/2022/09/article_graphic-696x468.png 696w, https://www.marktechpost.com/wp-content/uploads/2022/09/article_graphic-1068x719.png 1068w, https://www.marktechpost.com/wp-content/uploads/2022/09/article_graphic-624x420.png 624w, https://www.marktechpost.com/wp-content/uploads/2022/09/article_graphic.png 1468w" sizes="(max-width: 1024px) 100vw, 1024px" />

Specifically, it is worth mentioning three significant business challenges about data protection:

First, companies have to find out how to provide safe access to large datasets for their scientists to train ML models that provide novel business value.
Second, as part of their digital transformation efforts, many companies tend to migrate their ML processes (training and deployment) to cloud platforms where they can be more efficiently handled at a large-scale. However, exposing the data those ML processes consume to the cloud platform comes with its own associated data risks.
Third, organizations that want to take advantage of third-party ML-backed services must currently be willing to relinquish ownership of their sensitive data to the provider of those services.

To address these challenges and be broadly applicable, two essential goals must be met:

Separate plain-text sensitive data from the machine learning process and the platform during both the training and the inference stages of the ML lifecycle;
Fulfill this objective without significantly impacting the performance of the ML model and the platform on which it is trained and deployed.

In recent years, ML researchers have proposed different methods to protect the data that will be used by ML models. However, none of these solutions satisfies both the above-mentioned goals. Most importantly, Protopia AI’s Stained Glass Transform solution is the only solution on the market that adds a layer of data protection during inference without requiring specialized hardware or incurring significant performance overheads.

Protopia AI’s patented technology enables Stained Glass Transforms to reduce the information content in inferencing data to increase data security and enable privacy-preserving inference. The transforms can be thought of as a stained-glass covering the raw data behind the glass. Distinctly different from masking solutions, instead of scanning the data to find sensitive information to redact, Protopia AI’s solution stochastically transforms real data with respect to the machine learning model the data is intended for. The low-overhead and nonintrusive nature of Stained Glass Transforms enables enterprises to secure the ownership of their data in increasingly complex environments by dynamically applying the transformations in data pipelines for every record.

While synthetic data can be useful for training some models, inferencing requires real data. On the other hand, inferencing on encrypted data is prohibitively slow for most applications even with custom hardware. By contrast, Protopia AI’s Stained Glass Transforms change the representation of the data through a low-overhead software-only process. These transforms are applicable and effective for a variety of data types, including but not limited to tabular, text, image, video, etc. Protopia AI’s solution enables decoupling the ownership of data from where and on which platform the inferencing is performed.

Gartner has also recently highlighted Protopia AI in their June 2022 report on Cool Vendors in AI Governance and Responsible AI – From Principles to Practice.

Data sharing and protecting data ownership is a hindrance for using SaaS for AI and machine learning. With Protopia AI, the specific target machine learning model is still able to perform accurate inferencing without the need to reverse the transformation. The target model is still trained with the common practices and using the original data. As such, the solution seamlessly integrates with MLOps platforms and data stores. Ultimately, Protopia AI’s Stained-Glass Transforms minimize leakage of the sensitive information entangled in inferencing data — which, in many cases, is the barrier to using the data for machine learning and AI. “

In the sections that follow, we detail how existing methods are complementary to Protopia AI’s Stained Glass Transform solution and where other solutions fall short.

Federated Learning: To protect training data, Google presented Federated Learning [1], a distributed learning framework in which the devices on which data are locally stored collaboratively learn a shared ML model without the need to expose training data to a centralized training platform. The idea is to send only the ML models’ parameters to the cloud, thus protecting the sensitive training data. However, different works in the literature demonstrated that an attacker could use observations on an ML model’s parameters to infer private information included in the training data, such as class representatives, membership, and properties of a training data’s subset [2]. Moreover, Federated Learning ignores the inference stage of the ML lifecycle, and therefore running inference still exposes the data to the ML model whether it is running on the cloud on the edge device.

Differential Privacy: There has been significant attention to the use of Differential Privacy. This method provides margins on how much a single data record from the training dataset contributes to the machine learning model. This is a membership test on the training data records and it ensures if a single data record is removed from the dataset, the output should not change beyond a certain threshold. Although very important, training in a differentially private manner still requires access to plain-text data. More importantly, differential privacy does not deal with the inferencing stage in any form or way.

Synthetic Data: Another method to protect sensitive training data is just training the ML model using Synthetic Data. However, the generated synthetic data might not cover possible real-world data subspaces essential to train a predictive model which will be reliable during the inference stage. This could cause significant accuracy losses that make the model unusable after its deployment. Moreover, the trained model still needs to use real data to perform inferencing and prediction and there is no escaping the challenges of this stage where synthetic data cannot be used.

Secure Multi-Party Computation and Homomorphic Encryption: Two cryptographic techniques for privacy-preserving computations are Secure Multi-Party Computation (SMC) and Homomorphic Encryption (HE). In SMC, the computation is distributed over multiple secure platforms that results in significant computation and communications costs which can be prohibitive in many cases [3]. Homomorphic encryption is even more costly as it operates on the data in the encrypted fashion that even with custom hardware is orders of magnitude slower [4]. Moreover, deep neural networks, which represent the most used ML solution in many domains nowadays, require some modifications to be used in a framework that relies on HE [5].

Confidential Computing: Confidential computing focuses on protecting data during use. Many big companies like Google, Intel, Meta, and Microsoft have already joined the Confidential Computing Consortium, established in 2019 to promote hardware-based Trusted Execution Environments (TEEs). This solution aims at protecting data while it is being used by isolating computations to these hardware-based TEEs. The main drawback of Confidential Computing is that it forces companies to increase their costs to migrate their ML-based services on platforms that provide such specialized hardware infrastructures. At the same time, this solution can not be considered risk-free. Indeed, in May 2021, a group of researchers introduced SmashEx [6], an attack that allows collecting and corrupting data from TEEs that rely on the Intel Software Guard Extension (SGX) technology. Protopia AI’s Stained Glass Transform technology can transform data before entering the trusted execution environment and as such it is complementary and minimizes the attack surface on an orthogonal axis. Even if the TEE is breached the plaintext data is not there anymore with Protopia AI’s solution.

In conclusion, enterprises have been struggling to understand how to protect sensitive information when using their data during training and inference stages of the ML lifecycle. Questions of data ownership and to whom, what platform, and what algorithms sensitive data gets exposed to during ML processes are a central challenge to enabling ML solutions and unlocking their value in today’s enterprise. Protopia AI’s Stained Glass Transform solution privatizes and protects ML data for both training and inference for any ML application and data type. These lightweight transformations decouple the ownership of plain/raw sensitive information in real data from the ML process without imposing significant overhead in the critical path nor requiring specialized hardware.

Note: Thanks to Protopia AI for the thought leadership/ Educational article above. Protopia AI has supported and sponsored this Content.
For more information, products, sales, and marketing, please contact Protopia AI team at info@protopia.ai

References:

[1] McMahan, Brendan, et al. “Communication-efficient learning of deep networks from decentralized data.” Artificial intelligence and statistics. PMLR, 2017.

[2] Lyu, Lingjuan, et al. “Privacy and robustness in federated learning: Attacks and defenses.” arXiv preprint arXiv:2012.06337 (2020).

[3] Mohassel, Payman, and Yupeng Zhang. “Secureml: A system for scalable privacy-preserving machine learning.” 2017 IEEE symposium on security and privacy (SP). IEEE, 2017.

[4] Xie, Pengtao, et al. “Crypto-nets: Neural networks over encrypted data.” arXiv preprint arXiv:1412.6181 (2014).

[5] Chabanne, Hervé, et al. “Privacy-preserving classification on deep neural network.” Cryptology ePrint Archive (2017).

[6] Cui, Jinhua, et al. “SmashEx: Smashing SGX Enclaves Using Exceptions.” Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. 2021.

The post 3 Machine Learning Business Challenges Rooted in Data Sensitivity appeared first on MarkTechPost.

Researchers created a Novel Framework called ‘FedD3’ for Federated Learning in Resource-Constrained Edge Environments via Decentralized Dataset Distillation

Aneesh Tickoo — Mon, 05 Sep 2022 05:27:09 +0000

Source: https://arxiv.org/pdf/2208.11311v1.pdf

" data-medium-file="https://www.marktechpost.com/wp-content/uploads/2022/09/Screen-Shot-2022-09-04-at-10.21.53-PM-300x226.png" data-large-file="https://www.marktechpost.com/wp-content/uploads/2022/09/Screen-Shot-2022-09-04-at-10.21.53-PM-1024x773.png" />Source: https://arxiv.org/pdf/2208.11311v1.pdf

For collaborative learning in large-scale distributed systems with a sizable number of networked clients, such as smartphones, connected cars, or edge devices, federated learning has emerged as a paradigm. Previous research has attempted to speed up convergence, reduce the number of required operations, and increase communication efficiency due to the limited bandwidth between clients. However, this type of cooperative optimization still results in high communication volumes for current neural networks with over a billion parameters, necessitating significant network capacity (up to the Gbps level) to function consistently and effectively. Due to this limitation, federated learning models cannot be widely used in commercial wireless mobile networks, such as vehicle communication networks or industrial sensor networks.

This communication bottleneck drove prior federated learning methods to lower the number of communication rounds and, consequently, the communication volume to achieve satisfactory learning convergence. To reduce the communication costs associated with training a support vector machine by exchanging information in a single round, Guha et al. propose one-shot federated learning. Although it might be challenging to characterize the distribution of an actual dataset, Kasturi et al. present a fusion of federated learning that uploads both the model and the data distribution to the server.

One-shot federated learning method based on knowledge transfer is universal, but sending numerous student models to the server adds to the communication cost. Researchers offer a novel federated learning training scheme with one-shot communication through dataset distillation inspired by the one-shot scheme. It makes intuitive sense to synthesize and send significantly smaller but more valuable datasets with dense characteristics. More educational training data is transferred over the constrained bandwidth without violating anyone’s privacy. In particular, Researchersprovide FedD3, a unique federated learning system that incorporates dataset distillation.

It allows for effective federated learning by sending the locally distilled dataset to the server simultaneously. This may apply a pre-trained model and be utilized for individualized and fair learning. One of federated learning’s most significant benefits, privacy, is preserved by this technique. Similar to the shared model parameters in earlier federated learning approaches but far more effective and efficient, it anonymously maps distilled datasets from the original client data without any exposure.

The communication efficiency in federated learning is assessed appropriately by adjusting the significance of accuracy gain to the communication cost. The tests specifically show the trade-off between accuracy and communication expense. Researchers suggest a new assessment metric, the -accuracy gain, to address this trade-off. Researchers also examine the effects of particular external variables, such as Non-IID datasets, client count, and local contributions. Researchers show excellent potential for this approach in federated learning networks with limited communication costs.

They demonstrate through experimentation that FedD3 has the following benefits. Firstly FedD3 obtains much better performance even with less communication volume, where the accuracy in a distributed system with 500 clients is enhanced by over 2.3 (from 42.08% to 94.74%) on Non-IID MNIST. Secondly, compared to other one-shot federated learning, FedD3 obtains much better performance even with less communication volume. The accuracy in a distributed system with 500 clients is enhanced by over 2.3 (from 42.08% to 94).

Following are the four parts of contributions made in this paper:

Researchers propose a novel framework, FedD3, for effective federated learning in a one-shot manner
Researchers demonstrate FedD3 with two different dataset distillation instances in clients.
Researchers also introduce a decentralized dataset distillation scheme in federated learning systems in which distilled data, rather than models, are uploaded to the server.
The research paper’s studies, mainly when accuracy and communication cost are traded off, show the design’s enormous potential in federated learning networks with constrained communication resources. Everyone has access to the source code on GitHub.

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'Federated Learning via Decentralized Dataset Distillation in Resource-Constrained Edge Environments'. All Credit For This Research Goes To Researchers on This Project. Check out the paper and github link.

Please Don't Forget To Join Our ML Subreddit

The post Researchers created a Novel Framework called ‘FedD3’ for Federated Learning in Resource-Constrained Edge Environments via Decentralized Dataset Distillation appeared first on MarkTechPost.