The post GraphMa: Graph Processing with Pipeline-Oriented Computation appeared first on Graph Massivizer EU Project.

]]>At its core, GraphMa is a conceptual framework that seamlessly merges the principles of pipeline computation with the intricacies of graph processing. It introduces a series of powerful abstractions that empower developers to decompose complex graph operations into modular, composable functions. These functions can then be orchestrated into streamlined pipelines, facilitating the systematic development and execution of graph algorithms.

**Computation as Type:**This foundational abstraction elevates computation units to first-class entities, encapsulating them within a well-defined interface. This approach ensures type safety and promotes modularity, enabling the creation of reusable and composable pipeline stages.**Higher-Order Traversal Abstraction:**This abstraction provides a versatile mechanism for navigating and accessing data within graphs. It defines methods for traversing various data sources, empowering developers to manipulate and process graph data with flexibility and efficiency.**Directed Data-Transfer Protocol:**This protocol governs the seamless and efficient transfer of data between computational stages. It adheres to functional programming principles, ensuring clear directionality and optimized data flow throughout the pipeline.**Operator Model:**This model introduces a comprehensive set of constructs for managing the lifecycle and states of operators within the pipeline. It facilitates a wide array of data processing operations, from transformations to aggregations, enabling the construction of sophisticated graph algorithms.**Pipeline Abstraction:**This abstraction serves as the overarching framework that orchestrates the entire graph processing workflow. It encapsulates the complexities of data transformation and transmission, providing a high-level blueprint for defining and executing graph processing pipelines.

GraphMa’s versatility shines through its ability to seamlessly integrate well-established computational models for graph processing. Whether it’s the vertex-centric model, where computations are centered around individual nodes, or the edge-centric model, which focuses on the relationships between nodes, GraphMa provides a flexible platform for implementing and executing these models within its pipeline-oriented architecture.

GraphMa represents a significant leap forward in the field of graph processing. By combining the power of pipeline computation with graph-specific abstractions, it offers a structured and modular approach to tackling the challenges of graph data analysis. Its potential to enhance scalability, efficiency, and expressiveness in graph processing tasks positions it as a valuable tool for researchers and practitioners navigating the complexities of interconnected data. As GraphMa continues to evolve, we can anticipate its widespread adoption and its transformative impact on the way we understand and leverage the power of graphs in the digital age.

Schroeder, Daniel Thilo, Tobias Herb, Brian Elvesæter, and Dumitru Roman “GraphMa: Towards new Models for Pipeline-Oriented Computation on Graphs.” In Companion of the 15th ACM/SPEC International Conference on Performance Engineering, pp. 98-105. 2024.

The post GraphMa: Graph Processing with Pipeline-Oriented Computation appeared first on Graph Massivizer EU Project.

]]>The post How we implemented scalable graph summarization appeared first on Graph Massivizer EU Project.

]]>- k-bisimulation can be used to create a condensed version of a graph. This condensed version is a graph summary, keeping specific properties of the original
- k-bisimulation partitions the nodes of the graph in equivalence classes which we call blocks
- We create the summary by creating one node for each block. Then, for each edge in the original graph, we connect the corresponding blocks with an edge, with the same label.
- To speed up the computation of the k-bisimulation, we
- use a partition refinement approach
- implemented everything in C++, making use of the boost libraries
- treat singleton blocks separately
- then treat blocks with only 2 nodes
- … and only then the rest
- devised a new step in the algorithm which remembers which parts might need to be refined from step k-1 to compute the blocks at step k

- Doing all this, we obtained a speedup of 20X compared to the already improved python implementation.

When graphs get very large, they can become difficult to work with. One way to deal with such a graph is by reducing it to a smaller data structure which maintains the properties you want to preserve. In this specific case, we want to create a quotient graph based on a k-bisimulation. This summary graph preserves paths which were in the original graph, but can be much smaller than the original. For very large graphs, computing k-bisimulation is itself a challenge. There are existing frameworks, but they have their limitations. Some are either hard to set up and the overhead of the framework makes them less scalable. Often these frameworks trade efficiency for broader applicability; they have capabilities to produce a wider variety of summaries. In this blog post, we first define k-bisimulation and look at a naive algorithm to compute it. Then we will look into partition refinement, which is a faster way to compute the same thing. Finally, we will look at further optimizations of partition refinement and discuss how we implemented it.

We start from a labeled graph *G* = (*V*,*E*,*L*). where V is the set of vertices or nodes, L is a set of labels and *E* ⊂ {(*v*_{1},*l*,*v*_{2})|*v*_{1}, *v*_{2} ∈ *V* and *l* ∈ *L*} is the set of labeled edges, also called the triples, of the graph. *v*_{1} is the source vertex of the edge, *v*_{2} is the target vertex. This definition implies that there can be multiple labeled edges between two vertices, but not two edges with the same label.

Now, we will start talking about paths in a graph. A path is a sequence of edges where the target vertex of the previous edge is the source vertex of the next edge. If we call the source of the first edge *v*_{start} and target of the last edge *v*_{end}, then we say that this is a path from *v*_{start} to *v*_{end}. The length of the path is the number of edges in the path. In some definitions it is assumed that an edge occurs only once in a path; we make no such assumption. Commonly, we are only interested in the labels of the edges in the path and not in the path itself. Therefore, we introduce the term labelpath to mean the sequence of labels of the path defined above. We define the set of all outgoing labelpaths of length *k* as *p**a**t**h**s*_{k, out}(*v*_{A}) to be the set of all labelpaths of at most length *k* starting at vertex *v*_{A} (and ending anywhere in the graph).

Now we can find our k-bisimulation by first explaining when two vertices are bisimilar. Two vertices *v*_{A} and *v*_{B} are k-bisimilar if *p**a**t**h**s*_{k, out}(*v*_{A}) = *p**a**t**h**s*_{k, out}(*v*_{B}), i.e., they have the same set of labelpaths up to length k.

Side note: to be precise, we are working with forward-bisimilarity which deals with outgoing paths only. Analogously, backward-bisimulation deals with paths ending in a specific node. Forward-backward bisimulation deals with both at the same time, meaning that both incoming and outgoing paths must be equal.

What we now do to create the summary is first fixing the parameter *k*. Then, we use the bisimilarity as an equivalence relation between nodes, i.e., we consider two nodes equivalent if the are bisimilar. This equivalence relation identifies a partition on the vertices of the graph G, i.e., we can split the vertices *E* into subsets such that

- None of the subsets is empty and each vertex is in precisely one of the subsets.
- In each set, each vertex is bisimilar to all other vertices in that set.
- No vertex from one subset is bisimilar to a vertex from another subset.

We call each of these subsets a *block* of the partition.

Now we are ready to create our summary as follows. Given a graph *G* = (*V*,*E*,*l*) create a summary graph *S* = (*V*_{S},*E*_{S},*l*), where

*V*_{S}= {*v*_{B}|*B*is a block in the partition}, i.e., we create one supernode for each of the blocks.*E*_{S}= {(*v*_{A},*l*,*v*_{B})|(*v*_{a},*l*,*v*_{b}) ∈*E*and*A*,*B*∈*V*_{S}and*v*_{a}∈*A*and*v*_{b}∈*B*}, i.e., for each edge in the original graph, we create a new edge, with the same label, between the vertices representing their blocks in the summary graph.

This kind of summary graph is a quotient graph.

To create these summaries the main algorithmic step is to compute the partition, i.e., find the subsets. After that, creating the edges is trivial.

A naive way to find the blocks is by computing all outgoing labelpaths for all nodes. This works as follows (python pseudo-code):

```
def find_paths (v: Vertex, k: int):
labelpaths = set()
for label, target in v.outgoing_edges():
if k == 1:
labelpaths.add([label])
else:
for deeperpath in find_paths(target, k - 1):
labelpaths.add([label].extend(deeperpath))
return paths
equivalence_map = defaultdict(list)
for v in V:
paths = find_paths(v)
equivalence_map[paths].append(v)
```

In this algorithm, for each vertex, we compute the set of all outgoing labelpaths. Then, we use the equivalence_map to make sure all vertices with the same set of paths get grouped together. In the end, the values in the map are the blocks we are looking for.

The problem with this implementation is that in the worst case the speed and memory use become quadratic in the depth of the path. This happens, for example, with graphs which look like the one in the figure.

Here, the depth of the paths is only 3, which will not cause an issue. An issue arises when we encounter such structures with longer paths. If we make a larger graph with the same structure as the one above but with depth *k* instead of 3, we observe that the number of paths becomes 2^{k}, which for a large *k* means very many paths. The following figure shows the outcome of an experiment for increasing depths.

What we see is that the time to execute becomes ever longer with an increasing depth (note the logarithmic scale on the y-axis). In general, we can see that this type of algorithm can behave exponentially. In effect, this implementation is not scalable for deep paths. Also when paths are shorter, this implementation is far from scalable for large graphs because of the large amount of data stored for the paths.

As usual in computer science, everything has been solved in the 80-ies. Also in this case. In the paper

```
@article{doi:10.1137/0216062,
author = {Paige, Robert and Tarjan, Robert E.},
title = {Three Partition Refinement Algorithms},
journal = {SIAM Journal on Computing},
volume = {16},
number = {6},
pages = {973-989},
year = {1987},
doi = {10.1137/0216062},
URL = {https://doi.org/10.1137/0216062}
}
```

the partition refinement algorithm was used to compute bisimulations. The lingo used in the paper is rather different, but the algorithm almost directly applies. There are a few differences, though. In that work, the authors did not care about *k*, but were only interested in the case where *k* reaches infinity, meaning that all paths, independent on length must be the same for vertices to be bisimilar. We hence adapted the algorithm to our use case. Here, I explain the intuition behind the algorithm.

As mentioned the algorithm is called partition refinement. It works by creating a partitioning for a depth *k* − 1, and then refining that partition to become the one for level *k* of the bisimulation. In other words, when we need to compute k-bisimulation, we assume (k-1)-bisimulation has already been computed. In other words, the algorithm works inductively.

For *k* = 0, meaning paths of length 0, all vertices are equivalent. So we define the partition to contain one block that contains all vertices.

For *k* > 0, we assume the (*k*−1)-bisimulation has been computed already. This one has a number of blocks. The vertices within each block are pairwise (*k*−1)-bisimilar.

Now, we work block by block through the blocks at level (*k*−1). For a given block A, we compute a signature for each of the vertices. For a vertex *v*_{a}, this signature consists of the set of all (*l*,*B*) for which (*v*_{a},*l*,*v*_{b}) ∈ *E* and *v*_{b} ∈ block *B* on level (*k*−1).

Based on these signatures, we are splitting block B, into smaller blocks, where each new block contains the vertices which have the same signature. These blocks are added to the collection of blocks for level *k*. After this, we throw away the signatures and continue with the next block.

It is important to realize that the partition refinement algorithm will result in precisely the same final partition as the original algorithm. You can either believe this, and skip this section, or follow the informal argument. If that does not convince you, you could read a more formal proof in the original paper, specifically section 3 ‘Relational coarsest partition’.

We need to show the three properties:

- None of the blocks is empty and each vertex is in precisely one of the blocks.
- In each block, each vertex is k-bisimilar to all other vertices in that block.
- No vertex from one block is k-bisimilar to a vertex from another block.

First, it is trivial that none of the blocks is empty, and that each vertex is in precisely one block because we encounter each vertex only once iterating over the blocks.

For the second property, let’s look at two vertices in a block at level *k*. The vertices which ended up here had the same signature. If we look at one element (*l*,*B*) of the signature, we realize that both vertices have one (or more) outgoing edges with label *l* which end up in a vertex in block *B*. But, by induction, the vertices in block *B* are (*k*−1)-bisimilar, meaning that all of them have the same sets of outgoing paths. Prepending all these paths with *l* still results in two sets with the same paths. So, each part of the signature results in a set of paths which are the same for both vertices. Hence, this pair of vertices, and by extension all pairs of vertices in the block are *k*-bisimilar. We chose an arbitrary block, so in all blocks on level *k* the property holds.

We can show the third property by contradiction. Imagine the property holds for the previous round and now we find two vertices *v*_{a} and *v*_{b} which are *k*-bisimilar and in different blocks in the current round. To be *k*-bisimilar, these two vertices need to be (*k*−1)-bisimilar. So, in the previous round, they must have been in the same block *B*. And therefore, their signatures were directly compared. The only way they could have ended up in different blocks is if their signatures were not the same. This can have two causes.

- One of the labels might be different. This, however, means that the vertices are not
*k*-bisimilar which is a contradiction. - We find a part of the signature for
*v*_{a}which has the same label, but ends in a block A, rather than the corresponding part in the signature for*v*_{b}, which refers to block B. (if this is not the case, invert the roles of*v*_{a}and*v*_{b}). However, we assumed all was fine until the previous round. This means that*v*_{a}has an edge to a vertex in block A, while*v*_{b}does not have an edge to a vertex in block A. Since the vertices in block B are not (*k*−1)-bisimilar to the vertices in block A,*v*_{a}and*v*_{b}are not*k*-bisimilar, which is a contradiction.

So, given that all three conditions are fulfilled, we are guaranteed a correct *k*-bisimulation with this algorithm.

The theoretical result is interesting, but we apply a few additional insights to speed up the computation.

- When no refinement is happening, we know that we are done, and can stop. This happens at the latest when the number of rounds becomes equal to the diameter of the graph.
- When a block only contains one element, ie., it becomes a singleton, we never have to look at it again. It can also be stored more efficiently: instead of storing a list to contain all vertices in the block, we can have one list containing all vertices which are in a singleton block at this round. These can just be extended with new singletons in the next round.

With these tricks applied, we reduced the runtime of the bisimulation algorithm significantly. The following figure shows the runtime on the same type of graphs which illustrated the exponential behavior above. Now, we see that running a graph with depth 20 takes under 0.10 seconds, rather than the 100 seconds needed before. Even using a depth of 1000 results in a runtime of less than 2 seconds.

Now, we were ready to run this algorithm on a large graph. We chose a DBpedia dump which contains 8 million entities, and about 22 million edges. And ran it on a laptop with a i7-1280P CPU. The bisimulation ran until it reached k=146 before it finished. It took about 12 minutes 39 seconds, including about 40 seconds to load the data. It used about 40GB RAM.

To scale this up further, we had some more ideas, some of which would need more control on the memory management. We therefore moved to implement this in C++.

For the implementation in C++ we heavily relied on the Boost libraries which provide efficient unordered container types like flat sets and flat maps. We also made use of emplacement to avoid object copying when possible. Besides the optimizations done for the Python version, we further optimized the following:

- We keep track of which blocks have been split at the previous round. Only blocks which have vertices with edges to these blocks can be split at the next round. As far as we know, this is a novel addition to the algorithm.
- We also experimented with a reverse index which keeps track of these edges directly, this did however not lead to significant speedups

- We keep a mapping from the vertices to the block in which they are.
- We specialize this for level zero because everything is in the same block.
- We do not keep singletons explicitly. Rather, we map the corresponding vertices to negative integers, which indicates that they are singletons. To keep track of singletons blocks, we only need to remember how many there were.
- After the round, the index of the previous round is cleared as is is no longer needed.

- To create the blocks at level k, we first create a shallow copy of the blocks at level k-1. Only the modified block will occupy additional memory, others will only occupy one pointer.
- When splitting blocks, we try to not move all the data around. Rather, we put the first part of the split in place of the old block and put the other new parts in the back, then we update only the necessary mappings.
- In some cases, a split results in only singletons. In that case, we add a special empty block to the result. As soon as possible, that block will be overwritten by other splits, which will be put in these places, rather than appended to the back. Note that in a rare case an empty part could remain. This might contradict the requirements. A final cleanup step, which is not yet implemented, could take care of this.
- We deal with blocks of size 2 first, because when they split, they cause two singletons, and always an empty block. The heuristic is that by doing these first that there is a larger chance that larger blocks will not just split into singletons and fill this gap.

With these additional steps, we could further reduce the running time for the DBpedia dataset. This now runs in 56 seconds, including 20 seconds for reading, meaning 36 seconds for the actual computation. To compare, the python version needed 720 seconds (excluding reading). The speedup is about a factor 20. The memory usage went down from 40 GB to 10.7GB, so roughly by a factor 4.

Both implementations also have a support parameter, which defaults to 1. If the size of a block goes under that size, then the block will no longer be a candidate for splitting.

It would be possible to port some of the optimizations from the C++ version back to the python version.

It would be possible to only partially compute the signatures until enough of it is computed to notice a difference where a node gets into a singleton block. This would, however, require quite some bookkeeping and that is most likely outweighing potential gains.

Michael Cochez – Assistant Professor, Vrije Universiteit Amsterdam

The post How we implemented scalable graph summarization appeared first on Graph Massivizer EU Project.

]]>The post Neurosymbolic quality monitoring for sustainable manufacturing appeared first on Graph Massivizer EU Project.

]]>In the fast-paced world of modern manufacturing, quality monitoring and analysis are paramount to ensuring the reliability and performance of products. As industries strive for excellence, maintaining stringent quality standards across all manufacturing processes becomes essential. This is particularly true for intricate and precision-dependent operations such as welding and soldering, which are foundational to the integrity of countless products.

In the fast-paced world of modern manufacturing, quality monitoring and analysis are paramount to ensuring the reliability and performance of products. As industries strive for excellence, maintaining stringent quality standards across all manufacturing processes becomes essential. This is particularly true for intricate and precision-dependent operations such as welding and soldering, which are foundational to the integrity of countless products.

For Bosch a global leader in engineering, the importance of quality monitoring cannot be overstated. The company’s diverse product line—from automotive components to home appliances—relies heavily on precise manufacturing processes. For instance, the production of an electric drive involves several intricate welding operations that are critical to the product’s functionality and durability. However, quality monitoring not only ensures product excellence but is also a fundamental lever towards a sustainable automotive industry. By optimizing manufacturing processes and reducing waste, robust quality control measures contribute to sustainable manufacturing practices.

However, conventional quality monitoring often presents significant challenges, mainly derived from the costs associated to the required human intervention, as traditional methods for estimating welding quality are often time consuming and expensive. For example, for evaluating the quality of spot-welding operations, one common approach involves measuring the diameters of welding spots using ultrasound technology, which, while effective, requires specialized equipment and skilled operators. Apart from this, destructive testing methods are also used, mainly by pulling the welded metal sheets apart and measuring the required force to separate them, leading to an increase in waste.

As such, to address the challenges in quality monitoring, researchers have developed data-driven methods that approach the problem from a multivariate time series perspective. These models estimate quality based on sensor measurements from the spot-welding machine. In Graph Massivizer we aim at the next generation of such methods that offer versatile explainability and transparency by leveraging expert knowledge.

**Aim**

The aim of our use case is to develop next-generation quality monitoring methods that go beyond the more traditional data-driven approach by combining knowledge and sensor measurements.

On the one hand, we refer to sensor-measurements as the time series that are produced during the welding operation, which can have the form of currents, voltages, temperatures… These are variables that evolve during each weld and are fundamental to the estimation of the quality (e.g., if an abnormality has been detected in the current that flows through the cathode of the welding machine, the presence of an anomaly will be likely).

On the other hand, we refer to knowledge that encompasses diverse and rich prior information about the process, derived from expert-knowledge, machine manuals, anomaly reports, etc. This knowledge is represented as a Knowledge-Graph, and it comprises what experts on spot-welding would know.

**Benefits**

The benefits of building models that combine knowledge and data-driven methods are immense, both from the accuracy and explainability sides. On the one hand, incorporating expert knowledge into quality prediction enhances transparency. This increased clarity makes it easier to apply corrective measures and potentially facilitates preventive maintenance. And on the other hand, by integrating expert knowledge with sensor measurements, these models adopt a more informed approach to quality estimation, resulting in more accurate predictions.

All in all, we believe that Graph Massivizer will be a big step towards harmonizing neural and symbolic AI methods. It introduces a revolutionary approach to quality estimation in welding processes, and signifies a profound leap towards realizing a genuinely sustainable automotive industry.

Authors: Mikel Mendibe, Antonis Klironomos, Mohamed Gad-Elrab, Evgeny Kharlamov (Ph. D. at Bosch)

The post Neurosymbolic quality monitoring for sustainable manufacturing appeared first on Graph Massivizer EU Project.

]]>The post Graph sampling algorithms and predicting their qualities and runtimes appeared first on Graph Massivizer EU Project.

]]>With the growing size of graphs in the real world, reducing their complexities becomes increasingly essential and computationally demanding, and due to the unavailability of whole graphs for privacy or scalability reasons, gathering relevant samples for analyzing and estimating their features is crucial. Therefore, several graph sampling algorithms emerged to simplify huge graphs.

**Sampling Methods**

Graph sampling methods fall into three categories:

**Node-based**

The simplest algorithms for sampling graphs involve selecting nodes and their connections. Different algorithms vary in how they choose nodes, whether it’s randomly or based on certain criteria to prioritize nodes with high degrees or PageRank scores. However, some of these methods may not fully capture the degree properties of the graph.

**Edge-based**

Edge-based samplings are straightforward sampling methods that preserve edge-dependent properties, such as path length. However, they might mostly sample high-degree nodes and harm the graph structure.

**Traversal-based**

Traversal-based methods enhance the performance of node and edge-based methods by taking into account the topological information of a graph. These methods involve exploring the graph’s structure, which can be categorized into random walk (RW) based methods and neighborhood exploration methods. The RW-based methods encompass various approaches such as random jump and Metropolis-hastings RW, which involve randomly traversing graph edges based on different policies. On the other hand, neighborhood exploration methods entail exploring the neighbors of a seed set based on specific criteria or objectives. Examples of this category include snowball, forest fire, and expansion sampling. While these approaches can preserve structural and degree information, they may be time-consuming depending on the graph’s structure. Moreover, they may yield local sampling if the seeds are not sufficiently spread out or if the walks become confined to a specific region.

**Analyzing graph sampling methods: why?**

In this blog, we evaluate various traits of samples produced by the different sampling methods.

Some methods are successful at preserving specific distributions like degree and clustering coefficient, as well as distances, while others are even more efficient. Additionally, the quality of these samplings and their outcomes can rely on graph properties, such as the number of nodes/edges, various centralities (e.g., betweenness and eigenvector), and nodes’ distances to describe their shape.

**It is crucial to rely on high-quality sampling for accurate results.**

Due to the presence of multiple sampling algorithms, it is important to anticipate their outcomes to avoid unnecessary sampling, which can consume time and memory.

**How do we evaluate sampling? ****By the quality of samples.**

We can use various quality metrics to assess sampling outcomes, depending on which aspect of the graph we want to prioritize. These can include the distributions of nodes’ degrees, clustering coefficient, and the distances between nodes (hop-plots). We then measure the divergence between the intended properties of the original and sample graphs using divergence metrics such as Kolmogorov-Smirnov and Jensen Shannon divergence.

**Efficiency of sampling**

As before mentioned, the need is often to sample from large graphs, which can be time-consuming. Therefore, an efficient algorithm is required to produce samples quickly.

**How we predict sampling outcomes?**

It’s essential to analyze the outcomes of sampling algorithms, whether through analytical or empirical research. The goal of Graph-Massivizer is to predict the quality of the sampling outcome or its execution time for a given sampling algorithm and input graph. To accomplish this, we start by extracting graph features to characterize it as an input to the prediction model.

**Feature extraction and selection**

Let’s now talk about the features that can help us predict outcomes in a graph. When it comes to graphs, we have different features to look at, like size and structural features. Size features tell us about the numbers of nodes and edges in the graph, while structural features describe the shape of the graph, including its diameter, node influence level, and betweenness centralities.

For size features, we have:

• Node/Edge numbers

And for structural features, we have:

• Density which shows the proportion of edge numbers among possible edges

• Clustering coefficient that tells us about the level of clustering around each node

• Shortest path length between nodes that gives us the minimum sequence length of edges connecting nodes

• Eigenvector/PageRank centrality showing the authority level of a node

• Betweenness centrality informing us about the level of nodes appearing in the shortest paths

• Degree assortativity that shows the level of similar degree nodes connected to each other

• Minimum spanning tree degree representing the degrees of the minimum tree covering all nodes

• Connected component sizes that gives us the sizes of connected regions

But the most important thing to remember is that not all these features may be relevant to what we’re trying to predict.

To avoid using unnecessary features, we use mutual information analysis to select the most relevant ones. This analysis helps us find non-linear connections between graph features and the outcome of the prediction model, which could be the desired quality metric or runtime. We also consider the sampling features, including the sampling algorithm, its type (node, edge, or traversal-based), and sampling rate for the model. When it comes to predicting the quality/runtime of sampling, we treat it as a regression problem. We use three machine learning models for this:

• random forest (RF),

• multilayer perceptron (MLP),

• k-nearest neighbor (kNN).

More specifically,

RF is the ideal model for high-dimensional and discrete data features, as it is both generalizable and robust. It is particularly suitable for graph data with numerous features.

Additionally, MLP is a flexible model that is well-suited for high-dimensional data.

On the other hand, kNN is an efficient model with minimal hyperparameters and demonstrates strong performance in the presence of similar data points.

**Prediction results **

Our project Graph-Massivizer has implemented prediction models that can provide predictions with errors below 20% regarding clustering coefficient and hop-plot metrics for most of considered sampling algorithms. However, their accuracy is dependent on the sampling algorithms. Therefore, we can select among the models for prediction on each algorithm. Overall, we find RF to be a suitable model for predicting clustering coefficient and hop-plots for most sampling algorithms.

As high-quality sampling methods can be time-consuming, we need to consider the runtime to compromise between runtime and the quality of the sample. We provide an algorithm ranking method using the prediction models for execution times to guide algorithm selection. We find kNN to be the best runtime prediction model.

**Conclusion**

Our study conclusively demonstrates that leveraging sufficient graph features and incorporating realistic graph features for training models can enable accurate prediction of sampling algorithm quality and facilitate ranking based on runtime, thus offering a method for algorithm selection that considers the trade-off between quality and efficiency. Efficient feature extraction methods are essential for developing a practical model.

Author: Seyedeh Haleh Seyed Dizaji, Postdoctoral Researcher at Universität *Klagenfurt*

The post Graph sampling algorithms and predicting their qualities and runtimes appeared first on Graph Massivizer EU Project.

]]>The post Graph massivizer: much more than another Graph Database Platform appeared first on Graph Massivizer EU Project.

]]>**Understanding Graph Massivizer**

Graph Massivizer is a high-performance, scalable, and sustainable platform designed for processing and reasoning based on massive graph representations of extreme data. It is part of an EU-funded project to promote climate-neutral and sustainable economic sectors through advanced graph data processing. The Graph-Massivizer project is building a software platform referred to as the“Toolkit” based on the massive graph representation of extreme data in general graphs, knowledge graphs (KG), and property graphs, which integrate patterns and store interlinked descriptions of objects, events, situations, and concepts with associated semantics.

The toolkit is made up of a Graphical User Interface, which will be the primary mechanism for users to interact with the information processing subcomponents and five categories of information processing components:

- A graph database development and management tool for creating and storing graph data.
- A graph database analytical tool for graph analysis, querying and modelling.
- A graph database optimization tool to enhance graph processing performance and predicting workload.
- An environmental impact optimization tool that enables reducing and monitoring in terms of environmental impact.
- An orchestration tool to manage heterogeneous resources and graph processing requests by incorporating serverless scheduling, resource management, and allocation mechanisms.

Besides, four different use cases were selected due to their capacity to demonstrate the effectiveness of the Graph-Massivizer approach. The use cases touch four different industries and scenarios: Green and sustainable finance, Global foresight for environmental protection, Green AI for sustainable automotive Industry, and Data Center Digital twin for sustainable exascale computing.

**Key Advantages of Graph Massivizer**

Graph Massivizer goes beyond the traditional capabilities of graph databases by offering scalable, high-performance data management, advanced querying, and integrated automated intelligence, all while maintaining a commitment to environmental sustainability. Its comprehensive toolset and user-friendly interface make it accessible and valuable across various industries. As organizations increasingly recognize the importance of interconnected data, Graph Massivizer is well-positioned to drive innovation and provide significant competitive advantages.

A unique aspect of Graph Massivizer is its commitment to environmental sustainability. The platform supports performance modeling and environmental sustainability trade-offs, ensuring high performance is achieved with minimal environmental impact. This focus aligns with the growing demand for eco-friendly technologies in the business world

The use of the 5 components together is very new to the market because the current providers can offer no more than 3 of these capabilities simultaneously. In fact, while there are many individual tools available, having an all-in-one solution that integrates development, analysis, optimization, environmental monitoring, and scheduling could significantly simplify companies’ adoption process. Integration often simplifies tasks, reduces errors, and can lead to cost savings.

Graph Massivizer provides a comprehensive toolkit of open-source software tools and FAIR (Findable, Accessible, Interoperable, and Reusable) graph datasets. These tools cover the entire lifecycle of processing extreme data as massive graphs, making the platform accessible to users with medium levels of technical expertise. The intuitive interface simplifies data management and analysis, although some of the workloads will have to be coded by using well-known code languages.

Author: Giovanni Cervellati (IDC Research Manager, Data and Analytics)

The post Graph massivizer: much more than another Graph Database Platform appeared first on Graph Massivizer EU Project.

]]>The post Synthetic Data Powered Investment and Trading appeared first on Graph Massivizer EU Project.

]]>Peracton Ltd.

**In the ever-evolving world of finance, Synthetic Data Driven Investment and Trading i**s emerging as hybrid approach, where financial algorithms are not only powered by traditional financial data (historic and live), but also by synthetic data [1] .

It has the potential to redefine the financial markets landscape [2] from both data markets as well as algorithms’ robustness and performances.

**Synthetic Data and Its Impact**

Synthetic data is artificially generated data that mimics historic and real-time data in terms of essential characteristics. In the context of investment and trading, synthetic data can be used to simulate various market conditions and investment scenarios, thereby providing a rich and diverse dataset for analysis. It is generated using complex algorithms and can include a wide range of variables, such as stock prices, volume, fundamental data, technical data as well as variables for other securities like options, futures, commodities etc. The key element is that this data captures and reflects the statistical properties of real-world data, while being completely artificial.

Synthetic data serves as the training ground for investment and trading strategies. Machine learning models are used to analyse this data and identify patterns, correlations, and potential investment opportunities. These models are trained, tested, and refined repeatedly on the synthetic data until they achieve the desired level of accuracy and reliability.

The use of synthetic data can lead to significant improvements in **portfolio optimization, market anomaly prediction, and risk management.** By simulating a wide range of market conditions and scenarios, synthetic data allows fund managers to test their strategies in a risk-free environment before implementing them in the real market.

Once the investment strategies have been intensively tested on synthetic data, they are then applied to real market data. However, the transition from synthetic data to real data is not a simple one-to-one process. The real world is much more complex and unpredictable than any synthetic environment. To account for this, an intermediary process called generically ‘back testing’ is used. Backtesting involves applying the investment strategies to historical real-world data. This allows investors/traders to see how their strategies would have performed in the past and then make necessary adjustments.

Furthermore, performances are continuously monitored on the real market data. Sophisticated risk management techniques are used to ensure that their strategies are performing as expected and adjust them as necessary based on real-world market conditions.

In essence, synthetic data serves as the training and testing ground, while the real market data is the ultimate playing field. The goal is to use synthetic data to develop investment strategies that can then navigate the complexities and uncertainties of the real financial markets.

** GraphMassivizer Project and Synthetic data**

The platform created within GraphMassivizer project will enable fast semi-automated creation of realistic and affordable synthetic financial data sets in extreme data quantities (PB level), unlimited in size and accessibility for green investment and trading. Such data can be used for the three next core topics at the heart of investment and trading:

**Portfolio Optimization**

Synthetic data can help optimize portfolios by enabling traders and fund managers to test various portfolio combinations and strategies under different market conditions including green type of investments. This can lead to the creation of more robust and diversified portfolios that can withstand market volatility and deliver consistent returns.

**Modelling Market Anomalies**

Market anomalies, such as sudden price jumps or crashes, can significantly impact investment and trading performance. Synthetic data can help model these anomalies by simulating their occurrence and studying their impact on various investment strategies. This can enable traders and fund managers to devise strategies to mitigate the impact of these anomalies.

**Risk Management**

Risk management is a critical aspect of any investment and trading strategy. Synthetic data can enhance risk management by providing a comprehensive understanding of various risk factors and their interplay under different market conditions. This can help traders and fund managers to better manage risk and protect their investments.

**Addressing Concerns**

Despite the potential benefits, the use of synthetic data in investment and trading may raise concerns related to **transparency and investor confidence**. Investors and traders may be wary of the artificial nature of synthetic data and its implications for investment decisions.

To address these concerns, it is crucial to ensure that the process of generating and using synthetic data is transparent and well-documented. Investors and traders should be provided with clear explanations of how synthetic data is used/can be in investment decision-making and how it contributes to the overall performance of a portfolio.

Moreover, rigorous testing and validation of synthetic data can help build investor confidence. By demonstrating that synthetic data can accurately mimic real market conditions and contribute to successful investment strategies, hedge funds can convince investors of its value.

**Conclusion**

Synthetic Data Powered Investment and Trading represents a promising new frontier in the investment and trading world. By harnessing the power of synthetic data, investors and traders can optimize portfolios, predict market anomalies, and manage risk more effectively.

**References**

[1] Synthetic Equity Market Data, J.P.Morgan, (accessed March 2024) https://www.jpmorgan.com/technology/artificial-intelligence/initiatives/synthetic-data/synthetic-equity-market-data

[2] JPMorgan’s AI team might need synthetic data expertise, McMurray, A., Jan. 2024 (accessed March 2024) https://www.efinancialcareers.com/news/JPMorgan-Synthetic-Data

The post Synthetic Data Powered Investment and Trading appeared first on Graph Massivizer EU Project.

]]>The post Trading in the Matrix appeared first on Graph Massivizer EU Project.

]]>Peracton Ltd.

Imagine a scenario where extreme quantities of synthetic data are continuously generated and used to train multiple generations of AI enhanced financial algorithms.

In this scenario, financial algorithms train on making decisions based on simulated market conditions that are generated artificially and the traders test their ideas creating an endless stream of what-if scenarios and possible futures.

The algorithms learn and adapt to a wide range of market situations, on diverse and complex scenarios that may not occur frequently in real-world trading. This can help algorithms become more robust and adaptable, improving their performance in unpredictable market conditions.

**The Synthetic Training Ground**

Synthetic data generation utilizes advanced statistical techniques and machine learning algorithms to create realistic, yet hypothetical, market data sets. This data closely mimics real-world market dynamics, encompassing factors like price movements, volatility, trading volume, and various fundamental and technical indicators. By leveraging extreme quantities of synthetic data, algorithmic traders can:

**Deep Stress Test Algorithms**: Algorithms are exposed to a multitude of extreme market conditions, including flash crashes, sudden economic shifts, and unforeseen geopolitical events. Rigorous and in-depth stress testing helps identify potential weaknesses and vulnerabilities, enabling traders to refine and fortify their algorithms pre-emptively.**Explore Unforeseen Scenarios:**The synthetic data multiverse allows for the exploration of rare or “black swan” events that may not occur frequently in historical data. By training algorithms on these simulated scenarios, traders can build in adaptability, allowing the algorithms to react effectively to unforeseen market disruptions.**Optimize Risk Management Strategies:**Through back testing on synthetic data, traders can optimize risk management parameters within their algorithms. This allows for the creation of dynamic risk profiles that adjust based on the ever-evolving market conditions simulated in the synthetic environment.

**Challenges and Safeguards:**

The benefits of synthetic data can be numerous but its utilization presents a unique set of challenges:

**Data Quality**: The effectiveness of synthetic data hinges on its fidelity to real-world markets. If the statistical properties and relationships between variables are not accurately captured, the resulting algorithms may be misled, leading to suboptimal behaviour of the trading algorithms.**Explainability and Transparency:**As with any complex model, understanding the decision-making processes within an algorithm trained on synthetic data can be challenging. This lack of transparency could hinder regulatory oversight and make it difficult to pinpoint the source of errors or biases.

To mitigate these risks, robust frameworks must be put in place for investment and trading. Financial algorithms tested and consolidated with synthetic data must be put through additional tests before operating with live money. Additionally, advancements in explainable AI (XAI) are crucial to shed light on the inner workings of these algorithms, fostering trust and facilitating responsible deployment.

**The Ethical Imperative:**

Beyond regulatory considerations, the ethical implications of synthetic data require thoughtful exploration. One key concern is the potential for such synthetic data to be misused. However, chances for such potential misuse are considered low, as synthetic data is used in a contained environment/simulator/sand-box.

To ensure fair and ethical use, industry participants must adhere to strict ethical codes of conduct. Continuous dialogue between regulators, developers, and users is paramount to prevent the misuse of synthetic data and safeguard the stability of financial markets.

**Conclusion**

The integration of synthetic data into algorithmic trading represents a significant paradigm shift. This approach unlocks many possibilities, fostering the development of adaptable and more robust trading algorithms. However, navigating the challenges and ethical considerations associated with synthetic data is critical to ensure a healthy future for algorithmic trading and the financial landscape. As we delve deeper into this synthetic training ground, a commitment to responsible innovation and robust regulatory frameworks will be essential for harnessing the true potential of this transformative technology.

The post Trading in the Matrix appeared first on Graph Massivizer EU Project.

]]>The post Building massive knowledge graphs using automated ETL pipelines appeared first on Graph Massivizer EU Project.

]]>The post describes the process step-by-step, and discuss how the Graph-Massivizer project supports the development of multiple large knowledge graphs and the considerations you need to take when creating your own graph.

Keep reading HERE and stay tuned for more outcomes soon on line!

The post Building massive knowledge graphs using automated ETL pipelines appeared first on Graph Massivizer EU Project.

]]>The post From Big Data to Green Data: Reducing the Environmental Impact of Data Science with Graph Massivizer appeared first on Graph Massivizer EU Project.

]]>According to a report by the International Energy Agency (IEA), the collection, storage, processing, and analysis of data account for an estimated 1 to 1.5 percent of global energy consumption. Furthermore, the same IEA report indicates that data centers and the data transmission network are responsible for approximately 1 percent of greenhouse gas emissions related to electricity production and consumption, significantly contributing to global warming. Additionally, the use of non-renewable resources, such as fossil fuels and rare metals, in hardware production should not be overlooked. Furthermore, data centers consume substantial amounts of water, as they require a constant temperature and humidity for optimal operation. Many of them use a liquid cooling system, which can lead to water recycling but also results in increased electricity consumption. While the continuous evolution of technologies in data science advances technological progress, it also leads to rapid obsolescence of hardware and software. This inevitably results in the generation of electronic waste, which may contain toxic additives and hazardous substances if not properly disposed of. According to the United Nations (UN) Agency, the world generated 53.6 metric tonnes (Mt) of electronic waste in 2019, and estimates predict that this figure will increase to 74.7 Mt by 2030.

In this context, the work of Graph Massivizer is of utmost importance. Focusing on the four areas investigated in the project (“Sustainable Green Finance,” “Global Environment Protection Foresight,” “Green Artificial Intelligence for the Sustainable Automotive Industry,” and “Data Centre Digital Twin for Exascale Computing”), Graph Massivizer aims to improve data analysis efficiency by 70 percent and reduce the energy impact of extract-transform-load operations on data by 30 percent. Moreover, it is expected to enhance data center energy efficiency by a factor of two and reduce greenhouse gas emissions associated with operations on graph-organized databases by over 25%. The “Data Centre Digital Twin” use case, involving CINECA and the Alma Mater Studiorium University of Bologna, is crucial. It revolves around creating a virtual representation of the world’s fourth-fastest supercomputer, LEONARDO, in a digital graph form. This representation is fundamental for studying and comprehending its operation, enabling a clear and concise portrayal of all possible relationships within a complex structure like a data center. The study and analysis of these relationships lay the foundation for optimizing the efficiency and sustainability of the next generation of supercomputers, known as exascale supercomputers.

While data science and data processing play essential roles in the fight against climate change, they also possess the potential to be indispensable tools for environmental sustainability. The Graph Massivizer project serves as an example of how technology can drive progress, ultimately reversing the course of climate change and reducing the environmental impact of groundbreaking discoveries

CINECA, December 2023

The post From Big Data to Green Data: Reducing the Environmental Impact of Data Science with Graph Massivizer appeared first on Graph Massivizer EU Project.

]]>The post The importance of the semantic knowledge graph appeared first on Graph Massivizer EU Project.

]]>Have a look HERE and stay tuned for the next blog signed by metaphacts!

The post The importance of the semantic knowledge graph appeared first on Graph Massivizer EU Project.

]]>