View on GitHub

Google Summer Of Code 2018 - Final Report

Efficient implementation of a general Weisfeiler-Lehman algorithm and graph isomorphism in Sagemath

Note: the title, originally “Checking graph isomorphism using the Weisfeiler-Lehman algorithm” was changed for the report since the final result presented here doesn’t include any part strictly relative to graph isomorphism using k-WL

Overview

The project, spanning over a period of slightly over 3 months, consisted in improving the way SageMath checks for graph isomorphism.
The plan was first implementing an interface to one of the fastest, if not the fastest, automorphism and isomorphism checking open-source library, Nauty, an interface that could turn useful both as a standalone isomorphism checker and as a benchmarking tool.
Subsequently, an implementation would be made of the Weisfeiler-Lehman (WL) method: such method, described in this 1968 paper, computes the coherent closure of an arbitrary (di)graph in polynomial time; in simpler terms, the method starts with an initial partition of the vertices (resp. edges) of a graph G and refines this partition through several passes, until the partition resulting by a refinement round is equal to the one the round started with; this resulting, final partition, has the property of being equal to or refined by the orbits (resp. orbitals) of the original graph’s automorphism group, and we say WL recognizes a graph if the resulting partition is equal to its orbits (resp. orbitals).
While this is called the Weisfeiler-Lehman algorithm, it would be more correct to say family of algorithms, since for any positive k a version of WL can be defined which recognizes strictly more graphs than any other, lower order, WL1.

As said above, scope of the project was implementing this family of algorithms, or at least part of it: in particular, the idea was implementing a generic, parametric method that could allow running any order of Weisfeiler-Lehman if at all possible; since it was not possible to determine if such a thing could be done before enough research was done on how to actually implement WL, a fallback plan was prepared, consisting in manually implementing, separately, the first three orders of WL (the choice of restricting it to the first 3 was made due to the results presented in this paper2).

Lastly, the project provided for the implementation of an isomorphism checking method which could use k-WL to distinguish graphs in a faster way.

Trac Tickets

SageMath development process is based around git and an issue tracking system called Trac. Specifically, every improvement, bug/bugfix or new functionality must be reported in a ticket on https://trac.sagemath.org with a detailed description, where it will be then discussed, possibly implemented/solved and then reviewed, to then be merged if the process went through successfully.

This means that the work I’ve done has been divided in tickets on the Trac system, and so will be reported here with a similar structure and organized in chronological order.

To access a particular piece of work described, and thus its in-depth description, discussion and code (which is stored in a separate branch for each ticket), it will suffice to click on the title of its subsection.

In one particular instance, the ticket’s original associated branch was changed due to a complete rework of the implementation required, which rendered obsolete the previous code. This means that two subsections of this document will reference the same ticket, but since its branch is still online, the subsection relative to the old implementation will link directly to said branch, skipping the ticket altogether.

#25391 - Issues in compiling SageMath

The subject of the first ticket I’ve worked on, and also the first difficulty I encountered during my project, was getting SageMath to run at all on my system.
I started working on a PC running Fedora 28 64bit with gcc 8.1, a fairly updated setup, and this caused a few issues in compiling SageMath’s source code.
In particular, as can be seen in the ticket’s description, I had troubles with python3, python2 and linbox packages.
Now, in all honesty, I could have worked around the issue completely by switching to an older version of the distro, or to another distro altogether, but I thought this could be a good first contribution to SageMath, and the perfect way to gain familiarity with the Trac system and Sage’s internal structure and workings before I started working on code that was more relevant to my project.

The actual fixes to the issues I encountered were fairly simple, but debugging the system required a lot of effort, and all the help I could get from my mentor, Dmitrii Pasechnik.
While at first, after a couple of weeks of work, I was able to produce patches to be included in Sage that could solve the problems listed in the ticket, after some time the upstream maintainers of the packages produced updated versions with fixes included; the ticket, even if fairly successful in providing a solution to the issue, was thus closed because not needed once the packages inside Sage had been updated too.

Still, I consider this ticket very important in my journey, since it really helped me understand how the library I was working on was structured, which parts were where and what to modify and how. In a sense, it served as kind of a workshop before starting work on the real thing.

#25506 - Nauty interface for isomorphism checking and automorphism group computing

My second ticket consisted in implementing an interface between Sage and nauty3, a very fast program for computing automorphism groups and canonical labels (and thus for checking for isomorphisms between graphs).
The reason why I chose this as my second ticket, instead of implementing k-WL first and then focus on interfaces, is a matter of priorities: I wanted to give something to SageMath’s project, and since implementing k-WL in an efficient and useful manner was a challenge, the only sure way of doing so was allowing SageMath’s automorphism_group and is_isomorphic methods to run faster, using a library that was included in Sage, but never actually incorporated, that is, nauty.

My work was heavily based on pynauty, a GPLv3 python module that allows python code to interface with nauty to use it for computing automorphism groups and comparing for isomorphism.
While keeping the same name to give credit, the package I developed for Sage applies a large number of patches to pynauty’s source, basically keeping only the C helper functions used to call on nauty.
First, I scrapped the Graph building part included in pynauty, since it was not needed for Sage, and modified the python code so that a SageGraph could be passed to it. I then modified the C code to make it parse a SageMath’s graph and convert it into a representation useful for nauty, without any wasteful middle conversions. Finally, I made it so that the returned automorphism groups were displayed in a format that was coherent with the one used by the current Sage’s automorphism_group method; of course, the isomorphism checking part didn’t need any output adjustments, since it returned a simple boolean.

My work on the interface, anyway, didn’t finish here, since nauty’s functions, while very fast, in this form were basically useless for Sage, since they required an input (di)graph whose edges were unlabelled and without multiple edges, a very strong restriction for a generic graph library.
The only way I could find around this issue was described in section 14 of nauty’s documentation, to which I added a quick fix to allow for multiple edges:

  1. If the graph is a multigraph, the multiple edges are removed and their multiplicity is translated to a label: in a number if the original edge had no label, in a pair with the original label as the first member and the multiplicity as the second otherwise; of course, this means that multiple edges on the same extremes should at least have the same labels.
  2. If the graph is edge labelled (or if it was a multigraph), convert the labels into m consecutive integers starting from 1, where m is the number of unique edge labels in the graph; if we’re dealing with an isomorphism checking problem, then this step must be done on the disjoint union of the two graphs being checked, that is if two edges in the two graphs have the same label, the edges must end up with the same integer label.
  3. Finally, each of the n vertices is converted into a (strongly) connected component of log(m) vertices (for a total of n log(m) vertices) and the edges between the new set of nodes is determined by the binary representation of the original edge labels; e.g: if between a vertex a and b there was an edge with label 5, the new graph will have edges \((a_0, b_0)\), \((a_2, b_2)\), since the number 5 has only its first and third bits set.
  4. Now, the only thing needed is restricting the action of the computed automorphism group to all of the vertices of the form \(v_0\), while respecting the eventual original partition provided to the method; this will give us the automorphism group of the edge labelled (multi)graph.

Finally, I went through Sage’s graph automorphism and isomorphism methods, making it so that they could use my interface and adding all the needed glue code, plus modifying the documentation and adding the tests for my algorithm. This was, in my opinion, the most tedious part of the ticket, since it required going through all the existing documentation and adapting it, making sure to keep the same level of quality as the original throughout it.

The main difficulties of this ticket (which consisted in my first real work on the project) were understanding how to correctly modify the source code so that it could interact with Sage, understanding the input and output conventions used by Sage’s graph methods, but overall the most difficult part was wrapping my head around more advanced concepts about automorphism groups and isomorphism checking, a feat that required a large amount of studying more than actual programming but that would have proven useful in my quest to implement k-WL.

#25802 - Efficient k-WL implementation (Part 1)

As anticipated, this subsection is about my first take at a k-WL implementation, thus, while the ticket number refers to a k-WL implementation, it refers to the second one, but the link above takes the user to the first implementation’s branch.

This implementation was based off a book draft4 describing the nominally best-known implementation (in terms of time, probably) of k-WL.

The algorithm I devised to put that description into code was basically generating and memorising all the tuples in \(V^n\) into a hashmap in the form of keys, associating a colour with each of them. The starting colour of each tuple was determined by the equivalence classes induced by their ATPs (Atomic TyPes), that is by the adjacency matrices of the subgraphs induced by the tuples themselves from the original graph G.
Said adjacency matrices were computed, and all tuples with the same ATP were put in the same equivalence class and thus had the same initial colouring; of course, to simplify the code and make execution faster, the colours were represented as integers, so after the division in equivalence classes to each class is assigned an unique integer identifier ranging from 0 to the number of different existing classes minus one.
After this initialisation phase, the main part of the algorithm can begin: a loop that executes a refinement round on the colour classes the tuples are divided into and goes on until the colour classes obtained after such refinement round are the same as they were before said round started, meaning that they now define a stable colouring for the current order of WL.
Each refinement round is in theory quite simple and consists in computing, for each k-tuple \(t\), a multiset of the current colourings of the k-tuples obtained by swapping each element in \(t\) with every vertex \(v\) in \(G\), grouping by \(v\) and then sorting the groups; such multisets’equivalence classes can then be used to induce new colour classes on the k-tuples.

Since this algorithm had to work with decently large amount of vertices and at least the first few possible values of k, special care was needed to ensure that these multisets occupied as little memory as possible and for just the time needed to compute the new colour classes, together with ensuring that sorting and dividing in colour classes was done in a most efficient way.
One first way this was achieved was by recognising that, since each round refines a partition, two tuples from different initial colour classes could never end up in the same colour class at the end of the round: this meant that multisets could be built on a per colour class basis, and then deleted soon after computing that specific colour class refinement. Another way of improving space efficiency would have been to compact the multisets, by storing a counter of unique values instead of storing repeated values multiple times, but this improvement never saw the light of day because the implementation was scrapped before that.

To both sort the multisets and divide the tuples into classes based on them, a hybrid sorting algorithm was implemented which used either a bucket counting sort or the standard C++ sort implementation depending on the size of the set to be sorted and on the range of values therein.
The choice of using a bucket counting sort was due to the fact that each multiset was really a set of tuples, and that an ATP could be considered an array, and thus every sorting needed by the algorithm consisted in sorting sets of fixed, equally long, sets of integers.
The idea was then to first order the whole set of sets based on each set’s first value using counting sort, then divide the outer set into buckets where each bucket’s elements had the same first value, and call the sorting procedure on each newly built bucket. At this point, the procedure will decide based on the range of values in the bucket and on its size if C++’s sort would be better than another counting sort or not, and sort the bucket accordingly on its second value; the same would then be done for the third value, the fourth, and so on, until a bucket ended up having only one item and was thus kept as is, to be later chained with the others to obtain the final sorted set of sets.

This algorithm’s time performance was very good given the amount of work it needed to do, but after discussing it with my mentor, it became exceedingly clear that the algorithm could be implemented in a different way, that was both faster and less memory hungry. Still, this implementation being very straight-forward and linear, it proved a perfect correctness checking tool for my second implementation later on, before I could develop a more precise way of testing.

There was no major difficulty during this part of the project, but a long series of minor distressful inconveniences: in no particular order, implementing the particular sorting method described above and getting it to work reliably given its convoluted essence was no easy task, but also understanding how k-WL should work from a summary description in a paper through trial and error proved quite difficult; all in all, this part required a lot of time to bring to completion, but was very useful in understanding enough of k-WL to be able to implement the second version of the algorithm later on.

#25891 - Implementing generators for Hamming graphs, Egawa graphs and Cai-Furer-Immerman graphs

While testing my first implementation and discussing it with my mentor, and also during the time we discussed the new implementation and how it should work, to avoid staying idle I decided to implement a few generators in SageMath for graphs that have interesting properties related to k-WL, which would make them useful for eventual tests or benchmarks.

Specifically, I implemented the following:

#25802 - Efficient k-WL implementation (Part 2)

This is the last piece of work I’ve done for the project, and also one of the most difficult to develop, both in theory required and implementation hardness.
Again, as is the idea of k-WL itself, the implementation is linear and not that difficult to explain, but getting to a working implementation and understanding why it should work the way it does required a lot of studying and help from my mentor, whom I’d like to really thank.
The idea is simple: in the same way as ATPs could distinguish tuples and give them initial colours in the first implementation of k-WL, particular subgraphs can be constructed that can, when compared, divide tuples into classes and thus colour them.
Clearly, this is very basic and slightly incorrect way of explaining it, so I will now delve further into details.

The brilliant idea behind this implementation is that there is no need to waste resources actually enumerating and storing all possible k-tuples in \(V^n\): two edges in \(G\) can be distinguished (or be coloured the same) based on the set of subgraphs of \(G\) that can be constructed having in the vertex set an edge’s extremes; if a subgraph can be constructed for an edge that cannot be constructed for the other, the two are clearly not in the same orbital, otherwise they will be put in the same colour class for this round, but they might get separated later on if they are not actually in the same orbital.

Of course, storing some data will be needed if we want to distinguish edges efficiently. The basic form of the algorithm goes as follows:

  1. An adjacency matrix of the graph is created, having eventual edge labels as entries for different vertices (i.e. \(A_{u,v}\) is an edge label, for \(u \neq v\)) and initial colourings of the vertices on the main diagonal (i.e. the initial colouring of the vertex \(v\) is stored in \(A_{v,v}\)); this approach clearly doesn’t directly support self-loops, but it’s easy enough to remove them from a graph through initial vertex colouring or by splitting a node with a self-loop into two different nodes, so this is not really an issue; furthermore, note the importance of having the entries on the main diagonal be disjoint with those in the rest of the matrix, since if this is not the case subgraphs that have multiple copies of the same vertex in their vertex set might result isomorphic to graphs to which they are clearly not, when compared by their canonical form’s adjacency matrix equality.
  2. Create colour classes out of every possible pair of vertices in \(G\), based on the corresponding entry in the adjacency matrix, and store them in a queue.
  3. For each colour class, one at a time from the queue, create a fingerprint database, where a fingerprint (used as a key) is just a variable length tuple of integers and its associated value is a set of pairs of vertices (from now on these will be called edges, even if technically they really are edges only in the complete graph with \(G\)’s vertex set).
  4. For each edge in the currently considered colour class, generate every possible vertex set of cardinality \(k+1\) that contains both the extremes of the edge at least once and append them to a list of computed subgraphs which will be common between every edge in the colour classes for this refinement round; the algorithm will be able to compute the adjacency matrix of the subgraphs on the fly from the vertex set traversing the main adjacency matrix accordingly, with the caveat of individualising the edge it’s working on, that is any entry relative to the ordered pair of vertices belonging to the edge will be considered as having a value that does not exist in the main adjacency matrix, but that will be common for every individualised edge in the refinement round.
  5. While generating the subgraphs for an edge as described above, also build a naturally sorted list of pairs of integers \((a,b)\), where \(a\) is the index of a computed adjacency matrix and \(b\) the number of times such adjacency matrix was computed (of course, two different vertex sets can produce an identical adjacency matrix).
  6. After generating all the subgraphs for an edge, the complete list built in point 5 (once flattened) will be used as a fingerprint for the edge and stored in the database of point 3 accordingly; it’s easy to see that two edges that must end up in the same colour class will build the same fingerprint and that the fact that the list of computed subgraphs in point 4 is built incrementally doesn’t influence this, since the first edge will append all needed adjacency matrices and the second edge will re-use them without needing to add any; if, absurdly, the second edge appended any new adjacency matrix, then clearly the set of generated subgraphs for the two edges would not be the same and they would not be able to be in the same colour class.
  7. Finally, use the database’s values to create the new colour classes for the next round (adding them to a temporary common queue) and update a temporary adjacency matrix with the newly computed labels, then continue with the next colour class (after of course clearing the current database, since there’s no way any current edge could mix with the next colour class.
  8. After every colour class for the current refinement round has been processed (when the original queue is empty) if any of the colour classes was refined, copy over the temporary queue and matrix of point 7 to the main ones and proceed with another refinement round, otherwise stop.

Improvements on the basic version

Of course, this type of implementation, at least as it’s been described, requires even more processing and memory resources, since it has to compute \(n^{k+1}\) sets of vertices (where \(n\) is the number of distinct vertices), for each of the \(n^2\) edges in the adjacency matrix, then compare each of the computed sets’ adjacency matrices with the list of computed subgraphs to create the fingerprint, and then do this all over again for as many rounds as required to obtain a stable result. Even without going into details, the time complexity would be \(\Omega(n^{k+3})\), and that’s a very relaxed lower bound. Storing all the adjacency matrices is not a very good idea either, since that would require a huge amount of memory, though quantifying it precisely is difficult.

A few improvements were then made to speed up the implementation and reduce the memory footprint:

In the end, it was time to improve the improvement, that is to lower the amount of time needed to compute the canonical form of each subgraph. This is the set of improvements on the canonical form computation that finally allowed this implementation to become faster than the previous one while reducing memory usage at the same time:

Other small improvements (which do not modify the asymptotic complexity) include making good use of the processor’s cache, reusing data structures and memory (such as the support matrix mentioned right above) and only resetting values when needed (i.e. again in the case of the matrix above, there’s no need to zero out the entire matrix, just the at most \((k+1)^2\) entries that were modified in the last run of the canonical form computing function).

Correctness checking test

A quick mention is needed for the correctness checking test that I added to k-WL’s code, a separate method that, albeit slowly, computes the orbitals (or orbits) of a given graph \(G\) using Sage’s already existing methods and checks them against k-WL’s result, returning either a Correct, Wrong or Refinable judgement, the last one meaning that k-WL’s result was a union of the correct orbits and a higher value of \(k\) might be able to return the correct orbits.

The way the method accomplishes this is actually very simple:

Current status and future work

The tickets’ current status in the Trac system is that they are all ready to be officially reviewed (except for the first one that, as said, has been closed because not needed anymore); in particular, #25506 is basically ready to be merged.
Each ticket’s code is complete with documentation and tests and has been unofficially reviewed by both my mentor, Dmitrii Pasechnik, whom I want to particularly thank, and by David Coudert, whom I thank for the time he spent helping me even without having any obligation to.

Future work (and such a future could very well be near, since I intend to continue working on the project after the end of GSoC 2018) includes


  1. Jin-yi Cai, Martin Fürer, and Neil Immerman. An optimal lower bound on the number of variables for graph identifications. Combinatorica, 12(4):389–410, 1992  2 3

  2. S. Kiefer, I. Ponomarenko and P. Schweitzer, “The Weisfeiler-Leman dimension of planar graphs is at most 3,” 2017 32nd Annual ACM/IEEE Symposium on Logic in Computer Science (LICS), Reykjavik, 2017, pp. 1-12. doi: 10.1109/LICS.2017.8005107 

  3. McKay, B.D. and Piperno, A., Practical Graph Isomorphism, II, Journal of Symbolic Computation, 60 (2014), pp. 94-112, http://dx.doi.org/10.1016/j.jsc.2013.09.003 

  4. https://www.lii.rwth-aachen.de/images/Mitarbeiter/pub/grohe/cr.pdf 

  5. Yoshimi Egawa, Characterization of H(n, q) by the parameters, Journal of Combinatorial Theory, Series A, Volume 31, Issue 2, 1981, Pages 108-125, ISSN 0097-3165, https://doi.org/10.1016/0097-3165(81)90007-8 

  6. Cameron, P. (1999). Permutation Groups (London Mathematical Society Student Texts). Cambridge: Cambridge University Press. doi:10.1017/CBO9780511623677