How UMAP Works — umap 0.5 documentation (2024)

UMAP is an algorithm for dimension reduction based on manifold learningtechniques and ideas from topological data analysis. It provides a verygeneral framework for approaching manifold learning and dimensionreduction, but can also provide specific concrete realizations. Thisarticle will discuss how the algorithm works in practice. There existdeeper mathematical underpinnings, but for the sake of readability by ageneral audience these will merely be referenced and linked. If you arelooking for the mathematical description please see the UMAPpaper.

To begin making sense of UMAP we will need a little bit of mathematicalbackground from algebraic topology and topological data analysis. Thiswill provide a basic algorithm that works well in theory, butunfortunately not so well in practice. The next step will be to make useof some basic Riemannian geometry to bring real world data a littlecloser to the underlying assumptions of the topological data analysisalgorithm. Unfortunately this will introduce new complications, whichwill be resolved through a combination of deep math (details of whichwill be elided) and fuzzy logic. We can then put the pieces backtogether again, and combine them with a new approach to finding a lowdimensional representation more fitting to the new data structures athand. Putting this all together we arrive at the basic UMAP algorithm.

Topological Data Analysis and Simplicial Complexes

Simplicial complexes are a means to construct topological spaces out ofsimple combinatorial components. This allows one to reduce thecomplexities of dealing with the continuous geometry of topologicalspaces to the task of relatively simple combinatorics and counting. Thismethod of taming geometry and topology will be fundamental to ourapproach to topological data analysis in general, and dimensionreduction in particular.

The first step is to provide some simple combinatorial building blockscalled *simplices*.Geometrically a simplex is a very simple way to build a\(k\)-dimensional object. A \(k\) dimensional simplex is calleda \(k\)-simplex, and it is formed by taking the convex hull of\(k+1\) independent points. Thus a 0-simplex is a point, a 1-simplexis a line segment (between two zero simplices), a 2-simplex is atriangle (with three 1-simplices as “faces”), and a 3-simplex is atetrahedron (with four 2-simplices as “faces”). Such a simpleconstruction allows for easy generalization to arbitrary dimensions.

How UMAP Works — umap 0.5 documentation (1)

This has a very simple combinatorial underlying structure, andultimately one can regard a \(k\)-simplex as an arbitrary set of\(k+1\) objects with faces (and faces of faces etc.) given byappropriately sized subsets – one can always provide a “geometricrealization” of this abstract set description by constructing thecorresponding geometric simplex.

Simplices can provide building blocks, but to construct interestingtopological spaces we need to be able to glue together such buildingblocks. This can be done by constructing a *simplicialcomplex*.Ostensibly a simplicial complex is a set of simplices glued togetheralong faces. More explicitly a simplicial complex \(\mathcal{K}\) isa set of simplices such that any face of any simplex in\(\mathcal{K}\) is also in \(\mathcal{K}\) (ensuring all facesexist), and the intersection of any two simplices in \(\mathcal{K}\)is a face of both simplices. A large class of topological spaces can beconstructed in this way – just gluing together simplices of variousdimensions along their faces. A little further abstraction will get to*simplicial sets*which are purely combinatorial, have a nice category theoreticpresentation, and can generate a much broader class of topologicalspaces, but that will take us too far afield for this article. Theintuition of simplicial complexes will be enough to illustrate therelevant ideas and motivation.

How does one apply these theoretical tools from topology to finite setsof data points? To start we’ll look at how one might construct asimplicial complex from a topological space. The tool we will consider isthe construction of a Čechcomplex given anopen cover of a topological space. That’s a lot of verbiage if youhaven’t done much topology, but we can break it down fairly easily forour use case. An open cover is essentially just a family of sets whoseunion is the whole space, and a Čech complex is a combinatorial way toconvert that into a simplicial complex. It works fairly simply: let eachset in the cover be a 0-simplex; create a 1-simplex between two suchsets if they have a non-empty intersection; create a 2-simplex betweenthree such sets if the triple intersection of all three is non-empty;and so on. Now, that doesn’t sound very advanced – just looking atintersections of sets. The key is that the background topological theoryactually provides guarantees about how well this simple process canproduce something that represents the topological space itself in ameaningful way (the Nervetheorem is the relevantresult for those interested). Obviously the quality of the cover isimportant, and finer covers provide more accuracy, but the reality isthat despite its simplicity the process captures much of the topology.

Next we need to understand how to apply that process to a finite set ofdata samples. If we assume that the data samples are drawn from someunderlying topological space then to learn about the topology of thatspace we need to generate an open cover of it. If our data actually liein a metric space (i.e. we can measure distance between points) then oneway to approximate an open cover is to simply create balls of some fixedradius about each data point. Since we only have finite samples, and notthe topological space itself, we cannot be sure it is truly an opencover, but it might be as good an approximation as we couldreasonably expect. This approach also has the advantage that the Čechcomplex associated to the cover will have a 0-simplex for each datapoint.

To demonstrate the process let’s consider a test dataset like this

How UMAP Works — umap 0.5 documentation (2)

If we fix a radius we can then picture the open sets of our cover ascircles (since we are in a nice visualizable two dimensional case). Theresult is something like this

How UMAP Works — umap 0.5 documentation (3)

We can then depict the the simplicial complex of 0-, 1-, and 2-simplicesas points, lines, and triangles

How UMAP Works — umap 0.5 documentation (4)

It is harder to easily depict the higher dimensional simplices, but youcan imagine how they would fit in. There are two things to note here:first, the simplicial complex does a reasonable job of starting tocapture the fundamental topology of the dataset; second, most of thework is really done by the 0- and 1-simplices, which are easier to dealwith computationally (it is just a graph, in the nodes and edges sense).The second observation motivates the Vietoris-Ripscomplex,which is similar to the Čech complex but is entirely determined by the0- and 1-simplices. Vietoris-Rips complexes are much easier to work withcomputationally, especially for large datasets, and are one of the majortools of topological data analysis.

If we take this approach to get a topological representation then we canbuild a dimension reduction algorithm by finding a low dimensionalrepresentation of the data that has a similar topologicalrepresentation. If we only care about the 0- and 1-simplices then thetopological representation is just a graph, and finding a lowdimensional representation can be described as a graph layoutproblem. If one wants to use, for example, spectral methods forgraph layout then we arrive at algorithms like Laplacianeigenmaps and Diffusion maps. Force directed layouts arealso an option, and provide algorithms closer to MDS or Sammonmapping in flavour.

I would not blame those who have read this far to wonder why we tooksuch an abstract roundabout road to simply building a neighborhood-graphon the data and then laying out that graph. There are a couple ofreasons. The first reason is that the topological approach, whileabstract, provides sound theoretical justification for what we aredoing. While building a neighborhood-graph and laying it out in lowerdimensional space makes heuristic sense and is computationally tractable,it doesn’t provide the same underlying motivation of capturing theunderlying topological structure of the data faithfully – for that weneed to appeal to the powerful topological machinery I’ve hinted lies inthe background. The second reason is that it is this more abstracttopological approach that will allow us to generalize the approach andget around some of the difficulties of the sorts of algorithms describedabove. While ultimately we will end up with a process that is fairlysimple computationally, understanding why various manipulations matteris important to truly understanding the algorithm (as opposed to merelycomputing with it).

Adapting to Real World Data

The approach described above provides a nice theory for why aneighborhood graph based approach should capture manifold structure whendoing dimension reduction. The problem tends to come when one tries toput the theory into practice. The first obvious difficulty (and we cansee it even our example above) is that choosing the right radius for theballs that make up the open cover is hard. If you choose something toosmall the resulting simplicial complex splits into many connectedcomponents. If you choose something too large the simplicial complexturns into just a few very high dimensional simplices (and their facesetc.) and fails to capture the manifold structure anymore. How shouldone solve this?

The dilemma is in part due to the theorem (called the Nervetheorem) thatprovides our justification that this process captures the topology.Specifically, the theorem says that the simplicial complex will be(hom*otopically) equivalent to the union of the cover. In our case,working with finite data, the cover, for certain radii, doesn’t coverthe whole of the manifold that we imagine underlies the data – it isthat lack of coverage that results in the disconnected components.Similarly, where the points are too bunched up, our cover does cover“too much” and we end up with higher dimensional simplices than we mightideally like. If the data were uniformly distributed across themanifold then selecting a suitable radius would be easy – the averagedistance between points would work well. Moreover with a uniformdistribution we would be guaranteed that our cover would actually coverthe whole manifold with no “gaps” and no unnecessarily disconnectedcomponents. Similarly, we would not suffer from those unfortunatebunching effects resulting in unnecessarily high dimensional simplices.

If we consider data that is uniformly distributed along the samemanifold it is not hard to pick a good radius (a little above half theaverage distance between points) and the resulting open cover lookspretty good:

How UMAP Works — umap 0.5 documentation (5)

Because the data is evenly spread we actually cover the underlyingmanifold and don’t end up with clumping. In other words, all this theoryworks well assuming that the data is uniformly distributed over themanifold.

Unsurprisingly this uniform distribution assumption crops up elsewherein manifold learning. The proofs that Laplacian eigenmaps work wellrequire the assumption that the data is uniformly distributed on themanifold. Clearly if we had a uniform distribution of points on themanifold this would all work a lot better – but we don’t! Real worlddata simply isn’t that nicely behaved. How can we resolve this? Byturning the problem on its head: assume that the data is uniformlydistributed on the manifold, and ask what that tells us about themanifold itself. If the data looks like it isn’t uniformly distributedthat must simply be because the notion of distance is varying across themanifold – space itself is warping: stretching or shrinking accordingto where the data appear sparser or denser.

By assuming that the data is uniformly distributed we can actuallycompute (an approximation of) a local notion of distance for each pointby making use of a little standard Riemanniangeometry. Inpractical terms, once you push the math through, this turns out to meanthat a unit ball about a point stretches to the k-th nearest neighborof the point, where k is the sample size we are using to approximatethe local sense of distance. Each point is given its own unique distancefunction, and we can simply select balls of radius one with respect tothat local distance function!

How UMAP Works — umap 0.5 documentation (6)

This theoretically derived result matches well with many traditionalgraph based algorithms: a standard approach for such algorithms is touse a k-neighbor graph instead of using balls of some fixed radius todefine connectivity. What this means is that each point in the datasetis given an edge to each of its k nearest neighbors – the effectiveresult of our locally varying metric with balls of radius one. Now,however, we can explain why this works in terms of simplicial complexesand the Nerve theorem.

Of course we have traded choosing the radius of the balls for choosing avalue for k. However it is often easier to pick a resolution scale interms of number of neighbors than it is to correctly choose a distance.This is because choosing a distance is very dataset dependent: one needsto look at the distribution of distances in the dataset to even begin toselect a good value. In contrast, while a k value is still datasetdependent to some degree, there are reasonable default choices, such asthe 10 nearest neighbors, that should work acceptably for most datasets.

At the same time the topological interpretation of all of this gives usa more meaningful interpretation of k. The choice of k determines howlocally we wish to estimate the Riemannian metric. A small choice of kmeans we want a very local interpretation which will more accuratelycapture fine detail structure and variation of the Riemannian metric.Choosing a large k means our estimates will be based on largerregions, and thus, while missing some of the fine detail structure, theywill be more broadly accurate across the manifold as a whole, havingmore data to make the estimate with.

We also get a further benefit from this Riemannian metric basedapproach: we actually have a local metric space associated to eachpoint, and can meaningfully measure distance, and thus we could weightthe edges of the graph we might generate by how far apart (in terms ofthe local metric) the points on the edges are. In slightly moremathematical terms we can think of this as working in a fuzzy topologywhere being in an open set in a cover is no longer a binary yes or no,but instead a fuzzy value between zero and one. Obviously the certaintythat points are in a ball of a given radius will decay as we move awayfrom the center of the ball. We could visualize such a fuzzy cover aslooking something like this

How UMAP Works — umap 0.5 documentation (7)

None of that is very concrete or formal – it is merely an intuitivepicture of what we would like to have happen. It turns out that we canactually formalize all of this by stealing the singularsetand geometricrealizationfunctors from algebraic topology and then adapting them to apply tometric spaces and fuzzy simplicial sets. The mathematics involved inthis is outside the scope of this exposition, but for those interestedyou can look at the original work on this by DavidSpivakand our paper. It will have tosuffice to say that there is some mathematical machinery that lets usrealize this intuition in a well defined way.

This resolves a number of issues, but a new problem presents itself whenwe apply this sort of process to real data, especially in higherdimensions: a lot of points become essentially totally isolated. Onewould imagine that this shouldn’t happen if the manifold the data wassampled from isn’t pathological. So what property are we expecting thatmanifold to have that we are somehow missing with the current approach?What we need to add is the idea of local connectivity.

Note that this is not a requirement that the manifold as a whole beconnected – it can be made up of many connected components. Instead itis a requirement that at any point on the manifold there is somesufficiently small neighborhood of the point that is connected (this“in a sufficiently small neighborhood” is what the “local” part means).For the practical problem we are working with, where we only have afinite approximation of the manifold, this means that no point should becompletely isolated – it should connect to at least one other point.In terms of fuzzy open sets what this amounts to is that we should havecomplete confidence that the open set extends as far as the closestneighbor of each point. We can implement this by simply having the fuzzyconfidence decay in terms of distance beyond the first nearestneighbor. We can visualize the result in terms of our example datasetagain.

How UMAP Works — umap 0.5 documentation (8)

Again this can be formalized in terms of the aforementioned mathematicalmachinery from algebraic topology. From a practical standpoint thisplays an important role for high dimensional data – in high dimensionsdistances tend to be larger, but also more similar to one another (seethe curse ofdimensionality).This means that the distance to the first nearest neighbor can be quitelarge, but the distance to the tenth nearest neighbor can often be onlyslightly larger (in relative terms). The local connectivity constraintensures that we focus on the difference in distances among nearestneighbors rather than the absolute distance (which shows littledifferentiation among neighbors).

Just when we think we are almost there, having worked around some of theissues of real world data, we run aground on a new obstruction: ourlocal metrics are not compatible! Each point has its own local metricassociated to it, and from point a’s perspective the distance frompoint a to point b might be 1.5, but from the perspective of pointb the distance from point b to point a might only be 0.6. Whichpoint is right? How do we decide? Going back to our graph basedintuition we can think of this as having directed edges with varyingweights something like this.

How UMAP Works — umap 0.5 documentation (9)

Between any two points we might have up to two edges and the weights onthose edges disagree with one another. There are a number of options forwhat to do given two disagreeing weights – we could take the maximum,the minimum, the arithmetic mean, the geometric mean, or something elseentirely. What we would really like is some principled way to make thedecision. It is at this point that the mathematical machinery we builtcomes into play. Mathematically we actually have a family of fuzzysimplicial sets, and the obvious choice is to take their union – a welldefined operation. There are a a few ways to define fuzzy unions,depending on the nature of the logic involved, but here we haverelatively clear probabilistic semantics that make the choicestraightforward. In graph terms what we get is the following: if we wantto merge together two disagreeing edges with weight a and b then weshould have a single edge with combined weight \(a + b - a \cdot b\).The way to think of this is that the weights are effectively theprobabilities that an edge (1-simplex) exists. The combined weight isthen the probability that at least one of the edges exists.

If we apply this process to union together all the fuzzy simplicial setswe end up with a single fuzzy simplicial complex, which we can againthink of as a weighted graph. In computational terms we are simplyapplying the edge weight combination formula across the whole graph(with non-edges having a weight of 0). In the end we have something thatlooks like this.

How UMAP Works — umap 0.5 documentation (10)

So in some sense in the end we have simply constructed a weighted graph(although we could make use of higher dimensional simplices if wewished, just at significant extra computational cost). What themathematical theory lurking in the background did for us is determinewhy we should construct this graph. It also helped make thedecisions about exactly how to compute things, and gives a concreteinterpretation of what this graph means. So while in the end we justconstructed a graph, the math answered the important questions to get ushere, and can help us determine what to do next.

So given that we now have a fuzzy topological representation of the data(which the math says will capture the topology of the manifoldunderlying the data), how do we go about converting that into a lowdimensional representation?

Finding a Low Dimensional Representation

Ideally we want the low dimensional representation to have as similara fuzzy topological structure as possible. The first questionis how do we determine the fuzzy topological structure of a lowdimensional representation, and the second question is how do we find agood one.

The first question is largely already answered – we should presumablyfollow the same procedure we just used to find the fuzzy topologicalstructure of our data. There is a quirk, however: this time around thedata won’t be lying on some manifold, we’ll have a low dimensionalrepresentation that is lying on a very particular manifold. Thatmanifold is, of course, just the low dimensional euclidean space we aretrying to embed into. This means that all the effort we went topreviously to make vary the notion of distance across the manifold isgoing to be misplaced when working with the low dimensionalrepresentation. We explicitly want the distance on the manifold to bestandard euclidean distance with respect to the global coordinatesystem, not a varying metric. That saves some trouble. The other quirkis that we made use of the distance to the nearest neighbor, againsomething we computed given the data. This is also a property we wouldlike to be globally true across the manifold as we optimize toward agood low dimensional representation, so we will have to accept it as ahyper-parameter min_dist to the algorithm.

The second question, ‘how do we find a good low dimensionalrepresentation’, hinges on our ability to measure how “close” a match wehave found in terms of fuzzy topological structures. Given such ameasure we can turn this into an optimization problem of finding the lowdimensional representation with the closest fuzzy topological structure.Obviously if our measure of closeness turns out to have variousproperties the nature of the optimization techniques we can apply willdiffer.

Going back to when we were merging together the conflicting weightsassociated to simplices, we interpreted the weights as the probabilityof the simplex existing. Thus, since both topological structures we arecomparing share the same 0-simplices, we can imagine that we arecomparing the two vectors of probabilities indexed by the 1-simplices.Given that these are Bernoulli variables (ultimately the simplex eitherexists or it doesn’t, and the probability is the parameter of aBernoulli distribution), the right choice here is the cross entropy.

Explicitly, if the set of all possible 1-simplices is \(E\), and wehave weight functions such that \(w_h(e)\) is the weight of the1-simplex \(e\) in the high dimensional case and \(w_l(e)\) isthe weight of \(e\) in the low dimensional case, then the crossentropy will be

\[\sum_{e\in E} w_h(e) \log\left(\frac{w_h(e)}{w_l(e)}\right) + (1 - w_h(e)) \log\left(\frac{1 - w_h(e)}{1 - w_l(e)}\right)\]

This might look complicated, but if we go back to thinking in terms of agraph we can view minimizing the cross entropy as a kind of forcedirected graph layout algorithm.

The first term, \(w_h(e) \log\left(\frac{w_h(e)}{w_l(e)}\right)\),provides an attractive force between the points \(e\) spans wheneverthere is a large weight associated to the high dimensional case. This isbecause this term will be minimized when \(w_l(e)\) is as large aspossible, which will occur when the distance between the points is assmall as possible.

In contrast the second term,\((1 - w_h(e)) \log\left(\frac{1 - w_h(e)}{1 - w_l(e)}\right)\),provides a repulsive force between the ends of \(e\) whenever\(w_h(e)\) is small. This is because the term will be minimized bymaking \(w_l(e)\) as small as possible.

On balance this process of pull and push, mediated by the weights onedges of the topological representation of the high dimensional data,will let the low dimensional representation settle into a state thatrelatively accurately represents the overall topology of the sourcedata.

The UMAP Algorithm

Putting all these pieces together we can construct the UMAP algorithm.The first phase consists of constructing a fuzzy topologicalrepresentation, essentially as described above. The second phase issimply optimizing the low dimensional representation to have as closea fuzzy topological representation as possible as measured by crossentropy.

When constructing the initial fuzzy topological representation we cantake a few shortcuts. In practice, since fuzzy set membership strengthsdecay away to be vanishingly small, we only need to compute them for thenearest neighbors of each point. Ultimately that means we need a way toquickly compute (approximate) nearest neighbors efficiently, even inhigh dimensional spaces. We can do this by taking advantage of theNearest-Neighbor-Descent algorithm of Dong etal. The remainingcomputations are now only dealing with local neighbors of each point andare thus very efficient.

In optimizing the low dimensional embedding we can again take someshortcuts. We can use stochastic gradient descent for the optimizationprocess. To make the gradient descent problem easier it is beneficial ifthe final objective function is differentiable. We can arrange for thatby using a smooth approximation of the actual membership strengthfunction for the low dimensional representation, selecting from asuitably versatile family. In practice UMAP uses the family of curves ofthe form \(\frac{1}{1 + a x^{2b}}\). Equally we don’t want to have todeal with all possible edges, so we can use the negative sampling trick(as used by word2vec and LargeVis), to simply sample negative examplesas needed. Finally since the Laplacian of the topological representationis an approximation of the Laplace-Beltrami operator of the manifold wecan use spectral embedding techniques to initialize the low dimensionalrepresentation into a good state.

Putting all these pieces together we arrive at an algorithm that is fastand scalable, yet still built out of sound mathematical theory.Hopefully this introduction has helped provide some intuition for thatunderlying theory, and for how the UMAP algorithm works in practice.

How UMAP Works — umap 0.5 documentation (2024)

References

Top Articles
Latest Posts
Article information

Author: Dean Jakubowski Ret

Last Updated:

Views: 6760

Rating: 5 / 5 (70 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Dean Jakubowski Ret

Birthday: 1996-05-10

Address: Apt. 425 4346 Santiago Islands, Shariside, AK 38830-1874

Phone: +96313309894162

Job: Legacy Sales Designer

Hobby: Baseball, Wood carving, Candle making, Jigsaw puzzles, Lacemaking, Parkour, Drawing

Introduction: My name is Dean Jakubowski Ret, I am a enthusiastic, friendly, homely, handsome, zealous, brainy, elegant person who loves writing and wants to share my knowledge and understanding with you.