git workflow for D (page 3)

On 12/4/17 3:14 PM, Ali Çehreli wrote:

> Dear git experts, given 3 repos, now what are the steps? Is the following correct? What are the exact commands?

Disclaimer: I'm not a git expert.

> - Only once, create the original repo as an upstream of your local repo.

The wording is off (s/repo/remote), but yes. This one I always have to look up, because I don't remember the order of the URL vs. the name, and git doesn't care if you swap them. But the command is this:

git remote add upstream <url>

Where url is the https version of dlang's repository (IMPORTANT: for dlang members, do NOT use the ssh version, as then you can accidentally push to it without any confirmation).

> 
> - For each change:
> 
> 1) Fetch from upstream

git fetch upstream

will fetch EVERYTHING, all branches. But just for reference so you can use it without affecting your local repo.

> 
> 2) Rebase origin/master (on upstream, right?)

No, don't do this every time! If you never commit to your local master (which you shouldn't), you do it this way:

# This checks out your local master branch
git checkout master

# This moves your local master branch along to match that of master.
# The --ff-only is to ensure it only works if the merge is a fast-forward.
# A fast forward merge happens when your master is on the same commit
# history as the upstream master.
git merge --ff-only upstream/master

Optionally you can push your local master to origin, but it's not strictly necessary:

git push origin master

> 
> 3) Make changes

Step 2.1: checkout a local branch. You can even do this after you have changed some files but *before* you have committed them (IMO one of the best features of git as compared to, say, subversion).

# creates a new branch called mylocalfix based on local master, and checks it out.
git checkout -b mylocalfix

> 
> 4) Commit (potentially after 'add')

Protip: git commit -a automatically adds any modified files that are currently in the repo, but have been changed.

> 
> 5) Repeat steps 3 and 4 as needed

Correct!

> 
> 6) 'git push -force' so that your GitHub repo is up-to-date right? (There, I mentioned "force". :) )

I'd say:

git push origin mylocalfix

This pushes *only* your mylocalfix branch to *your* fork of the repo.

No need to force as long as you do not want to squash. Squashing is when you merge multiple commits into one commit so that the history looks cleaner. I'd recommend never using force (btw, it's --force with 2 dashes) unless it complains. And then, make sure you aren't doing something foolish before using the --force command! Because you are committing only to your fork, and to your branch, even if you mess up here, it's pretty easy to recover.

A couple of examples:

1. You commit, but missed a space between "if" and "(". Instead of generating a commit and log for the typo, you just squash the new commit into the first one.
2. You commit a work in progress, but then change the design. The first commit is useless in the history, as it probably doesn't even apply anymore, so you squash them together, as if you only ever committed the correct version.

To squash the last few commits, I recommend using rebase -i:

# Replace 3 with the number of the last few commits you want to work with.
# IMPORTANT: this must contain the commit you want to squash into!
git rebase -i HEAD~3

This will pop up an editor. Follow the instructions listed there! If you want to squash 2 commits together, use "fixup", or even just "f". Note: do NOT "fixup" your first commit, as this will try to squash into someone else's commit that happened before you changed anything!

Once you write the file and exit, git will rebase using your directions.

At this point you need to use --force to push (as long as you have already pushed before), as your commit history now differs from github's.

> 
> 7) Go to GitHub and press the big button to create a pull request

Correct! After you do this, you can continue to run steps 3, 4, 6 to update your PR.

One further step that I like to do to keep my repo clean:

8) When your PR is pulled:

git fetch upstream
git checkout master
git merge --ff-only upstream/master
git branch -d mylocalfix

This pulls the new changes that were successfully merged into dlang's master into your master. Then it deletes the mylocalfix branch (no longer needed). The lower case -d means to only delete if the changes have been merged (git will complain if they aren't in the history). This is a nice way to clean your local branches up, and verify there isn't anything amiss.

Note also, if you want to work on several fixes at once, you can checkout more than one local branch, and switch between them. Just remember to commit before you checkout the different branches (git will complain if you have uncommitted files, but not files that haven't ever been added).

Hope this all helps!

-Steve

December 05, 2017

Re: git workflow for D

Posted by H. S. Teoh
in reply to Nick Sabalausky (Abscissa)

Permalink

H. S. Teoh

Posted in reply to Nick Sabalausky (Abscissa)

Permalink

On Mon, Dec 04, 2017 at 06:51:42AM -0500, Nick Sabalausky (Abscissa) via Digitalmars-d-learn wrote:
> On 12/03/2017 03:05 PM, bitwise wrote:
> > I've finally started learning git, due to our team expanding beyond one person - awesome, right?
> 
> PROTIP: Version control systems (no matter whether you use git, subversion, or whatever), are VERY helpful on single-person projects, too! Highly recommended! (Or even any time you have a directory tree where you might want to enable undo/redo/magic-time-machine on!)

+100!  (and by '!' I mean 'factorial'. :-P)

I've been using version control for all my personal projects, and I cannot tell you how many times it has saved me from my own stupidity (i.e., have to rollback a whole bunch of changes, or just plain ole consult an older version of the code that I've forgotten). Esp. with git, it also lets me play with experimental code changes without ever worrying that if things don't work out I might have to revert everything by hand (not fun! and very error-prone).

In fact, I use version control for more than just code: *anything* that's text-based is highly recommended to be put under version control if you're doing any serious amount of editing with it, because it's just such a life-saver. Of course, git works with binaries too, but diffing and such become a lot easier if everything is text-based.  This is why I always prefer text-based file formats when it comes to authoring.

Websites are a good example that really ought to be under version control.  Git, especially, lets you clone the website to a testing server where you can experiment with changes without fear, and once you're happy with the changes, commit and push to the "real" web server. Notice an embarrassing mistake that isn't easy to fix? No problem, just git checkout HEAD^, and that buys you the time you need to fix the problem locally, then re-push.

I've also recently started putting certain subdirectories under /etc in git.  Another life-saver when you screw up a configuration accidentally and need to revert to the last-known good config. Also good for troubleshooting to see exactly what changes were made that led to the current state of things.

tl;dr: use version control WHEREVER you can, even for personal 1-man projects, not only for code, but for *everything* that involves a lot of changes over time.

> > Anyways, I've got things more or less figured out, which is nice, because being clueless about git is a big blocker for me trying to do any real work on dmd/phobos/druntime. As far as working on a single master branch works, I can commit, rebase, merge, squash, push, reset, etc, like the best of em.
> 
> Congrats! Like Arun mentioned, git's CLI can be a royal mess. I've heard it be compared to driving a car by crawling under the hood and pulling on wires - and I agree.
> 
> But it's VERY helpful stuff to know, and the closer you get to understanding it inside and out, the better off you are. (And I admit, I still have a long ways to go myself.)

Here's the thing: in order to use git effectively, you have to forget all the traditional notions of version control. Yes, git does use many of the common VC terminology, and, on the surface, does work in similar ways.

BUT.

You will never be able to avoid problems and unexpected behaviours unless you forget all the traditional VC notions, and begin to think in terms of GRAPHS. Because that's what git is: a system for managing a graph. To be precise, a directed acyclic graph (DAG).

Roughly speaking, a git repo is just a graph (a DAG) of commits, objects, and refs.  Objects are the stuff you're tracking, like files and stuff.  Commits are sets of files (objects) that are considered to be part of a changeset. Refs are just pointers to certain nodes in the graph.

A git 'branch' is nothing but a pointer to some node in the DAG. In git, a 'branch' in the traditional sense is not a first-class entity; what git calls a "branch" is nothing but a node pointer. The traditional "branch" is merely a particular configuration of nodes in the DAG that has no special significance to git.

Git maintains a notion of the 'current branch', i.e., which pointer will serve as the location where new nodes will be added to the DAG. By default, this is the 'master' branch (i.e., a pointer named 'master' pointing to some node in the DAG).

When you run `git commit`, what you're doing is creating a new node in the DAG, with the parent pointer set to the current branch pointer. So if the current branch is 'master', and it's pointing to the node with SHA hash 012345, then `git commit` will create a new node with its parent pointer set to 012345.  After this node is added to the graph, the current pointer, 'master', is updated to point to the new node.

By performing a series of `git commit`s, what you end up with is a linear chain of nodes, with the current branch ('master') pointing to the last node.  This, we traditionally view as a "branch", but in git, there is nothing special at all about this chain; it's just a (sub)graph of some nodes. The git 'branch' is nothing but a pointer to the last of these nodes. You can easily make this pointer point to something else -- you wouldn't normally do this, but sometimes it can be useful.

You can also decide that instead of adding new nodes to 'master', you want to add new nodes elsewhere in the DAG. No problem, just `git checkout` some arbitrary node, and start running `git commit` on it. The first new commit will take that node as parent, and thereby start creating a new chain of nodes "branching off" the 'master' chain.

Merging a branch in git is likewise not something you'd think of in traditional VC terms; it's basically nothing but creating a new node with two parents, one from the tip of each respective branch. You can 'merge' any two arbitrary nodes together. Though of course, in general you'll end up with a huge number of conflicts if the node contents aren't correlated with each other -- but git doesn't actually mind that; you can actually overwrite all the contents with something else altogether and commit that, and git will happily take that as the "merge" of the two unrelated branches. The resulting graph won't make any sense in terms of revision history in the traditional VC sense, but git doesn't care. The point is that as far as git is concerned, it's all just a DAG.  The fact that the contents of two adjacent nodes happen to be similar is just a "coincidence", albeit a usual one.

The more 'arcane' git operations like rebasing, history rewriting, etc., are at the end of the day nothing more than graph operations, updating a bunch of pointers and moving nodes around.  If you begin thinking of your repo as a graph and forget traditional VC notions of branches, you'll find that git suddenly starts to "makes sense", and you'll be able to do amazing things to your repo without losing your way.

[...]
> ([...] there's nothing worse than accidentally loosing a bunch of important code, or finding you need to undo a bunch of changes that didn't work out.)

If you think in terms of graphs, you'll hardly ever need to worry about losing changes.  Just think in terms of code: if you were given a bunch of pointers to nodes in a graph, and you need to update these pointers, what's the safest way to do it?  Easy: just save the pointers to some local variables, then do whatever updates you want, and if it doesn't work out, just overwrite the pointers with the saved values, and you're back to where you started.

In git, because everything is SHA-hashed, nodes are actually immutable. Even the so-called history rewriting, technically speaking, isn't really "rewriting"; it's actually creating a NEW subgraph that just happens to be similar to the older part of the graph plus some changes, and updating your refs (pointers) to point to nodes in the new part of the graph instead.  In git, nodes that have nothing pointing to them are considered garbage; `git gc` will delete them from the graph.  So once all your pointers are pointing to the new nodes, you've effectively discarded the old nodes; hence the overall effect is "rewriting" the graph.  But if you still keep a ref to the old nodes, they will still be there; nothing is be lost.

It's like dealing with immutable values in D: you can never change them, but you *can* make (modified) copies of them and changing your pointers to point to the copies instead of the original values.  As long as you still keep refs to the old nodes, they will never be lost no matter what you do to your graph.  And note that the parent pointers in each node are also part of the SHA hash, so the topology of the old part of the graph is immutable too.  There is literally nothing you can do that can change the content or topology of those old nodes. As long as you have a way to reach them, you will still have your old history completely intact.

And how do you create backup copies of your pointers? Easy: remember a git 'branch' is nothing but a pointer? Well, so you just go `git checkout <branch>; git checkout -b backup_ref` and now you have a pointer called 'backup_ref' that points to that same node that <branch> is pointing to.  Now you can do whatever you want to <branch> -- add new commits, overwrite it with a ref to a completely different node, whatever.  If at any point you decide that you want it to point to the original node again, just `git checkout <branch>; git reset --hard backup_ref`.  As long as you don't touch backup_ref, you will be able to go back to the original state.

(See? This is why you have to stop thinking of a git repo in traditional
VC terms.  Your git repo is a graph. (With immutable nodes.) That's all
there is to it.)

> One thing to keep in mind: Any time you're talking about moving anything from one repo to another, there's exactly two basic primitives there: push and pull. Both of them are basically the same simple thing: All they're about is copying the latest new commits (or tags) from WW branch on XX repo, to YY branch on ZZ repo. All other git commands that move anything bewteen repos start out with this basic "push" or "pull" primitive. (Engh, technically "fetch" is even more of a primitive than those, but I find it more helpful to think in terms of "push/pull" for the most typical daily tasks.)

Again, this will all make so much more sense if you think in terms of graphs.

What `git fetch` does is to download a bunch of nodes from a remote source.  Don't even think in terms of branches; think in terms of individual nodes (which imply their own graph connectivity structure -- because the parent pointers are an immutable part of them) that are downloaded from the remote source.  After downloading these nodes, git will create a new pointer (i.e., ref) to point to the last node (i.e., the node from which the other nodes can be reached), usually with a name like upstream/somebranch.  There is nothing special about this name besides the convention that we use names of the form x/y for pointers named 'y' that we downloaded from 'x'; it's just a pointer to some nodes that you downloaded off the 'net.

What 'git pull' does is to try to reconcile these downloaded nodes with the nodes in your local branch -- and here is where wrinkles can arise, because, by convention, git will try to merge the nodes from x/y into the local branch called y.  It's all good if the local branch y points to an ancestor of x/y, i.e., your local branch is just a subgraph of the remote branch, and since the parent pointers of the downloaded nodes already point to y (i.e., they are already a part of the graph! -- because they share an ancestor node), the only thing that's needed is to update y to point to x/y (i.e., the new tip of the branch) instead. This is called 'fast-forwarding'.

But what if your local branch has diverged from the remote branch? I.e., the nodes in local branch 'y' share a common ancestor with the downloaded nodes in x/y, but have different descendent nodes. Now we cannot simply set y to x/y, because that would cause you to lose your pointer to your local nodes, which means `git gc` will garbage-collect them (i.e., your local changes will be lost).  So git tries to be 'helpful' here by attempting to merge the nodes together -- i.e., create a new series of nodes that incorporate the changes from *both* y and x/y.  Unfortunately, this process often causes further problems, because remember, nodes are immutable, so the only way you can merge the changesets together is by creating new nodes ("merge commits" in git parlance) and discarding the old ones.  But once you do that, your local branch 'y' is no longer the same as the remote one, so when it comes time to push your changes to other collaborators, or to pull from remote again later, it causes more conflicts in a never-ending spiral.

The best approach is to avoid this situation altogether, by designating certain branches (usually master) as pull-only, i.e., you never commit changes to them, all your changes are committed to local branches. In terms of graphs, you never change the value of the 'master' pointer, but may add new nodes to the graph by using other pointers ("local branches") for that purpose.  Then `git pull` will always be fast-forward only (the value of the local 'master' pointer will always be equal to, or an ancestor of, the remote 'master' pointer, so it is always possible to just replace the local 'master' pointer with the remote value without losing any nodes).  This is why I recommend to *always* run:

	git pull --ff-only upstream master

The --ff-only tells git not to try to be smart and create a mess of merge commits, but to only ever fast-forward the master pointer.  If this fails, then you know you've made a mistake and updated the master pointer where you should have used a local branch instead.  (How to fix this is left as an exercise for the reader: hint, remember 'master' is just a pointer. Just create a new local branch to point to the current nodes, i.e., backup your pointer, then reset 'master' to the last common ancestor with the upstream nodes, then `git pull`, and rebase your local branch afterwards.)

> > How does one keep their fork up to date? For example, if I fork dmd, and wait a month, do I just fetch using dmd's master as a remote, and then rebase?

If you keep to the convention of never committing to master locally, then you can just `git pull --ff-only upstream master` and it will pull in the latest changes.  Then you just rebase your local branch(es) on top of master.

In graph-centric terms, running `git rebase master` in a local branch B
does the following: (1) find the common ancestor A of master and B; (2)
for each node in B up to (but not including) A, create a corresponding
new node that contains the same changes, but is based on the tip of
master instead of A; (3) set B to point to the last of the new nodes.

Special note: since rebase isn't actually modifying nodes -- remember nodes are immutable -- if you're unsure or want to be extra-careful, you can keep a spare reference to the old tip of B before running the rebase, like this:

	git checkout B
	git checkout -b B-backup	# backup pointer
	git checkout B			# set current branch back to B

	git rebase master		# rebase B onto master

If you then run `git log --graph --all`, you'll see that there are now *two* copies of the commits you made in B: one in the original position branching off master at ancestor A, and the other is now based on master.  'B' will now point to the new nodes, but you'll still be able to access the old nodes via 'B-backup'.  If at any time you wish to 'undo' the rebase, just reset B to B-backup.  (The new nodes will then become unreferenced, and will be garbage-collected. Unless you kept another pointer to them, of course.)

See? No danger of data loss. (Unless you forget to keep a spare pointer to your old nodes. But even in that case, there's still a way out with `git reflog` -- git gc doesn't actually delete nodes until they're past a certain age, so as long as you notice the problem early and not a week or month later, your old nodes will still be there. You just have to dig through `git reflog` to find the old pointer values, i.e., SHA hashes. Once you find the right SHA hash, just `git checkout <hash>` to go back to the old node, then `git checkout -b <oldbranch>` to create a new branch pointer to point to the old nodes.)

[...]
> > and do I need a separate branch for each pull request, or is the pull request itself somehow isolated from my changes?
> 
> You *should* create a separate branch for each pull request unless you're a masochist. There's *no* isolation other than whatever isolation YOU create.  (Not my idea of award-winning software design, but meh, it is what it is).
> 
> This is why people are adamant about making a separate branch for each pull request. *Technically* speaking you don't absolutely HAVE to...But if you *don't* create a separate branch for each PR, you're just asking for pain: It'll be a PITA if you want to create another PR before your first one is approved and merged. And it'll be a PITA if your PR is rejected and you want to do any more work on the codebase.
[...]

Just think of it as updating a graph.  You have a local copy of the graph, and you've added a bunch of new nodes to it.  Now you want the upstream people to add your new nodes to their copies of the graph too. Suppose further that these nodes represent several different changesets. What's the best way to manage these nodes?

It should be obvious that the best way is to use a different pointer for each changeset, so that if the upstream people decide to merge changeset A but reject changeset B, you can keep your local copy of the graph straight.  If you use the *same* pointer for all changesets, then it should be no surprise when things become a big mess when upstream merges some changesets but not others, yet locally you have no way of addressing each changeset separately.

Even if all your changes eventually get merged, in the interim you may be running git rebase to apply your changes to the latest upstream code; if you only keep a single pointer around for everything, you're going to lose track of what's going on really quickly.

There's no *requirement* that you do things this way, of course, but it's just a matter of being able to keep your own changesets straight when you have to reconcile your local graph with the remote one.

T

-- 
Never wrestle a pig. You both get covered in mud, and the pig likes it.

Forums