June 07, 2018
On Thu, Jun 07, 2018 at 05:11:40PM -0700, Walter Bright via Digitalmars-d-announce wrote:
> On 6/7/2018 10:17 AM, H. S. Teoh wrote:
> > Exactly!!!  Git was built precisely for decentralized, distributed development.  Anyone should be (and is, if they bothered to put just a tiny amount of effort into it) able to set up a git server and send the URL to prospective collaborators.  Anyone is free to clone the git repo and redistribute that clone to anyone else.  Anyone can create new commits in a local clone and send the URL to another collaborator who can pull the commits.  It should never have become the tool to build walled gardens that inhibit this free sharing of code.
> 
> We have more on Github than just the source code. There are all the comments that go with the PRs. I have most of this archived, as they get emailed to me by Github, but not all of it and recreating all this priceless historical information into a usable form would be very burdensome.

And that is why it's a bad thing to build a walled garden around a code repo, esp. when the underlying VCS is well capable of distributed development.  If only there has been a standard protocol for communicating such associated content, such as PR comments and discussions, bugs and issues (this latter not applicable in our case, thankfully), then we could have setup an archival system to retrieve and store all of this information.  Unfortunately, AFAIK there isn't a way to do this, and so if Github for whatever reason shuts down, all of this valuable information would be lost forever.

The same problem faces us if for whatever reason we decide to move to a different VCS hosting provider in the future: the lack of a common, compatible data exchange format for PRs, comments, issues, etc., means that it will be very hard (practically impossible) to export this data and import it into the new system.  It's a mild form of vendor lock-in. Mild in the sense that we can take the code with us anytime, thanks to the way git works, but the valuable associated information like PR discussions is specific to Github and there is no easy way (if there's a way at all!) to export this data and import it elsewhere.

It's 2018, and history has shown that standard, open data formats are what stands the test of time. We *could* have had a standardized interchange format for representing PR discussions, standard vendor-agnostic protocols for bug-tracking, PR merging, etc.. Yet we're still stuck in the 1998 mindset of building walled gardens that lock us into an inescapable dependence on a specific vendor.  Thankfully git allows at least the code to be free from this lock-in, but still, as you said, priceless historical information resides in data that only exists on Github, and the lack of common protocols means we're bound to Github by the fear of losing this data forever if we leave.


T

-- 
Error: Keyboard not attached. Press F1 to continue. -- Yoon Ha Lee, CONLANG
June 08, 2018
On Thursday, 7 June 2018 at 19:02:31 UTC, Russel Winder wrote:
> On Thu, 2018-06-07 at 10:17 -0700, H. S. Teoh via Digitalmars-d-announce wrote:
>> […]
>> 
>> Exactly!!!  Git was built precisely for decentralized, distributed development.  Anyone should be (and is, if they bothered to put just a tiny amount of effort into it) able to set up a git server and send the URL to prospective collaborators.  Anyone is free to clone the git repo and redistribute that clone to anyone else.  Anyone can create new commits in a local clone and send the URL to another collaborator who can pull the commits.  It should never have become the tool to build walled gardens that inhibit this free sharing of code.
>> 
>
> I think there is an interesting tension between using a DVCS as a DVCS and no central resource, and thus no mainline version, and using a DVCS in combination with a central resource.  In the latter category the central resource may just be the repository acting as the mainline, or, as with GitHub, GitLab, Launchpad, the central resource provides sharing and reviewing support.
>
> Very few organisations, except perhaps those that use Fossil, actually use DVCS as a DVCS. Everyone seems to want a public mainline version: the repository that represents the official state of the project. It seems the world is not capable of working with a DVCS system that does not even support "eventually consistent". Perhaps because of lack of trying or perhaps because the idea of the mainline version of a project is important to projects.

Well, as Jonathan says, you have to release a build eventually, and you need a mainline version that you know has all the needed commits to release from.

If you have multiple people all releasing their own builds with each build getting a roughly equivalent number of downloads, then a mainline version may not be needed, but I know of no large project like that.

> In the past Gnome, Debian, GStreamer, and many others have had a central mainline Git repository and everything was handled as DVCS, with emailed patches. They tended not to support using remotes and merges via that route, not entirely sure why. GitHub and GitLab supported forking, issues, pull requests, and CI. So many people have found this useful. Not just for having ready made CI on PRs, but because there was a central place that lots of projects were at, there was lots of serendipitous contribution. Gnome, Debian, and GStreamer are moving to private GitLab instances. It seems the use of a bare Git repository is not as appealing to these projects as having the support of a centralised system.

Nobody uses a DVCS alone, even the linux kernel guys have mailing lists and other software they use to coordinate with around git.

> I think that whilst there are many technical reasons for having an element of process support at the mainline location favouring the GitHubs and GitLabs of this Gitty world, a lot of it is about the people and the social system: there is a sense of belonging, a sense of accessibility, and being able to contribute more easily.

There is some of that, but you could reproduce all of that in a technically decentralized manner.

> One of the aspects of the total DVCS is that it can exclude, it is in itself a walled garden, you have to be in the clique to even know the activity is happening.

Right now, yes, mailing lists and bugzilla can be forbidding to the noob, compared to just signing up on github and getting everything at one go. But as Basile's link above points out, there are tools like git-ssb that try decentralize all that:

http://git-ssb.celehner.com/%25RPKzL382v2fAia5HuDNHD5kkFdlP7bGvXQApSXqOBwc%3D.sha256

> All of this is not just technical, it is socio-technical.

It is all ultimately technical, but yes, social elements come into play.

One big thing that web software like github or trac helps with is reviewing pulls to the main repo. I'm not about to add dozens of remotes to my local repo to review pulls from all the contributors to dmd/druntime/phobos, the github pull review workflow is much easier than the git command-line equivalent.

However, it wouldn't be that hard to decentralize most of what github provides by coming up with a standard format to store issues and other discussion in a git repo, as I'm guessing git-ssb does. The only aspect that might present difficulty is that you may not get as nice a web viewer as github provided, as the built-in gitweb is not very good compared to github's web UI.

In that way, while many are complaining about using github, the OSS community doing so for all these years may have been optimal, in that as long as a money-losing company was willing to do that work for you for years, why not use it? Where was all that money being lost after all, if not on providing features to users who weren't paying enough to sustain it? Then, once you know whether github's business model works or not, apparently not, you consider moving elsewhere.

I don't know why some people are making a big deal about losing data with github, I can't imagine it'd be that hard to scrape. Gitlab has an exporter, you could simply repurpose that for your own uses. You may not get everything, like user permissions or some other internal controls, but you'd get everything that really matters.
June 08, 2018
On 06/08/2018 01:01 AM, H. S. Teoh wrote:
> but the valuable associated information like PR
> discussions is specific to Github and there is no easy way (if there's a
> way at all!) to export this data and import it elsewhere.

For importing, you may be right. For exporting, I'm not sure I agree. With curl and something like Adam's HTML DOM (or heck, even just regex) it shouldn't be too difficult to crawl/scrape all the information into a sensible format. That's a technique I've been wanting to do a LOT more with than I've had a chance to.

Although granted, that's still far more complicated than it SHOULD be, and doesn't help much if there's nowhere to import it into.

> It's 2018, and history has shown that standard, open data formats are
> what stands the test of time.

Yup. Unfortunately, history has also shown that closed-off and locked-in tend to be more lucrative business models. Which is why all the big muscle in the tech world is usually working *against* open standards.
June 08, 2018
On Fri, Jun 08, 2018 at 02:02:12PM -0400, Nick Sabalausky (Abscissa) via Digitalmars-d-announce wrote:
> On 06/08/2018 01:01 AM, H. S. Teoh wrote:
> > but the valuable associated information like PR discussions is specific to Github and there is no easy way (if there's a way at all!) to export this data and import it elsewhere.
> 
> For importing, you may be right. For exporting, I'm not sure I agree. With curl and something like Adam's HTML DOM (or heck, even just regex) it shouldn't be too difficult to crawl/scrape all the information into a sensible format. That's a technique I've been wanting to do a LOT more with than I've had a chance to.

True, you can write a crawler to trawl through all the pages and collate all the info.  But it doesn't seem to be something that can be done overnight, and the extracted data will probably need further processing to be put into a more useful form (e.g., resolving cross-links, parse references between PRs, etc., dumping the raw HTML is only the first step).


> Although granted, that's still far more complicated than it SHOULD be, and doesn't help much if there's nowhere to import it into.

Even if there were somewhere to import it, it would still require a fair amount of effort to massage the data into the right format to be imported.


> > It's 2018, and history has shown that standard, open data formats are what stands the test of time.
> 
> Yup. Unfortunately, history has also shown that closed-off and locked-in tend to be more lucrative business models. Which is why all the big muscle in the tech world is usually working *against* open standards.

Of course.  Money corrupts, and where money is involved, you can expect that anything else that stands in the way to be shoved aside or thrown out the window completely, no matter how much more sense it may make. Ironic, that Github hasn't turned a profit yet. :-D


T

-- 
Which is worse: ignorance or apathy? Who knows? Who cares? -- Erich Schubert
June 08, 2018
On 6/7/2018 10:01 PM, H. S. Teoh wrote:
> And that is why it's a bad thing to build a walled garden around a code
> repo, esp. when the underlying VCS is well capable of distributed
> development.  If only there has been a standard protocol for
> communicating such associated content, such as PR comments and
> discussions, bugs and issues (this latter not applicable in our case,
> thankfully), then we could have setup an archival system to retrieve and
> store all of this information.  Unfortunately, AFAIK there isn't a way
> to do this, and so if Github for whatever reason shuts down, all of this
> valuable information would be lost forever.

Since I have (most) of the Github discussions in email form, I could do something like this if we had to:

https://digitalmars.com/d/archives/digitalmars/D/index.html

There's a program that runs over the NNTP database to generate the static pages:

https://github.com/DigitalMars/ngArchiver
June 08, 2018
On 6/8/2018 2:34 PM, Walter Bright via Digitalmars-d-announce wrote:
> On 6/7/2018 10:01 PM, H. S. Teoh wrote:
>> And that is why it's a bad thing to build a walled garden around a code
>> repo, esp. when the underlying VCS is well capable of distributed
>> development.  If only there has been a standard protocol for
>> communicating such associated content, such as PR comments and
>> discussions, bugs and issues (this latter not applicable in our case,
>> thankfully), then we could have setup an archival system to retrieve and
>> store all of this information.  Unfortunately, AFAIK there isn't a way
>> to do this, and so if Github for whatever reason shuts down, all of this
>> valuable information would be lost forever.
> 
> Since I have (most) of the Github discussions in email form, I could do something like this if we had to:
> 
> https://digitalmars.com/d/archives/digitalmars/D/index.html
> 
> There's a program that runs over the NNTP database to generate the static pages:
> 
> https://github.com/DigitalMars/ngArchiver

Essentially (if not actually) everything on github is available through their api's.  No need for scraping or other heroics to gather it.
June 08, 2018
On 6/8/2018 3:02 PM, Brad Roberts wrote:
> Essentially (if not actually) everything on github is available through their api's.  No need for scraping or other heroics to gather it.

That's good to know! The situation I was concerned with is it going dark all of a sudden.

BTW, if someone wants to build a scraper that'll produce static web pages of the dlang PR discussions, that would be pretty cool!
June 09, 2018
On Friday, 8 June 2018 at 22:06:29 UTC, Walter Bright wrote:
> On 6/8/2018 3:02 PM, Brad Roberts wrote:
>> Essentially (if not actually) everything on github is available through their api's.  No need for scraping or other heroics to gather it.
>
> That's good to know! The situation I was concerned with is it going dark all of a sudden.
>
> BTW, if someone wants to build a scraper that'll produce static web pages of the dlang PR discussions, that would be pretty cool!

There's plenty of third party tools that archive GitHub.

For example, https://www.gharchive.org/. GitHub advertises some of them at https://help.github.com/articles/about-archiving-content-and-data-on-github/#third-party-archival-projects and https://help.github.com/articles/backing-up-a-repository/.

Personally I think the fear of Microsoft ruining GitHub is completely unfounded. Just look at what they did to Xamarin. They bought an interesting product and then made it free for individuals, open sourced it, and improved it drastically. And they sure do hate Linux nowadays with dotnet CORE being partially to improve Linux / cross-platform support.

June 09, 2018
On Saturday, 9 June 2018 at 00:54:08 UTC, Kapps wrote:
>
> Personally I think the fear of Microsoft ruining GitHub is completely unfounded. Just look at what they did to Xamarin. They bought an interesting product and then made it free for individuals, open sourced it, and improved it drastically. And they sure do hate Linux nowadays with dotnet CORE being partially to improve Linux / cross-platform support.

These days, I don't think the "evil" of MS is the thing to be concerned about. I'm more concerned about unpredictably and unreliability. The potential for mess-ups or mind-changing or other surprises down the road. Not that it necessarily WILL happen, but I think being MS its worth being prepared, just in case.
June 08, 2018
On 6/8/2018 5:54 PM, Kapps wrote:
> Personally I think the fear of Microsoft ruining GitHub is completely unfounded. 

My concern has nothing to do with Microsoft. It's about not totally relying on any third party not under our control.