The tao of R-coding

From Sustainability Methods
Revision as of 12:44, 24 October 2020 by HvW (talk | contribs) (Created page with ">The Tao of R-coding< I learned from many, and many learned from me. Much of my alumni will question details that I write here, as they have different views on coding. I think...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

>The Tao of R-coding< I learned from many, and many learned from me. Much of my alumni will question details that I write here, as they have different views on coding. I think this is a good thing. After all, you have to find your own meaning for the tao of R-coding.

Learning to write code in R takes a long time. This text will not take take this time away from you, but it may help you to find some shortcuts here and there, hopefully you may learn more faster. I made many mistakes in my time, and I hope that this will go on, since otherwise there is nothing left to learn. Still, I am glad to say that I can recognise a difference between knowledge and experience, and this key difference is the main insight one can get into mastery. I believe, that once you felt bored out by an R-coding problem, this is the closest one can get to wisdom. Let us begin.

First, the setting. Perfecting patience in R-coding demands the right surrounding. Some people can code everywhere, others can code everywhere. I think this is actually inaccurate, I just think the latter are better in adapting, and would do their prime work in a well defined setting as well, but this is just my own experience. I thought I am very adaptive, but learned that I benefit from a defined setting as well.

Place You need to code regularly, and a lot. Hence it is good to design a space where you can sit easily for hours to an end, and focus on your coding. You do not need the best laptop, but ideally one that is a tool that you feel comfortably with. There is the old Mac vs. a PC rant, and I guess this will never be solved. However, it is good if you can afford a computer that works well for you. Some people use a mouse and a keyboard, which can be beneficial. However, I advice you to use as few mouse clicks as possible. Try to rely on keyboard shortcuts instead, which are the easiest bits of code that you should learn first. Switching between windows, copy & paste, tab setting and shortcuts are the rudimentaries you need to master. In my experience, a second screen can get you a long way. For students, a Padlet can be a welcome extension of the screen, as it allows for more mobility. Many people can code everywhere, and this is an advantage. Therefore, it is so important to get a Laptop, and to have a a goo backpack to allow for mobility and relocation.

Time You need to code as much as you can. Still, some time spots seem to work better for some than other time spots. Find your sweet spot, and more importantly, practice! Coding should be a reward you look forward you, which interludes the other duties and activities of the day. Still, it needs to be a pronounced focus if you are really planing to get good at it. I have the experience that sometimes a short break can get you really further. In addition, planing your code before hand in a rough form is really helpful. I feel that to this end, coding is just like any other form of writing. To conclude, find the best mode how to excercise, since only daily practice can allow you to evolve. Lean back, and start coding.


The hardware If your learn programming, there is a high likelihood that you stay in one place for at least some years. If your budget permits, how about making a sedentary commitment: A second screen can make a real difference when programming, and using a screen as a main device and the laptop screen as a secondary screen can also be more comfortable on your eves and neck, as the position of the screen can be more upward. Some people are also faster on a keyboard, and it can be adjusted to different typing poses. Lastly, while you should minimise the use of a mouse or touchpad by all means, I prefer a touchpad, mostly because it is more agile in Finalcut. Overall, I think the specs of the computer matter less than people think they do. In my experience, you either do not wait at all for a calculation, or you wait veery very long -if not too long- anyway. In this case, you anyway need to finned alternative server to calculate your stuff. In R, this hardly ever happens, at least while learning to code. If something calculates really long, in most cases you made a mistake. Only large data or endless loops can stall the average calculations. Bayesian stuff may pose a problem, and larger data. Hence try to avoid the latter, and brace for the bayesian coffee break. Where the hardware is related to the software is in the language settings. Macs and PCs differ in their language settings, and it is good to consider this difference when it comes to comma delimiters, number formats etc. These settings often have a severe influence onto Excel and csv files, which it is wise to use number formats without any dots at 000 number groups, and to definitely use points and not comma. Also, switch your conputer to English settings, which is a source of many errors. Please forget that any other signs outside of the englisch language exists, since this is a constant source of error. In addition are some internet resources a bit picky when it comes to the browser, hence using at least two browsers seems preferable, and Firefox has proven to be rather robust.


The software and settings

The bare minimum is to install R, though most people these days also install R Studio. It is also good to have a code file that automatically installs the most important packages that you may need. Slack or other chat software is essential to exchange code with others. However, remember that Social media can be a time sink. Many R programmers rely on Github, Reddit and other such internet forums. Considering the organisation of your hard drive, make a long term structure. You will not remember in which semester you did this course in 5 years, yet time goes on. Try to use a structure that is long term, but focusses on your current work. I propose that folders that equal a canvas structure can be quite helpful, allowing you to systematise the project you are currently working on. Also, you may want to link this to a Notion-fuellled data base, which can help you to keep track of your projects and deliverables. Some people use Notion also as a platform for a second brain, however I use roam research to this end, because is resembles my personal structure of thinking more. One last point: get backups. Nothing is more horrible to a programmer than a data loss. Have at least three physical backups, and ideally also at least two online backups. Make a clear backup plan, where you have short term, mid term and long term backups that allow you a nested structure in case you have a fatal system flaw. I once almost went through a data loss, and I would like to have this feeling never again. Hence I have a dozen hard drive backup (ok, a bit too much), two time machines, a raid and dropbox as well as iCloud. In addition, I always have a SSD with me and use MicroSD cards for the core business. I divide my data into four sections: 1) All R stuff, which is as of today about 22 GB. This is the most important data plus Code. The Code files are a few MB, but I prefer to keep some data in case I want to get back at it. For some more intense calculations I also saved the workspace into a file, however these are rare exceptions. 2) The larger datasets are on hard drives, where I keep a backup structure two have the data associated to published papers on two hard drives, one at University, and one at home. 3) All writing and presentations (about 60 GB, mostly because of teaching presentations, the word files are a few hundred MB), so basically all "office" related files are in my documents folder. This one is stout across all my technical Units, so if I work on the laptop it automatically syncs to my desktop computer. The latter one is also connected to 4) All movie files. Filming eats up hard drive space in terabytes, and I went to only keeping the raw originals, the final cut projects and the finally rendered movies. Everything in between I delete once a project is finished. I use my desktop folder as a big turnaround folder to download stuff, work at it right away and send it back. Every few months I move all this stuff into a folder and delete the originals, and backup this folder.When working with data from the internet, remember that the internet is changing, and sometimes data will be gone. Hence keeping a copy on the more important projects acan be a lifesaver. Also, always add a small txt file with the meta information of the downloaded file in the folder, containing the link where and the data when you downloaded the file. I think the desktop is generally a bad place to save your data. In addition I would advise you to establish a folder structure that works long term, as you will not remember in a few years, which course you had in which year. I work in a Canvas structure about the publications I currently work on, and move finished projects into folders that are sorted by countries. In terms of software my main working horses are a e-mail client, slack, Zoom, Safari, Firefox, Word, Excel, Keynote, Notion, Drafts, RocketChat, and FinalCut. Lastly, of course, R. I can highly endorse Slack for all people that are coding, and consider it to be very convenient for collaboration that is focussed on coding. Also, I use Notion to structure larger projects, and develop nested structures for planning of larger projects. An example would be the data science class, which is one project, the different days are sub-projects, and slides, sessions etc are smaller tasks. Within such smaller tasks, I discovered that nested toggle lists work best for me, and I also use this to plan the analysis of a larger dataset. Nested toggle lists match my brain structure when planning. However, I also use Roam Research in order to build a second brain. In my case, this is almost exclusively focussed on my work on normatively of methods. Every since I use a second brain, no precious thought or bit of information is lost, but instead it is integrated into a growing structure in RoamResearch. I wish I would have started earlier with this. So many thoughts are lost forever.

Coding everything up until now was only pretext, and now we come to the real deal: Coding. Writing code in R is like writing a language. You have dialects, different capabilities of the tongue, and even these little minute details that make us all different when we write and talk. You need to find your own way of writing code in R, yet there are many suggestions and norms how people write code in R. Generally, when thinking in code, I thinking in a nested structure of at least four different levels: 1) Scripts, 2) Sections, 3) Lines and 4) Commands. Let's break these down individually.

1) Scripts You should develop a clear procedure how you name your files. If they have the name of the folder they are in, it does not make any sense. If they all have within each folder the same name, it also does not make any senes. Searching your hard drive for the "main_analysis" file will not bring you anywhere, if you call the central file like this in every folder. I tend to name it after the different steps of methods, which could be as an example data preparation and the project name "data_prep_gobi_plant_diversity". It can be good to add a timestamp, which should be something else that the date you last saved within the file, because this information is anyway available. I tend to just label files into starting, pending and finished. Within the scripts, I always have the same structure. I start with all libraries I need to load, and the continue with the data I am loading. Then I continue with data preparations and reformating. The typical next steps are initial data inspection commands, often intermingled with some simple plots that serve more as a data inspection. The I shift to deeper analysis, and also to some final plots. It is important to label every section with a hashtag label structure, and to divide the sections by some empty lines. I often tend to copy code towards the end of the script if it has proven to be important in the beginning but then maybe shifted out of the main focus. Sometimes you wish you could go back, which why it can be helpful to keep different versions of your base code. In this case, I recommend that you save the files with the date, since names just as "test", "test_again", "once_more", "nonsense", "shit", "shitoncemore", "nowreally", "andagain" have proven to be emotionally fulfilling at first, but co fusing down the road. What is most relevant to me within the script is to clearly indicate open steps if you need to take a break, or want to continue to code later. Writing a little line to your future self can be really helpful. Also, clearly describe all steps that were part of your final report or publication. You may think you remember all this, but if you need redo an analysis after you get revision for a manuscript, you might not remember what you did six months ago. Some labels which code creates which figure can be essential to avoid this challenge. Lastly, write into the Code which version you used, and when you updated R last. At rare occasions packages change, and sometimes even the analysis may reveal different patterns. While this may sound alarming, this is a normal process of science evolving. So far, this has happened three times to me, but I can clearly state that these three occasions were all very special in their own wonderful way. A last word on the lentos of one script. I think every script should be like a poem, and thus should be longer than a Haiku, but shorter than say a thousand lines. If the code is too long, you will have trouble finding things. Even the two longest scripts I ever complied did not exceed a thousand lines. However, if this may be the case, build a nested structure, and try to divide the code into data crunching, analysis and plotting, or other steps. Make your code like a poem, hence it should not be too long, and it should be elegant. Take this advice from someone who could not be further away from an elegant code. More on this later.

2) Code sections. I already mentioned to label all different code sections. However, we should not forget about the poetry of the code itself. Different sections in code are like different verses in a poem. They should be consistent, and contain some inner logic and flow. As already said, each section should have a label line that describes what the section is about. I try to balance sections, and when they get too long, I divide them into different sections. This is however less guided by their lengths, but more by the density of information they contain. If a section is long, but repetitive, it is ok as the information density is comparable to shorter sections that are more complicated. I guess everybody needs to find their own rhyme scheme. Nothing would be worse if we would all code the same, as innovation might become less and less. Therefore, I propose you experiment with your section composition, and try to find your own style. This is also the part where one can learn much from other people, as their solution might give you a better diversity of approaches that you can integrate for yourself. If you have repeated structures in your code, it might be good to find a way to keep repeating sections in a way that makes it easy to adapt whenever necessary. Using tabs to make nested structures can be very helpful to define an inner logic on how sections are nested within each other. You can have for instance the larger steps as main tabs, smaller subsection with one tab, and then within that individual steps with 2 tabs. Sections are the unit of code that can be quickly exchanged between people, as it is often enough to understand what this specific step is about, yet not too long so that it takes too long to understand everything. Hence exchange code snippets with other to get feedback, and to learn from each other. A really elegant section of code can be like a Haiku.