The tao of R-coding

From Sustainability Methods
Revision as of 14:56, 27 June 2021 by Imi (talk | contribs) (→‎Non-duality in R-coding)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

In short: This entry provides guidance on R coding in general.

Pretext

I learned from many, and many learned from me. Much of my alumni will question details that I write here, as they have different views on coding. I think this is a good thing. After all, you have to find your own meaning for the tao of R-coding.

Learning to write code in R takes a long time. This text will not take this time away from you, but it may help you to find some shortcuts here and there, through which hopefully you may learn faster afterwards. I made many mistakes in my time, and I hope that this will go on, since otherwise there is nothing left to learn. Still, I am glad to say that I can recognise a difference between knowledge and experience, and this key difference is the main insight one can get into mastery. I believe, that once you felt bored out by an R-coding problem, this is the closest one can get to wisdom. Let us begin.

You in your surrounding

First, the setting. Perfecting patience in R-coding demands the right surrounding. Some people cannot code everywhere, others can code everywhere. I think this is actually inaccurate; the latter are simply better in adapting, and would do their prime work in a well defined setting as well, but this is just my own experience. I thought I am very adaptive, but learned that I benefit from a defined setting as well.

Place

You need to code regularly, and a lot. Hence it is good to design a space where you can sit easily for hours to an end, and focus on your coding. You do not need the best laptop, but ideally one that is a tool that you feel comfortably with. There is the old Mac vs. a PC rant, and I guess this will never be solved. However, it is good if you can afford a computer that works well for you. Some people use a mouse and a keyboard, which can be beneficial. However, I advice you to use as few mouse clicks as possible. Try to rely on keyboard shortcuts instead, which are the easiest bits of code that you should learn first. Switching between windows, copy & paste, tab setting and shortcuts are the rudimentaries you need to master. In my experience, a second screen can get you a long way. For students, a Padlet can be a welcome extension of the screen, as it allows for more mobility. Many people can code everywhere, and this is an advantage. Therefore, it is so important to get a Laptop, and to have a a good backpack to allow for mobility and relocation.

Time

“There is something to be learned from a rainstorm. When meeting with a sudden shower, you try not to get wet and run quickly along the road. But doing such things as passing under the eaves of houses, you still get wet. When you are resolved from the beginning, you will not be perplexed, though you will still get the same soaking. This understanding extends to everything.” - Hagakure

You need to code as much as you can. Still, some time spots seem to work better for some than other time spots. Find your sweet spot, and more importantly, practice! Coding should be a reward you look forward you, which interludes the other duties and activities of the day. Still, it needs to be a pronounced focus if you are really planning to get good at it. I have the experience that sometimes a short break can get you really further. In addition, planning your code beforehand in a rough form is really helpful. I feel that to this end, coding is just like any other form of writing. To conclude, find the best mode how to excercise, since only daily practice can allow you to evolve. Lean back, and start coding.

The hardware

When you learn programming, there is a high likelihood that you stay in one place for at least some years. If your budget permits, how about making a sedentary commitment: A second screen can make a real difference when programming, and using a screen as a main device and the laptop screen as a secondary screen can also be more comfortable on your eves and neck, as the position of the screen can be more upward. Some people are also faster on a keyboard, and it can be adjusted to different typing poses. Lastly, while you should minimise the use of a mouse or touchpad by all means, I prefer a touchpad, mostly because it is more agile in Finalcut. Overall, I think the specs of the computer matter less than people think they do. In my experience, you either do not wait at all for a calculation, or you wait very very long - if not too long - anyway. In this case, you anyway need to find an alternative server to calculate your stuff. In R, this hardly ever happens, at least while learning to code. If something calculates really long, in most cases you made a mistake. Only large data or endless loops can stall the average calculations. Bayesian stuff may pose a problem, and larger data. Hence try to avoid the latter, and brace for the Bayesian coffee break.

Where the hardware is related to the software is in the language settings. Macs and PCs differ in their language settings, and it is good to consider this difference when it comes to comma delimiters, number formats etc. These settings often have a severe influence onto Excel- and .csv-files, which is why it is wise to use number formats without any dots at 000 number groups, and to definitely use points and not commas. Also, switch your computer to English settings, which is a source of many errors. Please forget that any other letters outside of the english language exist, since this is a constant source of error. In addition, some Internet resources are a bit picky when it comes to the browser, hence using at least two browsers seems preferable, and Firefox has proven to be rather robust.


The software and settings

I believe that R base is the weapon of a Jedi. Not as clumsy or random as R Studio; an elegant weapon for a more civilised age.

The bare minimum is to install R, though most people these days also install R Studio. It is also good to have a code file that automatically installs the most important packages that you may need. Slack or other chat software is essential to exchange code with others. However, remember that Social media can be a time sink. Many R programmers rely on Github, Reddit and other such Internet forums.

Considering the organisation of your hard drive, make a long term structure. You will not remember in which semester you did this course in 5 years, yet time goes on. Try to use a structure that is long term, but focuses on your current work. I propose that folders that equal a canvas structure can be quite helpful, allowing you to systematise the project you are currently working on. I work in a Canvas structure about the publications I currently work on, and move finished projects into folders that are sorted by countries. Also, you may want to link this to a Notion-fuelled data base, which can help you to keep track of your projects and deliverables. Some people use Notion also as a platform for a second brain, however I use 'roam research' to this end, because is resembles my personal structure of thinking more.

Make backups. Nothing is more horrible to a programmer than a data loss. Have at least three physical backups, and ideally also at least two online backups. Make a clear backup plan, where you have short term, mid term and long term backups that allow you a nested structure in case you have a fatal system flaw. I once almost went through a data loss, and I would like to have this feeling never again. Hence I have a dozen hard drive backups (ok, a bit too much), two time machines, a raid and dropbox as well as iCloud. In addition, I always have a SSD with me and use MicroSD cards for the core business. I divide my data into four sections:

  • 1) All R stuff, which is as of today about 22 GB. This is the most important data plus Code. The Code files are a few MB, but I prefer to keep some data in case I want to get back at it. For some more intense calculations I also saved the workspace into a file, however these are rare exceptions.
  • 2) The larger datasets are on hard drives, where I keep a backup structure two have the data associated to published papers on two hard drives, one at University, and one at home.
  • 3) All writing and presentations (about 60 GB, mostly because of teaching presentations, the word files are a few hundred MB), so basically all "office" related files are in my documents folder. This one is stout across all my technical Units, so if I work on the laptop it automatically syncs to my desktop computer. The latter one is also connected to
  • 4) All movie files. Filming eats up hard drive space in terabytes, and I went to only keeping the raw originals, the final cut projects and the finally rendered movies. Everything in between I delete once a project is finished.

I think the desktop is generally a bad place to save your data, so I use my desktop folder only as a big turnaround folder to download stuff, work at it right away and send it back. Every few months I move all this stuff into a folder and delete the originals, and backup this folder. When working with data from the Internet, remember that the Internet is changing, and sometimes data will be gone. Hence keeping a copy of the more important projects can be a lifesaver. Also, always add a small text file with the meta-information of the downloaded file in the folder, containing the link where and the data when you downloaded the file.

Firefox, Word, Excel, Keynote, Notion, Drafts, RocketChat, and FinalCut. Lastly, of course, R. I can highly endorse Slack for all people that are coding, and consider it to be very convenient for collaboration that is focused on coding. Also, I use Notion to structure larger projects, and develop nested structures for planning of larger projects. An example would be the data science class, which is one project, the different days are sub-projects, and slides, sessions etc are smaller tasks. Within such smaller tasks, I discovered that nested toggle lists work best for me, and I also use this to plan the analysis of a larger dataset. Nested toggle lists match my brain structure when planning. However, I also use 'roam research' in order to build a second brain. In my case, this is almost exclusively focused on my work on Normativity of Methods. Ever since I started using a second brain, no precious thought or bit of information went lost, but instead it is integrated into a growing structure in 'roam research'. I wish I would have started earlier with this. So many thoughts are lost forever.


Coding

The hagakure says: “When one is writing an R-Code for someone, one should think that the recipient will make it into a hanging scroll.” - Hagakure

Everything up until now was only pretext, and now we come to the real deal: Coding. Writing code in R is like writing a language. You have dialects, different capabilities of the tongue, and even these little minute details that make us all different when we write and talk. You need to find your own way of writing code in R, yet there are many suggestions and norms how people write code in R. Generally, when thinking in code, I thinking in a nested structure of at least four different levels: 1) Scripts, 2) Sections, 3) Lines and 4) Commands. Let's break these down individually.

Scripts

You should develop a clear procedure how you name your files. If they have the name of the folder they are in, it does not make any sense. If they all have the same name within each folder, it also does not make any sense. Searching your hard drive for the "main_analysis" file will not bring you anywhere, if you call the central file like this in every folder. I tend to name it after the different steps of methods, which could be as an example data preparation and the project name "data_prep_gobi_plant_diversity". It can be good to add a timestamp, which should be something other than the date you last saved within the file, because this information is anyway available. I tend to just label files into starting, pending and finished.

Within the scripts, I always have the same structure. I start with all libraries I need to load, and then continue with the data I am loading. Then I proceed with data preparations and reformatting. The typical next steps are initial data inspection commands, often intermingled with some simple plots that serve more as a data inspection. Then I shift to deeper analysis, and also to some final plots. It is important to label every section with a hashtag-label-structure, and to divide the sections by some empty lines. I often tend to copy code towards the end of the script if it has proven to be important in the beginning but then maybe shifted out of the main focus. Sometimes you wish you could go back, which why it can be helpful to keep different versions of your base code. In this case, I recommend that you save the files with the date, since names just as "test", "test_again", "once_more", "nonsense", "shit", "shitoncemore", "nowreally", "andagain" have proven to be emotionally fulfilling at first, but confusing down the road.

What is most relevant to me within the script is to clearly indicate open steps if you need to take a break, or want to continue to code later. Writing a little line to your future self can be really helpful. Also, clearly describe all steps that were part of your final report or publication. You may think you remember all this, but if you need to redo an analysis after you get a revision for a manuscript, you might not remember what you did six months ago. Some label which code creates which figure - this can be essential to avoid this challenge. Lastly, write into the Code which version you used, and when you last updated R. At rare occasions packages change, and sometimes even the analysis may reveal different patterns. While this may sound alarming, this is a normal process of science evolving. So far, this has happened three times to me, but I can clearly state that these three occasions were all very special in their own wonderful way.

A last word on the length of one script. I think every script should be like a poem, and thus should be longer than a Haiku, but shorter than - say - a thousand lines. If the code is too long, you will have trouble finding things. Even the two longest scripts I ever complied did not exceed a thousand lines. However, if this may be the case, build a nested structure, and try to divide the code into data crunching, analysis and plotting, or other steps. Make your code like a poem, hence it should not be too long, and it should be elegant. Take this advice from someone who could not be further away from an elegant code. More on this later.

Code sections

I already mentioned to label all different code sections. However, we should not forget about the poetry of the code itself. Different sections in code are like different verses in a poem. They should be consistent, and contain some inner logic and flow. As I already mentioned, each section should have a label line that describes what the section is about. I try to balance sections, and when they get too long, I divide them into different sections. This is however less guided by their lengths, but more by the density of information they contain. If a section is long, but repetitive, it is ok as the information density is comparable to shorter sections that are more complicated. I guess everybody needs to find their own rhyme scheme. Nothing would be worse than if we all coded the same way, as innovation might become less and less. Therefore, I propose you experiment with your section composition, and try to find your own style. This is also the part where one can learn much from other people. Their solutions might give you a better diversity of approaches that you can integrate for yourself. If you have repeated structures in your code, it might be good to find a way to keep repeating sections in a way that makes it easy to adapt whenever necessary. Using tabs to make nested structures can be very helpful to define an inner logic on how sections are nested within each other. For instance, you can have the larger steps as main tabs, smaller subsection with one tab, and then within that individual steps with 2 tabs. Sections are the unit of code that can be quickly exchanged between people, as it is often enough to understand what this specific step is about, yet not too long so that it takes too long to understand everything. Hence exchange code snippets with others to get feedback, and to learn from each other. A really elegant section of code can be like a Haiku.

Lines of code

While it is easy to label sections with hashtags, and you may add an explanation to a line of code, this is only advisable if it is short and not self-explanatory. If you want to add a hashtag explanation to every line, it would probably look a bit clumsy. What is however often the case is that lines become too long. Personally, I think this should be avoided. Some graphic command may go over several lines, but then it is about specifications, and these should become part of your intuitive understanding in the long run. However, many learners want to show off in the beginning and make super long code lines, and these are hard to grasp for others, and also for you after a long time. Try to contain your lines, so they do not become dragons. Code lines are like poetry - they should make sense instantly. I think a good line solves exactly one problem. Also, a good line should look effortless. Try not to impress anyone, least yourself. You need to find your own rhyme scheme and rhythm, but lines should ideally flow in the long run. You might need to reconsider and think every now and then, but the majority of your lines should be part of your repertoire.

R Language

Just likely other language, R can be learned and has different dialects and accents. Just as there are wonderful words such as Serendipity and bamboozlement in the English language, there are fantastic words in R. You need to become a connoisseur of R commands and specifications. Many of the lesser important arguments can be the most relevant ones for shortcuts, and a loop or a reduction can bring you a long way when doing repetitive tasks. There are differences in terms of grammar, which often also depend on packages. However, especially data crunching is something where a small loophole can bring you a long way. Luckily, today there are many many resources in the Internet, and almost all questions can be asked in one of the larger search engines, and will reveal at least vital steps towards a solution.

The first initial steps towards starting to build your own stock of commands:

These norms and support structure are a great way to start. However, I think the most important way to practice is to work together with peers, help each other, review each other's code, also trying to dissect problems within a larger circle of learners, and briefing each other on the progress. This way you can gain the fastest progress.


The Tao of data analysis

“There is surely nothing other than the single purpose of the present moment. A persons's whole life is a succession of moment after moment. There will be nothing else to do, and nothing else to pursue. Live being true to the single purpose of the moment.” - Hagakure

Pretext

The next best step should be to go to the data() examples and just look at as many as possible. More datasets come with the individual packages. You may want to check out at least a few hundred examples to understand the rudimentaries. More data is in the web galore, hence it is vital to look at many examples, and diverse examples. Data is constructed, and these days there are all sorts of constructs available, and all sorts of samples, often through the Internet. The availability and diversity of datasets is every increasing, and it is vital to become versatile in the different viewpoints that data can offer. You learn to look at parts of the picture, and this does not only imply questions about data quality and bias, but also about the nature of data itself. Hence your data arise a part of reality but your analysis can also change reality. Data analysis is a question of responsibility.

The scripts of the ancients

The more you know about the knowledge of the old, the more you will be able to derive from your data. While it is true that no methods can solve everything, all methods can solve more than nothing. Being experienced in scientific methods is of utter importance to get the best initial understanding of any given dataset. Statistics stand out to this end, as they are the basis of most approaches related to numbers. However, one has to recognise that there is an almost uncountable amount of statistical approaches. Start with the most important ones, as they will help you to cover most of the way. Take correlations. I think it can be done to understand the simple correlation within a few hours, maybe days. You will get the general mechanics, the formula, the deeper mathematics behind it. You can even square this with the textbook preconditions, the interpretation, maybe even the limitations. You are all set. However this will help you as much as practicing a punch alone and without an experienced teacher when you suddenly find yourself in a real fight. You may be able to throw the punch the way you practised it, but your opponent will not stand in the spot you practised to hit. Reality is messy, you need to be agile and adaptive. R coding is like kung fu, you need a lot of practice, but you also need to get into peer-to-peer practice, and you need an experienced teacher, and learn from the Ancient Ones. Just as every attack of an opponent is different, every dataset is different. However when you are versatile, you can find you goal no matter what. As the Hagakure says: "Taking an enemy on the battlefield is like a hawk taking a bird. Even though one enters into the midst of thousands of them, it gives no attention to any bird other than the one it has first marked." Finding patterns in data is exactly like this. Once you become experienced and versatile, this is how you will find patterns in data.

There is also much new knowledge that is evolving, as data science is a thriving arena. Being versatile in the basics is one thing, but a true master in data science needs to equally rely on the ancients as well as the revolutionary renegades. They all offer knowledge, and we should perceive their knowledge and experience as pure gold that is potentially true. Some methods are established, and it is a good thing to know these. Other methods are evolving, and it is equally good to know the latest tricks and developments. Much creativity however comes from also combining the old and the new schools of thinking. Innovation within scientific methods is often rooted in the combination of different scientific methods. These unlock different types of knowledge, and this can be seen as often appropriate to acknowledge the complexity within many datasets. Combination of approaches is the path to new knowledge, and new knowledge is what we need these days quite often, since the old knowledge has not solved the problems we face, and new problems are emerging. When you want to approximate solutions to these problems, you have to be like water.

Non-duality in R-coding

Non-duality in data science relates to the difference between predictive power and explanatory power. Any given dataset can be seen from these two perspectives, and equally from none of these perspectives. This is what non-duality of data science is all about. You need to learn to see both in data, and also none. Predictive power and explanatory power are one and the same, and they are not. As the ancients said, it is bad if one thing becomes two. The same is true for data analysis. Many guide their analysis through predictive power, and they become obsessed by the desire to have what they call "the best model". Thoughts on the goodness of fit circulate in their heads like wild dragons, and they never manage to see anything but the best model they can possibly achieve, and hence they fail. Many compromises have been made by people to find the best model, and sadly, the best model may never be found. As the Ancient Ones said: all models are wrong, some models are useful. One should never forget this.

Equally, there are people who obsess about the explanatory power of models. They see the promise of causality in the smallest patterns they find, and never stop iterating about how it all makes sense now. Much has been explained in the past, but much may remain a mystery. Much that once was causal is lost, for no papers are available to remember it. The knowledge of people evolves, and with it the theories that our causality is often built upon. The predictive and the explanatory people even go to war against one another claiming victory after victory and their fight about superiority. However, many souls were lost in these endless quests, and to me it remains unclear if any real victory was ever won in this eternal iteration on whether patterns are causal or predictive. Both approaches can make sense, and the clever ones never went down the paths of priority, and avoided claiming what is better. They simply claimed their worldview, and were fine. They lived in their world of the two realms, and ignored the other realm completely. Many thus lived a happy life, half ignorant, but happy. These people are like the Ancient Ones living in their kingdom of old, with scientific disciplines and textbook knowledge, where kingdoms fought other kingdoms at times, but there was also peace and prosperity.

What can be called a 'modern data scientists' will know nothing of these worlds of Old, but will become unattached to their data. One should be aware that all data is normative, but the patterns detection and analysis is ideally done with a beginner's mind, knowing everything and nothing at the same time. As the Ancients said, matters of great concern should be treated lightly, to which someone replied, that matters of small concern should be treated seriously. The same is true for data analysis. We need to remember that the world is constructed, and that our data is only looking at parts of the picture. Despite all that, one should become detached from the analysis itself and versatile at it at the same time. Then coding becomes a form of art, and one can exceed one's own expectations. Your code becomes a statement, and even if people admire it, this does not matter to you. And thus you become the coder that you should be, and never will be. Then coding becomes a way.

There are small mindfulness exercises in R-coding that have proven beneficial in the past: 1) Chant the mantra "library". l-i-b-r-a-r-y. Only by perfecting this mantra will you master the art to spell the word correctly, and load a library. 2) Close the bracket. Through endless training, you can master the ancient art of closing every bracket you ever opened, understanding the inner circularity of life itself, and R-coding. Brackets in R are like breathing. Just as you breathe in and breathe out, you open and close brackets. 3) Cleaning the workspace is like cleaning your inner self. After a long work session, the master will ritually close R as if it is second nature, and the knowledge of the ancients know how the very question "do you want to save your workspace" test the worthiness of the decibel always anew. 4) Let go of your fear when coding. Simply breathe in, and breathe out. More is not expected of you to master the art of coding.


The author of this entry is Henrik von Wehrden.