Сергей Кузнецов
Образование
Предпочтения
Жизненная позиция
Контакты
23 октября 2018
PHPStan: Find Bugs In Your Code Without Writing Tests!
I really like how much productivity a web developer gains by switching from compiled languages like Java or C# to an interpreted one like PHP. Aside from the dead simple execution model (start, handle one request, and die) and a much shorter feedback loop (no need to wait for the compiler), there’s a healthy ecosystem of open-source frameworks and libraries to help developers with their everyday tasks. Because of these reasons, PHP is the most popular language for web development by far.
But there’s one downside.
When do you find out about errors?
Compiled languages need to know about the type of every variable, return type of every method etc. before the program runs. This is why the compiler needs to make sure that the program is “correct” and will happily point out to you these kinds of mistakes in the source code, like calling an undefined method or passing a wrong number of arguments to a function. The compiler acts as a first line of defense before you are able to deploy the application into production.
On the other hand, PHP is nothing like that. If you make a mistake, the program will crash when the line of code with the mistake is executed. When testing a PHP application, whether manually or automatically, developers spend a lot of their time discovering mistakes that wouldn’t even compile in other languages, leaving less time for testing actual business logic.
I’d like to change that.
Enter PHPStan
Keeping up with modern PHP practices leads to codebases where we can be sure about types of a lot of data, converging with statically typed languages, although the dynamic nature is still present. Modern PHP codebases are similar to the ones in languages people make much less fun of. Object-oriented code, dependency injection and usage of established design patterns are truly common nowadays.
Which led me to the idea of a static analysis tool for PHP that would substitute the role of the compiler from other languages. I’ve spent a lot of time working on it and I’ve been employing its various development versions checking our codebase for more than a year.
It’s called PHPStan, it’s open-source and free to use.
What it currently checks for?
Existence of classes used in instanceof, catch, typehints and other language constructs. PHP does not check this and just stays instead, rendering the surrounded code unused.
Existence and accessibility of called methods and functions. It also checks the number of passed arguments.
Whether a method returns the same type it declares to return.
Existence and visibility of accessed properties. It will also point out if a different type from the declared one is assigned to the property.
Correct number of parameters passed to sprintf/printf calls based on format strings.
Existence of variables while respecting scopes of branches and loops.
Useless casting like (string) ‘foo’ and strict comparisons (=== and !==) with different types as operands which always result in false.
The list is growing with every release. But it’s not the only thing that makes PHPStan useful.
PHPStan is fast…
It manages to check the whole codebase in a single pass. It doesn’t need to go through the code multiple times. And it only needs to go through the code you wish to analyze, e.g. the code you written. It doesn’t need to parse and analyze 3rd party dependencies. Instead, it uses reflection to find out useful information about somebody else’s code your codebase uses.
PHPStan is able to check our codebase (6000 files, 600k LOCs) in around a minute. And it checks itself under a second.
…and extensible
Even with current static typing practices, a developer can sometimes justify using dynamic features of PHP like __get, __set and __call magic methods. They allow to define new properties and methods dynamically in runtime. Normally, static analysis would complain about accessing undefined properties and methods, but there’s a mechanism for telling the engine the rules how exactly the new properties and methods are created.
This is made possible thanks to a custom abstraction over native PHP reflection which allows the user to define extensions. For more details, check the Class reflection extensions section in the README.
Some methods’ return type depends on its arguments. It can depend on what class name you pass to it or it may return object of the same class as the object you passed. This is what Dynamic return type extensions are for.
And last but not least, if you come up with a new check PHPStan could perform, you can write and use it yourself. It’s possible to come up with framework-specific rules like checking if entities and fields referenced in a DQL query exist or if all generated links in your MVC framework of choice lead to existing controllers.
Choose your level of strictness
Other tools I tried out suffer from the initial experience of trying to integrate them into existing codebases. They spill thousands and thousands of errors which discourage further use.
Instead, I looked back to how I integrated PHPStan into our codebase during its initial development. Its first versions weren’t as capable as the current one, they didn’t find as many errors. But it was ideal from the integration perspective — When I had time, I wrote a new rule, I fixed what it found in our codebase and merged the new version with the fixes into master. We used the new version for a few weeks to find errors it was capable to find and the cycle repeated. This gradual increasing of strictness proved to be really beneficial, so I set out to simulate it even with current capabilities of PHPStan.
By default, PHPStan checks only code it’s sure about — constants, instantiation, methods called on $this, statically called methods, functions and existing classes in various language constructs. By increasing the level (from the default 0 up to the current 4), you also increase the number of assumptions it makes about the code and the number of rules it checks.
You can also create your own rulesets if the built-in levels do not suit your needs.
Write less unit tests! (but focus on the meaningful ones)
You don’t hear this advice often. Developers are forced to write unit tests even for trivial code because there is an equal chance to make a mistake in it, like write a simple typo or forget to assign a result to a variable. It’s not very productive to write unit tests for simple forwarding code that you can usually find in your controllers and facades.
Unit tests come with a cost. They are code that has to be written and maintained like any other. Running PHPStan on every change, ideally on a continuous integration server, prevents these kinds of mistakes without the cost of unit tests. It’s really hard and not economically feasible to achieve 100 % code coverage, but you can statically analyze 100 % of your code.
As for your unit testing efforts, focus them on places where it’s easy to make a mistake that doesn’t look like one from the point of static analysis. That includes: complex filtering of data, loops, conditions, calculations with multiplication and division including rounding.
On the shoulders of giants
Creating PHPStan wouldn’t be possible without the excellent PHP Parser created by Nikita Popov.
PHP in 2016 has established and widely used tools for package management, unit testing and coding standards. It’s a mystery to me there isn’t yet a widely used tool for finding bugs in code without running it. So I created one that is easy enough to use, fast, extensible, and doesn’t bother you with strict requirements on your codebase while still allowing you to benefit from various checks it performs. Check out the GitHub repo to find out how you can integrate it in your project today!
It’s already used by Slevomat, Hele and Bonami —several of the most prominent Czech startups. I hope yours will be next!
If you’re a PHP developer, give PHPStan a shot. If you’re interested in various insights about software development, follow me on Twitter. If you’re interested in my consulting services on code quality, continuous delivery, hiring developers, open-source and plethora of other topics, please get in touch.
- Нравится 0
- Комментировать 0
- 0
Who uses PHP anyway?
Each Thursday at my office we hold a one hour ‘tech talk’ after lunch to discuss something technical of interest. These tech talks are pretty informal, it’s really just a bunch of us sitting around a projector talking shop with one person facilitating the conversation. I get a lot out of them and it’s one of the reasons I really value working at Vehikl.
This week we chose Object Calisthenics as our source material and we stepped through the list of heuristics discussing why the application of these rules tend to produce cleaner more maintainable code. Afterwards, one of our junior developers sent me a message in Slack with a screenshot from the post we were working through.
Read through the whole article this time on Object Calisthenics, and saw this… How do you feel when people say this about PHP? Is it because they’re not developing web applications? Just curious, and if you have nothing to say that’s okay too
It really kinda sucks that given the value the post delivers to its readers, this is the one thing that made enough of an impact that it caused the junior dev to ask me about it.
Here’s my reply.
People shitting on PHP isn’t going to go away, it’s a symptom of a few things.
PHP has a ridiculously flat learning curve so just about anyone can write code using it, this means a lot of amateurs and ‘get it done’ developers will choose php but won’t really ever level up their skills when it comes to software development.
PHP is ridiculously easy to deploy software with. This means that all those amateurs writing less than professional code can get their apps out into the real world. There are a lot of nasty people on the internet and these types of amateur applications are ripe for malicious attacks. This tends to give PHP a bad rap when it comes to building secure and robust applications.
Since PHP is easy to write and easy to deploy, a lot of people are trying to do that but they need help. They post to stack overflow or other message boards and other amateurs help them out with solutions that they’ve made themselves which is also amateur. So the shitty code propagates quickly and lots and lots of terrible examples are posted out there for everyone to see.
People are insecure assholes looking to make themselves feel better by shitting on people they don’t know because that’s easy and seemingly without repercussions. These are generally the same people who troll the internet wasting time and taking up space instead of building useful tools and generally being productive. Note that these trolls are shitting on people who actually are building useful tools and being productive, kinda ironic (if I know what irony means).
So how do I feel when people say shitty things about PHP? I can’t say that it doesn’t bother me because it does; it doesn’t make me angry though, it just makes me kinda sad. It makes me sad because they’re taking time out of their day to point out “if you’re doing X then you’re an idiot” and that isn’t helpful in anyway. In this particular instance, I don’t think that the author was trying to be malicious, it looks like the statement was made in jest, like they’re in on the joke that people shitting on PHP is a waste of time.
The key take away though is try not to identify yourself as any kind of X developer, PHP developer, Laravel developer, Symfony developer, JavaScript developer, Vue, React, Python whatever. All that does is create tribalism and pulls people further apart which is the opposite of what we should be striving towards. The product of software development is the same no matter what tools are used to make it, many of the core ideas and philosophies are applicable to all sorts of languages, so sharing this information across disciplines works to benefit the industry as a whole. If you silo yourself, you’re doing yourself a disservice and limiting your opportunities for growth.
Further, learning multiple different programming languages only makes you a better developer. While the core ideas and philosophies may be the same, the ways in which they are applied tend be slightly different because different languages have different language constructs. This is why developers will ask “whats the idiomatic way to do X in Y?”. Learning new languages will expose you to these “new” ways of accomplishing common tasks. You can take these experiences back to other languages and see how well they apply. It’s a win win.
- Нравится 0
- Комментировать 0
- 0
Taking PHP Seriously
Slack uses PHP for most of its server-side application logic, which is an unusual choice these days. Why did we choose to build a new project in this language? Should you?
Most programmers who have only casually used PHP know two things about it: that it is a bad language, which they would never use if given the choice; and that some of the most extraordinarily successful projects in history use it. This is not quite a contradiction, but it should make us curious. Did Facebook, Wikipedia, Wordpress, Etsy, Baidu, Box, and more recently Slack all succeed in spite of using PHP? Would they all have been better off expressing their application in Ruby? Erlang? Haskell?
Perhaps not. PHP-the-language has many flaws, which undoubtedly have slowed these efforts down, but PHP-the-environment has virtues which more than compensate for those flaws. And the options for improving on PHP’s language-level flaws are pretty impressive. On the balance, PHP provides better support for building, changing, and operating a successful project than competing environments. I would start a new project in PHP today, with a reservation or two, but zero apologies.
Background
Uniquely among modern languages, PHP was born in a web server. Its strengths are tightly coupled to the context of request-oriented, server-side execution.
PHP originally stood for “Personal Home Page.” It was first released in 1995 by Rasmus Lerdorf, with an aim of supporting small, simple dynamic web applications, like the guestbooks and hit counters that were popular in the web’s early days.
From PHP’s inception, it has been used for far more complicated projects than its creators anticipated. It has been through several major revisions, each of which brought new mechanisms for wrangling these more complex applications. Today, in 2016, it is a feature-rich member of the Mixed-Paradigm Developer Productivity Language (MPDPL) family[1], which includes JavaScript, Python, Ruby, and Lua. If you last touched PHP in the early ‘aughts, a contemporary PHP codebase might surprise you with traits, closures, and generators.
Virtues of PHP
PHP gets several things very deeply, and uniquely, right.
First, state. Every web request starts from a completely blank slate. Its namespace and globals are uninitialized, except for the standard globals, functions and classes that provide primitive functionality and life support. By starting each request from a known state, we get a kind of organic fault isolation; if request t encounters a software defect and fails, this bug does not directly interfere with the execution of subsequent request t+1. State does reside in places other than the program heap, of course, and it is possible to statefully mess up a database, or memcache, or the filesystem. But PHP shares that weakness with all conceivable environments that allow persistence. Isolating request heaps from one another reduces the cost of most program defects.
Second, concurrency. An individual web request runs in a single PHP thread. This seems at first like a silly limitation. But since your program executes in the context of a web server, we have a natural source of concurrency available: web requests. Asynchronously curl’ing to localhost (or even another web server) provides a shared-nothing, copy-in/copy-out way of exploiting parallelism. In practice, this is safer and more resilient to error than the locks-and-shared-state approach that most other general-purpose languages provide.
Finally, the fact that PHP programs operate at a request level means that programmer workflow is fast and efficient, and stays fast as the application changes. Many developer productivity languages claim this, but if they do not reset state for each request, and the main event loop shares program-level state with requests, they almost invariably have some startup time. For a typical Python application server, e.g., the debugging cycle will look something like “think; edit; restart the server; send some test requests.” Even if “restart the server” only takes a few seconds of wall-clock time, that takes a big cut of the 15–30 seconds our finite human brains have to hold the most delicate state in place.
I claim that PHP’s simpler “think; edit; reload the page” cycle makes developers more productive. Over the course of a long and complex software project’s life cycle, these productivity gains compound.
The Case Against PHP
If all of the above is true, why all the hate? When you boil the colorful hyperbole away, the most common complaints about PHP cluster around these root causes:
Surprise type conversions. Almost all languages these days let programmers compare, e.g., integers and floats with the >= operator; heck, even C allows this. It’s perfectly clear what is intended. It’s less clear what comparing a string and an integer with == is supposed to mean, and different languages have made different choices. PHP’s choices in this department are especially perverse, leading to surprises and undetected errors. For instance, 123 == “123foo” evaluates to true (see what it’s doing there?), but 0123 == “0123foo” is false (hmm).
Inconsistency around reference, value semantics. PHP 3 had a clear semantic that assignment, argument passing, and return are all by value, creating a logical copy of the data in question. The programmer can opt into reference semantics with a & annotation[2]. This clashed with the introduction of object-oriented programming facilities in PHP 4 and 5, though. Much of PHP’s OO notation is borrowed from Java, and Java has the semantic that objects are treated by reference, while primitive types are treated by value. So the current state of PHP’s semantics is that objects are passed by reference (choosing Java over, say, C++), primitive types are passed by value (where Java, C++, and PHP agree), but the older reference semantics and & notation persist, sometimes interacting with the new world in weird ways.
Failure-oblivious philosophy. PHP tries very, very hard to keep the request running, even if it has done something deeply strange. For instance, division by zero does not throw an exception, or return INF, or fatally terminate the request. By default, it warns and evaluates to the value false. Since false is silently treated as 0 in numeric contexts, many applications are deployed and run with undiagnosed divisions by zero. This particular issue is changed in PHP 7, but the design impulse to keep plowing ahead, past when it could possibly make sense, pervades libraries too.
Inconsistencies in the standard library. When PHP was young, its audience was most familiar with C, and many APIs used the C standard library’s design language: six-character lower case names, success and failure returned in an integer return value with “real” values returned in a callee-supplied “out” param, etc. As PHP matured, the C style of namespacing by prefixing with _ became more pervasive: mysql_…, json_…, etc. And more recently, the Java style of camelCase methods on CamelCase classes has become the most common way of introducing new functionality. So sometimes we see code snippets that interleave expressions like new DirectoryIterator($path) with if (!($f = fopen($p, ‘w+’)) … in a jarring way.
Lest I seem like an unreflective PHP apologist: these are all serious problems that make defects more likely. And they’re unforced errors. There’s no inherent trade-off between the Good Parts of PHP and these problems. It should be possible to build a PHP that limits these downsides while preserving the good parts.
HHVM and Hack
That successor system to PHP is called Hack[3].
Hack is what programming language people call a ‘gradual typing system’ for PHP. The ‘typing system’ means that it allows the programmer to express automatically verifiable invariants about the data that flows through code: this function takes a string and an integer and returns a list of Fribbles, just like in Java or C++ or Haskell or whatever statically typed language you favor. The ‘gradual’ part means that some parts of your codebase can be statically typed, while other parts are still in rough-and-tumble, dynamic PHP. The ability to mix them enables gradual migration of big codebases.
Rather than spill a ton of ink here describing Hack’s type system and how it works, just go play with it. I’ll be here when you get back.
It’s a neat system, and quite ambitious in what it allows you to express. Having the option of gradually migrating a project to Hack, in case it grows larger than you first expected, is a unique advantage of the PHP ecosystem. Hack’s type checking preserves the ‘think; edit; reload the page’ workflow, because the type checker runs in the background, incrementally updating its model of the codebase when it sees modifications to the filesystem. The Hack project provides integrations with all the popular editors and IDEs so that the feedback about type errors comes as soon as you’re done typing, just like in the web demo.
Let’s evaluate the set of real risks that PHP poses in light of Hack:
Surprise type conversions become errors in Hack files. The entire class of problems boils away.
Reference and value semantics are cleaned up by simply banning old-style references in Hack, since they’re unnecessary in new codebases. This leaves behind the same objects-by-reference-and-everything-else-by-value semantics as Java or C#…
PHP’s failure-obliviousness is more a property of the runtime and libraries, and it is harder for a semantic checker like Hack to reach directly into these systems. However, in practice most forms of failure-obliviousness require surprise type conversions to get very far. For instance, problems that arise from propagating the ‘false’ returned from division by zero eventually cross a type-checked boundary[4], which fails on treating a boolean numerically. These boundaries are more frequent in Hack codebases. By making it easier to write these types, Hack decreases the ‘skid distance’ of many buggy executions in practice.
Finally, inconsistencies in the standard library persist. The most Hack hopes to do is to make it less painful to wrap them in safer abstractions.
Hack provides an option that no other popular member of the MPDPL family has: the ability to introduce a type system after initial development, and only in the parts of the system where the value exceeds the cost.
HHVM
Hack was originally developed as part of the HipHop Virtual Machine, or HHVM, an open source JIT environment for PHP. HHVM provides another important option for the successful project: the ability to run your site faster and more economically. Facebook reports an 11.6x improvement in CPU efficiency over the PHP interpreter, and Wikipedia reports a 6x improvement.
Slack recently migrated its web environments into HHVM, and experienced significant drops in latency for all endpoints, but we lack an apples-to-apples measurement of CPU efficiency at this writing. We’re also in the process of moving portions of our codebase into Hack, and will report our experience here.
Looking Ahead
We started with the apparent paradox that PHP is a really bad language that is used in a lot of successful projects. We find that its reputation as a poor language is, in isolation, pretty well deserved. The success of projects using it has more to do with properties of the PHP environment, and the high-cadence workflow it enables, than with PHP the language. And the advantages of that environment (reduced cost of bugs through fault isolation; safe concurrency; and high developer throughput) are more valuable than the problems that the language’s flaws create.
Also, uniquely among the MPDPLs, there is a clear migration path to a higher performance, safer and more maintainable medium in the form of Hack and HHVM. Slack is in the later stages of a transition to HHVM, and the early stages of a transition to Hack, and we are optimistic that they will let us produce better software, faster.
Slack Technologies, Inc. is looking for great technologists to join us.
Notes
I made up the term ‘MPDPL.’ While there is little direct genetic relationship among them, these languages have influenced one another heavily. Looking past syntax, they are much more similar than different. In a universe of programming languages that includes MIPS assembly, Haskell, C++, Forth, and Erlang it is hard to deny that the MPDPLs form a tight cluster in language design space.
Unfortunately the & was marked in the callee, not the caller. So the programmer declares a desire to receive params by reference, but actually passing them by reference is unmarked. This makes it hard to understand what might change when reading code, and complicates an efficient implementation of PHP significantly. See Figure 2 in dl.acm.org/citation.cfm?id=2660199
Yes, Hack is a nearly unGoogleable programming language name. ‘Hacklang’ is sometimes used when ambiguity is possible. If Google themselves can name a popular language the still-more-unGoogleable Go, why not?
The typechecks in a Hack program are also enforced at runtime by default, because they piggy-back on PHP’s “type hint” facility. This increases safety in mixed codebases where Hack and classic PHP are co-mingled.
- Нравится 0
- Комментировать 0
- 0
12 апреля 2018
- Нравится 0
- Комментировать 0
- 0